Asthma Genetics in the UK Biobank and the Global Biobank Engine

I recently wrote a blog post with Manuel Rivas for the Rivas Lab website that explores the genetics of asthma in UK Biobank participants.  In that post we look for known variants associated with asthma as well as identify a loss of function mutation (LOF) in IL33 that confers a protective effect to carriers.  Any LOF mutations that reduce the incidence of disease is interesting because it might be a good target for drug development.  It is far easier to develop a drug to reduce the function of a protein that it is to enhance it, so finding variants like these is quite exciting.

The post we wrote also serves as a guide to using the Global Biobank Engine (GBE).  GBE is a project I worked on during my rotation in the Rivas Lab that has since gone live and is available for the world to use.  GBE allows users to explore genetic and phenotypic data from the UK Biobank and perform statistical analyses including genome-wide association studies and phenome-wide association studies.  

The UK Biobank is a very exciting dataset for the world and provides an unprecedented opportunity to study the genetics of disease for a broad range of phenotypes.  Phenotypes included are lots of things you would expect, like all sorts of disease, demographics, and medications.  They also include some ridiculous stuff like, "Thickness of butter spread on baguettes".  But what if the individual doesn't like baguettes?  What if they prefer oatcakes?  Well don't worry because they asked about that, too.  Now of course you also want to know how many buttered oatcakes they're eating per day, so they asked that as well.  In case you were actually wondering, most people that eat buttered oatcakes have two per day and apply a thin layer of butter.  You may be disappointed to learn that we do not currently have a GWAS available for thickness of butter spread on oatcakes.  Sorry.

We built GBE in Python with a Flask framework.  The backend database is managed by SciDB.  We initially built it using MongoDB but found that once we scaled up to the full dataset it ran a little too slow to serve an interactive website.

I have found GBE to be a great resource for quickly exploring phenotype-genotype relationships when working on GWAS studies.  I encourage everyone to go take a look and make it part of their regular analysis pipeline.

Cloud-based Interactive Analytics for Terabytes of Genomic Variants Data

I’m excited to announce that my first peer-reviewed publication, Cloud-based Interactive Analytics for Terabytes of Genomic Variants Data, is available online in the journal Bioinformatics.  This work was done in close collaboration with Cuiping Pan of the VA and Nicole Deflaux at Google.  In this article, we evaluate methods for processing large amounts of genomic data.  Specifically, we use a tool called BigQuery on Google Cloud and demonstrate that BigQuery is capable of performing standard genomic analyses that normally take hours or days in only seconds.  We performed this analysis on variant data from 500 whole human genomes, which when we started this project several years ago was quite a lot.  Of course now there are much larger sequencing efforts underway, but BigQuery can still handle that.  We demonstrate that the format we use to store the data in BigQuery can scale to handle thousands of genomes and still perform analysis quickly.  Nicole has since come out with a newer version of the storage model that performs even better.

I have a detailed, but a little out of date, codelab that goes through how to set up and run analysis in BigQuery on my GitHub page.  Check out Google Genomic's Cookbook for the latest on how to run genomic analyses on Google Cloud.