Cloud-based Interactive Analytics for Terabytes of Genomic Variants Data

I’m excited to announce that my first peer-reviewed publication, Cloud-based Interactive Analytics for Terabytes of Genomic Variants Data, is available online in the journal Bioinformatics.  This work was done in close collaboration with Cuiping Pan of the VA and Nicole Deflaux at Google.  In this article, we evaluate methods for processing large amounts of genomic data.  Specifically, we use a tool called BigQuery on Google Cloud and demonstrate that BigQuery is capable of performing standard genomic analyses that normally take hours or days in only seconds.  We performed this analysis on variant data from 500 whole human genomes, which when we started this project several years ago was quite a lot.  Of course now there are much larger sequencing efforts underway, but BigQuery can still handle that.  We demonstrate that the format we use to store the data in BigQuery can scale to handle thousands of genomes and still perform analysis quickly.  Nicole has since come out with a newer version of the storage model that performs even better.

I have a detailed, but a little out of date, codelab that goes through how to set up and run analysis in BigQuery on my GitHub page.  Check out Google Genomic's Cookbook for the latest on how to run genomic analyses on Google Cloud.