We're building scalable software and platforms to enable efficient analysis of very large genetic data. All of our tools are open access and free for use by the scientific community.
The widespread application of massively parallel sequencing for complex trait analysis offers unprecedented power to link genetics with disease risk. However, these projects pose substantial challenges of scale and complexity, making even trivial analytic tasks increasingly cumbersome. To address these challenges we are actively developing Hail, an open-source framework for scalable genetic data analysis.
The foundation of Hail is infrastructure for representing and computing on genetic data. This infrastructure builds on open-source distributed computing frameworks including Hadoop and Spark. Hail achieves near-perfect scalability for many tasks and scales seamlessly to whole genome datasets of thousands of individuals. On top of this infrastructure, we have implemented a suite of standard tools and analysis modules including: data import/export, quality control (QC), analysis of population structure, and methods for performing both common and rare variant association. Simultaneously, we and other groups are using Hail to manage the engineering details of distributed computation in order to develop and deploy new methods at scale.
In addition, Hail exposes a high-level domain-specific language (DSL) for manipulating genetic data and assembling pipelines. As an example, porting a rare-variant analysis from Python to Hail reduced the number of lines of code by ~10x and improved performance by ~100x. We aim to grow Hail into a scalable, reliable and expressive framework on which the genetics community develops, validates, and shares new analytic approaches on massive datasets to uncover the biology of disease.
Picopili: Pedigree Imputation Consortium Pipeline
Family-based study designs can contribute valuable insights in genome-wide association studies (GWAS), but require different statistical considerations in quality control (QC), imputation, and analysis. Standardizing this process allows more efficient and uniform processing of data from these cohorts, facilitating inclusion of these family-based cohorts in meta-analyses. Therefore we've developed picopili (Pedigree Imputation Consortium Pipeline), a standardized pipeline for processing GWAS data from family-based cohorts.
Paralleling the design of ricopili, this pipeline supports QC, PCA, pedigree validation, imputation, and case/control association appropriate for family designs ranging from sib-pairs to complex, multigenerational pedigrees. Multiple association models are supported, including logistic mixed models and generalized estimating equations (GEE). Tasks are automatically parallelized, with flexible support for common cluster computing environments.
Code is available at: https://github.com/Nealelab/picopili
LD Hub is a centralized database of summary-level GWAS results for 173 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline. LD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes.
LD Hub was developed collaboratively by Broad Institute of MIT and Harvard and MRC Integrative Epidemiology Unit, University of Bristol. The site is hosted by the Broad Institute. Major developers include Jie Zheng, Tom Gaunt, David Evans and Benjamin Neale.