Details and considerations of the UK Biobank GWAS


In an effort to generate and share GWAS summary statistics from the 500K UK Biobank release to the scientific community, we faced a set of practical challenges in efficiently running GWAS analyses on such a large scale in order to quickly provide association results that may inform variant interpretation and downstream analyses. The result is a “quick-and-dirty” analysis that strives to provide a reliable, albeit imperfect, insight into the UK Biobank data that we hope will expedite scientific inquiry and empower the scientific community to use these results to answer broader questions about the genetic underpinnings of disease and trait biology. These analysis results are being shared as they are produced, and because we will continue to work on and analyze these data, we expect that they may change as we iterate on QC and dig deeper in to the results themselves. In this post, we outline some of the decisions that we made in this process, limitations inherent to each one, and some keys to interpreting the association results.



For a detailed step by step guide on how we QCed and analyzed the UK Biobank phenotypes and genotypes, the Github repository at outlines the steps and provides scripts for running association models at scale in Hail. If you have specific questions about what was performed in the analysis, then you are much more likely to find your answer there.

For those analyzing case/control phenotypes, we suggest that you filter SNPs using these criteria:

1)    Identify the sample size of the smaller of the two groups in the case/control phenotype

2)    Determine the allele frequency where we would expect at least 25 minor alleles in the smaller group (e.g. minor allele frequency (MAF) cutoff = 25 / ([sample_size]*2))

3)    Remove SNPs below this allele frequency cutoff




When we set out to collect and analyze a wide range of phenotypes in the UK Biobank, the central challenge was to harmonize the variety of categorization modes, variable scaling, and follow-up responses in a meaningful way. Fortunately, the folks behind the PHEnome Scan ANalysis Tool (or PHESANT: ) had already done a lot of the heavy lifting in this domain. Duncan Palmer, a post-doc in our lab, set out to make the finishing touches - write out a collection of manually curated phenotypes and perform transformations to make the phenotype data amenable to analysis using hail. In short, phenotypes were converted into normally-distributed quantitative or a collection of binary (TRUE/FALSE) categorical variables. Thus, the model performance would maintain some consistency despite the wide range of variable scaling across phenotypes. Full details of the phenotype pipeline are provided in our Github lab repository ( and are summarized here (


-       Simplicity: Minimize the model specifications across the range of phenotypes analyzed

-       Efficiency: Automated recoding and processing would minimize time spent going through each individual phenotype

-       Consistency: The same rules applied on all phenotypes makes it to track the type of transformation on any given phenotype



-       Over-specification: no higher-level processing put into combining phenotypes or otherwise organizing disparate phenotypes into a more coherent, meaningful variable for model interpretation

-       Over-simplification: By forcing phenotypes to either have a normal distribution (often through inverse rank-normalization) or a binomial distribution (often by splitting multi-level categories in multiple binary categories), we lose sensitivity in our measurement and subsequently power to detect real effects. Further, any transformation poses the risk of misinterpreting the effect sizes, as the scale may have changed from that listed in the UK Biobank showcase, and subsequently the meaning of any SNP effect size changes. This can pose challenges in properly determining additive vs. dominant effects in downstream analyses.

As of this writing, we have run 1513 unique phenotypes through PHESANT. You can view the summaries of each phenotype in the application-specific ‘phenosummary_final.tsv’ files.



Starting from the imputed .bgen files provided by UK Biobank, our goal was to stick to the most commonly agreed upon sample and SNP QC parameters currently used across GWAS analyses. Too little QC and we would have to deal with an abundance of false positive results, too much QC and we would lose a lot of real genetic information in the aim of providing clean results. For our own efficiency, and to maintain as much consistency with the broader UK Biobank, we used the kinship relatedness, predicted ancestry, principal components, and per-SNP INFO scores provided by the UK Biobank genetics team rather than re-calculating these ourselves.

Starting from the 487,409 individuals with phased and imputed genotype data, we filtered down to 337,199 QC positive individuals. The filter that removed most individuals was the restriction to samples of white British genetic ancestry. This is done not because we aren’t interested in the genetics of individuals outside this arbitrary threshold, but rather that the differences in population allele frequencies measured in a sample can often drive spurious associations when including individuals of sufficiently different ancestral backgrounds. Other filters included removing closely related individuals (or at least one of a related pair of individuals), individuals with sex chromosome aneuploidies, and individuals who had withdrawn consent from the UK Biobank study.

Over 92 million imputed autosomal SNPs were available for analysis. As of this writing, over half of these SNPs were undergoing re-imputation by the UK Biobank team due to mapping issues, leaving the ~40 autosomal million SNPs imputed from the Haplotype Reference Consortium as eligible for analysis. We further restricted to SNPs with MAF > 0.1% and HWE p-value > 1e-10 in the 337,199 QC positive individuals, an INFO score > 0.8 (directly from UK Biobank), leaving 10.8 million SNPs for analysis.   


-       Quality: Restricting to a single ancestry and high quality SNPs attenuates the primary confounding factors (population stratification, imputation quality) that can drive false positive results in GWAS analyses.

-       Consistency: maintaining these basic parameters across all GWAS runs allows for easier downstream processing across phenotypes



-       Restrictive: By removing related individuals and applying a strict ancestry inclusion, we are reducing our statistical power, particularly for phenotypes that already have a low prevalence or high level of missing values. More sophisticated linear mixed models can account for related individuals, but computational costs are still prohibitively large for this scale of analysis.

-       Specificity over Sensitivity: We are removing around 75% of analysis-ready SNPs, with many well-imputed SNPs with MAF < 0.1% and INFO scores < 0.8. For well-powered phenotypes where most individuals have a non-missing value, we will inevitably miss real associations at the lower end of the MAF spectrum. Additionally, the variants most likely to be under recent positive or negative selection may be excluded by the HWE filter the we’ve chosen (HWE p-value > 1e-10), as we are quite well-powered with 330K individuals to detect even minor deviations from expectation.



Our ability to generate summary statistics on such a large scale (of samples, SNPs and phenotypes) relies on the access to large cluster computing on the Google Cloud Platform and massively parallel processing using the Apache Spark API employed by Hail to perform all computational tasks. To simplify the process of association testing, association for all phenotypes used a least-squares linear model predicting the phenotype with an additive genotype coding (0, 1, or 2 copies of the minor allele), with sex and the first 10 principal components from the UK Biobank sample QC file as covariates. Using the linreg3 command in Hail, we could simultaneously run up to 110 separate phenotypes before reaching memory limitations. Results are stored in key tables, and individual phenotype association summary statistics can be exported to a tab separated file.


-       Speed: The combination of cloud computing infrastructure and low-level parallelization of tasks via Hail and Apache Spark allow us to provide summary statistics at scale

-       Simplicity: Using the same association model allows for easy scaling up across phenotypes


-       Model misspecification: While normally distributed quantitative traits are suited for linear regression models, binary traits are better suited to a logistic model, and the linear assumptions can create biases in the beta coefficients and significance, which we address below. 



Our primary goal in assessing the quality of the association statistics is to get a quick picture of the behavior of the results without digging too deeply into any single phenotype – that’s what we hope users will do! With this in mind, we focused on signs of statistical inflation, namely lambda GC, across a small but representative set of phenotypes to gauge how well our association model behaved across the allele frequency spectrum and genotype quality metrics. Lambda GC compares the median test statistic against the expected median test statistic under the null hypothesis of no association. For well-powered quantitative traits with a known polygenic inheritance, we expect inflation of lambda GC, but for traits with no expected association, we expect lambda GC to be around 1. Additionally, we stratified our measurement of lambda GC by MAF and INFO score bins to get a more detailed picture of the variant properties that may signal spurious (or polygenic) results.

It is important to keep in mind that the assessment we detail below is directed towards the model behavior across a set of different phenotype arrangements, and is not evaluating any specific quality of the phenotype measurement itself, transformation in PHESANT, or replicability against previous findings.


Quantitative phenotypes

Quantitative traits performed quite well overall, with moderate inflation in lambda GC driven predominantly by more common variation. Taking trait neuroticism as an example, where we had 274K non-missing trait scores, we assessed the overall p-value distribution:



We also checked how the lambda GC varied as a function of MAF, where we measure MAF on a log10 scale whereby a MAF of 1% is -2, and MAF 10% is -1. Lambda GC is the black line and its scale is measured on the vertical axis on the right hand side of the plot:


As you can see, more common variants are driving the inflation in the overall summary statistics, suggesting a polygenic signal that is being tagged by variants likely to be in LD with causal variants in the population. While not shown here, we see a similar pattern of inflation for other well characterized quantitative polygenic traits, such as height, weight, and BMI in the UK Biobank samples. 

We also checked how lambda GC varied as a function of imputation INFO scores:


We see the same inflation in test statistics among higher INFO scores (which capture most QC positive SNPs), which could reflect that better genotype qualities reduce noise and lead to cleaner association signals. However, it turns out that most common SNPs are predominantly in the highest INFO score bin, meaning that these two measurements are not independent of each other (MAF is measured on the vertical axis on the right hand side of the plot):


Finally, we looked at the Manhattan plot, which shows many genome-wide significant associations with characteristic patterns of LD creating dense peaks of SNPs rising to significance (NOTE: we excluded plotting of SNPs with p-value > 0.001 for ease of generating the plot).



Case/control phenotypes

Case/control phenotypes presented a different challenge, as we knew that fitting a linear model to a binary outcome misspecifies the model assumptions, and in general will lead to imprecision in our statistical inference under certain scenarios.  To the extent that we test larger case and control cohorts, and focus on more common SNPs, the performance of linear regression is comparable to a logistic model. In contrast, when the case sample is small and the SNP MAF is rare, uncertainty in allele frequency estimation can drive false positive associations. This is because the asymptotic assumptions of a linear model are not met when the allele count is low among either the case or control group, and with the addition of model covariates individuals with large model residuals (often measured through leverage and Cook’s distance) will drive false positive associations.

Two case examples

To demonstrate the point above, we selected two case/control phenotypes where the case count has 500 samples. The key difference is that one example has an equal size control sample, while the other example uses the remainder of the UK Biobank as controls, as this gives a sense of the limitations of rare variant analysis in both small samples and highly imbalanced case/control samples.

1.     Bipolar disorder status (500 vs 453 individuals)

After sample QC, there were 500 Bipolar I cases & 453 Bipolar II cases. When we look at the QQ plot, we don’t see any genome-wide significant hits, which is expected for such a small sample size and no established loci of large effect between Bipolar types, but we do see more inflation in the lambda GC than we’d expect:


When we stratify lambda GC by MAF, we see strong inflation among the lowest frequency SNPs. For reference, a SNP with MAF of 1% within this sample would have a total allele count of ~19:


Convergence of lambda GC to expectation steadily drops as MAF bins get larger, first crossing below expectation around MAF of 1.5%, or an allele count of ~30.

Lower allele counts, whether they predict risk for Bipolar I or II cases, have not met the asymptotic assumptions in the linear model, and lead to inflated statistical significance. Often, rare alleles are instead tested using a fisher’s exact test or stratified Cochran–Mantel–Haenszel (CMH) test, which fits to a hypergeometric distribution that considers all possible outcomes, thus controlling for any inflation at low allele counts. These tests, however, do not account for covarites and require more computational resources. Furthermore, under a basic 2x2 testing framework (i.e. case/control x allele A/allele B), a fully penetrant allele would require around 25 alleles in either the case or control cohort to surpass genome-wide significance (p < 5e-8). When we remove SNPs with allele count < 25, our lambda GC drops to 1.007.


   2.     Essential Hypertension (ICD10 code = I10; 500 vs 336699 individuals)

Many of the ICD10 codes were tested so that any individual not given the ICD10 code is considered a control. After QC, there were 500 Essential Hypertension cases & the remainder of the 336,699 QC positive UK Biobank cohort were considered as controls. When we look at the QQ plot, we do see genome-wide significant hits and little inflation in the lambda GC:


When we look at the Manhattan plot, however, most of the genome-wide significant hits are likely false positives as there is little to no LD-pattern of association around the top hits.


When we stratify lambda QC by MAF, we see a different pattern of statistical deviation to the Bipolar Disorder Status phenotype above:


This pattern of inflation and deflation is indicative of the strong imbalance in case and control sample sizes, as the small case count is much more imprecise about allele frequency estimation than the large control count. For example, an unassociated SNP with MAF of 0.1% is expected to be seen only once in the 500 cases, whereas it is expected to be observed 673 times in the 336,699 controls. If this SNP was seen 0 or 2 times in the case sample, it could have a large effect on the frequency estimation in cases relative to controls.

One option is to restrict to SNPs with a high enough allele count in the smaller sample, however this leads to a biased ascertainment in SNPs meeting specific criteria only for one category. As the example below shows, if we restrict to an allele count > 25 just in cases, we see a spike in lambda GC inflation among the variants that just meet this criteria, and are not meeting an equivocal criteria in controls:  


Conversely, if we set an allele frequency cutoff equivalent to the AC > 25 above, we would get an MAF cutoff of 0.025%. When we use this cutoff instead, we do not see any inflation in the test statistics:




While there is no hard-and-fast rule about the optimal cutoff for ensuring robust results, we recommend that for case/control phenotypes, removing SNPs that are under-dispersed in either the case or control cohort will remove much of the uncertainty in allele frequency estimation from smaller cohorts, and mitigate the leverage that large residuals will have in driving false positive signals. Much of this revolves around the allele count, which presents a floor effect of dispersion to detect genetic variation. Given that larger case/control cohorts increase association power of lower frequency variants, the key is to adapt the allele frequency cutoff for any given phenotype at a point where allele counts render the case/control difference uninformative for genome-wide interrogation.

Here are the quick steps we recommend:

1)    Identify the sample size of the smaller of the two groups in the case/control phenotype

2)    Determine the allele frequency where we would expect at least 25 minor alleles in the smaller group (e.g. MAF cutoff = 25 / ([case_sample_size]*2))

3)    Remove SNPs below this allele frequency cutoff

While there are likely real genetic associations that are being filtered out using these criteria, many of the SNPs are simply too unreliable under the current model configuration to be properly assessed for genetic association. Thus, many false positive signals will be found among these SNPs.



Our primary goal in this exercise is provide the scientific community with a set of actionable summary statistics from the UK Biobank release to remove the computational burden associated with generating these initial results and drive downstream analysis beyond the discovery of individual genetic loci. These results should not be regarded as a definitive set of curated and reviewed association statistics, but rather an early guide post of “quick and dirty” results that represent the preliminary landscape of new findings in the UK Biobank resource. Thus, we urge that you view these results with the same skeptical inquiry as any other preliminary set of data, and we are open to any feedback from the community about how to improve on this resource moving forward. 


Authored by Daniel Howrigan, with contributions from Liam Abbott, Claire Churchhouse and Duncan Palmer

Claire Churchhouse