Heritability 501 - LDSR-based h2 in UKBB for the technically-minded

Wish we had written a manuscript rather than blog posts so you could read a formal Methods section? Saw the initial post and had 100 “yeah, but…” questions? Want a technical description of the heritability analysis that isn’t just source code on github? Then this is the post for you!

(If you’re looking for a gentler place to start with more accessible explanations of heritability and how we may estimate it, check out Heritability 101 and 201. And no, you haven’t missed a 301 or 401, we’re diving straight into the deep end here.)

LD Score Regression: A short primer

All of the heritability analyses of UKBB we’re presenting here are estimated using LD Score Regression (LDSR). LDSR is a method developed by Bulik-Sullivan et al. for estimating heritability from GWAS results. Code is available in the ldsc package on github, along with tutorials and FAQs, and is supported by an active user group.

Briefly, begin with a standard additive model y=X𝛽+e for standardized phenotype y and matrix of standardized genotypes X (i.e. with dimensions N x M for some set of N individuals and M SNPs). Then model effect 𝛽 as random with mean 0 and variance h2g/M. Make the strong simplifying assumptions that effect sizes 𝛽 are independent across SNPs and are not correlated with the amount of linkage disequilibrium (LD) between SNPs.

Given this model and assumptions, it can be shown that the expected 𝜒2 statistic from a GWAS of SNP j is approximately:

where a is an index for population stratification or other confounding independent of LD, and lj is the “LD score” of SNP j defined as:

where rjk is the correlation between SNP j and SNP k. Using lj estimated in a population reference sample, it’s then possible to regress the observed 𝜒2 statistics from a GWAS on lj to get an estimate of heritability.

As described by Finucane et al and expanded by Gazal et al, this model can be generalized to allow the variance of B to be modelled as a function of SNP annotations rather than a single value of h2g/M. We refer to this as partitioned LDSR, as opposed to univariate LDSR using the the model without annotations.


What “flavor” of heritability does LDSR estimate?

We covered this some in the Heritability 201 post, but this is the technical post so let’s be very precise: LDSR aims to estimate the proportion of phenotypic variance in the population explained by the best linear predictor comprised of common SNPs in the LD score reference panel (assuming infinite sample size). In other words it is h2g or SNP-heritability, where the “universe” of SNPs is those SNPs present with MAF > 5% in the chosen reference panel.

LDSR gets to this estimate as described above. Conceptually, this amounts to estimating the average conditional variance explained by each SNP in the regression[1] (i.e. as an estimate of h2g/M), and extrapolating that effect size to the full reference panel. “Conditional” here refers to the 𝛽 described above, which is the effect size of the SNP in a joint model with all other SNPs in the reference panel as part of the (hypothetical) best linear predictor that defines h2g. The LD score regression serves to estimate this effect size based on the observed marginal effect size of the SNP, using its LD score as a proxy for how much that marginal effect reflects the total conditional effects of other correlated SNPs. In the case of univariate LDSR, this average conditional variance explained is a single value estimated genome-wide; in the partitioned heritability case it’s modelled for each SNP conditional on the included annotations.


Why are we doing LDSR rather than some other method?

The short answer is convenience.

It’s hard to overstate the benefit of estimation from summary statistics when the raw genetic data involves samples sizes in the 100,000s. As noted in the benchmarks on the github repo, once the initial GWAS is done we can estimate heritability for ~1,500 phenotypes in ~10 hours of Google VM time with LDSR. That’s with very little optimization of how we’re running the cloud compute. So given GWAS summary statistics were being produced (more on that effort here), LDSR is a very effective way to leverage these for a quick first look at the overall genetic signal across the large number phenotypes in UK Biobank.

There’s also a benefit of familiarity. LDSR is only a couple years old, but it’s already seen robust use across a wide variety of phenotypes[2] meaning a lot of experience with how it works, how the estimates behave, and how to interpret the results.[3]

That’s certainly not to say that the heritability analysis should end with LDSR though. Methods using individual-level data, such as REML (e.g. GCTA) and mixed models (e.g. BOLT-LMM), will almost certainly provide more precise estimates[4] and facilitate additional modelling, though the computational burden may require more careful thought about which phenotypes to prioritize for analysis. Similarly, other summary statistics-based methods will likely provide additional insights beyond the simplistic modelling in LDSR. But since LDSR is informative, familiar, and convenient, we think it’s the right place to start for this analysis.


The many limitations of this analysis  

Just to reiterate: this is a rough initial analysis of heritability in a rough initial analysis of UK Biobank data. Expect lots of iteration on data quality control, phenotype refinement, methods evaluation, etc. before this stabilizes. We want to share what we’re doing and how we’re doing it more promptly than can be done through formal publications. It’s reasonable to take every analysis with a grain of salt, but you may want to apply a little extra here. (See below for how we’re interpreting the results currently.)

In particular, it’s important to highlight the following:

  • Garbage GWAS in, garbage h2 out: This heritability analysis starts with the very rough initial GWAS being done by the Neale Lab team (read more here). Any artifacts that haven’t been caught yet will almost certainly impact the LDSR analysis.

  • UKBB phenotypes with automated cleaning. This initial analysis relies on automated pre-processing of the phenotypes with PHESANT. Since UKBB provides fairly raw phenotypes, we can expect that, with time, experts in each field covered by UKBB are going to be able to construct more meaningful phenotypes as endpoints for analysis. For example, there’s a long list of medication codes that will probably be much more useful after careful curation of sets of medication relevant to particular diseases and conditions, plus their relevant dosages. Until then, we’re just looking at the heritability of taking each particular medication on that list separately.

  • Automated phenotype cleaning, part 2: As part of the phenotype cleaning, we’ve opted to inverse rank-normalize the continuous phenotypes. This is convenient for automation purposes, since it prevents needing to make a decision about normalization for every phenotype separately, but it is hugely important to understanding the results. To be clear: the heritability of the inverse rank-normalized phenotype is not the same quantity as the heritability of the original continuous phenotype. The nonlinear transformation changes what genetic effects are additive. The magnitude of the impact of this normalization will depend on the original distribution of the phenotype and the nature of the true genetic effects.

  • All regression is linear. Part of the optimizations to allow the rapid GWAS of so many phenotypes in such large-scale data is that every phenotype has been analyzed with linear regression. That means we do not have the conventional logistic regression analysis for binary phenotypes. Central limit theorem and large sample sizes provide a fair amount of protection here (or maybe you’re happy to see linear probability models) but it’s still worth noting.

  • Sample size and rare phenotypes: Random (or quasi-random) sampling from the population means that phenotypes that are rare in the population will be rare in sample. Even at the scale of UK Biobank, phenotypes that are 1-in-1000 in the population would be expected to be observed in only a few hundred individuals. The result is that there is going to be limited statistical power for analyses of rare phenotypes, most notably the many rare diseases/disorders.

  • The UKBB population: The UK Biobank cohort is far from being a random sample of the UK population. They’re intentionally an older cohort (age 40-69 at recruitment), and with a ~5% participation rate there’s strong selection bias towards being healthier and from better socioeconomic conditions, among other features (Fry et al. 2017). Since heritability is a parameter of a given population it’s crucial to keep in mind that we’re estimating heritability in this “healthy volunteer” population, which may or may not be the same as the heritability in the general population.

  • Bootstrapped standard errors: We’ve tried to be very careful about describing the idiosyncrasies of LDSR, but one final statistical feature to note are its standard errors. There aren’t cleanly derived SEs for the LDSR model, so instead it depends on block-jackknife standard errors.  Additionally, since LDSR is running from summary statistics there is no individual level data to jackknife, so it’s instead jackknifed over sets of SNPs. Anecdotally, this pivot to sampling over SNPs rather than over individuals means that the block-jackknife SEs might not have quite the same behavior an conventional SEs. The SEs and corresponding p-values should be evaluated accordingly.


How are you interpreting the results?

Cautiously. Because of all of the above subtleties to the quantity estimated by LDSR and the many limitations of the current analysis, we are primarily focused on qualitative observations than on the raw quantitative value of h2g for any particular phenotype.

As a starting point, we’re looking at things like:

  • What phenotypes have robust evidence for meaningful SNP-heritability vs. what phenotypes have robust evidence for little-to-no heritability from common SNPs (keeping in mind the sample size limitations for rare binary outcomes)? This is especially useful for prioritizing traits for more careful follow-up.

  • Are there noteworthy trends in the results for various sets of phenotypes? For example, do anthropometric measures often have stronger heritability results than behavioral traits? (spoiler: probably yes)

  • Are we observing estimates consistent with previous analysis of these traits in other settings? If not, are the differences potentially informative about either our analysis (especially any technical issues) or the UKBB cohort?


In short, as a preliminary analysis LDSR is a great hypothesis-generating activity to identify things that look “interesting” in UKBB and deserve more attention.


Some final technical notes on implementation

We really recommend the github repo for complete implementation details, but here’s the executive summary for analysts familiar with ldsc:

  • LDSR run using MTAG, which provides a modified version of ldsc with a more convenient interface for iterating LDSR over hundreds of phenotypes.

  • We follow the conventional approach of restricting to GWAS results from HapMap3 sites passing MAF and INFO thresholds as input, but we’ve reimplemented this to define a single passing list in UKBB rather than running munge_sumstats.py with the w_hm3.snplist reference file.

  • We’ve run both univariate and partitioned heritability analyses. For reasons that will be discussed in a later post we’re currently presenting the partitioned heritability LDSR results as the primary analysis in the summary browser, but the univariate results are also available.

  • The univariate LDSR analysis was run with default settings using the precomputed LD scores from 1000 Genomes European ancestry samples (i.e. “./eur_w_ld/”)

  • The partitioned LDSR analysis was run using the v1.1 of the Baseline-LD annotations described by Gazal et al. computed from 1000 Genomes Phase 3 data from European ancestry populations and corresponding allele frequencies (all available from the ldsc reference downloads page). Default settings were used, with the exception of removing the maximum chi2 filter due to the extreme sample size of UKBB.

  • All analyses were run on the Google Cloud Platform, with the assistance of Hail, cloudtools, and many other tools.





i.e. the intersection of the reference panel and the chosen input GWAS results. In our case, as is generally recommended, that input is HapMap3 SNPs passing QC in the UKBB analysis sample.



One of the benefits of blogging these results is that we can provide a lot more context on these points all in one place. Plus we can have that conversation colloquially, with less jargon, no word count restrictions, and no need to dig through lots of supplementary information and cross-reference a dozen journal articles.



Direct comparison of those estimates to LDSR is somewhat complicated since they estimate subtly different flavors of heritability, but it’s fairly safe to anticipate that e.g. GCTA’s estimate of it’s flavor of heritability will be more precise than LDSR’s estimate of it’s flavor.


Authored by Raymond Walters with contributions from Claire Churchhouse and Rosy Hosking

Claire Churchhouse