Insights from estimates of SNP-heritability for >2,000 traits and disorders in UK Biobank

So we’ve estimated heritability for thousands of phenotypes [1] in the UK Biobank. Although this is only a rough, early analysis (see the laundry list of technical limitations) we still want to explore these results to see what trends emerge, what evidence we might have about the heritability of these traits and about the performance of LD score regression (LDSR). In short: what have we learned?

Lots of things are heritable

Starting with the obvious question: yes, many traits show significant SNP heritability in the UK Biobank data.

Of the 2,419 phenotypes analyzed as of this writing, 350 have statistically significant heritability estimates after correction for multiple testing, and 697 are at least nominally significant [2]. Regardless of significance which can get influenced by the number of measured individuals and number of cases, the average SNP heritability estimate is somewhere around 0.1.

Distribution of LDSR SNP-heritability estimates for phenotypes with $N_{eff}$ > 10,000.

QQ plot of significance for all heritability estimates. Interactive versions of these plots can be found on the UKBB heritability results site.

Many of the non-significant results likely reflect limited statistical power rather than a true lack of heritability. Accounting for effective sample size ($N_{eff}$) [3] as a rough measure of statistical power, 56% of phenotypes with $N_{eff}$ > 10,000, and an impressive 89.9% of phenotypes with $N_{eff}$ > 100,000, have significant heritability estimates. To understand why negative estimates of heritability occur, please see the FAQ on the ldsc Github.

This widespread heritability across many traits shouldn’t be a surprise. Countless twin studies have shown a wide variety of traits are heritable, to the point the Eric Turkheimer famously proposed the first law of behavioral genetics: “all human behavioral traits are heritable”. But it’s encouraging to see molecular genetics evidence continue to support the same conclusion as previous methods that almost all human traits, not just behavior, have a heritable component.

Anthropometric measures are really heritable, but so is behavior

If you browse the heritability results (either in a table or in graphical form), one of the most obvious trends is that the top heritability results are often for anthropometric measures (i.e. physical measures of the human body).

Defining a “most heritable” phenotype depends a bit your choice of metric (e.g. highest $h^2$ estimate, most significant $h^2$ estimate, highest $h^2$ conditional on significance level, etc), but height stands out from the crowd ($h^2$=.46, p=7.5e-109). This perhaps isn’t surprising, given that height was one of the first traits studied to formalize the concept of heritability in humans, has among the highest twin and family heritability estimates, and mapping specific genetic loci associated with height has been extremely successful over the past decade. Many less commonly studied anthropometric measures show similarly strong results (e.g. percentage of fat in each arm and leg, bone mineral density, impedance measures, waist circumference).

Many behavioral outcomes also show strong heritability though. Having a college degree appears near the top of the heritability results ($h^2$=.28, p=6.6e-195), consistent with previous work on the genetics of educational attainment. Multiple phenotypes related to alcohol consumption and cigarette smoking also show significant heritability, as do personality features (e.g. neuroticism, restlessness, mood swings). Even something as simple as the amount of time someone spends watching TV appears to be at least a little bit heritable ($h^2$=.096, p=2.8e-114).

The heritability of a trait is heavily influenced by the quality of the measurement of the trait. Heritability is a proportion of variation of the trait, meaning that increasing the measurement error will decrease the heritability. For anthropometric traits, many of these measures show extremely low variability in repeated measurement studies.

If anything, it’s more difficult to find phenotypes with strong evidence for a complete lack of heritability. At large effective sample sizes even very low heritability estimates around .01 for phenotypes, such as the amount of time employed at one’s current job or the death of a close relative in the last 2 years, have nominal statistical evidence for non-zero heritability. This may be evidence of some statistical artifact of LDSR at large sample sizes, or may reflect indirect influences of genetics on general health for example.

There’s still some inflation or model misspecification

In addition to heritability, LDSR also estimates an intercept term that aims to gauge the amount of confounding in each analysis (indeed, this was the original goal of LDSR). This confounding could be population stratification, subtle familial relatedness, or other model misspecification in the LDSR model.

Distribution of LDSR intercept estimates for phenotypes with $N_{eff}$ > 10,000. — Distribution of LDSR intercept estimates for phenotypes with $N_{eff}$ > 10,000.

QQ plot of significance for all intercept estimates. Interactive versions of these plots can be found on the UKBB heritability results site.

In the UKBB LDSR analysis we observe many highly significant intercept estimates. Among well-powered phenotypes with $N_{eff}$ > 10,000 the average intercept is 1.025, though there is a tail of phenotypes with much higher values. Consistent with the intercept being an index of population structure, some of most significant intercepts are observed for GWAS of home location (i.e. an intercept >2 for GWAS of living in a large city in Scotland vs. elsewhere in the UK).

Distribution of LDSR ratio estimates for phenotypes with $N_{eff}$ > 50,000. An interactive versions of this plot can be found on the UKBB heritability results site.

Reassuringly, the LDSR “ratio” estimate, a gauge of the relative balance of confounding and genetic effects in the GWAS [4], indicates that most of the genetic signal in most of the UK Biobank GWAS is likely genetic rather than some form of confounding or model misspecification despite the highly significant intercepts. Specifically, the mean ratio for phenotypes with $N_{eff}$ > 10,000 is roughly 0.11, suggesting nearly 90% of the observed signal in the average UK Biobank GWAS is attributable to genetics rather than confounding [5].

Partitioned LDSR is helpful

Astute readers familiar with LDSR have probably noticed that we’re emphasizing heritability estimates from partitioned, rather than univariate, LDSR and that we’ve been adding “model misspecification” to the conventional list of factors that may increase the LDSR intercept term. If you’ve already read the technical details post, then you may also have noted that in the partitioned LDSR analysis we’ve removed the default use of two-stage estimation and a cap on maximum $\chi^2$ in estimation of the intercept term. These changes come in response to initial observations by Steven Gazal, Patrick Turley, and colleagues about the behaviour of partitioned vs. univariate LDSR and the impact of the $\chi^2$ threshold, with subsequent affirmation of those observations in this analysis of UK Biobank. Much of this also follows work from Jian Yang, Naomi Wray, Peter Visscher and others on stratified analyses for GREML.

Comparison of estimated intercept from univariate and partitioned LDSR. Orange reference line indicates equal values.

QQ plots for the significance of the intercept term in univariate (blue) and partitioned (orange) LDSR. Both plots restricted to phenotypes with univariate LDSR results. Interactive versions of these plots can be found on the UKBB heritability resul… — QQ plots for the significance of the intercept term in univariate (blue) and partitioned (orange) LDSR. Both plots restricted to phenotypes with univariate LDSR results. Interactive versions of these plots can be found on the UKBB heritability results site.

Specifically, we observe that partitioned LDSR without $\chi^2$ thresholding yields substantially smaller and less significant estimates of the regression intercept. The decrease in significance may in part be attributable to increased uncertainty in the partitioned model. The clear trend towards reduced point estimates however suggests that (a) imposing the $\chi^2$ threshold in two-stage estimation may flatten the LDSR regression, increasing the intercept, and/or (b) the partitioned model may capture variability in regional polygenic signal across the genome that is otherwise reflected in the intercept term of the univariate LDSR model.

Comparison of estimated $h^2$ from univariate and partitioned LDSR for phenotypes with $N_{eff}$ > 10,000. Orange reference line indicates equal values.

QQ plots for the significance of the estimated heritability in univariate (blue) and partitioned (orange) LDSR for all phenotypes. Both plots restricted to phenotypes with univariate LDSR results. Interactive versions of these plots can be found on … — QQ plots for the significance of the estimated heritability in univariate (blue) and partitioned (orange) LDSR for all phenotypes. Both plots restricted to phenotypes with univariate LDSR results. Interactive versions of these plots can be found on the UKBB heritability results site.

Consistent with the hypothesis of the partitioned LDSR model providing a better fit to the polygenic signal, we also observe higher heritability estimates from partitioned LDSR compared to univariate LDSR. This is also consistent with $\chi^2$ thresholding flattening the regression in LDSR. The increases in heritability are modest but noticeable. The higher estimates of heritability do not yield higher statistical significance however, since the extra model complexity increases standard errors.

QQ plot of significance of coefficient for genomic annotations in partitioned LDSR. An interactive version of this plot, including a full legend for the 38 plotted annotations, can be found on the UKBB heritability results site.

Of course the partitioned LDSR model isn’t only intended to improve model fit, it also provides insight on genomic annotations associated with stronger polygenic signal. We observe strongly significant associations for multiple annotations across many phenotypes, with the most statistically significant results observed for (in descending order): CpG content, GERP scores, evolutionarily conserved regions, and predicted allele age. Differences in the power to detect heritability for different annotations, as well as the relative overlap of these annotations fit in a joint model, means there’s some nuance to interpreting these results both within and between phenotypes, but we anticipate there will be many interesting observations from these partitioned heritability results as time allows for more thorough investigation.

LDSR has issues at small effective sample sizes

The large sample size and comprehensive data collection efforts of the UK Biobank does not mean that GWAS is well-powered for every phenotype. In many cases the UK Biobank data has small effective sample sizes and thus low statistical power, especially for rare diseases (more on that below) and phenotypes that are only defined for a subset of the populations (e.g. which eye is affected for individuals with cataracts).

This range of effective sample sizes allows us to evaluate the impact of sample size on the performance of LDSR. Unsurprisingly, power of LDSR to identify significant heritability scales with sample size (not shown, see the UKBB heritability results site). More notably, however, the average estimate of heritability tends towards zero at small effective sample sizes.

Loess fit of average estimated heritability as a function of effective sample size. Zoomed to show detail at small sample sizes. Phenotypes with $N_{eff}$ < 100 omitted to avoid outliers with extreme heritability estimates. An interactive versi… — Loess fit of average estimated heritability as a function of effective sample size. Zoomed to show detail at small sample sizes. Phenotypes with $N_{eff}$ < 100 omitted to avoid outliers with extreme heritability estimates. An interactive version of this plot can be found on the UKBB heritability results site.

This trend towards an average estimate of zero heritability is noteworthy because there’s no reason to believe that the phenotypes that are rare in UK Biobank aren’t heritable. In fact, for many of these rare phenotypes there is documented evidence for heritability in other studies. Instead, it seems likely that this reflects a statistical artifact in LDSR, such that GWAS with small effective sample sizes have insufficient power for LDSR to detect polygenic effects, leading to near-zero estimates of heritability [6]. Identifying that this deflation exists primarily below an effective sample size of 5,000-10,000, with average heritability estimates stabilizing above that range, may provide useful information in determining sample size requirements for future LDSR analyses.

Population samples have limited information about rare phenotypes

Because of these sample size concerns we’ve focused on phenotypes with an effective sample size $N_{eff}$ > 10,000 for many of the results descriptions above. The may seems like a laughably lenient threshold given >300,000 individuals in the UK Biobank GWAS analysis set, but in fact only 26% of the 2,419 phenotypes analyzed as of this writing pass that threshold. Indeed nearly 1,000 phenotypes have effective sample sizes below 1,000.

The reason for these low effective sample sizes is rare binary phenotypes. Many of the ICD diagnostic codes and medication codes are quite rare, leading to low case counts in the biobank and very small effective sample sizes. In many cases these phenotypes are even rarer than would be expected in 300,000 random individuals since UK Biobank participants are much healthier than average.

Distribution of liability-scale heritability estimates for 854 binary phenotypes with effective sample sizes between 200 and 1000. Effective sample sizes below 200 omitted for clarity. An interactive version of this plot can be found on the UKBB her… — Distribution of liability-scale heritability estimates for 854 binary phenotypes with effective sample sizes between 200 and 1000. Effective sample sizes below 200 omitted for clarity. An interactive version of this plot can be found on the UKBB heritability results site.

We’ve chosen to retain these rare phenotypes in the current LDSR analysis since there’s strong scientific interest in many of these outcomes, especially medical disorders. However we caution that in most cases these estimates are unstable [7]. The conversion factor between observed and liability scale heritability (see this blog post for more on that distinction) is quite large for these rare phenotypes, which further exaggerates the instability of heritability estimation for these rare binary phenotypes. For many disorders, more stable SNP heritability estimates may be available from other large GWAS in more focused ascertained case/control studies.

Conclusion

LDSR analysis of >2,000 phenotypes in UK Biobank reaffirms substantial variance explained by common SNPs for a broad range of human traits and disorders. Analysis on this scale also provides new information on the performance of LDSR, including the potential contribution of model misspecification to the LDSR intercept, the corresponding benefits of partitioned LDSR, and the importance of sample size in getting stable estimates of heritability.

If you’d like to explore the LDSR results further or have questions about the analysis, then be sure to check out the UKBB Heritability results page, the other posts in the Neale Lab blog (especially the technical description of the LDSR analysis), and the UKBB ldsc code repository, or simply download the results file. And for any other questions or comments, we’d be happy to hear from you at: nealelab.ukb@gmail.com.

Footnotes:

Authored by Raymond Walters, with contributions from Claire Churchhouse and Ben Neale.

[1]

“Phenotype” refers to some observed characteristic of an individual. If you’re coming from the introductory post on heritability you’ll note that we previously described these as “traits”. The word trait carries some additional meaning in the scientific context so we’re switching to the term phenotype here for the more formal discussion of the results, but feel free to substitute “trait” wherever we say “phenotype” if that helps for understanding this post.

[2]

Based on p < 2e-5 (.05/2419) for Bonferroni adjusted significance, or p < .05 for nominal significance. This involves taking the LDSR standard errors at face value for hypothesis testing, which may have some issues (see “Bootstrapped standard errors” in the technical post), but it’s still a useful benchmark for discussing the result.

[3]

For binary traits, effective sample size is defined as: $$N_{eff} = \frac{4}{\frac{1}{N_{cases}}+\frac{1}{N_{controls}}}$$

For continuous trait, the effective sample size is simply the number of non-missing individuals.

[4]

Specifically, the LDSR “ratio” is defined as:$$\frac{Intercept - 1}{mean(\chi^2) - 1}$$

It’s expected to take values between zero (no inflation in the estimated intercept term) and one (intercept term is equal to the mean $\chi^2$, leaving no additional inflation in mean $\chi^2$ attributable to polygenic signal), though it can take values outside that range due to sampling variation in the intercept term.

[5]

Interpretation of the LDSR ratio as the proportion of inflation from confounding is admittedly heuristic, but we find it provides a useful intuition for understanding the intercept estimate.

[6]

Strictly speaking, the observed relationship between effective sample size and LDSR heritability estimates could alternatively reflect an inflation of estimates at large sample sizes rather than deflation at low sample sizes. However given the previous existing evidence for heritability across numerous traits, studies, and statistical estimation methods, it seems relatively safe to hypothesize that “true” average heritability across phenotypes is closer to 0.1 than 0 and thus the observed trend for LDSR at least primarily reflects deflation at small sample sizes.

[7]

Given the severe instability at small sample sizes, the UKBB Heritability results browser omits phenotypes with effective sample size $N_{eff}$ < 200, and presents a warning on the phenotype results page for any phenotype with $N_{eff}$ < 5,000. For the sake of openness, the LDSR results for the omitted phenotypes are still included in the results file available for download.

Claire ChurchhouseSeptember 20, 2017