Genotyped SNPs in UK Biobank failing Hardy-Weinberg equilibrium test

Testing SNP deviation from Hardy-Weinberg equilibrium (HWE) is a standard method used to detect potential genotyping errors. In the rapid-GWAS marker QC, we observed 44,184 genotyped autosomal variants with HWE p-value < 10e–12 when computed using genotypes of 361,194 white British individuals. 15,069 of these genotyped variants are retained in the imputed bgen files with INFO = 1 and HWE p-value < 10e–12. Of particular concern, 3,987 SNPs have no homozygous alternative genotypes despite having a MAF > 1%.

One hypothesis for why these SNPs are not filtered might stem from the QC criteria relying on a per-batch HWE test. According to the Supplementary Note S2.3 of Bycroft, C., et al. (Nature, 2018), UK Biobank applied marker-based QC to the raw genotypes using a per-batch approach, where each batch contains up to ~4,000 samples (after restricting to 463,844 ancestrally homogeneous individuals). Within each batch, SNPs with HWE p-value < 10e–12 were set to missing for samples in that batch. However, if a subsequent HWE test across the full sample wasn’t performed after the per-batch QC, it could explain the high number of HWE-failing SNPs seen in the genotype data. For example, rs2237897 (MAF = 4.2% in gnomAD non-Finnish European) showed HWE p-value = 8.1 × 10e–233 (counts of homozygous alternative / heterozygous / homozygous reference: 0 / 26777 / 321577; missing: 12840). However, only a single batch out of 106 batches failed the per-batch HWE test (p < 10e–12). As a result, rs2237897 in the imputed bgen shows INFO = 1 and HWE p-value = 6.3 × 10e–134. (We note that, since missing genotypes were imputed for genotyped variants, HWE p-value could be different from those calculated only using non-missing genotypes, and INFO score could be less than 1). Finally, the SNP intensity plot for rs2237897 in UK Biobank below confirms that deviation from HWE is indeed a result of poor genotyping quality.

Scatterplot of SNP intensity for rs2237897 in UK Biobank courtesy of and brought to our attention by Adam Butterworth (@aidanbutty). Thanks Adam!

Scatterplot of SNP intensity for rs2237897 in UK Biobank courtesy of and brought to our attention by Adam Butterworth (@aidanbutty). Thanks Adam!

Not only could retaining these SNPs negatively affect imputation quality of the surrounding loci, this might be of substantial concern for interpreting associations around these HWE-failed variants, particularly for downstream analysis such as fine-mapping. While we excluded all markers with HWE p-value ≤ 10e–10 from the rapid-GWAS results (except for those annotated as protein-coding by VEP), we recommend users re-evaluate HWE when they interpret GWAS signals from individual loci.

Authored by Masahiro Kanai, with input from Daniel Howrigan, Mark Daly, and Hilary Finucane

Daniel Howrigan