GWAS – Genome-Wide Association Studies (Lectures 12–14)
The six key GWAS design considerations are:
1. Phenotype definition: Precise trait classification (binary, quantitative, or ordinal) is essential. Misclassification increases heterogeneity and reduces statistical power. Different disease subtypes should be distinguished.
2. Structure of common genetic variation (LD): Understanding LD blocks enables efficient genotyping with tag SNPs. LD patterns vary across populations, so study design must account for the target population's LD structure.
3. Sample size: Must be adequate for the trait complexity. Complex diseases need ≥700–1000+ individuals; molecular traits may require ~250. Larger samples detect more loci and improve reliability.
4. Population structure/stratification: Ancestry differences between subgroups can confound results. Must be assessed (via PCA/MDS) and corrected by including ancestry components as covariates.
5. Genome-wide significance and multiple testing correction: Testing millions of SNPs generates many false positives. Correction methods include Bonferroni, fixed threshold (P < 5 × 10⁻⁸), FDR, and permutation testing.
6. Replication: Findings must be validated in an independent cohort with similar phenotype definition and genetic background. Successful replication in other populations = generalization.
First, calculate expected haplotype frequencies under linkage equilibrium (product of allele frequencies):
Now compute D = observed − expected for each haplotype:
The absolute value |D| = 0.14 is the same for all haplotypes (with signs adjusting to preserve the total probability summing to 1). Since D ≠ 0, the two SNPs are in linkage disequilibrium. The positive D for G-A means this haplotype is observed more frequently than expected — indicating non-random co-inheritance.
GWAS relies on indirect association through LD. Genotyping arrays use tag SNPs that are representative markers for LD blocks. When a tag SNP shows significant association, it may be in strong LD with the true causal variant, which was never directly genotyped. The detected signal reflects the correlation between the tag and causal variant, not direct causality.
Post-GWAS steps include:
1. Fine-mapping: Examine the LD structure (r², D′) around the top SNP to identify the boundaries of the associated region and narrow down candidate variants.
2. Gene annotation: Identify nearby genes using databases (e.g., GFF files, BioMart) and tools like BEDTools intersect, often within a defined window (e.g., 0.5 Mb).
3. Biological evaluation: Assess candidate gene relevance using GeneCards, GWAS Catalog, Mouse Genome Informatics (MGI), and scientific literature.
4. Functional enrichment: Use ORA tools (DAVID, EnrichR) to test if the gene set is enriched for specific biological pathways.
5. Functional validation: Conduct experimental studies (e.g., gene expression, knockouts) to confirm the causal role of the candidate variant/gene.
(a) Bonferroni correction:
(b) Why the fixed threshold is more appropriate: Bonferroni assumes all 500,000 SNP tests are independent, but many SNPs are correlated through LD, so the effective number of independent tests is lower. This makes Bonferroni overly conservative. The fixed threshold of P < 5 × 10⁻⁸ was derived from ~1 million independent LD blocks (Pe'er et al., 2008), is LD-aware, standardized across studies, and does not change with the genotyping platform used.
(c) FDR BH threshold for rank 10:
The 10th-ranked p-value must be below 1.0 × 10⁻⁶ to be declared significant under FDR.
(a) Good control: Most observed p-values follow the diagonal (expected under the null hypothesis). Only a few points deviate upward at the extreme tail (top-right), representing true associations. This indicates that the vast majority of SNPs behave as expected (no association), and only a handful show genuine signal.
(b) Uncorrected stratification: The entire distribution shifts upward — many points across the full range lie above the diagonal, not just the tail. This systematic inflation means that ancestry differences are creating widespread false signals, not just a few true associations.
λGC complements the QQ-plot: It quantifies the inflation numerically by comparing the median of observed chi-squared test statistics to the expected median under the null. λGC ≈ 1 confirms good control (matching the QQ-plot diagonal). λGC > 1 quantifies the degree of inflation seen in the QQ-plot. λGC < 1 flags overcorrection (too many PCs as covariates), which may cause false negatives. Together, the QQ-plot provides visual assessment and λGC provides a numerical summary — both are essential QC tools before interpreting GWAS results.