GWAS – Genome-Wide Association Studies (Lectures 12–14)

📝GWAS — Concepts, LD & Study Design
0 / 45
Q1 Easy
What is the primary goal of a Genome-Wide Association Study (GWAS)?
ATo sequence the entire genome of an individual at single-nucleotide resolution
BTo identify statistical associations between genetic variants and phenotypic traits across the genome
CTo determine the complete haplotype structure of an organism
DTo identify all protein-coding genes in the genome
Explanation
GWAS analyzes DNA sequence variation across a large population to find statistical associations between specific genetic variants (typically SNPs) and phenotypic traits. It does not sequence the whole genome — it genotypes known variant positions. The ultimate goals include understanding genetic basis of traits, disease prediction/prevention, and improving breeding programs.
Q2 Medium
According to the Common Disease/Common Variant (CD/CV) hypothesis, which statement is correct?
ACommon diseases are caused by single rare mutations with large effect sizes
BCommon variants each explain a large proportion of disease heritability on their own
CCommon diseases are influenced by many common variants, each with a small effect size
DCommon variants have high penetrance and follow Mendelian inheritance patterns
Explanation
The CD/CV hypothesis states that common diseases are influenced by genetic variants that are also common in the population. Each individual variant has a small effect size (low penetrance), but their aggregate, polygenic effect explains the observed heritability. This is in contrast to rare Mendelian disorders caused by single high-penetrance mutations (like Huntington's disease).
Q3 Easy
What does Linkage Disequilibrium (LD) describe?
AThe non-random association of alleles at different loci within a population
BThe random segregation of alleles during meiosis
CThe physical distance between two genes on a chromosome
DThe rate of recombination between two loci
Explanation
LD describes the tendency of certain alleles at different loci (commonly SNPs) to be inherited together more often than expected by chance. When two SNPs are in LD, the presence of one allele can predict the presence of another nearby allele. LD is shaped by recombination history, genetic drift, selection, and population history.
Q4 Medium
What is the role of tag SNPs in GWAS?
AThey are always the causal variants responsible for disease
BThey mark the boundaries of chromosomes
CThey are used to increase the number of SNPs tested in GWAS
DThey are representative SNPs that capture genetic variation within an LD block without genotyping every variant
Explanation
Because SNPs in LD carry redundant information, researchers don't need to genotype every variant. Tag SNPs serve as proxies for other variants in the same LD block. This reduces genotyping costs and data complexity while still enabling genome-wide coverage. If a tag SNP is associated with a trait, the actual causal variant is likely nearby and in LD with it — this is called indirect association.
Q5 Hard
Given two bi-allelic SNPs with allele frequencies q₁ = 0.2 and q₂ = 0.3, and observed haplotype frequency q₁₂ = 0.20, what is the value of D?
A0.06
B0.14
C0.20
D−0.14
Explanation
D = q₁₂ − q₁ × q₂ = 0.20 − (0.2 × 0.3) = 0.20 − 0.06 = 0.14. The expected haplotype frequency under linkage equilibrium is 0.06, but the observed is 0.20, indicating strong LD. Since D ≠ 0, the two loci are in linkage disequilibrium — the alleles co-occur more often than expected by chance.
Q6 Medium
What does D′ = 1 indicate about two SNPs?
AThe two SNPs are in perfect linkage equilibrium
BThe two SNPs have identical allele frequencies
CComplete LD — the strongest possible non-random association given the allele frequencies
DThe two SNPs are on different chromosomes
Explanation
D′ is D normalized by its maximum possible value given the allele frequencies. D′ = 1 means complete LD, the strongest possible association between the two SNPs. D′ = 0 means no LD. Note that D′ = 1 does not necessarily mean r² = 1 — perfect correlation (r² = 1) additionally requires that the allele frequencies are the same at both loci.
Q7 Tricky
Which LD measure is most commonly used to assess the coverage quality of a GWAS genotyping array?
Ar² — because it measures the correlation between tag SNP and causal variant
BD′ — because it indicates complete LD regardless of allele frequency
CD — because it directly measures haplotype frequency deviations
Dχ² — because it tests significance of association
Explanation
r² is the preferred measure for GWAS array design because it directly quantifies the correlation between a tag SNP and any other SNP. GWAS genotyping products select tag SNPs that guarantee coverage of common polymorphisms at some predetermined threshold of r². The lecture specifically states that LD between the causal polymorphism and a tested tag SNP is "measured by r²" and affects power. D′ can equal 1 even when prediction is poor (if allele frequencies differ), making it less useful for assessing genotyping coverage.
Q8 Easy
Over many generations, what happens to linkage disequilibrium between two loci?
ALD always increases due to genetic drift
BLD remains constant regardless of distance
CLD increases with physical distance between the loci
DRecombination gradually breaks LD, with distant loci losing LD faster than close loci
Explanation
Recombination events accumulate over generations and break apart linked regions. Variants that are physically close on a chromosome are less likely to be separated by recombination and remain in LD longer. Variants farther apart are more likely to be separated, moving toward linkage equilibrium. This leaves only small blocks of variants (haplotype blocks) still in LD.
Q9 Medium
In the context of GWAS phenotype definition, which phenotype type is analyzed using logistic regression?
AQuantitative traits like height or BMI
BBinary (dichotomous) traits like disease presence/absence
COrdinal traits like disease severity scores
DAll phenotype types use logistic regression in GWAS
Explanation
Binary (dichotomous) traits use logistic regression or contingency table methods (e.g., Fisher's exact test, chi-squared). Quantitative (continuous) traits use linear regression / generalized linear models (GLM). Semi-quantitative/ordinal traits may require ordinal regression or non-parametric tests. The statistical test depends primarily on the phenotype type.
Q10 Tricky
Why can a poorly defined phenotype in a case-control GWAS reduce statistical power?
AIt increases the number of SNPs that need to be tested
BIt reduces linkage disequilibrium across the genome
CMisclassified individuals increase heterogeneity in causal polymorphisms, diluting the genetic signal
DIt causes the Bonferroni correction threshold to become more stringent
Explanation
Non-specific case–control definitions increase heterogeneity in the underlying causal genetic polymorphisms and non-genetic risk factors. For example, mixing different disease subtypes (e.g., inflammatory vs. non-inflammatory forms) in the "case" group means different causal variants are at play, diluting the signal for any single one. This leads to decreased power for detection, spurious associations, and invalid conclusions.
Q11 Medium
Which sample size would typically be needed for a GWAS studying complex diseases like diabetes?
A≥700–1000+ individuals due to small effect sizes and polygenic nature
B~50 individuals since GWAS genotyping arrays are highly precise
C~250 individuals, same as for molecular traits
DSample size does not affect GWAS power for complex diseases
Explanation
Complex diseases require large cohorts (≥700–1000+ individuals) because of their polygenic nature and the involvement of environmental factors. Molecular traits (e.g., metabolite levels) can often be studied with smaller cohorts (~250 individuals) because they are more directly linked to genetic variation. External traits (height, hair color) need medium sample sizes.
Q12 Easy
What is the standard genome-wide significance threshold used in GWAS?
AP < 0.05
BP < 0.01
CP < 1 × 10⁻⁶
DP < 5 × 10⁻⁸
Explanation
The widely accepted genome-wide significance threshold is P < 5 × 10⁻⁸. This threshold was derived by correcting for approximately 1 million independent LD blocks across the genome (Pe'er et al., 2008). It is now standard across GWAS studies — if a SNP's p-value is below this threshold, it is considered significantly associated with the trait.
Q13 Hard
If you test 150,000 SNPs at α = 0.05 without correction, approximately how many false positives would you expect?
A150
B7,500
C75
D750
Explanation
At α = 0.05, you expect 5% of all tested SNPs to appear significant by chance alone (even if no real associations exist). 150,000 × 0.05 = 7,500 expected false positives. This is exactly why multiple testing correction is critical in GWAS.
Q14 Medium
What is the main disadvantage of the Bonferroni correction in GWAS?
AIt is computationally too expensive for large datasets
BIt does not correct for multiple testing at all
CIt is overly conservative because it assumes all tests are independent, ignoring LD between SNPs
DIt produces too many false positives compared to FDR
Explanation
Bonferroni correction divides α by the total number of SNPs tested, assuming each test is independent. However, in GWAS, many SNPs are correlated due to linkage disequilibrium (LD), so the effective number of independent tests is lower. This makes Bonferroni overly conservative — it may miss true associations (increased false negatives). Additionally, the threshold changes depending on the genotyping panel used (panel-dependence).
Q15 Tricky
The fixed genome-wide significance threshold of P < 5 × 10⁻⁸ was derived based on:
AAn estimated 1 million independent LD blocks across the genome
BThe total number of SNPs on the Illumina 1M array
CThe number of protein-coding genes in the human genome
DThe False Discovery Rate method at q = 0.05
Explanation
The threshold P < 5 × 10⁻⁸ comes from correcting for approximately 1 million independent LD blocks across the genome (Pe'er et al., 2008). Rather than correcting per dataset, this predefined threshold is LD-aware and standardized across studies. It is neither based on a specific genotyping array nor on FDR — it's an empirical, fixed correction derived from the estimated number of independent tests genome-wide.
Q16 Medium
How does the False Discovery Rate (FDR) approach differ from Bonferroni correction?
AFDR is more conservative and rejects fewer SNPs
BFDR controls the proportion of false positives among significant results, rather than controlling the family-wise error rate
CFDR requires permutation of genotype data
DFDR uses a fixed threshold of P < 5 × 10⁻⁸
Explanation
FDR (Benjamini-Hochberg) controls the expected proportion of false positives among all declared significant associations. It is less conservative than Bonferroni and better suited for exploratory research because it retains more true positives. The procedure ranks p-values, calculates thresholds q(i) = (i/m) × α, and identifies the largest p-value meeting the criterion. Bonferroni controls the family-wise error rate (probability of ≥1 false positive).
Q17 Medium
What is population stratification in the context of GWAS?
AThe random sampling of individuals from a homogeneous population
BThe sequencing of DNA in distinct population layers
CThe division of a population into cases and controls for analysis
DThe presence of subgroups differing in genetic ancestry and trait prevalence, causing confounding in GWAS
Explanation
Population stratification occurs when a study population contains subgroups that differ in both genetic ancestry and trait prevalence. SNPs associated with ancestry may falsely appear associated with the disease (confounding by ancestry). For example, if Southern Europeans have both higher disease rates (for environmental reasons) and different allele frequencies, a GWAS might detect spurious associations. If uncorrected, it creates false positives and can mask true associations.
Q18 Easy
What is the purpose of MDS (Multidimensional Scaling) in GWAS?
ATo detect and visualize population structure by reducing high-dimensional genotype data
BTo calculate p-values for SNP associations
CTo perform multiple testing correction
DTo phase haplotypes from genotype data
Explanation
MDS is a dimensionality reduction technique that summarizes genome-wide genetic variation into a few dimensions. Each point on an MDS plot represents an individual, and clusters indicate groups of genetically similar individuals. If distinct clusters appear, it signals population stratification that must be accounted for in GWAS. MDS outputs can be used as covariates in the association model to correct for structure.
Q19 Tricky
In the mouse body weight GWAS example, why did almost every SNP appear significantly associated with body weight?
ABody weight is controlled by every locus in the genome
BThe genotyping array had a very high error rate
CPopulation structure between wild-derived and classical inbred strains confounded the results — SNPs differentiating strains correlated with weight differences
DThe Bonferroni correction threshold was too lenient
Explanation
The mice came from genetically distinct strains (wild-derived vs. classical inbred), and wild-derived strains had much lower body weight (3–4× difference). MDS revealed two major genetic clusters. The GWAS wasn't detecting causal genes — it was detecting genetic background differences that correlated with body weight. Every SNP differentiating the two strain groups appeared associated. This is a textbook example of population stratification creating massive false associations.
Q20 Medium
What does a genomic control inflation factor (λGC) of approximately 1.00 indicate?
AThe study has many true positive associations
BNo inflation — the test statistics match the expected null distribution, suggesting proper population structure control
CSignificant overcorrection for population structure
DThe Bonferroni threshold was applied correctly
Explanation
λGC ≈ 1.00 is the ideal scenario — the observed test statistics match the null distribution, indicating no inflation from population stratification or other confounding. λGC > 1 indicates inflation (possible uncorrected confounding, leading to false positives). λGC < 1 suggests overcorrection (too many covariates), potentially causing false negatives.
Q21 Medium
On a QQ-plot from a well-controlled GWAS, what pattern indicates true associations?
AAll points lie exactly on the diagonal line
BAn overall upward curve with all points above the diagonal
CAll points fall below the diagonal line
DMost points follow the diagonal, with an upward deviation only at the tail (top-right corner)
Explanation
In a well-controlled GWAS: most SNPs (not associated with the trait) should fall along the diagonal, and only a few SNPs with true associations deviate upward at the tail. An overall upward curve (many points above the line) would indicate systematic inflation due to population stratification or technical artifacts. If all points are on the diagonal with no deviation, there may be no true associations detected.
Q22 Easy
In a Manhattan plot, what does the Y-axis represent?
A−log₁₀(p-value), so that more significant SNPs appear as higher points
BThe physical position of each SNP along the chromosome
CThe allele frequency difference between cases and controls
DThe effect size (beta coefficient) of each SNP
Explanation
In a Manhattan plot, the X-axis shows SNP positions along chromosomes, and the Y-axis shows −log₁₀(p-value). This transformation means smaller p-values (more significant) appear as higher points. Peaks indicate clusters of SNPs with significant associations, and a horizontal line typically marks the genome-wide significance threshold.
Q23 Tricky
In GWAS, a single isolated SNP signal above the significance threshold is most likely:
AA definitive causal variant for the trait
BA tag SNP in perfect LD with many other variants
CA potential false positive caused by genotyping errors or mapping issues
DAlways more reliable than a peak with multiple linked SNPs
Explanation
A true GWAS association usually appears as a peak with several linked SNPs (in LD with each other), because LD means multiple nearby SNPs carry similar association signals. A single isolated SNP signal — without supporting nearby SNPs — is suspicious and may represent a false positive from genotyping errors or mapping issues. The presence of multiple linked SNPs strengthens confidence in the association.
Q24 Medium
In the additive genetic model used in GWAS, how are genotypes coded?
AAA = 1, AG = 0, GG = −1
BBy the count of minor alleles: 0 (homozygous major), 1 (heterozygous), 2 (homozygous minor)
CAA = 0, AG = 0, GG = 1 (dominant model)
DGenotypes are not numerically coded in GWAS
Explanation
The additive model is the most commonly used in GWAS. It counts the number of copies of the minor allele: 0 (homozygous major, e.g., AA), 1 (heterozygous, e.g., AG), and 2 (homozygous minor, e.g., GG). A linear regression then tests whether the number of minor alleles is predictive of the phenotype value, assuming a trend per copy of the minor allele.
Q25 Medium
Why is covariate adjustment important in GWAS linear mixed models?
AIt increases the number of SNPs that can be tested
BIt replaces the need for multiple testing correction
CIt eliminates all environmental effects on phenotype
DIt reduces spurious associations due to sampling artifacts, biases, and known confounding factors like sex, age, and population substructure
Explanation
The model Y = Xb + Zu + e includes fixed effects (known constants like sex, age, study site, population substructure) and random effects. Covariate adjustment reduces spurious associations caused by sampling artifacts or biases in study design. However, it comes at the cost of using additional degrees of freedom, which may impact statistical power. Population substructure is one of the most important covariates.
Q26 Easy
What is the purpose of replication in GWAS?
ATo validate that identified associations are robust and not statistical artifacts, using an independent sample
BTo increase the number of SNPs tested in the original study
CTo apply a different multiple testing correction method
DTo genotype the same individuals using a different SNP array
Explanation
Replication is the gold standard for validation. It should be done in an independent dataset, drawn from the same population, with similar phenotype definition and genotyping platform. Once confirmed in the target population, other populations may be sampled — successful replication in additional populations is called "generalization."
Q27 Medium
What is the concept of "indirect association" in GWAS?
AA SNP that is found to be causal through functional studies
BAn association detected only in meta-analyses, not in individual studies
CWhen the detected SNP is not the causal variant but is in strong LD with it, acting as a proxy
DAn association caused by population stratification rather than biology
Explanation
Indirect association is central to GWAS interpretation. The SNP that shows up in the GWAS results is often not the causal variant, but is in strong LD with it. The real causal variant may not have been genotyped, but its "tag" shows association due to LD. This is why further fine-mapping and functional studies are needed after GWAS to identify the true causal variants.
Q28 Medium
What is the main purpose of genotype imputation in meta-GWAS?
ATo correct genotyping errors in individual studies
BTo generate a common set of SNPs across studies that used different genotyping platforms
CTo increase the sample size of individual GWAS studies
DTo reduce the number of SNPs tested and thus relax the significance threshold
Explanation
Meta-analysis requires assessing the effect of the same allele across studies. When different studies use different genotyping platforms (with different SNP sets), imputation estimates genotypes for SNPs not directly genotyped by exploiting known LD patterns and haplotype frequencies from reference panels like HapMap or 1000 Genomes. This creates a common set of SNPs for comparison.
Q29 Tricky
Which statement about permutation testing for multiple testing correction in GWAS is correct?
AIt is the most commonly used method in standard GWAS because of its simplicity
BIt assumes all SNP tests are independent, like Bonferroni
CIt controls the false discovery rate rather than family-wise error rate
DIt preserves LD structure and generates empirical p-values, but is computationally too intensive for routine GWAS use
Explanation
Permutation testing shuffles genotypes many times to generate an empirical null distribution. Its key advantage is that it preserves the LD structure between SNPs, producing accurate empirical p-values. However, it is computationally intensive, especially with millions of SNPs and tens of thousands of samples, making it impractical for routine GWAS. It is powerful but rarely used in standard analysis.
Q30 Easy
Which of the following is NOT a typical GWAS application in livestock?
AIdentifying genetic risk factors for schizophrenia
BFinding causal genes for milk yield and quality
CIdentifying genes for coat color
DDiscovering variants for disease resistance
Explanation
Schizophrenia is a human complex disease, not a livestock trait. GWAS in livestock focuses on economically important traits such as milk yield/quality, fertility, growth, coat color, disease resistance, and performance traits (e.g., racing ability in horses). Crop GWAS focuses on yield, flowering time, drought tolerance, and nutritional value.
Q31 Hard
In the FDR (Benjamini-Hochberg) procedure with 20 SNPs at q* = 0.05, the BH threshold for the 3rd ranked p-value is:
A0.0025
B0.0050
C0.0075
D0.0100
Explanation
The BH threshold for rank i is: q(i) = (i/m) × α. For the 3rd ranked p-value: q(3) = (3/20) × 0.05 = 0.0075. The procedure ranks all p-values from smallest to largest, assigns each a BH threshold, and the largest p-value still below its threshold becomes the cutoff. All SNPs with p-values below that cutoff are declared significant.
Q32 Medium
What is the formula for the basic LD measure D?
AD = q₁ × q₂ − q₁₂
BD = q₁₂ − q₁ × q₂
CD = q₁₂ / (q₁ × q₂)
DD = (q₁ + q₂) − q₁₂
Explanation
D = q₁₂ − q₁ × q₂, where q₁₂ is the observed haplotype frequency and q₁ × q₂ is the expected haplotype frequency under linkage equilibrium (random combination of alleles). When D = 0, the loci are in linkage equilibrium. When D ≠ 0, the loci are in LD. However, D is sensitive to allele frequencies, which is why standardized versions D′ and r² are preferred.
Q33 Tricky
A λGC value less than 1.00 in a GWAS most likely suggests:
AUncorrected population stratification inflating results
BMany true associations were detected
CThe study has perfect statistical power
DOvercorrection for population structure, potentially leading to false negatives
Explanation
λGC < 1.00 means observed test statistics are smaller than expected, suggesting the model is too conservative. This can happen when over-adjusting for population structure (e.g., including too many principal components as covariates). The consequence is an increased risk of false negatives — real associations may be missed because the test statistics are deflated.
Q34 Medium
What is over-representation analysis (ORA) used for in post-GWAS analysis?
ATesting whether specific biological functions are significantly enriched in a set of GWAS-identified genes compared to chance
BIdentifying additional SNPs not tested in the original GWAS
CCalculating linkage disequilibrium between candidate genes
DReplicating GWAS findings in an independent population
Explanation
ORA tests whether biological functions/pathways are significantly more frequent (over-represented) in a GWAS gene set than expected by chance. For example, if your GWAS identifies 50 candidate genes, ORA can determine whether immune-related functions are enriched in that gene set. Tools like DAVID and EnrichR perform this analysis, producing p-values for each functional term.
Q35 Easy
Which tool can be used to calculate LD measures (D, D′, r²) and visualize LD structure from genotype data?
ABEDTools
BEnrichR
CPLINK
DGeneCards
Explanation
PLINK is the primary tool for LD calculation, handling large SNP datasets and computing D, D′, and r². It also supports LD pruning and block identification. Haploview is another tool used specifically for LD visualization. BEDTools is for genomic interval operations, EnrichR for pathway enrichment analysis, and GeneCards is a gene-centric database.
Q36 Medium
What does the GWAS Catalog (NHGRI-EBI) provide?
ARaw sequencing reads from GWAS experiments
BA curated collection of published SNP-trait associations from GWAS, with summary statistics and genomic visualization
CReference genomes for assembly purposes
DLD block definitions for all human populations
Explanation
The GWAS Catalog (founded by NHGRI in 2008) is a curated repository of published SNP-trait associations from genome-wide association studies. It offers a search interface, downloadable data, API access, summary statistics, and an iconic GWAS diagram showing associations mapped onto the human karyotype. It is a key resource for post-GWAS annotation and prior knowledge.
Q37 Easy
What tool is used to find overlaps between genomic features such as GWAS peaks and gene annotations (GFF files)?
AHaploview
BDAVID
CPLINK
DBEDTools intersect
Explanation
BEDTools intersect allows screening for overlaps between two sets of genomic features (e.g., GWAS-associated SNP positions and annotated genes in a GFF file). It works with BED, GFF, VCF, and BAM files. This is commonly used in post-GWAS analysis to identify genes near association peaks, often within a defined window (e.g., 0.5 Mb).
Q38 Tricky
In a meta-GWAS, which of the following is NOT a requirement?
AAll studies must use the exact same genotyping platform
BAll studies must have examined the same hypothesis
CQC procedures and covariate adjustments should be standardized across studies
DThe sample sets across all studies should be independent
Explanation
Studies in a meta-GWAS do NOT need to use the exact same platform — this is precisely why imputation exists: to generate a common set of SNPs across studies using different arrays. However, all studies must examine the same hypothesis, use standardized QC and covariate adjustments, have consistent phenotype measurements, and use independent sample sets. Meta-analysis allows pooling results without transferring protected genotype data.
Q39 Medium
What data sources does genotype imputation rely on?
AProtein crystal structure databases
BRNA-Seq expression profiles from the same individuals
CKnown LD patterns and haplotype frequencies from reference panels like HapMap or 1000 Genomes
DPhenotype data from case-control studies
Explanation
Genotype imputation exploits known LD patterns and haplotype frequencies from reference panels (HapMap, 1000 Genomes) to statistically estimate genotypes at SNP positions that were not directly genotyped in the study. It leverages the principle that nearby SNPs in LD are inherited together, so if you know the genotype of surrounding SNPs, you can predict the missing ones.
Q40 Medium
In GWAS, what does "fine-mapping" refer to?
AIncreasing the sample size of the study
BInvestigating the LD structure and nearby genes within significant GWAS peaks to prioritize candidate causal variants
CApplying more stringent multiple testing corrections
DRemoving SNPs in LD from the genotyping panel
Explanation
Fine-mapping is the step after identifying GWAS peaks where researchers zoom into significant regions to examine LD structure (using r² or D′), identify nearby genes, and prioritize candidate causal variants for functional studies. Because GWAS detects indirect associations through LD, fine-mapping helps narrow down from a region to the actual causal variant(s).
Q41 — Open Short Answer
Describe the six key design considerations a GWAS should address. For each, explain why it matters for study quality.
✓ Model Answer

The six key GWAS design considerations are:

1. Phenotype definition: Precise trait classification (binary, quantitative, or ordinal) is essential. Misclassification increases heterogeneity and reduces statistical power. Different disease subtypes should be distinguished.

2. Structure of common genetic variation (LD): Understanding LD blocks enables efficient genotyping with tag SNPs. LD patterns vary across populations, so study design must account for the target population's LD structure.

3. Sample size: Must be adequate for the trait complexity. Complex diseases need ≥700–1000+ individuals; molecular traits may require ~250. Larger samples detect more loci and improve reliability.

4. Population structure/stratification: Ancestry differences between subgroups can confound results. Must be assessed (via PCA/MDS) and corrected by including ancestry components as covariates.

5. Genome-wide significance and multiple testing correction: Testing millions of SNPs generates many false positives. Correction methods include Bonferroni, fixed threshold (P < 5 × 10⁻⁸), FDR, and permutation testing.

6. Replication: Findings must be validated in an independent cohort with similar phenotype definition and genetic background. Successful replication in other populations = generalization.

Q42 — Open Calculation
Given two SNPs: SNP A (alleles T and G, with freq(G) = 0.2) and SNP B (alleles G and A, with freq(A) = 0.3). The observed haplotype frequencies are: T-G = 0.70, T-A = 0.10, G-G = 0.00, G-A = 0.20. Calculate D for the G-A haplotype and verify it matches for all other haplotypes.
✓ Model Answer

First, calculate expected haplotype frequencies under linkage equilibrium (product of allele frequencies):

freq(T) = 0.8, freq(G_A) = 0.2, freq(G_B) = 0.7, freq(A_B) = 0.3
Expected T-G = 0.8 × 0.7 = 0.56
Expected T-A = 0.8 × 0.3 = 0.24
Expected G-G = 0.2 × 0.7 = 0.14
Expected G-A = 0.2 × 0.3 = 0.06

Now compute D = observed − expected for each haplotype:

D(T-G) = 0.70 − 0.56 = +0.14
D(T-A) = 0.10 − 0.24 = −0.14
D(G-G) = 0.00 − 0.14 = −0.14
D(G-A) = 0.20 − 0.06 = +0.14

The absolute value |D| = 0.14 is the same for all haplotypes (with signs adjusting to preserve the total probability summing to 1). Since D ≠ 0, the two SNPs are in linkage disequilibrium. The positive D for G-A means this haplotype is observed more frequently than expected — indicating non-random co-inheritance.

Q43 — Open Tricky
Explain why a GWAS-significant SNP is often not the actual causal variant. What steps would a researcher take after identifying a significant association peak?
✓ Model Answer

GWAS relies on indirect association through LD. Genotyping arrays use tag SNPs that are representative markers for LD blocks. When a tag SNP shows significant association, it may be in strong LD with the true causal variant, which was never directly genotyped. The detected signal reflects the correlation between the tag and causal variant, not direct causality.

Post-GWAS steps include:

1. Fine-mapping: Examine the LD structure (r², D′) around the top SNP to identify the boundaries of the associated region and narrow down candidate variants.

2. Gene annotation: Identify nearby genes using databases (e.g., GFF files, BioMart) and tools like BEDTools intersect, often within a defined window (e.g., 0.5 Mb).

3. Biological evaluation: Assess candidate gene relevance using GeneCards, GWAS Catalog, Mouse Genome Informatics (MGI), and scientific literature.

4. Functional enrichment: Use ORA tools (DAVID, EnrichR) to test if the gene set is enriched for specific biological pathways.

5. Functional validation: Conduct experimental studies (e.g., gene expression, knockouts) to confirm the causal role of the candidate variant/gene.

Q44 — Open Calculation
You are performing a GWAS with a panel of 500,000 SNPs. (a) What is the Bonferroni-corrected significance threshold at α = 0.05? (b) Why might the fixed threshold of P < 5 × 10⁻⁸ be more appropriate? (c) Using FDR at q* = 0.05 with 500,000 SNPs, what is the BH threshold for the SNP ranked 10th?
✓ Model Answer

(a) Bonferroni correction:

α_corrected = α / N = 0.05 / 500,000 = 1.0 × 10⁻⁷

(b) Why the fixed threshold is more appropriate: Bonferroni assumes all 500,000 SNP tests are independent, but many SNPs are correlated through LD, so the effective number of independent tests is lower. This makes Bonferroni overly conservative. The fixed threshold of P < 5 × 10⁻⁸ was derived from ~1 million independent LD blocks (Pe'er et al., 2008), is LD-aware, standardized across studies, and does not change with the genotyping platform used.

(c) FDR BH threshold for rank 10:

q(10) = (10 / 500,000) × 0.05 = 0.000001 = 1.0 × 10⁻⁶

The 10th-ranked p-value must be below 1.0 × 10⁻⁶ to be declared significant under FDR.

Q45 — Open Short Answer
Explain the difference between a QQ-plot showing (a) good population structure control and (b) uncorrected stratification. What role does the genomic inflation factor λGC play alongside the QQ-plot?
✓ Model Answer

(a) Good control: Most observed p-values follow the diagonal (expected under the null hypothesis). Only a few points deviate upward at the extreme tail (top-right), representing true associations. This indicates that the vast majority of SNPs behave as expected (no association), and only a handful show genuine signal.

(b) Uncorrected stratification: The entire distribution shifts upward — many points across the full range lie above the diagonal, not just the tail. This systematic inflation means that ancestry differences are creating widespread false signals, not just a few true associations.

λGC complements the QQ-plot: It quantifies the inflation numerically by comparing the median of observed chi-squared test statistics to the expected median under the null. λGC ≈ 1 confirms good control (matching the QQ-plot diagonal). λGC > 1 quantifies the degree of inflation seen in the QQ-plot. λGC < 1 flags overcorrection (too many PCs as covariates), which may cause false negatives. Together, the QQ-plot provides visual assessment and λGC provides a numerical summary — both are essential QC tools before interpreting GWAS results.