Lectures 11-12-13: Genotyping Tools, CNV, Population Genomics, ROH & GWAS

📝High Throughput Genotyping Tools
0 / 16
Q1 Easy
What does "genotyping" mean in the context of high-throughput genomic studies?
ASequencing the entire genome of an organism de novo
BDetermination of the genotype at polymorphic loci
CIdentifying all genes present in an organism's genome
DMeasuring gene expression levels across all tissues
Explanation
Genotyping specifically refers to the determination of the genotype at polymorphic loci — i.e., identifying which alleles an individual carries at variable positions in the genome. It does not involve whole-genome sequencing or gene expression analysis. The value of genetic information relies largely on variants/polymorphisms, which are the informative parts of a genome.
Q2 Easy
Which of the following is NOT listed as an application of high-throughput genotyping in agricultural species?
AGenome-wide association studies (GWAS)
BGenomic selection
CParentage testing
DDe novo genome assembly
Explanation
The four main applications listed are: GWAS/QTL mapping, genomic selection, marker-assisted selection/breeding (MAS/MAB), and parentage testing. De novo genome assembly is a different process that aims to reconstruct the full genome sequence and is not a direct application of high-throughput genotyping tools like SNP chips.
Q3 Medium
A restriction enzyme that recognizes a 4 bp sequence would be expected to cut, on average, once every:
A64 bp
B128 bp
C256 bp
D4,096 bp
Explanation
The expected cutting frequency is 4n, where n is the number of bases in the recognition sequence. For a 4 bp cutter: 44 = 256 bp. Similarly, a 6 bp cutter cuts every 46 = 4,096 bp, and an 8 bp cutter cuts every 48 = 65,536 bp. AluI is an example of a 4 bp cutter.
Q4 Medium
In the Illumina Infinium BeadChip genotyping assay, how is allele specificity determined?
AA single base extension that incorporates one of four labeled nucleotides
BHybridization of two differently colored probes to both alleles simultaneously
CSequencing the region surrounding each SNP
DPCR amplification with allele-specific primers
Explanation
In the Illumina BeadChip, each probe binds to complementary sequence in the sample DNA, stopping one base before the locus of interest. Allele specificity is then conferred by a single base extension that incorporates one of four labeled nucleotides. When excited by a laser, the nucleotide label emits a signal whose intensity conveys information about the allelic ratio at that locus.
Q5 Tricky
How does the Axiom (Affymetrix) genotyping assay differ from the Illumina Infinium assay in interrogating simple SNPs?
AAxiom uses two probes per SNP; Illumina uses one probe with single base extension
BAxiom uses one probe with two-color readout via differentially labeled nonamers; Illumina uses single base extension
CAxiom uses sequencing-by-synthesis; Illumina uses hybridization
DBoth platforms use identical chemistry but differ in array density
Explanation
In the Axiom system, simple SNPs are interrogated using one standard probe with allelic discrimination achieved by differentially labeled nonamers that hybridize to each allele (one probe, two color readout). In contrast, Illumina Infinium uses a probe that stops one base before the SNP and relies on single base extension with labeled nucleotides. This is a subtle but important distinction between the two major genotyping platforms.
Q6 Medium
In the GenomeStudio Genoplot for SNP genotyping, what do the three clusters (red, purple, blue) represent?
ADifferent chromosomes where the SNP is located
BDifferent quality scores: high, medium, and low
CThe three possible haplotypes at the locus
DThe three genotype classes: AA, AB, and BB
Explanation
In a GenomeStudio Genoplot, data points are color coded for the genotype call: red = AA, purple = AB, blue = BB. Each dot represents a sample, plotted by signal intensity (norm R) and allele frequency (Norm Theta) relative to canonical cluster positions for a given SNP marker.
Q7 Medium
What is a key advantage of custom genotyping arrays over commercially available SNP chips?
AThey are always cheaper per sample than commercial arrays
BThey provide higher data quality and fewer genotyping errors
CThey enable studies of species or populations not supported by standard products
DThey always include more SNPs than commercial arrays
Explanation
Custom genotyping arrays allow researchers to target regions relevant to their specific research interests. Key advantages include: enabling studies of species/populations not supported by standard products, allowing focus on genes/variants/regions of interest not covered in pre-designed products, and conserving resources by avoiding irrelevant genome regions. They are not necessarily cheaper or higher quality — their value lies in flexibility and specificity.
Q8 Medium
What is the main advantage of restriction enzyme GBS (RE-GBS) methods like RAD-Seq over array-based genotyping?
AReduced ascertainment bias and simultaneous SNP discovery and genotyping
BHigher per-sample cost but better data quality
CNo need for a reference genome under any circumstances
DComplete absence of missing data in the final dataset
Explanation
RE-GBS methods have reduced ascertainment bias over array-based methods, the ability to discover and characterize polymorphisms simultaneously, and low cost per sample (<$20 USD). However, they can have issues with missing data (especially in divergent populations), and a reference genome or genome knowledge can be helpful. High divergence can result in missing data, while low divergence may yield fewer SNPs.
Q9 Tricky
In the Ramos et al. study on pig SNP discovery using reduced representation libraries (RRLs), which of the following was used as a criterion to discard unreliable SNPs?
ASNPs with read depth lower than 120 were discarded
BSNPs with read depth higher than 120 were discarded
CThe minor allele had to be present in at least 10 reads
DOnly reads mapping to multiple locations were considered
Explanation
In the SNP discovery pipeline, SNPs with total read depth higher than 120 were discarded (to avoid repetitive/duplicated regions), the minor allele needed to be represented in at least 3 reads (not 10), and only reads mapping to a single unique location were considered (not multiple locations). The quality thresholds for MAQ mapping quality, consensus quality, and best mapping read quality were all set at 10.
Q10 Medium
In the three-step process for converting NGS data into genotype calls, what is the correct order?
ASNP calling → alignment → filtering
BFiltering → alignment → SNP calling
CPre-processing (alignment + quality scores) → SNP/genotype calling → post-processing (filtering)
DSNP calling → pre-processing → post-processing
Explanation
The correct pipeline is: (1) pre-processing steps that transform NGS data into aligned reads with quality scores, (2) SNP or genotype calls using multi-sample or single-sample calling procedures depending on the number of samples and depth of coverage, and (3) post-processing steps that filter the called SNPs to remove unreliable variants.
Q11 Easy
The formula P = G + E in the context of genotyping studies refers to:
APhenotype = Genotype + Environment
BPopulation = Genes + Evolution
CProbability = Genetics + Error
DPower = Genotyping + Efficiency
Explanation
P = G + E is a fundamental equation in quantitative genetics where an individual's Phenotype (P) is determined by their Genotype (G) plus Environment (E). High-throughput genotyping studies are crucial for generating large volumes of genotyping data to identify associations between genotypes and phenotypes (such as diseases or production traits).
Q12 — Open Calculation
A restriction enzyme recognizes a 6 bp sequence. On average, how frequently would you expect this enzyme to cut in a random DNA sequence? If you digest a 3 Gbp genome with this enzyme, approximately how many fragments would you expect?
✓ Model Answer

The expected frequency of a restriction site with an n-base recognition sequence is 4n.

For a 6 bp cutter: 4⁶ = 4,096 bp
The enzyme cuts on average once every 4,096 bp.
Number of fragments ≈ Genome size / Cut frequency = 3,000,000,000 / 4,096 ≈ 732,422 fragments

So a 6 bp restriction enzyme would generate approximately 730,000 fragments from a 3 Gbp genome. This principle underlies why enzymes with longer recognition sites produce fewer, larger fragments (e.g., an 8 bp cutter would produce ~46,000 fragments).

Q13 — Open Short Answer
Explain what a Reduced Representation Library (RRL) is and why it is useful for SNP discovery in livestock species. Describe the general steps to construct one.
✓ Model Answer

A Reduced Representation Library (RRL) is a method to sequence only a fraction of the genome by using restriction enzymes to fragment the DNA and selecting fragments of a specific size range. This dramatically reduces the amount of sequencing needed while still sampling reproducible, genome-wide locations.

Steps (as in the Ramos et al. pig study):

1. Pool DNA from multiple individuals (equal amounts per individual) to capture population-level variation.

2. Digest pooled DNA with restriction enzymes (e.g., AluI, HaeIII, MspI).

3. Select fragments of a specific size range (reduced representation).

4. Sequence the selected fragments using NGS (e.g., Illumina).

5. Align reads to the reference genome and call SNPs using quality filters.

Why useful: It enables cost-effective, large-scale SNP discovery across the genome without the expense of whole-genome sequencing all individuals. The discovered SNPs can then be used to design SNP chips (e.g., PorcineSNP60 BeadChip).

Q14 — Open Short Answer
Compare and contrast commercially available SNP chips versus custom genotyping arrays. When would you choose one over the other?
✓ Model Answer

Commercial SNP chips (e.g., PorcineSNP60, BovineSNP50): Pre-designed for common species with known SNPs. They offer standardized, trusted data quality, widespread adoption enabling cross-study comparisons, low per-sample cost, and comprehensive genome coverage. Best for well-studied species with existing tools.

Custom genotyping arrays: Designed for specific research needs. Advantages: enable studies of species/populations not supported by standard products, allow focus on specific genes/regions of interest, and conserve resources by excluding irrelevant regions.

Choose commercial when working with common species (cattle, pigs, humans), needing standardized results, or conducting large population studies. Choose custom when studying non-model species, targeting specific genomic regions relevant to a particular disease/trait, or when commercial products don't cover your variants of interest.

Q15 Tricky
In the RE-GBS context, what happens when the target populations are more divergent than expected?
AIt results in a lower number of detected SNPs
BIt results in increased missing data, complicating downstream analysis
CIt increases the per-sample cost by more than 10-fold
DIt makes restriction enzyme digestion more efficient
Explanation
When populations are more divergent than expected, RE-GBS protocols can result in increased missing data (because restriction sites may differ between divergent individuals, leading to different fragments being sequenced), complicating downstream analysis. Conversely, low divergence results in a lower number of detected SNPs. This is a tricky distinction — high divergence → more missing data; low divergence → fewer SNPs.
Q16 Medium
What is the purpose of genomic selection in agricultural species?
ATo sequence and assemble the genomes of all individuals in a breeding population
BTo identify and remove deleterious mutations from the population
CTo select animals based on a single marker linked to one trait
DTo improve quantitative traits using whole-genome molecular markers combined with phenotypic and pedigree data
Explanation
Genomic selection aims to improve quantitative traits in large breeding populations through the use of whole-genome molecular markers. Genomic prediction combines marker data with phenotypic and pedigree data (when available) to increase the accuracy of predicting breeding and genotypic values. This differs from marker-assisted selection (MAS), which selects based on individual markers linked to specific traits.

📝Copy Number Variation (CNV)
0 / 15
Q1 Easy
According to Redon et al. (2006), a copy number variation (CNV) is defined as a DNA segment that is:
AAt least 100 bp and present at variable copy number
BAt least 500 bp and differs by a single nucleotide
C1 kb or larger and present at variable copy number compared to a reference genome
DAt least 1 Mb and completely deleted from some genomes
Explanation
Redon et al. (2006) defined a CNV as a DNA segment that is 1 kb or larger and present at variable copy number in comparison with a reference genome. Lee et al. (2008) similarly defined CNVs as intra-specific gains or losses of more than 1 kb of genomic DNA. Note that CNVs are not simply SNPs — they involve larger structural changes.
Q2 Medium
Which of the following is NOT a platform for CNV analysis?
AGWAS with chi-squared testing
BArray Comparative Genome Hybridization (aCGH)
CComparative intensity analysis of SNP genotyping chips
DNext-generation sequencing platforms
Explanation
The three main platforms for CNV analysis are: (1) Array Comparative Genome Hybridization (aCGH), including whole genome tilepath arrays and oligonucleotide arrays; (2) Comparative intensity analysis of SNP genotyping chips (Affymetrix and Illumina); and (3) Next-Generation Sequencing platforms. GWAS with chi-squared testing is used for association studies, not specifically for CNV detection.
Q3 Medium
In oligonucleotide-based aCGH, the reference DNA and test DNA are labeled with:
AReference with Cy3 (green) and test with Cy5 (red)
BReference with Cy5 (red) and test with Cy3 (green)
CBoth with the same fluorescent dye at different concentrations
DNeither is labeled; hybridization is detected by mass spectrometry
Explanation
In aCGH, the reference DNA is labeled with Cy5 and the sample/test DNA is labeled with Cy3. Both are then co-hybridized to the oligonucleotide microarray. The ratio of the two fluorescent signals at each probe position indicates whether the test sample has a gain (more Cy3), loss (more Cy5), or normal copy number (equal signals) relative to the reference.
Q4 Hard
What are the four main NGS-based methods for detecting CNVs?
AAlignment, Variant Calling, Filtering, Annotation
BPCR, Sanger, Microarray, FISH
CRead-Pair, Split-Read, Read-Depth, Haplotype-based
DRead-Pair, Split-Read, Read-Depth, Assembly-based
Explanation
The four main methods are: (1) Read-Pair (RP) — compares insert sizes between mapped read-pairs, (2) Split-Read (SR) — uses reads where one mate fails to map to find breakpoints, (3) Read-Depth (RD) — detects CNVs based on correlation between coverage depth and copy number, and (4) Assembly-based (AS) — assembles contigs/scaffolds and compares them with the reference.
Q5 Tricky
Which NGS-based CNV detection method can determine the exact number of copies, unlike the others which only report positions?
ARead-Pair (RP)
BSplit-Read (SR)
CRead-Depth (RD)
DAssembly-based (AS)
Explanation
Compared to RP and SR, the Read-Depth (RD) method can detect the exact number of CNVs, as RP and SR can only report the position of potential CNVs and not the counts. RD works particularly well for large-size CNVs, which are hard to detect with RP and SR. The method is based on the hypothesis that there is a correlation between depth of coverage and the copy number of a region.
Q6 Medium
What is a major limitation of assembly-based (AS) methods for CNV detection?
AThey can only detect homozygous structural variants and demand extensive computational resources
BThey require paired-end sequencing data exclusively
CThey are limited to CNVs smaller than 1 kb
DThey are unable to detect any insertions, only deletions
Explanation
Assembly-based methods have several limitations: (1) overwhelming demand on computational resources, (2) eukaryotic genomes contain repeats and segmental duplications which reduce accuracy, and (3) they are unable to handle haplotype sequences, meaning only homozygous structural variations can be detected. This is why AS methods are less commonly used for CNV detection in practice.
Q7 Hard
In PennCNV, what is the expected B-Allele Frequency (BAF) pattern for a triploid (duplicated) region?
ATwo bands at 0 and 1
BFour bands at 0, 0.33, 0.66, and 1
CThree bands at 0, 0.5, and 1
DA single band at 0.5
Explanation
For a triploid (duplicated) region with three allele copies, the possible genotypes are AAA (BAF=0), AAB (BAF≈0.33), ABB (BAF≈0.66), and BBB (BAF=1), giving four allele tracks. In contrast, normal diploid regions have three bands (0, 0.5, 1), and deleted regions have only two bands (0 and 1). This is a key pattern for identifying duplications from SNP chip data.
Q8 Medium
What does a Log2 Ratio (LogR) value of 0 indicate in CNV analysis using microarrays?
AComplete deletion of the region
BDuplication (copy number gain)
CLoss of heterozygosity
DNormal copy state (copy number = 2)
Explanation
LogR represents the difference between a reference data point and the sample of interest on a Log2 scale. A value of 0 represents the normal copy state of 2 (equal signal in reference and sample). Positive LogR values indicate copy number gain (duplication), while negative values indicate copy number loss (deletion).
Q9 Tricky
PennCNV uses a hidden Markov model (HMM) for CNV calling. What makes it different from segmentation-based algorithms?
AIt integrates SNP allelic ratio distribution and other factors in addition to signal intensity
BIt only uses signal intensity data, ignoring allelic information
CIt cannot use family information for CNV calling
DIt works only with Illumina data, not Affymetrix
Explanation
PennCNV differs from segmentation-based algorithms in that it considers SNP allelic ratio distribution (BAF) as well as other factors, in addition to signal intensity (LRR) alone. It integrates multiple sources of information through its HMM framework. PennCNV can handle both Illumina and Affymetrix data and can optionally utilize family information to generate family-based CNV calls.
Q10 Tricky
In loss of heterozygosity (LOH) regions, what is characteristic of the BAF pattern?
AThree bands at 0, 0.5, and 1 (normal pattern)
BFour bands at 0, 0.33, 0.66, and 1 (like duplication)
CTwo bands at 0 and 1 only (no heterozygous SNPs), with unchanged copy number
DA single band at 0.5 indicating all heterozygous SNPs
Explanation
In LOH regions, copy number is unchanged (LogR ≈ 0), but only homozygous SNPs (AA or BB) are present, giving only two BAF bands at 0 and 1. This distinguishes LOH from deletion: in deletion, you also see two bands, but LogR is negative (reduced copy number). LOH can arise through mechanisms like mitotic recombination or gene conversion without actual loss of DNA.
Q11 Medium
What is the purpose of cross-species aCGH, as used in the goat genome study?
ATo determine the phylogenetic relationship between cattle and goats
BTo detect CNVs in goats using a chip designed based on the bovine genome
CTo identify SNPs shared between cattle and goats
DTo compare gene expression levels between cattle and goats
Explanation
Cross-species aCGH involves using a microarray chip designed based on one species' genome (bovine, Btau_4.0 and UMD 2.0 assemblies) to detect CNVs in a closely related species (goat). This was done because a goat-specific aCGH chip was not available. It leverages the conservation between closely related genomes to study structural variation in species that lack their own dedicated genomic tools.
Q12 Medium
In the Read-Depth (RD) method for CNV detection, what is the purpose of normalizing the read counts?
ATo increase the number of detected CNVs
BTo align reads to the correct chromosomal positions
CTo convert read counts to allele frequencies
DTo remove potential biases mainly due to GC content and repeat regions
Explanation
In the RD method: (1) reads are aligned and depth counted per window, (2) counts are normalized to remove potential biases mainly due to GC content and repeat regions, and then a segmentation algorithm identifies contiguous windows with the same copy number, (3) statistical significance is predicted and filtering applied. GC content bias is a well-known confound in sequencing depth — GC-rich regions tend to have different coverage than expected.
Q13 — Open Short Answer
Describe the general steps of an oligonucleotide-based aCGH experiment for CNV detection. What data does aCGH produce and how is it interpreted?
✓ Model Answer

Steps:

1. Design microarray with long oligonucleotide probes (50-70 bp) based on the reference genome, spaced at regular intervals while avoiding repetitive sequences.

2. Extract high-quality DNA from both a reference sample and the test sample.

3. Label reference DNA with Cy5 and test DNA with Cy3.

4. Co-hybridize both labeled DNAs to the microarray — they compete to bind the probes.

5. Scan fluorescent images using a microarray scanner.

6. Normalize the data.

7. Analyze CNVs using specialized software (e.g., CGHWeb, SignalMap).

Interpretation: The Log2 ratio of Cy3/Cy5 signal at each probe is calculated. A Log2 ratio of 0 indicates normal copy number, positive values indicate gains (duplications), and negative values indicate losses (deletions). Various smoothing/segmentation algorithms (e.g., CBS, BioHMM, GLAD) can be used to identify CNV regions from the raw data.

Q14 — Open Short Answer
Compare the four NGS-based methods for CNV detection (Read-Pair, Split-Read, Read-Depth, Assembly-based). What are the strengths and limitations of each?
✓ Model Answer

Read-Pair (RP): Compares insert sizes of paired-end reads with expected size from reference. Detects medium-sized insertions and deletions. Limitation: insensitive to small events because small perturbations are hard to distinguish from normal variability. Reports positions but not copy number counts.

Split-Read (SR): Uses reads where one mate maps but the other fails to map fully. Provides precise breakpoints at single base pair resolution. Limitation: requires reads to span breakpoints, may miss larger events. Reports positions but not copy counts.

Read-Depth (RD): Based on the correlation between coverage depth and copy number. Unique advantage: can detect the exact number of copies (not just positions). Works well for large CNVs. Can be applied to single samples, case/control pairs, or populations. Requires normalization for GC content and repeat biases.

Assembly-based (AS): Generates contigs/scaffolds and compares them with the reference. Theoretically can detect all forms of variation. Major limitations: high computational resource demands, poor performance in repeat regions, and can only detect homozygous structural variants (cannot handle haplotype sequences).

Q15 — Open Short Answer
What data does PennCNV require for CNV calling from SNP genotyping arrays, and what statistical model does it use? How does it differ from segmentation-based algorithms?
✓ Model Answer

PennCNV requires: (1) LRR (Log R Ratio) and BAF (B-Allele Frequency) values from the signal intensity file, (2) population frequency of B alleles, (3) SNP genome coordinates, and (4) an appropriate HMM model.

PennCNV uses a Hidden Markov Model (HMM) that integrates multiple sources of information to infer CNV calls. Unlike segmentation-based algorithms that rely primarily on signal intensity alone, PennCNV also considers the SNP allelic ratio distribution (BAF) and other factors. This integration of multiple data sources (LRR + BAF + population frequencies) makes it more robust at distinguishing true CNVs from noise.

PennCNV can handle both Illumina and Affymetrix array data and can optionally utilize family information to generate family-based CNV calls or use a validation-calling algorithm for specific candidate CNV regions.


📝Population Genomics, Inbreeding & ROH
0 / 25
Q1 Easy
Which of the following is NOT a condition required for Hardy-Weinberg Equilibrium (HWE)?
ARandom mating
BSmall population size
CNo mutation, migration, or selection
DOrganisms are diploid
Explanation
HWE requires an infinitely large population size to eliminate genetic drift. A small population size would violate HWE assumptions. The full list of conditions includes: diploid organisms, exclusively sexual reproduction, non-overlapping generations, random mating, infinitely large population, equal allele frequencies between sexes, and no evolutionary forces (mutation, migration, selection, gene flow).
Q2 Easy
Genetic drift is defined as:
ADirected changes in allele frequency due to natural selection
BChanges in allele frequency due to migration between populations
CRandom changes in allele frequencies from one generation to the next, especially in small populations
DIncrease in genetic diversity over time due to mutations
Explanation
Genetic drift refers to random changes in allele frequencies from one generation to the next due to chance events, not natural selection. It is most impactful in small populations where not all alleles are guaranteed to be transmitted. The direction of drift is entirely random — no selective pressure guides it. Small populations can lose alleles entirely by chance, while larger populations are less prone to this but still experience small fluctuations.
Q3 Medium
What happens to genetic diversity after a population bottleneck?
AGenetic diversity increases because selection is relaxed
BAllele frequencies remain unchanged because bottlenecks are neutral events
COnly beneficial alleles are retained through the bottleneck
DGenetic diversity is reduced, allele frequencies shift randomly, and homozygosity increases
Explanation
A bottleneck is a special case of genetic drift where a population experiences a sharp reduction in size due to a random, drastic event. The surviving gene pool is a non-representative sample of the original. Alleles that were common might be lost while rare alleles may become fixed, leading to reduced genetic diversity and increased homozygosity. The changes are random, not selective — common alleles aren't preferentially retained.
Q4 Medium
What is an "outlier locus" in population genomics?
AA genomic region showing significantly stronger allele frequency differences between populations than expected under neutral conditions
BA region where all individuals in a population have identical genotypes
CA genetic locus that is located outside the coding regions of the genome
DA region with a high mutation rate that creates new alleles every generation
Explanation
Outlier loci are genomic regions that show much stronger allele frequency differences between populations than expected under neutral conditions (genetic drift, migration, demography). These loci may be under selection — either natural (adaptive traits) or artificial (breeding). They are identified using tools like FST scans, PCA, or Bayesian methods like BayeScan.
Q5 Easy
What does the inbreeding coefficient (F) represent?
AThe number of deleterious mutations in an individual's genome
BThe probability that two alleles at a given locus are identical by descent (IBD)
CThe proportion of heterozygous loci in an individual
DThe rate of mutation at each genomic locus
Explanation
The inbreeding coefficient (F) represents the probability that two alleles at a randomly chosen locus are identical by descent (IBD) — meaning both alleles originated from a common ancestor. F = 0 means no inbreeding (alleles from unrelated parents); F = 1 means complete inbreeding (all loci are homozygous for IBD alleles). F also reflects the level of autozygosity — the proportion of an individual's genome that is homozygous due to descent from a common ancestor.
Q6 Tricky
What is the key difference between identical by descent (IBD) and identical by state (IBS)?
AIBD refers to alleles on the same chromosome; IBS refers to alleles on different chromosomes
BIBD alleles are always in heterozygous state; IBS alleles are always homozygous
CIBD alleles are identical because they were inherited from a common ancestor; IBS alleles are identical by chance without known shared ancestry
DThere is no practical difference; IBD and IBS are interchangeable terms
Explanation
IBD (Identical by Descent): Two alleles are genetically identical AND came from the same common ancestor through inheritance. IBS (Identical by State): Two alleles look the same (same nucleotide sequence), but there is no known shared ancestor — they may be identical by chance. Only IBD contributes to inbreeding. IBS does not necessarily reflect inbreeding and may occur even in outbred populations. This distinction is crucial for correctly interpreting genomic inbreeding estimates.
Q7 Hard
Which of the following is NOT a limitation of pedigree-based inbreeding coefficient (FPED)?
AIt assumes all animals of the base population are unrelated
BIt does not account for the stochasticity of recombination during meiosis
CIt assumes all pedigree registrations are correct
DIt directly measures the actual homozygous regions in the individual's DNA
Explanation
FPED does NOT directly measure actual DNA — that's a feature of genomic methods, not a limitation of FPED. The actual limitations are: (i) assumes founder animals are unrelated, (ii) needs complete pedigree registration, (iii) assumes correct pedigree records, (iv) does not account for stochastic recombination events, and (v) does not consider selection biases on specific genomic regions. These are five specific limitations listed in the lecture.
Q8 Easy
What is a Run of Homozygosity (ROH)?
AA continuous stretch of DNA where all polymorphic loci are homozygous, with no heterozygous genotype
BA region of the genome with higher-than-expected heterozygosity
CA stretch of DNA where copy number is variable between individuals
DA region where recombination rates are extremely high
Explanation
ROH are continuous and uninterrupted chromosome portions showing homozygosity at all loci without any heterozygous genotype. They provide evidence of identical by descent (IBD) inheritance. Only polymorphic sites (SNPs) are used to detect ROHs. The ROH ends where the first heterozygous SNP is encountered. ROHs of different sizes indicate different inbreeding histories: long ROHs suggest recent inbreeding, short ROHs suggest more distant ancestral events.
Q9 Medium
How is the genomic inbreeding coefficient FROH calculated?
ANumber of ROH segments divided by total number of chromosomes
BSum of lengths of all ROH segments divided by the total autosomal genome length
CAverage length of ROH segments divided by average chromosome length
DNumber of homozygous SNPs divided by total number of SNPs
Explanation
FROH = SROH / LGEN, where SROH is the sum of all ROH segment lengths and LGEN is the total length of the autosomal genome. This gives the proportion of the genome that is covered by homozygous segments inherited from a common ancestor. It directly measures autozygosity from DNA data, unlike pedigree-based estimates which are theoretical probabilities.
Q10 Tricky
What does a long ROH indicate, compared to a short ROH?
ALong ROH = ancient inbreeding; Short ROH = recent inbreeding
BLong ROH = higher mutation rate; Short ROH = lower mutation rate
CLong ROH = recent inbreeding; Short ROH = ancient or distant inbreeding
DLong ROH = more recombination events; Short ROH = fewer recombination events
Explanation
Long ROHs indicate recent inbreeding because recombination has not had enough generations to break up the large haplotype block inherited from close common ancestors. Short ROHs reflect ancient or distant relatedness where many recombination events have gradually broken down the original haplotype over many generations. This is the same principle used for LD-based age estimation of mutations: long haplotype + strong LD → recent; short haplotype + weak LD → older.
Q11 Medium
What effect does inbreeding have on genotype frequencies?
AIncreases homozygous frequencies and decreases heterozygous frequency
BChanges allele frequencies by favoring dominant alleles
CIncreases heterozygous frequency and decreases homozygous frequencies
DHas no effect on genotype frequencies but changes allele frequencies
Explanation
Inbreeding does NOT change allele frequencies, but it does alter genotype frequencies: it increases homozygous genotype frequencies (AA and aa) and decreases heterozygous genotype frequency (Aa). This deviation from expected frequencies violates HWE assumptions. A significant excess of homozygotes and deficit of heterozygotes detected by chi-squared testing is a hallmark of inbreeding.
Q12 Medium
What is an ROH island?
AA region of the genome with no ROHs in any individual
BA single very long ROH in one individual's genome
CA genomic region where ROHs overlap between chromosomes in the same individual
DA genomic region where a high proportion of individuals in a population share ROHs at the same location
Explanation
ROH islands are specific chromosome regions where a high frequency of individuals in a population share ROHs at the same genomic location — they are "hotspots" of shared homozygosity. They often indicate selection pressure (natural or artificial) acting on that region, because an allele in that region is advantageous when homozygous. For example, in African populations, ROH islands may contain genes for trypanosome resistance.
Q13 Tricky
A genotyping error inside a true ROH would most likely cause:
AThe ROH to appear longer than it actually is
BThe ROH to be broken into smaller pieces, underestimating inbreeding
CA false detection of a CNV at that position
DNo effect on ROH detection because algorithms account for all errors
Explanation
A single genotyping error inside a true ROH (e.g., a homozygous SNP mistakenly called as heterozygous) can break that continuous homozygous stretch into smaller pieces, artificially shortening the detected ROH length. This leads to underestimation of ROH lengths and potentially the overall inbreeding level. Mitigation: use high-quality SNP data, strict quality control, and appropriate minimum window sizes.
Q14 Medium
Which ROH pattern would you expect in a consanguineous population (close-kin mating)?
AVery high number and length of ROHs
BFew ROHs, mostly short
CVery few ROHs with high heterozygosity
DNo ROHs detectable
Explanation
Different populations show distinct ROH signatures: Large outbred = few, short ROHs; Admixed = very few ROHs, high heterozygosity; Small population = more numerous and longer ROHs; Consanguineous = very high number and length of ROHs (close-kin mating); Bottleneck = many ROHs, variable length. A consanguineous population has the most extreme ROH burden because close relatives share large identical chromosome segments.
Q15 Medium
What does FST = 0 between two populations indicate?
AComplete genetic differentiation with no shared alleles
BOne population is a subset of the other
CBoth populations have undergone a recent bottleneck
DNo genetic differentiation — both populations have the same allele frequencies
Explanation
FST = (HT − HS) / HT, where HT is total expected heterozygosity and HS is average expected heterozygosity within subpopulations. FST = 0 means no genetic differentiation (HT = HS), indicating populations are genetically identical in allele frequencies. FST = 1 means complete differentiation with no shared alleles. Values between 0 and 1 indicate partial differentiation.
Q16 Medium
Why are FST values typically calculated over genomic windows rather than individual SNPs?
AIndividual SNPs are always uninformative
BWindow-based analysis is computationally faster
CAveraging over windows captures linkage disequilibrium and provides more robust, less noisy estimates
DGenomic windows can only be applied to whole-genome sequencing data, not SNP chips
Explanation
Individual SNP-level FST can be noisy or biased. Averaging over genomic windows (e.g., 1 Mb) captures linkage disequilibrium — the causal mutation may influence nearby genetic variants. This sliding window approach provides a more robust measure of genetic differentiation and increases the signal-to-noise ratio. This is analogous to how sliding windows are used in selection scans.
Q17 Medium
In the PLINK PED file format, how many columns are needed to represent genotypes for 50 SNPs per individual?
A56 columns
B106 columns
C50 columns
D100 columns
Explanation
The PED file has 6 mandatory columns (Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype) plus 2 columns per SNP (for the two alleles in a diploid organism). For n SNPs: total columns = 6 + 2n. For 50 SNPs: 6 + 2(50) = 106 columns.
Q18 Tricky
In the PLINK MAP file, which of the following is the correct column order?
AChromosome, SNP Identifier, Genetic distance, Base-pair position
BSNP Identifier, Chromosome, Base-pair position, Genetic distance
CChromosome, Base-pair position, SNP Identifier, Genetic distance
DBase-pair position, Chromosome, Genetic distance, SNP Identifier
Explanation
The standard PLINK .map file has 4 columns in this order: (1) Chromosome number, (2) SNP Identifier (e.g., rs123456 or custom label), (3) Genetic distance (usually set to 0), (4) Base-pair position. This file serves as the reference for aligning genotype data in the .ped file and ensures correct interpretation of SNPs during analysis.
Q19 Medium
In the context of linkage disequilibrium (LD), a "selective sweep" refers to:
AThe removal of all deleterious alleles from a population by natural selection
BThe random loss of alleles during a population bottleneck
CThe increase in frequency of a beneficial mutation along with nearby linked neutral variants
DThe gradual decay of LD between distant loci over evolutionary time
Explanation
A selective sweep occurs when a beneficial mutation increases in frequency in the population, and nearby neutral loci also increase in frequency — not because they are beneficial, but because they are physically linked to the selected allele on the same chromosome (hitchhiking). This creates a region of reduced variation and strong LD around the selected site. Over time, recombination gradually breaks down this LD block.
Q20 Hard
Which combination of genomic tools correctly matches: detecting outlier loci, visualizing population structure, and Bayesian selection testing?
APLINK, GWAS, BLAST
BPennCNV, PCA, TASSEL
CBayeScan, PLINK, GenomeStudio
DFST, PCA, BayeScan
Explanation
The three main statistical tools for identifying selection from the lecture are: (1) FST (Fixation Index) for detecting outlier loci by measuring genetic differentiation between populations; (2) PCA (Principal Component Analysis) for visualizing and grouping populations based on genetic similarity; and (3) BayeScan for Bayesian testing that explicitly compares a selection model versus a neutral drift model for each locus.
Q21 — Open Short Answer
Explain at least four limitations of the pedigree-based inbreeding coefficient (FPED), and describe how genomic inbreeding estimation methods (FROH) overcome these limitations.
✓ Model Answer

Limitations of FPED:

1. Assumes founders are unrelated: Does not account for true relatedness of base population animals.

2. Requires complete pedigree: Needs full registration for both paternal and maternal lineages; incomplete pedigrees lead to underestimation.

3. Assumes correct records: Cannot verify pedigree accuracy, especially in extensive production systems.

4. Ignores stochastic recombination: Assumes equal 25% inheritance from each grandparent, but actual inheritance varies (0-50%) due to random recombination.

5. Ignores selection: Does not consider biases from selection on specific genomic regions.

How FROH overcomes these:

FROH is calculated from actual DNA data (SNP genotyping or WGS) by measuring Runs of Homozygosity. It requires no pedigree information, directly measures autozygosity from the individual's genome, captures both recent inbreeding (long ROHs) and ancient inbreeding (short ROHs), and can detect hidden inbreeding from unknown or distant relatives. It reflects the real consequences of recombination and selection, providing more accurate individual-level estimates.

Q22 — Open Short Answer
Describe the FST statistic: what does it measure, how is it calculated, and how is it used in population genomics studies? Include interpretation of extreme values.
✓ Model Answer

FST (Fixation Index) measures the proportion of genetic diversity due to differences between populations versus within populations.

FST = (HT − HS) / HT

Where HT = total expected heterozygosity across all populations combined, and HS = average expected heterozygosity within individual subpopulations.

Interpretation:

• FST = 0: No differentiation — populations have identical allele frequencies.

• FST = 1: Complete differentiation — populations share no alleles (each fixed for different alleles).

• 0 < FST < 1: Partial differentiation — allele frequencies differ but overlap.

Practical use: Genomes are divided into windows (e.g., 1 Mb). FST is calculated for each window between population pairs and visualized in Manhattan plots. High FST peaks indicate regions of strong differentiation, potentially under selection. Low FST regions evolve neutrally. This allows identification of genomic regions associated with adaptive traits or breeding-specific selection.

Q23 — Open Calculation
In a PLINK PED file, a father has genotype TT at a locus, the mother has genotype AT, and their child has genotype AA. Explain whether this is consistent with Mendelian inheritance and what could cause such a result.
✓ Model Answer

This is a Mendelian inconsistency.

Father: TT → can only pass T allele to offspring
Mother: AT → can pass either A or T allele
Possible child genotypes: AT (T from father + A from mother) or TT (T from father + T from mother)
Child genotype AA is IMPOSSIBLE — the father cannot provide an A allele

Possible causes of this inconsistency include: (1) sequencing/genotyping error (e.g., the child's genotype was miscalled), (2) data formatting error in the PED file, (3) sample mislabeling (wrong sample assigned to the child), or (4) incorrect pedigree (the stated father is not the biological father).

Tools like PLINK can automatically identify and flag these Mendelian errors during quality control checks.

Q24 — Open Short Answer
Describe how linkage disequilibrium (LD) originates from a new beneficial mutation and explain how the length of the LD block can be used to estimate the age of the mutation.
✓ Model Answer

Origin of LD from a new mutation:

1. A new beneficial mutation arises on a single chromosome within a specific haplotype context (surrounding markers).

2. The mutation is initially in complete LD with all nearby variants on that chromosome (they form a single haplotype block).

3. If the mutation is advantageous, natural selection increases its frequency in the population — and the linked nearby markers "hitchhike" along (selective sweep).

4. Over generations, recombination during meiosis gradually breaks up the original haplotype, shortening the LD block around the mutation.

Estimating mutation age from LD:

• Long haplotype + strong LD around the mutation → the mutation is recent (recombination has not had time to break the block).

• Short haplotype + weak LD → the mutation is older (many generations of recombination have eroded the original haplotype).

This principle is used in population genomic scans to detect recent versus ancient selection events and to estimate when adaptive alleles arose in a population.

Q25 — Open Short Answer
List at least three factors that can cause biases or errors in ROH detection from genomic data. For each, explain how it affects ROH detection and how it can be mitigated.
✓ Model Answer

1. Genotyping Errors: A homozygous SNP miscalled as heterozygous breaks a true ROH into smaller pieces → underestimates inbreeding. Mitigation: Use high-quality SNP data, strict QC, appropriate minimum window sizes.

2. SNP Density and Distribution: Uneven SNP spacing means regions with low density may miss real ROHs or inaccurately size them, while high-density regions may detect more small ROHs. Mitigation: Consider SNP chip design, set minimum SNP number thresholds, interpret with caution in poorly covered regions.

3. Missing Data: Failed genotype calls create gaps that can break up ROHs → underestimation. Mitigation: Filter samples/SNPs with excessive missing data, use imputation, allow some tolerance for missing SNPs in ROH detection parameters.

4. Window Size Parameters: Too low thresholds → many spurious short ROHs (overestimate); too high → miss real short ROHs (underestimate). Mitigation: Choose parameters based on population-specific LD and SNP density; consult literature.

5. LD Variation: High LD populations may have long homozygous stretches by chance (false positives); low LD populations may have meaningful short ROHs. Mitigation: Adjust minimum ROH length thresholds based on typical LD structure.


📝Genome-Wide Association Studies (GWAS)
0 / 18
Q1 Easy
What is the primary aim of a genome-wide association study (GWAS)?
ATo sequence the complete genome of each individual in the study
BTo identify genetic variants (SNPs) associated with a trait or disease across the genome
CTo determine the complete pedigree of all study participants
DTo identify all genes in an organism's genome
Explanation
GWAS involve testing genetic variants across the genomes of many individuals to identify genotype–phenotype associations. Population-based association studies focus on identifying SNPs for which genotypes are associated with the trait under investigation, meaning they have different frequencies in affected vs. unaffected individuals, or different mean quantitative measures.
Q2 Medium
The common disease/common variant (CD/CV) hypothesis states that:
ACommon disorders are likely influenced by genetic variation that is also common in the population, with small individual effect sizes
BCommon diseases are caused by rare mutations with very large effect sizes
COnly one common variant is responsible for each common disease
DDiseases become common when their causative variants undergo positive selection
Explanation
The CD/CV hypothesis states that common disorders are influenced by common genetic variants. Key ramifications: (1) any single common variant must have a small effect size, and (2) multiple common alleles must influence disease susceptibility (the total genetic risk is spread across multiple genetic factors). This contrasts with Mendelian disorders where rare, highly penetrant alleles have large effect sizes.
Q3 Medium
What are the two main categories of phenotypes investigated in GWAS?
ADominant traits and recessive traits
BCoding variants and non-coding variants
CStructural variants and single nucleotide variants
DBinary disease/affected phenotypes (case-control) and quantitative (continuous) measurements
Explanation
The two most widely considered categories of traits in GWAS are: (a) binary disease/affected phenotypes, where individuals are classified as affected cases or unaffected controls (e.g., coronary heart disease), and (b) quantitative (continuous) measurements, such as lipid profiles, BMI, stature, etc. Each requires different statistical approaches.
Q4 Medium
What is a "tag SNP" in the context of GWAS?
AA SNP that directly causes the disease or trait being studied
BA SNP used to label different chromosomes for identification
CA SNP that serves as a proxy for nearby SNPs through LD, allowing genome-wide coverage without genotyping all SNPs
DA SNP located at the start of every LD block in the genome
Explanation
Tag SNPs are selected to guarantee coverage of all common polymorphisms at some threshold of r². Because SNPs within LD blocks are strongly correlated, we need not genotype all common polymorphisms genome-wide. Instead, GWAS arrays use a smaller number of tag SNPs from which we can recover information about common variation across the genome. GWASs thus rely on "indirect association" — tag SNPs may not be causal but serve as proxies for causal variants within the same LD block.
Q5 Hard
The widely accepted genome-wide significance threshold of p < 5 × 10⁻⁸ in GWAS corrects for approximately:
A500,000 independent SNPs
B1 million independent LD blocks
C10 million individual SNPs on the array
DThe total number of genes in the human genome
Explanation
The p < 5 × 10⁻⁸ threshold corrects for approximately 1 million blocks of LD across the genome, within which common SNPs are assumed to be strongly correlated (Pe'er et al., 2008). This is essentially a Bonferroni correction for 1 million independent tests: 0.05 / 1,000,000 = 5 × 10⁻⁸. It accounts for the LD structure rather than treating every single SNP as independent.
Q6 Medium
What is the Bonferroni correction in the context of GWAS?
ADividing the significance level α by the number of tests (N) to achieve an experimentwise false positive rate of α
BMultiplying each p-value by the sample size to account for population structure
CTaking the logarithm of all p-values to normalize the distribution
DUsing only SNPs with minor allele frequency above 5%
Explanation
The Bonferroni correction adjusts the significance level to maintain an overall experimentwise false positive error rate. When testing N SNPs, the SNP-wise significance level is set to α/N. The disadvantage is that it assumes each test is independent, but in GWAS, SNPs are correlated due to LD, making the correction conservative (too strict). The widely accepted threshold of 5 × 10⁻⁸ accounts for this by estimating ~1 million effective independent tests.
Q7 Tricky
Why is the Bonferroni correction considered conservative for GWAS?
ABecause it uses too few SNPs in the correction
BBecause it only works for quantitative traits, not binary traits
CBecause it assumes each test is independent, but SNPs are correlated due to LD, overcorrecting the significance level
DBecause it does not account for sample size
Explanation
The Bonferroni correction treats each SNP test as independent (α/N). However, in GWAS, many SNPs are correlated with each other due to LD (linkage disequilibrium). The actual number of independent tests is therefore lower than the total number of SNPs tested, meaning Bonferroni overcorrects and may miss true associations (loss of power). This is why the effective number of independent LD blocks (~1 million) is used rather than the total number of SNPs.
Q8 Medium
What does a genomic control inflation factor (λGC) greater than 1 indicate in a GWAS?
AThe study has too few samples
BAll associations found are true positives
CThe genotyping platform has a high error rate
DUnmeasured confounding due to genetic structure (population stratification)
Explanation
The genomic control inflation factor λGC is estimated by comparing the median of observed test statistics with the null distribution. λGC > 1 indicates inflation of test statistics due to unmeasured confounding from genetic structure (population stratification or cryptic relatedness). This is visualized on a QQ plot as observed p-values being more significant than expected under the null. A simple (but imperfect) correction is to divide all test statistics by λGC.
Q9 Hard
How does population stratification cause spurious associations in a case-control GWAS?
AIf disease prevalence differs between strata, cases are enriched from one stratum, and any SNP differing in frequency between strata will appear associated even without true association
BPopulation stratification increases the mutation rate in cases relative to controls
CStratification reduces linkage disequilibrium in the population, making tag SNPs unreliable
DIt causes genotyping errors that are specific to one population stratum
Explanation
Consider a population with two underlying strata that differ in disease prevalence. Cases will more often be selected from the stratum with higher disease prevalence. As a result, any SNP that differs in allele/genotype frequency between the strata will appear to be associated with disease, even if there is no true association within each stratum. This is a confounding effect — the SNP frequency difference is due to population structure, not disease biology. Solutions include matching cases and controls by stratum, or using statistical methods like PCA to correct for structure.
Q10 Medium
On a GWAS quantile-quantile (QQ) plot, what does inflation of observed -log10(p-values) above the y=x line indicate?
ANo significant associations were found
BPopulation structure that has not been accounted for in the analysis
CThe GWAS had too many samples
DAll genotyped SNPs are in perfect linkage equilibrium
Explanation
On a QQ plot, each SNP is plotted by its ranked observed -log10(p-value) against the expected ranked value under the null hypothesis. If most points fall on the y=x line, the study is well-calibrated. Systematic inflation above this line indicates that there are more significant signals than expected by chance, which is indicative of population structure not accounted for in the analysis. A few points deviating at the tail (far right) represent potentially true associations.
Q11 Medium
Which of the following is NOT one of the six key design considerations listed for a GWAS?
APopulation structure and stratification
BGenome-wide significance and correction for multiple testing
CChoice of restriction enzyme for library preparation
DSample size
Explanation
The six key GWAS design considerations are: (1) Phenotype definition, (2) Structure of common genetic variation (LD), (3) Sample size, (4) Population structure/stratification, (5) Genome-wide significance and correction for multiple testing, and (6) Replication. Choice of restriction enzyme is relevant to RRL/GBS library preparation, not to GWAS study design.
Q12 Tricky
The False Discovery Rate (FDR) correction in GWAS is calculated as:
Aα × k / N, where k is number of significant SNPs and N is total SNPs
Bα / k, where k is the number of significant SNPs
Ck / (N × α), where N is total tests
DNα / k, where N is total SNPs, α is the SNP-wise significance level, and k is the number of SNPs with p < α
Explanation
The FDR (Benjamini and Hochberg, 1995) fixes the expected number of false positives among significant associations. For an uncorrected SNP-wise significance level of α, the FDR = Nα/k, where N is the total number of SNPs tested and k is the number of SNPs with p < α. Using these relationships, one can define the appropriate SNP-wise significance threshold to obtain an overall FDR at a desired experimentwise error rate. Unlike Bonferroni, FDR accounts for the number of actual discoveries made.
Q13 Medium
Why is careful phenotype definition critical in a case-control GWAS design?
ANon-specific case-control definitions increase heterogeneity in causal polymorphisms, reducing power for detection
BIt determines which restriction enzymes are used for genotyping
CPhenotype definition only matters for quantitative traits, not binary traits
DIt is only important for replication studies, not discovery studies
Explanation
Careful phenotype definition is essential because non-specific case-control definitions can increase heterogeneity in the underlying causal genetic polymorphisms (and non-genetic risk factors), leading to decreased power for detection. If "cases" include individuals with different subtypes of a disease (each driven by different genetic variants), the signal from any single variant is diluted. The same principle applies to control definitions.
Q14 Medium
What is an "indirect association" in the context of GWAS?
AAn association between two different diseases mediated by the same gene
BAn association detected only in a replication cohort
CA genotyped tag SNP that is associated with a trait as a surrogate for the true causal variant through LD
DAn association that is not statistically significant after Bonferroni correction
Explanation
Genotyped tag SNPs often lie in a region of high linkage disequilibrium with the actual causal variant. The tag SNP will be statistically associated with the trait as a surrogate for the disease SNP through an indirect association. The tag SNP may not itself be causal, but its genotypes serve as proxies for those at the causal polymorphism located within the same block of LD. This is why GWAS identifies "associated regions" rather than specific causal variants.
Q15 — Open Short Answer
List and briefly explain the six key design considerations for planning a GWAS study.
✓ Model Answer

1. Phenotype definition: Precisely define the trait under investigation. For binary traits (case-control), ensure case and control definitions are specific to avoid heterogeneity that reduces power. For quantitative traits, use standardized measurements (e.g., BMI, height).

2. Structure of common genetic variation (LD): Understand the LD structure in the target population to select appropriate tag SNPs. Common SNPs are arranged in LD blocks; genotyping arrays exploit this to cover variation efficiently.

3. Sample size: Key determinant of statistical power. Power depends on significance level, effect size, causal allele frequency, and LD between causal variant and tag SNP. Effect sizes for complex traits are small, so large samples are needed.

4. Population structure/stratification: Unmeasured confounding from population structure can cause spurious associations. Must be detected (QQ plots, λGC) and corrected (PCA, genomic control, matching).

5. Genome-wide significance and multiple testing correction: Must correct for testing hundreds of thousands of SNPs. Standard threshold: p < 5 × 10⁻⁸. Methods include Bonferroni, FDR, and permutation procedures.

6. Replication: Findings should be validated in independent cohorts to confirm true associations and rule out false positives.

Q16 — Open Calculation
In a GWAS testing 500,000 SNPs at a significance level of α = 0.05: (a) How many false positive associations would you expect by chance without correction? (b) What is the Bonferroni-corrected significance threshold? (c) Why might the standard threshold of 5 × 10⁻⁸ differ from the strict Bonferroni value here?
✓ Model Answer

(a) Expected false positives without correction:

Expected false positives = N × α = 500,000 × 0.05 = 25,000 SNPs

Without correction, 5% of all SNPs (25,000) would appear significant by chance — a huge false positive problem.

(b) Bonferroni-corrected threshold:

α_corrected = α / N = 0.05 / 500,000 = 1 × 10⁻⁷

Each SNP must reach p < 1 × 10⁻⁷ to be declared significant.

(c) Why the standard threshold differs:

The standard GWAS threshold of 5 × 10⁻⁸ was derived by correcting for approximately 1 million independent LD blocks across the human genome, not the raw number of SNPs on the array. Since many SNPs on a 500K array are correlated (in LD), the effective number of independent tests is larger than the array size (~1 million blocks estimated from HapMap data). The Bonferroni for 1 million tests: 0.05 / 1,000,000 = 5 × 10⁻⁸. This more stringent threshold ensures genome-wide significance regardless of array density.

Q17 — Open Short Answer
Explain what population stratification is in GWAS, how it leads to spurious associations, and describe at least two methods to detect or correct for it.
✓ Model Answer

Population stratification arises when a study population consists of subgroups (strata) that differ in both allele frequencies and disease prevalence. If cases are preferentially drawn from one stratum, any SNP that differs between strata will appear associated with disease — even without a true biological link.

Example: A population has two ethnic strata. Disease X is more common in stratum 1. Cases will disproportionately come from stratum 1. A SNP that happens to be more common in stratum 1 (for ancestral reasons) will appear disease-associated even though it has no causal role.

Detection methods:

1. QQ plot inspection: Systematic inflation of observed p-values above the expected y=x line indicates population structure.

2. Genomic control (λGC): Comparing the median of observed test statistics with the null distribution. λGC > 1 indicates confounding from structure.

Correction methods:

1. Matching cases and controls by stratum to equalize population composition.

2. Dividing test statistics by λGC (simple but assumes uniform confounding across all SNPs, which may lose power).

3. PCA-based correction: Including principal components as covariates in the association model to adjust for ancestry differences.

4. Mixed models: Using kinship/relatedness matrices to account for both population structure and cryptic relatedness.

Q18 — Open Short Answer
Explain what linkage disequilibrium (LD) is, define the measures D' and r², and explain how LD is exploited in GWAS design through tag SNPs.
✓ Model Answer

Linkage Disequilibrium (LD) is a property of SNPs on a contiguous stretch of genomic sequence that describes the degree to which an allele of one SNP is inherited or correlated with an allele of another SNP within a population.

LD Measures:

The basic statistic is D = q₁₂ − q₁q₂, where q₁ and q₂ are allele frequencies and q₁₂ is the haplotype frequency. Under linkage equilibrium, D = 0 (alleles are randomly associated). To reduce dependence on allele frequencies, two standardized measures are used:

D': Ranges from 0 to 1. D' = 1 indicates complete LD (no recombination has occurred between the two loci).

r²: Ranges from 0 to 1. Represents the correlation between alleles. r² = 1 means the two SNPs are perfect proxies for each other. This is the most commonly used measure for GWAS design.

Exploitation in GWAS:

Because SNPs within LD blocks are strongly correlated, GWAS arrays need not genotype every common SNP. Instead, "tag SNPs" are selected that guarantee coverage of all common polymorphisms at a predetermined r² threshold. This enables efficient genome-wide coverage with fewer markers. GWAS then identifies tag SNPs with "indirect association" — they are proxies for the causal variant located within the same LD block. The International HapMap Project characterized LD patterns across populations to enable this approach.