Lectures 11-12-13: Genotyping Tools, CNV, Population Genomics, ROH & GWAS
The expected frequency of a restriction site with an n-base recognition sequence is 4n.
So a 6 bp restriction enzyme would generate approximately 730,000 fragments from a 3 Gbp genome. This principle underlies why enzymes with longer recognition sites produce fewer, larger fragments (e.g., an 8 bp cutter would produce ~46,000 fragments).
A Reduced Representation Library (RRL) is a method to sequence only a fraction of the genome by using restriction enzymes to fragment the DNA and selecting fragments of a specific size range. This dramatically reduces the amount of sequencing needed while still sampling reproducible, genome-wide locations.
Steps (as in the Ramos et al. pig study):
1. Pool DNA from multiple individuals (equal amounts per individual) to capture population-level variation.
2. Digest pooled DNA with restriction enzymes (e.g., AluI, HaeIII, MspI).
3. Select fragments of a specific size range (reduced representation).
4. Sequence the selected fragments using NGS (e.g., Illumina).
5. Align reads to the reference genome and call SNPs using quality filters.
Why useful: It enables cost-effective, large-scale SNP discovery across the genome without the expense of whole-genome sequencing all individuals. The discovered SNPs can then be used to design SNP chips (e.g., PorcineSNP60 BeadChip).
Commercial SNP chips (e.g., PorcineSNP60, BovineSNP50): Pre-designed for common species with known SNPs. They offer standardized, trusted data quality, widespread adoption enabling cross-study comparisons, low per-sample cost, and comprehensive genome coverage. Best for well-studied species with existing tools.
Custom genotyping arrays: Designed for specific research needs. Advantages: enable studies of species/populations not supported by standard products, allow focus on specific genes/regions of interest, and conserve resources by excluding irrelevant regions.
Choose commercial when working with common species (cattle, pigs, humans), needing standardized results, or conducting large population studies. Choose custom when studying non-model species, targeting specific genomic regions relevant to a particular disease/trait, or when commercial products don't cover your variants of interest.
Steps:
1. Design microarray with long oligonucleotide probes (50-70 bp) based on the reference genome, spaced at regular intervals while avoiding repetitive sequences.
2. Extract high-quality DNA from both a reference sample and the test sample.
3. Label reference DNA with Cy5 and test DNA with Cy3.
4. Co-hybridize both labeled DNAs to the microarray — they compete to bind the probes.
5. Scan fluorescent images using a microarray scanner.
6. Normalize the data.
7. Analyze CNVs using specialized software (e.g., CGHWeb, SignalMap).
Interpretation: The Log2 ratio of Cy3/Cy5 signal at each probe is calculated. A Log2 ratio of 0 indicates normal copy number, positive values indicate gains (duplications), and negative values indicate losses (deletions). Various smoothing/segmentation algorithms (e.g., CBS, BioHMM, GLAD) can be used to identify CNV regions from the raw data.
Read-Pair (RP): Compares insert sizes of paired-end reads with expected size from reference. Detects medium-sized insertions and deletions. Limitation: insensitive to small events because small perturbations are hard to distinguish from normal variability. Reports positions but not copy number counts.
Split-Read (SR): Uses reads where one mate maps but the other fails to map fully. Provides precise breakpoints at single base pair resolution. Limitation: requires reads to span breakpoints, may miss larger events. Reports positions but not copy counts.
Read-Depth (RD): Based on the correlation between coverage depth and copy number. Unique advantage: can detect the exact number of copies (not just positions). Works well for large CNVs. Can be applied to single samples, case/control pairs, or populations. Requires normalization for GC content and repeat biases.
Assembly-based (AS): Generates contigs/scaffolds and compares them with the reference. Theoretically can detect all forms of variation. Major limitations: high computational resource demands, poor performance in repeat regions, and can only detect homozygous structural variants (cannot handle haplotype sequences).
PennCNV requires: (1) LRR (Log R Ratio) and BAF (B-Allele Frequency) values from the signal intensity file, (2) population frequency of B alleles, (3) SNP genome coordinates, and (4) an appropriate HMM model.
PennCNV uses a Hidden Markov Model (HMM) that integrates multiple sources of information to infer CNV calls. Unlike segmentation-based algorithms that rely primarily on signal intensity alone, PennCNV also considers the SNP allelic ratio distribution (BAF) and other factors. This integration of multiple data sources (LRR + BAF + population frequencies) makes it more robust at distinguishing true CNVs from noise.
PennCNV can handle both Illumina and Affymetrix array data and can optionally utilize family information to generate family-based CNV calls or use a validation-calling algorithm for specific candidate CNV regions.
Limitations of FPED:
1. Assumes founders are unrelated: Does not account for true relatedness of base population animals.
2. Requires complete pedigree: Needs full registration for both paternal and maternal lineages; incomplete pedigrees lead to underestimation.
3. Assumes correct records: Cannot verify pedigree accuracy, especially in extensive production systems.
4. Ignores stochastic recombination: Assumes equal 25% inheritance from each grandparent, but actual inheritance varies (0-50%) due to random recombination.
5. Ignores selection: Does not consider biases from selection on specific genomic regions.
How FROH overcomes these:
FROH is calculated from actual DNA data (SNP genotyping or WGS) by measuring Runs of Homozygosity. It requires no pedigree information, directly measures autozygosity from the individual's genome, captures both recent inbreeding (long ROHs) and ancient inbreeding (short ROHs), and can detect hidden inbreeding from unknown or distant relatives. It reflects the real consequences of recombination and selection, providing more accurate individual-level estimates.
FST (Fixation Index) measures the proportion of genetic diversity due to differences between populations versus within populations.
Where HT = total expected heterozygosity across all populations combined, and HS = average expected heterozygosity within individual subpopulations.
Interpretation:
• FST = 0: No differentiation — populations have identical allele frequencies.
• FST = 1: Complete differentiation — populations share no alleles (each fixed for different alleles).
• 0 < FST < 1: Partial differentiation — allele frequencies differ but overlap.
Practical use: Genomes are divided into windows (e.g., 1 Mb). FST is calculated for each window between population pairs and visualized in Manhattan plots. High FST peaks indicate regions of strong differentiation, potentially under selection. Low FST regions evolve neutrally. This allows identification of genomic regions associated with adaptive traits or breeding-specific selection.
This is a Mendelian inconsistency.
Possible causes of this inconsistency include: (1) sequencing/genotyping error (e.g., the child's genotype was miscalled), (2) data formatting error in the PED file, (3) sample mislabeling (wrong sample assigned to the child), or (4) incorrect pedigree (the stated father is not the biological father).
Tools like PLINK can automatically identify and flag these Mendelian errors during quality control checks.
Origin of LD from a new mutation:
1. A new beneficial mutation arises on a single chromosome within a specific haplotype context (surrounding markers).
2. The mutation is initially in complete LD with all nearby variants on that chromosome (they form a single haplotype block).
3. If the mutation is advantageous, natural selection increases its frequency in the population — and the linked nearby markers "hitchhike" along (selective sweep).
4. Over generations, recombination during meiosis gradually breaks up the original haplotype, shortening the LD block around the mutation.
Estimating mutation age from LD:
• Long haplotype + strong LD around the mutation → the mutation is recent (recombination has not had time to break the block).
• Short haplotype + weak LD → the mutation is older (many generations of recombination have eroded the original haplotype).
This principle is used in population genomic scans to detect recent versus ancient selection events and to estimate when adaptive alleles arose in a population.
1. Genotyping Errors: A homozygous SNP miscalled as heterozygous breaks a true ROH into smaller pieces → underestimates inbreeding. Mitigation: Use high-quality SNP data, strict QC, appropriate minimum window sizes.
2. SNP Density and Distribution: Uneven SNP spacing means regions with low density may miss real ROHs or inaccurately size them, while high-density regions may detect more small ROHs. Mitigation: Consider SNP chip design, set minimum SNP number thresholds, interpret with caution in poorly covered regions.
3. Missing Data: Failed genotype calls create gaps that can break up ROHs → underestimation. Mitigation: Filter samples/SNPs with excessive missing data, use imputation, allow some tolerance for missing SNPs in ROH detection parameters.
4. Window Size Parameters: Too low thresholds → many spurious short ROHs (overestimate); too high → miss real short ROHs (underestimate). Mitigation: Choose parameters based on population-specific LD and SNP density; consult literature.
5. LD Variation: High LD populations may have long homozygous stretches by chance (false positives); low LD populations may have meaningful short ROHs. Mitigation: Adjust minimum ROH length thresholds based on typical LD structure.
1. Phenotype definition: Precisely define the trait under investigation. For binary traits (case-control), ensure case and control definitions are specific to avoid heterogeneity that reduces power. For quantitative traits, use standardized measurements (e.g., BMI, height).
2. Structure of common genetic variation (LD): Understand the LD structure in the target population to select appropriate tag SNPs. Common SNPs are arranged in LD blocks; genotyping arrays exploit this to cover variation efficiently.
3. Sample size: Key determinant of statistical power. Power depends on significance level, effect size, causal allele frequency, and LD between causal variant and tag SNP. Effect sizes for complex traits are small, so large samples are needed.
4. Population structure/stratification: Unmeasured confounding from population structure can cause spurious associations. Must be detected (QQ plots, λGC) and corrected (PCA, genomic control, matching).
5. Genome-wide significance and multiple testing correction: Must correct for testing hundreds of thousands of SNPs. Standard threshold: p < 5 × 10⁻⁸. Methods include Bonferroni, FDR, and permutation procedures.
6. Replication: Findings should be validated in independent cohorts to confirm true associations and rule out false positives.
(a) Expected false positives without correction:
Without correction, 5% of all SNPs (25,000) would appear significant by chance — a huge false positive problem.
(b) Bonferroni-corrected threshold:
Each SNP must reach p < 1 × 10⁻⁷ to be declared significant.
(c) Why the standard threshold differs:
The standard GWAS threshold of 5 × 10⁻⁸ was derived by correcting for approximately 1 million independent LD blocks across the human genome, not the raw number of SNPs on the array. Since many SNPs on a 500K array are correlated (in LD), the effective number of independent tests is larger than the array size (~1 million blocks estimated from HapMap data). The Bonferroni for 1 million tests: 0.05 / 1,000,000 = 5 × 10⁻⁸. This more stringent threshold ensures genome-wide significance regardless of array density.
Population stratification arises when a study population consists of subgroups (strata) that differ in both allele frequencies and disease prevalence. If cases are preferentially drawn from one stratum, any SNP that differs between strata will appear associated with disease — even without a true biological link.
Example: A population has two ethnic strata. Disease X is more common in stratum 1. Cases will disproportionately come from stratum 1. A SNP that happens to be more common in stratum 1 (for ancestral reasons) will appear disease-associated even though it has no causal role.
Detection methods:
1. QQ plot inspection: Systematic inflation of observed p-values above the expected y=x line indicates population structure.
2. Genomic control (λGC): Comparing the median of observed test statistics with the null distribution. λGC > 1 indicates confounding from structure.
Correction methods:
1. Matching cases and controls by stratum to equalize population composition.
2. Dividing test statistics by λGC (simple but assumes uniform confounding across all SNPs, which may lose power).
3. PCA-based correction: Including principal components as covariates in the association model to adjust for ancestry differences.
4. Mixed models: Using kinship/relatedness matrices to account for both population structure and cryptic relatedness.
Linkage Disequilibrium (LD) is a property of SNPs on a contiguous stretch of genomic sequence that describes the degree to which an allele of one SNP is inherited or correlated with an allele of another SNP within a population.
LD Measures:
The basic statistic is D = q₁₂ − q₁q₂, where q₁ and q₂ are allele frequencies and q₁₂ is the haplotype frequency. Under linkage equilibrium, D = 0 (alleles are randomly associated). To reduce dependence on allele frequencies, two standardized measures are used:
• D': Ranges from 0 to 1. D' = 1 indicates complete LD (no recombination has occurred between the two loci).
• r²: Ranges from 0 to 1. Represents the correlation between alleles. r² = 1 means the two SNPs are perfect proxies for each other. This is the most commonly used measure for GWAS design.
Exploitation in GWAS:
Because SNPs within LD blocks are strongly correlated, GWAS arrays need not genotype every common SNP. Instead, "tag SNPs" are selected that guarantee coverage of all common polymorphisms at a predetermined r² threshold. This enables efficient genome-wide coverage with fewer markers. GWAS then identifies tag SNPs with "indirect association" — they are proxies for the causal variant located within the same LD block. The International HapMap Project characterized LD patterns across populations to enable this approach.