Lecture 9 – Application of NGS: Different Approaches
Approach: Pool-seq involves combining DNA from multiple individuals into a single pool (equimolar amounts), preparing one library, and performing whole-genome sequencing. Reads are mapped to a reference genome and allele frequencies are estimated at each variant position.
Information provided: Population-level allele frequency estimates for SNPs across the genome. It enables comparison of allele frequencies between groups (e.g., using FST).
Limitations: (1) No individual genotype data — variants cannot be traced to specific individuals. (2) Hard to detect rare variants (low-frequency alleles lost in noise). (3) Haplotype phasing is impossible. (4) Bias from unequal DNA input can distort results. (5) Not suitable for clinical diagnostics.
When preferred: When comparing populations or extreme phenotype groups (e.g., red vs. yellow canaries, healthy vs. diseased), when budget limits individual sequencing, and when the goal is allele frequency estimation rather than individual-level genotyping.
Purpose: ChIP-seq identifies genome-wide binding sites of DNA-associated proteins (transcription factors, histone modifications) to understand gene regulation.
Steps: (1) Crosslink proteins to DNA using formaldehyde. (2) Fragment chromatin by sonication or enzymatic digestion. (3) Immunoprecipitate protein-DNA complexes using a specific antibody against the protein of interest. (4) Reverse crosslinks and extract the captured DNA. (5) Sequence the DNA using NGS.
Identifying binding sites: Sequenced reads are aligned to a reference genome. Regions with significantly enriched read coverage (peaks) indicate where the protein was bound. Peak calling algorithms identify these enriched regions. Peaks can be annotated to determine overlap with promoters, enhancers, or other regulatory elements, revealing which genes the protein regulates.
The average exome depth is 300×. This is well above the recommended ~100× for confident genotype calling from NGS data. However, this is an ideal calculation — in practice, not all reads will map on-target (capture efficiency is typically 60–80%), so effective depth would be lower but still sufficient.
(1) Unrelated affected individuals: Sequence exomes of multiple unrelated patients with the same disease. Apply discrete filtering to remove common variants (dbSNP, 1000 Genomes, gnomAD). Look for novel/rare variants shared across affected individuals in the same gene. Powerful for rare Mendelian disorders where ~98% of exome variants are already known.
(2) Family-based segregation: Sequence affected and unaffected family members. Identify variants that co-segregate with the disease (present in all affected, absent in unaffected). Increases confidence that the variant tracks with the phenotype across generations. Used for dominant or recessive trait mapping.
(3) De novo trio analysis: Sequence the child and both healthy parents. Remove all shared/inherited variants and common variants. What remains are novel, de novo mutations unique to the child — strong candidates for ultra-rare syndromes with unclear inheritance.
(4) Extreme phenotype sequencing: For quantitative traits, select individuals at phenotypic extremes (e.g., tallest vs. shortest). Rare causative variants are enriched at the extremes. Can combine with Pool-seq to reduce costs. Used for height, BMI, fertility, and other continuous traits.
Diagnosing the cause:
(1) Check across populations: If the same SNP is out of HWE in all populations → likely a technical issue (poor probe, repetitive region). If only in some populations → may reflect biology (inbreeding, selection, population structure).
(2) Examine the genotype clustering plot: Good clustering (three clearly separated groups) → SNP is reliable, deviation may be biological. Poor/noisy clustering or missing clusters → technical failure.
(3) Consider genomic context: Is the SNP in a repetitive region or near a CNV? These locations cause unreliable probe binding.
(4) Adjust software parameters: Try tuning clustering thresholds in GenomeStudio to see if genotype calls improve.
For version 2: Flag persistently problematic SNPs. Remove those that consistently fail HWE across all populations or show poor clustering. Replace them with new informative SNPs from better-characterized regions. Keep SNPs with biologically explainable HWE deviations if the clustering is clean.
SNP chips: Require a reference genome and a pre-designed set of SNPs. Provide fixed, high-accuracy genotyping with built-in probe redundancy, making them robust even with poor DNA quality. Cost-effective for large samples with established markers. Best for: GWAS, genomic selection, parentage testing in well-studied species. Limitation: only genotype pre-selected SNPs; cannot discover new variants.
NGS-based (GBS/RAD-seq): Can work without a reference genome (restriction enzymes create reproducible fragments). Enable simultaneous SNP discovery and genotyping. Cost-effective per locus but require higher sequencing depth (~100×) for confident genotype calls. Best for: population genomics in non-model organisms, diversity studies, evolutionary studies. Limitation: higher risk of genotyping errors at low depth; computationally intensive.
Key trade-off: SNP chips are more reliable and standardized; NGS-based methods are more flexible and can discover novel variation. The choice depends on the species (model vs. non-model), available resources, and whether discovery of new variants is needed.