Lecture 9 – Application of NGS: Different Approaches

📝Pool-seq & Targeted Sequencing
0 / 10
Q1 Easy
What is the primary advantage of Pool-seq over individual whole-genome sequencing?
AIt provides individual genotype data for each sample
BIt is a cost-effective way to estimate allele frequencies across a population
CIt allows haplotype phasing of complex variants
DIt detects rare variants more accurately than individual sequencing
Explanation
Pool-seq sequences the combined DNA of multiple individuals together, providing more accurate allele frequency estimation at a lower cost than sequencing individuals separately. However, it sacrifices individual genotype data and haplotype information. Rare variant detection is actually harder with Pool-seq because low-frequency alleles can be lost in noise.
Q2 Medium
When preparing a Pool-seq experiment, what does "equimolar pooling" ensure?
AEach individual contributes the same number of reads after sequencing
BAll SNPs have equal minor allele frequency in the pool
CEach individual's DNA contributes an equal number of genome copies to the pool
DEqual amounts of PCR product are added from each individual
Explanation
Equimolar pooling means that each individual's DNA contributes an equal number of genome copies to the pool. This requires precise DNA quantification (e.g., spectrophotometry or fluorometry) before pooling. Without equimolar input, some individuals would be overrepresented, distorting allele frequency estimates and biasing population structure conclusions.
Q3 Tricky
Which of the following is NOT a limitation of Pool-seq?
AIndividual genotypes cannot be recovered from pooled data
BHaplotype phasing is impossible in pooled samples
CLow-frequency alleles may be lost in sequencing noise
DAllele frequencies cannot be estimated from pooled data
Explanation
Estimating allele frequencies is exactly what Pool-seq is designed for — it's the main strength, not a limitation. The actual limitations include: loss of individual genotypes (A), impossible haplotype phasing (B), difficulty detecting rare variants (C), potential bias from unequal DNA input, and unsuitability for clinical diagnostics.
Q4 Medium
In the Pool-seq study of red vs. yellow canaries, what statistical measure was used to identify genomic regions of differentiation between the two pools?
AFST index
BChi-squared test for HWE
CLinkage disequilibrium (r²)
DFPKM normalization
Explanation
The FST index measures population differentiation based on allele frequency differences between groups. In the canary study, allele frequencies were compared between red and yellow pools, and FST peaks indicated genomic regions with strong differentiation — candidate regions for the red coloration phenotype.
Q5 Easy
Which targeted sequencing method is best suited for sequencing a small number of specific genomic regions?
AHybridization-based capture
BPCR amplification combined with Sanger sequencing
CWhole-exome sequencing
DPool-seq
Explanation
For a small number of targeted regions, PCR amplification followed by Sanger sequencing is the appropriate choice. As the number of targets increases: Ion AmpliSeq™ handles hundreds of genes, and hybridization-based capture (Ion TargetSeq™) is used for larger target regions up to ~60 Mb.
Q6 Medium
In Illumina amplicon sequencing, what makes it particularly useful for detecting rare somatic mutations in tumor biopsies?
AIt sequences the entire genome at low coverage
BIt uses random fragmentation to cover all genomic regions
CUltra-deep sequencing of PCR amplicons provides high sensitivity for variant detection
DIt eliminates the need for a reference genome
Explanation
Amplicon sequencing provides ultra-deep coverage of targeted regions. This high sequencing depth is critical for detecting rare somatic mutations in complex samples like tumor biopsies, where cancer cells are mixed with normal (germline) DNA. A mutation present in only a small fraction of cells can still be detected with sufficient depth.
Q7 Medium
Amplicon sequencing of bacterial 16S rRNA genes is widely used for:
ADetecting human copy number variations
BPerforming genome-wide association studies
CWhole-genome assembly of bacterial species
DPhylogenetic and taxonomy studies in diverse metagenomics samples
Explanation
16S rRNA gene amplicon sequencing is a standard method for characterizing microbial communities (e.g., soil, water, human gut). The 16S rRNA gene contains conserved regions (for universal primer design) and variable regions (for species identification), making it ideal for phylogenetic classification and taxonomy assignment in metagenomics.
Q8 Tricky
Ion AmpliSeq™ panels consist of:
AA pool of oligonucleotide primer pairs, each designed to amplify a specified genomic region
BBiotinylated probes that hybridize to target regions and are captured with streptavidin
CShort DNA fragments immobilized on a glass slide for hybridization
DRestriction enzymes that cut DNA at specific sites for reduced representation
Explanation
AmpliSeq panels are PCR-based: they consist of pools of oligonucleotide primer pairs that amplify specified genomic regions. Option B describes hybridization capture (e.g., exome sequencing). Option C describes microarrays. Option D describes RAD-seq or reduced representation library approaches. Knowing the difference between amplicon-based and hybridization-based methods is key.
Q9 Easy
What percentage of the human genome does the exome represent?
ALess than 0.5%
BLess than 2%
CAbout 15%
DAbout 85%
Explanation
The human exome represents less than 2% of the genome but contains ~85% of known disease-related variants. This is why WES is so cost-effective: you sequence only ~4–5 Gb per exome vs. ~90 Gb for a whole genome, yet capture the vast majority of clinically relevant variation. Don't confuse the 2% (genome fraction) with the 85% (disease variant fraction).
Q10 Hard
In hybridization capture for whole-exome sequencing, what is the role of biotin-labeled probes and streptavidin beads?
ABiotin fragments DNA; streptavidin sequences the fragments
BBiotin amplifies target regions; streptavidin removes PCR duplicates
CBiotin-labeled probes hybridize to target sequences; streptavidin beads pull down the probe-DNA complexes for isolation
DBiotin labels the adapters; streptavidin separates the two DNA strands for sequencing
Explanation
In hybridization capture: (1) biotinylated probes (baits) hybridize to the target DNA regions (e.g., exons); (2) streptavidin-coated magnetic beads bind to the biotin on the probes, allowing physical separation of the probe-DNA complexes from non-target fragments. This biotin-streptavidin interaction is one of the strongest non-covalent bonds in nature, making capture highly efficient.

📝WES Strategies, Discrete Filtering & Epigenomics
0 / 12
Q11 Medium
In discrete filtering for Mendelian disorders, what is the purpose of the "filter set"?
ATo amplify rare variants before sequencing
BTo select common variants for GWAS analysis
CTo identify all variants shared between patients and controls
DTo eliminate common or known benign variants found in healthy populations, leaving only rare candidates
Explanation
The filter set (from databases like dbSNP, 1000 Genomes, gnomAD, or unaffected controls) contains common variants. By removing any variant found in the filter set, researchers eliminate likely benign polymorphisms. For rare Mendelian disorders, only ~2% of exome variants are novel, making this approach highly effective at narrowing candidates.
Q12 Medium
A child presents with a rare genetic syndrome, but both parents are healthy. What WES-based strategy is most appropriate?
ATrio sequencing (child + both parents) to identify de novo mutations
BPool-seq of the child's DNA with unrelated controls
CRNA-seq of the affected tissue to find expression changes
DExtreme phenotype sequencing comparing the child with healthy siblings
Explanation
When a child has a rare disorder but healthy parents, a de novo mutation is a likely cause. The trio approach sequences the child and both parents. Filtering removes all shared/inherited variants and common variants, leaving novel variants unique to the child as strong disease-causing candidates.
Q13 Tricky
In extreme phenotype sequencing for a quantitative trait like height, what is the main rationale for selecting individuals from the tails of the distribution?
AExtreme individuals have more de novo mutations
BRare causative variants are more likely to be enriched at phenotypic extremes
CExtreme individuals have simpler genomes that are easier to sequence
DIt avoids the need for a reference genome during analysis
Explanation
By selecting individuals at the extremes of a quantitative trait distribution (e.g., tallest vs. shortest), rare causative variants are more likely to be concentrated in one tail. This increases the statistical power to detect them without sequencing the entire population. This approach can be combined with Pool-seq to further reduce costs.
Q14 Easy
In bisulfite sequencing, what happens to unmethylated cytosines?
AThey are converted to adenine
BThey remain as cytosine
CThey are converted to uracil, which is read as thymine during sequencing
DThey are removed from the DNA strand
Explanation
Sodium bisulfite converts unmethylated cytosines → uracil → read as thymine (C→T). Methylated cytosines are protected and remain as C. Therefore: C→C in reads = methylated; C→T in reads = unmethylated. This is the fundamental principle of bisulfite sequencing for studying DNA methylation.
Q15 Hard
A major analytical challenge of bisulfite sequencing is distinguishing between:
AMethylated CpG sites and unmethylated CpG islands
B5-methylcytosine and 5-hydroxymethylcytosine
CCpG shores and CpG shelves
DTrue C→T SNPs and C→T changes caused by bisulfite conversion of unmethylated cytosines
Explanation
A C→T change in bisulfite-treated reads could be either an epigenetic signal (bisulfite conversion of unmethylated C) or a real genetic variant (a true C→T SNP). To resolve this ambiguity, researchers can: (1) perform parallel WES/WGS without bisulfite treatment, (2) use specialized bioinformatics tools, or (3) exploit strand-specificity — the G on the opposing strand is unaffected by bisulfite treatment.
Q16 Medium
About 98% of DNA methylation in the human genome occurs at:
AAdenine residues in GATC motifs
BCpG dinucleotides
CThymine residues in repetitive elements
DGuanine residues in GC-rich promoters
Explanation
In the human genome, about 98% of cytosine methylation occurs at CpG dinucleotides. These CpG sites often cluster into CpG islands (regions >500 bp), which are typically located near gene promoters. Highly methylated promoters are generally associated with repressed gene expression, while low methylation often indicates active transcription.
Q17 Medium
What is the correct order of steps in ChIP-seq?
ACrosslink → Fragment chromatin → Immunoprecipitate with antibody → Extract DNA → Sequence
BExtract DNA → Fragment → Bisulfite treat → Sequence → Call peaks
CImmunoprecipitate → Crosslink → Sequence → Fragment → Align
DFragment → Hybridize probes → Pull down with streptavidin → Sequence → Quantify expression
Explanation
ChIP-seq workflow: (1) Crosslink proteins to DNA with formaldehyde; (2) Fragment chromatin (sonication/enzymatic); (3) Immunoprecipitate with a protein-specific antibody; (4) Reverse crosslinks and extract DNA; (5) Sequence and align reads. Peaks in read depth indicate protein binding sites. Option B describes bisulfite sequencing; option D describes hybridization capture.
Q18 Easy
What does ChIP-seq identify?
ADifferentially methylated regions across the genome
BGene expression levels across tissues
CGenome-wide binding sites of DNA-associated proteins
DCopy number variations in the genome
Explanation
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) identifies the binding sites of DNA-associated proteins genome-wide. These include transcription factors, histone modifications, and other regulatory proteins. The "peaks" in the sequencing data correspond to protein binding locations, which help map regulatory elements like promoters and enhancers.
Q19 Tricky
Which statement about CpG islands is correct?
AThey are defined as regions shorter than 100 bp enriched in CpG dinucleotides
BThey are regions greater than 500 bp, typically located near gene promoters, and surrounded by shores and shelves
CHeavily methylated CpG islands always indicate active gene transcription
DThey are found exclusively in intergenic regions far from any gene
Explanation
CpG islands are defined as regions >500 bp with high CpG density, typically near gene promoters. They are flanked by CpG shores (~2 kb away) and shelves. Critically, high methylation of promoter CpG islands correlates with gene repression (not activation — option C is reversed). Low methylation at promoters generally means active transcription.
Q20 Medium
In an RNA-seq experiment, which method is used to enrich mRNA from total RNA?
ABisulfite treatment
BImmunoprecipitation with anti-RNA antibodies
CRestriction enzyme digestion
DPoly-A selection using oligo-dT primers/beads
Explanation
mRNA has a poly-A tail, so it can be enriched using oligo-dT primers or beads that bind these tails. An alternative approach is ribosomal RNA depletion, since rRNA makes up ~80% of total RNA. These two methods serve the same goal — enriching mRNA — but work differently. Bisulfite is for methylation, immunoprecipitation is for ChIP, and restriction enzymes are for DNA fragmentation.
Q21 Hard
Which RNA-seq normalization metric is most appropriate for comparing gene expression across samples?
ATPM (Transcripts Per Million)
BRaw read counts
CRPKM (Reads Per Kilobase per Million)
DGC content ratio
Explanation
TPM (Transcripts Per Million) is considered better for comparing gene expression across samples because the total TPM values sum to the same number in each sample. RPKM and FPKM normalize for sequencing depth and gene length but their totals can differ between samples, making cross-sample comparisons less reliable. Raw counts need separate normalization for both depth and length.
Q22 Medium
In RNA-seq, when no reference genome is available, which approach is used?
AAlign-then-assemble using STAR or HISAT2
BChIP-seq peak calling
CDe novo transcriptome assembly using tools like Trinity
DBisulfite sequencing of cDNA libraries
Explanation
When no reference genome exists (e.g., non-model organisms), reads are assembled into transcripts de novo using tools like Trinity or SOAPdenovo-Trans. This is the "assemble-then-align" approach. Note: de novo assembly works best for the most abundant transcripts. The "align-then-assemble" approach (option A) requires a reference genome.

📝High-Throughput Genotyping & SNP Chips
0 / 14
Q23 Easy
What is the concept behind using "tag SNPs" for genotyping?
ATag SNPs are the rarest variants in the genome and thus most informative
BDue to linkage disequilibrium, a representative subset of SNPs can capture most genetic variation without genotyping every variant
CTag SNPs are always located in exonic regions encoding proteins
DTag SNPs must have a MAF below 0.01 to be useful
Explanation
Because nearby variants on the same chromosome are inherited together in blocks (linkage disequilibrium), genotyping one representative "tag" SNP per block captures the variation of all other SNPs in that block. This allows researchers to reduce millions of SNPs to tens or hundreds of thousands of tag SNPs, saving cost while retaining most genetic information.
Q24 Medium
A SNP with a Minor Allele Frequency (MAF) of 0.5 indicates:
AThe SNP is monomorphic in the population
BOnly one individual carries the minor allele
CThe SNP is likely a sequencing error
DBoth alleles are present at equal frequency — maximum informativeness
Explanation
MAF = 0.5 means the two alleles are equally frequent (50%/50%). This is the most informative state for a SNP because there is maximum chance that any two individuals will differ at that position. A higher MAF = more informative for detecting genetic differences. Monomorphic would mean MAF = 0 (only one allele present).
Q25 Medium
In the Illumina Infinium array, how is a SNP genotype determined?
ABy hybridization of sample DNA to multiple overlapping probes
BBy restriction enzyme digestion and fragment length analysis
CBy single-base extension of a probe that stops just before the SNP, followed by fluorescence detection
DBy PCR amplification and gel electrophoresis
Explanation
The Infinium (Illumina) array uses single-base extension: a probe on a glass slide matches the DNA up to just before the SNP position. DNA polymerase extends by one fluorescently labeled nucleotide corresponding to the SNP allele. A camera detects the color — one color = homozygous, mixed colors = heterozygous. Option A describes the Affymetrix approach.
Q26 Tricky
What is the key difference between Infinium (Illumina) and Affymetrix genotyping arrays?
AInfinium uses single-base extension; Affymetrix uses hybridization of sample DNA to multiple probes
BInfinium is based on hybridization capture; Affymetrix uses bisulfite conversion
CInfinium can only genotype 100 SNPs; Affymetrix handles millions
DThey are identical technologies from different manufacturers
Explanation
Infinium (Illumina) uses single-base extension: one probe per SNP, extending by one fluorescent nucleotide. Affymetrix uses hybridization: multiple overlapping probes per SNP, detecting differential binding. Infinium is generally considered more robust and accurate; Affymetrix is more dependent on probe design and hybridization conditions. Both handle large numbers of SNPs.
Q27 Medium
In GenomeStudio's Genoplot, what do the axes Norm R and Norm Theta represent?
ANorm R = allele frequency; Norm Theta = sequencing depth
BNorm R = signal intensity; Norm Theta = allele frequency (balance between alleles)
CNorm R = mapping quality; Norm Theta = GC content
DNorm R = chromosome position; Norm Theta = p-value
Explanation
In GenomeStudio's Genoplot: Norm R represents signal intensity (how strong the overall signal is) and Norm Theta represents allele frequency (the balance between the two alleles). Dots are color-coded: Red = AA homozygous, Blue = BB homozygous, Purple = AB heterozygous. These clusters make genotype calling visual and intuitive.
Q28 Hard
A SNP consistently deviates from Hardy-Weinberg Equilibrium across all populations tested. This most likely indicates:
AStrong natural selection acting on that locus in all populations
BHigh inbreeding in every tested population
CA technical problem with the genotyping assay (e.g., poor probe design or repetitive region)
DThe SNP has a MAF of exactly 0.5
Explanation
If a SNP is out of HWE in all populations, it's a strong signal of a technical problem — poor probe design, location in a repetitive/duplicated region, or assay chemistry issues. Biological causes (selection, inbreeding) would typically affect only some populations. If the HWE deviation is population-specific, then biological explanations become more plausible.
Q29 Medium
In the pig 60K SNP chip study, why were Reduced Representation Libraries (RRL) used?
ATo sequence the entire pig genome at high coverage
BTo enrich repetitive DNA elements for mapping
CTo amplify only exonic regions of the pig genome
DTo reduce sequencing effort by focusing on a non-repetitive subset of the genome using restriction enzymes and size selection
Explanation
RRL uses restriction enzymes (e.g., AluI) to cut genomic DNA, followed by size selection via gel electrophoresis. This focuses sequencing on a representative, non-repetitive subset of the genome. Repetitive regions (like SINEs) are avoided because SNPs there can't be reliably mapped to a single location, making them useless for genotyping.
Q30 Tricky
Why might a SNP on a genotyping chip fail to show two alleles (appearing monomorphic when it shouldn't be)?
AThe SNP may be in a repetitive region, the reference genome may be misassembled, or the assay chemistry may have failed
BThe population has too much genetic diversity
CThe MAF of 0.5 makes the alleles invisible to the chip
DPool-seq was used instead of individual genotyping
Explanation
Technical reasons for SNP genotyping failure include: (1) the SNP is in a repetitive region causing ambiguous probe binding; (2) the reference genome has the SNP in a misassembled or unassigned contig; (3) the assay chemistry fails. High MAF (option C) would actually make a SNP easier to detect, not harder. Too much diversity (option B) wouldn't cause monomorphism.
Q31 Medium
What minimum sequencing depth is recommended for reliable genotype calling from NGS data?
A
B100× or more
C10×
D
Explanation
For confident genotype calling from NGS, ~100× coverage or more is recommended. At low depth (e.g., 10×), a heterozygous position might appear homozygous due to random sampling — you might only capture reads from one allele by chance. SNP chips are more robust because they use thousands of probes per SNP, providing built-in redundancy.
Q32 Medium
Which NGS-based genotyping method does NOT require a reference genome?
AHybridization-based enrichment (exome sequencing)
BSNP chip genotyping
CRAD-seq (Restriction-site Associated DNA sequencing)
DWhole-genome resequencing
Explanation
RAD-seq (and GBS) can work without a reference genome, making them ideal for non-model species. They use restriction enzymes to create reproducible genomic fragments across individuals. Exome sequencing requires probe design based on a reference, SNP chips need mapped positions, and WGS resequencing needs a reference for alignment. Amplicon sequencing also doesn't require a full reference but needs known target sequences.
Q33 Easy
What distinguishes genomic selection from marker-assisted selection (MAS)?
AGenomic selection uses thousands of genome-wide markers; MAS uses a few specific markers linked to traits
BGenomic selection is only for plants; MAS is only for animals
CMAS requires whole-genome sequencing; genomic selection does not
DThere is no difference; they are the same method
Explanation
Genomic selection uses thousands to hundreds of thousands of markers genome-wide to predict an individual's total genetic potential for complex traits (many genes, small effects). MAS targets a few specific markers linked to traits controlled by major genes. Genomic selection enables early-life prediction and faster breeding cycles; MAS is simpler but limited to well-characterized traits.
Q34 Medium
Copy Number Variations (CNVs) are defined as:
ASingle nucleotide changes scattered randomly throughout the genome
BSegments of DNA smaller than 100 bp that are duplicated
CInsertions of transposable elements at random positions
DDNA segments ≥1 kb that vary in copy number compared to a reference genome, typically occurring as tandem repeats
Explanation
CNVs are segments of DNA ≥1 kb (kilobase) that vary in copy number between individuals compared to a reference. They typically occur as tandem repeats (adjacent copies on the same haplotype), not as dispersed elements. Despite being fewer in number than SNPs, CNVs contribute more total nucleotide variation and have been linked to many traits and diseases.
Q35 Hard
In inbreeding studies, why might a single incorrectly called heterozygous SNP within a long homozygous stretch be problematic?
AIt would falsely increase the estimated MAF of the region
BIt would break a run of homozygosity (ROH), leading to underestimation of inbreeding
CIt would cause the entire chromosome to be excluded from analysis
DIt would cause HWE violation in the entire population
Explanation
In inbreeding analysis, researchers look for long runs of homozygosity (ROH). A single falsely heterozygous SNP due to technical noise would split a long ROH into two shorter ones (or eliminate it), leading to incorrect conclusions about the degree of inbreeding. This is why such problematic SNPs should be identified and removed from the dataset.
Q36 Tricky
In the pig 60K SNP chip design, starting from 2.6 million initial SNPs, approximately how many were selected for the final chip?
A60,000
B2,600,000
C1,000,000
D106,000
Explanation
The pipeline was: 2.6M initial SNPs → filtered to ~106,000 → selected 60,000 for the final chip. Option D (106,000) was the intermediate step after quality filtering for MAF, mapping quality, etc. The final product is the "60K SNP chip" with 60,000 highly informative SNPs. This multi-step filtering is essential because not all discovered SNPs are suitable for genotyping.

📝Open Questions — Lecture 9
0 / 6
Q37 — Open Short Answer
Describe the Pool-seq approach. What type of information does it provide, what are its main limitations, and when is it preferred over individual whole-genome sequencing?
✓ Model Answer

Approach: Pool-seq involves combining DNA from multiple individuals into a single pool (equimolar amounts), preparing one library, and performing whole-genome sequencing. Reads are mapped to a reference genome and allele frequencies are estimated at each variant position.

Information provided: Population-level allele frequency estimates for SNPs across the genome. It enables comparison of allele frequencies between groups (e.g., using FST).

Limitations: (1) No individual genotype data — variants cannot be traced to specific individuals. (2) Hard to detect rare variants (low-frequency alleles lost in noise). (3) Haplotype phasing is impossible. (4) Bias from unequal DNA input can distort results. (5) Not suitable for clinical diagnostics.

When preferred: When comparing populations or extreme phenotype groups (e.g., red vs. yellow canaries, healthy vs. diseased), when budget limits individual sequencing, and when the goal is allele frequency estimation rather than individual-level genotyping.

Q38 — Open Short Answer
What is the primary purpose of ChIP-seq? Describe its main steps and explain how binding sites are identified from the data.
✓ Model Answer

Purpose: ChIP-seq identifies genome-wide binding sites of DNA-associated proteins (transcription factors, histone modifications) to understand gene regulation.

Steps: (1) Crosslink proteins to DNA using formaldehyde. (2) Fragment chromatin by sonication or enzymatic digestion. (3) Immunoprecipitate protein-DNA complexes using a specific antibody against the protein of interest. (4) Reverse crosslinks and extract the captured DNA. (5) Sequence the DNA using NGS.

Identifying binding sites: Sequenced reads are aligned to a reference genome. Regions with significantly enriched read coverage (peaks) indicate where the protein was bound. Peak calling algorithms identify these enriched regions. Peaks can be annotated to determine overlap with promoters, enhancers, or other regulatory elements, revealing which genes the protein regulates.

Q39 — Open Calculation
A whole-exome sequencing experiment produces 50 million paired-end reads of 150 bp each. The target exome size is 50 Mb. What is the average sequencing depth of the exome? Is this sufficient for reliable variant calling?
✓ Model Answer
Total bases sequenced = 50,000,000 reads × 150 bp × 2 (paired-end) = 15,000,000,000 bp = 15 Gb
Exome size = 50 Mb = 50,000,000 bp
Average depth = 15,000,000,000 / 50,000,000 = 300×

The average exome depth is 300×. This is well above the recommended ~100× for confident genotype calling from NGS data. However, this is an ideal calculation — in practice, not all reads will map on-target (capture efficiency is typically 60–80%), so effective depth would be lower but still sufficient.

Q40 — Open Short Answer
Compare the four main strategies for finding disease-causing rare variants using exome sequencing: (1) filtering across unrelated individuals, (2) family-based segregation, (3) de novo trio analysis, and (4) extreme phenotype sequencing.
✓ Model Answer

(1) Unrelated affected individuals: Sequence exomes of multiple unrelated patients with the same disease. Apply discrete filtering to remove common variants (dbSNP, 1000 Genomes, gnomAD). Look for novel/rare variants shared across affected individuals in the same gene. Powerful for rare Mendelian disorders where ~98% of exome variants are already known.

(2) Family-based segregation: Sequence affected and unaffected family members. Identify variants that co-segregate with the disease (present in all affected, absent in unaffected). Increases confidence that the variant tracks with the phenotype across generations. Used for dominant or recessive trait mapping.

(3) De novo trio analysis: Sequence the child and both healthy parents. Remove all shared/inherited variants and common variants. What remains are novel, de novo mutations unique to the child — strong candidates for ultra-rare syndromes with unclear inheritance.

(4) Extreme phenotype sequencing: For quantitative traits, select individuals at phenotypic extremes (e.g., tallest vs. shortest). Rare causative variants are enriched at the extremes. Can combine with Pool-seq to reduce costs. Used for height, BMI, fertility, and other continuous traits.

Q41 — Open Tricky
You are designing a custom SNP genotyping chip for a livestock species. After genotyping your first batch of samples, you notice that several SNPs violate Hardy-Weinberg Equilibrium. How would you determine whether this is a technical problem or reflects real biology? What would you do for version 2 of the chip?
✓ Model Answer

Diagnosing the cause:

(1) Check across populations: If the same SNP is out of HWE in all populations → likely a technical issue (poor probe, repetitive region). If only in some populations → may reflect biology (inbreeding, selection, population structure).

(2) Examine the genotype clustering plot: Good clustering (three clearly separated groups) → SNP is reliable, deviation may be biological. Poor/noisy clustering or missing clusters → technical failure.

(3) Consider genomic context: Is the SNP in a repetitive region or near a CNV? These locations cause unreliable probe binding.

(4) Adjust software parameters: Try tuning clustering thresholds in GenomeStudio to see if genotype calls improve.

For version 2: Flag persistently problematic SNPs. Remove those that consistently fail HWE across all populations or show poor clustering. Replace them with new informative SNPs from better-characterized regions. Keep SNPs with biologically explainable HWE deviations if the clustering is clean.

Q42 — Open Short Answer
Compare SNP chip-based genotyping with NGS-based genotyping (e.g., GBS or RAD-seq). Discuss their requirements, advantages, and typical use cases.
✓ Model Answer

SNP chips: Require a reference genome and a pre-designed set of SNPs. Provide fixed, high-accuracy genotyping with built-in probe redundancy, making them robust even with poor DNA quality. Cost-effective for large samples with established markers. Best for: GWAS, genomic selection, parentage testing in well-studied species. Limitation: only genotype pre-selected SNPs; cannot discover new variants.

NGS-based (GBS/RAD-seq): Can work without a reference genome (restriction enzymes create reproducible fragments). Enable simultaneous SNP discovery and genotyping. Cost-effective per locus but require higher sequencing depth (~100×) for confident genotype calls. Best for: population genomics in non-model organisms, diversity studies, evolutionary studies. Limitation: higher risk of genotyping errors at low depth; computationally intensive.

Key trade-off: SNP chips are more reliable and standardized; NGS-based methods are more flexible and can discover novel variation. The choice depends on the species (model vs. non-model), available resources, and whether discovery of new variants is needed.