📝 Exam Simulation — Version B

📝Applied Genomics — Final Exam Simulation B
0 / 57
Q1 Easy
What does "equimolar DNA pool" mean?
ADNA pooled from individuals of the same species only
BDNA mixed with equal volumes regardless of concentration
CDNA pooled with equal amounts from each individual
DDNA fragmented into equal-length pieces
Explanation
Equimolar means equal molar quantities of DNA from each individual. This ensures equal genomic representation in the pool so that allele frequencies estimated from read counts are proportional to the true population frequencies. If one individual contributes more DNA, its genome is over-represented and biases frequency estimation.
Q2 Medium
In a VCF file, the genotype notation "1|2" indicates:
AHomozygous for the first alternative allele
BUnphased heterozygous genotype
CMissing genotype data
DPhased heterozygous with two different alternative alleles
Explanation
The pipe "|" indicates phased data (as opposed to "/" for unphased). "0" = reference allele, "1" = first alternative, "2" = second alternative. So "1|2" means one chromosome carries ALT1 and the other carries ALT2 — a phased heterozygous genotype with two different alternative alleles.
Q3 Easy
Which sequencing technology uses fluorescently labeled reversible terminators?
AIon Torrent
BIllumina
CPacBio
DSanger
Explanation
Illumina sequencing by synthesis uses fluorescently labeled reversible terminators. Each nucleotide has a fluorescent dye and a 3' blocking group. After incorporation, the cluster is imaged, then the dye and block are cleaved. Sanger uses irreversible dideoxy terminators; Ion Torrent detects H⁺ ions; PacBio uses real-time detection of fluorescent nucleotides.
Q4 Medium
What is the main advantage of mate-pair libraries over standard paired-end libraries?
ALarger insert sizes (up to ~5 kb), helping scaffold across repeats
BHigher sequencing accuracy
CLower DNA input requirement
DSimpler library preparation protocol
Explanation
Mate-pair libraries use biotinylation and circularization of large fragments (up to ~5 kb) to sequence the ends of distant genomic regions. This larger insert size helps link contigs across repetitive regions during genome assembly. Standard paired-end libraries have inserts of ~200–800 bp. The mate-pair protocol is actually more complex, not simpler.
Q5 Medium
The FLAG field in a SAM file is found in:
AColumn 1
BColumn 2
CColumn 6
DColumn 10
Explanation
In SAM format: column 1 = read name (QNAME), column 2 = FLAG (bitwise flags indicating properties like primary/secondary alignment, unmapped, mate unmapped, strand), column 3 = reference name, column 4 = position, column 5 = MAPQ, column 6 = CIGAR, etc. The FLAG field encodes multiple properties as a single integer using bit flags.
Q6 Easy
BUSCO is used to evaluate:
ASequencing error rate
BRead quality scores
CGenome assembly completeness
DPopulation genetic diversity
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) evaluates genome assembly completeness by searching for conserved single-copy genes expected to be present in all organisms of a given lineage. It reports percentages of complete, fragmented, duplicated, and missing orthologs. It is one of the most frequently asked topics on the exam.
Q7 Medium
In a de Bruijn graph used for genome assembly, vertices represent:
AComplete sequencing reads
BK-mers themselves
COverlapping regions between reads
D(k−1)-mers
Explanation
In a de Bruijn graph, vertices are (k−1)-mers and edges are k-mers. A k-mer connects vertex X (its prefix of length k−1) to vertex Y (its suffix of length k−1) with a directed edge. The graph is then traversed using an Eulerian path (visiting each edge exactly once) to reconstruct the sequence.
Q8 Medium
Why should duplicate reads be removed before variant calling?
AThey can reinforce sequencing errors as real variants
BThey increase the file size beyond storage limits
CThey cause the reference genome index to fail
DThey originate from mitochondrial contamination
Explanation
Duplicate reads arise from PCR amplification during library preparation. If a read containing a sequencing error is duplicated, the error appears in multiple reads and can be mistakenly called as a true variant. Removing duplicates ensures each molecule is counted once, giving accurate allele frequency estimates.
Q9 Easy
Which technique identifies genome-wide protein–DNA binding sites?
ABisulfite sequencing
BChIP-seq
CRNA-seq
DExome sequencing
Explanation
ChIP-seq (Chromatin Immunoprecipitation sequencing) identifies DNA regions bound by specific proteins such as transcription factors. The DNA-protein complex is cross-linked, fragmented, immunoprecipitated with an antibody specific to the protein of interest, and the recovered DNA is sequenced. Bisulfite seq detects methylation; RNA-seq measures expression.
Q10 Easy
What does FST measure?
AIndividual inbreeding level
BSequencing error rate
CLinkage disequilibrium decay
DGenetic differentiation between populations
Explanation
FST (fixation index) is a measure of population differentiation based on allele frequency differences between subpopulations. FST = 0 means no genetic differentiation; FST = 1 means complete fixation of different alleles. It is a key statistic in population genomics for comparing pools or populations and detecting signatures of selection.
Q11 Medium
In GWAS, a genomic inflation factor (λGC) greater than 1 suggests:
APopulation stratification was not properly corrected
BThe sample size is too large
CAll SNPs are in linkage equilibrium
DNo significant associations exist
Explanation
Lambda GC is computed by comparing the median of observed test statistics to the expected chi-square distribution. λ > 1 indicates systematic inflation of test statistics, typically caused by unaccounted population structure (stratification). This must be corrected using methods like including principal components as covariates in the model.
Q12 Easy
What is the C-value?
AThe number of chromosomes in a cell
BThe GC content percentage of a genome
CThe mass of DNA in a haploid chromosome set
DThe coverage depth of sequencing data
Explanation
The C-value represents the amount (mass) of DNA in picograms contained in a haploid chromosome set. It is used to estimate genome size before sequencing by comparing with databases of known C-values from related species. This is measured using techniques like flow cytometry.
Q13 Easy
Oxford Nanopore sequencing detects nucleotides by measuring:
AFluorescent emissions from labeled nucleotides
BChanges in ionic current as DNA passes through a pore
CpH changes from hydrogen ion release
DLight emitted during pyrophosphate cleavage
Explanation
Nanopore sequencing passes a single strand of DNA through a biological nanopore embedded in a membrane. As each nucleotide passes through, it causes a characteristic disruption in the ionic current flowing through the pore. This signal is decoded to determine the DNA sequence in real-time, enabling very long reads.
Q14 Easy
Why are highly inbred individuals preferred for de novo genome assembly?
AThey have more transposable elements
BThey produce more DNA per cell
CThey have larger genomes
DHigh homozygosity makes read overlap easier
Explanation
Inbred individuals are nearly homozygous at all positions. In a heterozygous individual, reads from the two haplotypes may differ at SNP positions, making it difficult to overlap and connect them during assembly. Homozygous individuals have identical haplotypes, so reads overlap cleanly and contigs extend more easily.
Q15 Medium
The CIGAR string "5M2I8M" means:
A5 matches, 2 insertions in the read, 8 matches
B5 mismatches, 2 introns, 8 mismatches
C5 matches, 2 deletions in the read, 8 matches
D5 soft clips, 2 insertions, 8 soft clips
Explanation
CIGAR string operations: M = alignment match (can include both matches and mismatches), I = insertion in the read relative to reference, D = deletion from the reference, S = soft clipping. So "5M2I8M" = 5 aligned bases, then 2 extra bases in the read (insertion), then 8 more aligned bases. The read consumes 15 bases; the reference consumes 13.
Q16 Easy
In pool sequencing (Pool-seq), allele frequency is estimated from:
AThe number of individuals in the pool
BGel electrophoresis band intensity
CRead counts supporting each allele at a position
DIndividual genotype calls from GATK
Explanation
In pool sequencing, DNA from multiple individuals is mixed and sequenced together. At any polymorphic position, the proportion of reads carrying each allele approximates the allele frequency in the pool. For example, if 6 out of 10 reads show allele A, the estimated frequency of A is ~60%. This requires equimolar pooling and sufficient sequencing depth.
Q17 Medium
In a GWAS QQ plot, what does early deviation of observed from expected p-values indicate?
AThe Bonferroni threshold is too lenient
BSystematic bias, likely from population stratification
CThe study has perfect statistical power
DAll tested SNPs are associated with the trait
Explanation
In a QQ plot, observed −log10(p) values are plotted against expected ones. Under the null hypothesis (no association), points should fall on the diagonal. If points deviate early (across the entire distribution), this indicates systematic inflation — typically from uncorrected population stratification. Late deviation only at the tail suggests true associations.
Q18 Medium
The biotin–streptavidin interaction is exploited in which procedures?
ASanger sequencing only
BFastQC quality control
CDe Bruijn graph construction
DMate-pair library prep and exome capture
Explanation
Biotin–streptavidin binding is used in mate-pair libraries (biotinylated adapters mark circularized fragment junctions, then streptavidin pulldown selects these fragments) and in hybridization-based exome capture (biotinylated probes hybridize to exonic fragments, then streptavidin beads capture them). Both exploit the extremely strong and specific biotin–streptavidin bond.
Q19 Medium
A pyramid-shaped peak in a Manhattan plot is caused by:
ALD decay around the causal variant
BSequencing errors at that locus
CA repetitive element in that region
DRandom statistical noise
Explanation
The pyramid or "skyline" shape occurs because SNPs near the causal variant are in strong LD with it and show high significance, while SNPs farther away have decreasing LD and lower significance. This creates a peak that tapers off on both sides. The width of the peak reflects the extent of LD in that genomic region and population.
Q20 Medium
An Eulerian path through a de Bruijn graph exists if the graph contains:
ANo balanced vertices at all
BExactly four semibalanced vertices
CAt most two semibalanced vertices
DOnly vertices with in-degree of zero
Explanation
An Eulerian path visits every edge exactly once. It exists in a directed graph when at most two vertices are semibalanced (|in-degree − out-degree| = 1) and all other vertices are balanced (in-degree = out-degree). The two semibalanced vertices serve as start and end points of the path. This is the traversal algorithm used in de Bruijn graph–based assembly.
Q21 Medium
High GC-content regions may have low coverage with Illumina. A potential solution is:
AUsing shorter k-mers in assembly
BSupplementing with Nanopore sequencing
CRemoving those regions from analysis
DIncreasing the Phred quality threshold
Explanation
Illumina sequencing can fail in GC-rich regions due to PCR amplification bias during library prep. Nanopore sequencing is much less affected by GC content because it does not require PCR amplification and directly reads native DNA. Using a complementary technology resolves coverage gaps.
Q22 Easy
In the additive GWAS model, the genotype AA, AG, GG (where G is the minor allele) is coded as:
A0, 1, 2
B1, 2, 3
C0, 0, 1
D−1, 0, 1
Explanation
The additive model codes genotypes by counting the number of minor allele copies: AA = 0 copies of G, AG = 1 copy, GG = 2 copies. This allows linear regression between genotype (0, 1, 2) and phenotype. Each additional copy of the minor allele is assumed to have an equal additive effect on the phenotype.
Q23 Easy
The Bonferroni correction in GWAS adjusts for:
ASequencing depth differences
BSample size imbalance
CLinkage disequilibrium between SNPs
DMultiple testing across many SNPs
Explanation
When testing thousands of SNPs simultaneously, some will appear significant by chance. Bonferroni divides the significance threshold (e.g., α = 0.05) by the number of tests. For 50,000 SNPs: 0.05/50,000 ≈ 10⁻⁶. This is conservative because it assumes tests are independent, while SNPs in LD are correlated. The genome-wide threshold of 5 × 10⁻⁸ is commonly used instead.
Q24 Medium
ROH islands (regions frequently in ROH across many individuals) suggest:
AGenotyping errors at those loci
BRandom genetic drift only
CPossible selection pressure favoring homozygosity
DHigh sequencing coverage artifacts
Explanation
ROH islands are genomic regions where a high percentage of individuals in a population share runs of homozygosity. This non-random pattern suggests that being homozygous at those positions increases fitness. These regions often harbor genes under selection pressure and can be visualized in Manhattan-style plots showing ROH frequency per genomic position.
Q25 Medium
When genomic DNA is digested with a restriction enzyme, visible bands on a gel are caused by:
ACoding exon sequences only
BRepetitive elements producing many fragments of the same size
CComplete chromosomes migrating together
DRNA contamination in the sample
Explanation
Random restriction digestion produces a smear of fragment sizes. Visible bands appear because repetitive elements (which have the same sequence repeated throughout the genome) produce many fragments of identical size. In reduced representation library construction, these bands are deliberately avoided to prevent sequencing repetitive DNA.
Q26 Medium
Genotyping by sequencing (GBS) is characterized by:
ARestriction digestion to sequence a reduced genome fraction across individuals
BWhole genome sequencing at 30× coverage per individual
CUsing SNP arrays with pre-designed probes
DSequencing only mitochondrial DNA
Explanation
GBS uses restriction enzymes to select and sequence the same small fraction of the genome across many individuals. It is cost-effective (~€20–30/sample), does not require a pre-existing SNP chip, and can even work without a reference genome. The sequenced fraction is determined by the restriction enzyme used, and SNPs are identified by comparing sequences across individuals.
Q27 Easy
The CIGAR string is found in which column of a SAM file?
AColumn 2
BColumn 4
CColumn 10
DColumn 6
Explanation
SAM format mandatory columns: 1 = QNAME (read name), 2 = FLAG, 3 = RNAME (reference name), 4 = POS (position), 5 = MAPQ (mapping quality), 6 = CIGAR (alignment summary string), 7–9 = mate information, 10 = SEQ (sequence), 11 = QUAL (quality string).
Q28 Easy
A Phred quality score of Q20 means the base call has:
A1 in 10 chance of error (90% accuracy)
B1 in 100 chance of error (99% accuracy)
C1 in 1,000 chance of error (99.9% accuracy)
D1 in 10,000 chance of error (99.99% accuracy)
Explanation
Phred score Q = −10 × log₁₀(P), where P is the error probability. Q10 = 10% error, Q20 = 1% error, Q30 = 0.1% error, Q40 = 0.01% error. So Q20 means 99% accuracy, which is generally considered a minimum acceptable quality for many analyses.
Q29 Medium
In array CGH, a log₂ ratio below zero between test and reference DNA indicates:
AA gain of copies in the test sample
BEqual copy number in both samples
CA loss (deletion) in the test sample
DA sequencing error at that position
Explanation
aCGH compares hybridization intensity between test DNA and reference DNA. log₂(test/reference) = 0 means equal copies. log₂ < 0 means the test has fewer copies (loss/deletion). log₂ > 0 means the test has more copies (gain/duplication). The minimum resolution depends on the average probe spacing — at least 3 consecutive probes must show the same ratio.
Q30 Easy
What is the difference between structural and functional genome annotation?
AStructural identifies gene locations; functional assigns biological roles
BStructural uses RNA-seq; functional uses DNA-seq
CThey are two names for the same process
DStructural annotates proteins; functional annotates DNA
Explanation
Structural annotation identifies the positions of genomic features — exons, introns, UTRs, promoters, genes — along the assembled sequence. Functional annotation then assigns biological functions to those features using tools like Gene Ontology, pathway databases, and homology searches. Both are essential steps after de novo genome assembly.
Q31 Medium
Using a smaller k-mer size in de Bruijn graph assembly:
AEliminates all sequencing errors
BIncreases the number of unique k-mers
CMakes the graph impossible to traverse
DReduces the fraction of k-mers affected by a single error
Explanation
If a read of length L has one error, a k-mer equal to L gives 100% error-containing k-mers. With smaller k, more k-mers are generated and only a subset contain the error position. For example, k=3 on a 10bp read gives 8 k-mers, and only ~3 contain the error. This is why assemblers often test multiple k-mer sizes and combine results.
Q32 Easy
The Variant Effect Predictor (VEP) is used to:
AAlign reads to a reference genome
BPredict the biological impact of detected variants
CPerform genome assembly from raw reads
DCalculate population allele frequencies
Explanation
VEP (from Ensembl) annotates variants with their predicted biological consequences — e.g., synonymous, missense, stop-gain, splice site, intergenic. It requires matching the VCF chromosome naming convention with the annotation database. VEP and SnpEff are common tools for functional annotation of variants.
Q33 Easy
In SNP array genotyping, "call rate" refers to:
AThe speed of the genotyping instrument
BThe minor allele frequency threshold
CThe percentage of SNPs successfully genotyped
DThe number of samples per chip
Explanation
Call rate is the fraction of SNPs for which a genotype could be reliably determined. A typical call rate is ~98%, meaning for a 10,000-SNP chip, about 9,800 SNPs are genotyped and ~200 fail. Failed calls appear as "0 0" (missing) in the data. Low call rates may indicate poor DNA quality or technical issues.
Q34 Easy
Inbreeding depression refers to:
AReduced fitness from increased homozygosity of deleterious alleles
BHigher heterozygosity in large populations
CIncreased mutation rate in inbred lines
DImproved assembly quality from homozygosity
Explanation
Inbreeding increases the frequency of homozygous genotypes across the genome — including for deleterious recessive alleles that would normally be masked in the heterozygous state. When these alleles become homozygous, they reduce individual fitness (survival, reproduction). This population-level phenomenon is called inbreeding depression.
Q35 Easy
In a FASTQ file, the third line ("+") serves to:
AStore the reference genome name
BIndicate the strand of the read
CStore alignment coordinates
DSeparate the sequence from quality scores
Explanation
FASTQ format has 4 lines per read: line 1 = header starting with "@", line 2 = nucleotide sequence, line 3 = "+" separator (optionally followed by the header again), line 4 = ASCII-encoded quality scores (one character per base). The "+" line is simply a delimiter between sequence and quality data. FASTQ contains raw reads, not alignments.
Q36 Medium
A reference-guided genome assembly may introduce errors because:
AIt always requires PacBio reads
BStructural rearrangements in the guide species may misplace contigs
CIt cannot use Illumina data
DIt skips the annotation step
Explanation
In reference-guided assembly, the genome of a related species provides a scaffold for ordering contigs. However, if the guide species has structural rearrangements (inversions, translocations) relative to the target species, contigs may be placed in the wrong order or orientation. This saves computational time but introduces potential errors. De novo assembly avoids this but is more resource-intensive.
Q37 Medium
A SNP with minor allele frequency (MAF) of 0.5 is considered:
AMonomorphic and uninformative
BRare and difficult to detect
CMaximally informative for population studies
DLikely a sequencing artifact
Explanation
MAF = 0.5 means both alleles are equally frequent in the population. This provides maximum heterozygosity and thus maximum informativity for distinguishing individuals and detecting genetic associations. SNP arrays aim to include SNPs with high MAF (ideally ≥0.3) across target populations. A good average MAF for a genotyping panel is around 0.3.
Q38 Easy
In bisulfite sequencing, sodium bisulfite converts:
AUnmethylated cytosine to uracil (read as thymine)
BMethylated cytosine to uracil
CAdenine to guanine
DThymine to cytosine
Explanation
Sodium bisulfite treatment converts unmethylated cytosine → uracil → thymine (during PCR), while methylated (5-methylcytosine) remains unchanged as C. After sequencing and alignment, positions where C→T conversion occurred were unmethylated; positions retaining C were methylated. A challenge: distinguishing bisulfite-induced C→T from true C→T SNPs may require parallel genomic sequencing.
Q39 Medium
Long ROH segments in an individual's genome most likely indicate:
AAncient inbreeding many generations ago
BAn admixed population background
CHigh sequencing error rate
DRecent inbreeding (parents closely related)
Explanation
Long ROH segments indicate recent inbreeding because there has been little time for recombination to break them down. Ancient inbreeding produces short ROH (fragments broken by many generations of crossing over). Admixed populations typically show very few ROH. The size distribution of ROH can reconstruct the genetic history of individuals and populations.
Q40 Medium
The minimum resolution of an aCGH system with probes spaced every 10 kb is approximately:
A10 kb
B30 kb
C100 kb
D1 kb
Explanation
At least 3 consecutive probes must show the same log₂ ratio shift to reliably call a CNV (otherwise a single probe deviation could be an artifact). Therefore, minimum resolution ≈ 3 × average probe spacing. With probes every 10 kb: resolution ≈ 30 kb. CNVs smaller than 30 kb would be missed with this design.
Q41 Easy
PCR amplification of DNA before sequencing can introduce:
ALonger read lengths
BHigher quality scores
CAmplification biases and duplicate artifacts
DBetter genome coverage uniformity
Explanation
PCR amplification during library preparation can introduce errors through polymerase mistakes, create duplicate molecules from the same template, and preferentially amplify certain fragments (e.g., those with moderate GC content). This is why duplicates are marked/removed during analysis, and why PCR-free library protocols or technologies like Nanopore (no PCR needed) can be advantageous.
Q42 Easy
In a VCF file, the genotype "0/0" represents:
AHomozygous reference
BHeterozygous
CMissing genotype
DHomozygous alternative
Explanation
VCF genotype encoding: 0 = reference allele, 1 = first alternative, 2 = second alternative. "/" means unphased, "|" means phased. So 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternative, ./. = missing data. Genotypes are encoded numerically, not with nucleotide letters.
Q43 Medium
The effective population size (Ne) primarily influences:
AThe sequencing error rate
BThe physical size of the genome
CThe cost of DNA extraction
DThe extent of linkage disequilibrium in a population
Explanation
Effective population size determines the rate at which LD decays over generations. Large Ne (e.g., humans) = more recombination = rapid LD decay (a few kb), requiring denser SNP arrays. Small Ne (e.g., livestock breeds) = slower LD decay (~100 kb), requiring fewer SNPs. This directly affects the design and cost of genotyping tools.
Q44 Medium
A "tag SNP" in GWAS is useful because it:
AIs always the causal mutation itself
BIs in high LD with nearby ungenotyped variants
CHas the lowest MAF in the population
DIs located only in coding regions
Explanation
Tag SNPs represent nearby variants through linkage disequilibrium. If a tag SNP and a causal variant have r² ≈ 1, genotyping the tag SNP captures the same information without genotyping the causal variant directly. This is the basis of indirect association in GWAS — the SNP array samples tag SNPs across the genome, and associated tags point to nearby causal regions for fine-mapping.
Q45 Easy
RepeatMasker is used to:
ADetect single nucleotide variants
BAssemble contigs into scaffolds
CIdentify and mask repetitive elements in a genome
DPredict protein structures from DNA
Explanation
RepeatMasker screens genome sequences for interspersed repeats (SINEs, LINEs, DNA transposons, etc.) and low-complexity regions, then replaces them with Ns or lowercase letters. This is essential before: (1) SNP calling — to avoid calling false variants in repetitive regions, (2) read mapping — to prevent multi-mapping artifacts, and (3) CNV detection — to avoid bias from repetitive elements.
Q46 — Open Calculation
A mammalian genome is 2.8 Gbp. You want 40× average coverage using 150 bp reads. How many reads do you need?
✓ Model Answer

Using the coverage formula rearranged to solve for number of reads:

Coverage = (N × L) / G → N = (Coverage × G) / L
N = (40 × 2,800,000,000) / 150
N = 112,000,000,000 / 150
N ≈ 746,666,667 reads ≈ 747 million reads

So approximately 747 million reads of 150 bp are needed to achieve 40× coverage of a 2.8 Gbp genome.

Q47 — Open Calculation
Given these contig lengths (in kb): 150, 100, 85, 65, 55, 35, 20, 15, 10. Calculate the N50.
✓ Model Answer

Step 1: Contigs are already sorted from largest to smallest: 150, 100, 85, 65, 55, 35, 20, 15, 10 kb.

Total assembly length = 150 + 100 + 85 + 65 + 55 + 35 + 20 + 15 + 10 = 535 kb
Half of total = 535 / 2 = 267.5 kb

Step 2: Cumulative sum from largest:

150 → cumulative = 150 (< 267.5)
150 + 100 = 250 → cumulative = 250 (< 267.5)
250 + 85 = 335 → cumulative = 335 (≥ 267.5) ✓

N50 = 85 kb — the contig that crosses the 50% cumulative threshold.

Q48 — Open Short Answer
Describe the de Bruijn graph approach for genome assembly. Include: what k-mers are, how the graph is built, and how the sequence is reconstructed.
✓ Model Answer

K-mers: Substrings of fixed length k extracted by sliding a window across each read. For example, the sequence ATGCG with k=3 produces: ATG, TGC, GCG.

Graph construction: (1) Extract all k-mers from reads and retain unique ones. (2) Vertices represent (k−1)-mers (prefixes and suffixes of k-mers). (3) Edges represent k-mers — each k-mer connects its prefix vertex to its suffix vertex with a directed edge.

Sequence reconstruction: The graph is traversed using an Eulerian path, which visits every edge exactly once. This requires at most two semibalanced vertices (|in-degree − out-degree| = 1). The sequence is reconstructed by concatenating the vertices along the path.

Challenges: Sequencing errors create false k-mers; repetitive elements cause ambiguous paths (multiple valid Eulerian paths). Using multiple k-mer sizes and paired-end data helps resolve these issues.

Q49 — Open Short Answer
What is pool sequencing (Pool-seq)? Explain how it works, what information it provides, and why it is cost-effective.
✓ Model Answer

Definition: Pool sequencing involves mixing equimolar DNA from multiple individuals into a single pool and sequencing the pool together, rather than sequencing each individual separately.

How it works: DNA is extracted from each individual, quantified, and combined in equal amounts (equimolar pooling). The pooled DNA is then used for library preparation and sequenced. Reads map randomly to the reference genome. At any polymorphic position, the proportion of reads carrying each allele approximates the allele frequency in the pooled population.

Information provided: Pool-seq gives population-level allele frequencies at each variant position, not individual genotypes. This allows detection of variants, estimation of allele frequencies, and comparison of frequencies between populations (e.g., using FST).

Cost-effectiveness: Instead of sequencing N individuals separately (cost = N × per-sample cost), only one pooled library is sequenced. For example, comparing two populations of 50 individuals each requires 2 sequencing runs instead of 100. This dramatically reduces cost while preserving population-level variant information.

Q50 — Open Short Answer
Describe the RNA-seq technique. Include: what is sequenced, the two main library preparation strategies (random priming vs. poly-A selection), and two applications.
✓ Model Answer

What is sequenced: RNA (transcripts) is extracted, converted to cDNA, and sequenced. RNA-seq captures the transcriptome — all RNA molecules expressed at the time of sampling.

Library preparation strategies:

(1) Poly-A selection: Mature mRNAs have a poly-A tail. Probes with poly-T sequences capture these mRNAs specifically. This enriches for protein-coding transcripts and excludes rRNA/tRNA.

(2) Random priming: RNA is fragmented and random hexamer primers are used for cDNA synthesis. This captures a broader range of RNAs but may include unwanted rRNA.

Applications:

(1) Gene expression quantification: Comparing transcript abundance between conditions (e.g., healthy vs. diseased tissue) to identify differentially expressed genes.

(2) Genome annotation: RNA-seq data provides evidence of transcribed regions, helping identify gene structures, exon boundaries, and transcript isoforms (alternative splicing) during functional annotation of a new genome assembly.

Q51 — Open Calculation
In a population of 800 individuals, the observed genotype counts are: 320 GG, 400 Gg, 80 gg. Test whether this population is in Hardy–Weinberg equilibrium.
✓ Model Answer
Total individuals = 800, Total alleles = 1,600

Step 1 — Allele frequencies:

G alleles = (2 × 320) + 400 = 1,040 → p = 1,040/1,600 = 0.65
g alleles = (2 × 80) + 400 = 560 → q = 560/1,600 = 0.35

Step 2 — Expected genotype frequencies and counts:

p² = 0.4225 → Expected GG = 0.4225 × 800 = 338
2pq = 0.455 → Expected Gg = 0.455 × 800 = 364
q² = 0.1225 → Expected gg = 0.1225 × 800 = 98

Step 3 — Chi-squared test:

χ² = (320−338)²/338 + (400−364)²/364 + (80−98)²/98
χ² = 324/338 + 1,296/364 + 324/98
χ² = 0.96 + 3.56 + 3.31 = 7.83
Critical value (df=1, α=0.05) = 3.84

Conclusion: χ² = 7.83 > 3.84 → Reject H₀. The population is NOT in Hardy–Weinberg equilibrium. There is an excess of heterozygotes compared to expectations, which could indicate balancing selection or recent admixture.

Q52 — Open Short Answer
Explain the concept of whole exome sequencing (WES). Describe the hybridization capture method used and why WES is cost-effective compared to whole genome sequencing.
✓ Model Answer

WES concept: Whole exome sequencing targets only the protein-coding regions (exons) of the genome, which constitute approximately 2% of the human genome (~45 Mb vs. ~3 Gb).

Hybridization capture method: (1) Genomic DNA is fragmented. (2) Biotinylated probes complementary to all known exonic sequences are hybridized to the fragments. (3) Streptavidin-coated beads capture biotinylated probe–fragment complexes. (4) Non-exonic fragments are washed away. (5) Captured exonic fragments are eluted and sequenced.

Cost-effectiveness: By sequencing only ~2% of the genome, WES generates much smaller FASTQ files (~45 Gb vs. ~90 Gb for WGS), requires less sequencing output, and enables faster bioinformatic analysis. The trade-off is that regulatory variants in non-coding regions (introns, intergenic regions) are missed. WES is ideal when the hypothesis is that causal variants alter protein-coding sequences.

Q53 — Open Short Answer
Describe the k-mer frequency distribution plot. Explain the three typical regions (left peak, main peak, right tail) and what each represents biologically.
✓ Model Answer

A k-mer frequency distribution plots k-mer multiplicity (x-axis) against the number of distinct k-mers with that multiplicity (y-axis).

Region 1 — Left peak (frequency = 1): K-mers appearing only once. These are mostly derived from sequencing errors — a single nucleotide error creates a k-mer unique to that read. This peak should be excluded when estimating genome size.

Region 2 — Main peak (central): K-mers appearing at the expected coverage depth. These represent unique genomic sequences. The position of this peak corresponds to the average sequencing depth. Genome size can be estimated as: G = (total k-mers under the curve) / (mean coverage from the main peak).

Region 3 — Right tail (high frequency): K-mers appearing much more frequently than average. These derive from repetitive elements (SINEs, LINEs, transposons) that are present in many copies throughout the genome. The height and extent of this tail reflects the repeat content of the genome.

Q54 — Open Calculation
You sequenced 300 million reads of 100 bp from a genome estimated to be 1.5 Gbp. Calculate the average sequencing depth (coverage).
✓ Model Answer
Coverage = (N × L) / G
N = 300,000,000 reads
L = 100 bp
G = 1,500,000,000 bp
Coverage = (300,000,000 × 100) / 1,500,000,000
Coverage = 30,000,000,000 / 1,500,000,000 = 20×

The average depth of coverage is 20×, meaning each position in the genome is covered by ~20 reads on average.

Q55 — Open Short Answer
What is population stratification in GWAS? Explain how it causes false positives and how multidimensional scaling (MDS) or PCA is used to address it.
✓ Model Answer

Population stratification: When a GWAS sample includes individuals from genetically distinct subpopulations that also differ in phenotype, allele frequency differences between subpopulations are confounded with phenotype differences — creating false positive associations.

Example: If wild mice (low body weight, genotype profile A) and laboratory strains (high body weight, genotype profile B) are analyzed together, nearly every SNP differing between strains appears associated with body weight — not because those SNPs cause weight differences, but because they track population membership.

MDS/PCA solution: MDS or PCA compresses genome-wide genotype data into a few principal components that capture population structure. Each individual gets component scores. Individuals from the same subpopulation cluster together. These components are then included as covariates in the GWAS statistical model to correct for ancestry differences. After correction, only true genotype–phenotype associations remain significant.

Verification: The QQ plot and λGC metric are used to confirm that stratification has been properly corrected (λ ≈ 1 indicates no inflation).

Q56 — Open Short Answer
Describe the complete variant discovery pipeline from FASTQ files to annotated VCF. List each major step, the file format at each stage, and one key tool for each step.
✓ Model Answer

Step 1 — Quality Control: Raw reads (FASTQ) → Quality assessment with FastQC → Evaluate per-base quality, adapter content, GC distribution.

Step 2 — Trimming: FASTQ → Trimmed FASTQ using Trimmomatic or fastp → Remove low-quality bases and adapter sequences. Re-run FastQC to confirm improvement.

Step 3 — Alignment: Trimmed FASTQ + Reference genome (FASTA) → SAM file using BWA (Burrows-Wheeler Aligner) → Reads are mapped to the reference genome. Convert SAM → BAM (binary, compressed) and sort.

Step 4 — Duplicate Removal: Sorted BAM → Deduplicated BAM using Picard MarkDuplicates → Remove PCR duplicates that could bias variant calling.

Step 5 — Variant Calling: Deduplicated BAM → VCF file using GATK HaplotypeCaller → Identify SNPs and indels. Each variant line includes chromosome, position, REF, ALT, quality score, and sample genotypes.

Step 6 — Variant Annotation: VCF → Annotated VCF using VEP or SnpEff → Each variant is annotated with its predicted biological effect (synonymous, missense, stop-gain, splice-site, etc.) and compared to known variant databases (e.g., dbSNP).

Q57 — Open Short Answer
Explain how a Manhattan plot is constructed and interpreted in a GWAS. What do the axes represent? What does a "peak" indicate? What determines the significance threshold line?
✓ Model Answer

Axes: The x-axis represents genomic position, with SNPs ordered along each chromosome (chromosomes are shown in alternating colors). The y-axis represents −log₁₀(p-value) from the association test between each SNP and the phenotype. Higher values mean stronger statistical significance.

Construction: For each genotyped SNP, a statistical test (e.g., linear regression with additive model: phenotype ~ genotype + covariates) produces a p-value. Each SNP is plotted as a dot at its chromosomal position (x) and its −log₁₀(p) value (y).

Peaks: A "peak" or cluster of highly significant SNPs indicates a genomic region associated with the trait. The pyramid shape occurs because the causal variant and its neighbors in strong LD all show elevated significance, tapering off as LD decays with distance. The peak width reflects the extent of LD in that region.

Significance threshold: A horizontal line marks the genome-wide significance threshold. Using Bonferroni correction: α/number_of_tests. The conventional threshold is 5 × 10⁻⁸ (−log₁₀ ≈ 7.3), which accounts for approximately 1 million independent tests across the genome. SNPs above this line are considered genome-wide significant associations.

Important note: Associated SNPs are usually tag SNPs in LD with the true causal variant — they are not necessarily the causal mutation itself. Fine-mapping is needed to identify the actual causal variant within the associated region.