Applied Genomics — Final Exam Simulation

📝Final Exam — 45 MCQ + 12 Open Questions
0 / 57
Q1 Easy
An allele is best defined as:
AA segment of DNA that codes for a protein
BOne of two or more alternative forms of a gene at a given locus
CA mutation that alters gene function
DA chromosome region inherited as a block
Explanation
An allele is one of two or more alternative forms of a gene (or a genetic locus) at the same position on a chromosome. Different alleles can produce variation in the trait that the gene controls. This is a fundamental concept in genetics that underpins all genomic analyses.
Q2 Medium
In a De Bruijn graph, the genome assembly problem is solved by finding:
AA Hamiltonian path (visiting every node once)
BThe shortest path between start and end nodes
CAn Eulerian path (visiting every edge once)
DA maximum spanning tree
Explanation
In a De Bruijn graph, nodes are (k−1)-mers and edges are k-mers. Assembly requires finding an Eulerian path — traversing every edge exactly once. This is computationally tractable, unlike the Hamiltonian path problem (visiting every node once) used in OLC, which is NP-complete. This distinction is a frequently tested concept.
Q3 Easy
What does a FASTQ file contain?
ASequencing reads and per-base quality scores
BAligned reads and their genomic positions
CVariant calls (SNPs and indels)
DGene annotations and coordinates
Explanation
FASTQ stores raw sequencing reads with per-base quality scores. Each entry has 4 lines: identifier (@), sequence, separator (+), and quality string. BAM/SAM stores alignments, VCF stores variant calls, and GFF/BED store annotations. This is one of the most fundamental file formats in NGS analysis.
Q4 Medium
What is the primary goal of a GWAS?
ATo sequence entire genomes of affected individuals
BTo determine the haplotype structure of a population
CTo identify all genes in the genome
DTo find statistical associations between genetic variants and traits
Explanation
GWAS identifies statistical associations between SNPs (or other genetic variants) and phenotypic traits across a population. It does not sequence entire genomes — it genotypes known variant positions using SNP arrays. The goal is to link specific genomic regions with traits of interest for downstream biological investigation.
Q5 Easy
The Ion Torrent sequencer detects nucleotide incorporation by measuring:
AFluorescent light emission
BpH changes from H⁺ ion release
CBioluminescent signal from luciferase
DChanges in electrical current through a nanopore
Explanation
Ion Torrent uses semiconductor sequencing. When a nucleotide is incorporated, H⁺ ions are released, causing a pH change detected by an ion-sensitive layer. This is electronic detection — no optics, cameras, or fluorescent labels are needed. Option A describes Illumina, C describes 454 pyrosequencing, and D describes Nanopore.
Q6 Medium
BUSCO evaluates genome assembly quality by assessing:
AGC content uniformity
BScaffold length distribution
CPresence of conserved single-copy orthologs expected for the lineage
DSequencing error rates in consensus bases
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) checks for conserved genes expected in a given lineage. A high BUSCO score indicates a complete assembly. Missing BUSCOs suggest gaps; duplicated BUSCOs may indicate assembly errors or redundancy. BUSCO measures completeness, while N50 measures contiguity — both are needed for comprehensive QC.
Q7 Medium
In a VCF file, the genotype 0|1 indicates:
AA phased heterozygous genotype
BAn unphased heterozygous genotype
CA homozygous alternative genotype
DA missing genotype call
Explanation
The pipe "|" indicates a phased genotype — you know which allele is on which chromosome. The slash "/" indicates unphased (allele assignment to chromosomes is unknown). Both 0|1 and 0/1 are heterozygous, but phasing information differs. 0 = reference allele, 1 = first alternative allele.
Q8 Easy
Linkage disequilibrium describes:
AThe mutation rate of genetic markers
BThe non-random association of alleles at different loci
CThe rate of linked contigs during genome assembly
DThe degree of similarity between two populations
Explanation
LD describes the tendency of alleles at different loci to be inherited together more often than expected by chance. It is shaped by recombination, genetic drift, selection, and population history. LD is distinct from physical linkage — even unlinked loci can be in LD due to population structure or recent admixture.
Q9 Medium
ABI SOLiD sequencing uses:
AFluorescent reversible terminators
BSemiconductor pH detection
CPyrosequencing with luciferase
DSequencing by ligation with a two-base encoding system
Explanation
SOLiD (Sequencing by Oligonucleotide Ligation and Detection) uses a ligation-based approach with fluorescently labeled di-base probes. Each base is interrogated twice through five rounds of primer reset, enabling high accuracy (up to 99.99% with ECC). The output is in color-space, requiring conversion to nucleotide sequences.
Q10 Medium
PacBio SMRT sequencing detects nucleotide incorporation using:
ApH changes in semiconductor wells
BCurrent disruptions through a protein pore
CFluorescent signals in zero-mode waveguides (ZMWs)
DBioluminescence from a luciferase reaction
Explanation
PacBio uses ZMWs — tiny wells with a single immobilized polymerase at the bottom. Fluorescently labeled nucleotides are incorporated continuously; the camera detects the color and timing of each incorporation in real time. PacBio can also detect DNA modifications through interpulse duration changes. Typical read lengths are 10–25 kb or more.
Q11 Easy
A SAM/BAM file stores:
ARaw sequencing reads and quality scores only
BSequence alignments to a reference genome
CSNP and indel variant calls
DGenome annotation information
Explanation
SAM (Sequence Alignment Map) stores reads aligned to a reference, including mapping positions, CIGAR strings, mapping quality, and mate-pair information. BAM is the compressed binary version. FASTQ stores raw reads, VCF stores variant calls, and GFF stores annotations.
Q12 Medium
In a Manhattan plot, the vertical axis represents:
A−log₁₀(p-value)
BMinor allele frequency
CEffect size (β)
DPhysical position in base pairs
Explanation
In a Manhattan plot, the X-axis shows chromosomal positions and the Y-axis shows −log₁₀(p-value). This transformation makes more significant associations appear as higher points. A horizontal line typically marks the genome-wide significance threshold at P < 5 × 10⁻⁸.
Q13 Medium
Paired-end sequencing means:
ASequencing two different samples on the same flow cell
BSequencing the same strand twice for error correction
CSequencing with two different chemistries on one library
DSequencing both ends of a DNA fragment
Explanation
Paired-end sequencing reads both ends of a DNA fragment, producing two linked reads per molecule. The middle portion may remain unsequenced, but the known insert size provides positional information critical for detecting structural variants, indels, and gene fusions that are not detectable with single-end reads.
Q14 Medium
If an individual has a high degree of inbreeding, what effect does this have on genome assembly?
AIt makes assembly harder due to increased heterozygosity
BIt has no effect on assembly quality
CIt makes assembly easier due to increased homozygosity
DIt requires long-read sequencing exclusively
Explanation
High inbreeding increases homozygosity, which simplifies assembly because there is less allelic variation to confuse the assembler. In highly heterozygous genomes, the assembler may interpret allelic variants as separate genomic regions, causing fragmentation and inflated genome size. This is why inbred lines are preferred for reference genome assembly.
Q15 Medium
Runs of Homozygosity (ROH) in a genome indicate:
ARegions of high mutation rate
BStretches of homozygous genotypes reflecting autozygosity from a common ancestor
CRegions with high recombination rates
DErrors in genotyping array data
Explanation
ROH are long continuous stretches of homozygous genotypes that arise when an individual inherits identical haplotype segments from both parents due to a common ancestor. The total length and number of ROH correlate with the degree of inbreeding — longer ROH indicate more recent inbreeding events.
Q16 Easy
Illumina sequencing generates clusters using:
ABridge amplification on a flow cell
BEmulsion PCR on beads
CRolling circle amplification
DIsothermal strand displacement
Explanation
Illumina uses bridge amplification: single-stranded DNA hybridizes to oligos on the flow cell surface, folds over to bridge with adjacent primers, and is amplified into clonal clusters of ~1,000 copies. Emulsion PCR is used by Ion Torrent, 454, and SOLiD.
Q17 Medium
The CIGAR string "5M2I3M1D4M" in a SAM record means:
A15 bases aligned with no indels
B5 soft-clipped, 2 insertions, 3 matches, 1 deletion, 4 matches
C5 deletions followed by 2 insertions
D5 aligned, 2 inserted in read, 3 aligned, 1 deleted from read, 4 aligned
Explanation
CIGAR operations: M = match/mismatch (aligned), I = insertion in read (bases present in read but not reference), D = deletion from read (bases in reference but not read), S = soft clip. Here: 5M (5 aligned) + 2I (2 bases inserted) + 3M (3 aligned) + 1D (1 base deleted) + 4M (4 aligned). Read length = 5+2+3+4 = 14 bases.
Q18 Medium
BWA-MEM is based on:
AHash table indexing
BSmith-Waterman local alignment
CBurrows-Wheeler Transform
DK-mer frequency counting
Explanation
BWA (Burrows-Wheeler Aligner) uses the Burrows-Wheeler Transform for efficient read alignment. It is the default aligner in many standard pipelines (e.g., GATK best practices). Different aligners can produce substantially different variant calls — one study showed only 24.5% concordance between BWA-MEM and Bowtie2.
Q19 Medium
The genome-wide significance threshold in GWAS is typically:
AP < 0.05
BP < 5 × 10⁻⁸
CP < 1 × 10⁻³
DP < 0.01
Explanation
The widely accepted threshold is P < 5 × 10⁻⁸, derived from correcting for approximately 1 million independent LD blocks across the genome (Pe'er et al., 2008). This is a fixed, LD-aware threshold that replaced the per-study Bonferroni correction, which was considered overly conservative.
Q20 Medium
Structural annotation of a genome refers to:
AIdentifying the positions and structures of genes (exons, introns, UTRs)
BAssigning biological function to predicted genes
CDetermining the 3D structure of encoded proteins
DMeasuring gene expression levels across tissues
Explanation
Structural annotation identifies where genes are located and what they look like (exon-intron boundaries, start/stop codons, UTRs). Functional annotation then assigns biological roles (e.g., enzyme activity, pathway involvement) to those predicted genes. Both are essential steps after genome assembly.
Q21 Medium
MDS (Multidimensional Scaling) in GWAS is used to:
ACalculate p-values for each SNP
BPerform multiple testing correction
CPhase haplotypes from genotype data
DDetect and visualize population structure
Explanation
MDS reduces high-dimensional genotype data into a few dimensions, where each point represents an individual. Clusters on the MDS plot reveal population subgroups. If distinct clusters correlate with case/control status, population stratification is confounding results. MDS components can be included as covariates to correct for this.
Q22 Medium
What is aCGH?
AA chip-based genome resequencing technology
BAn NGS paired-end sequencing approach
CA microarray-based method to identify CNVs
DA method for evaluating chromosomal heterozygosity
Explanation
Array Comparative Genomic Hybridization (aCGH) is a microarray-based method that compares a test genome with a reference genome to detect copy number variations (CNVs) — duplications and deletions. Test and reference DNA are labeled with different fluorescent dyes and co-hybridized to the array.
Q23 Easy
A Phred quality score of Q20 corresponds to a base call accuracy of:
A90%
B99%
C99.9%
D99.99%
Explanation
Q = −10 × log₁₀(e). For Q20: e = 10⁻² = 0.01, meaning 1 error in 100 bases = 99% accuracy. Q10 = 90%, Q30 = 99.9%, Q40 = 99.99%. Illumina typically achieves Q30+, while Ion Torrent averages around Q20.
Q24 Medium
In equimolar DNA pooling for Pool-seq, what must be ensured?
AEach individual contributes equal amounts of DNA to the pool
BAll individuals are homozygous at target loci
CThe pool contains only coding sequences
DEach individual is sequenced separately before pooling
Explanation
Equimolar pooling means each individual contributes the same amount of DNA so that allele frequencies in the pool accurately represent the population. Unequal contributions would bias allele frequency estimates. Pool-seq estimates population-level allele frequencies but cannot determine individual genotypes.
Q25 Medium
Hardy-Weinberg Equilibrium assumes all of the following EXCEPT:
ARandom mating
BNo mutation
CLarge population size
DSelection favoring heterozygotes
Explanation
HWE assumes: random mating, no selection, no mutation, no migration, and large population size. Selection (including heterozygote advantage) violates HWE. If a population deviates from HWE, it may indicate selection, non-random mating, population structure, or genotyping errors.
Q26 Medium
The Sanger chain-termination method uses:
AFluorescent reversible terminators added simultaneously
BH⁺ ion detection in semiconductor wells
CDideoxynucleotides (ddNTPs) that terminate chain elongation
DLigation of fluorescent di-base probes
Explanation
Sanger sequencing uses ddNTPs that lack a 3'-OH group, terminating chain elongation when incorporated. Each ddNTP is labeled with a different fluorescent dye. This is classified as first-generation sequencing — producing long, high-accuracy reads (~800–1000 bp) but at low throughput.
Q27 Medium
Illumina SBS has higher accuracy in homopolymer regions than Ion Torrent because:
AIllumina uses a more sensitive camera
BReversible terminators ensure only one base is incorporated per cycle
CIllumina reads are inherently longer
DIllumina uses a two-base encoding system
Explanation
Illumina's reversible terminator chemistry blocks the 3' end after each nucleotide incorporation, ensuring exactly one base is added per cycle — even in homopolymer runs like AAAA, each A is read in a separate cycle. Ion Torrent flows nucleotides without termators, so multiple identical bases may incorporate simultaneously, and the signal intensity must estimate the count — which is error-prone.
Q28 Medium
Over-Representation Analysis (ORA) in post-GWAS analysis tests whether:
ASpecific biological functions are enriched in GWAS-identified genes
BSNPs are in Hardy-Weinberg Equilibrium
CPopulation stratification has been corrected
DThe genotyping call rate exceeds 95%
Explanation
ORA determines whether certain biological pathways or Gene Ontology (GO) terms are more frequently represented in GWAS-identified genes than expected by chance. Tools like DAVID and EnrichR perform this analysis. It helps translate lists of candidate genes into biologically meaningful insights about the trait.
Q29 Medium
What does FST measure?
AThe rate of mutation between two loci
BThe inbreeding coefficient of an individual
CGenetic differentiation between populations
DThe proportion of missing genotypes
Explanation
FST (Fixation Index) measures the proportion of genetic variance found between populations relative to the total variance. FST = 0 means no differentiation (same allele frequencies); FST = 1 means complete fixation of different alleles. It is widely used in population genomics and was used in the Pool-seq study of red vs. yellow canaries.
Q30 Easy
ChIP-seq identifies:
ACopy number variations
BMethylation patterns at single-base resolution
CmRNA expression levels
DGenome-wide protein-DNA binding sites
Explanation
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) identifies where proteins (e.g., transcription factors, histones) bind to DNA across the genome. The process involves crosslinking, fragmentation, immunoprecipitation with a specific antibody, and sequencing the enriched DNA fragments.
Q31 Medium
Typical Illumina read lengths are approximately:
A35–50 bp
B100–300 bp
C1–5 kb
D10–25 kb
Explanation
Illumina platforms produce short reads of ~100–300 bp depending on the platform and chemistry. SOLiD produces 35–50 bp. PacBio produces 10–25 kb (or more with HiFi). The trade-off is: Illumina has high accuracy and throughput but shorter reads; PacBio/Nanopore have long reads but historically higher error rates.
Q32 Medium
Oxford Nanopore sequencing can directly sequence RNA without:
AConverting it to cDNA first
BUsing any electrical current
CFragmenting the molecules
DA protein nanopore
Explanation
Nanopore can sequence RNA directly, without reverse transcription to cDNA. RNA molecules pass through the pore, and current changes are used to infer the sequence. This preserves RNA modifications (e.g., m6A) that would be lost in cDNA conversion. Nanopore always requires electrical current and a protein pore for detection.
Q33 Medium
Population stratification in GWAS can cause:
AIncreased read length
BDecreased sequencing depth
CSpurious associations between ancestry-related SNPs and the trait
DImproved statistical power
Explanation
Population stratification occurs when subgroups differ in both ancestry and trait prevalence. SNPs that differ between subgroups may appear associated with the trait simply because they track ancestry, not biology. This creates false positives. It is detected using MDS/PCA and corrected by including ancestry components as covariates.
Q34 Medium
The exome represents approximately what percentage of the human genome?
A0.1%
B10%
C5%
D~1–2%
Explanation
The exome (all protein-coding exons) represents only about 1–2% of the human genome, yet contains ~85% of known disease-causing mutations. Whole-exome sequencing (WES) targets this fraction, making it far cheaper than WGS while capturing most clinically relevant variants.
Q35 Medium
In bisulfite sequencing, unmethylated cytosines are converted to:
AGuanine
BUracil (read as thymine after PCR)
CAdenine
D5-methylcytosine
Explanation
Bisulfite treatment converts unmethylated cytosines to uracil (read as T after PCR amplification), while methylated cytosines (5mC) are protected and remain as C. By comparing the treated sequence to the reference, methylation status at each C can be determined. The main analytical challenge is distinguishing true C→T conversions from C→T SNPs.
Q36 Medium
A VCF file stores:
ASNPs, indels, and structural variant calls
BRaw sequencing reads
CRead alignment coordinates in binary format
DGenome annotation features
Explanation
VCF (Variant Call Format) is a standardized text file for variant calls. It contains columns for chromosome, position, ID, reference allele, alternative allele(s), quality, filter status, info annotations, format, and per-sample genotype data. Meta-information lines begin with ## and the header line with #.
Q37 Medium
In RNA-seq, poly(A) selection is used to:
ARemove adapter sequences
BFragment the cDNA library
CEnrich mRNA from total RNA
DNormalize expression levels across samples
Explanation
Most eukaryotic mRNAs have a poly(A) tail. Oligo(dT) beads capture polyadenylated transcripts, separating mRNA from the dominant rRNA (~80% of total RNA). This enrichment is essential because sequencing total RNA without enrichment would be overwhelmed by ribosomal RNA.
Q38 Medium
Tag SNPs in GWAS genotyping arrays are selected because they:
AAre always the causal variants
BHave the highest mutation rates
CAre located exclusively in coding regions
DCapture variation within LD blocks without genotyping every SNP
Explanation
Because SNPs within an LD block are correlated, genotyping one representative "tag" SNP captures information about the others. This reduces cost while maintaining genome-wide coverage. The detected association is typically indirect — the tag SNP is in LD with the actual causal variant, which may not be on the array.
Q39 Medium
A SAM FLAG value of 4 indicates the read is:
AA PCR duplicate
BUnmapped to the reference
CA secondary alignment
DProperly paired with its mate
Explanation
SAM FLAG is a bitwise flag: FLAG 4 = unmapped, FLAG 256 = secondary alignment, FLAG 1024 = PCR/optical duplicate. Unmapped reads can be useful for metagenomics — they may come from contaminant organisms (bacteria, viruses) not present in the host reference genome.
Q40 Medium
The correct order of file formats in a variant discovery pipeline is:
AFASTQ → BAM → VCF
BBAM → FASTQ → VCF
CVCF → BAM → FASTQ
DFASTQ → VCF → BAM
Explanation
Raw reads (FASTQ) are aligned to a reference with BWA to produce BAM files, which are then processed through variant callers like GATK to produce VCF files. Quality control and filtering happen between every step. This FASTQ → BAM → VCF flow is the backbone of any resequencing analysis.
Q41 Medium
Class I transposable elements (retrotransposons) move via:
ACut-and-paste through a DNA intermediate
BDirect excision and reinsertion
CCopy-and-paste through an RNA intermediate
DHorizontal gene transfer
Explanation
Class I (retrotransposons: LINEs, SINEs, LTR elements) use "copy and paste" via an RNA intermediate — the original stays in place while a copy inserts elsewhere. Class II (DNA transposons) use "cut and paste" via a DNA intermediate. This is a commonly confused distinction — reversing them is a classic exam trap.
Q42 Medium
A genomic inflation factor (λGC) of 1.00 in a GWAS indicates:
AThe study detected many true associations
BUncorrected population stratification
COvercorrection for population structure
DNo systematic inflation — proper control of confounders
Explanation
λGC ≈ 1.00 is the ideal scenario: observed test statistics match the expected null distribution. λ > 1 indicates inflation (possible stratification, false positives). λ < 1 suggests overcorrection (too many covariates, risking false negatives). The QQ-plot provides the visual counterpart to this numeric assessment.
Q43 Medium
In the FastQC "Per base sequence content" module, all four bases showing approximately equal frequencies at every position suggests:
AAdapter contamination
BRandom fragmentation — a good-quality WGS library
CRestriction enzyme digestion
DPCR duplicate artifacts
Explanation
Random fragmentation produces roughly equal proportions of A, T, G, C at every position along the read — expected for a good WGS library. If specific bases dominate at certain positions (e.g., T always at position 1), it suggests restriction enzyme digestion was used. Understanding the library prep method is key to interpreting FastQC output correctly.
Q44 Medium
The additive genetic model in GWAS codes genotypes as:
AAA = 1, Aa = 1, aa = 0 (dominant model)
BAA = 1, Aa = 0, aa = −1
C0, 1, or 2 copies of the minor allele
DGenotypes are not coded numerically
Explanation
The additive model — the most commonly used in GWAS — counts minor allele copies: 0 (homozygous major), 1 (heterozygous), 2 (homozygous minor). Linear regression then tests whether the phenotype changes with each additional copy of the minor allele.
Q45 Medium
RepeatMasker is used for:
AIdentifying and masking repetitive elements in a genome assembly
BPredicting gene structures using HMMs
CEvaluating assembly completeness
DVisualizing read alignments in a genome browser
Explanation
RepeatMasker identifies repetitive elements by comparing the genome against databases (Dfam, Repbase). Repeats can be hard-masked (replaced with Ns) or soft-masked (converted to lowercase). Masking repeats before gene prediction prevents false gene predictions in repetitive regions. AUGUSTUS is for gene prediction, BUSCO for completeness, and IGV for visualization.
Q46 — Open Calculation
You sequence a genome of 2.5 Gbp using Illumina paired-end 150 bp reads. You obtain 500 million reads. Calculate the sequencing depth. Is this sufficient for robust SNP detection (minimum ~10×)?
✓ Model Answer
Depth = (N × L) / G
N = 500,000,000 reads; L = 150 bp; G = 2,500,000,000 bp
Depth = (500,000,000 × 150) / 2,500,000,000
= 75,000,000,000 / 2,500,000,000 = 30×

Yes, 30× exceeds the recommended minimum of ~10× for robust SNP detection. This depth provides high confidence for variant calling and genotyping.

Q47 — Open Calculation
Given the following contig lengths (in kb): 120, 90, 80, 70, 50, 40, 30, 20. Calculate the N50 value.
✓ Model Answer
Step 1: Sort contigs from longest to shortest: 120, 90, 80, 70, 50, 40, 30, 20
Step 2: Total assembly size = 120 + 90 + 80 + 70 + 50 + 40 + 30 + 20 = 500 kb
Step 3: Half of total = 250 kb
Step 4: Cumulative sum from longest:
120 → cumulative = 120 (below 250)
120 + 90 = 210 (below 250)
210 + 80 = 290 (exceeds 250)
N50 = 80 kb

The N50 is the length of the contig that, when added, causes the cumulative sum to cross 50% of the total assembly size. Note: N50 measures contiguity, not correctness — a high N50 does not guarantee an error-free assembly.

Q48 — Open Short Answer
Describe the bisulfite sequencing strategy. What chemical conversion occurs, and how does it allow detection of methylated cytosines?
✓ Model Answer

Bisulfite sequencing works by treating genomic DNA with sodium bisulfite, which converts unmethylated cytosines to uracil (read as thymine after PCR amplification). Methylated cytosines (5-methylcytosine) are protected from this conversion and remain as C.

After sequencing, the reads are aligned to the reference genome. At each cytosine position: if the read shows C → the position was methylated; if it shows T → the position was unmethylated. This provides single-base resolution methylation mapping.

A major analytical challenge is distinguishing bisulfite-induced C→T conversions from genuine C→T SNPs in the genome. About 98% of methylation in the human genome occurs at CpG dinucleotides. CpG islands (regions dense in CpG sites) near gene promoters are of particular interest as their methylation status often regulates gene expression.

Q49 — Open Short Answer
How can genome size be estimated before sequencing? Describe two approaches.
✓ Model Answer

1. C-value (Flow Cytometry): The C-value is the amount of DNA in picograms (pg) in a haploid genome. It is measured using flow cytometry or Feulgen densitometry, typically by comparing staining intensity to a reference species with known genome size. The conversion formula is: Genome size (bp) = C-value (pg) × 0.978 × 10⁹. For example, a C-value of 2.0 pg gives approximately 1.96 Gbp.

2. K-mer frequency analysis: After sequencing, reads are decomposed into K-mers and their frequency distribution is plotted. The genome size is estimated by: Genome size = Total number of K-mers (area under the curve) / Average K-mer coverage (position of the main peak). The distribution typically shows three features: a left peak of low-frequency K-mers (sequencing errors), a main peak (true genomic K-mers at average coverage), and a right tail of high-frequency K-mers (repetitive regions).

Q50 — Open Short Answer
Describe and draw a Manhattan plot. What information does it display, what do the axes represent, and how do you identify significant associations?
✓ Model Answer

A Manhattan plot is the standard visualization of GWAS results. It displays all tested SNPs across the genome:

X-axis: Genomic position — SNPs are plotted by their physical location, ordered by chromosome. Each chromosome is shown in a different color.

Y-axis: −log₁₀(p-value) — the negative log-transformed p-value of each SNP-trait association. This transformation makes more significant associations appear as taller points (a p-value of 10⁻⁸ appears as 8 on the Y-axis).

Significance threshold: A horizontal line at −log₁₀(5 × 10⁻⁸) ≈ 7.3 marks the genome-wide significance threshold. SNPs above this line are considered significantly associated.

Interpretation: True associations appear as "peaks" — clusters of linked SNPs (in LD) rising above the background. The peak shape reflects LD structure: the top SNP has the strongest signal, and nearby correlated SNPs form a hill. An isolated single SNP above the threshold (without supporting nearby SNPs) is suspicious and may be a false positive due to genotyping errors.

[Drawing: a scatter plot with chromosomes along the X-axis separated by alternating colors, dots scattered at low Y values (1–4), with one or more sharp peaks exceeding the horizontal significance line around Y = 7.3]

Q51 — Open Calculation
In a population of 1000 individuals, you observe: 500 AA, 200 Aa, 300 aa. Calculate allele frequencies, expected genotype counts under HWE, and determine whether this population is in Hardy-Weinberg Equilibrium.
✓ Model Answer
Total individuals = 1000; Total alleles = 2000
Allele A count: (2 × 500) + (1 × 200) = 1200
p = freq(A) = 1200 / 2000 = 0.6
q = freq(a) = 1 − 0.6 = 0.4
Expected under HWE:
AA: p² × 1000 = 0.36 × 1000 = 360
Aa: 2pq × 1000 = 0.48 × 1000 = 480
aa: q² × 1000 = 0.16 × 1000 = 160
Chi-squared test:
χ² = (500−360)²/360 + (200−480)²/480 + (300−160)²/160
= 19600/360 + 78400/480 + 19600/160
= 54.44 + 163.33 + 122.50 = 340.28

With 1 degree of freedom, the critical value at α = 0.05 is 3.84. Since 340.28 >> 3.84, the population is not in Hardy-Weinberg Equilibrium. There is a large excess of homozygotes and a deficit of heterozygotes, suggesting non-random mating, selection, or population substructure.

Q52 — Open Short Answer
What is the primary purpose of ChIP-seq? Describe its main experimental steps and explain how protein-DNA binding sites are identified from the data.
✓ Model Answer

Purpose: ChIP-seq identifies genome-wide binding sites of proteins (transcription factors, histones, etc.) to DNA.

Steps:

1. Crosslinking: Formaldehyde covalently links proteins to the DNA they are bound to in vivo.

2. Fragmentation: Chromatin is sheared into small fragments (~200–500 bp) by sonication or enzymatic digestion.

3. Immunoprecipitation: An antibody specific to the target protein pulls down protein-DNA complexes.

4. Reverse crosslinking & purification: The crosslinks are reversed and DNA is purified.

5. Sequencing: The enriched DNA fragments are sequenced using NGS.

Identifying binding sites: Reads are aligned to the reference genome. Regions with significantly more reads than the background (input control) form "peaks." Peak-calling algorithms (e.g., MACS2) identify these enriched regions as binding sites. The height and shape of peaks indicate binding strength and precision.

Q53 — Open Calculation
You want to sequence a 1.2 Gbp genome at 60× coverage using 150 bp reads. How many reads do you need?
✓ Model Answer
Coverage = (N × L) / G → N = (Coverage × G) / L
N = (60 × 1,200,000,000) / 150
= 72,000,000,000 / 150
= 480,000,000 reads

You need approximately 480 million reads of 150 bp each to achieve 60× coverage of a 1.2 Gbp genome.

Q54 — Open Short Answer
Describe the FastQC "Per base sequence quality" module. What does the boxplot at each position represent, and when should trimming be applied?
✓ Model Answer

The "Per base sequence quality" module shows quality score distributions at each position along the read. At each position, a boxplot displays:

- The median quality score (central line)

- The interquartile range (IQR, the box: 25th–75th percentile)

- The 10th and 90th percentiles (whiskers)

- The mean quality (blue line)

The background is color-coded: green (good, Q ≥ 28), yellow (acceptable, Q 20–28), and red (poor, Q < 20).

When to trim: Trimming should be applied when quality scores drop into the yellow or red zones, which typically occurs toward the 3' end of reads. A sliding window approach (e.g., with Trimmomatic) calculates the average quality within a window and trims when it falls below a threshold (e.g., Q20). After trimming, FastQC should be re-run to confirm improvement. Reads shorter than a minimum length (e.g., 25 bp) should be discarded entirely.

Q55 — Open Short Answer
Describe the K-mer frequency distribution graph. What are the three main regions visible, and what does each represent?
✓ Model Answer

A K-mer frequency distribution plots K-mer frequency (X-axis) against the number of distinct K-mers at that frequency (Y-axis). Three main regions are visible:

1. Left peak (low frequencies, e.g., 1–5×): Represents K-mers caused by sequencing errors. Errors create unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.

2. Main peak (moderate frequency): Represents true genomic K-mers. The position of this peak corresponds to the average sequencing depth. For example, a peak at 30× means each genomic K-mer was sequenced approximately 30 times.

3. Right tail (high frequencies, extending well beyond the main peak): Represents K-mers from repetitive regions. Repeats occur multiple times in the genome, so their K-mers appear at multiples of the average coverage. A prominent right tail indicates high repeat content, which will complicate assembly.

Genome size is estimated by: Total K-mers (area under the curve, excluding error peak) / Main peak position.

Q56 — Open Short Answer
Describe the complete variant discovery pipeline from raw reads to annotated variants. For each step, name the input format, output format, and one commonly used tool.
✓ Model Answer

1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (QC), Trimmomatic (trimming) → Output: cleaned FASTQ. Assess per-base quality, GC content, duplications. Trim low-quality ends and remove short reads.

2. Alignment: Input: cleaned FASTQ + reference FASTA → Tool: BWA-MEM → Output: SAM/BAM. Map reads to reference genome. Post-alignment: sort, index, and remove PCR duplicates (Picard). Filter by mapping quality (MAPQ).

3. Variant Calling: Input: filtered BAM → Tool: GATK HaplotypeCaller → Output: VCF. Identify SNPs and indels at each position. Joint calling across samples is preferred for population studies.

4. Variant Annotation: Input: VCF → Tool: Ensembl VEP or SnpEff → Output: annotated VCF. Determine effect of each variant (missense, synonymous, intronic, splice site, TFBS gain/loss). Provide functional impact predictions (SIFT, PolyPhen-2).

Quality control and filtering occur between every step — this iterative QC cycle is essential for reliable results.

Q57 — Open Short Answer
Explain why a GWAS-significant SNP is usually not the causal variant. What is "indirect association," and what steps follow a GWAS to identify the true causal variant?
✓ Model Answer

GWAS uses tag SNPs on genotyping arrays, which are representative markers for LD blocks. The detected SNP is typically in LD with the true causal variant rather than being causal itself — this is called indirect association. The causal variant may not be on the array at all.

Post-GWAS steps:

1. Fine-mapping: Examine LD structure (r², D′) around the peak to narrow the candidate region and prioritize variants most likely to be causal.

2. Gene annotation: Use BEDTools intersect to identify genes within a defined window (e.g., 0.5 Mb) of the top SNPs. Consult databases like GeneCards and the GWAS Catalog.

3. Functional enrichment: Apply ORA (DAVID, EnrichR) to test whether candidate genes are enriched for specific pathways.

4. Replication: Validate findings in an independent cohort with the same phenotype definition.

5. Functional validation: Experimental studies (gene expression analysis, knockouts, reporter assays) to confirm the causal role of the candidate variant.