Applied Genomics — Final Comprehensive Exam

Instructions: Answer all questions. For MCQs, select the single best answer. For open questions, provide concise but complete answers.


1 Medium
What does linkage disequilibrium describe?
AThe degree of similarity between two populations
BThe correlation between alleles of two SNPs within a population
CThe rate of linked contigs during genome assembly
DThe mutation rate of genetic markers
Explanation
Linkage disequilibrium (LD) refers to the non-random association of alleles at different loci. When two SNPs are in LD, certain allele combinations occur more frequently than expected by chance, reflecting shared evolutionary history and limited recombination between them.
2 Easy
Which NGS technology is known for producing long reads, often used in de novo genome assembly?
AIllumina
BIon Torrent
CPacBio
DRoche 454
Explanation
Pacific Biosciences (PacBio) produces long reads averaging 10-25 kb, which are invaluable for resolving repetitive regions and constructing de novo genome assemblies. Illumina and Ion Torrent are short-read technologies, while Roche 454 is discontinued.
3 Medium
In a Manhattan plot (GWAS analysis), what does the vertical axis typically represent?
AThe SNP position (bp) along the chromosome
BThe minor allele frequency (MAF)
CThe significance level [−log₁₀(P)] of each SNP association
DThe effect size (β) of each SNP
Explanation
The Y-axis shows −log₁₀(P-value), meaning that more significant associations appear as higher points. The horizontal line typically represents the genome-wide significance threshold (P < 5 × 10⁻⁸). The X-axis displays chromosome positions.
4 Easy
What is the purpose of a SAM file in NGS data analysis?
ATo store raw sequencing reads and quality scores
BTo store sequence alignments to a reference genome
CTo store variant calls and inferred genotypes
DTo store genome annotation information
Explanation
SAM (Sequence Alignment/Map) files contain alignments of sequencing reads to a reference genome. BAM is the binary compressed version. FASTQ files store raw reads, VCF files store variants, and annotation goes in GFF/BED files.
5 Medium
What is aCGH?
AA chip-based genome resequencing technology
BA microarray-based method to identify copy number variations
CAn NGS paired-end sequencing approach
DA method for advanced evaluation of chromosomal heterozygosity
Explanation
Array Comparative Genomic Hybridization (aCGH) is a microarray-based technique that detects copy number variations (CNVs) by comparing test and reference DNA hybridization signals. It was widely used before NGS-based CNV detection became common.
6 Medium
In a FASTQ file, each sequence entry consists of how many lines?
A2 lines
B3 lines
C4 lines
D5 lines
Explanation
FASTQ format uses exactly 4 lines per read: (1) header starting with @, (2) the nucleotide sequence, (3) separator line starting with +, and (4) quality scores encoded as ASCII characters.
7 Medium
What is the sequencing depth formula?
ADepth = G × L / N
BDepth = (N × L) / G
CDepth = N / (L × G)
DDepth = (G × N) / L
Explanation
Coverage (depth) = (number of reads × read length) / genome size. For example, 100 million 150-bp reads on a 3 Gb genome gives (100M × 150) / 3G = 5× coverage.
8 Medium
Which Ion Torrent technology principle involves hydrogen ion release?
APyrosequencing with light detection
BSequencing by synthesis with reversible terminators
CDetection of H⁺ ions released during nucleotide incorporation
DSingle-molecule sequencing using zero-mode waveguides
Explanation
Ion Torrent detects the pH change (hydrogen ions) released when DNA polymerase incorporates a nucleotide into the growing strand. Each incorporation releases one H⁺ ion, which is detected by an ion sensor.
9 Medium
What does ABI SOLiD technology use for encoding?
ASingle-base encoding
BThree-base encoding
CFour-base encoding
DTwo-base encoding system
Explanation
ABI SOLiD uses di-base (two-base) encoding, where each fluorescence color represents a dinucleotide combination. This provides built-in error checking since each base is read twice in different contexts.
10 Medium
What is the approximate read length for Illumina sequencing?
A10-50 bp
B100-300 bp
C1-5 kb
D10-25 kb
Explanation
Illumina (now NovaSeq, HiSeq, MiSeq) produces short reads typically 100-300 bp. PacBio produces long reads (10-25 kb), and Nanopore can produce reads exceeding 100 kb.
11 Medium
What is the primary goal of a GWAS?
ATo sequence entire genomes of affected individuals
BTo identify statistical associations between genetic variants and phenotypic traits
CTo determine the complete haplotype structure of populations
DTo develop new therapeutic drugs for genetic diseases
Explanation
Genome-Wide Association Studies aim to identify statistical associations between genetic variants (typically SNPs) and phenotypic traits across the genome. This helps uncover the genetic basis of complex diseases and traits.
12 Medium
What does CIGAR string represent in alignment files?
AThe chromosome identity of the read
BThe quality score of the alignment
CThe mapping coordinates of the read
DThe pattern of matches, mismatches, insertions and deletions in the alignment
Explanation
CIGAR (Compact Idiosyncratic Gapped Alignment Report) describes the alignment through operations like M (match/mismatch), I (insertion), D (deletion), S (soft clip), and H (hard clip). For example, "8M2I4M" means 8 matching bases, 2 insertions, then 4 more matches.
13 Medium
What does BWA stand for?
ABurrows-Wheeler Aligner
BBase-wise Alignment Algorithm
CBinary Read Alignment Tool
DBioinformatics Workflow Analyzer
Explanation
BWA (Burrows-Wheeler Aligner) is a widely-used sequence alignment tool that employs the Burrows-Wheeler Transform (BWT) for efficient read mapping. BWA-MEM is the recommended algorithm for longer reads.
14 Medium
In paired-end sequencing, what is being sequenced?
ATwo separate DNA fragments from different regions
BThe same fragment sequenced twice independently
CThe ends of the same DNA fragment
DForward and reverse strands of double-stranded DNA
Explanation
Paired-end sequencing reads both ends of the same DNA fragment, with a known insert size between them. This provides information about the distance between reads, useful for assembly and structural variant detection.
15 Medium
In VCF format, what does genotype notation "0|1" indicate?
AHomozygous reference genotype
BHomozygous alternative genotype
CHeterozygous phased genotype
DHeterozygous unphased genotype
Explanation
In VCF, 0 = reference allele, 1 = first alternative allele. The pipe "|" indicates phased genotype (chromosome of origin known), while slash "/" indicates unphased genotype. So 0|1 is a phased heterozygous call.
16 Medium
What is the difference between structural and functional annotation?
AStructural annotation identifies gene function; functional annotation identifies gene locations
BThey are different names for the same process
CStructural annotation is computational; functional annotation is experimental only
DStructural annotation identifies gene locations; functional annotation describes gene products and biological roles
Explanation
Structural (or computational) annotation identifies genomic features like genes, exons, introns, and regulatory elements. Functional annotation describes what these features do—their biological functions, pathways, and interactions.
17 Hard
What algorithm is used in De Bruijn graph genome assembly?
AEulerian path algorithm
BHamiltonian path algorithm
CSmith-Waterman algorithm
DNeedleman-Wunsch algorithm
Explanation
De Bruijn graph assembly uses the Eulerian path algorithm, which efficiently traverses edges (k-mers) to reconstruct the genome. This approach is computationally efficient for short reads but struggles with repeats. OLC uses Hamiltonian paths.
18 Medium
What does BUSCO evaluate in genome assemblies?
ABase-level accuracy of the assembly
BCompleteness by checking for conserved single-copy orthologous genes
CAssembly contiguity statistics
DRead mapping rates to the assembly
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses assembly completeness by checking for evolutionarily conserved genes expected in the organism's lineage. High BUSCO scores indicate a biologically meaningful, complete assembly.
19 Medium
In Hardy-Weinberg equilibrium, what does the equation p² + 2pq + q² = 1 represent?
AThe expected genotype frequencies in a population
BThe allele frequencies in a population
CThe mutation rate in a population
DThe selection coefficient in a population
Explanation
The HWE equation describes expected genotype frequencies: p² (homozygous dominant), 2pq (heterozygous), and q² (homozygous recessive), which sum to 1. This assumes random mating, no selection, infinite population size, and no migration or mutation.
20 Medium
What is multidimensional scaling (MDS) used for in population genomics?
ATo calculate linkage disequilibrium between SNPs
BTo phase haplotypes from genotype data
CTo detect and visualize population structure by reducing high-dimensional genotype data
DTo perform genome-wide association testing
Explanation
MDS is a dimensionality reduction technique that summarizes genome-wide genetic variation into a few dimensions. It visualizes population structure—distinct clusters indicate groups with different genetic ancestry, which must be corrected for in GWAS to avoid false associations.
21 Medium
What is over-representation analysis (ORA) applied after GWAS?
ATo identify additional SNPs not tested in the original GWAS
BTo test whether specific biological functions are enriched in GWAS-identified genes
CTo calculate linkage disequilibrium between candidate variants
DTo replicate GWAS findings in independent populations
Explanation
ORA determines whether specific biological functions, pathways, or processes are over-represented in the GWAS gene list compared to what would be expected by chance. Tools like DAVID and EnrichR perform this analysis.
22 Medium
How do you estimate genome size before sequencing?
AUsing the C-value from flow cytometry
BBy counting all genes in related species
CBy measuring DNA concentration with spectrophotometry
DBy performing a small pilot sequencing run
Explanation
The C-value (DNA content in picograms) is measured by flow cytometry. Since 1 pg ≈ 978 Mb, genome size (bp) = C-value × 0.978 × 10⁹. K-mer analysis is another computational approach to estimate genome size from sequencing data.
23 Hard
What is bisulfite sequencing used to detect?
ADNA sequence variants
BCopy number variations
CChromosomal rearrangements
DDNA methylation patterns
Explanation
Bisulfite sequencing treats DNA with bisulfite, which converts unmethylated cytosines to uracil (read as thymine after PCR), while methylated cytosines remain unchanged. Comparing treated and untreated sequences reveals the methylation status of each cytosine.
24 Medium
In ChIP-seq, what does the "peak" represent?
AA genomic region where a DNA-binding protein is likely attached
BA sequencing error in the data
CA copy number variation in the genome
DA gene fusion event
Explanation
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) identifies protein binding sites. Regions with enriched read coverage ("peaks") indicate where the protein of interest (transcription factor, histone modification) was bound to DNA.
25 Medium
What is the standard genome-wide significance threshold in GWAS?
AP < 0.05
BP < 0.01
CP < 5 × 10⁻⁸
DP < 0.001
Explanation
P < 5 × 10⁻⁸ is the widely accepted genome-wide significance threshold, derived from approximately 1 million independent LD blocks across the genome. This threshold accounts for the massive multiple testing burden in GWAS.
26 Medium
What does ROH stand for in population genomics?
ARate of Homoplasy
BRecombination Output Hierarchy
CReference Ontology Hub
DRuns of Homozygosity
Explanation
Runs of Homozygosity (ROH) are contiguous stretches of homozygous genotypes. Longer ROH indicate recent inbreeding, as identical-by-descent alleles are inherited from a common ancestor. ROH analysis is used to quantify inbreeding coefficients.
27 Medium
What is the primary purpose of Pool-seq?
ATo obtain individual genotypes for all participants
BTo estimate allele frequencies across a population cost-effectively
CTo phase haplotypes in family data
DTo identify rare variants in individuals
Explanation
Pool-seq combines DNA from multiple individuals into a single pool and sequences it to estimate population-level allele frequencies. It's cost-effective for population comparison studies (e.g., case vs. control pools) but loses individual genotype information.
28 Medium
What is the key advantage of exome sequencing over whole-genome sequencing?
ALower cost while capturing most disease-relevant variants
BDetection of structural variants
CSequencing of non-coding regulatory regions
DAssembly of novel genomes
Explanation
Exome sequencing targets only the protein-coding regions (~1-2% of the genome) but contains ~85% of known disease-related variants. This makes it much cheaper than WGS while remaining clinically relevant for many genetic disorders.
29 Medium
In RNA-seq, what does TPM normalization account for?
AOnly gene length differences
BOnly sequencing depth differences
CBoth gene length and sequencing depth, enabling cross-sample comparison
DGC content bias only
Explanation
TPM (Transcripts Per Million) normalizes for both gene length and sequencing depth, making it suitable for comparing gene expression across samples. Unlike RPKM/FPKM, TPM values sum to the same total in each sample.
30 Medium
What is the difference between genomic selection and marker-assisted selection?
AGenomic selection uses phenotype data; marker-assisted selection uses genotype data
BThey are identical methods
CGenomic selection uses few markers; marker-assisted selection uses many
DGenomic selection uses genome-wide markers for complex traits; marker-assisted selection uses few markers for traits with major genes
Explanation
Genomic selection uses thousands of genome-wide markers to predict total genetic merit for complex quantitative traits. Marker-assisted selection (MAS) uses specific markers linked to major-effect genes. GS enables selection for hard-to-measure traits early in life.
31 Medium
What is the definition of an allele?
AA segment of DNA that encodes a protein
BOne of two or more alternative forms of a gene or DNA sequence
CA mutation that causes disease
DThe complete set of chromosomes in an organism
Explanation
An allele is an alternative form of a gene or DNA sequence at a specific locus. For example, at the ABO gene the A, B, and O, alleles represent different versions. Individuals inherit one allele from each parent.
32 Medium
If a population is NOT in Hardy-Weinberg equilibrium, what might this indicate?
AEvolutionary forces are acting (selection, migration, mutation, drift) or non-random mating
BThe population is extremely large
CThe DNA sequencing was performed incorrectly
DAll individuals are genetically identical
Explanation
Deviation from HWE suggests evolutionary forces are at work: natural selection, migration (gene flow), mutation, genetic drift (especially in small populations), or non-random mating (including inbreeding). This is a fundamental test in population genetics.
33 Medium
What does FST measure?
AThe frequency of somatic mutations in a population
BThe forward substitution rate in DNA sequences
CThe level of genetic differentiation between subpopulations
DThe fixation index of alleles within individuals
Explanation
FST measures genetic differentiation among subpopulations. Values range from 0 (no differentiation, populations identical) to 1 (complete differentiation, no shared alleles). High FST indicates population structure, which is important for GWAS to avoid confounding.
34 Medium
What is the primary goal of genome annotation?
ATo assemble reads into contigs
BTo align reads to a reference genome
CTo identify variants between samples
DTo identify and describe functional elements in the genome
Explanation
Genome annotation identifies and describes genomic elements including genes, exons, introns, regulatory regions, and other functional elements. Structural annotation locates features; functional annotation describes their biological roles.
35 Hard
In a De Bruijn graph, what do vertices represent?
A(k−1)-mers (prefixes and suffixes of k-mers)
BIndividual sequencing reads
CComplete genes
DK-mers connecting nodes
Explanation
In De Bruijn graphs, vertices are (k−1)-mers derived from decomposing reads into k-mers. Edges represent the k-mers themselves, connecting prefix (k−1-mer) to suffix (k−1-mer). For example, k-mer "ATGC" connects node "ATG" to "TGC".
36 Medium
What is N50 in genome assembly?
AThe total number of contigs in the assembly
BThe contig length where 50% of the assembly is in contigs of this length or longer
CThe average contig length
DThe number of gaps in the assembly
Explanation
N50 is a contiguity metric: sort contigs longest to shortest, sum lengths until reaching 50% of total assembly length—the length of the last contig added is N50. Higher N50 means less fragmented assembly, but doesn't guarantee correctness.
37 Medium
What is population stratification in GWAS?
AThe random sampling of individuals from a population
BThe division of a population into cases and controls
CSubgroups differing in genetic ancestry that can cause confounding in GWAS
DThe stratification of DNA by GC content during sequencing
Explanation
Population stratification occurs when a study includes genetically distinct subgroups (e.g., different ancestries). Allele frequency differences between groups can mimic associations with traits, causing false positives. MDS/PCA and covariate adjustment are used to correct for this.
38 Medium
What is copy number variation (CNV)?
ASingle nucleotide differences between individuals
BSmall insertions and deletions (1-50 bp)
CChanges in chromosome number (aneuploidy)
DDNA segments ≥1 kb that vary in copy number
Explanation
CNVs are DNA segments ≥1 kb that vary in copy number between individuals. They include duplications, deletions, and complex rearrangements. Despite being fewer than SNPs, CNVs contribute significantly to phenotypic diversity and disease susceptibility.
39 Medium
What is imputation in the context of GWAS?
AStatistical method to infer genotypes at untyped SNPs using reference panels
BThe process of filling gaps in genome assemblies
CA quality control procedure to remove low-quality reads
DEstimating missing phenotype data
Explanation
Genotype imputation uses LD patterns from reference panels (HapMap, 1000 Genomes) to statistically estimate genotypes at SNP positions not directly genotyped. This enables meta-analysis across studies using different genotyping platforms.
40 Medium
Why is high heterozygosity problematic for genome assembly?
AIt increases sequencing error rates
BAllelic differences can be misassembled as separate regions, causing fragmentation
CIt reduces coverage depth
DIt makes DNA extraction more difficult
Explanation
In highly heterozygous genomes, assemblers may interpret allelic variation (from maternal and paternal chromosomes) as distinct genomic regions, assembling both separately. This leads to fragmented assemblies and inflated genome sizes. Using inbred lines or haploid tissues helps.
41 Medium
What does a λGC value of approximately 1.0 indicate in GWAS?
AMany true positive associations detected
BSevere population stratification causing false positives
CNo inflation—test statistics match expected null distribution
DOver-correction requiring more covariates
Explanation
λGC (genomic control inflation factor) ≈ 1.0 means test statistics follow the expected null distribution—indicating proper population structure control. λGC > 1 suggests inflation (confounding); λGC < 1 suggests over-correction.
42 Hard
In a GFF file, the score column represents:
AThe GC content of the feature
BThe length of the feature in base pairs
CThe number of reads supporting the feature
DA confidence value for the prediction (higher = more confident)
Explanation
GFF (General Feature Format) has 9 columns: seqname, source, feature, start, end, score, strand, frame, and attribute. The score column is a floating-point value typically representing confidence—higher values indicate more reliable predictions.
43 Medium
What is RAD-seq primarily used for?
ARestriction site-associated DNA sequencing for population genetics
BFull genome assembly of new species
CRNA expression profiling
DMetagenomic community analysis
Explanation
RAD-seq (Restriction site-Associated DNA sequencing) uses restriction enzymes to generate consistent genomic fragments across samples. It's cost-effective for population genetics, genetic mapping, and species without reference genomes, enabling SNP discovery and genotyping simultaneously.
44 Medium
What is the primary purpose of variant annotation?
ATo align reads to the reference genome
BTo determine the biological impact of identified variants
CTo filter out low-quality reads
DTo estimate population allele frequencies
Explanation
Variant annotation determines the functional consequences of variants: where they occur (exonic, intronic, UTR), what type of change (missense, nonsense, splice site), and predicted impact (using SIFT, PolyPhen). This prioritizes variants for follow-up studies.
45 Hard
Calculate N50: Contig lengths are 100, 70, 60, 50, 50, 40, 30 kb. What is the N50?
A70 kb
B50 kb
C60 kb
DD100 kb
Explanation
Total = 400 kb. Half = 200 kb. Sort descending: 100, 70, 60, 50... Cumulative: 100 (100), 170 (100+70), 230 (170+60)—exceeds 200 at 60 kb. So N50 = 60 kb.

46 — Open Calculation
You have a genome of 3 Gbp and want to achieve 30× coverage using 150-bp Illumina reads. How many reads do you need? Show your calculation.
✓ Model Answer

Using the coverage formula: Coverage = (N × L) / G

30× = (N × 150 bp) / 3,000,000,000 bp
N = (30 × 3,000,000,000) / 150
N = 90,000,000,000 / 150
N = 600,000,000 reads

Answer: 600 million reads

47 — Open Calculation
A population has 500 AA individuals, 200 Aa individuals, and 300 aa individuals (total 1000). Calculate allele frequencies and determine if this population is in Hardy-Weinberg equilibrium.
✓ Model Answer

Step 1: Calculate allele frequencies

Total alleles = 1000 × 2 = 2000
A alleles = (500 × 2) + (200 × 1) = 1000 + 200 = 1200
p = freq(A) = 1200 / 2000 = 0.6
q = freq(a) = 1 − 0.6 = 0.4

Step 2: Expected HWE genotype frequencies

AA = p² = 0.36 → 360 individuals
Aa = 2pq = 2 × 0.6 × 0.4 = 0.48 → 480 individuals
aa = q² = 0.16 → 160 individuals

Step 3: Comparison

Observed: 500 AA, 200 Aa, 300 aa
Expected: 360 AA, 480 Aa, 160 aa

Conclusion: This population is NOT in HWE—there is a large excess of homozygotes and deficit of heterozygotes, suggesting inbreeding, selection, or population structure.

48 — Open Short Answer
Describe the key steps in the NGS variant discovery pipeline, from raw sequencing data to annotated variants. Name the file format and one tool for each step.
✓ Model Answer

1. Quality Control & Trimming: Input FASTQ → FastQC (assessment), Trimmomatic (trimming) → Output cleaned FASTQ. Removes low-quality bases and adapters.

2. Alignment: Input FASTQ → BWA-MEM (alignment) → Output SAM/BAM. Maps reads to reference genome.

3. Variant Calling: Input BAM → GATK (variant calling) → Output VCF. Identifies SNPs and indels.

4. Variant Annotation: Input VCF → Ensembl VEP or SnpEff → Output annotated VCF. Determines functional impact of variants.

5. Filtering & Quality Control: Applies filters for depth, quality, and variant type to obtain high-confidence variant calls.

49 — Open Short Answer
Explain the difference between de novo genome assembly and reference-guided assembly. When would you choose each approach?
✓ Model Answer

De novo assembly: Reconstructs genome from scratch using overlapping reads without a reference. Required when no reference exists (non-model organisms). More computationally intensive and challenging for repetitive genomes.

Reference-guided assembly: Aligns reads to an existing reference genome. More efficient, requires lower coverage, but may miss species-specific variants or structural differences.

Choose de novo when: No reference genome available, studying novel species, or characterizing unique genomic regions absent from reference.

Choose reference-guided when: Reference exists, studying well-characterized species, or resources are limited (lower coverage needed).

50 — Open Short Answer
What is linkage disequilibrium (LD)? How does it differ from physical linkage, and why is it important for GWAS?
✓ Model Answer

Linkage disequilibrium (LD): The non-random association of alleles at different loci—the tendency of certain allele combinations to be inherited together more (or less) frequently than expected by chance.

Difference from physical linkage: Physical linkage means genes/loci are on the same chromosome. LD describes the statistical association between alleles, which is influenced by physical linkage BUT also by selection, drift, population history, and mutation.

Importance for GWAS: LD enables tag SNP strategies—genotyping a subset of variants (tag SNPs) can capture information about nearby variants in the same LD block. This reduces genotyping costs while maintaining genome-wide coverage. However, detected associations are often indirect—the true causal variant may not be genotyped but is in LD with the tag SNP.

51 — Open Short Answer
Describe the main steps in ChIP-seq and explain how binding sites are identified from the data.
✓ Model Answer

ChIP-seq steps:

1. Crosslink: Formaldehyde crosslinks proteins to DNA in vivo.

2. Fragment: Sonication breaks chromatin into small fragments.

3. Immunoprecipitate: Antibody against the protein of interest pulls down protein-DNA complexes.

4. Reverse crosslinks & purify: Extract and purify the DNA.

5. Sequence: NGS library preparation and sequencing.

Identifying binding sites: Sequence reads are aligned to the genome. Regions with significantly enriched read coverage ("peaks") compared to input control indicate protein binding locations. Peak calling algorithms (MACS, SICER) identify these enriched regions.

52 — Open Short Answer
What are the three main strategies for gene prediction (structural annotation)? Explain each briefly.
✓ Model Answer

1. Ab initio (intrinsic): Uses statistical models trained on known genes to predict features from genomic sequence alone. Advantages: detects novel genes, no external data needed. Limitations: requires species-specific training, moderate accuracy.

2. Homology-based (extrinsic): Compares genome to known genes/proteins in databases. If similarity is found, a gene is predicted. Advantages: leverages conserved sequences. Limitations: cannot detect truly novel genes absent from databases.

3. Combined (hybrid): Integrates both approaches—uses ab initio predictions guided by evidence from RNA-seq, ESTs, or protein data. Most accurate and widely used approach (e.g., AUGUSTUS with evidence).

53 — Open Calculation
A species has a C-value of 2.0 pg. Estimate the genome size in base pairs. If using 60× coverage with 150-bp reads, how many reads are needed?
✓ Model Answer

Step 1: Genome size

Genome size (bp) = C-value × 0.978 × 10⁹
= 2.0 × 0.978 × 10⁹
= 1.956 × 10⁹ bp ≈ 1.96 Gb

Step 2: Number of reads

Coverage = (N × L) / G
60 = (N × 150) / 1,956,000,000
N = (60 × 1,956,000,000) / 150
N = 782,400,000 reads ≈ 782 million reads
54 — Open Short Answer
Explain the concept of "indirect association" in GWAS. Why is a significant SNP often not the causal variant?
✓ Model Answer

Indirect association: The detected SNP shows statistical association with the trait but is not itself the causal variant—it is correlated with the causal variant through LD.

Why significant SNPs are often not causal: GWAS genotyping arrays use tag SNPs designed to capture genetic variation in LD blocks. When a tag SNP shows association, the signal may reflect the presence of the true causal variant (which was not directly genotyped) due to their correlation. The detected SNP and causal variant are inherited together because recombination hasn't separated them.

Consequence: Post-GWAS fine-mapping is needed to narrow the association signal and identify the actual causal variant(s) for functional studies.

55 — Open Short Answer
Compare RNA-seq and DNA-seq (whole-genome sequencing). What different information does each provide about an organism?
✓ Model Answer

DNA-seq (WGS):

• Sequences the entire genome (all DNA)

• Captures all variant types: SNPs, indels, CNVs, structural variants

• Identifies variants in coding and non-coding regions

• Can determine genotype, population ancestry, evolutionary relationships

• Does not directly measure gene expression or functional activity

RNA-seq:

• Sequences transcribed RNA (the transcriptome)

• Measures which genes are actively expressed and at what levels

• Captures alternative splicing, allele-specific expression, novel transcripts

• Provides functional readouts—shows which variants may affect gene regulation

• Cannot detect variants in non-expressed genes or genomic rearrangements not affecting transcription

Together: Combining DNA and RNA data provides comprehensive understanding—genetic variants (DNA) and their functional consequences (RNA expression).

56 — Open Short Answer
What is population stratification and how does it affect GWAS results? How can it be detected and corrected?
✓ Model Answer

Population stratification: Presence of genetically distinct subgroups within a study population (e.g., different ancestries). These groups differ in both allele frequencies and trait prevalence, creating confounding—association signals may reflect ancestry rather than genuine genotype-phenotype relationships.

Detection: Principal Component Analysis (PCA) or Multidimensional Scaling (MDS) plots of genotype data reveal clustering. The genomic inflation factor (λGC) quantifies statistical inflation—values >1 indicate stratification.

Correction: Include top principal components or MDS dimensions as covariates in the association model. This accounts for genetic ancestry differences. Genomic control can also adjust test statistics. Family-based designs or matching cases/controls by ancestry help prevent stratification from the start.

57 — Open Short Answer
Describe the key differences between Illumina (short-read) and PacBio/Nanopore (long-read) sequencing technologies. What are the advantages and disadvantages of each?
✓ Model Answer

Illumina (short-read):

• Read length: 100-300 bp

• High accuracy (>99.9%)

• Lower cost per base

• Requires PCR amplification

• Challenges with repetitive regions and structural variants

• Best for: variant calling, RNA-seq, ChIP-seq, population studies

PacBio/Nanopore (long-read):

• Read length: 10 kb to >100 kb

• Lower raw accuracy (85-95%) but improving

• Higher cost per base

• Can sequence without amplification (native DNA)

• Excellent for: genome assembly, structural variants, haplotype phasing, epigenetic detection

Hybrid approaches: Combine short-read accuracy with long-read contiguity for optimal assemblies.