Applied Genomics — Final Comprehensive Exam
Instructions: Answer all questions. For MCQs, select the single best answer. For open questions, provide concise but complete answers.
Using the coverage formula: Coverage = (N × L) / G
Answer: 600 million reads
Step 1: Calculate allele frequencies
Step 2: Expected HWE genotype frequencies
Step 3: Comparison
Observed: 500 AA, 200 Aa, 300 aa
Expected: 360 AA, 480 Aa, 160 aa
Conclusion: This population is NOT in HWE—there is a large excess of homozygotes and deficit of heterozygotes, suggesting inbreeding, selection, or population structure.
1. Quality Control & Trimming: Input FASTQ → FastQC (assessment), Trimmomatic (trimming) → Output cleaned FASTQ. Removes low-quality bases and adapters.
2. Alignment: Input FASTQ → BWA-MEM (alignment) → Output SAM/BAM. Maps reads to reference genome.
3. Variant Calling: Input BAM → GATK (variant calling) → Output VCF. Identifies SNPs and indels.
4. Variant Annotation: Input VCF → Ensembl VEP or SnpEff → Output annotated VCF. Determines functional impact of variants.
5. Filtering & Quality Control: Applies filters for depth, quality, and variant type to obtain high-confidence variant calls.
De novo assembly: Reconstructs genome from scratch using overlapping reads without a reference. Required when no reference exists (non-model organisms). More computationally intensive and challenging for repetitive genomes.
Reference-guided assembly: Aligns reads to an existing reference genome. More efficient, requires lower coverage, but may miss species-specific variants or structural differences.
Choose de novo when: No reference genome available, studying novel species, or characterizing unique genomic regions absent from reference.
Choose reference-guided when: Reference exists, studying well-characterized species, or resources are limited (lower coverage needed).
Linkage disequilibrium (LD): The non-random association of alleles at different loci—the tendency of certain allele combinations to be inherited together more (or less) frequently than expected by chance.
Difference from physical linkage: Physical linkage means genes/loci are on the same chromosome. LD describes the statistical association between alleles, which is influenced by physical linkage BUT also by selection, drift, population history, and mutation.
Importance for GWAS: LD enables tag SNP strategies—genotyping a subset of variants (tag SNPs) can capture information about nearby variants in the same LD block. This reduces genotyping costs while maintaining genome-wide coverage. However, detected associations are often indirect—the true causal variant may not be genotyped but is in LD with the tag SNP.
ChIP-seq steps:
1. Crosslink: Formaldehyde crosslinks proteins to DNA in vivo.
2. Fragment: Sonication breaks chromatin into small fragments.
3. Immunoprecipitate: Antibody against the protein of interest pulls down protein-DNA complexes.
4. Reverse crosslinks & purify: Extract and purify the DNA.
5. Sequence: NGS library preparation and sequencing.
Identifying binding sites: Sequence reads are aligned to the genome. Regions with significantly enriched read coverage ("peaks") compared to input control indicate protein binding locations. Peak calling algorithms (MACS, SICER) identify these enriched regions.
1. Ab initio (intrinsic): Uses statistical models trained on known genes to predict features from genomic sequence alone. Advantages: detects novel genes, no external data needed. Limitations: requires species-specific training, moderate accuracy.
2. Homology-based (extrinsic): Compares genome to known genes/proteins in databases. If similarity is found, a gene is predicted. Advantages: leverages conserved sequences. Limitations: cannot detect truly novel genes absent from databases.
3. Combined (hybrid): Integrates both approaches—uses ab initio predictions guided by evidence from RNA-seq, ESTs, or protein data. Most accurate and widely used approach (e.g., AUGUSTUS with evidence).
Step 1: Genome size
Step 2: Number of reads
Indirect association: The detected SNP shows statistical association with the trait but is not itself the causal variant—it is correlated with the causal variant through LD.
Why significant SNPs are often not causal: GWAS genotyping arrays use tag SNPs designed to capture genetic variation in LD blocks. When a tag SNP shows association, the signal may reflect the presence of the true causal variant (which was not directly genotyped) due to their correlation. The detected SNP and causal variant are inherited together because recombination hasn't separated them.
Consequence: Post-GWAS fine-mapping is needed to narrow the association signal and identify the actual causal variant(s) for functional studies.
DNA-seq (WGS):
• Sequences the entire genome (all DNA)
• Captures all variant types: SNPs, indels, CNVs, structural variants
• Identifies variants in coding and non-coding regions
• Can determine genotype, population ancestry, evolutionary relationships
• Does not directly measure gene expression or functional activity
RNA-seq:
• Sequences transcribed RNA (the transcriptome)
• Measures which genes are actively expressed and at what levels
• Captures alternative splicing, allele-specific expression, novel transcripts
• Provides functional readouts—shows which variants may affect gene regulation
• Cannot detect variants in non-expressed genes or genomic rearrangements not affecting transcription
Together: Combining DNA and RNA data provides comprehensive understanding—genetic variants (DNA) and their functional consequences (RNA expression).
Population stratification: Presence of genetically distinct subgroups within a study population (e.g., different ancestries). These groups differ in both allele frequencies and trait prevalence, creating confounding—association signals may reflect ancestry rather than genuine genotype-phenotype relationships.
Detection: Principal Component Analysis (PCA) or Multidimensional Scaling (MDS) plots of genotype data reveal clustering. The genomic inflation factor (λGC) quantifies statistical inflation—values >1 indicate stratification.
Correction: Include top principal components or MDS dimensions as covariates in the association model. This accounts for genetic ancestry differences. Genomic control can also adjust test statistics. Family-based designs or matching cases/controls by ancestry help prevent stratification from the start.
Illumina (short-read):
• Read length: 100-300 bp
• High accuracy (>99.9%)
• Lower cost per base
• Requires PCR amplification
• Challenges with repetitive regions and structural variants
• Best for: variant calling, RNA-seq, ChIP-seq, population studies
PacBio/Nanopore (long-read):
• Read length: 10 kb to >100 kb
• Lower raw accuracy (85-95%) but improving
• Higher cost per base
• Can sequence without amplification (native DNA)
• Excellent for: genome assembly, structural variants, haplotype phasing, epigenetic detection
Hybrid approaches: Combine short-read accuracy with long-read contiguity for optimal assemblies.