📝 Exam Simulation — Version B
Using the coverage formula rearranged to solve for number of reads:
So approximately 747 million reads of 150 bp are needed to achieve 40× coverage of a 2.8 Gbp genome.
Step 1: Contigs are already sorted from largest to smallest: 150, 100, 85, 65, 55, 35, 20, 15, 10 kb.
Step 2: Cumulative sum from largest:
N50 = 85 kb — the contig that crosses the 50% cumulative threshold.
K-mers: Substrings of fixed length k extracted by sliding a window across each read. For example, the sequence ATGCG with k=3 produces: ATG, TGC, GCG.
Graph construction: (1) Extract all k-mers from reads and retain unique ones. (2) Vertices represent (k−1)-mers (prefixes and suffixes of k-mers). (3) Edges represent k-mers — each k-mer connects its prefix vertex to its suffix vertex with a directed edge.
Sequence reconstruction: The graph is traversed using an Eulerian path, which visits every edge exactly once. This requires at most two semibalanced vertices (|in-degree − out-degree| = 1). The sequence is reconstructed by concatenating the vertices along the path.
Challenges: Sequencing errors create false k-mers; repetitive elements cause ambiguous paths (multiple valid Eulerian paths). Using multiple k-mer sizes and paired-end data helps resolve these issues.
Definition: Pool sequencing involves mixing equimolar DNA from multiple individuals into a single pool and sequencing the pool together, rather than sequencing each individual separately.
How it works: DNA is extracted from each individual, quantified, and combined in equal amounts (equimolar pooling). The pooled DNA is then used for library preparation and sequenced. Reads map randomly to the reference genome. At any polymorphic position, the proportion of reads carrying each allele approximates the allele frequency in the pooled population.
Information provided: Pool-seq gives population-level allele frequencies at each variant position, not individual genotypes. This allows detection of variants, estimation of allele frequencies, and comparison of frequencies between populations (e.g., using FST).
Cost-effectiveness: Instead of sequencing N individuals separately (cost = N × per-sample cost), only one pooled library is sequenced. For example, comparing two populations of 50 individuals each requires 2 sequencing runs instead of 100. This dramatically reduces cost while preserving population-level variant information.
What is sequenced: RNA (transcripts) is extracted, converted to cDNA, and sequenced. RNA-seq captures the transcriptome — all RNA molecules expressed at the time of sampling.
Library preparation strategies:
(1) Poly-A selection: Mature mRNAs have a poly-A tail. Probes with poly-T sequences capture these mRNAs specifically. This enriches for protein-coding transcripts and excludes rRNA/tRNA.
(2) Random priming: RNA is fragmented and random hexamer primers are used for cDNA synthesis. This captures a broader range of RNAs but may include unwanted rRNA.
Applications:
(1) Gene expression quantification: Comparing transcript abundance between conditions (e.g., healthy vs. diseased tissue) to identify differentially expressed genes.
(2) Genome annotation: RNA-seq data provides evidence of transcribed regions, helping identify gene structures, exon boundaries, and transcript isoforms (alternative splicing) during functional annotation of a new genome assembly.
Step 1 — Allele frequencies:
Step 2 — Expected genotype frequencies and counts:
Step 3 — Chi-squared test:
Conclusion: χ² = 7.83 > 3.84 → Reject H₀. The population is NOT in Hardy–Weinberg equilibrium. There is an excess of heterozygotes compared to expectations, which could indicate balancing selection or recent admixture.
WES concept: Whole exome sequencing targets only the protein-coding regions (exons) of the genome, which constitute approximately 2% of the human genome (~45 Mb vs. ~3 Gb).
Hybridization capture method: (1) Genomic DNA is fragmented. (2) Biotinylated probes complementary to all known exonic sequences are hybridized to the fragments. (3) Streptavidin-coated beads capture biotinylated probe–fragment complexes. (4) Non-exonic fragments are washed away. (5) Captured exonic fragments are eluted and sequenced.
Cost-effectiveness: By sequencing only ~2% of the genome, WES generates much smaller FASTQ files (~45 Gb vs. ~90 Gb for WGS), requires less sequencing output, and enables faster bioinformatic analysis. The trade-off is that regulatory variants in non-coding regions (introns, intergenic regions) are missed. WES is ideal when the hypothesis is that causal variants alter protein-coding sequences.
A k-mer frequency distribution plots k-mer multiplicity (x-axis) against the number of distinct k-mers with that multiplicity (y-axis).
Region 1 — Left peak (frequency = 1): K-mers appearing only once. These are mostly derived from sequencing errors — a single nucleotide error creates a k-mer unique to that read. This peak should be excluded when estimating genome size.
Region 2 — Main peak (central): K-mers appearing at the expected coverage depth. These represent unique genomic sequences. The position of this peak corresponds to the average sequencing depth. Genome size can be estimated as: G = (total k-mers under the curve) / (mean coverage from the main peak).
Region 3 — Right tail (high frequency): K-mers appearing much more frequently than average. These derive from repetitive elements (SINEs, LINEs, transposons) that are present in many copies throughout the genome. The height and extent of this tail reflects the repeat content of the genome.
The average depth of coverage is 20×, meaning each position in the genome is covered by ~20 reads on average.
Population stratification: When a GWAS sample includes individuals from genetically distinct subpopulations that also differ in phenotype, allele frequency differences between subpopulations are confounded with phenotype differences — creating false positive associations.
Example: If wild mice (low body weight, genotype profile A) and laboratory strains (high body weight, genotype profile B) are analyzed together, nearly every SNP differing between strains appears associated with body weight — not because those SNPs cause weight differences, but because they track population membership.
MDS/PCA solution: MDS or PCA compresses genome-wide genotype data into a few principal components that capture population structure. Each individual gets component scores. Individuals from the same subpopulation cluster together. These components are then included as covariates in the GWAS statistical model to correct for ancestry differences. After correction, only true genotype–phenotype associations remain significant.
Verification: The QQ plot and λGC metric are used to confirm that stratification has been properly corrected (λ ≈ 1 indicates no inflation).
Step 1 — Quality Control: Raw reads (FASTQ) → Quality assessment with FastQC → Evaluate per-base quality, adapter content, GC distribution.
Step 2 — Trimming: FASTQ → Trimmed FASTQ using Trimmomatic or fastp → Remove low-quality bases and adapter sequences. Re-run FastQC to confirm improvement.
Step 3 — Alignment: Trimmed FASTQ + Reference genome (FASTA) → SAM file using BWA (Burrows-Wheeler Aligner) → Reads are mapped to the reference genome. Convert SAM → BAM (binary, compressed) and sort.
Step 4 — Duplicate Removal: Sorted BAM → Deduplicated BAM using Picard MarkDuplicates → Remove PCR duplicates that could bias variant calling.
Step 5 — Variant Calling: Deduplicated BAM → VCF file using GATK HaplotypeCaller → Identify SNPs and indels. Each variant line includes chromosome, position, REF, ALT, quality score, and sample genotypes.
Step 6 — Variant Annotation: VCF → Annotated VCF using VEP or SnpEff → Each variant is annotated with its predicted biological effect (synonymous, missense, stop-gain, splice-site, etc.) and compared to known variant databases (e.g., dbSNP).
Axes: The x-axis represents genomic position, with SNPs ordered along each chromosome (chromosomes are shown in alternating colors). The y-axis represents −log₁₀(p-value) from the association test between each SNP and the phenotype. Higher values mean stronger statistical significance.
Construction: For each genotyped SNP, a statistical test (e.g., linear regression with additive model: phenotype ~ genotype + covariates) produces a p-value. Each SNP is plotted as a dot at its chromosomal position (x) and its −log₁₀(p) value (y).
Peaks: A "peak" or cluster of highly significant SNPs indicates a genomic region associated with the trait. The pyramid shape occurs because the causal variant and its neighbors in strong LD all show elevated significance, tapering off as LD decays with distance. The peak width reflects the extent of LD in that region.
Significance threshold: A horizontal line marks the genome-wide significance threshold. Using Bonferroni correction: α/number_of_tests. The conventional threshold is 5 × 10⁻⁸ (−log₁₀ ≈ 7.3), which accounts for approximately 1 million independent tests across the genome. SNPs above this line are considered genome-wide significant associations.
Important note: Associated SNPs are usually tag SNPs in LD with the true causal variant — they are not necessarily the causal mutation itself. Fine-mapping is needed to identify the actual causal variant within the associated region.