Applied Genomics — Final Exam Simulation
0|1 indicates:Yes, 30× exceeds the recommended minimum of ~10× for robust SNP detection. This depth provides high confidence for variant calling and genotyping.
The N50 is the length of the contig that, when added, causes the cumulative sum to cross 50% of the total assembly size. Note: N50 measures contiguity, not correctness — a high N50 does not guarantee an error-free assembly.
Bisulfite sequencing works by treating genomic DNA with sodium bisulfite, which converts unmethylated cytosines to uracil (read as thymine after PCR amplification). Methylated cytosines (5-methylcytosine) are protected from this conversion and remain as C.
After sequencing, the reads are aligned to the reference genome. At each cytosine position: if the read shows C → the position was methylated; if it shows T → the position was unmethylated. This provides single-base resolution methylation mapping.
A major analytical challenge is distinguishing bisulfite-induced C→T conversions from genuine C→T SNPs in the genome. About 98% of methylation in the human genome occurs at CpG dinucleotides. CpG islands (regions dense in CpG sites) near gene promoters are of particular interest as their methylation status often regulates gene expression.
1. C-value (Flow Cytometry): The C-value is the amount of DNA in picograms (pg) in a haploid genome. It is measured using flow cytometry or Feulgen densitometry, typically by comparing staining intensity to a reference species with known genome size. The conversion formula is: Genome size (bp) = C-value (pg) × 0.978 × 10⁹. For example, a C-value of 2.0 pg gives approximately 1.96 Gbp.
2. K-mer frequency analysis: After sequencing, reads are decomposed into K-mers and their frequency distribution is plotted. The genome size is estimated by: Genome size = Total number of K-mers (area under the curve) / Average K-mer coverage (position of the main peak). The distribution typically shows three features: a left peak of low-frequency K-mers (sequencing errors), a main peak (true genomic K-mers at average coverage), and a right tail of high-frequency K-mers (repetitive regions).
A Manhattan plot is the standard visualization of GWAS results. It displays all tested SNPs across the genome:
X-axis: Genomic position — SNPs are plotted by their physical location, ordered by chromosome. Each chromosome is shown in a different color.
Y-axis: −log₁₀(p-value) — the negative log-transformed p-value of each SNP-trait association. This transformation makes more significant associations appear as taller points (a p-value of 10⁻⁸ appears as 8 on the Y-axis).
Significance threshold: A horizontal line at −log₁₀(5 × 10⁻⁸) ≈ 7.3 marks the genome-wide significance threshold. SNPs above this line are considered significantly associated.
Interpretation: True associations appear as "peaks" — clusters of linked SNPs (in LD) rising above the background. The peak shape reflects LD structure: the top SNP has the strongest signal, and nearby correlated SNPs form a hill. An isolated single SNP above the threshold (without supporting nearby SNPs) is suspicious and may be a false positive due to genotyping errors.
[Drawing: a scatter plot with chromosomes along the X-axis separated by alternating colors, dots scattered at low Y values (1–4), with one or more sharp peaks exceeding the horizontal significance line around Y = 7.3]
With 1 degree of freedom, the critical value at α = 0.05 is 3.84. Since 340.28 >> 3.84, the population is not in Hardy-Weinberg Equilibrium. There is a large excess of homozygotes and a deficit of heterozygotes, suggesting non-random mating, selection, or population substructure.
Purpose: ChIP-seq identifies genome-wide binding sites of proteins (transcription factors, histones, etc.) to DNA.
Steps:
1. Crosslinking: Formaldehyde covalently links proteins to the DNA they are bound to in vivo.
2. Fragmentation: Chromatin is sheared into small fragments (~200–500 bp) by sonication or enzymatic digestion.
3. Immunoprecipitation: An antibody specific to the target protein pulls down protein-DNA complexes.
4. Reverse crosslinking & purification: The crosslinks are reversed and DNA is purified.
5. Sequencing: The enriched DNA fragments are sequenced using NGS.
Identifying binding sites: Reads are aligned to the reference genome. Regions with significantly more reads than the background (input control) form "peaks." Peak-calling algorithms (e.g., MACS2) identify these enriched regions as binding sites. The height and shape of peaks indicate binding strength and precision.
You need approximately 480 million reads of 150 bp each to achieve 60× coverage of a 1.2 Gbp genome.
The "Per base sequence quality" module shows quality score distributions at each position along the read. At each position, a boxplot displays:
- The median quality score (central line)
- The interquartile range (IQR, the box: 25th–75th percentile)
- The 10th and 90th percentiles (whiskers)
- The mean quality (blue line)
The background is color-coded: green (good, Q ≥ 28), yellow (acceptable, Q 20–28), and red (poor, Q < 20).
When to trim: Trimming should be applied when quality scores drop into the yellow or red zones, which typically occurs toward the 3' end of reads. A sliding window approach (e.g., with Trimmomatic) calculates the average quality within a window and trims when it falls below a threshold (e.g., Q20). After trimming, FastQC should be re-run to confirm improvement. Reads shorter than a minimum length (e.g., 25 bp) should be discarded entirely.
A K-mer frequency distribution plots K-mer frequency (X-axis) against the number of distinct K-mers at that frequency (Y-axis). Three main regions are visible:
1. Left peak (low frequencies, e.g., 1–5×): Represents K-mers caused by sequencing errors. Errors create unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.
2. Main peak (moderate frequency): Represents true genomic K-mers. The position of this peak corresponds to the average sequencing depth. For example, a peak at 30× means each genomic K-mer was sequenced approximately 30 times.
3. Right tail (high frequencies, extending well beyond the main peak): Represents K-mers from repetitive regions. Repeats occur multiple times in the genome, so their K-mers appear at multiples of the average coverage. A prominent right tail indicates high repeat content, which will complicate assembly.
Genome size is estimated by: Total K-mers (area under the curve, excluding error peak) / Main peak position.
1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (QC), Trimmomatic (trimming) → Output: cleaned FASTQ. Assess per-base quality, GC content, duplications. Trim low-quality ends and remove short reads.
2. Alignment: Input: cleaned FASTQ + reference FASTA → Tool: BWA-MEM → Output: SAM/BAM. Map reads to reference genome. Post-alignment: sort, index, and remove PCR duplicates (Picard). Filter by mapping quality (MAPQ).
3. Variant Calling: Input: filtered BAM → Tool: GATK HaplotypeCaller → Output: VCF. Identify SNPs and indels at each position. Joint calling across samples is preferred for population studies.
4. Variant Annotation: Input: VCF → Tool: Ensembl VEP or SnpEff → Output: annotated VCF. Determine effect of each variant (missense, synonymous, intronic, splice site, TFBS gain/loss). Provide functional impact predictions (SIFT, PolyPhen-2).
Quality control and filtering occur between every step — this iterative QC cycle is essential for reliable results.
GWAS uses tag SNPs on genotyping arrays, which are representative markers for LD blocks. The detected SNP is typically in LD with the true causal variant rather than being causal itself — this is called indirect association. The causal variant may not be on the array at all.
Post-GWAS steps:
1. Fine-mapping: Examine LD structure (r², D′) around the peak to narrow the candidate region and prioritize variants most likely to be causal.
2. Gene annotation: Use BEDTools intersect to identify genes within a defined window (e.g., 0.5 Mb) of the top SNPs. Consult databases like GeneCards and the GWAS Catalog.
3. Functional enrichment: Apply ORA (DAVID, EnrichR) to test whether candidate genes are enriched for specific pathways.
4. Replication: Validate findings in an independent cohort with the same phenotype definition.
5. Functional validation: Experimental studies (gene expression analysis, knockouts, reporter assays) to confirm the causal role of the candidate variant.