NGS Data Analysis — Exam Practice
(a) Sequencing depth:
(b) Yes, 33.3× exceeds the minimum recommended ~10× for robust SNP detection. This depth provides high confidence for variant calling.
(c) Phred score for 99.99% accuracy:
The variant discovery pipeline consists of four major steps, each with QC/filtering between them:
1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (assessment), Trimmomatic or Prinseq (trimming) → Output: cleaned FASTQ. Checks per-base quality, GC content, duplication levels. Removes low-quality bases using sliding window or threshold approaches.
2. Alignment: Input: cleaned FASTQ → Tool: BWA-MEM (Burrows-Wheeler Transform) → Output: BAM file. Maps reads to the reference genome. BAM is the binary compressed version of SAM. Post-alignment: filter by mapping quality (MAPQ) and remove PCR duplicates (Picard).
3. Variant Calling: Input: filtered BAM → Tool: GATK (following GATK Best Practices) → Output: VCF file. Examines aligned bases at each position to identify SNPs and indels. Detection affected by base quality, proximity to homopolymers, mapping quality, and sequencing depth. Can be individual or joint calling across samples.
4. Variant Annotation: Input: VCF → Tool: Ensembl VEP, SnpEff, or ANNOVAR → Output: annotated VCF. Determines effect of variants on genes/transcripts/proteins (e.g., missense, frameshift, intronic, TFBS variants). Provides functional impact predictions (SIFT, PolyPhen-2).
Depth of coverage = the average number of times each base is read (expressed as "X", e.g., 30×). It measures redundancy and confidence. Formula: Depth = (N × L) / G.
Breadth of coverage = the percentage of the target genome covered at a minimum depth (expressed as %, e.g., 95% at 1×). It measures completeness.
High depth, low breadth scenario: In a PCR-amplified library with severe amplification bias, certain genomic regions may be vastly over-represented (giving very high local depth), while other regions receive no reads at all. The average depth might be reported as 30×, but large portions of the genome have 0× coverage. This means variants in uncovered regions are completely missed, making the analysis incomplete despite seemingly adequate depth. This is why breadth is especially important in clinical diagnostics where missing regions could mean missing pathogenic variants.
Another example: targeted sequencing (e.g., exome capture) inherently has high depth in target regions but low breadth over the whole genome — which is by design, but must be understood when interpreting results.
A bioinformatician must understand what happened before data generation because ignoring upstream processes leads to incorrect assumptions or flawed analyses. Key examples:
1. PCR amplification vs. PCR-free: If PCR was used during library prep, duplicate reads are expected and must be removed (e.g., with Picard). Without knowing this, duplicates would be treated as independent evidence, inflating coverage and producing false positive variants. PCR-free protocols avoid this but require more input DNA.
2. DNA quality/degradation: Degraded DNA (e.g., from ancient samples, honey, or soil) produces shorter fragments. This affects which sequencing platform is appropriate — degraded DNA is unsuitable for long-read platforms (PacBio/Nanopore). Low-quality DNA also affects alignment success and error rates.
3. Sequencing platform choice: Different platforms have different error profiles. Illumina has very low error rates; Ion Torrent has higher error rates especially in homopolymer regions. Knowing the platform tells you which types of errors to expect and filter for.
4. Expected sequencing depth: If coverage was planned at 5× vs. 30×, the confidence in variant calls differs dramatically. Low-coverage data (1–5×) requires different analytical approaches than high-coverage data.
5. Fragmentation method: Random fragmentation vs. restriction enzymes produces different sequence content patterns visible in FastQC (e.g., consistent bases at read starts with restriction enzymes).