NGS Data Analysis — Exam Practice

📝NGS Data Analysis – Variant Discovery Pipeline

0 / 40

Q1 Medium

An A260/280 ratio of 1.5 for a DNA sample most likely indicates:

ARNA contamination

BProtein or phenol contamination

CPure, high-quality DNA

DCarbohydrate contamination

Explanation

The ideal A260/280 ratio for pure DNA is ~1.8. A ratio <1.8 indicates contamination by proteins or phenol, which absorb at 280 nm and lower the ratio. A ratio >1.8 may suggest RNA contamination. Carbohydrate contamination is detected by the A260/230 ratio, not A260/280.

Q2 Easy

The A260/230 ratio is used to assess:

ADNA fragment length

BProtein contamination

CDNA integrity

DChemical contaminants such as carbohydrates, phenol, or guanidine salts

Explanation

The A260/230 ratio (ideal 2.0–2.2) detects chemical contaminants that absorb at 230 nm: carbohydrates (common in plant DNA), residual phenol, guanidine salts (from column kits), and glycogen. Protein contamination is assessed by A260/280. DNA integrity is assessed by gel electrophoresis, not absorbance ratios.

Q3 Medium

On an agarose gel, high-quality intact genomic DNA appears as:

AA sharp, high molecular weight band near the well

BA smear distributed evenly across the gel

CMultiple discrete bands of different sizes

DA band at the bottom of the gel near small fragments

Explanation

Intact genomic DNA is composed of very long fragments that migrate slowly in the gel, producing a sharp, high molecular weight band near the wells. A smear indicates degradation. Degraded DNA (important for ancient DNA or complex-matrix samples) may still work for short-read sequencing but will fail on long-read platforms like PacBio or Nanopore.

Q4 Tricky

A bioinformatician receives NGS data but does not ask about the library preparation protocol. Which of the following errors is MOST likely to occur?

AFailure to install the alignment software

BThe reference genome will not be available

CIncorrect interpretation of duplicate reads or coverage biases introduced by PCR amplification

DThe FASTQ files will be unreadable

Explanation

Knowing whether PCR amplification was used in library prep is essential. PCR introduces duplicate reads and coverage biases that affect variant calling, allele frequency estimation, and coverage evenness. Without this knowledge, a bioinformatician may misinterpret duplicates as genuine high coverage supporting a variant, leading to false positives. The lecture emphasizes: "Always ask what has been done to generate the data."

Q5 Easy

The formula for sequencing depth (coverage) is:

ADepth = G × L / N

BDepth = (N × L) / G

CDepth = N / (L × G)

DDepth = (G × N) / L

Explanation

Depth (X) = (N × L) / G, where N = number of reads, L = read length (bp), G = genome size (bp). For example, 100 million reads of 150 bp on a 3 Gb genome gives (100M × 150) / 3G = 5×.

Q6 Tricky

Breadth of coverage of 95% at 20× means:

A95% of reads have a quality score ≥20

BEach base in the genome has been read exactly 20 times

C95% of reads aligned with a mapping quality of 20

D95% of the target genome bases are covered by at least 20 reads

Explanation

Breadth of coverage is the percentage of the target genome covered at a specified minimum depth. "95% at 20×" means 95% of bases have ≥20 reads mapped to them. This is different from depth of coverage, which is the average number of times each base is read. Breadth measures completeness; depth measures redundancy/confidence.

Q7 Easy

In a FASTQ file, each sequence entry consists of:

A4 lines: header (@), sequence, separator (+), quality scores

B3 lines: header (>), sequence, quality scores

C2 lines: sequence and quality scores

D5 lines: header, sequence, separator, quality, checksum

Explanation

FASTQ uses exactly 4 lines per read: Line 1 begins with '@' followed by the sequence identifier; Line 2 is the raw nucleotide sequence; Line 3 begins with '+' (optionally repeating the identifier); Line 4 encodes quality values as ASCII characters, with the same number of characters as bases in Line 2.

Q8 Medium

A Phred quality score of Q30 corresponds to:

AA 1 in 100 chance of an incorrect base call (99% accuracy)

BA 1 in 10 chance of an incorrect base call (90% accuracy)

CA 1 in 1000 chance of an incorrect base call (99.9% accuracy)

DA 1 in 10000 chance of an incorrect base call (99.99% accuracy)

Explanation

The Phred formula is Q = −10 × log₁₀(P), where P is the probability of error. For Q30: P = 10^(−30/10) = 10^(−3) = 1/1000. So there is a 0.1% chance of error, or 99.9% base call accuracy. Q20 = 99%, Q10 = 90%, Q40 = 99.99%.

Q9 Medium

Quality scores in FASTQ files are encoded using:

ABinary values representing Phred scores directly

BSingle ASCII characters, where each character maps to a Phred score

CTwo-digit integers separated by commas

DHexadecimal values encoding error probabilities

Explanation

Both the sequence and quality scores are each encoded with a single ASCII character for brevity. ASCII printable characters (range 33–126) are used, where each character corresponds to a specific Phred quality score. Illumina 1.8+ uses the same encoding as the original Sanger format.

Q10 Medium

In paired-end sequencing, the two FASTQ files (*_1.fastq.gz and *_2.fastq.gz) are characterized by:

AReads sorted in the same order — the n-th read in file 1 is the mate of the n-th read in file 2

BReads sorted by mapping position along the reference genome

CFile 1 contains forward reads and file 2 contains all the quality scores

DThe two files can be read in any order and paired later by sequence similarity

Explanation

In paired-end sequencing, reads follow the same order in both files — the first read in *_1.fastq.gz is the mate pair of the first read in *_2.fastq.gz, and so on. They are NOT sorted by genomic position (that happens only after alignment to produce BAM files). Maintaining this order is critical for correct downstream alignment and analysis.

Q11 Easy

Which file formats can FastQC accept as input?

AOnly FASTQ files

BFASTQ and VCF files

COnly BAM files

DBAM, SAM, and FASTQ files

Explanation

FastQC can import data from BAM, SAM, or FASTQ files (any variant). It provides a modular set of analyses with summary graphs and exports results as an HTML report. It does not accept VCF or other variant files.

Q12 Medium

In the FastQC "Per base sequence quality" module, a box plot at position 140 showing a median Phred score of 15 indicates:

AExcellent quality — no action needed

BModerate quality — acceptable for most analyses

CPoor quality — trimming of read ends is recommended

DThe sequencing run failed and data should be discarded entirely

Explanation

A Phred score of 15 means ~96.8% accuracy — this falls in the poor/red zone. Quality typically drops toward the end of reads. A median of 15 at position 140 suggests the read ends need trimming. However, the entire run is not necessarily a failure — trimming the low-quality tails may rescue the usable portion of the data.

Q13 Medium

In the FastQC "Per sequence GC content" module, a distribution with two distinct peaks (instead of a single normal curve) most likely indicates:

AHigh sequencing quality

BContamination from another organism

CNormal variation in GC content across the genome

DLow sequencing depth

Explanation

A normal WGS library should produce a roughly normal (single-peak) GC distribution matching the reference genome. Multiple peaks or an unusually shaped distribution suggest contamination from another organism with a different GC content. For example, bacterial DNA in a cattle sample would produce a secondary peak. Environmental DNA samples (soil, honey) naturally show complex multi-peak distributions.

Q14 Medium

The "Sequence Duplication Levels" module in FastQC shows that 35% of reads appear 10+ times. The most likely cause is:

AOver-amplification during PCR in library preparation

BA very large and complex genome

CHigh sequencing depth with PCR-free library prep

DAdapter contamination

Explanation

In a random WGS library, most sequences should occur only once. High duplication levels (35% appearing 10+ times) strongly suggest over-amplification during PCR library preparation. These duplicates don't add new information and should be removed before analysis. PCR-free protocols avoid this bias but require more starting DNA material.

Q15 Tricky

In a WGS experiment, the "Per base sequence content" plot shows that position 1 always starts with a T and position 2 always starts with an A. This is most likely because:

AThe genome is AT-rich

BSequencing machine error at the beginning of every read

CA restriction enzyme was used during library preparation that cuts at a specific recognition site

DAdapter sequences were not trimmed

Explanation

Random DNA fragmentation should produce roughly equal proportions of all four bases at each position. Consistent base enrichment at specific positions indicates a restriction enzyme was used for fragmentation — these enzymes cut at defined recognition sequences, so all fragments begin with the same bases. This is a key example from the lecture about why understanding the library prep method is crucial for interpreting QC results.

Q16 Medium

What is the advantage of sliding window trimming over simple threshold-based trimming?

AIt is faster and requires less memory

BIt better preserves moderate-quality bases surrounded by high-quality neighbors

CIt removes adapter sequences simultaneously

DIt increases the overall read length

Explanation

Sliding window trimming uses a window (e.g., 5 bases) and calculates the average quality within it, trimming only when the average drops below the threshold. This approach preserves individual bases of moderate quality that are surrounded by high-quality neighbors, whereas simple threshold trimming would remove any single base below the cutoff. Tools like Prinseq and Trimmomatic implement sliding window trimming.

Q17 Easy

What should be done immediately after trimming reads?

AProceed directly to variant calling

BSubmit the trimmed reads to a public database

CRe-sequence the sample

DRe-run quality control (e.g., FastQC) to confirm improvements

Explanation

After trimming, you should always re-run QC (e.g., FastQC) to verify that the quality has improved. This is part of the iterative QC-filter cycle emphasized throughout the pipeline. Skipping this verification risks retaining errors that lead to unreliable downstream conclusions.

Q18 Easy

What is the primary purpose of a SAM/BAM file?

ATo store the alignment of sequencing reads to a reference genome

BTo store raw sequencing reads and quality scores

CTo store a list of genetic variants (SNPs and indels)

DTo store the reference genome sequence

Explanation

SAM (Sequence Alignment Map) stores alignments of reads to a reference genome. BAM is the compressed binary version of SAM — smaller file size, indexed access, but not human-readable. FASTQ stores raw reads + quality scores. VCF stores variant calls. The reference genome is stored separately (e.g., as a FASTA file).

Q19 Medium

The SAM file header line "@SQ SN:chr1 LN:248956422" indicates:

AA sequencing quality score of 248956422

BThe alignment tool version number

CA reference sequence named chr1 with a length of 248,956,422 bp

DThe number of reads aligned to chromosome 1

Explanation

In the SAM header, @SQ is the reference sequence dictionary tag. SN stands for the reference Sequence Name (e.g., chr1), and LN stands for the reference Length in base pairs. There is one @SQ line per chromosome/contig. The @PG line (not @SQ) records alignment tool information.

Q20 Hard

A SAM alignment record has the FLAG value 1024. This read is:

AUnmapped to the reference genome

BA PCR or optical duplicate

CA secondary alignment

DPart of a properly paired read

Explanation

SAM FLAG is an integer where each bit encodes a different property. FLAG 4 = unmapped read, FLAG 256 = secondary alignment, FLAG 1024 = PCR or optical duplicate. Duplicates should typically be removed (e.g., using Picard) because they don't add independent information and can artificially inflate coverage, leading to false variant support.

Q21 Hard

Given the CIGAR string "4S8M2I4M1D3M", which statement is correct?

AThe read has 4 deletions at the start

BThe read is 22 bases long and all bases align to the reference

CThere are 2 deletions and 1 insertion in this alignment

DThe first 4 bases are soft-clipped, there is a 2-base insertion and a 1-base deletion relative to the reference

Explanation

Parsing "4S8M2I4M1D3M": 4S = 4 bases soft-clipped (present in read but not aligned); 8M = 8 matching/mismatching bases; 2I = 2 bases inserted in read (not in reference); 4M = 4 matching; 1D = 1 base deleted from read (present in reference, missing in read); 3M = 3 matching. The read length = 4+8+2+4+3 = 21 bases. Note: M includes both matches AND mismatches — variant detection requires separate tools.

Q22 Medium

A mapping quality (MAPQ) score of 0 indicates that:

AThe read maps equally well to multiple locations in the genome

BThe read has perfect alignment with no mismatches

CThe base quality of all positions in the read is zero

DThe read was not sequenced correctly

Explanation

MAPQ is a Phred-scaled score of mapping confidence. MAPQ = 0 means the read aligns equally well to multiple locations, often due to repetitive/low-complexity regions or genome duplications. Such reads should be filtered out before variant calling to avoid false positives. Higher MAPQ = higher confidence in unique placement.

Q23 Medium

BWA-MEM is based on which algorithm?

AHash table indexing

BSmith-Waterman local alignment

CBurrows-Wheeler Transform

Dk-mer frequency counting

Explanation

BWA (Burrows-Wheeler Aligner) uses the Burrows-Wheeler Transform (BWT) for efficient sequence alignment. Bowtie also uses BWT. In contrast, some other aligners use hash table-based approaches. The two approaches differ in speed, CPU/memory usage, and sensitivity — affecting downstream variant discovery. BWA-MEM is the default aligner in many standard genomic pipelines.

Q24 Tricky

When comparing BWA-MEM and Bowtie2 using the same variant caller (SAMtools), only 24.5% of SNPs were concordant. This demonstrates that:

ASAMtools is an unreliable variant caller

BThe choice of read aligner has a major impact on downstream variant discovery

CBoth aligners produce identical results and the difference is due to random variation

DThe reference genome was incorrectly assembled

Explanation

Only 24.5% concordance between BWA-MEM and Bowtie2 (with the same variant caller) shows that aligner choice is NOT trivial — it profoundly affects which variants are discovered. The suggestion is to run both tools and compare results, especially for complex genomes like polyploid plants. This is a key point students often underestimate.

Q25 Medium

PCR duplicates are identified by sharing:

AThe same base quality scores

BThe same read name in the FASTQ file

CAlignment to different chromosomes with similar sequences

DCommon coordinates, sequencing direction, and the same sequence

Explanation

PCR duplicates are identified as reads that share common genomic coordinates (start/end position), the same sequencing direction, and the same sequence — indicating they originated from the same amplified fragment rather than independent DNA molecules. Tools like Picard mark and remove these duplicates. Alternatively, PCR-free library protocols can be used if sufficient input DNA is available.

Q26 Medium

Why is duplicate removal important before variant calling?

ADuplicates artificially inflate coverage and can give false support to variants

BDuplicates reduce the mapping quality of all reads

CDuplicates change the reference genome sequence

DDuplicates decrease the file size of BAM files

Explanation

PCR duplicates are copies of the same DNA fragment. They fake high coverage at certain positions, giving artificially strong support for variants (including errors from the original fragment). This can lead to false positive variant calls. Removing duplicates ensures that only independent observations contribute to variant evidence.

Q27 Medium

Which of the following factors does NOT directly affect variant calling accuracy?

ABase call quality of supporting reads

BProximity to homopolymer runs

CThe GC content of the entire genome

DMapping quality of the aligned reads

Explanation

Variant calling accuracy is affected by: base call quality, proximity to indels/homopolymer runs (which cause sequencing errors), mapping quality, and sequencing depth. The overall GC content of the genome affects sequencing coverage evenness but does not directly impact variant calling at a specific position in the way the other factors do.

Q28 Hard

What is the main advantage of joint variant calling over individual variant calling followed by merging?

AIt produces smaller VCF files

BA low-confidence variant in one sample can be confirmed by evidence from other samples

CIt requires less computational resources

DIt does not require a reference genome

Explanation

Joint variant calling analyzes all samples simultaneously, allowing the caller to use cross-sample evidence. A variant that appears with low confidence in one sample (e.g., due to low coverage) can be confirmed if it appears confidently in other samples. In individual calling, a missing variant in some VCFs is ambiguous — it could be wild-type or just insufficient coverage. Joint calling resolves this ambiguity.

Q29 Tricky

A variant is called in a region with a homopolymer run of 8 adenines (AAAAAAAA). How should this variant be treated?

AWith caution — homopolymer regions are prone to sequencing errors that produce false positive variants

BWith high confidence — repetitive regions are easier to sequence accurately

CIt should be automatically accepted because modern callers handle homopolymers perfectly

DIt should be ignored — variants cannot occur in homopolymer regions

Explanation

Homopolymer runs are regions where the same nucleotide repeats many times (e.g., AAAAAAAA). Sequencing platforms, especially Illumina, are prone to errors in these regions (insertions/deletions of bases). Variants found in homopolymers are often false positives. Modern variant callers include filters for these regions but they are not perfect. Manual inspection (e.g., in IGV) is recommended.

Q30 Easy

In a VCF file, which column contains the alternative (non-reference) allele?

AREF

BQUAL

CALT

DINFO

Explanation

The VCF mandatory columns are: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, and then sample columns. ALT contains the alternative (non-reference) allele(s), comma-separated if more than one. REF contains the reference allele. QUAL is the Phred-scaled quality score. INFO contains extensible annotations.

Q31 Medium

In a VCF file, meta-information lines begin with:

A@ (at sign)

B> (greater-than sign)

C# (single hash)

D## (double hash)

Explanation

In VCF files: ## (double hash) marks meta-information lines (key=value pairs describing filters, info fields, etc.); # (single hash) marks the column header line (CHROM, POS, ID, etc.). Don't confuse with FASTQ where @ begins each entry, or FASTA where > begins each entry. SAM headers use @.

Q32 Easy

Which tool determines the effect of variants on genes, transcripts, and protein sequence?

ABWA-MEM

BEnsembl Variant Effect Predictor (VEP)

CFastQC

DPicard

Explanation

VEP (Variant Effect Predictor) from Ensembl determines variant effects on genes, transcripts, and protein sequences. It also provides SIFT and PolyPhen-2 scores for protein-altering changes. Other annotation tools include SnpEff and ANNOVAR. BWA-MEM is an aligner, FastQC does quality control, and Picard handles duplicate removal.

Q33 Medium

A "gain of TFBS" variant means:

AThe transcription factor binding site exists only for the alternative allele of the SNP

BThe transcription factor binding site exists only for the reference allele

CBoth alleles have identical transcription factor binding affinity

DThe variant is located in a coding region and causes a missense change

Explanation

TFBS variant consequences: Loss of TFBS = binding site exists only for the reference (0) allele; Gain of TFBS = binding site exists only for the alternative (1) allele; Score-Change = binding affinity differs between alleles; No Change = both alleles predicted with same binding affinity. In entropy logos, larger letters indicate more critical positions — mutations there are more likely to impact binding.

Q34 Medium

In the albino donkey case study, what type of variant was identified in the TYR gene?

AA frameshift deletion

BA synonymous substitution

CA splice site variant

DA missense mutation (c.604C>G, p.His202Asp) disrupting copper binding in tyrosinase

Explanation

The albino donkeys from Asinara island carry a missense mutation c.604C>G in the TYR gene, causing a histidine to aspartate substitution at position 202 (p.His202Asp). This disrupts copper binding in the tyrosinase enzyme, inactivating melanin production. Parents are heterozygous (C/G), albino offspring are homozygous (G/G) — demonstrating autosomal recessive inheritance.

Q35 Easy

The correct order of file formats in a standard variant discovery pipeline is:

ABAM → FASTQ → VCF

BVCF → BAM → FASTQ

CFASTQ → BAM → VCF

DFASTQ → VCF → BAM

Explanation

The standard pipeline flows: FASTQ (raw reads) → alignment with BWA → BAM (aligned reads) → variant calling with GATK → VCF (variant calls) → annotation with VEP/SnpEff. Quality control and filtering occur between every step. This is the core workflow emphasized throughout the lecture.

Q36 Tricky

Unmapped reads (FLAG 4) from cattle WGS data could be useful for:

AImproving the alignment of mapped reads

BMetagenomics — detecting contaminant bacteria or viruses in the sample

CIncreasing the sequencing depth of the cattle genome

DGenerating a new reference genome

Explanation

Unmapped reads (FLAG 4) did not align to the host (cattle) genome. These can be extracted and re-aligned to microbial databases to identify contaminant bacteria, viruses, or other organisms. This is the foundation of metagenomics and is relevant to the "One Health" approach linking human, animal, and environmental health. This is a concept the lecture specifically highlights as a practical application of SAM flag filtering.

Q37 Easy

Galaxy is primarily described as:

AAn open, web-based platform for accessible, reproducible, and transparent computational research

BA command-line only tool for variant calling

CA commercial sequencing platform

DA local software package requiring complex installation

Explanation

Galaxy is an open, web-based platform. Its key features are: accessibility (no programming required, point-and-click), reproducibility (captures all analysis details), transparency (users share histories and workflows), and community-centered design. It requires only an internet connection and a browser — no installation or complex commands needed.

Q38 Easy

The Galaxy interface consists of three main panels:

AUpload, Download, and Settings

BCode editor, Terminal, and Output

CData + Available tools, Run tools and view results, Analysis history

DAlignment, Variant calling, and Annotation panels

Explanation

Galaxy has three main panels: (1) Left panel for data and available tools, (2) Middle/center panel for running tools and viewing results, and (3) Right panel for the analysis history which tracks all files and operations. The history panel shows files with three action buttons: view (eye icon), edit attributes (pencil), and delete.

Q39 — Open Calculation

You sequence a cattle genome (genome size = 2.7 Gbp) using Illumina paired-end 150 bp reads. You generate 600 million reads in total. (a) Calculate the sequencing depth. (b) If the minimum recommended depth for robust SNP detection is 10×, is this sufficient? (c) What is the Phred score corresponding to 99.99% base call accuracy?

✓ Model Answer

(a) Sequencing depth:

Depth = (N × L) / G

N = 600,000,000 reads; L = 150 bp; G = 2,700,000,000 bp

Depth = (600,000,000 × 150) / 2,700,000,000

= 90,000,000,000 / 2,700,000,000

= 33.3×

(b) Yes, 33.3× exceeds the minimum recommended ~10× for robust SNP detection. This depth provides high confidence for variant calling.

(c) Phred score for 99.99% accuracy:

P(error) = 1 − 0.9999 = 0.0001 = 10⁻⁴

Q = −10 × log₁₀(10⁻⁴) = −10 × (−4) = 40

Answer: Q40

Q40 — Open Short Answer

Describe the complete variant discovery pipeline from raw sequencing data to annotated variants. For each major step, name the input file format, the output file format, and one commonly used tool.

✓ Model Answer

The variant discovery pipeline consists of four major steps, each with QC/filtering between them:

1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (assessment), Trimmomatic or Prinseq (trimming) → Output: cleaned FASTQ. Checks per-base quality, GC content, duplication levels. Removes low-quality bases using sliding window or threshold approaches.

2. Alignment: Input: cleaned FASTQ → Tool: BWA-MEM (Burrows-Wheeler Transform) → Output: BAM file. Maps reads to the reference genome. BAM is the binary compressed version of SAM. Post-alignment: filter by mapping quality (MAPQ) and remove PCR duplicates (Picard).

3. Variant Calling: Input: filtered BAM → Tool: GATK (following GATK Best Practices) → Output: VCF file. Examines aligned bases at each position to identify SNPs and indels. Detection affected by base quality, proximity to homopolymers, mapping quality, and sequencing depth. Can be individual or joint calling across samples.

4. Variant Annotation: Input: VCF → Tool: Ensembl VEP, SnpEff, or ANNOVAR → Output: annotated VCF. Determines effect of variants on genes/transcripts/proteins (e.g., missense, frameshift, intronic, TFBS variants). Provides functional impact predictions (SIFT, PolyPhen-2).

Q41 — Open Tricky

Explain the difference between depth of coverage and breadth of coverage. Give a scenario where you might have high depth but low breadth, and explain why this would be problematic for variant discovery.

✓ Model Answer

Depth of coverage = the average number of times each base is read (expressed as "X", e.g., 30×). It measures redundancy and confidence. Formula: Depth = (N × L) / G.

Breadth of coverage = the percentage of the target genome covered at a minimum depth (expressed as %, e.g., 95% at 1×). It measures completeness.

High depth, low breadth scenario: In a PCR-amplified library with severe amplification bias, certain genomic regions may be vastly over-represented (giving very high local depth), while other regions receive no reads at all. The average depth might be reported as 30×, but large portions of the genome have 0× coverage. This means variants in uncovered regions are completely missed, making the analysis incomplete despite seemingly adequate depth. This is why breadth is especially important in clinical diagnostics where missing regions could mean missing pathogenic variants.

Another example: targeted sequencing (e.g., exome capture) inherently has high depth in target regions but low breadth over the whole genome — which is by design, but must be understood when interpreting results.

Q42 — Open Short Answer

Why is it important for a bioinformatician to understand the upstream wet-lab steps before analyzing NGS data? Give at least three specific examples of how lab decisions affect data analysis.

✓ Model Answer

A bioinformatician must understand what happened before data generation because ignoring upstream processes leads to incorrect assumptions or flawed analyses. Key examples:

1. PCR amplification vs. PCR-free: If PCR was used during library prep, duplicate reads are expected and must be removed (e.g., with Picard). Without knowing this, duplicates would be treated as independent evidence, inflating coverage and producing false positive variants. PCR-free protocols avoid this but require more input DNA.

2. DNA quality/degradation: Degraded DNA (e.g., from ancient samples, honey, or soil) produces shorter fragments. This affects which sequencing platform is appropriate — degraded DNA is unsuitable for long-read platforms (PacBio/Nanopore). Low-quality DNA also affects alignment success and error rates.

3. Sequencing platform choice: Different platforms have different error profiles. Illumina has very low error rates; Ion Torrent has higher error rates especially in homopolymer regions. Knowing the platform tells you which types of errors to expect and filter for.

4. Expected sequencing depth: If coverage was planned at 5× vs. 30×, the confidence in variant calls differs dramatically. Low-coverage data (1–5×) requires different analytical approaches than high-coverage data.

5. Fragmentation method: Random fragmentation vs. restriction enzymes produces different sequence content patterns visible in FastQC (e.g., consistent bases at read starts with restriction enzymes).

Bioinformatics Forever

NGS Data Analysis — Exam Practice