NGS Data Analysis — Exam Practice

📝NGS Data Analysis – Variant Discovery Pipeline
0 / 40
Q1 Medium
An A260/280 ratio of 1.5 for a DNA sample most likely indicates:
ARNA contamination
BProtein or phenol contamination
CPure, high-quality DNA
DCarbohydrate contamination
Explanation
The ideal A260/280 ratio for pure DNA is ~1.8. A ratio <1.8 indicates contamination by proteins or phenol, which absorb at 280 nm and lower the ratio. A ratio >1.8 may suggest RNA contamination. Carbohydrate contamination is detected by the A260/230 ratio, not A260/280.
Q2 Easy
The A260/230 ratio is used to assess:
ADNA fragment length
BProtein contamination
CDNA integrity
DChemical contaminants such as carbohydrates, phenol, or guanidine salts
Explanation
The A260/230 ratio (ideal 2.0–2.2) detects chemical contaminants that absorb at 230 nm: carbohydrates (common in plant DNA), residual phenol, guanidine salts (from column kits), and glycogen. Protein contamination is assessed by A260/280. DNA integrity is assessed by gel electrophoresis, not absorbance ratios.
Q3 Medium
On an agarose gel, high-quality intact genomic DNA appears as:
AA sharp, high molecular weight band near the well
BA smear distributed evenly across the gel
CMultiple discrete bands of different sizes
DA band at the bottom of the gel near small fragments
Explanation
Intact genomic DNA is composed of very long fragments that migrate slowly in the gel, producing a sharp, high molecular weight band near the wells. A smear indicates degradation. Degraded DNA (important for ancient DNA or complex-matrix samples) may still work for short-read sequencing but will fail on long-read platforms like PacBio or Nanopore.
Q4 Tricky
A bioinformatician receives NGS data but does not ask about the library preparation protocol. Which of the following errors is MOST likely to occur?
AFailure to install the alignment software
BThe reference genome will not be available
CIncorrect interpretation of duplicate reads or coverage biases introduced by PCR amplification
DThe FASTQ files will be unreadable
Explanation
Knowing whether PCR amplification was used in library prep is essential. PCR introduces duplicate reads and coverage biases that affect variant calling, allele frequency estimation, and coverage evenness. Without this knowledge, a bioinformatician may misinterpret duplicates as genuine high coverage supporting a variant, leading to false positives. The lecture emphasizes: "Always ask what has been done to generate the data."
Q5 Easy
The formula for sequencing depth (coverage) is:
ADepth = G × L / N
BDepth = (N × L) / G
CDepth = N / (L × G)
DDepth = (G × N) / L
Explanation
Depth (X) = (N × L) / G, where N = number of reads, L = read length (bp), G = genome size (bp). For example, 100 million reads of 150 bp on a 3 Gb genome gives (100M × 150) / 3G = 5×.
Q6 Tricky
Breadth of coverage of 95% at 20× means:
A95% of reads have a quality score ≥20
BEach base in the genome has been read exactly 20 times
C95% of reads aligned with a mapping quality of 20
D95% of the target genome bases are covered by at least 20 reads
Explanation
Breadth of coverage is the percentage of the target genome covered at a specified minimum depth. "95% at 20×" means 95% of bases have ≥20 reads mapped to them. This is different from depth of coverage, which is the average number of times each base is read. Breadth measures completeness; depth measures redundancy/confidence.
Q7 Easy
In a FASTQ file, each sequence entry consists of:
A4 lines: header (@), sequence, separator (+), quality scores
B3 lines: header (>), sequence, quality scores
C2 lines: sequence and quality scores
D5 lines: header, sequence, separator, quality, checksum
Explanation
FASTQ uses exactly 4 lines per read: Line 1 begins with '@' followed by the sequence identifier; Line 2 is the raw nucleotide sequence; Line 3 begins with '+' (optionally repeating the identifier); Line 4 encodes quality values as ASCII characters, with the same number of characters as bases in Line 2.
Q8 Medium
A Phred quality score of Q30 corresponds to:
AA 1 in 100 chance of an incorrect base call (99% accuracy)
BA 1 in 10 chance of an incorrect base call (90% accuracy)
CA 1 in 1000 chance of an incorrect base call (99.9% accuracy)
DA 1 in 10000 chance of an incorrect base call (99.99% accuracy)
Explanation
The Phred formula is Q = −10 × log₁₀(P), where P is the probability of error. For Q30: P = 10^(−30/10) = 10^(−3) = 1/1000. So there is a 0.1% chance of error, or 99.9% base call accuracy. Q20 = 99%, Q10 = 90%, Q40 = 99.99%.
Q9 Medium
Quality scores in FASTQ files are encoded using:
ABinary values representing Phred scores directly
BSingle ASCII characters, where each character maps to a Phred score
CTwo-digit integers separated by commas
DHexadecimal values encoding error probabilities
Explanation
Both the sequence and quality scores are each encoded with a single ASCII character for brevity. ASCII printable characters (range 33–126) are used, where each character corresponds to a specific Phred quality score. Illumina 1.8+ uses the same encoding as the original Sanger format.
Q10 Medium
In paired-end sequencing, the two FASTQ files (*_1.fastq.gz and *_2.fastq.gz) are characterized by:
AReads sorted in the same order — the n-th read in file 1 is the mate of the n-th read in file 2
BReads sorted by mapping position along the reference genome
CFile 1 contains forward reads and file 2 contains all the quality scores
DThe two files can be read in any order and paired later by sequence similarity
Explanation
In paired-end sequencing, reads follow the same order in both files — the first read in *_1.fastq.gz is the mate pair of the first read in *_2.fastq.gz, and so on. They are NOT sorted by genomic position (that happens only after alignment to produce BAM files). Maintaining this order is critical for correct downstream alignment and analysis.
Q11 Easy
Which file formats can FastQC accept as input?
AOnly FASTQ files
BFASTQ and VCF files
COnly BAM files
DBAM, SAM, and FASTQ files
Explanation
FastQC can import data from BAM, SAM, or FASTQ files (any variant). It provides a modular set of analyses with summary graphs and exports results as an HTML report. It does not accept VCF or other variant files.
Q12 Medium
In the FastQC "Per base sequence quality" module, a box plot at position 140 showing a median Phred score of 15 indicates:
AExcellent quality — no action needed
BModerate quality — acceptable for most analyses
CPoor quality — trimming of read ends is recommended
DThe sequencing run failed and data should be discarded entirely
Explanation
A Phred score of 15 means ~96.8% accuracy — this falls in the poor/red zone. Quality typically drops toward the end of reads. A median of 15 at position 140 suggests the read ends need trimming. However, the entire run is not necessarily a failure — trimming the low-quality tails may rescue the usable portion of the data.
Q13 Medium
In the FastQC "Per sequence GC content" module, a distribution with two distinct peaks (instead of a single normal curve) most likely indicates:
AHigh sequencing quality
BContamination from another organism
CNormal variation in GC content across the genome
DLow sequencing depth
Explanation
A normal WGS library should produce a roughly normal (single-peak) GC distribution matching the reference genome. Multiple peaks or an unusually shaped distribution suggest contamination from another organism with a different GC content. For example, bacterial DNA in a cattle sample would produce a secondary peak. Environmental DNA samples (soil, honey) naturally show complex multi-peak distributions.
Q14 Medium
The "Sequence Duplication Levels" module in FastQC shows that 35% of reads appear 10+ times. The most likely cause is:
AOver-amplification during PCR in library preparation
BA very large and complex genome
CHigh sequencing depth with PCR-free library prep
DAdapter contamination
Explanation
In a random WGS library, most sequences should occur only once. High duplication levels (35% appearing 10+ times) strongly suggest over-amplification during PCR library preparation. These duplicates don't add new information and should be removed before analysis. PCR-free protocols avoid this bias but require more starting DNA material.
Q15 Tricky
In a WGS experiment, the "Per base sequence content" plot shows that position 1 always starts with a T and position 2 always starts with an A. This is most likely because:
AThe genome is AT-rich
BSequencing machine error at the beginning of every read
CA restriction enzyme was used during library preparation that cuts at a specific recognition site
DAdapter sequences were not trimmed
Explanation
Random DNA fragmentation should produce roughly equal proportions of all four bases at each position. Consistent base enrichment at specific positions indicates a restriction enzyme was used for fragmentation — these enzymes cut at defined recognition sequences, so all fragments begin with the same bases. This is a key example from the lecture about why understanding the library prep method is crucial for interpreting QC results.
Q16 Medium
What is the advantage of sliding window trimming over simple threshold-based trimming?
AIt is faster and requires less memory
BIt better preserves moderate-quality bases surrounded by high-quality neighbors
CIt removes adapter sequences simultaneously
DIt increases the overall read length
Explanation
Sliding window trimming uses a window (e.g., 5 bases) and calculates the average quality within it, trimming only when the average drops below the threshold. This approach preserves individual bases of moderate quality that are surrounded by high-quality neighbors, whereas simple threshold trimming would remove any single base below the cutoff. Tools like Prinseq and Trimmomatic implement sliding window trimming.
Q17 Easy
What should be done immediately after trimming reads?
AProceed directly to variant calling
BSubmit the trimmed reads to a public database
CRe-sequence the sample
DRe-run quality control (e.g., FastQC) to confirm improvements
Explanation
After trimming, you should always re-run QC (e.g., FastQC) to verify that the quality has improved. This is part of the iterative QC-filter cycle emphasized throughout the pipeline. Skipping this verification risks retaining errors that lead to unreliable downstream conclusions.
Q18 Easy
What is the primary purpose of a SAM/BAM file?
ATo store the alignment of sequencing reads to a reference genome
BTo store raw sequencing reads and quality scores
CTo store a list of genetic variants (SNPs and indels)
DTo store the reference genome sequence
Explanation
SAM (Sequence Alignment Map) stores alignments of reads to a reference genome. BAM is the compressed binary version of SAM — smaller file size, indexed access, but not human-readable. FASTQ stores raw reads + quality scores. VCF stores variant calls. The reference genome is stored separately (e.g., as a FASTA file).
Q19 Medium
The SAM file header line "@SQ SN:chr1 LN:248956422" indicates:
AA sequencing quality score of 248956422
BThe alignment tool version number
CA reference sequence named chr1 with a length of 248,956,422 bp
DThe number of reads aligned to chromosome 1
Explanation
In the SAM header, @SQ is the reference sequence dictionary tag. SN stands for the reference Sequence Name (e.g., chr1), and LN stands for the reference Length in base pairs. There is one @SQ line per chromosome/contig. The @PG line (not @SQ) records alignment tool information.
Q20 Hard
A SAM alignment record has the FLAG value 1024. This read is:
AUnmapped to the reference genome
BA PCR or optical duplicate
CA secondary alignment
DPart of a properly paired read
Explanation
SAM FLAG is an integer where each bit encodes a different property. FLAG 4 = unmapped read, FLAG 256 = secondary alignment, FLAG 1024 = PCR or optical duplicate. Duplicates should typically be removed (e.g., using Picard) because they don't add independent information and can artificially inflate coverage, leading to false variant support.
Q21 Hard
Given the CIGAR string "4S8M2I4M1D3M", which statement is correct?
AThe read has 4 deletions at the start
BThe read is 22 bases long and all bases align to the reference
CThere are 2 deletions and 1 insertion in this alignment
DThe first 4 bases are soft-clipped, there is a 2-base insertion and a 1-base deletion relative to the reference
Explanation
Parsing "4S8M2I4M1D3M": 4S = 4 bases soft-clipped (present in read but not aligned); 8M = 8 matching/mismatching bases; 2I = 2 bases inserted in read (not in reference); 4M = 4 matching; 1D = 1 base deleted from read (present in reference, missing in read); 3M = 3 matching. The read length = 4+8+2+4+3 = 21 bases. Note: M includes both matches AND mismatches — variant detection requires separate tools.
Q22 Medium
A mapping quality (MAPQ) score of 0 indicates that:
AThe read maps equally well to multiple locations in the genome
BThe read has perfect alignment with no mismatches
CThe base quality of all positions in the read is zero
DThe read was not sequenced correctly
Explanation
MAPQ is a Phred-scaled score of mapping confidence. MAPQ = 0 means the read aligns equally well to multiple locations, often due to repetitive/low-complexity regions or genome duplications. Such reads should be filtered out before variant calling to avoid false positives. Higher MAPQ = higher confidence in unique placement.
Q23 Medium
BWA-MEM is based on which algorithm?
AHash table indexing
BSmith-Waterman local alignment
CBurrows-Wheeler Transform
Dk-mer frequency counting
Explanation
BWA (Burrows-Wheeler Aligner) uses the Burrows-Wheeler Transform (BWT) for efficient sequence alignment. Bowtie also uses BWT. In contrast, some other aligners use hash table-based approaches. The two approaches differ in speed, CPU/memory usage, and sensitivity — affecting downstream variant discovery. BWA-MEM is the default aligner in many standard genomic pipelines.
Q24 Tricky
When comparing BWA-MEM and Bowtie2 using the same variant caller (SAMtools), only 24.5% of SNPs were concordant. This demonstrates that:
ASAMtools is an unreliable variant caller
BThe choice of read aligner has a major impact on downstream variant discovery
CBoth aligners produce identical results and the difference is due to random variation
DThe reference genome was incorrectly assembled
Explanation
Only 24.5% concordance between BWA-MEM and Bowtie2 (with the same variant caller) shows that aligner choice is NOT trivial — it profoundly affects which variants are discovered. The suggestion is to run both tools and compare results, especially for complex genomes like polyploid plants. This is a key point students often underestimate.
Q25 Medium
PCR duplicates are identified by sharing:
AThe same base quality scores
BThe same read name in the FASTQ file
CAlignment to different chromosomes with similar sequences
DCommon coordinates, sequencing direction, and the same sequence
Explanation
PCR duplicates are identified as reads that share common genomic coordinates (start/end position), the same sequencing direction, and the same sequence — indicating they originated from the same amplified fragment rather than independent DNA molecules. Tools like Picard mark and remove these duplicates. Alternatively, PCR-free library protocols can be used if sufficient input DNA is available.
Q26 Medium
Why is duplicate removal important before variant calling?
ADuplicates artificially inflate coverage and can give false support to variants
BDuplicates reduce the mapping quality of all reads
CDuplicates change the reference genome sequence
DDuplicates decrease the file size of BAM files
Explanation
PCR duplicates are copies of the same DNA fragment. They fake high coverage at certain positions, giving artificially strong support for variants (including errors from the original fragment). This can lead to false positive variant calls. Removing duplicates ensures that only independent observations contribute to variant evidence.
Q27 Medium
Which of the following factors does NOT directly affect variant calling accuracy?
ABase call quality of supporting reads
BProximity to homopolymer runs
CThe GC content of the entire genome
DMapping quality of the aligned reads
Explanation
Variant calling accuracy is affected by: base call quality, proximity to indels/homopolymer runs (which cause sequencing errors), mapping quality, and sequencing depth. The overall GC content of the genome affects sequencing coverage evenness but does not directly impact variant calling at a specific position in the way the other factors do.
Q28 Hard
What is the main advantage of joint variant calling over individual variant calling followed by merging?
AIt produces smaller VCF files
BA low-confidence variant in one sample can be confirmed by evidence from other samples
CIt requires less computational resources
DIt does not require a reference genome
Explanation
Joint variant calling analyzes all samples simultaneously, allowing the caller to use cross-sample evidence. A variant that appears with low confidence in one sample (e.g., due to low coverage) can be confirmed if it appears confidently in other samples. In individual calling, a missing variant in some VCFs is ambiguous — it could be wild-type or just insufficient coverage. Joint calling resolves this ambiguity.
Q29 Tricky
A variant is called in a region with a homopolymer run of 8 adenines (AAAAAAAA). How should this variant be treated?
AWith caution — homopolymer regions are prone to sequencing errors that produce false positive variants
BWith high confidence — repetitive regions are easier to sequence accurately
CIt should be automatically accepted because modern callers handle homopolymers perfectly
DIt should be ignored — variants cannot occur in homopolymer regions
Explanation
Homopolymer runs are regions where the same nucleotide repeats many times (e.g., AAAAAAAA). Sequencing platforms, especially Illumina, are prone to errors in these regions (insertions/deletions of bases). Variants found in homopolymers are often false positives. Modern variant callers include filters for these regions but they are not perfect. Manual inspection (e.g., in IGV) is recommended.
Q30 Easy
In a VCF file, which column contains the alternative (non-reference) allele?
AREF
BQUAL
CALT
DINFO
Explanation
The VCF mandatory columns are: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, and then sample columns. ALT contains the alternative (non-reference) allele(s), comma-separated if more than one. REF contains the reference allele. QUAL is the Phred-scaled quality score. INFO contains extensible annotations.
Q31 Medium
In a VCF file, meta-information lines begin with:
A@ (at sign)
B> (greater-than sign)
C# (single hash)
D## (double hash)
Explanation
In VCF files: ## (double hash) marks meta-information lines (key=value pairs describing filters, info fields, etc.); # (single hash) marks the column header line (CHROM, POS, ID, etc.). Don't confuse with FASTQ where @ begins each entry, or FASTA where > begins each entry. SAM headers use @.
Q32 Easy
Which tool determines the effect of variants on genes, transcripts, and protein sequence?
ABWA-MEM
BEnsembl Variant Effect Predictor (VEP)
CFastQC
DPicard
Explanation
VEP (Variant Effect Predictor) from Ensembl determines variant effects on genes, transcripts, and protein sequences. It also provides SIFT and PolyPhen-2 scores for protein-altering changes. Other annotation tools include SnpEff and ANNOVAR. BWA-MEM is an aligner, FastQC does quality control, and Picard handles duplicate removal.
Q33 Medium
A "gain of TFBS" variant means:
AThe transcription factor binding site exists only for the alternative allele of the SNP
BThe transcription factor binding site exists only for the reference allele
CBoth alleles have identical transcription factor binding affinity
DThe variant is located in a coding region and causes a missense change
Explanation
TFBS variant consequences: Loss of TFBS = binding site exists only for the reference (0) allele; Gain of TFBS = binding site exists only for the alternative (1) allele; Score-Change = binding affinity differs between alleles; No Change = both alleles predicted with same binding affinity. In entropy logos, larger letters indicate more critical positions — mutations there are more likely to impact binding.
Q34 Medium
In the albino donkey case study, what type of variant was identified in the TYR gene?
AA frameshift deletion
BA synonymous substitution
CA splice site variant
DA missense mutation (c.604C>G, p.His202Asp) disrupting copper binding in tyrosinase
Explanation
The albino donkeys from Asinara island carry a missense mutation c.604C>G in the TYR gene, causing a histidine to aspartate substitution at position 202 (p.His202Asp). This disrupts copper binding in the tyrosinase enzyme, inactivating melanin production. Parents are heterozygous (C/G), albino offspring are homozygous (G/G) — demonstrating autosomal recessive inheritance.
Q35 Easy
The correct order of file formats in a standard variant discovery pipeline is:
ABAM → FASTQ → VCF
BVCF → BAM → FASTQ
CFASTQ → BAM → VCF
DFASTQ → VCF → BAM
Explanation
The standard pipeline flows: FASTQ (raw reads) → alignment with BWA → BAM (aligned reads) → variant calling with GATK → VCF (variant calls) → annotation with VEP/SnpEff. Quality control and filtering occur between every step. This is the core workflow emphasized throughout the lecture.
Q36 Tricky
Unmapped reads (FLAG 4) from cattle WGS data could be useful for:
AImproving the alignment of mapped reads
BMetagenomics — detecting contaminant bacteria or viruses in the sample
CIncreasing the sequencing depth of the cattle genome
DGenerating a new reference genome
Explanation
Unmapped reads (FLAG 4) did not align to the host (cattle) genome. These can be extracted and re-aligned to microbial databases to identify contaminant bacteria, viruses, or other organisms. This is the foundation of metagenomics and is relevant to the "One Health" approach linking human, animal, and environmental health. This is a concept the lecture specifically highlights as a practical application of SAM flag filtering.
Q37 Easy
Galaxy is primarily described as:
AAn open, web-based platform for accessible, reproducible, and transparent computational research
BA command-line only tool for variant calling
CA commercial sequencing platform
DA local software package requiring complex installation
Explanation
Galaxy is an open, web-based platform. Its key features are: accessibility (no programming required, point-and-click), reproducibility (captures all analysis details), transparency (users share histories and workflows), and community-centered design. It requires only an internet connection and a browser — no installation or complex commands needed.
Q38 Easy
The Galaxy interface consists of three main panels:
AUpload, Download, and Settings
BCode editor, Terminal, and Output
CData + Available tools, Run tools and view results, Analysis history
DAlignment, Variant calling, and Annotation panels
Explanation
Galaxy has three main panels: (1) Left panel for data and available tools, (2) Middle/center panel for running tools and viewing results, and (3) Right panel for the analysis history which tracks all files and operations. The history panel shows files with three action buttons: view (eye icon), edit attributes (pencil), and delete.
Q39 — Open Calculation
You sequence a cattle genome (genome size = 2.7 Gbp) using Illumina paired-end 150 bp reads. You generate 600 million reads in total. (a) Calculate the sequencing depth. (b) If the minimum recommended depth for robust SNP detection is 10×, is this sufficient? (c) What is the Phred score corresponding to 99.99% base call accuracy?
✓ Model Answer

(a) Sequencing depth:

Depth = (N × L) / G
N = 600,000,000 reads; L = 150 bp; G = 2,700,000,000 bp
Depth = (600,000,000 × 150) / 2,700,000,000
= 90,000,000,000 / 2,700,000,000
= 33.3×

(b) Yes, 33.3× exceeds the minimum recommended ~10× for robust SNP detection. This depth provides high confidence for variant calling.

(c) Phred score for 99.99% accuracy:

P(error) = 1 − 0.9999 = 0.0001 = 10⁻⁴
Q = −10 × log₁₀(10⁻⁴) = −10 × (−4) = 40
Answer: Q40
Q40 — Open Short Answer
Describe the complete variant discovery pipeline from raw sequencing data to annotated variants. For each major step, name the input file format, the output file format, and one commonly used tool.
✓ Model Answer

The variant discovery pipeline consists of four major steps, each with QC/filtering between them:

1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (assessment), Trimmomatic or Prinseq (trimming) → Output: cleaned FASTQ. Checks per-base quality, GC content, duplication levels. Removes low-quality bases using sliding window or threshold approaches.

2. Alignment: Input: cleaned FASTQ → Tool: BWA-MEM (Burrows-Wheeler Transform) → Output: BAM file. Maps reads to the reference genome. BAM is the binary compressed version of SAM. Post-alignment: filter by mapping quality (MAPQ) and remove PCR duplicates (Picard).

3. Variant Calling: Input: filtered BAM → Tool: GATK (following GATK Best Practices) → Output: VCF file. Examines aligned bases at each position to identify SNPs and indels. Detection affected by base quality, proximity to homopolymers, mapping quality, and sequencing depth. Can be individual or joint calling across samples.

4. Variant Annotation: Input: VCF → Tool: Ensembl VEP, SnpEff, or ANNOVAR → Output: annotated VCF. Determines effect of variants on genes/transcripts/proteins (e.g., missense, frameshift, intronic, TFBS variants). Provides functional impact predictions (SIFT, PolyPhen-2).

Q41 — Open Tricky
Explain the difference between depth of coverage and breadth of coverage. Give a scenario where you might have high depth but low breadth, and explain why this would be problematic for variant discovery.
✓ Model Answer

Depth of coverage = the average number of times each base is read (expressed as "X", e.g., 30×). It measures redundancy and confidence. Formula: Depth = (N × L) / G.

Breadth of coverage = the percentage of the target genome covered at a minimum depth (expressed as %, e.g., 95% at 1×). It measures completeness.

High depth, low breadth scenario: In a PCR-amplified library with severe amplification bias, certain genomic regions may be vastly over-represented (giving very high local depth), while other regions receive no reads at all. The average depth might be reported as 30×, but large portions of the genome have 0× coverage. This means variants in uncovered regions are completely missed, making the analysis incomplete despite seemingly adequate depth. This is why breadth is especially important in clinical diagnostics where missing regions could mean missing pathogenic variants.

Another example: targeted sequencing (e.g., exome capture) inherently has high depth in target regions but low breadth over the whole genome — which is by design, but must be understood when interpreting results.

Q42 — Open Short Answer
Why is it important for a bioinformatician to understand the upstream wet-lab steps before analyzing NGS data? Give at least three specific examples of how lab decisions affect data analysis.
✓ Model Answer

A bioinformatician must understand what happened before data generation because ignoring upstream processes leads to incorrect assumptions or flawed analyses. Key examples:

1. PCR amplification vs. PCR-free: If PCR was used during library prep, duplicate reads are expected and must be removed (e.g., with Picard). Without knowing this, duplicates would be treated as independent evidence, inflating coverage and producing false positive variants. PCR-free protocols avoid this but require more input DNA.

2. DNA quality/degradation: Degraded DNA (e.g., from ancient samples, honey, or soil) produces shorter fragments. This affects which sequencing platform is appropriate — degraded DNA is unsuitable for long-read platforms (PacBio/Nanopore). Low-quality DNA also affects alignment success and error rates.

3. Sequencing platform choice: Different platforms have different error profiles. Illumina has very low error rates; Ion Torrent has higher error rates especially in homopolymer regions. Knowing the platform tells you which types of errors to expect and filter for.

4. Expected sequencing depth: If coverage was planned at 5× vs. 30×, the confidence in variant calls differs dramatically. Low-coverage data (1–5×) requires different analytical approaches than high-coverage data.

5. Fragmentation method: Random fragmentation vs. restriction enzymes produces different sequence content patterns visible in FastQC (e.g., consistent bases at read starts with restriction enzymes).