Quality Control in Next-Generation Sequencing
Introduction: Why Sequencing Isn't Perfect
Next-generation sequencing (NGS) has revolutionized genomics, but it's not error-free. Every sequencing run introduces errors, and understanding these errors is crucial for reliable variant discovery. In this article, we'll explore how errors occur, how quality is measured, and how to analyze sequencing data quality using Python.
Poor quality control can lead to false variant calls, wasting weeks of downstream analysis. Always perform QC before proceeding!
How Sequencing Errors Happen
Sequencing errors occur at multiple stages of the NGS process. Let's understand the main sources:
1. Cluster Generation Errors
In Illumina sequencing, DNA fragments are amplified into clusters on a flow cell. Each cluster should contain identical copies of the same fragment.
What can go wrong:
- Incomplete amplification: Some molecules in the cluster don't amplify properly
- Mixed clusters: Multiple different DNA fragments amplify in the same location
- Phasing errors: Molecules in a cluster get out of sync during sequencing
Imagine sequencing the sequence "ATCGATCG":
- Cycle 1: All molecules read "A" ✅
- Cycle 2: All molecules read "T" ✅
- Cycle 3: 99% read "C", but 1% lagged and still read "T" ⚠️
- Cycle 4: Now signals are mixed - getting worse each cycle
Result: Quality degrades as the read progresses!
2. Terminator Not Removed
During sequencing-by-synthesis:
- A fluorescent nucleotide with a reversible terminator is added
- The terminator prevents the next nucleotide from being added
- After imaging, the terminator should be cleaved off
- Problem: If the terminator isn't removed, the sequence stops prematurely
This creates shorter reads and reduces coverage at later positions.
3. Optical Errors
- Incorrect base calling: The imaging system misidentifies which fluorescent signal is present
- Signal bleeding: Fluorescent signals from nearby clusters interfere with each other
- Photobleaching: Fluorescent dyes fade over time, reducing signal strength
4. Biochemical Errors
- Incorrect nucleotide incorporation: DNA polymerase occasionally adds the wrong base
- Damaged bases: Pre-existing DNA damage can cause misreads
- Secondary structures: GC-rich or repetitive regions can form structures that interfere with sequencing
Typical Illumina sequencing error rates are around 0.1-1%, meaning 99-99.9% of bases are correct. However, with billions of bases sequenced, this still means millions of errors!
Understanding Base Quality Scores
Since every base call can be wrong, sequencers assign a quality score to each base, representing the confidence that the base call is correct.
Probability vs Quality Score
Instead of storing raw probabilities, sequencing platforms use Phred quality scores:
Q = -10 × log₁₀(P)
Where P is the probability that the base call is incorrect.
Why Use Quality Scores Instead of Probabilities?
There are several practical reasons:
- Easier to interpret: Q=30 is easier to remember than P=0.001
- Compact storage: Single ASCII characters encode quality (more on this later)
- Natural scale: Higher numbers = better quality (intuitive)
- Historical: Originally developed for Sanger sequencing, now standard across platforms
Quality Score Reference Table
| Quality Score (Q) | Error Probability (P) | Accuracy | Interpretation |
|---|---|---|---|
| Q10 | 1 in 10 (0.1) | 90% | Low quality |
| Q20 | 1 in 100 (0.01) | 99% | Acceptable |
| Q30 | 1 in 1,000 (0.001) | 99.9% | Good quality |
| Q40 | 1 in 10,000 (0.0001) | 99.99% | Excellent quality |
Q30 is generally considered the minimum acceptable quality for variant calling. Bases below Q20 are often filtered out.
Calculating Quality Scores
Let's see some examples:
Example 1: A base with 99% confidence (P = 0.01)
Q = -10 × log₁₀(0.01)
Q = -10 × (-2)
Q = 20
Example 2: A base with 99.9% confidence (P = 0.001)
Q = -10 × log₁₀(0.001)
Q = -10 × (-3)
Q = 30
If a base has a quality score of Q=25, what's the probability it's correct?
Click to see answer
P = 10^(-Q/10) = 10^(-25/10) = 10^(-2.5) ≈ 0.00316
So accuracy = 1 - 0.00316 = 99.68% correct
The FASTQ File Format
Sequencing data is typically stored in FASTQ format, which contains both the DNA sequence and quality scores for each base.
A text-based format for storing both nucleotide sequences and their corresponding quality scores. Each read is represented by exactly 4 lines.
FASTQ File Structure
Each sequencing read takes exactly 4 lines:
@SEQ_ID ← Line 1: Header (starts with @)
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT ← Line 2: Sequence
+ ← Line 3: Separator (starts with +)
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ← Line 4: Quality scores
Breaking it down:
-
Line 1 - Header: Starts with
@, contains read identifier and optional description- Example:
@SRR123456.1 M01234:23:000000000-A1B2C:1:1101:15555:1234 1:N:0:1
- Example:
-
Line 2 - Sequence: The actual DNA sequence (A, T, C, G, sometimes N for unknown)
-
Line 3 - Separator: Always starts with
+, optionally repeats the header (usually just+) -
Line 4 - Quality Scores: ASCII-encoded quality scores (one character per base)
ASCII Encoding of Quality Scores
Quality scores are encoded as single ASCII characters to save space. The encoding formula is:
ASCII_character = chr(Quality_Score + 33)
The +33 offset is called Phred+33 encoding (also known as Sanger format).
Quality score Q=30:
- ASCII value = 30 + 33 = 63
- Character = chr(63) = '?'
Quality score Q=40:
- ASCII value = 40 + 33 = 73
- Character = chr(73) = 'I'
Quality Character Reference
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
0 10 20 30 40
!= Q0 (worst quality, 50% error rate)+= Q10 (10% error rate)5= Q20 (1% error rate)?= Q30 (0.1% error rate)I= Q40 (0.01% error rate)
Older Illumina data used Phred+64 encoding (adding 64 instead of 33). Always check which encoding your data uses! Modern data uses Phred+33.
Parsing FASTQ Files with Python
Now let's write Python code to read and analyze FASTQ files. We'll build this step-by-step, as if working in a Jupyter notebook.
Step 1: Reading a FASTQ File
First, let's write a function to parse FASTQ files:
def read_fastq(filename):
"""
Read a FASTQ file and return lists of sequences and quality strings.
Parameters:
-----------
filename : str
Path to the FASTQ file
Returns:
--------
sequences : list
List of DNA sequences
qualities : list
List of quality strings (ASCII encoded)
"""
sequences = []
qualities = []
with open(filename, 'r') as f:
while True:
# Read 4 lines at a time
header = f.readline().strip()
if not header: # End of file
break
seq = f.readline().strip()
plus = f.readline().strip()
qual = f.readline().strip()
sequences.append(seq)
qualities.append(qual)
return sequences, qualities
For very large FASTQ files (common in NGS), consider using generators or the BioPython library to avoid loading everything into memory at once.
Step 2: Converting Phred+33 to Numeric Quality Scores
Now let's create a helper function to convert ASCII characters to numeric quality scores:
def phred33_to_q(qual_str):
"""
Convert a Phred+33 encoded quality string to numeric quality scores.
Parameters:
-----------
qual_str : str
Quality string with ASCII-encoded scores
Returns:
--------
list of int
Numeric quality scores
"""
return [ord(char) - 33 for char in qual_str]
Let's test it:
# Example quality string
example_qual = "!5?II"
# Convert to numeric scores
scores = phred33_to_q(example_qual)
print(f"Quality string: {example_qual}")
print(f"Numeric scores: {scores}")
print(f"Interpretation:")
for char, score in zip(example_qual, scores):
error_prob = 10 ** (-score / 10)
accuracy = (1 - error_prob) * 100
print(f" '{char}' → Q{score} → {accuracy:.2f}% accurate")
Output:
Quality string: !5?II
Numeric scores: [0, 20, 30, 40, 40]
Interpretation:
'!' → Q0 → 50.00% accurate
'5' → Q20 → 99.00% accurate
'?' → Q30 → 99.90% accurate
'I' → Q40 → 99.99% accurate
'I' → Q40 → 99.99% accurate
Visualizing Quality Distributions
Step 3: Creating a Quality Score Histogram
Let's write a function to compute quality score distributions:
def quality_histogram(qualities, phred_offset=33):
"""
Calculate histogram of quality scores across all bases.
Parameters:
-----------
qualities : list of str
List of quality strings from FASTQ
phred_offset : int
Phred encoding offset (33 for Phred+33, 64 for Phred+64)
Returns:
--------
dict
Dictionary with quality scores as keys and counts as values
"""
from collections import Counter
all_scores = []
for qual_str in qualities:
scores = [ord(char) - phred_offset for char in qual_str]
all_scores.extend(scores)
return Counter(all_scores)
Step 4: Visualizing with Matplotlib
import matplotlib.pyplot as plt
import numpy as np
def plot_quality_distribution(qualities, title="Quality Score Distribution"):
"""
Plot histogram of quality scores.
Parameters:
-----------
qualities : list of str
List of quality strings from FASTQ
title : str
Plot title
"""
# Get histogram data
hist = quality_histogram(qualities)
# Prepare data for plotting
scores = sorted(hist.keys())
counts = [hist[s] for s in scores]
# Create plot
plt.figure(figsize=(12, 6))
plt.bar(scores, counts, color='steelblue', alpha=0.7, edgecolor='black')
# Add reference lines for quality thresholds
plt.axvline(x=20, color='orange', linestyle='--', linewidth=2, label='Q20 (99% accurate)')
plt.axvline(x=30, color='green', linestyle='--', linewidth=2, label='Q30 (99.9% accurate)')
plt.xlabel('Quality Score (Q)', fontsize=12)
plt.ylabel('Number of Bases', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Example usage:
# sequences, qualities = read_fastq('sample.fastq')
# plot_quality_distribution(qualities)
You now have a complete pipeline to read FASTQ files and visualize quality distributions!
Quality by Read Position
One of the most important QC checks is looking at how quality changes across read positions. Remember our phasing error example? Quality typically degrades toward the end of reads.
Step 5: Computing Mean Quality by Position
def quality_by_position(qualities, phred_offset=33):
"""
Calculate mean quality score at each position along the read.
Parameters:
-----------
qualities : list of str
List of quality strings from FASTQ
phred_offset : int
Phred encoding offset
Returns:
--------
positions : list
Position numbers (0-indexed)
mean_qualities : list
Mean quality score at each position
"""
# Find maximum read length
max_len = max(len(q) for q in qualities)
# Initialize lists to store quality scores at each position
position_scores = [[] for _ in range(max_len)]
# Collect all scores at each position
for qual_str in qualities:
scores = [ord(char) - phred_offset for char in qual_str]
for pos, score in enumerate(scores):
position_scores[pos].append(score)
# Calculate mean at each position
positions = list(range(max_len))
mean_qualities = [np.mean(scores) if scores else 0
for scores in position_scores]
return positions, mean_qualities
Step 6: Plotting Quality by Position
def plot_quality_by_position(qualities, title="Quality Scores by Position"):
"""
Plot mean quality score across read positions.
Parameters:
-----------
qualities : list of str
List of quality strings from FASTQ
title : str
Plot title
"""
positions, mean_quals = quality_by_position(qualities)
plt.figure(figsize=(14, 6))
plt.plot(positions, mean_quals, linewidth=2, color='steelblue', marker='o',
markersize=3, markevery=5)
# Add reference lines
plt.axhline(y=20, color='orange', linestyle='--', linewidth=2,
label='Q20 threshold', alpha=0.7)
plt.axhline(y=30, color='green', linestyle='--', linewidth=2,
label='Q30 threshold', alpha=0.7)
plt.xlabel('Position in Read (bp)', fontsize=12)
plt.ylabel('Mean Quality Score (Q)', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, max(mean_quals) + 5)
plt.tight_layout()
plt.show()
# Example usage:
# plot_quality_by_position(qualities, title="Quality Degradation Across Read")
In a typical quality-by-position plot:
- ✅ Quality starts high (Q30-40) at the beginning
- ⚠️ Gradual decline is normal (phasing effects)
- 🚫 Sudden drops indicate problems (adapter contamination, chemistry issues)
- 🚫 Quality below Q20 for most of the read → consider re-sequencing
Analyzing GC Content
GC content analysis is another crucial quality control metric. Let's understand why it matters and how to analyze it.
Why Analyze GC Content?
GC content is the percentage of bases in a DNA sequence that are either Guanine (G) or Cytosine (C).
Formula: GC% = (G + C) / (A + T + G + C) × 100
Reasons to monitor GC content:
- Bias detection: PCR amplification can be biased toward or against GC-rich regions
- Contamination: Unexpected GC distribution may indicate adapter contamination or sample contamination
- Coverage issues: Extreme GC content (very high or low) is harder to sequence accurately
- Species verification: Different organisms have characteristic GC content ranges
- Humans: ~41% GC
- E. coli: ~51% GC
- P. falciparum (malaria parasite): ~19% GC (very AT-rich!)
- Some bacteria: up to ~75% GC
Step 7: Calculating GC Content
def calculate_gc_content(sequence):
"""
Calculate GC content percentage for a DNA sequence.
Parameters:
-----------
sequence : str
DNA sequence string
Returns:
--------
float
GC content as a percentage (0-100)
"""
sequence = sequence.upper()
gc_count = sequence.count('G') + sequence.count('C')
total = len(sequence)
if total == 0:
return 0.0
return (gc_count / total) * 100
def gc_content_per_read(sequences):
"""
Calculate GC content for each read.
Parameters:
-----------
sequences : list of str
List of DNA sequences
Returns:
--------
list of float
GC content percentage for each read
"""
return [calculate_gc_content(seq) for seq in sequences]
Step 8: Plotting GC Content Distribution
def plot_gc_distribution(sequences, expected_gc=None, title="GC Content Distribution"):
"""
Plot histogram of GC content across all reads.
Parameters:
-----------
sequences : list of str
List of DNA sequences
expected_gc : float, optional
Expected GC content for the organism (will add reference line)
title : str
Plot title
"""
gc_contents = gc_content_per_read(sequences)
plt.figure(figsize=(12, 6))
plt.hist(gc_contents, bins=50, color='steelblue', alpha=0.7,
edgecolor='black', linewidth=0.5)
# Add reference line for expected GC content
if expected_gc is not None:
plt.axvline(x=expected_gc, color='red', linestyle='--', linewidth=2,
label=f'Expected GC: {expected_gc}%')
plt.legend()
# Add mean line
mean_gc = np.mean(gc_contents)
plt.axvline(x=mean_gc, color='green', linestyle='-', linewidth=2,
label=f'Observed Mean: {mean_gc:.1f}%', alpha=0.7)
plt.xlabel('GC Content (%)', fontsize=12)
plt.ylabel('Number of Reads', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Example usage:
# sequences, qualities = read_fastq('sample.fastq')
# plot_gc_distribution(sequences, expected_gc=41, title="Human Genome GC Content")
Step 9: GC Content by Position
Sometimes GC content varies along the read length, which can indicate:
- Adapter sequences (usually very different GC content)
- Random hexamer priming bias (in RNA-seq)
- Fragmentation bias
def gc_by_position(sequences):
"""
Calculate GC content at each position along reads.
Parameters:
-----------
sequences : list of str
List of DNA sequences
Returns:
--------
positions : list
Position numbers
gc_percentages : list
GC percentage at each position
"""
max_len = max(len(seq) for seq in sequences)
# Count G/C and total bases at each position
gc_counts = [0] * max_len
total_counts = [0] * max_len
for seq in sequences:
seq = seq.upper()
for pos, base in enumerate(seq):
if base in 'ATGC':
total_counts[pos] += 1
if base in 'GC':
gc_counts[pos] += 1
# Calculate percentages
positions = list(range(max_len))
gc_percentages = [(gc_counts[i] / total_counts[i] * 100) if total_counts[i] > 0 else 0
for i in range(max_len)]
return positions, gc_percentages
def plot_gc_by_position(sequences, expected_gc=None,
title="GC Content by Position"):
"""
Plot GC content across read positions.
Parameters:
-----------
sequences : list of str
List of DNA sequences
expected_gc : float, optional
Expected GC content percentage
title : str
Plot title
"""
positions, gc_pcts = gc_by_position(sequences)
plt.figure(figsize=(14, 6))
plt.plot(positions, gc_pcts, linewidth=2, color='steelblue',
marker='o', markersize=3, markevery=5)
if expected_gc is not None:
plt.axhline(y=expected_gc, color='red', linestyle='--', linewidth=2,
label=f'Expected: {expected_gc}%', alpha=0.7)
plt.legend()
plt.xlabel('Position in Read (bp)', fontsize=12)
plt.ylabel('GC Content (%)', fontsize=12)
plt.title(title, fontsize=14, fontweight='bold')
plt.ylim(0, 100)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Example usage:
# plot_gc_by_position(sequences, expected_gc=41)
- 🚫 Sharp peaks/valleys: May indicate adapter contamination
- 🚫 Bimodal distribution: Possible mixed samples or contamination
- 🚫 Spike at read ends: Adapter sequences not trimmed
- ⚠️ Shift from expected: May indicate PCR bias or wrong reference
Putting It All Together: Complete QC Pipeline
Let's create a comprehensive quality control function:
def comprehensive_qc(fastq_file, expected_gc=None, output_prefix="qc"):
"""
Perform comprehensive quality control on a FASTQ file.
Parameters:
-----------
fastq_file : str
Path to FASTQ file
expected_gc : float, optional
Expected GC content percentage
output_prefix : str
Prefix for output plot files
"""
print("Reading FASTQ file...")
sequences, qualities = read_fastq(fastq_file)
print(f"Total reads: {len(sequences):,}")
print(f"Mean read length: {np.mean([len(s) for s in sequences]):.1f} bp")
# Calculate summary statistics
all_quals = []
for qual_str in qualities:
all_quals.extend(phred33_to_q(qual_str))
mean_q = np.mean(all_quals)
median_q = np.median(all_quals)
q20_pct = (np.sum(np.array(all_quals) >= 20) / len(all_quals)) * 100
q30_pct = (np.sum(np.array(all_quals) >= 30) / len(all_quals)) * 100
print(f"\nQuality Statistics:")
print(f" Mean quality: Q{mean_q:.1f}")
print(f" Median quality: Q{median_q:.1f}")
print(f" Bases ≥ Q20: {q20_pct:.2f}%")
print(f" Bases ≥ Q30: {q30_pct:.2f}%")
gc_contents = gc_content_per_read(sequences)
print(f"\nGC Content Statistics:")
print(f" Mean GC: {np.mean(gc_contents):.2f}%")
print(f" Median GC: {np.median(gc_contents):.2f}%")
if expected_gc:
print(f" Expected GC: {expected_gc}%")
# Generate plots
print("\nGenerating QC plots...")
plot_quality_distribution(qualities, title=f"Quality Score Distribution - {output_prefix}")
plot_quality_by_position(qualities, title=f"Quality by Position - {output_prefix}")
plot_gc_distribution(sequences, expected_gc=expected_gc,
title=f"GC Content Distribution - {output_prefix}")
plot_gc_by_position(sequences, expected_gc=expected_gc,
title=f"GC Content by Position - {output_prefix}")
print("\nQC analysis complete!")
# Return summary dictionary
return {
'n_reads': len(sequences),
'mean_length': np.mean([len(s) for s in sequences]),
'mean_quality': mean_q,
'q20_percent': q20_pct,
'q30_percent': q30_pct,
'mean_gc': np.mean(gc_contents)
}
#