Quality Control in Next-Generation Sequencing

Introduction: Why Sequencing Isn't Perfect

Next-generation sequencing (NGS) has revolutionized genomics, but it's not error-free. Every sequencing run introduces errors, and understanding these errors is crucial for reliable variant discovery. In this article, we'll explore how errors occur, how quality is measured, and how to analyze sequencing data quality using Python.

⚠️

Warning

Poor quality control can lead to false variant calls, wasting weeks of downstream analysis. Always perform QC before proceeding!

How Sequencing Errors Happen

Sequencing errors occur at multiple stages of the NGS process. Let's understand the main sources:

1. Cluster Generation Errors

In Illumina sequencing, DNA fragments are amplified into clusters on a flow cell. Each cluster should contain identical copies of the same fragment.

What can go wrong:

Incomplete amplification: Some molecules in the cluster don't amplify properly
Mixed clusters: Multiple different DNA fragments amplify in the same location
Phasing errors: Molecules in a cluster get out of sync during sequencing

💻

Example: Phasing Error

Imagine sequencing the sequence "ATCGATCG":

Cycle 1: All molecules read "A" ✅
Cycle 2: All molecules read "T" ✅
Cycle 3: 99% read "C", but 1% lagged and still read "T" ⚠️
Cycle 4: Now signals are mixed - getting worse each cycle

Result: Quality degrades as the read progresses!

2. Terminator Not Removed

During sequencing-by-synthesis:

A fluorescent nucleotide with a reversible terminator is added
The terminator prevents the next nucleotide from being added
After imaging, the terminator should be cleaved off
Problem: If the terminator isn't removed, the sequence stops prematurely

This creates shorter reads and reduces coverage at later positions.

3. Optical Errors

Incorrect base calling: The imaging system misidentifies which fluorescent signal is present
Signal bleeding: Fluorescent signals from nearby clusters interfere with each other
Photobleaching: Fluorescent dyes fade over time, reducing signal strength

4. Biochemical Errors

Incorrect nucleotide incorporation: DNA polymerase occasionally adds the wrong base
Damaged bases: Pre-existing DNA damage can cause misreads
Secondary structures: GC-rich or repetitive regions can form structures that interfere with sequencing

🔬

Fact

Typical Illumina sequencing error rates are around 0.1-1%, meaning 99-99.9% of bases are correct. However, with billions of bases sequenced, this still means millions of errors!

Understanding Base Quality Scores

Since every base call can be wrong, sequencers assign a quality score to each base, representing the confidence that the base call is correct.

Probability vs Quality Score

Instead of storing raw probabilities, sequencing platforms use Phred quality scores:

📖

Definition: Phred Quality Score

Q = -10 × log₁₀(P)

Where P is the probability that the base call is incorrect.

Why Use Quality Scores Instead of Probabilities?

There are several practical reasons:

Easier to interpret: Q=30 is easier to remember than P=0.001
Compact storage: Single ASCII characters encode quality (more on this later)
Natural scale: Higher numbers = better quality (intuitive)
Historical: Originally developed for Sanger sequencing, now standard across platforms

Quality Score Reference Table

Quality Score (Q)	Error Probability (P)	Accuracy	Interpretation
Q10	1 in 10 (0.1)	90%	Low quality
Q20	1 in 100 (0.01)	99%	Acceptable
Q30	1 in 1,000 (0.001)	99.9%	Good quality
Q40	1 in 10,000 (0.0001)	99.99%	Excellent quality

💡

Tip

Q30 is generally considered the minimum acceptable quality for variant calling. Bases below Q20 are often filtered out.

Calculating Quality Scores

Let's see some examples:

Example 1: A base with 99% confidence (P = 0.01)

Q = -10 × log₁₀(0.01)
Q = -10 × (-2)
Q = 20

Example 2: A base with 99.9% confidence (P = 0.001)

Q = -10 × log₁₀(0.001)
Q = -10 × (-3)
Q = 30

❓

Question

If a base has a quality score of Q=25, what's the probability it's correct?

Click to see answer

P = 10^(-Q/10) = 10^(-25/10) = 10^(-2.5) ≈ 0.00316

So accuracy = 1 - 0.00316 = 99.68% correct

The FASTQ File Format

Sequencing data is typically stored in FASTQ format, which contains both the DNA sequence and quality scores for each base.

📖

Definition: FASTQ Format

A text-based format for storing both nucleotide sequences and their corresponding quality scores. Each read is represented by exactly 4 lines.

FASTQ File Structure

Each sequencing read takes exactly 4 lines:

@SEQ_ID                           ← Line 1: Header (starts with @)
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT  ← Line 2: Sequence
+                                 ← Line 3: Separator (starts with +)
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65  ← Line 4: Quality scores

Breaking it down:

Line 1 - Header: Starts with @, contains read identifier and optional description
- Example: @SRR123456.1 M01234:23:000000000-A1B2C:1:1101:15555:1234 1:N:0:1
Line 2 - Sequence: The actual DNA sequence (A, T, C, G, sometimes N for unknown)
Line 3 - Separator: Always starts with +, optionally repeats the header (usually just +)
Line 4 - Quality Scores: ASCII-encoded quality scores (one character per base)

ASCII Encoding of Quality Scores

Quality scores are encoded as single ASCII characters to save space. The encoding formula is:

ASCII_character = chr(Quality_Score + 33)

The +33 offset is called Phred+33 encoding (also known as Sanger format).

💻

Example: Quality Encoding

Quality score Q=30:

ASCII value = 30 + 33 = 63
Character = chr(63) = '?'

Quality score Q=40:

ASCII value = 40 + 33 = 73
Character = chr(73) = 'I'

Quality Character Reference

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
|         |         |         |         |
0        10        20        30        40

! = Q0 (worst quality, 50% error rate)
+ = Q10 (10% error rate)
5 = Q20 (1% error rate)
? = Q30 (0.1% error rate)
I = Q40 (0.01% error rate)

⚠️

Warning: Different Encodings Exist

Older Illumina data used Phred+64 encoding (adding 64 instead of 33). Always check which encoding your data uses! Modern data uses Phred+33.

Parsing FASTQ Files with Python

Now let's write Python code to read and analyze FASTQ files. We'll build this step-by-step, as if working in a Jupyter notebook.

Step 1: Reading a FASTQ File

First, let's write a function to parse FASTQ files:

def read_fastq(filename):
    """
    Read a FASTQ file and return lists of sequences and quality strings.
    
    Parameters:
    -----------
    filename : str
        Path to the FASTQ file
        
    Returns:
    --------
    sequences : list
        List of DNA sequences
    qualities : list
        List of quality strings (ASCII encoded)
    """
    sequences = []
    qualities = []
    
    with open(filename, 'r') as f:
        while True:
            # Read 4 lines at a time
            header = f.readline().strip()
            if not header:  # End of file
                break
            
            seq = f.readline().strip()
            plus = f.readline().strip()
            qual = f.readline().strip()
            
            sequences.append(seq)
            qualities.append(qual)
    
    return sequences, qualities

💡

Tip

For very large FASTQ files (common in NGS), consider using generators or the BioPython library to avoid loading everything into memory at once.

Step 2: Converting Phred+33 to Numeric Quality Scores

Now let's create a helper function to convert ASCII characters to numeric quality scores:

def phred33_to_q(qual_str):
    """
    Convert a Phred+33 encoded quality string to numeric quality scores.
    
    Parameters:
    -----------
    qual_str : str
        Quality string with ASCII-encoded scores
        
    Returns:
    --------
    list of int
        Numeric quality scores
    """
    return [ord(char) - 33 for char in qual_str]

Let's test it:

# Example quality string
example_qual = "!5?II"

# Convert to numeric scores
scores = phred33_to_q(example_qual)
print(f"Quality string: {example_qual}")
print(f"Numeric scores: {scores}")
print(f"Interpretation:")
for char, score in zip(example_qual, scores):
    error_prob = 10 ** (-score / 10)
    accuracy = (1 - error_prob) * 100
    print(f"  '{char}' → Q{score} → {accuracy:.2f}% accurate")

Output:

Quality string: !5?II
Numeric scores: [0, 20, 30, 40, 40]
Interpretation:
  '!' → Q0 → 50.00% accurate
  '5' → Q20 → 99.00% accurate
  '?' → Q30 → 99.90% accurate
  'I' → Q40 → 99.99% accurate
  'I' → Q40 → 99.99% accurate

Visualizing Quality Distributions

Step 3: Creating a Quality Score Histogram

Let's write a function to compute quality score distributions:

def quality_histogram(qualities, phred_offset=33):
    """
    Calculate histogram of quality scores across all bases.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    phred_offset : int
        Phred encoding offset (33 for Phred+33, 64 for Phred+64)
        
    Returns:
    --------
    dict
        Dictionary with quality scores as keys and counts as values
    """
    from collections import Counter
    
    all_scores = []
    for qual_str in qualities:
        scores = [ord(char) - phred_offset for char in qual_str]
        all_scores.extend(scores)
    
    return Counter(all_scores)

Step 4: Visualizing with Matplotlib

import matplotlib.pyplot as plt
import numpy as np

def plot_quality_distribution(qualities, title="Quality Score Distribution"):
    """
    Plot histogram of quality scores.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    title : str
        Plot title
    """
    # Get histogram data
    hist = quality_histogram(qualities)
    
    # Prepare data for plotting
    scores = sorted(hist.keys())
    counts = [hist[s] for s in scores]
    
    # Create plot
    plt.figure(figsize=(12, 6))
    plt.bar(scores, counts, color='steelblue', alpha=0.7, edgecolor='black')
    
    # Add reference lines for quality thresholds
    plt.axvline(x=20, color='orange', linestyle='--', linewidth=2, label='Q20 (99% accurate)')
    plt.axvline(x=30, color='green', linestyle='--', linewidth=2, label='Q30 (99.9% accurate)')
    
    plt.xlabel('Quality Score (Q)', fontsize=12)
    plt.ylabel('Number of Bases', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

# Example usage:
# sequences, qualities = read_fastq('sample.fastq')
# plot_quality_distribution(qualities)

✅

Success

You now have a complete pipeline to read FASTQ files and visualize quality distributions!

Quality by Read Position

One of the most important QC checks is looking at how quality changes across read positions. Remember our phasing error example? Quality typically degrades toward the end of reads.

Step 5: Computing Mean Quality by Position

def quality_by_position(qualities, phred_offset=33):
    """
    Calculate mean quality score at each position along the read.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    phred_offset : int
        Phred encoding offset
        
    Returns:
    --------
    positions : list
        Position numbers (0-indexed)
    mean_qualities : list
        Mean quality score at each position
    """
    # Find maximum read length
    max_len = max(len(q) for q in qualities)
    
    # Initialize lists to store quality scores at each position
    position_scores = [[] for _ in range(max_len)]
    
    # Collect all scores at each position
    for qual_str in qualities:
        scores = [ord(char) - phred_offset for char in qual_str]
        for pos, score in enumerate(scores):
            position_scores[pos].append(score)
    
    # Calculate mean at each position
    positions = list(range(max_len))
    mean_qualities = [np.mean(scores) if scores else 0 
                      for scores in position_scores]
    
    return positions, mean_qualities

Step 6: Plotting Quality by Position

def plot_quality_by_position(qualities, title="Quality Scores by Position"):
    """
    Plot mean quality score across read positions.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    title : str
        Plot title
    """
    positions, mean_quals = quality_by_position(qualities)
    
    plt.figure(figsize=(14, 6))
    plt.plot(positions, mean_quals, linewidth=2, color='steelblue', marker='o', 
             markersize=3, markevery=5)
    
    # Add reference lines
    plt.axhline(y=20, color='orange', linestyle='--', linewidth=2, 
                label='Q20 threshold', alpha=0.7)
    plt.axhline(y=30, color='green', linestyle='--', linewidth=2, 
                label='Q30 threshold', alpha=0.7)
    
    plt.xlabel('Position in Read (bp)', fontsize=12)
    plt.ylabel('Mean Quality Score (Q)', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, max(mean_quals) + 5)
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_quality_by_position(qualities, title="Quality Degradation Across Read")

📝

What to Look For

In a typical quality-by-position plot:

✅ Quality starts high (Q30-40) at the beginning
⚠️ Gradual decline is normal (phasing effects)
🚫 Sudden drops indicate problems (adapter contamination, chemistry issues)
🚫 Quality below Q20 for most of the read → consider re-sequencing

Analyzing GC Content

GC content analysis is another crucial quality control metric. Let's understand why it matters and how to analyze it.

Why Analyze GC Content?

📖

Definition: GC Content

GC content is the percentage of bases in a DNA sequence that are either Guanine (G) or Cytosine (C).

Formula: GC% = (G + C) / (A + T + G + C) × 100

Reasons to monitor GC content:

Bias detection: PCR amplification can be biased toward or against GC-rich regions
Contamination: Unexpected GC distribution may indicate adapter contamination or sample contamination
Coverage issues: Extreme GC content (very high or low) is harder to sequence accurately
Species verification: Different organisms have characteristic GC content ranges

🔬

Organism GC Content Examples

Humans: ~41% GC
E. coli: ~51% GC
P. falciparum (malaria parasite): ~19% GC (very AT-rich!)
Some bacteria: up to ~75% GC

Step 7: Calculating GC Content

def calculate_gc_content(sequence):
    """
    Calculate GC content percentage for a DNA sequence.
    
    Parameters:
    -----------
    sequence : str
        DNA sequence string
        
    Returns:
    --------
    float
        GC content as a percentage (0-100)
    """
    sequence = sequence.upper()
    gc_count = sequence.count('G') + sequence.count('C')
    total = len(sequence)
    
    if total == 0:
        return 0.0
    
    return (gc_count / total) * 100

def gc_content_per_read(sequences):
    """
    Calculate GC content for each read.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
        
    Returns:
    --------
    list of float
        GC content percentage for each read
    """
    return [calculate_gc_content(seq) for seq in sequences]

Step 8: Plotting GC Content Distribution

def plot_gc_distribution(sequences, expected_gc=None, title="GC Content Distribution"):
    """
    Plot histogram of GC content across all reads.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
    expected_gc : float, optional
        Expected GC content for the organism (will add reference line)
    title : str
        Plot title
    """
    gc_contents = gc_content_per_read(sequences)
    
    plt.figure(figsize=(12, 6))
    plt.hist(gc_contents, bins=50, color='steelblue', alpha=0.7, 
             edgecolor='black', linewidth=0.5)
    
    # Add reference line for expected GC content
    if expected_gc is not None:
        plt.axvline(x=expected_gc, color='red', linestyle='--', linewidth=2,
                   label=f'Expected GC: {expected_gc}%')
        plt.legend()
    
    # Add mean line
    mean_gc = np.mean(gc_contents)
    plt.axvline(x=mean_gc, color='green', linestyle='-', linewidth=2,
               label=f'Observed Mean: {mean_gc:.1f}%', alpha=0.7)
    
    plt.xlabel('GC Content (%)', fontsize=12)
    plt.ylabel('Number of Reads', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

# Example usage:
# sequences, qualities = read_fastq('sample.fastq')
# plot_gc_distribution(sequences, expected_gc=41, title="Human Genome GC Content")

Step 9: GC Content by Position

Sometimes GC content varies along the read length, which can indicate:

Adapter sequences (usually very different GC content)
Random hexamer priming bias (in RNA-seq)
Fragmentation bias

def gc_by_position(sequences):
    """
    Calculate GC content at each position along reads.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
        
    Returns:
    --------
    positions : list
        Position numbers
    gc_percentages : list
        GC percentage at each position
    """
    max_len = max(len(seq) for seq in sequences)
    
    # Count G/C and total bases at each position
    gc_counts = [0] * max_len
    total_counts = [0] * max_len
    
    for seq in sequences:
        seq = seq.upper()
        for pos, base in enumerate(seq):
            if base in 'ATGC':
                total_counts[pos] += 1
                if base in 'GC':
                    gc_counts[pos] += 1
    
    # Calculate percentages
    positions = list(range(max_len))
    gc_percentages = [(gc_counts[i] / total_counts[i] * 100) if total_counts[i] > 0 else 0
                      for i in range(max_len)]
    
    return positions, gc_percentages

def plot_gc_by_position(sequences, expected_gc=None, 
                        title="GC Content by Position"):
    """
    Plot GC content across read positions.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
    expected_gc : float, optional
        Expected GC content percentage
    title : str
        Plot title
    """
    positions, gc_pcts = gc_by_position(sequences)
    
    plt.figure(figsize=(14, 6))
    plt.plot(positions, gc_pcts, linewidth=2, color='steelblue', 
             marker='o', markersize=3, markevery=5)
    
    if expected_gc is not None:
        plt.axhline(y=expected_gc, color='red', linestyle='--', linewidth=2,
                   label=f'Expected: {expected_gc}%', alpha=0.7)
        plt.legend()
    
    plt.xlabel('Position in Read (bp)', fontsize=12)
    plt.ylabel('GC Content (%)', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.ylim(0, 100)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_gc_by_position(sequences, expected_gc=41)

⚠️

Warning Signs in GC Content

🚫 Sharp peaks/valleys: May indicate adapter contamination
🚫 Bimodal distribution: Possible mixed samples or contamination
🚫 Spike at read ends: Adapter sequences not trimmed
⚠️ Shift from expected: May indicate PCR bias or wrong reference

Putting It All Together: Complete QC Pipeline

Let's create a comprehensive quality control function:

def comprehensive_qc(fastq_file, expected_gc=None, output_prefix="qc"):
    """
    Perform comprehensive quality control on a FASTQ file.
    
    Parameters:
    -----------
    fastq_file : str
        Path to FASTQ file
    expected_gc : float, optional
        Expected GC content percentage
    output_prefix : str
        Prefix for output plot files
    """
    print("Reading FASTQ file...")
    sequences, qualities = read_fastq(fastq_file)
    
    print(f"Total reads: {len(sequences):,}")
    print(f"Mean read length: {np.mean([len(s) for s in sequences]):.1f} bp")
    
    # Calculate summary statistics
    all_quals = []
    for qual_str in qualities:
        all_quals.extend(phred33_to_q(qual_str))
    
    mean_q = np.mean(all_quals)
    median_q = np.median(all_quals)
    q20_pct = (np.sum(np.array(all_quals) >= 20) / len(all_quals)) * 100
    q30_pct = (np.sum(np.array(all_quals) >= 30) / len(all_quals)) * 100
    
    print(f"\nQuality Statistics:")
    print(f"  Mean quality: Q{mean_q:.1f}")
    print(f"  Median quality: Q{median_q:.1f}")
    print(f"  Bases ≥ Q20: {q20_pct:.2f}%")
    print(f"  Bases ≥ Q30: {q30_pct:.2f}%")
    
    gc_contents = gc_content_per_read(sequences)
    print(f"\nGC Content Statistics:")
    print(f"  Mean GC: {np.mean(gc_contents):.2f}%")
    print(f"  Median GC: {np.median(gc_contents):.2f}%")
    if expected_gc:
        print(f"  Expected GC: {expected_gc}%")
    
    # Generate plots
    print("\nGenerating QC plots...")
    plot_quality_distribution(qualities, title=f"Quality Score Distribution - {output_prefix}")
    plot_quality_by_position(qualities, title=f"Quality by Position - {output_prefix}")
    plot_gc_distribution(sequences, expected_gc=expected_gc, 
                        title=f"GC Content Distribution - {output_prefix}")
    plot_gc_by_position(sequences, expected_gc=expected_gc,
                       title=f"GC Content by Position - {output_prefix}")
    
    print("\nQC analysis complete!")
    
    # Return summary dictionary
    return {
        'n_reads': len(sequences),
        'mean_length': np.mean([len(s) for s in sequences]),
        'mean_quality': mean_q,
        'q20_percent': q20_pct,
        'q30_percent': q30_pct,
        'mean_gc': np.mean(gc_contents)
    }

#

Bioinformatics Forever