Sequence Alignment
Why Do We Align Sequences?
Because similarity reveals relationships.
If two protein or DNA sequences are similar, they likely:
- Share a common ancestor (homology)
- Have similar functions (we can transfer annotations)
- Adopt similar structures (especially for proteins)
The core idea: Evolution preserves what works. Similar sequences suggest shared evolutionary history, which means shared function and structure.
Without alignment, we can't quantify similarity. Alignment gives us a systematic way to compare sequences and measure their relatedness.
Pairwise vs Multiple Sequence Alignment
| Feature | Pairwise Alignment | Multiple Sequence Alignment (MSA) |
|---|---|---|
| Definition | Align two sequences | Align three or more sequences |
| Purpose | Find similarity between two sequences | Find conserved regions across multiple sequences |
| Algorithms | Needleman-Wunsch (global) Smith-Waterman (local) | Progressive (ClustalW, MUSCLE) Iterative (MAFFT) Consistency-based (T-Coffee) |
| Complexity | O(n²) - fast | O(n^k) where k = number of sequences - slow |
| Common Tools | BLAST, FASTA EMBOSS (Needle, Water) | ClustalW, ClustalOmega MUSCLE, MAFFT T-Coffee, Clustal Phi |
| Output | One optimal alignment | Consensus of all sequences |
| Best For | Comparing two proteins/genes Database searches | Phylogenetic analysis Finding conserved motifs Family analysis |
Pairwise Sequence Alignment
The basic scenario in bioinformatics:
- You have a sequence of interest (newly discovered, unknown function)
- You have a known sequence (well-studied, annotated)
- Question: Are they similar?
- Hypothesis: If similar, they might share function/structure
Sequence Identity
Sequence identity is the percentage of exact matches between aligned sequences.
Example:
Seq1: ACGTACGT
Seq2: ACGTCCGT
||||.|||
Identity: 7/8 = 87.5%
But identity alone doesn't tell the whole story - we need to consider biological similarity (similar but not identical amino acids).
Two Types of Sequence Alignment
Global Alignment
Goal: Align every residue in both sequences from start to end.
Residue = individual unit in a sequence:
- For DNA/RNA: nucleotide (A, C, G, T/U)
- For proteins: amino acid
How it works:
- Start sequences at the same position
- Optimize alignment by inserting gaps where needed
- Forces alignment of entire sequences
Example (ASCII):
Seq1: ACGTACGT----
|||| |||
Seq2: ACGTTACGTAGC
Best for: Sequences of similar length that are expected to be similar along their entire length.
Local Alignment
Goal: Find the most similar regions between sequences, ignoring less similar parts.
How it works:
- Identify regions of high similarity
- Ignore dissimilar terminals and regions
- Can find multiple local alignments in the same pair
Example (ASCII):
Seq1: GTACGT
||||||
Seq2: AAAAGTGTACGTCCCC
Only the middle region is aligned; terminals are ignored.
Best for:
- Short sequence vs. longer sequence
- Distantly related sequences
- Finding conserved domains in otherwise divergent proteins
Scoring Alignments
Because there are many possible ways to align two sequences, we need a scoring function to assess alignment quality.
Simple Scoring: Percent Match
Basic approach: Count matches and calculate percentage.
Seq1: ACGTACGT
|||| |||
Seq2: ACGTTCGT
Matches: 7/8 = 87.5%
Problem: This treats all mismatches equally. But some substitutions are more biologically likely than others.
Additive Scoring with Linear Gap Penalty
Better approach: Assign scores to matches, mismatches, and gaps.
Simple scoring scheme:
- Match (SIM): +1
- Mismatch: -1
- Gap penalty (GAP): -1
Formula:
Score = Σ[SIM(s1[pos], s2[pos])] + (gap_positions à GAP)
Example:
Seq1: ACGT-ACGT
|||| ||||
Seq2: ACGTTACGT
Matches: 8 Ć (+1) = +8
Gap: 1 Ć (-1) = -1
Total Score = +7
Affine Gap Penalty: A Better Model
Problem with linear gap penalty: Five gaps in one place vs. five gaps in different places - which is more biologically realistic?
Answer: Consecutive gaps (one insertion/deletion event) are more likely than multiple separate events.
Affine gap penalty:
- GOP (Gap Opening Penalty): Cost to START a gap (e.g., -5)
- GEP (Gap Extension Penalty): Cost to EXTEND an existing gap (e.g., -1)
Formula:
Score = Σ[SIM(s1[pos], s2[pos])] + (number_of_gaps à GOP) + (total_gap_length à GEP)
Example:
One gap of length 3: GOP + (3 Ć GEP) = -5 + (3 Ć -1) = -8
Three gaps of length 1: 3 Ć (GOP + GEP) = 3 Ć (-5 + -1) = -18
Consecutive gaps are penalized less - matches biological reality better.
DNA vs. Protein Level Alignment
The Problem
Consider these DNA sequences:
DNA1: CAC
DNA2: CAT
||.
At the DNA level: C matches C, A matches A, but C doesn't match T (67% identity).
But translate to protein:
CAC ā Histidine (His)
CAT ā Histidine (His)
Both code for the same amino acid! At the protein level, they're 100% identical.
Which Level to Use?
DNA alignment:
- More sensitive to recent changes
- Can detect synonymous mutations
- Good for closely related sequences
Protein alignment:
- Captures functional conservation
- More robust for distant relationships
- Ignores silent mutations
Rule of thumb: For evolutionary distant sequences, protein alignment is more informative because the genetic code is redundant - multiple codons can encode the same amino acid.
Substitution Matrices: Beyond Simple Scoring
The DNA Problem: Not All Mutations Are Equal
Transitions (purine ā purine or pyrimidine ā pyrimidine):
- A ā G
- C ā T
- More common in evolution
Transversions (purine ā pyrimidine):
- A/G ā C/T
- Less common (different ring structures)
Implication: Not all mismatches should have the same penalty. A transition should be penalized less than a transversion.
The Protein Problem: Chemical Similarity
Amino acids have different chemical properties:
- Hydrophobic vs. hydrophilic
- Charged vs. neutral
- Small vs. large
- Aromatic vs. aliphatic
Key insight: Substitutions between chemically similar amino acids (same set in the diagram) occur with higher probability in evolution.
Example:
- Leucine (Leu) ā Isoleucine (Ile): Both hydrophobic, similar size ā common
- Leucine (Leu) ā Aspartic acid (Asp): Hydrophobic ā charged ā rare
Problem: Venn diagrams aren't computer-friendly. We need numbers.
Solution: Substitution matrices.
PAM Matrices (Point Accepted Mutation)
Image Ā© Anthony S. Serianni. Used under fair use for educational purposes.
Source: https://www3.nd.edu/~aseriann/CHAP7B.html/sld017.htm
PAM matrices encode the probability of amino acid substitutions.
How to read the matrix:
- This is a symmetric matrix (half shown, diagonal contains self-matches)
- Diagonal values (e.g., Cys-Cys = 12): Score for matching the same amino acid
- Off-diagonal values: Score for substituting one amino acid for another
Examples from PAM250:
- Cys ā Cys: +12 (perfect match, high score)
- Pro ā Leu: -3 (not very similar, small penalty)
- Pro ā Trp: -6 (very different, larger penalty)
Key principle: Similar amino acids (chemically) have higher substitution probabilities and therefore higher scores in the matrix.
What Does PAM250 Mean?
PAM = Point Accepted Mutation
PAM1: 1% of amino acids have been substituted (very similar sequences)
PAM250: Extrapolated to 250 PAMs (very distant sequences)
Higher PAM number = more evolutionary distance = use for distantly related proteins
BLOSUM Matrices (BLOcks SUbstitution Matrix)
BLOSUM is another family of substitution matrices, built differently from PAM.
How BLOSUM is Built
Block database: Collections of ungapped, aligned sequences from related proteins.
Amino acids in the blocks are grouped by chemistry of the side chain (like in the Venn diagram).
Each value in the matrix is calculated by:
Frequency of (amino acid pair in database)
Ć·
Frequency expected by chance
Then converted to a log-odds score.
Interpreting BLOSUM Scores
Zero score:
Amino acid pair occurs as often as expected by random chance.
Positive score:
Amino acid pair occurs more often than by chance (conserved substitution).
Negative score:
Amino acid pair occurs less often than by chance (rare/unfavorable substitution).
BLOSUM Naming: The Percentage
BLOSUM62: Matrix built from blocks with no more than 62% similarity.
What this means:
- BLOSUM62: Mid-range, general purpose
- BLOSUM80: More related proteins (higher % identity)
- BLOSUM45: Distantly related proteins (lower % identity)
Note: Higher number = MORE similar sequences used to build matrix.
Which BLOSUM to Use?
Depends on how related you think your sequences are:
Comparing two cow proteins?
Use BLOSUM80 (closely related species, expect high similarity)
Comparing human protein to bacteria?
Use BLOSUM45 (distantly related, expect low similarity)
Don't know how related they are?
Use BLOSUM62 (default, works well for most cases)
PAM vs. BLOSUM: Summary
| Feature | PAM | BLOSUM |
|---|---|---|
| Based on | Evolutionary model (extrapolated mutations) | Observed alignments (block database) |
| Numbers mean | Evolutionary distance (PAM units) | % similarity of sequences used |
| Higher number | More distant sequences | More similar sequences (opposite!) |
| PAM250 ā | BLOSUM45 | (both for distant proteins) |
| PAM100 ā | BLOSUM80 | (both for close proteins) |
| Most common | PAM250 | BLOSUM62 |
Key difference in naming:
- PAM: Higher number = MORE evolutionary distance
- BLOSUM: Higher number = LESS evolutionary distance (MORE similar sequences)
Which to use?
- BLOSUM is more commonly used today (especially BLOSUM62)
- PAM is more theoretically grounded but less practical
- For most purposes: Start with BLOSUM62
Dynamic Programming
Please see the complete topic written in this seperate page