Sequence Alignment

Why Do We Align Sequences?

Because similarity reveals relationships.

If two protein or DNA sequences are similar, they likely:

  • Share a common ancestor (homology)
  • Have similar functions (we can transfer annotations)
  • Adopt similar structures (especially for proteins)

The core idea: Evolution preserves what works. Similar sequences suggest shared evolutionary history, which means shared function and structure.

Without alignment, we can't quantify similarity. Alignment gives us a systematic way to compare sequences and measure their relatedness.

Pairwise vs Multiple Sequence Alignment

FeaturePairwise AlignmentMultiple Sequence Alignment (MSA)
DefinitionAlign two sequencesAlign three or more sequences
PurposeFind similarity between two sequencesFind conserved regions across multiple sequences
AlgorithmsNeedleman-Wunsch (global)
Smith-Waterman (local)
Progressive (ClustalW, MUSCLE)
Iterative (MAFFT)
Consistency-based (T-Coffee)
ComplexityO(n²) - fastO(n^k) where k = number of sequences - slow
Common ToolsBLAST, FASTA
EMBOSS (Needle, Water)
ClustalW, ClustalOmega
MUSCLE, MAFFT
T-Coffee, Clustal Phi
OutputOne optimal alignmentConsensus of all sequences
Best ForComparing two proteins/genes
Database searches
Phylogenetic analysis
Finding conserved motifs
Family analysis

Pairwise Sequence Alignment

The basic scenario in bioinformatics:

  • You have a sequence of interest (newly discovered, unknown function)
  • You have a known sequence (well-studied, annotated)
  • Question: Are they similar?
  • Hypothesis: If similar, they might share function/structure

Sequence Identity

Sequence identity is the percentage of exact matches between aligned sequences.

Example:

Seq1: ACGTACGT
Seq2: ACGTCCGT
      ||||.|||
Identity: 7/8 = 87.5%

But identity alone doesn't tell the whole story - we need to consider biological similarity (similar but not identical amino acids).


Two Types of Sequence Alignment

Global Alignment

Goal: Align every residue in both sequences from start to end.

Residue = individual unit in a sequence:

  • For DNA/RNA: nucleotide (A, C, G, T/U)
  • For proteins: amino acid

How it works:

  • Start sequences at the same position
  • Optimize alignment by inserting gaps where needed
  • Forces alignment of entire sequences

Example (ASCII):

Seq1: ACGTACGT----
      |||| |||
Seq2: ACGTTACGTAGC

Best for: Sequences of similar length that are expected to be similar along their entire length.

Local Alignment

Goal: Find the most similar regions between sequences, ignoring less similar parts.

How it works:

  • Identify regions of high similarity
  • Ignore dissimilar terminals and regions
  • Can find multiple local alignments in the same pair

Example (ASCII):

Seq1:        GTACGT
             ||||||
Seq2: AAAAGTGTACGTCCCC

Only the middle region is aligned; terminals are ignored.

Best for:

  • Short sequence vs. longer sequence
  • Distantly related sequences
  • Finding conserved domains in otherwise divergent proteins

Scoring Alignments

Because there are many possible ways to align two sequences, we need a scoring function to assess alignment quality.

Simple Scoring: Percent Match

Basic approach: Count matches and calculate percentage.

Seq1: ACGTACGT
      |||| |||
Seq2: ACGTTCGT

Matches: 7/8 = 87.5%

Problem: This treats all mismatches equally. But some substitutions are more biologically likely than others.


Additive Scoring with Linear Gap Penalty

Better approach: Assign scores to matches, mismatches, and gaps.

Simple scoring scheme:

  • Match (SIM): +1
  • Mismatch: -1
  • Gap penalty (GAP): -1

Formula:

Score = Ī£[SIM(s1[pos], s2[pos])] + (gap_positions Ɨ GAP)

Example:

Seq1: ACGT-ACGT
      |||| ||||
Seq2: ACGTTACGT

Matches: 8 Ɨ (+1) = +8
Gap: 1 Ɨ (-1) = -1
Total Score = +7

Affine Gap Penalty: A Better Model

Problem with linear gap penalty: Five gaps in one place vs. five gaps in different places - which is more biologically realistic?

Answer: Consecutive gaps (one insertion/deletion event) are more likely than multiple separate events.

Affine gap penalty:

  • GOP (Gap Opening Penalty): Cost to START a gap (e.g., -5)
  • GEP (Gap Extension Penalty): Cost to EXTEND an existing gap (e.g., -1)

Formula:

Score = Ī£[SIM(s1[pos], s2[pos])] + (number_of_gaps Ɨ GOP) + (total_gap_length Ɨ GEP)

Example:

One gap of length 3: GOP + (3 Ɨ GEP) = -5 + (3 Ɨ -1) = -8
Three gaps of length 1: 3 Ɨ (GOP + GEP) = 3 Ɨ (-5 + -1) = -18

Consecutive gaps are penalized less - matches biological reality better.


DNA vs. Protein Level Alignment

The Problem

Consider these DNA sequences:

DNA1: CAC
DNA2: CAT
      ||.

At the DNA level: C matches C, A matches A, but C doesn't match T (67% identity).

But translate to protein:

CAC → Histidine (His)
CAT → Histidine (His)

Both code for the same amino acid! At the protein level, they're 100% identical.

Which Level to Use?

DNA alignment:

  • More sensitive to recent changes
  • Can detect synonymous mutations
  • Good for closely related sequences

Protein alignment:

  • Captures functional conservation
  • More robust for distant relationships
  • Ignores silent mutations

Rule of thumb: For evolutionary distant sequences, protein alignment is more informative because the genetic code is redundant - multiple codons can encode the same amino acid.


Substitution Matrices: Beyond Simple Scoring

The DNA Problem: Not All Mutations Are Equal

Transversion and Transition

Transitions (purine ↔ purine or pyrimidine ↔ pyrimidine):

  • A ↔ G
  • C ↔ T
  • More common in evolution

Transversions (purine ↔ pyrimidine):

  • A/G ↔ C/T
  • Less common (different ring structures)

Implication: Not all mismatches should have the same penalty. A transition should be penalized less than a transversion.

The Protein Problem: Chemical Similarity

Amino Acid Properties Venn Diagram

Amino acids have different chemical properties:

  • Hydrophobic vs. hydrophilic
  • Charged vs. neutral
  • Small vs. large
  • Aromatic vs. aliphatic

Key insight: Substitutions between chemically similar amino acids (same set in the diagram) occur with higher probability in evolution.

Example:

  • Leucine (Leu) → Isoleucine (Ile): Both hydrophobic, similar size → common
  • Leucine (Leu) → Aspartic acid (Asp): Hydrophobic → charged → rare

Problem: Venn diagrams aren't computer-friendly. We need numbers.

Solution: Substitution matrices.


PAM Matrices (Point Accepted Mutation)

PAM250 Matrix

Image Ā© Anthony S. Serianni. Used under fair use for educational purposes.
Source: https://www3.nd.edu/~aseriann/CHAP7B.html/sld017.htm

PAM matrices encode the probability of amino acid substitutions.

How to read the matrix:

  • This is a symmetric matrix (half shown, diagonal contains self-matches)
  • Diagonal values (e.g., Cys-Cys = 12): Score for matching the same amino acid
  • Off-diagonal values: Score for substituting one amino acid for another

Examples from PAM250:

  • Cys ↔ Cys: +12 (perfect match, high score)
  • Pro ↔ Leu: -3 (not very similar, small penalty)
  • Pro ↔ Trp: -6 (very different, larger penalty)

Key principle: Similar amino acids (chemically) have higher substitution probabilities and therefore higher scores in the matrix.

What Does PAM250 Mean?

PAM = Point Accepted Mutation

PAM1: 1% of amino acids have been substituted (very similar sequences)
PAM250: Extrapolated to 250 PAMs (very distant sequences)

Higher PAM number = more evolutionary distance = use for distantly related proteins


BLOSUM Matrices (BLOcks SUbstitution Matrix)

BLOSUM is another family of substitution matrices, built differently from PAM.

How BLOSUM is Built

Block database: Collections of ungapped, aligned sequences from related proteins.

Amino acids in the blocks are grouped by chemistry of the side chain (like in the Venn diagram).

Each value in the matrix is calculated by:

Frequency of (amino acid pair in database)
Ć·
Frequency expected by chance

Then converted to a log-odds score.

Interpreting BLOSUM Scores

Zero score:
Amino acid pair occurs as often as expected by random chance.

Positive score:
Amino acid pair occurs more often than by chance (conserved substitution).

Negative score:
Amino acid pair occurs less often than by chance (rare/unfavorable substitution).

BLOSUM Naming: The Percentage

BLOSUM62: Matrix built from blocks with no more than 62% similarity.

What this means:

  • BLOSUM62: Mid-range, general purpose
  • BLOSUM80: More related proteins (higher % identity)
  • BLOSUM45: Distantly related proteins (lower % identity)

Note: Higher number = MORE similar sequences used to build matrix.

Which BLOSUM to Use?

Depends on how related you think your sequences are:

Comparing two cow proteins?
Use BLOSUM80 (closely related species, expect high similarity)

Comparing human protein to bacteria?
Use BLOSUM45 (distantly related, expect low similarity)

Don't know how related they are?
Use BLOSUM62 (default, works well for most cases)


PAM vs. BLOSUM: Summary

FeaturePAMBLOSUM
Based onEvolutionary model (extrapolated mutations)Observed alignments (block database)
Numbers meanEvolutionary distance (PAM units)% similarity of sequences used
Higher numberMore distant sequencesMore similar sequences (opposite!)
PAM250 ā‰ˆBLOSUM45(both for distant proteins)
PAM100 ā‰ˆBLOSUM80(both for close proteins)
Most commonPAM250BLOSUM62

Key difference in naming:

  • PAM: Higher number = MORE evolutionary distance
  • BLOSUM: Higher number = LESS evolutionary distance (MORE similar sequences)

Which to use?

  • BLOSUM is more commonly used today (especially BLOSUM62)
  • PAM is more theoretically grounded but less practical
  • For most purposes: Start with BLOSUM62

Dynamic Programming

Please see the complete topic written in this seperate page

Needleman Wunsch Algorithm