String Metrics and Alignments
1. String Metrics and Alignments
What is a String Metric?
A string metric measures the distance or similarity between two text strings. We use these in bioinformatics to compare DNA, RNA, or protein sequences.
Edit Distance
Edit distances measure how many operations you need to transform one string into another. The operations can be:
Substitution: Replace one character with another
CAT → CUT
|
A → U
Insertion/Deletion (Indel): Add or remove a character
MONKEY → MON-EY (deletion of K, or viewing it differently, insertion of gap)
Transposition: Swap adjacent characters
FORM → FROM
Each operation has a cost. The total cost defines the distance.
Distance vs Similarity
Distance metric: Range is [0, +∞)
- Score of 0 means identical strings
- Higher scores mean more different strings
Similarity metric: Range is (-∞, +∞)
- High positive score means similar strings
- Negative score means distant strings
In bioinformatics, we typically work with similarity because we want to find sequences that are alike.
2. Hamming Distance
The simplest edit distance. It only counts substitutions and requires strings of equal length.
Example
ACCCTCGCTAGATAAATAGATCTGATAG
||x||||||||||x|||||x||||x|||
ACTCTCGCTAGATGAATAGGTCTGTTAG
Count the mismatches (marked with x): positions 3, 14, 19, 25
Hamming Distance = 4
Converting to Similarity
For the example above:
- Total positions: 28
- Matches: 24
- Mismatches: 4
Similarity = +24 - 4 = 20
(We add +1 for each match, subtract 1 for each mismatch)
Implementation: AlnSeq Class (Basic)
We start by creating a class to hold two sequences and compute Hamming metrics.
Step 1: Define the class and constructor
class AlnSeq:
def __init__(self, seq1, seq2):
if len(seq1) != len(seq2):
raise ValueError("Sequences must have equal length")
self.seq1 = seq1
self.seq2 = seq2
Step 2: Add Hamming distance method
def compute_hamming_distance(self):
self.distance = 0
for i in range(len(self.seq1)):
if self.seq1[i] != self.seq2[i]:
self.distance += 1
return self.distance
Step 3: Add Hamming similarity method
def compute_hamming_similarity(self):
if getattr(self, 'distance', None) == None:
self.compute_hamming_distance()
# +1 * numbers of matchs + (-1 * numbers of mismatchs)
# It's possible simplify this math experssion, but left for readabilty
return (len(self.seq1) - self.distance) - self.distance
Usage
aln = AlnSeq("ACCCTCGCTAG", "ACTCTCGCTAG")
print(aln.compute_hamming_distance()) # 1
print(aln.compute_hamming_similarity()) # 9
3. Biologically Relevant Substitution Matrices
Not all substitutions are equal in biology. Some amino acids or nucleotides are chemically similar, so swapping between them is "less bad" than swapping very different ones.
Transition/Transversion Matrix (for DNA)
Transitions are mutations between chemically similar bases:
- A ↔ G (both purines)
- T ↔ C (both pyrimidines)
Transversions are mutations between different types:
- A ↔ T, A ↔ C, G ↔ T, G ↔ C
A T C G
A 2 -1 -1 0
T -1 2 0 -1
C -1 0 2 -1
G 0 -1 -1 2
Example calculation:
AAAA
|XX/
ATCG
Score = Score(A,A) + Score(A,T) + Score(A,C) + Score(A,G) = 2 + (-1) + (-1) + 0 = 0
PAM Matrices (for Proteins)
PAM = Point Accepted Mutation
PAMn[i,j] gives the likelihood of residue i being replaced by residue j through evolutionary mutations over a time when n mutations occur per 100 residues.
Most commonly used: PAM250
BLOSUM Matrices (for Proteins)
BLOSUMn[i,j] is based on observed substitution frequencies in aligned protein sequences that share n% sequence identity.
Most commonly used: BLOSUM62
Implementation: SubstitutionMatrix Class
A class to store and access substitution matrix values.
Step 1: Define the class with matrix data
class SubstitutionMatrix:
def __init__(self, matrix_dict):
# matrix_dict is a nested dictionary
# matrix_dict['A']['T'] gives score for A->T
self.matrix = matrix_dict
Step 2: Implement getitem for easy indexing
This lets us use matrix['A', 'T'] syntax.
def __getitem__(self, key):
# key is a tuple like ('A', 'T')
res1, res2 = key
return self.matrix[res1][res2]
Step 3: Create the Transition/Transversion matrix
# Define the matrix as nested dictionaries
tt_matrix_data = {
'A': {'A': 2, 'T': -1, 'C': -1, 'G': 0},
'T': {'A': -1, 'T': 2, 'C': 0, 'G': -1},
'C': {'A': -1, 'T': 0, 'C': 2, 'G': -1},
'G': {'A': 0, 'T': -1, 'C': -1, 'G': 2}
}
tt_matrix = SubstitutionMatrix(tt_matrix_data)
Usage
print(tt_matrix['A', 'A']) # 2 (match)
print(tt_matrix['A', 'G']) # 0 (transition)
print(tt_matrix['A', 'T']) # -1 (transversion)
Implementation: Extend AlnSeq with Matrix-Based Similarity
Add a method to compute similarity using any substitution matrix.
class AlnSeq:
# ... previous methods ...
def hamming_similarity_matrix(self, sub_matrix):
if len(self.seq1) != len(self.seq2):
raise ValueError("Sequences must have equal length")
score = 0
for i in range(len(self.seq1)):
# Use the substitution matrix for scoring
score += sub_matrix[self.seq1[i], self.seq2[i]]
return score
Usage
aln = AlnSeq("AAAA", "ATCG")
print(aln.hamming_similarity_matrix(tt_matrix)) # 2 + (-1) + (-1) + 0 = 0
4. Levenshtein Distance
Unlike Hamming distance, Levenshtein allows insertions and deletions (indels), not just substitutions.
Example
KITTEN → SITTING
Step 1: K → S (substitution)
Step 2: E → I (substitution)
Step 3: insert G (insertion)
Levenshtein Distance = 3
Levenshtein vs Hamming
For strings of the same length: Hamming Distance ≥ Levenshtein Distance
FLAW vs LAWN
Hamming (no gaps allowed):
FLAW
XXXX
LAWN
Hamming Distance = 4
Levenshtein (gaps allowed):
-FLAW
|||^
-LAWN
Levenshtein Distance = 2 (delete F, insert N)
This shows why we need sequence alignment.
5. Sequence Alignment
Each character of one sequence is matched with:
- A character in the other sequence, OR
- A gap (-)
After alignment, both gapped strings have the same length.
How Many Possible Alignments?
Simple case (no internal gaps): Just slide sequences past each other
--TCA -TCA TCA TCA TCA- TCA--
TA--- TA-- TA- -TA --TA ---TA
Number of alignments = len1 + len2 + 1
Complex case (internal gaps allowed): The number explodes
| 1 if len1 = 0
N(len1, len2) = | 1 if len2 = 0
| N(len1-1, len2) + N(len1, len2-1) + N(len1-1, len2-1) otherwise
Examples:
- N(3,3) = 63
- N(5,5) = 1,683
- N(9,9) = 1,462,563
Brute force is not feasible. We need a smarter approach.
Implementation: Counting Possible Alignments
A recursive function to calculate the number of possible alignments.
def count_alignments(len1, len2):
# Base cases: if one sequence is empty,
# only one alignment is possible (all gaps)
if len1 == 0:
return 1
if len2 == 0:
return 1
# Recursive case: three choices at each position
return (count_alignments(len1 - 1, len2) + # gap in seq2
count_alignments(len1, len2 - 1) + # gap in seq1
count_alignments(len1 - 1, len2 - 1)) # match/mismatch
Usage
print(count_alignments(3, 3)) # 63
print(count_alignments(5, 5)) # 1683
print(count_alignments(9, 9)) # 1462563
This function is slow because it recalculates the same subproblems many times. That is exactly why we need dynamic programming.
6. Dynamic Programming
An algorithmic paradigm that works when a problem has:
-
Optimal Substructure: The optimal solution depends on solutions to smaller sub-problems
-
Overlapping Subproblems: The same sub-problems are solved multiple times
Key idea: Store solutions to sub-problems and build up to the full solution.
7. Needleman-Wunsch Algorithm (Global Alignment)
Finds the best alignment across the entire length of both sequences.
Setup
Given:
- Sequence 1: TCA
- Sequence 2: TA
- Gap penalty: -2
- Substitution scores: Transition/Transversion matrix
Step 1: Initialize the Matrix
Create a matrix with sequence 1 across the top and sequence 2 down the side. Add a gap column/row at the start.
- T C A
- 0 -2 -4 -6
T -2
A -4
First row: Each cell = previous + gap penalty (0, -2, -4, -6) First column: Same logic (0, -2, -4, -6)
Step 2: Fill the Matrix
For each cell (i,j), take the maximum of three options:
Cell[i,j] = max(
Cell[i-1, j] + gap, // come from above (gap in seq1)
Cell[i, j-1] + gap, // come from left (gap in seq2)
Cell[i-1, j-1] + match(i,j) // come from diagonal (match/mismatch)
)
Filling cell (T,T):
max(-2 + (-2), 0 + match(T,T), -2 + (-2))
= max(-4, 0 + 2, -4)
= max(-4, 2, -4)
= 2
Complete filled matrix:
- T C A
- 0 -2 -4 -6
T -2 2 0 -2
A -4 0 1 2
The bottom-right cell (2) is the optimal alignment score.
Step 3: Backtracking
Start from bottom-right. At each cell, go back the way you came:
- Diagonal: match characters
- Up: gap in sequence 1
- Left: gap in sequence 2
- T C A
- 0 ← -2 ← -4 ← -6
↑ ↖↑ ↖↑ ↖↑
T -2 ← 2 ← 0 ← -2
↑ ↖ ↖ ↖
A -4 ← 0 ← 1 ← [2]
Path from [2]: diagonal → diagonal → diagonal
Final Alignment:
TCA
|X|
T-A
Score = 2
8. Smith-Waterman Algorithm (Local Alignment)
Finds the best alignment between substrings of the sequences.
Differences from Needleman-Wunsch
-
Initialization: First row and column are all zeros (no penalty for starting/ending gaps)
-
Filling: Add a fourth option, zero:
Cell[i,j] = max(
Cell[i-1, j] + gap,
Cell[i, j-1] + gap,
Cell[i-1, j-1] + match(i,j),
0 // NEW: discard negative paths
)
- Backtracking: Start from the highest value in the matrix (not bottom-right). Stop when you hit zero.
Example
Sequences: TGA and GA, Gap penalty: -2
Initialization:
- T G A
- 0 0 0 0
G 0
A 0
Filling cell (G,T):
max(0 + (-2), 0 + match(T,G), 0 + (-2), 0)
= max(-2, -1, -2, 0)
= 0
Complete matrix:
- T G A
- 0 0 0 0
G 0 0 2 0
A 0 0 0 [4]
Backtracking from highest value [4]:
- (A,A): came from diagonal, match A-A
- (G,G): came from diagonal, match G-G
- (G,T): value is 0, STOP
Final Local Alignment:
GA
||
GA
Score = 4
The algorithm found the best matching substring, ignoring the T at the beginning.
Implementation: SeqPair Class (Needleman-Wunsch)
Step 1: Define the class
class SeqPair:
def __init__(self, seq1, seq2, sub_matrix, gap_penalty=-2):
self.seq1 = seq1
self.seq2 = seq2
self.sub_matrix = sub_matrix
self.gap = gap_penalty
Step 2: Initialize the scoring matrix
def _init_matrix(self, rows, cols):
# Create matrix filled with zeros
matrix = [[0] * cols for _ in range(rows)]
return matrix
Step 3: Needleman-Wunsch matrix filling
def needleman_wunsch(self):
rows = len(self.seq2) + 1
cols = len(self.seq1) + 1
# Initialize score matrix
score = self._init_matrix(rows, cols)
# Initialize traceback matrix
# 0 = diagonal, 1 = up, 2 = left
trace = self._init_matrix(rows, cols)
# Fill first row (gaps in seq2)
for j in range(1, cols):
score[0][j] = score[0][j-1] + self.gap
trace[0][j] = 2 # came from left
# Fill first column (gaps in seq1)
for i in range(1, rows):
score[i][0] = score[i-1][0] + self.gap
trace[i][0] = 1 # came from up
# Fill rest of matrix
for i in range(1, rows):
for j in range(1, cols):
# Three options
diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
up = score[i-1][j] + self.gap
left = score[i][j-1] + self.gap
# Take maximum
score[i][j] = max(diag, up, left)
# Record which direction we came from
if score[i][j] == diag:
trace[i][j] = 0
elif score[i][j] == up:
trace[i][j] = 1
else:
trace[i][j] = 2
self.nw_score = score
self.nw_trace = trace
return score[rows-1][cols-1] # final score
Step 4: Backtracking to get alignment
def nw_traceback(self):
aligned1 = ""
aligned2 = ""
# Start from bottom-right
i = len(self.seq2)
j = len(self.seq1)
while i > 0 or j > 0:
if i > 0 and j > 0 and self.nw_trace[i][j] == 0:
# Diagonal: match/mismatch
aligned1 = self.seq1[j-1] + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
j -= 1
elif i > 0 and self.nw_trace[i][j] == 1:
# Up: gap in seq1
aligned1 = "-" + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
else:
# Left: gap in seq2
aligned1 = self.seq1[j-1] + aligned1
aligned2 = "-" + aligned2
j -= 1
return aligned1, aligned2
Usage
seq_pair = SeqPair("TCA", "TA", tt_matrix, gap_penalty=-2)
score = seq_pair.needleman_wunsch()
print(f"Score: {score}") # Score: 2
aln1, aln2 = seq_pair.nw_traceback()
print(aln1) # TCA
print(aln2) # T-A
Implementation: Smith-Waterman (Extend SeqPair)
Add local alignment method
def smith_waterman(self):
rows = len(self.seq2) + 1
cols = len(self.seq1) + 1
score = self._init_matrix(rows, cols)
trace = self._init_matrix(rows, cols)
# First row and column stay 0 (no initialization needed)
# Track position of maximum score
max_score = 0
max_pos = (0, 0)
# Fill matrix
for i in range(1, rows):
for j in range(1, cols):
diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
up = score[i-1][j] + self.gap
left = score[i][j-1] + self.gap
# Take maximum, but floor at 0
score[i][j] = max(diag, up, left, 0)
# Record direction (only if not 0)
if score[i][j] == 0:
trace[i][j] = -1 # stop here
elif score[i][j] == diag:
trace[i][j] = 0
elif score[i][j] == up:
trace[i][j] = 1
else:
trace[i][j] = 2
# Update max position
if score[i][j] > max_score:
max_score = score[i][j]
max_pos = (i, j)
self.sw_score = score
self.sw_trace = trace
self.sw_max_pos = max_pos
return max_score
Add local alignment backtracking
def sw_traceback(self):
aligned1 = ""
aligned2 = ""
# Start from maximum position
i, j = self.sw_max_pos
# Stop when we hit 0
while i > 0 and j > 0 and self.sw_trace[i][j] != -1:
if self.sw_trace[i][j] == 0:
aligned1 = self.seq1[j-1] + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
j -= 1
elif self.sw_trace[i][j] == 1:
aligned1 = "-" + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
else:
aligned1 = self.seq1[j-1] + aligned1
aligned2 = "-" + aligned2
j -= 1
return aligned1, aligned2
Usage
seq_pair = SeqPair("TGA", "GA", tt_matrix, gap_penalty=-2)
score = seq_pair.smith_waterman()
print(f"Score: {score}") # Score: 4
aln1, aln2 = seq_pair.sw_traceback()
print(aln1) # GA
print(aln2) # GA
9. Global vs Local: When to Use Which
Needleman-Wunsch (Global):
- Use when comparing sequences of similar length
- Use when you expect the sequences to be related across their entire length
- Example: comparing two versions of the same gene
Smith-Waterman (Local):
- Use when looking for conserved regions within longer sequences
- Use when sequences have very different lengths
- Example: finding a motif within a larger protein
10. Implementation Classes
AlnSeq Class
Holds two sequences and computes:
- Hamming distance
- Hamming similarity
- Levenshtein similarity (for pre-aligned sequences)
- Custom str for visual output:
Where: | = match, X = substitution, ^ = insertion, v = deletionACTG |X|^ AGTG-
SubstitutionMatrix Class
- Stores the scoring matrix values
- Implements getitem for easy access:
matrix['A', 'T']returns the score
SeqPair Class
- Computes alignment matrices (Needleman-Wunsch and Smith-Waterman)
- Reconstructs alignments through backtracking
Implementation: Extend AlnSeq with Levenshtein and str
Levenshtein similarity for aligned sequences
This assumes sequences are already aligned (may contain gaps).
class AlnSeq:
# ... previous methods ...
def levenshtein_similarity(self, sub_matrix, gap_penalty=-2):
# For aligned sequences (with gaps already inserted)
if len(self.seq1) != len(self.seq2):
raise ValueError("Aligned sequences must have equal length")
score = 0
for i in range(len(self.seq1)):
c1 = self.seq1[i]
c2 = self.seq2[i]
if c1 == '-' or c2 == '-':
# Gap penalty
score += gap_penalty
else:
# Use substitution matrix
score += sub_matrix[c1, c2]
return score
Custom str method for visualization
def __str__(self):
if len(self.seq1) != len(self.seq2):
return f"{self.seq1}\n{self.seq2}"
# Build match string
match_str = ""
for i in range(len(self.seq1)):
c1 = self.seq1[i]
c2 = self.seq2[i]
if c1 == c2:
match_str += "|" # identity
elif c1 == '-':
match_str += "^" # insertion (gap in seq1)
elif c2 == '-':
match_str += "v" # deletion (gap in seq2)
else:
match_str += "X" # substitution
return f"{self.seq1}\n{match_str}\n{self.seq2}"
Usage
aln = AlnSeq("TCA", "T-A")
print(aln)
# Output:
# TCA
# |v|
# T-A
print(aln.levenshtein_similarity(tt_matrix, gap_penalty=-2)) # 2 + (-2) + 2 = 2
Complete AlnSeq Class
class AlnSeq:
def __init__(self, seq1, seq2):
self.seq1 = seq1
self.seq2 = seq2
def hamming_distance(self):
if len(self.seq1) != len(self.seq2):
raise ValueError("Sequences must have equal length")
distance = 0
for i in range(len(self.seq1)):
if self.seq1[i] != self.seq2[i]:
distance += 1
return distance
def hamming_similarity(self):
if len(self.seq1) != len(self.seq2):
raise ValueError("Sequences must have equal length")
score = 0
for i in range(len(self.seq1)):
if self.seq1[i] == self.seq2[i]:
score += 1
else:
score -= 1
return score
def hamming_similarity_matrix(self, sub_matrix):
if len(self.seq1) != len(self.seq2):
raise ValueError("Sequences must have equal length")
score = 0
for i in range(len(self.seq1)):
score += sub_matrix[self.seq1[i], self.seq2[i]]
return score
def levenshtein_similarity(self, sub_matrix, gap_penalty=-2):
if len(self.seq1) != len(self.seq2):
raise ValueError("Aligned sequences must have equal length")
score = 0
for i in range(len(self.seq1)):
c1 = self.seq1[i]
c2 = self.seq2[i]
if c1 == '-' or c2 == '-':
score += gap_penalty
else:
score += sub_matrix[c1, c2]
return score
def __str__(self):
if len(self.seq1) != len(self.seq2):
return f"{self.seq1}\n{self.seq2}"
match_str = ""
for i in range(len(self.seq1)):
c1 = self.seq1[i]
c2 = self.seq2[i]
if c1 == c2:
match_str += "|"
elif c1 == '-':
match_str += "^"
elif c2 == '-':
match_str += "v"
else:
match_str += "X"
return f"{self.seq1}\n{match_str}\n{self.seq2}"
Complete SubstitutionMatrix Class
class SubstitutionMatrix:
def __init__(self, matrix_dict):
self.matrix = matrix_dict
def __getitem__(self, key):
res1, res2 = key
return self.matrix[res1][res2]
# Transition/Transversion Matrix
tt_matrix = SubstitutionMatrix({
'A': {'A': 2, 'T': -1, 'C': -1, 'G': 0},
'T': {'A': -1, 'T': 2, 'C': 0, 'G': -1},
'C': {'A': -1, 'T': 0, 'C': 2, 'G': -1},
'G': {'A': 0, 'T': -1, 'C': -1, 'G': 2}
})
Complete SeqPair Class
class SeqPair:
def __init__(self, seq1, seq2, sub_matrix, gap_penalty=-2):
self.seq1 = seq1
self.seq2 = seq2
self.sub_matrix = sub_matrix
self.gap = gap_penalty
def _init_matrix(self, rows, cols):
return [[0] * cols for _ in range(rows)]
# --- Needleman-Wunsch (Global) ---
def needleman_wunsch(self):
rows = len(self.seq2) + 1
cols = len(self.seq1) + 1
score = self._init_matrix(rows, cols)
trace = self._init_matrix(rows, cols)
for j in range(1, cols):
score[0][j] = score[0][j-1] + self.gap
trace[0][j] = 2
for i in range(1, rows):
score[i][0] = score[i-1][0] + self.gap
trace[i][0] = 1
for i in range(1, rows):
for j in range(1, cols):
diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
up = score[i-1][j] + self.gap
left = score[i][j-1] + self.gap
score[i][j] = max(diag, up, left)
if score[i][j] == diag:
trace[i][j] = 0
elif score[i][j] == up:
trace[i][j] = 1
else:
trace[i][j] = 2
self.nw_score = score
self.nw_trace = trace
return score[rows-1][cols-1]
def nw_traceback(self):
aligned1 = ""
aligned2 = ""
i = len(self.seq2)
j = len(self.seq1)
while i > 0 or j > 0:
if i > 0 and j > 0 and self.nw_trace[i][j] == 0:
aligned1 = self.seq1[j-1] + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
j -= 1
elif i > 0 and self.nw_trace[i][j] == 1:
aligned1 = "-" + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
else:
aligned1 = self.seq1[j-1] + aligned1
aligned2 = "-" + aligned2
j -= 1
return aligned1, aligned2
# --- Smith-Waterman (Local) ---
def smith_waterman(self):
rows = len(self.seq2) + 1
cols = len(self.seq1) + 1
score = self._init_matrix(rows, cols)
trace = self._init_matrix(rows, cols)
max_score = 0
max_pos = (0, 0)
for i in range(1, rows):
for j in range(1, cols):
diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
up = score[i-1][j] + self.gap
left = score[i][j-1] + self.gap
score[i][j] = max(diag, up, left, 0)
if score[i][j] == 0:
trace[i][j] = -1
elif score[i][j] == diag:
trace[i][j] = 0
elif score[i][j] == up:
trace[i][j] = 1
else:
trace[i][j] = 2
if score[i][j] > max_score:
max_score = score[i][j]
max_pos = (i, j)
self.sw_score = score
self.sw_trace = trace
self.sw_max_pos = max_pos
return max_score
def sw_traceback(self):
aligned1 = ""
aligned2 = ""
i, j = self.sw_max_pos
while i > 0 and j > 0 and self.sw_trace[i][j] != -1:
if self.sw_trace[i][j] == 0:
aligned1 = self.seq1[j-1] + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
j -= 1
elif self.sw_trace[i][j] == 1:
aligned1 = "-" + aligned1
aligned2 = self.seq2[i-1] + aligned2
i -= 1
else:
aligned1 = self.seq1[j-1] + aligned1
aligned2 = "-" + aligned2
j -= 1
return aligned1, aligned2