Regular Expressions in Python
Regular expressions (regex) are powerful patterns used to search, match, and manipulate text. You can find patterns, not just exact text.

Examples:
- Find all email addresses in a document
- Validate phone numbers
- Extract gene IDs from biological data
- Find DNA/RNA sequence patterns
- Clean messy text data
Getting Started
Import the Module
import re
Write regex patterns with the r prefix: r"pattern"
Why Raw Strings Matter
# Normal string - \n becomes a newline
print("Hello\nWorld")
# Output:
# Hello
# World
# Raw string - \n stays as literal characters
print(r"Hello\nWorld")
# Output: Hello\nWorld
In regex, backslashes are special! Raw strings prevent confusion:
# â Confusing without raw string
pattern = "\\d+"
# â
Clean with raw string
pattern = r"\d+"
Always write regex patterns as raw strings: r"pattern"
Level 1: Literal Matching
The simplest regex matches exact text.
import re
dna = "ATGCGATCG"
# Search for exact text "ATG"
if re.search(r"ATG", dna):
print("Found ATG!")
Your First Function: re.search()
Looks for a pattern anywhere in text. Returns a match object if found, None if not.
match = re.search(r"ATG", "ATGCCC")
if match:
print("Found:", match.group()) # Found: ATG
print("Position:", match.start()) # Position: 0
Regex is case-sensitive by default! "ATG" â "atg"
Practice
Find which sequences contain "ATG": ["ATGCCC", "TTTAAA", "ATGATG"]
Check if "PYTHON" appears in: "I love PYTHON programming"
Level 2: The Dot . - Match Any Character
The dot . matches any single character (except newline).
# Find "A" + any character + "G"
dna = "ATGCGATCG"
matches = re.findall(r"A.G", dna)
print(matches) # ['ATG', 'ACG']
New Function: re.findall()
Finds all matches and returns them as a list.
text = "cat bat rat"
print(re.findall(r".at", text)) # ['cat', 'bat', 'rat']
Practice
Match "b.t" (b + any char + t) in: "bat bet bit bot but"
Find all 3-letter patterns starting with 'c' in: "cat cow cup car"
Level 3: Character Classes [ ]
Square brackets let you specify which characters to match.
# Match any nucleotide (A, T, G, or C)
dna = "ATGCXYZ"
nucleotides = re.findall(r"[ATGC]", dna)
print(nucleotides) # ['A', 'T', 'G', 'C']
Character Ranges
Use - for ranges:
re.findall(r"[0-9]", "Room 123") # ['1', '2', '3']
re.findall(r"[a-z]", "Hello") # ['e', 'l', 'l', 'o']
re.findall(r"[A-Z]", "Hello") # ['H']
re.findall(r"[A-Za-z]", "Hello123") # ['H', 'e', 'l', 'l', 'o']
Negation with ^
^ inside brackets means "NOT these characters":
# Match anything that's NOT a nucleotide
dna = "ATGC-X123"
non_nucleotides = re.findall(r"[^ATGC]", dna)
print(non_nucleotides) # ['-', 'X', '1', '2', '3']
Practice
Find all digits in: "Gene ID: ABC123"
Find all vowels in: "bioinformatics"
Find all NON-digits in: "Room123"
Level 4: Quantifiers - Repeating Patterns
Quantifiers specify how many times a pattern repeats.
* â 0 or more times
+ â 1 or more times
? â 0 or 1 time (optional)
{n} â Exactly n times
{n,m} â Between n and m times
Examples
# Find sequences of 2+ C's
dna = "ATGCCCAAAGGG"
print(re.findall(r"C+", dna)) # ['CCC']
print(re.findall(r"C{2,}", dna)) # ['CCC']
# Find all digit groups
text = "Call 123 or 4567"
print(re.findall(r"\d+", text)) # ['123', '4567']
# Optional minus sign
print(re.findall(r"-?\d+", "123 -456 789")) # ['123', '-456', '789']
Combining with Character Classes
# Find all 3-letter codons
dna = "ATGCCCAAATTT"
codons = re.findall(r"[ATGC]{3}", dna)
print(codons) # ['ATG', 'CCC', 'AAA', 'TTT']
Practice
Find sequences of exactly 3 A's in: "ATGCCCAAAGGGTTT"
Match "colou?r" (u is optional) in: "color colour"
Find all digit sequences in: "123 4567 89"
Level 5: Escaping Special Characters
Special characters like . * + ? [ ] ( ) have special meanings. To match them literally, escape with \.
# â Wrong - dot matches ANY character
text = "file.txt and fileXtxt"
print(re.findall(r"file.txt", text)) # ['file.txt', 'fileXtxt']
# â
Correct - escaped dot matches only literal dot
print(re.findall(r"file\.txt", text)) # ['file.txt']
Common Examples
re.search(r"\$100", "$100") # Literal dollar sign
re.search(r"What\?", "What?") # Literal question mark
re.search(r"C\+\+", "C++") # Literal plus signs
re.search(r"\(test\)", "(test)") # Literal parentheses
Practice
Match "data.txt" (with literal dot) in: "File: data.txt"
Match "c++" in: "I code in c++ and python"
Level 6: Predefined Shortcuts
Python provides shortcuts for common character types.
\d â Any digit [0-9]
\D â Any non-digit
\w â Word character [A-Za-z0-9_]
\W â Non-word character
\s â Whitespace (space, tab, newline)
\S â Non-whitespace
Examples
# Find all digits
text = "Room 123, Floor 4"
print(re.findall(r"\d+", text)) # ['123', '4']
# Find all words
sentence = "DNA_seq-123 test"
print(re.findall(r"\w+", sentence)) # ['DNA_seq', '123', 'test']
# Split on whitespace
data = "ATG CCC\tAAA"
print(re.split(r"\s+", data)) # ['ATG', 'CCC', 'AAA']
Practice
Find all word characters in: "Hello-World"
Split on whitespace: "ATG CCC\tAAA"
Level 7: Anchors - Position Matching
Anchors match positions, not characters.
^ â Start of string
$ â End of string
\b â Word boundary
\B â Not a word boundary
Examples
dna = "ATGCCCATG"
# Match only at start
print(re.search(r"^ATG", dna)) # Matches!
print(re.search(r"^CCC", dna)) # None
# Match only at end
print(re.search(r"ATG$", dna)) # Matches!
print(re.search(r"CCC$", dna)) # None
# Word boundaries - whole words only
text = "The cat concatenated strings"
print(re.findall(r"\bcat\b", text)) # ['cat'] - only the word
print(re.findall(r"cat", text)) # ['cat', 'cat'] - both
Practice
Find sequences starting with "ATG": ["ATGCCC", "CCCATG", "TACATG"]
Match whole word "cat" (not "concatenate") in: "The cat sat"
Level 8: Alternation - OR Operator |
The pipe | means "match this OR that".
# Match either ATG or AUG
dna = "ATG is DNA, AUG is RNA"
print(re.findall(r"ATG|AUG", dna)) # ['ATG', 'AUG']
# Match stop codons
rna = "AUGCCCUAAUAGUGA"
print(re.findall(r"UAA|UAG|UGA", rna)) # ['UAA', 'UAG', 'UGA']
Practice
Match "email" or "phone" in: "Contact via email or phone"
Find stop codons (TAA, TAG, TGA) in: ["ATG", "TAA", "TAG"]
Level 9: Groups and Capturing ( )
Parentheses create groups you can extract separately.
# Extract parts of an email
email = "user@example.com"
match = re.search(r"(\w+)@(\w+)\.(\w+)", email)
if match:
print("Username:", match.group(1)) # user
print("Domain:", match.group(2)) # example
print("TLD:", match.group(3)) # com
print("Full:", match.group(0)) # user@example.com
Named Groups
Use (?P<name>...) for readable names:
gene_id = "NM_001234"
match = re.search(r"(?P<prefix>[A-Z]+)_(?P<number>\d+)", gene_id)
if match:
print(match.group('prefix')) # NM
print(match.group('number')) # 001234
Practice
Extract area code from: "Call 123-456-7890"
Extract year, month, day from: "2024-11-20"
Level 10: More Useful Functions
re.sub() - Find and Replace
# Mask stop codons
dna = "ATGTAACCC"
masked = re.sub(r"TAA|TAG|TGA", "XXX", dna)
print(masked) # ATGXXXCCC
# Clean multiple spaces
text = "too many spaces"
clean = re.sub(r"\s+", " ", text)
print(clean) # "too many spaces"
re.compile() - Reusable Patterns
# Compile once, use many times (more efficient!)
pattern = re.compile(r"ATG")
for seq in ["ATGCCC", "TTTAAA", "GCGCGC"]:
if pattern.search(seq):
print(f"{seq} contains ATG")
Practice
Replace all A's with N's in: "ATGCCCAAA"
Mask all digits with "X" in: "Room123Floor4"
Biological Examples
Here's how regex is used in bioinformatics!
Validate DNA Sequences
def is_valid_dna(sequence):
"""Check if sequence contains only A, T, G, C"""
return bool(re.match(r"^[ATGC]+$", sequence))
print(is_valid_dna("ATGCCC")) # True
print(is_valid_dna("ATGXCC")) # False
Find Restriction Sites
def find_ecori(dna):
"""Find EcoRI recognition sites (GAATTC)"""
matches = re.finditer(r"GAATTC", dna)
return [(m.start(), m.group()) for m in matches]
dna = "ATGGAATTCCCCGAATTC"
print(find_ecori(dna)) # [(3, 'GAATTC'), (12, 'GAATTC')]
Count Codons
def count_codons(dna):
"""Split DNA into codons (groups of 3)"""
return re.findall(r"[ATGC]{3}", dna)
dna = "ATGCCCAAATTT"
print(count_codons(dna)) # ['ATG', 'CCC', 'AAA', 'TTT']
Extract Gene IDs
def extract_gene_ids(text):
"""Extract gene IDs like NM_123456"""
return re.findall(r"[A-Z]{2}_\d+", text)
text = "Genes NM_001234 and XM_567890 are important"
print(extract_gene_ids(text)) # ['NM_001234', 'XM_567890']
Quick Reference
abc â Literal text
. â Any character
[abc] â Any of a, b, c
[^abc] â NOT a, b, c
[a-z] â Range
* â 0 or more
+ â 1 or more
? â 0 or 1 (optional)
{n} â Exactly n times
\d â Digit
\w â Word character
\s â Whitespace
^ â Start of string
$ â End of string
\b â Word boundary
| â OR
(...) â Capture group
Key Functions Summary
re.search(pattern, text) â Find first match
re.findall(pattern, text) â Find all matches
re.finditer(pattern, text) â Iterator of matches
re.sub(pattern, replacement, text) â Replace matches
re.split(pattern, text) â Split on pattern
re.compile(pattern) â Reusable pattern