Regular Expressions in Python

📖

What are Regular Expressions?

Regular expressions (regex) are powerful patterns used to search, match, and manipulate text. You can find patterns, not just exact text.

Regular Expressions

Examples:

Find all email addresses in a document
Validate phone numbers
Extract gene IDs from biological data
Find DNA/RNA sequence patterns
Clean messy text data

Getting Started

Import the Module

import re

💡

Always Use Raw Strings

Write regex patterns with the r prefix: r"pattern"

Why Raw Strings Matter

# Normal string - \n becomes a newline
print("Hello\nWorld")
# Output:
# Hello
# World

# Raw string - \n stays as literal characters
print(r"Hello\nWorld")
# Output: Hello\nWorld

In regex, backslashes are special! Raw strings prevent confusion:

# ❌ Confusing without raw string
pattern = "\\d+"

# ✅ Clean with raw string
pattern = r"\d+"

✅

Golden Rule

Always write regex patterns as raw strings: r"pattern"

Level 1: Literal Matching

The simplest regex matches exact text.

import re

dna = "ATGCGATCG"

# Search for exact text "ATG"
if re.search(r"ATG", dna):
    print("Found ATG!")

Your First Function: `re.search()`

ℹ️

re.search(pattern, text)

Looks for a pattern anywhere in text. Returns a match object if found, None if not.

match = re.search(r"ATG", "ATGCCC")
if match:
    print("Found:", match.group())    # Found: ATG
    print("Position:", match.start())  # Position: 0

⚠️

Case Sensitive

Regex is case-sensitive by default! "ATG" ≠ "atg"

Practice

💻

Exercise 1.1

Find which sequences contain "ATG": ["ATGCCC", "TTTAAA", "ATGATG"]

💻

Exercise 1.2

Check if "PYTHON" appears in: "I love PYTHON programming"

Level 2: The Dot `.` - Match Any Character

The dot . matches any single character (except newline).

# Find "A" + any character + "G"
dna = "ATGCGATCG"
matches = re.findall(r"A.G", dna)
print(matches)  # ['ATG', 'ACG']

New Function: `re.findall()`

ℹ️

re.findall(pattern, text)

Finds all matches and returns them as a list.

text = "cat bat rat"
print(re.findall(r".at", text))  # ['cat', 'bat', 'rat']

Practice

💻

Exercise 2.1

Match "b.t" (b + any char + t) in: "bat bet bit bot but"

💻

Exercise 2.2

Find all 3-letter patterns starting with 'c' in: "cat cow cup car"

Level 3: Character Classes `[ ]`

Square brackets let you specify which characters to match.

# Match any nucleotide (A, T, G, or C)
dna = "ATGCXYZ"
nucleotides = re.findall(r"[ATGC]", dna)
print(nucleotides)  # ['A', 'T', 'G', 'C']

Character Ranges

Use - for ranges:

re.findall(r"[0-9]", "Room 123")      # ['1', '2', '3']
re.findall(r"[a-z]", "Hello")         # ['e', 'l', 'l', 'o']
re.findall(r"[A-Z]", "Hello")         # ['H']
re.findall(r"[A-Za-z]", "Hello123")   # ['H', 'e', 'l', 'l', 'o']

Negation with `^`

^ inside brackets means "NOT these characters":

# Match anything that's NOT a nucleotide
dna = "ATGC-X123"
non_nucleotides = re.findall(r"[^ATGC]", dna)
print(non_nucleotides)  # ['-', 'X', '1', '2', '3']

Practice

💻

Exercise 3.1

Find all digits in: "Gene ID: ABC123"

💻

Exercise 3.2

Find all vowels in: "bioinformatics"

💻

Exercise 3.3

Find all NON-digits in: "Room123"

Level 4: Quantifiers - Repeating Patterns

Quantifiers specify how many times a pattern repeats.

📝

Quantifier Reference

* → 0 or more times
+ → 1 or more times
? → 0 or 1 time (optional)
{n} → Exactly n times
{n,m} → Between n and m times

Examples

# Find sequences of 2+ C's
dna = "ATGCCCAAAGGG"
print(re.findall(r"C+", dna))       # ['CCC']
print(re.findall(r"C{2,}", dna))    # ['CCC']

# Find all digit groups
text = "Call 123 or 4567"
print(re.findall(r"\d+", text))     # ['123', '4567']

# Optional minus sign
print(re.findall(r"-?\d+", "123 -456 789"))  # ['123', '-456', '789']

Combining with Character Classes

# Find all 3-letter codons
dna = "ATGCCCAAATTT"
codons = re.findall(r"[ATGC]{3}", dna)
print(codons)  # ['ATG', 'CCC', 'AAA', 'TTT']

Practice

💻

Exercise 4.1

Find sequences of exactly 3 A's in: "ATGCCCAAAGGGTTT"

💻

Exercise 4.2

Match "colou?r" (u is optional) in: "color colour"

💻

Exercise 4.3

Find all digit sequences in: "123 4567 89"

Level 5: Escaping Special Characters

Special characters like . * + ? [ ] ( ) have special meanings. To match them literally, escape with \.

# ❌ Wrong - dot matches ANY character
text = "file.txt and fileXtxt"
print(re.findall(r"file.txt", text))  # ['file.txt', 'fileXtxt']

# ✅ Correct - escaped dot matches only literal dot
print(re.findall(r"file\.txt", text))  # ['file.txt']

Common Examples

re.search(r"\$100", "$100")           # Literal dollar sign
re.search(r"What\?", "What?")         # Literal question mark
re.search(r"C\+\+", "C++")            # Literal plus signs
re.search(r"\(test\)", "(test)")      # Literal parentheses

Practice

💻

Exercise 5.1

Match "data.txt" (with literal dot) in: "File: data.txt"

💻

Exercise 5.2

Match "c++" in: "I code in c++ and python"

Level 6: Predefined Shortcuts

Python provides shortcuts for common character types.

📝

Common Shortcuts

\d → Any digit [0-9]
\D → Any non-digit
\w → Word character [A-Za-z0-9_]
\W → Non-word character
\s → Whitespace (space, tab, newline)
\S → Non-whitespace

Examples

# Find all digits
text = "Room 123, Floor 4"
print(re.findall(r"\d+", text))  # ['123', '4']

# Find all words
sentence = "DNA_seq-123 test"
print(re.findall(r"\w+", sentence))  # ['DNA_seq', '123', 'test']

# Split on whitespace
data = "ATG  CCC\tAAA"
print(re.split(r"\s+", data))  # ['ATG', 'CCC', 'AAA']

Practice

💻

Exercise 6.1

Find all word characters in: "Hello-World"

💻

Exercise 6.2

Split on whitespace: "ATG CCC\tAAA"

Level 7: Anchors - Position Matching

Anchors match positions, not characters.

📝

Anchor Reference

^ → Start of string
$ → End of string
\b → Word boundary
\B → Not a word boundary

Examples

dna = "ATGCCCATG"

# Match only at start
print(re.search(r"^ATG", dna))   # Matches!
print(re.search(r"^CCC", dna))   # None

# Match only at end
print(re.search(r"ATG$", dna))   # Matches!
print(re.search(r"CCC$", dna))   # None

# Word boundaries - whole words only
text = "The cat concatenated strings"
print(re.findall(r"\bcat\b", text))  # ['cat'] - only the word
print(re.findall(r"cat", text))      # ['cat', 'cat'] - both

Practice

💻

Exercise 7.1

Find sequences starting with "ATG": ["ATGCCC", "CCCATG", "TACATG"]

💻

Exercise 7.2

Match whole word "cat" (not "concatenate") in: "The cat sat"

Level 8: Alternation - OR Operator `|`

The pipe | means "match this OR that".

# Match either ATG or AUG
dna = "ATG is DNA, AUG is RNA"
print(re.findall(r"ATG|AUG", dna))  # ['ATG', 'AUG']

# Match stop codons
rna = "AUGCCCUAAUAGUGA"
print(re.findall(r"UAA|UAG|UGA", rna))  # ['UAA', 'UAG', 'UGA']

Practice

💻

Exercise 8.1

Match "email" or "phone" in: "Contact via email or phone"

💻

Exercise 8.2

Find stop codons (TAA, TAG, TGA) in: ["ATG", "TAA", "TAG"]

Level 9: Groups and Capturing `( )`

Parentheses create groups you can extract separately.

# Extract parts of an email
email = "user@example.com"
match = re.search(r"(\w+)@(\w+)\.(\w+)", email)
if match:
    print("Username:", match.group(1))   # user
    print("Domain:", match.group(2))     # example
    print("TLD:", match.group(3))        # com
    print("Full:", match.group(0))       # user@example.com

Named Groups

Use (?P<name>...) for readable names:

gene_id = "NM_001234"
match = re.search(r"(?P<prefix>[A-Z]+)_(?P<number>\d+)", gene_id)
if match:
    print(match.group('prefix'))  # NM
    print(match.group('number'))  # 001234

Practice

💻

Exercise 9.1

Extract area code from: "Call 123-456-7890"

💻

Exercise 9.2

Extract year, month, day from: "2024-11-20"

Level 10: More Useful Functions

`re.sub()` - Find and Replace

# Mask stop codons
dna = "ATGTAACCC"
masked = re.sub(r"TAA|TAG|TGA", "XXX", dna)
print(masked)  # ATGXXXCCC

# Clean multiple spaces
text = "too    many     spaces"
clean = re.sub(r"\s+", " ", text)
print(clean)  # "too many spaces"

`re.compile()` - Reusable Patterns

# Compile once, use many times (more efficient!)
pattern = re.compile(r"ATG")

for seq in ["ATGCCC", "TTTAAA", "GCGCGC"]:
    if pattern.search(seq):
        print(f"{seq} contains ATG")

Practice

💻

Exercise 10.1

Replace all A's with N's in: "ATGCCCAAA"

💻

Exercise 10.2

Mask all digits with "X" in: "Room123Floor4"

Biological Examples

💡

Real Applications

Here's how regex is used in bioinformatics!

Validate DNA Sequences

def is_valid_dna(sequence):
    """Check if sequence contains only A, T, G, C"""
    return bool(re.match(r"^[ATGC]+$", sequence))

print(is_valid_dna("ATGCCC"))  # True
print(is_valid_dna("ATGXCC"))  # False

Find Restriction Sites

def find_ecori(dna):
    """Find EcoRI recognition sites (GAATTC)"""
    matches = re.finditer(r"GAATTC", dna)
    return [(m.start(), m.group()) for m in matches]

dna = "ATGGAATTCCCCGAATTC"
print(find_ecori(dna))  # [(3, 'GAATTC'), (12, 'GAATTC')]

Count Codons

def count_codons(dna):
    """Split DNA into codons (groups of 3)"""
    return re.findall(r"[ATGC]{3}", dna)

dna = "ATGCCCAAATTT"
print(count_codons(dna))  # ['ATG', 'CCC', 'AAA', 'TTT']

Extract Gene IDs

def extract_gene_ids(text):
    """Extract gene IDs like NM_123456"""
    return re.findall(r"[A-Z]{2}_\d+", text)

text = "Genes NM_001234 and XM_567890 are important"
print(extract_gene_ids(text))  # ['NM_001234', 'XM_567890']

abc → Literal text
. → Any character
[abc] → Any of a, b, c
[^abc] → NOT a, b, c
[a-z] → Range
* → 0 or more
+ → 1 or more
? → 0 or 1 (optional)
{n} → Exactly n times
\d → Digit
\w → Word character
\s → Whitespace
^ → Start of string
$ → End of string
\b → Word boundary
| → OR
(...) → Capture group

Key Functions Summary

ℹ️

Function Reference

re.search(pattern, text) → Find first match
re.findall(pattern, text) → Find all matches
re.finditer(pattern, text) → Iterator of matches
re.sub(pattern, replacement, text) → Replace matches
re.split(pattern, text) → Split on pattern
re.compile(pattern) → Reusable pattern

Bioinformatics Forever