Greedy Algorithm for Genome Assembly
Watch Video Walkthrough
Genome Assembly
Greedy assembly is a simple approach that repeatedly finds and merges the two reads with the largest overlap, continuing until no more merges are possible.
How It Works
The algorithm follows these steps:
- Find the pair of reads with the longest overlap
- Merge those two reads into one longer sequence
- Repeat steps 1-2 until no overlaps remain (or overlaps are too small)
- Result is a set of contigs (assembled fragments)
Starting reads:
- Read A: ATCGAT
- Read B: CGATGC
- Read C: TGCAAA
Step 1: Best overlap is A+B (4 bp): ATCGAT + CGATGC → ATCGATGC
Step 2: Best overlap is AB+C (3 bp): ATCGATGC + TGCAAA → ATCGATGCAAA
Done! Final contig: ATCGATGCAAA
Why "Greedy"?
It's called "greedy" because it always takes the best immediate option (longest overlap right now) without considering if this might prevent better assemblies later.
Repeats break greedy assembly! If a sequence appears multiple times in the genome, the greedy algorithm doesn't know which copy it's assembling and can merge reads from different genome locations incorrectly.
Advantages & Disadvantages
Advantages:
- Simple and intuitive
- Fast for small datasets
- Works well for genomes with few repeats
Disadvantages:
- Fails on repetitive sequences
- Makes locally optimal choices that may be globally wrong
- Can create chimeric contigs (incorrectly merged sequences)
Modern assemblers use more sophisticated approaches (like De Bruijn graphs) that handle repeats better. Greedy assembly is rarely used alone for real genome projects.