Greedy Algorithm for Genome Assembly

Watch Video Walkthrough

Genome Assembly

📖

Definition: Greedy Assembly

Greedy assembly is a simple approach that repeatedly finds and merges the two reads with the largest overlap, continuing until no more merges are possible.

How It Works

The algorithm follows these steps:

Find the pair of reads with the longest overlap
Merge those two reads into one longer sequence
Repeat steps 1-2 until no overlaps remain (or overlaps are too small)
Result is a set of contigs (assembled fragments)

💻

Simple Example

Starting reads:

Read A: ATCGAT
Read B: CGATGC
Read C: TGCAAA

Step 1: Best overlap is A+B (4 bp): ATCGAT + CGATGC → ATCGATGC

Step 2: Best overlap is AB+C (3 bp): ATCGATGC + TGCAAA → ATCGATGCAAA

Done! Final contig: ATCGATGCAAA

Why "Greedy"?

It's called "greedy" because it always takes the best immediate option (longest overlap right now) without considering if this might prevent better assemblies later.

⚠️

Major Problem

Repeats break greedy assembly! If a sequence appears multiple times in the genome, the greedy algorithm doesn't know which copy it's assembling and can merge reads from different genome locations incorrectly.

Advantages & Disadvantages

Advantages:

Simple and intuitive
Fast for small datasets
Works well for genomes with few repeats

Disadvantages:

Fails on repetitive sequences
Makes locally optimal choices that may be globally wrong
Can create chimeric contigs (incorrectly merged sequences)

📝

Note

Modern assemblers use more sophisticated approaches (like De Bruijn graphs) that handle repeats better. Greedy assembly is rarely used alone for real genome projects.

Bioinformatics Forever

Greedy Algorithm for Genome Assembly

Watch Video Walkthrough

How It Works

Why "Greedy"?

Advantages & Disadvantages