Greedy Algorithm for Genome Assembly

Watch Video Walkthrough

Genome Assembly

📖
Definition: Greedy Assembly

Greedy assembly is a simple approach that repeatedly finds and merges the two reads with the largest overlap, continuing until no more merges are possible.

How It Works

The algorithm follows these steps:

  1. Find the pair of reads with the longest overlap
  2. Merge those two reads into one longer sequence
  3. Repeat steps 1-2 until no overlaps remain (or overlaps are too small)
  4. Result is a set of contigs (assembled fragments)
💻
Simple Example

Starting reads:

  • Read A: ATCGAT
  • Read B: CGATGC
  • Read C: TGCAAA

Step 1: Best overlap is A+B (4 bp): ATCGAT + CGATGC → ATCGATGC

Step 2: Best overlap is AB+C (3 bp): ATCGATGC + TGCAAA → ATCGATGCAAA

Done! Final contig: ATCGATGCAAA

Why "Greedy"?

It's called "greedy" because it always takes the best immediate option (longest overlap right now) without considering if this might prevent better assemblies later.

⚠️
Major Problem

Repeats break greedy assembly! If a sequence appears multiple times in the genome, the greedy algorithm doesn't know which copy it's assembling and can merge reads from different genome locations incorrectly.

Advantages & Disadvantages

Advantages:

  • Simple and intuitive
  • Fast for small datasets
  • Works well for genomes with few repeats

Disadvantages:

  • Fails on repetitive sequences
  • Makes locally optimal choices that may be globally wrong
  • Can create chimeric contigs (incorrectly merged sequences)
📝
Note

Modern assemblers use more sophisticated approaches (like De Bruijn graphs) that handle repeats better. Greedy assembly is rarely used alone for real genome projects.