Genome Assembly

📖
Definition: Genome Assembly

Genome assembly is the computational process of reconstructing the complete genome sequence from millions of short DNA fragments (reads) produced by sequencing.

The Puzzle Analogy

Think of it like solving a jigsaw puzzle:

  • The reads = individual puzzle pieces (short DNA sequences, typically 50-300 bp)
  • The genome = complete picture (the full chromosome sequences)
  • Assembly = finding overlaps between pieces to reconstruct the whole picture

Why Is It Needed?

Sequencing technologies can only read short fragments of DNA at a time, but we need the complete genome sequence. Assembly algorithms find overlapping regions between reads and merge them into longer sequences called contigs (contiguous sequences).

💻
Example

Read 1: ATCGATTGCA
Read 2: TTGCAGGCTAA
Read 3: GGCTAATCGA

Assembled: ATCGATTGCAGGCTAATCGA

(Overlapping regions in bold helped merge them)

Two Main Approaches

  1. De novo assembly: Building the genome from scratch without a reference (like solving a puzzle without the box picture)

  2. Reference-guided assembly: Using an existing genome as a template (like having the box picture to guide you)

🔬
Fact

The human genome required years to assemble initially. Now, with better algorithms and longer reads, we can assemble genomes in days or weeks!

Assembly turns fragmented sequencing data into meaningful, complete genome sequences.