Genome Assembly
Genome assembly is the computational process of reconstructing the complete genome sequence from millions of short DNA fragments (reads) produced by sequencing.
The Puzzle Analogy
Think of it like solving a jigsaw puzzle:
- The reads = individual puzzle pieces (short DNA sequences, typically 50-300 bp)
- The genome = complete picture (the full chromosome sequences)
- Assembly = finding overlaps between pieces to reconstruct the whole picture
Why Is It Needed?
Sequencing technologies can only read short fragments of DNA at a time, but we need the complete genome sequence. Assembly algorithms find overlapping regions between reads and merge them into longer sequences called contigs (contiguous sequences).
Read 1: ATCGATTGCA
Read 2: TTGCAGGCTAA
Read 3: GGCTAATCGA
Assembled: ATCGATTGCAGGCTAATCGA
(Overlapping regions in bold helped merge them)
Two Main Approaches
-
De novo assembly: Building the genome from scratch without a reference (like solving a puzzle without the box picture)
-
Reference-guided assembly: Using an existing genome as a template (like having the box picture to guide you)
The human genome required years to assemble initially. Now, with better algorithms and longer reads, we can assemble genomes in days or weeks!
Assembly turns fragmented sequencing data into meaningful, complete genome sequences.