Three Laws of Genome Assembly
Genome assembly follows three fundamental principles that determine success or failure. Understanding these "laws" helps explain why some genomes are easy to assemble while others remain challenging.
Law #1: Overlaps Reveal Relationships
If the suffix of read A is similar to the prefix of read B, then A and B might overlap in the genome.
What this means:
When the end of one read matches the beginning of another read, they likely came from adjacent or overlapping regions in the original DNA molecule.
Read A: ATCGATTGCA
Read B: ATTGCAGGCT
The suffix of A (ATTGCA) matches the prefix of B (ATTGCA) → They overlap!
Assembled: ATCGATTGCAGGCT
Important caveat: The word "might" is crucial. Just because two reads overlap doesn't guarantee they're from the same genomic location—they could be from repeated sequences!
Watch Video Walkthrough
First and second laws of assembly
Law #2: Coverage Enables Assembly
More coverage means more overlaps, which means better assembly.
What this means:
Higher sequencing depth (coverage) generates more reads spanning each genomic region, creating more overlapping read pairs that can be assembled together.
The relationship:
- Low coverage (5-10×): Sparse overlaps, many gaps, fragmented assembly
- Medium coverage (30-50×): Good overlaps, most regions covered, decent contigs
- High coverage (100×+): Abundant overlaps, nearly complete assembly, longer contigs
More coverage is always better for assembly, but there are diminishing returns. Going from 10× to 50× makes a huge difference; going from 100× to 200× makes less of an improvement.
Why it works:
Imagine trying to assemble a sentence with only a few random words versus having many overlapping phrases—more data gives more context and connections.
Genome region: ATCGATCGATCG (12 bp)
5× coverage (5 reads):
ATCGAT----
--CGAT----
----ATCGAT
------TCGA
--------GATCG
Result: Some gaps, uncertain overlaps
20× coverage (20 reads):
Many more reads covering every position multiple times
Result: Clear overlaps, confident assembly
Law #3: Repeats Are The Enemy
Repeats are bad for assembly. Very bad.
What this means:
When a DNA sequence appears multiple times in the genome (repeats), assembly algorithms cannot determine which copy a read came from, leading to ambiguous or incorrect assemblies.
Types of problematic repeats:
- Exact repeats: Identical sequences appearing multiple times
- Transposable elements: Mobile DNA sequences copied throughout the genome
- Tandem repeats: Sequences repeated back-to-back (CAGCAGCAGCAG...)
- Segmental duplications: Large blocks of duplicated DNA
Genome:
ATCG[REPEAT]GGGG...CCCC[REPEAT]TACG
Problem:
When you find a read containing "REPEAT", you don't know if it came from the first location or the second location!
Result:
Assembly breaks into multiple contigs at repeat boundaries, or worse, creates chimeric assemblies by incorrectly connecting different genomic regions.
The challenge:
If a repeat is longer than your read length, you cannot span it with a single read, making it impossible to determine the correct path through the assembly.
The human genome is ~50% repetitive sequences! This is why:
- Early human genome assemblies had thousands of gaps
- Some regions remained unassembled for decades
- Long-read sequencing (10kb+ reads) was needed to finally span repeats
Solutions to the repeat problem:
- Longer reads: Span the entire repeat in a single read
- Paired-end reads: Use insert size information to bridge repeats
- High coverage: May help distinguish repeat copies
- Reference genomes: Use a related species' genome as a guide
The final 8% of the human genome (highly repetitive centromeres and telomeres) wasn't fully assembled until 2022—nearly 20 years after the "complete" Human Genome Project—thanks to ultra-long reads from PacBio and Oxford Nanopore sequencing!
Summary: The Three Laws
- Overlaps suggest adjacency – matching suffix/prefix indicates reads might be neighbors
- Coverage enables confidence – more reads mean more overlaps and better assembly
- Repeats create ambiguity – identical sequences break assembly continuity
Understanding these principles explains why genome assembly remains challenging and why different strategies (long reads, paired ends, high coverage) are needed for complex genomes.
The three laws create a fundamental trade-off:
- Want to resolve repeats? → Need longer reads (but more expensive)
- Want better coverage? → Need more sequencing (costs more money/time)
- Want perfect assembly? → May be impossible for highly repetitive genomes
Every genome assembly project must balance accuracy, completeness, and cost.