Three Laws of Genome Assembly

Genome assembly follows three fundamental principles that determine success or failure. Understanding these "laws" helps explain why some genomes are easy to assemble while others remain challenging.


Law #1: Overlaps Reveal Relationships

📖
First Law

If the suffix of read A is similar to the prefix of read B, then A and B might overlap in the genome.

What this means:

When the end of one read matches the beginning of another read, they likely came from adjacent or overlapping regions in the original DNA molecule.

💻
Example

Read A: ATCGATTGCA
Read B: ATTGCAGGCT

The suffix of A (ATTGCA) matches the prefix of B (ATTGCA) → They overlap!
Assembled: ATCGATTGCAGGCT

Important caveat: The word "might" is crucial. Just because two reads overlap doesn't guarantee they're from the same genomic location—they could be from repeated sequences!

Watch Video Walkthrough

First and second laws of assembly


Law #2: Coverage Enables Assembly

📖
Second Law

More coverage means more overlaps, which means better assembly.

What this means:

Higher sequencing depth (coverage) generates more reads spanning each genomic region, creating more overlapping read pairs that can be assembled together.

The relationship:

  • Low coverage (5-10×): Sparse overlaps, many gaps, fragmented assembly
  • Medium coverage (30-50×): Good overlaps, most regions covered, decent contigs
  • High coverage (100×+): Abundant overlaps, nearly complete assembly, longer contigs
💡
Tip

More coverage is always better for assembly, but there are diminishing returns. Going from 10× to 50× makes a huge difference; going from 100× to 200× makes less of an improvement.

Why it works:

Imagine trying to assemble a sentence with only a few random words versus having many overlapping phrases—more data gives more context and connections.

💻
Coverage Example

Genome region: ATCGATCGATCG (12 bp)

5× coverage (5 reads):
ATCGAT----
--CGAT----
----ATCGAT
------TCGA
--------GATCG
Result: Some gaps, uncertain overlaps

20× coverage (20 reads):
Many more reads covering every position multiple times
Result: Clear overlaps, confident assembly


Law #3: Repeats Are The Enemy

🚫
Third Law

Repeats are bad for assembly. Very bad.

What this means:

When a DNA sequence appears multiple times in the genome (repeats), assembly algorithms cannot determine which copy a read came from, leading to ambiguous or incorrect assemblies.

Types of problematic repeats:

  • Exact repeats: Identical sequences appearing multiple times
  • Transposable elements: Mobile DNA sequences copied throughout the genome
  • Tandem repeats: Sequences repeated back-to-back (CAGCAGCAGCAG...)
  • Segmental duplications: Large blocks of duplicated DNA
💻
Why Repeats Break Assembly

Genome:
ATCG[REPEAT]GGGG...CCCC[REPEAT]TACG

Problem:
When you find a read containing "REPEAT", you don't know if it came from the first location or the second location!

Result:
Assembly breaks into multiple contigs at repeat boundaries, or worse, creates chimeric assemblies by incorrectly connecting different genomic regions.

The challenge:

If a repeat is longer than your read length, you cannot span it with a single read, making it impossible to determine the correct path through the assembly.

⚠️
Real-World Impact

The human genome is ~50% repetitive sequences! This is why:

  • Early human genome assemblies had thousands of gaps
  • Some regions remained unassembled for decades
  • Long-read sequencing (10kb+ reads) was needed to finally span repeats

Solutions to the repeat problem:

  1. Longer reads: Span the entire repeat in a single read
  2. Paired-end reads: Use insert size information to bridge repeats
  3. High coverage: May help distinguish repeat copies
  4. Reference genomes: Use a related species' genome as a guide
🔬
Fact

The final 8% of the human genome (highly repetitive centromeres and telomeres) wasn't fully assembled until 2022—nearly 20 years after the "complete" Human Genome Project—thanks to ultra-long reads from PacBio and Oxford Nanopore sequencing!


Summary: The Three Laws

Remember These Three Laws
  1. Overlaps suggest adjacency – matching suffix/prefix indicates reads might be neighbors
  2. Coverage enables confidence – more reads mean more overlaps and better assembly
  3. Repeats create ambiguity – identical sequences break assembly continuity

Understanding these principles explains why genome assembly remains challenging and why different strategies (long reads, paired ends, high coverage) are needed for complex genomes.

📝
Assembly Quality Trade-offs

The three laws create a fundamental trade-off:

  • Want to resolve repeats? → Need longer reads (but more expensive)
  • Want better coverage? → Need more sequencing (costs more money/time)
  • Want perfect assembly? → May be impossible for highly repetitive genomes

Every genome assembly project must balance accuracy, completeness, and cost.