Genome Assembly – Exam Practice
GT:GQ:DP and value 0/1:99:32. What does this mean?0/1 = heterozygous (0 = reference, 1 = first alt allele), the "/" indicates unphased. GQ=99 means very high confidence. DP=32 means 32 reads support this site. If it were phased, it would use "|" instead of "/".0/1 and 0|1 in VCF genotype notation?2/1 indicate?Step 1: Genome size in base pairs
Step 2: Number of reads needed
1. Gather information about the target genome — Investigate genome size, repeats, heterozygosity, ploidy, and GC content to plan the assembly strategy.
2. Extract high-quality DNA — Obtain pure, intact, high-molecular-weight DNA suitable for the chosen sequencing technology.
3. Design the best experimental workflow — Define experimental goals, select sequencing strategy (de novo vs reference-guided), and plan coverage and library types.
4. Choose sequencing technology and library preparation — Select between SGS, TGS, or hybrid approaches and prepare appropriate libraries (PE, mate-pair, PCR-free).
5. Evaluate computational resources — Ensure sufficient CPU, RAM, and storage are available for the assembly algorithm chosen.
6. Assemble the genome — Apply the chosen assembly algorithm (greedy, OLC, De Bruijn graph, or hybrid) to build contigs from sequencing reads.
7. Polish the assembly — Correct residual sequencing and assembly errors to improve base-level accuracy using tools like Pilon or Racon.
8. Check assembly quality — Evaluate using metrics such as N50 (contiguity), BUSCO (completeness), assembly size, and read mapping rates.
9. Scaffolding and gap filling — Connect contigs into scaffolds using paired reads, long reads, Hi-C, or optical mapping; fill gaps with sequence or Ns.
10. Re-evaluate assembly quality — Repeat quality control to ensure scaffolding and gap filling improved the assembly.
OLC (Overlap-Layout-Consensus):
• Nodes = individual reads; Edges = overlaps between reads
• Three steps: (i) compute overlaps between all reads, (ii) lay out overlap information in a graph, (iii) infer consensus sequence
• Requires comparing all reads to all other reads → computationally impractical for billions of short reads
• Best suited for long reads (e.g., PacBio, Nanopore) where the number of reads is smaller
• Assembly corresponds to finding a Hamiltonian path (visiting every node once)
De Bruijn Graph (DBG):
• Reads are decomposed into k-mers; Nodes = (k−1)-mers; Edges = k-mers connecting prefix to suffix
• Does not require all-vs-all read comparison, making it much more scalable
• Assembly corresponds to finding an Eulerian path (visiting every edge once), which is computationally easier than Hamiltonian paths
• Best suited for short reads (e.g., Illumina)
Why DBG became dominant: With SGS, the number of reads increased exponentially while lengths shortened. OLC's all-vs-all comparison became impractical. DBG avoids this by working with k-mers, consuming less computational time and memory.
1. Intrinsic (Ab initio): Relies solely on the genomic sequence itself, using mathematical models (e.g., Hidden Markov Models) trained on known genes to identify gene-like features (ORFs, start/stop codons, splice sites). Strengths: no external data needed, can detect novel genes, high sensitivity (~100%). Limitations: species-specific training required, moderate accuracy (~60-70% for exon-intron structures), struggles with complex eukaryotic genes.
2. Extrinsic (Homology-based): Compares the genome to known gene/protein sequences in databases (NCBI, UniProt). If similarity is found, a gene is inferred. Strengths: leverages extensive existing databases, protein sequences are conserved even between distant species. Limitations: cannot detect truly novel genes absent from databases.
3. Combined (Hybrid/Combiner): Integrates both ab initio models and extrinsic evidence (RNA-Seq data, protein databases, known genes). Strengths: most accurate and widely used approach, benefits from both computational prediction and experimental evidence. Example: AUGUSTUS can work both de novo and with external data. This is the most popular strategy in modern annotation projects.
Genome size estimation:
Interpretation of the distribution:
• Left peak (1–5× frequency): These low-frequency K-mers are likely caused by sequencing errors. Errors introduce unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.
• Main peak (~10× coverage): Represents the true genomic K-mers. The position of this peak corresponds to the average sequencing depth/coverage of the genome.
• Right tail (extending to 50×): These over-represented K-mers are likely derived from repetitive regions in the genome, which are sequenced multiple times. A prominent right tail suggests the genome contains significant repetitive content, which may complicate assembly.