Genome Assembly – Exam Practice

📝Genome Assembly – Full Coverage
0 / 45
Q1 Easy
What is variant calling?
AThe process of assembling reads into contigs
BThe process of identifying differences (SNPs, indels) between sequenced reads and a reference genome
CThe process of aligning reads to a reference genome
DThe process of annotating genes in the genome
Explanation
Variant calling identifies differences (like SNPs or indels) between sequenced reads and a reference genome. Not every difference is a true variant — sequencing errors, alignment issues, and low-quality bases can produce false positives.
Q2 Medium
Which of the following is most likely to produce false-positive variant calls?
AHigh mapping quality reads
BSequencing depth of 30×
CVariants located in homopolymer regions
DVariants supported by high base-call quality scores
Explanation
Homopolymer regions (e.g., AAAAA) are prone to sequencing errors, especially with certain technologies. Variants found in these regions are often false positives. Modern variant callers include filters to detect and handle homopolymeric regions.
Q3 Medium
What is the advantage of joint variant calling over individual variant calling?
AIt requires fewer computational resources
BIt produces one VCF file per sample
CIt only identifies homozygous variants
DA low-confidence variant in one sample may be confirmed by evidence from other samples
Explanation
Joint variant calling analyzes all samples simultaneously. A key advantage is improved sensitivity: a low-confidence variant in one sample can be confidently called if supported in other samples. It also helps detect shared variants even when some individuals have low coverage. However, it requires more computational resources than individual calling.
Q4 Tricky
In individual variant calling, if a variant is missing from a sample's VCF file, what can be concluded?
AThe variant may still be present but was missed due to insufficient coverage
BThe variant is definitely absent in that sample
CThe reference genome is incorrect at that position
DThe variant is homozygous in that sample
Explanation
A key limitation of individual variant calling: if a variant is missing from some VCF files, it does not necessarily mean it is absent in those samples. It could be due to insufficient sequencing coverage or different variant types. This is one reason joint calling is generally preferred for multi-sample projects.
Q5 Easy
In a VCF file, what does the QUAL column represent?
AThe base quality of the reference allele
BThe confidence score for the variant call
CThe mapping quality of reads at the position
DThe read depth at the variant position
Explanation
The QUAL column in VCF indicates confidence in the variant call. Read depth (DP) and allele frequency (AF) are found in the INFO column. Mapping quality is a separate concept from variant call quality.
Q6 Hard
In a VCF file, a sample shows FORMAT GT:GQ:DP and value 0/1:99:32. What does this mean?
AHomozygous reference, quality 99, depth 32
BHomozygous alternative, quality 32, depth 99
CHeterozygous (one ref, one alt allele), quality 99, depth 32
DHeterozygous, phased genotype, quality 99, depth 32
Explanation
FORMAT GT:GQ:DP means the sample data is structured as Genotype:Genotype Quality:Read Depth. 0/1 = heterozygous (0 = reference, 1 = first alt allele), the "/" indicates unphased. GQ=99 means very high confidence. DP=32 means 32 reads support this site. If it were phased, it would use "|" instead of "/".
Q7 Tricky
What is the difference between 0/1 and 0|1 in VCF genotype notation?
A0/1 is homozygous while 0|1 is heterozygous
B0/1 is unphased (allele-to-chromosome assignment unknown) while 0|1 is phased (known assignment)
C0/1 means low quality while 0|1 means high quality
D0/1 is from short-read data while 0|1 is from long-read data
Explanation
The "/" (forward slash) indicates an unphased genotype — you know which alleles are present but not which chromosome each came from. The "|" (pipe) indicates a phased genotype — you know the exact combination of alleles on each chromosome. Both 0/1 and 0|1 are heterozygous. Phasing is useful in haplotype or linkage studies.
Q8 Medium
In VCF genotype notation, what does 2/1 indicate?
AHomozygous for the second alternative allele
BHeterozygous with one reference and one alternative allele
CMissing genotype data
DHeterozygous with the first ALT allele and the second ALT allele
Explanation
In VCF: 0 = reference allele, 1 = first alternative allele, 2 = second alternative allele (at multiallelic sites). So 2/1 means heterozygous with one allele being the first ALT and the other being the second ALT. This is different from 0/1 (ref + first ALT).
Q9 Medium
In IGV, how would you identify a heterozygous variant (0/1)?
AAbout half of the reads show the variant allele and half show the reference
BAll reads match the reference
CNearly all reads show the variant allele
DOnly 1–2 reads show the variant
Explanation
Heterozygous (0/1): approximately half the reads show the variant. Homozygous reference (0/0): all reads match reference. Homozygous alternative (1/1): nearly all reads show the variant. If only 1–2 reads show the variant, it's likely a sequencing or mapping error, not a true variant.
Q10 Medium
Which of the following is NOT a recommended post-calling quality control step?
AFilter variants by read depth (exclude too low or too high)
BRetain only variants marked as "PASS" in the FILTER field
CRetain all variants regardless of quality score to maximize sensitivity
DExclude variants in repetitive or low-complexity regions
Explanation
Post-calling QC should filter by quality score (e.g., QUAL > 30) to keep only high-confidence variants. Retaining all variants regardless of quality would include many false positives. Other valid steps include filtering by read depth, excluding repetitive regions, and checking the FILTER field for "PASS".
Q11 Easy
What is a typical minimum read depth threshold for reliable variant calling?
A≥ 3 reads
B≥ 10 reads
C≥ 50 reads
D≥ 100 reads
Explanation
A common minimum depth threshold is ≥ 10 reads. Fewer than 3 reads means low confidence. A maximum depth threshold (e.g., > 100 reads) is also useful to filter out positions with unusually high coverage, which often correspond to repetitive regions.
Q12 Medium
What is the primary goal of variant annotation?
ATo determine the biological impact and consequences of each identified variant
BTo align reads to the reference genome
CTo increase the sequencing depth of variants
DTo remove false-positive variant calls
Explanation
Variant annotation determines the impact of each variant. Variants can be in genes, introns, or regulatory regions, and their effects vary by location. Tools like ENSEMBL provide gene locations, functions, variant tables with rsIDs, and predicted consequences. Databases like dbSNP record SNP information and modification types.
Q13 Tricky
When checking variants against a variant table in ENSEMBL, approximately what percentage of variants in your sample are expected to be previously known?
AAround 50%
BAround 25%
CAround 1%
DAround 8%
Explanation
The lecture notes specifically state that "usually around 8%" of variants in your sample are previously known. Checking the variant table helps verify your results, even though the main goal is to discover new variants. This is a detail easily overlooked by students.
Q14 Easy
When is de novo genome assembly necessary?
AWhen performing variant calling on a well-studied model organism
BWhen RNA-seq data is available
CWhen no reference genome is available for the species
DWhen the genome is very small
Explanation
De novo genome assembly is essential when no reference genome is available. It can also be used to improve an existing reference. However, given its complexity and resource demands, researchers must first assess whether it's truly necessary or if a reference-guided approach can be used.
Q15 Easy
In shotgun sequencing, how are longer sequences reconstructed?
ABy searching for overlaps between the sequences of individual fragments
BBy aligning each fragment to a reference genome
CBy using restriction enzymes to cut at known positions
DBy sequencing each chromosome separately
Explanation
Shotgun sequencing works by fragmenting DNA, sequencing the fragments, and then using overlaps between fragments to reconstruct longer sequences. The method relies on random fragmentation producing overlapping pieces that can be computationally assembled.
Q16 Medium
In hierarchical shotgun sequencing, what are BAC libraries used for?
ATo sequence the genome directly in one step
BTo clone and amplify large DNA fragments (~300 kb) whose order and overlap are known
CTo store short sequencing reads for downstream analysis
DTo perform variant calling on sequenced reads
Explanation
In hierarchical shotgun sequencing, large DNA fragments are inserted into BAC (Bacterial Artificial Chromosome) libraries. Because BAC fragments are large (~300 kb) and their order/overlap are known beforehand, each BAC is individually shotgun-sequenced and then all are aligned to reconstruct the full genome. This method is mostly obsolete now.
Q17 Hard
Using flow cytometry, the C-value of a species is measured at 2.5 pg. What is the estimated genome size in base pairs?
A~2.5 × 10⁹ bp
B~978 Mb
C~1.95 × 10⁹ bp
D~2.445 × 10⁹ bp
Explanation
Using the formula: Genome size (bp) = DNA content (pg) × 0.978 × 10⁹. So: 2.5 × 0.978 × 10⁹ = 2.445 × 10⁹ bp ≈ 2,445 Mb. The C-value is the amount of DNA in picograms in a haploid genome, and 1 pg = 978 Mb.
Q18 Medium
In a K-mer frequency distribution, what do low-frequency K-mers (appearing only 1–20 times) most likely represent?
AHighly conserved coding regions
BRepetitive regions
CSequencing errors that introduce unique erroneous K-mers
DTrue genomic K-mers at average coverage
Explanation
In a K-mer frequency distribution: low-frequency K-mers (left peak) = sequencing errors creating unique, erroneous K-mers; the main peak = true genomic K-mers at average coverage; high-frequency K-mers (right tail) = repetitive regions sequenced multiple times.
Q19 Hard
How is genome size estimated from K-mer frequency analysis?
ATotal number of K-mers (area under the curve) divided by average K-mer coverage (peak position)
BNumber of unique K-mers multiplied by K-mer length
CTotal number of reads multiplied by read length
DMaximum K-mer frequency divided by read length
Explanation
Genome Size = Total number of K-mers (area under the curve) / Average K-mer coverage (mean coverage = position of the main peak). This provides an approximate genome size based solely on sequencing data. It's particularly useful for unknown or poorly studied genomes.
Q20 Medium
How does high heterozygosity affect genome assembly?
AIt simplifies the assembly by reducing the number of contigs
BAllelic variation can be assembled as separate regions, causing fragmentation and inflated genome size
CIt has no effect on assembly quality
DIt only affects GC content bias in Illumina sequencing
Explanation
In highly heterozygous genomes, the assembler may interpret allelic variation as two separate genomic regions. This leads to fragmentation and inflated genome size estimates — heterozygous regions may be reported twice for diploid organisms. Solutions: use inbred lines or haploid individuals, or bioinformatics tools that distinguish allelic differences from true duplications.
Q21 Medium
Why does extreme GC content cause problems for Illumina sequencing?
AIt increases the error rate of base calling
BIt causes reads to be too long for assembly
CIt causes amplification bias during PCR, resulting in low or no coverage in affected regions
DIt makes the reference genome incompatible with the reads
Explanation
Extremely low or high GC content causes amplification bias during PCR-based Illumina sequencing. This results in low or no coverage in those regions. Solutions: use GC-insensitive platforms like PacBio or Nanopore, or over-sequence to compensate for coverage gaps.
Q22 Tricky
Why is it recommended to sequence inbred individuals for genome assembly?
AInbred individuals have larger genomes
BInbred individuals have more repetitive elements
CInbred individuals have higher GC content
DLow polymorphism in inbred individuals greatly simplifies assembly by reducing heterozygosity
Explanation
Inbred organisms (e.g., lab strains) have low levels of polymorphism, which greatly simplifies assembly. High heterozygosity causes assemblers to misinterpret allelic differences as separate genomic regions, leading to fragmentation. The lecture specifically mentions that differences between reads due to polymorphism "may be misinterpreted by assemblers and errors introduced in the sequence."
Q23 Medium
What is required for long-read sequencing that is NOT necessary for short-read sequencing?
ADNA extraction from the sample
BHigh-molecular-weight DNA from fresh or well-preserved tissue
CLibrary preparation
DPCR amplification
Explanation
Long-read sequencing requires high-molecular-weight (HMW) DNA (≥20 kbp), mainly obtained from fresh material. Short-read sequencing can work with fragmented or degraded DNA, making it suitable for ancient or poor-quality samples. PCR amplification is sometimes needed when DNA is limited but actually introduces bias.
Q24 Tricky
Why is PCR amplification of genomic DNA a potential problem for genome assembly?
ASome regions amplify more efficiently than others, leading to uneven coverage and potential gaps
BPCR destroys the DNA fragments
CPCR only works with long-read sequencing
DPCR introduces indels into the reads
Explanation
PCR introduces bias because some genomic regions amplify more efficiently than others. This leads to uneven coverage and potential gaps in the genome assembly. PCR-free library preparation methods are preferred when possible to avoid this bias.
Q25 Easy
What is a standard minimum coverage/depth recommended for genome assembly?
A10×
B30×
CAt least 60×
D100×
Explanation
A coverage of at least 60× is standard practice for genome assembly, ensuring each region is sequenced enough times for accurate assembly. This is explicitly mentioned in the lecture when discussing fold-coverage requirements for a "good assembly (>60x)".
Q26 Medium
What is the main advantage of a hybrid assembly approach (combining SGS + TGS)?
AIt is cheaper than using only short reads
BIt eliminates all assembly errors
CIt only requires de Bruijn graph assembly
DShort reads correct errors in long reads, while long reads improve assembly continuity across repeats
Explanation
The hybrid approach compensates for the downsides of both technologies: SGS (Illumina) provides high accuracy to correct errors in TGS reads, while TGS (PacBio/Nanopore) provides long reads that span repeats and improve continuity. It's a cost-effective strategy since SGS data can correct errors in TGS reads.
Q27 Hard
In a De Bruijn graph, what do vertices and edges represent?
AVertices = reads; Edges = overlaps between reads
BVertices = (k−1)-mers; Edges = k-mers connecting prefix to suffix
CVertices = k-mers; Edges = (k−1)-mers
DVertices = chromosomes; Edges = contigs
Explanation
In a De Bruijn graph: vertices = (k−1)-mers (prefix and suffix of each k-mer), and edges = k-mers. For example, with k=3, the k-mer ATG has prefix AT and suffix TG, so the edge ATG connects node AT to node TG. Option A describes the OLC approach, not De Bruijn. Option C reverses vertices and edges — a very common exam trap!
Q28 Hard
What condition must be met for an Eulerian path to exist in a De Bruijn graph?
AThe graph must have all nodes balanced (equal in-degree and out-degree) or exactly two semi-balanced nodes
BThe graph must have exactly one node with maximum out-degree
CEvery node must be visited exactly once
DThe graph must be undirected
Explanation
An Eulerian path visits every edge exactly once (not every node — that's a Hamiltonian path). For an Eulerian path to exist, the graph must have all nodes balanced (indegree = outdegree) or at most two semi-balanced nodes (where |indegree − outdegree| = 1). De Bruijn graphs are directed, not undirected.
Q29 Tricky
Why should you NOT choose the longest possible k-mer for De Bruijn graph assembly?
ALonger k-mers produce more ambiguous assemblies
BLonger k-mers require more sequencing depth
CA single sequencing error affects 100% of k-mers from that read when k equals the read length, versus only a few k-mers with smaller k
DLonger k-mers create larger, more connected graphs
Explanation
The assumption in De Bruijn graph assembly is that all k-mers are error-free, which is not true for NGS data. If you choose k = read length, then a single sequencing error affects 100% of the k-mers from that read. With a smaller k, an error only affects a limited number of k-mers. However, too small a k increases ambiguity (many repeated k-mers). Multiple assemblies with different k values can be compared.
Q30 Medium
Why did De Bruijn graph-based assemblers replace OLC for short-read data?
AOLC produces better assemblies but is more expensive
BDe Bruijn graphs produce error-free assemblies
COLC cannot handle reads shorter than 1000 bp
DOLC requires comparing all reads to all other reads, which is impractical with billions of short reads
Explanation
With SGS, the number of reads increased exponentially while read lengths shortened. The OLC approach requires comparing all reads with every other read — computationally impractical with millions or billions of reads. De Bruijn graphs are more efficient because they decompose reads into k-mers, avoiding direct all-vs-all comparisons. OLC is still used for long-read data.
Q31 Medium
What causes branching structures in De Bruijn graphs?
ARepetitive DNA regions that create multiple possible paths
BToo few reads in the dataset
CUsing too large a k-mer size
DThe use of paired-end reads
Explanation
Repeated sequences create branches in the De Bruijn graph because identical k-mers from different genomic locations converge, creating ambiguity about which path to follow. Paired-end reads can actually help resolve these branches — if a fragment spans the repeat, its paired reads anchor the assembly in unique flanking regions.
Q32 Medium
What is the key advantage of mate-pair sequencing over paired-end sequencing?
AMate-pair is simpler and cheaper to prepare
BMate-pair covers much larger distances (2–10 kb inserts), useful for scaffolding across repeats and gaps
CMate-pair produces longer reads
DMate-pair has higher base-call accuracy
Explanation
Mate-pair sequencing uses long DNA fragments (2–10 kb) that are circularized and labeled with biotin. This enables spanning large distances, which is crucial for scaffolding across repeats, detecting structural variations, and connecting distant contigs. Paired-end inserts are typically only 50–500 bp. However, mate-pair preparation is more complex and labor-intensive.
Q33 Tricky
In paired-end sequencing, when can two reads from the same fragment be merged into a single longer read?
AWhen the fragment is very long (>1 kb)
BWhen using mate-pair libraries
CWhen the DNA insert is short enough that the two reads from each end overlap
DWhen long-read technology is used simultaneously
Explanation
If the DNA insert is short (e.g., 200 bp) and the reads are long (e.g., 150 bp each), the two reads overlap in the middle and can be merged into a longer, more accurate read that behaves like a single-end read. This only works when the insert size is less than 2× the read length.
Q34 Easy
What is the purpose of assembly polishing?
ATo increase sequencing depth
BTo correct sequencing and assembly errors, improving base-level accuracy
CTo fragment the assembly into smaller contigs
DTo annotate genes in the assembly
Explanation
Polishing corrects sequencing errors and improves the accuracy of the consensus sequence. It is especially important for long-read assemblies, which tend to have higher error rates. Polishing tools (e.g., Pilon, Racon, Medaka) use aligned reads to detect mismatches and correct them iteratively.
Q35 Hard
What does N50 measure in a genome assembly?
AThe average length of all contigs
BThe percentage of genes correctly assembled
CThe total number of contigs in the assembly
DThe shortest contig length such that contigs of that length or longer cover 50% of the total assembly
Explanation
N50 is calculated by ranking contigs from longest to shortest, then summing their lengths until 50% of the total assembly size is reached — the length of the last contig added is the N50. A higher N50 implies a less fragmented assembly. Critically, N50 measures contiguity, NOT correctness — aggressive assemblers may produce high N50 but with misassemblies.
Q36 Tricky
Why is a high N50 value alone NOT sufficient to confirm assembly quality?
AAggressive assemblers may produce long contigs with misjoins (wrong order/orientation), inflating N50
BN50 can only be calculated for scaffolds, not contigs
CN50 measures correctness but not completeness
DN50 is only valid for long-read assemblies
Explanation
N50 is a measure of contiguity, NOT correctness. Aggressive assemblers may join regions in the wrong order or orientation, producing artificially long contigs and a high N50 — but with structural errors. That's why additional metrics (BUSCO scores, assembly size, read mapping) are needed to evaluate assembly quality comprehensively.
Q37 Medium
What does BUSCO evaluate in a genome assembly?
ASequencing error rates
BThe length distribution of scaffolds
CCompleteness by checking for conserved single-copy orthologous genes expected for the lineage
DGC content uniformity across the assembly
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) evaluates assembly completeness by checking for conserved single-copy genes expected for a given lineage. A high BUSCO score suggests a complete and biologically meaningful assembly. Duplicated or missing BUSCOs may indicate assembly errors or gene prediction artifacts.
Q38 Medium
In genome scaffolding, what are gaps between contigs typically filled with?
ARandom nucleotide sequences
B'N's as placeholders for unknown sequence
CRepetitive sequences from a database
DReference genome sequences from a related species
Explanation
In scaffolding, unknown sequences between contigs are filled with 'N's as placeholders (often 50 Ns as standard). If long reads or other matching reads span the gap, actual sequence can fill it — this is called "gap filling." Technologies like BioNano, 10X Genomics Chromium, and Hi-C help improve scaffold contiguity.
Q39 Tricky
What is a potential risk of reference-guided genome assembly?
AIt may introduce bias if the reference has inversions or translocations, masking unique structural features
BIt requires more computational resources than de novo assembly
CIt can only use long-read sequencing data
DIt is incompatible with scaffolding techniques
Explanation
Reference-guided assembly aligns reads to a closely related reference genome. While this is more efficient and requires lower coverage, it carries the risk of bias: if the reference differs structurally (inversions, translocations), the assembly may assume the reference structure is correct, overlooking unique features in the genome of interest. De novo assembly is unbiased but more computationally demanding.
Q40 Easy
What are the two main types of masking used for repetitive regions?
AForward masking and reverse masking
BFull masking and partial masking
CHard masking (replace with N's) and soft masking (convert to lowercase letters)
DStatic masking and dynamic masking
Explanation
Hard masking replaces repeat regions with 'N's (e.g., ACGTACGT → ACNNNNNN). Soft masking converts repeats to lowercase letters (e.g., ACGTACGT → acgtACGT). Soft masking is preferred because it preserves sequence data while signaling repeat regions, allowing flexibility in downstream analyses.
Q41 Hard
What distinguishes Class I (retrotransposons) from Class II (DNA transposons)?
AClass I uses cut-and-paste; Class II uses copy-and-paste
BClass I are only found in prokaryotes; Class II only in eukaryotes
CClass I are smaller and less abundant than Class II
DClass I uses copy-and-paste via RNA intermediate; Class II uses cut-and-paste via DNA intermediate
Explanation
Class I retrotransposons (LINEs, SINEs, LTR retrotransposons) use a "copy and paste" mechanism via an RNA intermediate — the original element stays in place while a copy inserts elsewhere. Class II DNA transposons use "cut and paste" via a DNA intermediate — the element is excised and reinserted at a new location. Option A reverses them — a classic exam trap!
Q42 Medium
Which tool is most widely used for homology-based repeat annotation?
ARepeatMasker
BAUGUSTUS
CBUSCO
DIGV
Explanation
RepeatMasker is the most widely used tool for repeat annotation. It uses homology-based approaches, comparing the genome against databases like Dfam and Repbase. It integrates algorithms like NHMMER (Hidden Markov Models) to detect even divergent repeat elements. AUGUSTUS is for gene prediction, BUSCO for assembly completeness, and IGV for visualization.
Q43 Hard
Why is gene annotation in eukaryotes much harder than in prokaryotes?
AProkaryotes have more genes than eukaryotes
BEukaryotic genes are interrupted by introns, have abundant intergenic DNA (~62%), and require analysis of UTRs and regulatory elements
CEukaryotic genomes cannot be sequenced with current technology
DProkaryotic genes have more complex intron-exon structures
Explanation
Prokaryotic genomes are simpler: ORFs are long (300–350 codons), there's minimal intergenic DNA (11% in E. coli), and genes rarely overlap. Eukaryotic genomes are complex: up to 62% intergenic DNA, genes interrupted by introns, exons, UTRs, and regulatory elements. This makes ab initio gene prediction in eukaryotes much more difficult and error-prone.
Q44 Medium
What is the "combiner" approach to gene annotation?
AUsing only ab initio prediction methods
BUsing only homology-based prediction methods
CIntegrating both intrinsic (ab initio) and extrinsic (homology-based) methods for improved accuracy
DCombining short reads and long reads during assembly
Explanation
Combiners merge ab initio (intrinsic) and extrinsic methods. They leverage statistical models from the genome sequence AND sequence similarity from external databases (RNA-Seq, protein evidence). AUGUSTUS is a key example — it can work both de novo and by incorporating external evidence. Combiners are the most popular and widely used approach.
Q45 Medium
In the GFF file format, what does the "Score" column (column 6) represent?
AThe GC content of the feature
BThe number of reads covering the feature
CThe length of the feature in base pairs
DA confidence value for the feature prediction (higher = more confident)
Explanation
The Score column in GFF is a floating-point value representing confidence in the feature prediction — higher numbers indicate higher confidence. The GFF file has 9 columns: Seqname, Source, Feature, Start, End, Score, Strand, Frame, and Attribute.
Q46 — Open Calculation
A species has a C-value of 1.8 pg measured by flow cytometry. Calculate the estimated genome size in base pairs and in megabases. Then, if you plan to sequence at 60× coverage using 150 bp reads, how many reads do you need?
✓ Model Answer

Step 1: Genome size in base pairs

Genome size (bp) = DNA content (pg) × 0.978 × 10⁹
= 1.8 × 0.978 × 10⁹ = 1.7604 × 10⁹ bp ≈ 1,760 Mb ≈ 1.76 Gb

Step 2: Number of reads needed

Coverage = (Number of reads × Read length) / Genome size
60 = (N × 150) / 1,760,400,000
N = (60 × 1,760,400,000) / 150 = 704,160,000 reads ≈ 704 million reads
Q47 — Open Short Answer
Describe the 10-step genome assembly pipeline. For each step, provide a brief (one-sentence) explanation of its purpose.
✓ Model Answer

1. Gather information about the target genome — Investigate genome size, repeats, heterozygosity, ploidy, and GC content to plan the assembly strategy.

2. Extract high-quality DNA — Obtain pure, intact, high-molecular-weight DNA suitable for the chosen sequencing technology.

3. Design the best experimental workflow — Define experimental goals, select sequencing strategy (de novo vs reference-guided), and plan coverage and library types.

4. Choose sequencing technology and library preparation — Select between SGS, TGS, or hybrid approaches and prepare appropriate libraries (PE, mate-pair, PCR-free).

5. Evaluate computational resources — Ensure sufficient CPU, RAM, and storage are available for the assembly algorithm chosen.

6. Assemble the genome — Apply the chosen assembly algorithm (greedy, OLC, De Bruijn graph, or hybrid) to build contigs from sequencing reads.

7. Polish the assembly — Correct residual sequencing and assembly errors to improve base-level accuracy using tools like Pilon or Racon.

8. Check assembly quality — Evaluate using metrics such as N50 (contiguity), BUSCO (completeness), assembly size, and read mapping rates.

9. Scaffolding and gap filling — Connect contigs into scaffolds using paired reads, long reads, Hi-C, or optical mapping; fill gaps with sequence or Ns.

10. Re-evaluate assembly quality — Repeat quality control to ensure scaffolding and gap filling improved the assembly.

Q48 — Open Tricky
Explain the difference between De Bruijn graph and Overlap-Layout-Consensus (OLC) approaches for genome assembly. Include: what serves as nodes and edges in each, which sequencing data type each is best suited for, and why De Bruijn became dominant for short-read assemblies.
✓ Model Answer

OLC (Overlap-Layout-Consensus):

• Nodes = individual reads; Edges = overlaps between reads

• Three steps: (i) compute overlaps between all reads, (ii) lay out overlap information in a graph, (iii) infer consensus sequence

• Requires comparing all reads to all other reads → computationally impractical for billions of short reads

• Best suited for long reads (e.g., PacBio, Nanopore) where the number of reads is smaller

• Assembly corresponds to finding a Hamiltonian path (visiting every node once)

De Bruijn Graph (DBG):

• Reads are decomposed into k-mers; Nodes = (k−1)-mers; Edges = k-mers connecting prefix to suffix

• Does not require all-vs-all read comparison, making it much more scalable

• Assembly corresponds to finding an Eulerian path (visiting every edge once), which is computationally easier than Hamiltonian paths

• Best suited for short reads (e.g., Illumina)

Why DBG became dominant: With SGS, the number of reads increased exponentially while lengths shortened. OLC's all-vs-all comparison became impractical. DBG avoids this by working with k-mers, consuming less computational time and memory.

Q49 — Open Short Answer
Describe the three main strategies for gene prediction (structural annotation) in genome annotation. For each, explain its principle, strengths, and limitations.
✓ Model Answer

1. Intrinsic (Ab initio): Relies solely on the genomic sequence itself, using mathematical models (e.g., Hidden Markov Models) trained on known genes to identify gene-like features (ORFs, start/stop codons, splice sites). Strengths: no external data needed, can detect novel genes, high sensitivity (~100%). Limitations: species-specific training required, moderate accuracy (~60-70% for exon-intron structures), struggles with complex eukaryotic genes.

2. Extrinsic (Homology-based): Compares the genome to known gene/protein sequences in databases (NCBI, UniProt). If similarity is found, a gene is inferred. Strengths: leverages extensive existing databases, protein sequences are conserved even between distant species. Limitations: cannot detect truly novel genes absent from databases.

3. Combined (Hybrid/Combiner): Integrates both ab initio models and extrinsic evidence (RNA-Seq data, protein databases, known genes). Strengths: most accurate and widely used approach, benefits from both computational prediction and experimental evidence. Example: AUGUSTUS can work both de novo and with external data. This is the most popular strategy in modern annotation projects.

Q50 — Open Short Answer
A K-mer frequency analysis of raw sequencing reads shows a total area under the curve of 5 × 10⁹ k-mers and a main peak at coverage 10×. Estimate the genome size. The distribution also shows a prominent left peak at 1–5× frequency and a long right tail extending to 50×. Interpret these features.
✓ Model Answer

Genome size estimation:

Genome Size = Total K-mers / Average K-mer coverage = 5 × 10⁹ / 10 = 5 × 10⁸ bp = 500 Mb

Interpretation of the distribution:

Left peak (1–5× frequency): These low-frequency K-mers are likely caused by sequencing errors. Errors introduce unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.

Main peak (~10× coverage): Represents the true genomic K-mers. The position of this peak corresponds to the average sequencing depth/coverage of the genome.

Right tail (extending to 50×): These over-represented K-mers are likely derived from repetitive regions in the genome, which are sequenced multiple times. A prominent right tail suggests the genome contains significant repetitive content, which may complicate assembly.