Genome Assembly – Exam Practice

📝Genome Assembly – Full Coverage

0 / 45

Q1 Easy

What is variant calling?

AThe process of assembling reads into contigs

BThe process of identifying differences (SNPs, indels) between sequenced reads and a reference genome

CThe process of aligning reads to a reference genome

DThe process of annotating genes in the genome

Explanation

Variant calling identifies differences (like SNPs or indels) between sequenced reads and a reference genome. Not every difference is a true variant — sequencing errors, alignment issues, and low-quality bases can produce false positives.

Q2 Medium

Which of the following is most likely to produce false-positive variant calls?

AHigh mapping quality reads

BSequencing depth of 30×

CVariants located in homopolymer regions

DVariants supported by high base-call quality scores

Explanation

Homopolymer regions (e.g., AAAAA) are prone to sequencing errors, especially with certain technologies. Variants found in these regions are often false positives. Modern variant callers include filters to detect and handle homopolymeric regions.

Q3 Medium

What is the advantage of joint variant calling over individual variant calling?

AIt requires fewer computational resources

BIt produces one VCF file per sample

CIt only identifies homozygous variants

DA low-confidence variant in one sample may be confirmed by evidence from other samples

Explanation

Joint variant calling analyzes all samples simultaneously. A key advantage is improved sensitivity: a low-confidence variant in one sample can be confidently called if supported in other samples. It also helps detect shared variants even when some individuals have low coverage. However, it requires more computational resources than individual calling.

Q4 Tricky

In individual variant calling, if a variant is missing from a sample's VCF file, what can be concluded?

AThe variant may still be present but was missed due to insufficient coverage

BThe variant is definitely absent in that sample

CThe reference genome is incorrect at that position

DThe variant is homozygous in that sample

Explanation

A key limitation of individual variant calling: if a variant is missing from some VCF files, it does not necessarily mean it is absent in those samples. It could be due to insufficient sequencing coverage or different variant types. This is one reason joint calling is generally preferred for multi-sample projects.

Q5 Easy

In a VCF file, what does the QUAL column represent?

AThe base quality of the reference allele

BThe confidence score for the variant call

CThe mapping quality of reads at the position

DThe read depth at the variant position

Explanation

The QUAL column in VCF indicates confidence in the variant call. Read depth (DP) and allele frequency (AF) are found in the INFO column. Mapping quality is a separate concept from variant call quality.

Q6 Hard

In a VCF file, a sample shows FORMAT GT:GQ:DP and value 0/1:99:32. What does this mean?

AHomozygous reference, quality 99, depth 32

BHomozygous alternative, quality 32, depth 99

CHeterozygous (one ref, one alt allele), quality 99, depth 32

DHeterozygous, phased genotype, quality 99, depth 32

Explanation

FORMAT GT:GQ:DP means the sample data is structured as Genotype:Genotype Quality:Read Depth. 0/1 = heterozygous (0 = reference, 1 = first alt allele), the "/" indicates unphased. GQ=99 means very high confidence. DP=32 means 32 reads support this site. If it were phased, it would use "|" instead of "/".

Q7 Tricky

What is the difference between 0/1 and 0|1 in VCF genotype notation?

A0/1 is homozygous while 0|1 is heterozygous

B0/1 is unphased (allele-to-chromosome assignment unknown) while 0|1 is phased (known assignment)

C0/1 means low quality while 0|1 means high quality

D0/1 is from short-read data while 0|1 is from long-read data

Explanation

The "/" (forward slash) indicates an unphased genotype — you know which alleles are present but not which chromosome each came from. The "|" (pipe) indicates a phased genotype — you know the exact combination of alleles on each chromosome. Both 0/1 and 0|1 are heterozygous. Phasing is useful in haplotype or linkage studies.

Q8 Medium

In VCF genotype notation, what does 2/1 indicate?

AHomozygous for the second alternative allele

BHeterozygous with one reference and one alternative allele

CMissing genotype data

DHeterozygous with the first ALT allele and the second ALT allele

Explanation

In VCF: 0 = reference allele, 1 = first alternative allele, 2 = second alternative allele (at multiallelic sites). So 2/1 means heterozygous with one allele being the first ALT and the other being the second ALT. This is different from 0/1 (ref + first ALT).

Q9 Medium

In IGV, how would you identify a heterozygous variant (0/1)?

AAbout half of the reads show the variant allele and half show the reference

BAll reads match the reference

CNearly all reads show the variant allele

DOnly 1–2 reads show the variant

Explanation

Heterozygous (0/1): approximately half the reads show the variant. Homozygous reference (0/0): all reads match reference. Homozygous alternative (1/1): nearly all reads show the variant. If only 1–2 reads show the variant, it's likely a sequencing or mapping error, not a true variant.

Q10 Medium

Which of the following is NOT a recommended post-calling quality control step?

AFilter variants by read depth (exclude too low or too high)

BRetain only variants marked as "PASS" in the FILTER field

CRetain all variants regardless of quality score to maximize sensitivity

DExclude variants in repetitive or low-complexity regions

Explanation

Post-calling QC should filter by quality score (e.g., QUAL > 30) to keep only high-confidence variants. Retaining all variants regardless of quality would include many false positives. Other valid steps include filtering by read depth, excluding repetitive regions, and checking the FILTER field for "PASS".

Q11 Easy

What is a typical minimum read depth threshold for reliable variant calling?

A≥ 3 reads

B≥ 10 reads

C≥ 50 reads

D≥ 100 reads

Explanation

A common minimum depth threshold is ≥ 10 reads. Fewer than 3 reads means low confidence. A maximum depth threshold (e.g., > 100 reads) is also useful to filter out positions with unusually high coverage, which often correspond to repetitive regions.

Q12 Medium

What is the primary goal of variant annotation?

ATo determine the biological impact and consequences of each identified variant

BTo align reads to the reference genome

CTo increase the sequencing depth of variants

DTo remove false-positive variant calls

Explanation

Variant annotation determines the impact of each variant. Variants can be in genes, introns, or regulatory regions, and their effects vary by location. Tools like ENSEMBL provide gene locations, functions, variant tables with rsIDs, and predicted consequences. Databases like dbSNP record SNP information and modification types.

Q13 Tricky

When checking variants against a variant table in ENSEMBL, approximately what percentage of variants in your sample are expected to be previously known?

AAround 50%

BAround 25%

CAround 1%

DAround 8%

Explanation

The lecture notes specifically state that "usually around 8%" of variants in your sample are previously known. Checking the variant table helps verify your results, even though the main goal is to discover new variants. This is a detail easily overlooked by students.

Q14 Easy

When is de novo genome assembly necessary?

AWhen performing variant calling on a well-studied model organism

BWhen RNA-seq data is available

CWhen no reference genome is available for the species

DWhen the genome is very small

Explanation

De novo genome assembly is essential when no reference genome is available. It can also be used to improve an existing reference. However, given its complexity and resource demands, researchers must first assess whether it's truly necessary or if a reference-guided approach can be used.

Q15 Easy

In shotgun sequencing, how are longer sequences reconstructed?

ABy searching for overlaps between the sequences of individual fragments

BBy aligning each fragment to a reference genome

CBy using restriction enzymes to cut at known positions

DBy sequencing each chromosome separately

Explanation

Shotgun sequencing works by fragmenting DNA, sequencing the fragments, and then using overlaps between fragments to reconstruct longer sequences. The method relies on random fragmentation producing overlapping pieces that can be computationally assembled.

Q16 Medium

In hierarchical shotgun sequencing, what are BAC libraries used for?

ATo sequence the genome directly in one step

BTo clone and amplify large DNA fragments (~300 kb) whose order and overlap are known

CTo store short sequencing reads for downstream analysis

DTo perform variant calling on sequenced reads

Explanation

In hierarchical shotgun sequencing, large DNA fragments are inserted into BAC (Bacterial Artificial Chromosome) libraries. Because BAC fragments are large (~300 kb) and their order/overlap are known beforehand, each BAC is individually shotgun-sequenced and then all are aligned to reconstruct the full genome. This method is mostly obsolete now.

Q17 Hard

Using flow cytometry, the C-value of a species is measured at 2.5 pg. What is the estimated genome size in base pairs?

A~2.5 × 10⁹ bp

B~978 Mb

C~1.95 × 10⁹ bp

D~2.445 × 10⁹ bp

Explanation

Using the formula: Genome size (bp) = DNA content (pg) × 0.978 × 10⁹. So: 2.5 × 0.978 × 10⁹ = 2.445 × 10⁹ bp ≈ 2,445 Mb. The C-value is the amount of DNA in picograms in a haploid genome, and 1 pg = 978 Mb.

Q18 Medium

In a K-mer frequency distribution, what do low-frequency K-mers (appearing only 1–20 times) most likely represent?

AHighly conserved coding regions

BRepetitive regions

CSequencing errors that introduce unique erroneous K-mers

DTrue genomic K-mers at average coverage

Explanation

In a K-mer frequency distribution: low-frequency K-mers (left peak) = sequencing errors creating unique, erroneous K-mers; the main peak = true genomic K-mers at average coverage; high-frequency K-mers (right tail) = repetitive regions sequenced multiple times.

Q19 Hard

How is genome size estimated from K-mer frequency analysis?

ATotal number of K-mers (area under the curve) divided by average K-mer coverage (peak position)

BNumber of unique K-mers multiplied by K-mer length

CTotal number of reads multiplied by read length

DMaximum K-mer frequency divided by read length

Explanation

Genome Size = Total number of K-mers (area under the curve) / Average K-mer coverage (mean coverage = position of the main peak). This provides an approximate genome size based solely on sequencing data. It's particularly useful for unknown or poorly studied genomes.

Q20 Medium

How does high heterozygosity affect genome assembly?

AIt simplifies the assembly by reducing the number of contigs

BAllelic variation can be assembled as separate regions, causing fragmentation and inflated genome size

CIt has no effect on assembly quality

DIt only affects GC content bias in Illumina sequencing

Explanation

In highly heterozygous genomes, the assembler may interpret allelic variation as two separate genomic regions. This leads to fragmentation and inflated genome size estimates — heterozygous regions may be reported twice for diploid organisms. Solutions: use inbred lines or haploid individuals, or bioinformatics tools that distinguish allelic differences from true duplications.

Q21 Medium

Why does extreme GC content cause problems for Illumina sequencing?

AIt increases the error rate of base calling

BIt causes reads to be too long for assembly

CIt causes amplification bias during PCR, resulting in low or no coverage in affected regions

DIt makes the reference genome incompatible with the reads

Explanation

Extremely low or high GC content causes amplification bias during PCR-based Illumina sequencing. This results in low or no coverage in those regions. Solutions: use GC-insensitive platforms like PacBio or Nanopore, or over-sequence to compensate for coverage gaps.

Q22 Tricky

Why is it recommended to sequence inbred individuals for genome assembly?

AInbred individuals have larger genomes

BInbred individuals have more repetitive elements

CInbred individuals have higher GC content

DLow polymorphism in inbred individuals greatly simplifies assembly by reducing heterozygosity

Explanation

Inbred organisms (e.g., lab strains) have low levels of polymorphism, which greatly simplifies assembly. High heterozygosity causes assemblers to misinterpret allelic differences as separate genomic regions, leading to fragmentation. The lecture specifically mentions that differences between reads due to polymorphism "may be misinterpreted by assemblers and errors introduced in the sequence."

Q23 Medium

What is required for long-read sequencing that is NOT necessary for short-read sequencing?

ADNA extraction from the sample

BHigh-molecular-weight DNA from fresh or well-preserved tissue

CLibrary preparation

DPCR amplification

Explanation

Long-read sequencing requires high-molecular-weight (HMW) DNA (≥20 kbp), mainly obtained from fresh material. Short-read sequencing can work with fragmented or degraded DNA, making it suitable for ancient or poor-quality samples. PCR amplification is sometimes needed when DNA is limited but actually introduces bias.

Q24 Tricky

Why is PCR amplification of genomic DNA a potential problem for genome assembly?

ASome regions amplify more efficiently than others, leading to uneven coverage and potential gaps

BPCR destroys the DNA fragments

CPCR only works with long-read sequencing

DPCR introduces indels into the reads

Explanation

PCR introduces bias because some genomic regions amplify more efficiently than others. This leads to uneven coverage and potential gaps in the genome assembly. PCR-free library preparation methods are preferred when possible to avoid this bias.

Q25 Easy

What is a standard minimum coverage/depth recommended for genome assembly?

A10×

B30×

CAt least 60×

D100×

Explanation

A coverage of at least 60× is standard practice for genome assembly, ensuring each region is sequenced enough times for accurate assembly. This is explicitly mentioned in the lecture when discussing fold-coverage requirements for a "good assembly (>60x)".

Q26 Medium

What is the main advantage of a hybrid assembly approach (combining SGS + TGS)?

AIt is cheaper than using only short reads

BIt eliminates all assembly errors

CIt only requires de Bruijn graph assembly

DShort reads correct errors in long reads, while long reads improve assembly continuity across repeats

Explanation

The hybrid approach compensates for the downsides of both technologies: SGS (Illumina) provides high accuracy to correct errors in TGS reads, while TGS (PacBio/Nanopore) provides long reads that span repeats and improve continuity. It's a cost-effective strategy since SGS data can correct errors in TGS reads.

Q27 Hard

In a De Bruijn graph, what do vertices and edges represent?

AVertices = reads; Edges = overlaps between reads

BVertices = (k−1)-mers; Edges = k-mers connecting prefix to suffix

CVertices = k-mers; Edges = (k−1)-mers

DVertices = chromosomes; Edges = contigs

Explanation

In a De Bruijn graph: vertices = (k−1)-mers (prefix and suffix of each k-mer), and edges = k-mers. For example, with k=3, the k-mer ATG has prefix AT and suffix TG, so the edge ATG connects node AT to node TG. Option A describes the OLC approach, not De Bruijn. Option C reverses vertices and edges — a very common exam trap!

Q28 Hard

What condition must be met for an Eulerian path to exist in a De Bruijn graph?

AThe graph must have all nodes balanced (equal in-degree and out-degree) or exactly two semi-balanced nodes

BThe graph must have exactly one node with maximum out-degree

CEvery node must be visited exactly once

DThe graph must be undirected

Explanation

An Eulerian path visits every edge exactly once (not every node — that's a Hamiltonian path). For an Eulerian path to exist, the graph must have all nodes balanced (indegree = outdegree) or at most two semi-balanced nodes (where |indegree − outdegree| = 1). De Bruijn graphs are directed, not undirected.

Q29 Tricky

Why should you NOT choose the longest possible k-mer for De Bruijn graph assembly?

ALonger k-mers produce more ambiguous assemblies

BLonger k-mers require more sequencing depth

CA single sequencing error affects 100% of k-mers from that read when k equals the read length, versus only a few k-mers with smaller k

DLonger k-mers create larger, more connected graphs

Explanation

The assumption in De Bruijn graph assembly is that all k-mers are error-free, which is not true for NGS data. If you choose k = read length, then a single sequencing error affects 100% of the k-mers from that read. With a smaller k, an error only affects a limited number of k-mers. However, too small a k increases ambiguity (many repeated k-mers). Multiple assemblies with different k values can be compared.

Q30 Medium

Why did De Bruijn graph-based assemblers replace OLC for short-read data?

AOLC produces better assemblies but is more expensive

BDe Bruijn graphs produce error-free assemblies

COLC cannot handle reads shorter than 1000 bp

DOLC requires comparing all reads to all other reads, which is impractical with billions of short reads

Explanation

With SGS, the number of reads increased exponentially while read lengths shortened. The OLC approach requires comparing all reads with every other read — computationally impractical with millions or billions of reads. De Bruijn graphs are more efficient because they decompose reads into k-mers, avoiding direct all-vs-all comparisons. OLC is still used for long-read data.

Q31 Medium

What causes branching structures in De Bruijn graphs?

ARepetitive DNA regions that create multiple possible paths

BToo few reads in the dataset

CUsing too large a k-mer size

DThe use of paired-end reads

Explanation

Repeated sequences create branches in the De Bruijn graph because identical k-mers from different genomic locations converge, creating ambiguity about which path to follow. Paired-end reads can actually help resolve these branches — if a fragment spans the repeat, its paired reads anchor the assembly in unique flanking regions.

Q32 Medium

What is the key advantage of mate-pair sequencing over paired-end sequencing?

AMate-pair is simpler and cheaper to prepare

BMate-pair covers much larger distances (2–10 kb inserts), useful for scaffolding across repeats and gaps

CMate-pair produces longer reads

DMate-pair has higher base-call accuracy

Explanation

Mate-pair sequencing uses long DNA fragments (2–10 kb) that are circularized and labeled with biotin. This enables spanning large distances, which is crucial for scaffolding across repeats, detecting structural variations, and connecting distant contigs. Paired-end inserts are typically only 50–500 bp. However, mate-pair preparation is more complex and labor-intensive.

Q33 Tricky

In paired-end sequencing, when can two reads from the same fragment be merged into a single longer read?

AWhen the fragment is very long (>1 kb)

BWhen using mate-pair libraries

CWhen the DNA insert is short enough that the two reads from each end overlap

DWhen long-read technology is used simultaneously

Explanation

If the DNA insert is short (e.g., 200 bp) and the reads are long (e.g., 150 bp each), the two reads overlap in the middle and can be merged into a longer, more accurate read that behaves like a single-end read. This only works when the insert size is less than 2× the read length.

Q34 Easy

What is the purpose of assembly polishing?

ATo increase sequencing depth

BTo correct sequencing and assembly errors, improving base-level accuracy

CTo fragment the assembly into smaller contigs

DTo annotate genes in the assembly

Explanation

Polishing corrects sequencing errors and improves the accuracy of the consensus sequence. It is especially important for long-read assemblies, which tend to have higher error rates. Polishing tools (e.g., Pilon, Racon, Medaka) use aligned reads to detect mismatches and correct them iteratively.

Q35 Hard

What does N50 measure in a genome assembly?

AThe average length of all contigs

BThe percentage of genes correctly assembled

CThe total number of contigs in the assembly

DThe shortest contig length such that contigs of that length or longer cover 50% of the total assembly

Explanation

N50 is calculated by ranking contigs from longest to shortest, then summing their lengths until 50% of the total assembly size is reached — the length of the last contig added is the N50. A higher N50 implies a less fragmented assembly. Critically, N50 measures contiguity, NOT correctness — aggressive assemblers may produce high N50 but with misassemblies.

Q36 Tricky

Why is a high N50 value alone NOT sufficient to confirm assembly quality?

AAggressive assemblers may produce long contigs with misjoins (wrong order/orientation), inflating N50

BN50 can only be calculated for scaffolds, not contigs

CN50 measures correctness but not completeness

DN50 is only valid for long-read assemblies

Explanation

N50 is a measure of contiguity, NOT correctness. Aggressive assemblers may join regions in the wrong order or orientation, producing artificially long contigs and a high N50 — but with structural errors. That's why additional metrics (BUSCO scores, assembly size, read mapping) are needed to evaluate assembly quality comprehensively.

Q37 Medium

What does BUSCO evaluate in a genome assembly?

ASequencing error rates

BThe length distribution of scaffolds

CCompleteness by checking for conserved single-copy orthologous genes expected for the lineage

DGC content uniformity across the assembly

Explanation

BUSCO (Benchmarking Universal Single-Copy Orthologs) evaluates assembly completeness by checking for conserved single-copy genes expected for a given lineage. A high BUSCO score suggests a complete and biologically meaningful assembly. Duplicated or missing BUSCOs may indicate assembly errors or gene prediction artifacts.

Q38 Medium

In genome scaffolding, what are gaps between contigs typically filled with?

ARandom nucleotide sequences

B'N's as placeholders for unknown sequence

CRepetitive sequences from a database

DReference genome sequences from a related species

Explanation

In scaffolding, unknown sequences between contigs are filled with 'N's as placeholders (often 50 Ns as standard). If long reads or other matching reads span the gap, actual sequence can fill it — this is called "gap filling." Technologies like BioNano, 10X Genomics Chromium, and Hi-C help improve scaffold contiguity.

Q39 Tricky

What is a potential risk of reference-guided genome assembly?

AIt may introduce bias if the reference has inversions or translocations, masking unique structural features

BIt requires more computational resources than de novo assembly

CIt can only use long-read sequencing data

DIt is incompatible with scaffolding techniques

Explanation

Reference-guided assembly aligns reads to a closely related reference genome. While this is more efficient and requires lower coverage, it carries the risk of bias: if the reference differs structurally (inversions, translocations), the assembly may assume the reference structure is correct, overlooking unique features in the genome of interest. De novo assembly is unbiased but more computationally demanding.

Q40 Easy

What are the two main types of masking used for repetitive regions?

AForward masking and reverse masking

BFull masking and partial masking

CHard masking (replace with N's) and soft masking (convert to lowercase letters)

DStatic masking and dynamic masking

Explanation

Hard masking replaces repeat regions with 'N's (e.g., ACGTACGT → ACNNNNNN). Soft masking converts repeats to lowercase letters (e.g., ACGTACGT → acgtACGT). Soft masking is preferred because it preserves sequence data while signaling repeat regions, allowing flexibility in downstream analyses.

Q41 Hard

What distinguishes Class I (retrotransposons) from Class II (DNA transposons)?

AClass I uses cut-and-paste; Class II uses copy-and-paste

BClass I are only found in prokaryotes; Class II only in eukaryotes

CClass I are smaller and less abundant than Class II

DClass I uses copy-and-paste via RNA intermediate; Class II uses cut-and-paste via DNA intermediate

Explanation

Class I retrotransposons (LINEs, SINEs, LTR retrotransposons) use a "copy and paste" mechanism via an RNA intermediate — the original element stays in place while a copy inserts elsewhere. Class II DNA transposons use "cut and paste" via a DNA intermediate — the element is excised and reinserted at a new location. Option A reverses them — a classic exam trap!

Q42 Medium

Which tool is most widely used for homology-based repeat annotation?

ARepeatMasker

BAUGUSTUS

CBUSCO

DIGV

Explanation

RepeatMasker is the most widely used tool for repeat annotation. It uses homology-based approaches, comparing the genome against databases like Dfam and Repbase. It integrates algorithms like NHMMER (Hidden Markov Models) to detect even divergent repeat elements. AUGUSTUS is for gene prediction, BUSCO for assembly completeness, and IGV for visualization.

Q43 Hard

Why is gene annotation in eukaryotes much harder than in prokaryotes?

AProkaryotes have more genes than eukaryotes

BEukaryotic genes are interrupted by introns, have abundant intergenic DNA (~62%), and require analysis of UTRs and regulatory elements

CEukaryotic genomes cannot be sequenced with current technology

DProkaryotic genes have more complex intron-exon structures

Explanation

Prokaryotic genomes are simpler: ORFs are long (300–350 codons), there's minimal intergenic DNA (11% in E. coli), and genes rarely overlap. Eukaryotic genomes are complex: up to 62% intergenic DNA, genes interrupted by introns, exons, UTRs, and regulatory elements. This makes ab initio gene prediction in eukaryotes much more difficult and error-prone.

Q44 Medium

What is the "combiner" approach to gene annotation?

AUsing only ab initio prediction methods

BUsing only homology-based prediction methods

CIntegrating both intrinsic (ab initio) and extrinsic (homology-based) methods for improved accuracy

DCombining short reads and long reads during assembly

Explanation

Combiners merge ab initio (intrinsic) and extrinsic methods. They leverage statistical models from the genome sequence AND sequence similarity from external databases (RNA-Seq, protein evidence). AUGUSTUS is a key example — it can work both de novo and by incorporating external evidence. Combiners are the most popular and widely used approach.

Q45 Medium

In the GFF file format, what does the "Score" column (column 6) represent?

AThe GC content of the feature

BThe number of reads covering the feature

CThe length of the feature in base pairs

DA confidence value for the feature prediction (higher = more confident)

Explanation

The Score column in GFF is a floating-point value representing confidence in the feature prediction — higher numbers indicate higher confidence. The GFF file has 9 columns: Seqname, Source, Feature, Start, End, Score, Strand, Frame, and Attribute.

Q46 — Open Calculation

A species has a C-value of 1.8 pg measured by flow cytometry. Calculate the estimated genome size in base pairs and in megabases. Then, if you plan to sequence at 60× coverage using 150 bp reads, how many reads do you need?

✓ Model Answer

Step 1: Genome size in base pairs

Genome size (bp) = DNA content (pg) × 0.978 × 10⁹

= 1.8 × 0.978 × 10⁹ = 1.7604 × 10⁹ bp ≈ 1,760 Mb ≈ 1.76 Gb

Step 2: Number of reads needed

Coverage = (Number of reads × Read length) / Genome size

60 = (N × 150) / 1,760,400,000

N = (60 × 1,760,400,000) / 150 = 704,160,000 reads ≈ 704 million reads

Q47 — Open Short Answer

Describe the 10-step genome assembly pipeline. For each step, provide a brief (one-sentence) explanation of its purpose.

✓ Model Answer

1. Gather information about the target genome — Investigate genome size, repeats, heterozygosity, ploidy, and GC content to plan the assembly strategy.

2. Extract high-quality DNA — Obtain pure, intact, high-molecular-weight DNA suitable for the chosen sequencing technology.

3. Design the best experimental workflow — Define experimental goals, select sequencing strategy (de novo vs reference-guided), and plan coverage and library types.

4. Choose sequencing technology and library preparation — Select between SGS, TGS, or hybrid approaches and prepare appropriate libraries (PE, mate-pair, PCR-free).

5. Evaluate computational resources — Ensure sufficient CPU, RAM, and storage are available for the assembly algorithm chosen.

6. Assemble the genome — Apply the chosen assembly algorithm (greedy, OLC, De Bruijn graph, or hybrid) to build contigs from sequencing reads.

7. Polish the assembly — Correct residual sequencing and assembly errors to improve base-level accuracy using tools like Pilon or Racon.

8. Check assembly quality — Evaluate using metrics such as N50 (contiguity), BUSCO (completeness), assembly size, and read mapping rates.

9. Scaffolding and gap filling — Connect contigs into scaffolds using paired reads, long reads, Hi-C, or optical mapping; fill gaps with sequence or Ns.

10. Re-evaluate assembly quality — Repeat quality control to ensure scaffolding and gap filling improved the assembly.

Q48 — Open Tricky

Explain the difference between De Bruijn graph and Overlap-Layout-Consensus (OLC) approaches for genome assembly. Include: what serves as nodes and edges in each, which sequencing data type each is best suited for, and why De Bruijn became dominant for short-read assemblies.

✓ Model Answer

OLC (Overlap-Layout-Consensus):

• Nodes = individual reads; Edges = overlaps between reads

• Three steps: (i) compute overlaps between all reads, (ii) lay out overlap information in a graph, (iii) infer consensus sequence

• Requires comparing all reads to all other reads → computationally impractical for billions of short reads

• Best suited for long reads (e.g., PacBio, Nanopore) where the number of reads is smaller

• Assembly corresponds to finding a Hamiltonian path (visiting every node once)

De Bruijn Graph (DBG):

• Reads are decomposed into k-mers; Nodes = (k−1)-mers; Edges = k-mers connecting prefix to suffix

• Does not require all-vs-all read comparison, making it much more scalable

• Assembly corresponds to finding an Eulerian path (visiting every edge once), which is computationally easier than Hamiltonian paths

• Best suited for short reads (e.g., Illumina)

Why DBG became dominant: With SGS, the number of reads increased exponentially while lengths shortened. OLC's all-vs-all comparison became impractical. DBG avoids this by working with k-mers, consuming less computational time and memory.

Q49 — Open Short Answer

Describe the three main strategies for gene prediction (structural annotation) in genome annotation. For each, explain its principle, strengths, and limitations.

✓ Model Answer

1. Intrinsic (Ab initio): Relies solely on the genomic sequence itself, using mathematical models (e.g., Hidden Markov Models) trained on known genes to identify gene-like features (ORFs, start/stop codons, splice sites). Strengths: no external data needed, can detect novel genes, high sensitivity (~100%). Limitations: species-specific training required, moderate accuracy (~60-70% for exon-intron structures), struggles with complex eukaryotic genes.

2. Extrinsic (Homology-based): Compares the genome to known gene/protein sequences in databases (NCBI, UniProt). If similarity is found, a gene is inferred. Strengths: leverages extensive existing databases, protein sequences are conserved even between distant species. Limitations: cannot detect truly novel genes absent from databases.

3. Combined (Hybrid/Combiner): Integrates both ab initio models and extrinsic evidence (RNA-Seq data, protein databases, known genes). Strengths: most accurate and widely used approach, benefits from both computational prediction and experimental evidence. Example: AUGUSTUS can work both de novo and with external data. This is the most popular strategy in modern annotation projects.

Q50 — Open Short Answer

A K-mer frequency analysis of raw sequencing reads shows a total area under the curve of 5 × 10⁹ k-mers and a main peak at coverage 10×. Estimate the genome size. The distribution also shows a prominent left peak at 1–5× frequency and a long right tail extending to 50×. Interpret these features.

✓ Model Answer

Genome size estimation:

Genome Size = Total K-mers / Average K-mer coverage = 5 × 10⁹ / 10 = 5 × 10⁸ bp = 500 Mb

Interpretation of the distribution:

• Left peak (1–5× frequency): These low-frequency K-mers are likely caused by sequencing errors. Errors introduce unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.

• Main peak (~10× coverage): Represents the true genomic K-mers. The position of this peak corresponds to the average sequencing depth/coverage of the genome.

• Right tail (extending to 50×): These over-represented K-mers are likely derived from repetitive regions in the genome, which are sequenced multiple times. A prominent right tail suggests the genome contains significant repetitive content, which may complicate assembly.

Bioinformatics Forever

Genome Assembly – Exam Practice