Before Data Analysis

Understanding the Problem First

A common mistake in applied genomics is rushing to analysis before fully understanding the problem. Many researchers want to jump straight to implementation before proper design, or analyze sequences before understanding their origin and quality.

The Requirements Phase is Critical

Never underestimate the importance of thoroughly defining requirements. While solving problems is exciting and rewarding, spending weeks solving the wrong problem is far worse. I've learned this lesson the hard way—delivering excellent solutions that didn't address the actual need. As the saying goes, "the operation was a success, but the patient died."

Before investing significant time, money, and effort (resources you may not be able to recoup), invest in understanding the problem:

  • Interview all stakeholders multiple times
  • Don't worry about asking "obvious" questions—assumptions cause problems
  • Create scenarios to test your understanding
  • Have others explain the problem back to you from your perspective
  • Ask people to validate your interpretation

Many critical details go unmentioned because experts assume they're obvious. It's your responsibility to ask clarifying questions until you're confident you understand the requirements completely.


DNA Quality Requirements

Quality assessment of DNA is a critical step before next-generation sequencing (NGS). Both library preparation and sequencing success depend heavily on:

  • Sample concentration: sufficient DNA quantity for the workflow
  • DNA purity: absence of contaminants that interfere with enzymes

Understanding DNA Purity Measurements

The 260/280 absorbance ratio is the standard purity metric:

  • Nucleic acids absorb maximally at 260 nm wavelength
  • Proteins absorb maximally at 280 nm wavelength
  • The ratio between these measurements indicates sample composition

Interpreting the 260/280 ratio:

  • ~1.8 = pure DNA (target value)
  • Higher ratios = excess nucleic acids present
  • Lower ratios = protein contamination

Abnormal 260/280 ratios suggest contamination by proteins, residual extraction reagents (like phenol), or measurement errors.


Understanding Your Sequencing Report

Every sequencing experiment generates a detailed report—always request and review it carefully!

Example: Whole Genome Sequencing (WGS)

What is WGS? Whole Genome Sequencing reads the complete DNA sequence of an organism's genome in a single experiment.

Example calculation: If you ordered 40× WGS coverage of Sus scrofa (pig) DNA:

  • S. scrofa genome size: ~2.8 billion base pairs (2.8 Gb)
  • Expected data: at least 112 Gb (calculated as 40× × 2.8 Gb)

Pro tip: Calculate these expected values before requesting a quotation so you can verify the company delivers what you paid for.


Sequencing Depth and Coverage Explained

Depth of Coverage

Definition: The average number of times each base in the genome is sequenced.

Formula: Depth = (L × N) / G

Where:

  • L = read length (base pairs per sequence read)
  • N = total number of reads generated
  • G = haploid genome size (total base pairs)

This can be simplified to: Depth = Total sequenced base pairs / Genome size

Notation: Depth is expressed as "X×" (e.g., 5×, 10×, 30×, 100×), where X indicates how many times the average base was sequenced.

Breadth of Coverage

Definition: The percentage of the target genome that has been sequenced at a minimum depth threshold.

Example for Human Genome (~3 Gb):

Average DepthBreadth of Coverage
<1×Maximum 33% of genome
Maximum 67% of genome
1–3×>99% of genome
3–5×>99% of genome
7–8×>99% of genome

Key insight: Higher depth doesn't just mean more reads per base—it ensures more complete coverage across the entire genome. Even at 1× average depth, many regions may have zero coverage due to uneven distribution of reads.