NCBI: A Practical Guide
So you need to search for nucleotide sequences, reference sequences, or gene information? Welcome to NCBI — the American counterpart to Europe's EBI, and home to GenBank, RefSeq, and about 40 other interconnected databases.
What is NCBI?
National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.
What it gives you:
- GenBank (primary nucleotide sequences)
- RefSeq (curated reference sequences)
- Gene database (gene-centric information)
- PubMed (literature)
- dbSNP, ClinVar, OMIM (variants & clinical)
- BLAST (sequence alignment)
- And ~40 more databases, all cross-linked
Search any term (e.g., "HBB") from the NCBI homepage and it returns results across ALL databases — Literature, Genes, Proteins, Genomes, Genetics, Chemicals. Then drill down into the specific database you need.
The Three Main Sequence Databases
| Database | What it is | Key Point |
|---|---|---|
| Nucleotide | Collection from GenBank, RefSeq, TPA, PDB | Primary entry point for sequences |
| GenBank | Primary archive — anyone can submit | Raw data, may have duplicates/contradictions |
| RefSeq | Curated, non-redundant reference sequences | Clean, reviewed, NCBI-maintained |
GenBank vs RefSeq: Know the Difference
This is crucial — they serve different purposes:
| Aspect | GenBank | RefSeq |
|---|---|---|
| Curation | Not curated | Curated by NCBI |
| Who submits | Authors/labs | NCBI creates from existing data |
| Who revises | Only original author | NCBI updates continuously |
| Redundancy | Multiple records for same locus | Single record per molecule |
| Consistency | Records can contradict each other | Consistent, reviewed |
| Scope | Any species | Model organisms mainly |
| Data sharing | Shared via INSDC | NCBI exclusive |
| Analogy | Primary literature | Review articles |
GenBank: When you need all available sequences, including rare species or unpublished data.
RefSeq: When you need a reliable, canonical reference sequence for analysis.
INSDC: The Global Sequence Collaboration
GenBank doesn't exist in isolation. Since 2005, three databases synchronize daily:
DDBJ (Japan)
↓
← → INSDC ← →
↓
NCBI/GenBank ENA/EBI (Europe)
(USA)
Submit to one, it appears in all three. This is why you sometimes see the same sequence with different accession prefixes.
Understanding Accession Numbers
GenBank Accessions
The LOCUS line tells you a lot:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
↑ ↑ ↑ ↑ ↑
Name Length Type Division Date
GenBank Divisions (the 3-letter code):
| Code | Division |
|---|---|
| PRI | Primate sequences |
| ROD | Rodent sequences |
| MAM | Other mammalian |
| VRT | Other vertebrate |
| INV | Invertebrate |
| PLN | Plant, fungal, algal |
| BCT | Bacterial |
| VRL | Viral |
| PHG | Bacteriophage |
| SYN | Synthetic |
Query by division: gbdiv_pln[Properties]
RefSeq Accession Prefixes
This is important — the prefix tells you exactly what type of sequence it is:
| Prefix | Type | Curation Level |
|---|---|---|
| NM_ | mRNA | Curated ✓ |
| NP_ | Protein | Curated ✓ |
| NR_ | Non-coding RNA | Curated ✓ |
| XM_ | mRNA | Predicted (computational) |
| XP_ | Protein | Predicted (computational) |
| XR_ | Non-coding RNA | Predicted (computational) |
| NG_ | Genomic region | Reference |
| NC_ | Chromosome | Complete |
| NT_ | Contig | Assembly |
| NW_ | WGS Supercontig | Assembly |
NM_, NP_ = Curated, experimentally supported
XM_, XP_ = Predicted by algorithms, not yet reviewed
For reliable analyses, prefer N* prefixes when available!
RefSeq Status Codes
| Status | Meaning | Reliability |
|---|---|---|
| REVIEWED | Reviewed by NCBI staff, literature-backed | ⭐⭐⭐ Highest |
| VALIDATED | Initial review done, preferred sequence | ⭐⭐ High |
| PROVISIONAL | Not yet reviewed, gene association established | ⭐ Medium |
| PREDICTED | Computational prediction, some aspects predicted | ⭐ Medium |
| INFERRED | Predicted, partially supported by homology | Low |
| MODEL | Automatic pipeline, no individual review | Lowest |
NCBI Search Syntax
This is where it gets powerful. NCBI uses field tags in square brackets.
Basic Syntax
search_term[Field Tag]
Boolean operators must be UPPERCASE: AND, OR, NOT
Common Field Tags
| Field Tag | What it searches | Example |
|---|---|---|
[Title] | Definition line | glyceraldehyde 3 phosphate dehydrogenase[Title] |
[Organism] | NCBI taxonomy | mouse[Organism], "Homo sapiens"[Organism] |
[Properties] | Molecule type, source, etc. | biomol mrna[Properties] |
[Filter] | Subsets of data | nucleotide omim[Filter] |
[Gene Name] | Gene symbol | BRCA1[Gene Name] |
[EC/RN Number] | Enzyme Commission number | 2.1.1.1[EC/RN Number] |
[Accession] | Accession number | NM_001234[Accession] |
Useful Properties Field Terms
Molecule Type
biomol_mrna[Properties]
biomol_genomic[Properties]
biomol_rrna[Properties]
GenBank Division
gbdiv_pri[Properties] (primates)
gbdiv_rod[Properties] (rodents)
gbdiv_est[Properties] (ESTs)
gbdiv_htg[Properties] (high throughput genomic)
Gene Location
gene_in_mitochondrion[Properties]
gene_in_chloroplast[Properties]
gene_in_genomic[Properties]
Source Database
srcdb_refseq[Properties] (any RefSeq)
srcdb_refseq_reviewed[Properties] (reviewed RefSeq only)
srcdb_refseq_validated[Properties] (validated RefSeq only)
srcdb_pdb[Properties]
srcdb_swiss_prot[Properties]
Gene Database Search
The Gene database is the best starting point for gene-specific searches. It integrates information from multiple sources: nomenclature, RefSeqs, maps, pathways, variations, phenotypes.
Gene-Specific Field Tags
| Find genes by... | Search syntax |
|---|---|
| Free text | human muscular dystrophy |
| Gene symbol | BRCA1[sym] |
| Organism | human[Organism] |
| Chromosome | Y[CHR] AND human[ORGN] |
| Gene Ontology term | "cell adhesion"[GO] or 10030[GO] |
| EC number | 1.9.3.1[EC] |
| PubMed ID | 11331580[PMID] |
| Accession | M11313[accn] |
Gene Properties
genetype protein coding[Properties]
genetype pseudo[Properties]
has transcript variants[Properties]
srcdb refseq reviewed[Properties]
feattype regulatory[Properties]
Gene Filters
gene clinvar[Filter] (has ClinVar entries)
gene omim[Filter] (has OMIM entries)
gene structure[Filter] (has 3D structure)
gene type noncoding[Filter]
gene type pseudo[Filter]
src genomic[Filter]
src organelle[Filter]
Building Complex Queries
Query Structure
term1[Field] AND term2[Field] AND (term3[Field] OR term4[Field])
AND, OR, NOT — lowercase won't work!
Example Query Walkthrough
Goal: Find all reviewed/validated RefSeq mRNA entries for mouse enzymes with EC 2.1.1.1 or 2.1.1.10
Breaking it down:
| Requirement | Query Component |
|---|---|
| mRNA sequences | "biomol mrna"[Properties] |
| EC 2.1.1.1 OR 2.1.1.10 | (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) |
| Mouse | "mus musculus"[Organism] |
| Reviewed OR validated RefSeq | ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties]) |
Final query:
"biomol mrna"[Properties] AND (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) AND "mus musculus"[Organism] AND ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])
Result: 9 entries
Practice Exercises
Exercise 1: Nucleotide Database Query
Q: In NCBI "Nucleotide", find all entries containing:
- mRNA sequences
- coding for enzymes with EC Numbers 2.1.1.1 and 2.1.1.10
- from Mus musculus
- which have been reviewed or validated in RefSeq
How many entries?
Click for answer
Answer: 9 entries (range: 1-10)
Query:
"biomol mrna"[Properties] AND (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) AND "mus musculus"[Organism] AND ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])
How to build it:
| Requirement | Field Tag |
|---|---|
| mRNA | "biomol mrna"[Properties] |
| EC numbers (OR) | (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) |
| Mouse | "mus musculus"[Organism] |
| RefSeq quality | ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties]) |
Exercise 2: Gene Database Query
Q: In the «Gene» database, look for all genes:
- coding for proteins (protein-coding genes)
- associated to the GO term "ATP synthase"
- whose source is mitochondrial or genomic
- annotated in ClinVar OR OMIM
How many entries?
Click for answer
Answer: 32 entries (range: 31-40)
Query:
"genetype protein coding"[Properties] AND "atp synthase"[Gene Ontology] AND ("source mitochondrion"[Properties] OR "source genomic"[Properties]) AND ("gene clinvar"[Filter] OR "gene omim"[Filter])
How to build it:
| Requirement | Field Tag |
|---|---|
| Protein-coding | "genetype protein coding"[Properties] |
| GO term | "atp synthase"[Gene Ontology] |
| Source (OR) | ("source mitochondrion"[Properties] OR "source genomic"[Properties]) |
| Clinical (OR) | ("gene clinvar"[Filter] OR "gene omim"[Filter]) |
Common Query Patterns
Pattern 1: Species + Molecule Type + Quality
"homo sapiens"[Organism] AND biomol mrna[Properties] AND srcdb refseq reviewed[Properties]
Pattern 2: Gene Function + Clinical Relevance
"kinase"[Gene Ontology] AND gene clinvar[Filter] AND human[Organism]
Pattern 3: Chromosome Region + Gene Type
7[CHR] AND human[ORGN] AND genetype protein coding[Properties]
Pattern 4: Multiple EC Numbers
(1.1.1.1[EC/RN Number] OR 1.1.1.2[EC/RN Number] OR 1.1.1.3[EC/RN Number])
⚠️ Common NCBI Search Mistakes
Mistake #1: Lowercase Boolean Operators
❌ biomol mrna[Properties] and mouse[Organism]
✓ biomol mrna[Properties] AND mouse[Organism]
The fix: Always use UPPERCASE AND, OR, NOT
Mistake #2: Missing Quotes Around Multi-Word Terms
❌ mus musculus[Organism]
✓ "mus musculus"[Organism]
❌ biomol mrna[Properties]
✓ "biomol mrna"[Properties]
The fix: Use quotes around phrases with spaces
Mistake #3: Wrong Database for Your Query
| You want... | Use this database |
|---|---|
| Gene information, GO terms, pathways | Gene |
| Nucleotide sequences | Nucleotide |
| Protein sequences | Protein |
| Variants | dbSNP, ClinVar |
| Literature | PubMed |
Mistake #4: Confusing Properties vs Filters
| Type | Purpose | Example |
|---|---|---|
| Properties | Content-based attributes | biomol mrna[Properties] |
| Filters | Relationships to other databases | gene clinvar[Filter] |
Rule of thumb:
- Properties = what the sequence IS
- Filters = what the sequence is LINKED to
Mistake #5: Using GenBank When You Need RefSeq
If you need a reliable reference sequence for analysis, don't just search Nucleotide — filter for RefSeq:
srcdb refseq[Properties]
Or for highest quality:
srcdb refseq reviewed[Properties]
Quick Reference: Field Tags Cheatsheet
Nucleotide Database
| Purpose | Query |
|---|---|
| mRNA only | biomol mrna[Properties] |
| Genomic DNA | biomol genomic[Properties] |
| RefSeq only | srcdb refseq[Properties] |
| RefSeq reviewed | srcdb refseq reviewed[Properties] |
| Specific organism | "Homo sapiens"[Organism] |
| EC number | 1.1.1.1[EC/RN Number] |
| GenBank division | gbdiv_pri[Properties] |
Gene Database
| Purpose | Query |
|---|---|
| Protein-coding genes | genetype protein coding[Properties] |
| Pseudogenes | genetype pseudo[Properties] |
| GO term | "term"[Gene Ontology] |
| Has ClinVar | gene clinvar[Filter] |
| Has OMIM | gene omim[Filter] |
| Has structure | gene structure[Filter] |
| Chromosome | 7[CHR] |
| Gene symbol | BRCA1[sym] |
Cytogenetic Location Quick Reference
For the Gene database, understanding cytogenetic notation:
7 q 3 1 . 2
↑ ↑ ↑ ↑ ↑
Chr Arm Region Band Sub-band
p = short arm (petit)
q = long arm
Example: CFTR gene is at 7q31.2 = Chromosome 7, long arm, region 3, band 1, sub-band 2
TL;DR
- NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
- GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
- RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
- Boolean operators MUST be UPPERCASE:
AND,OR,NOT - Use quotes around multi-word terms:
"homo sapiens"[Organism] - Gene database = best starting point for gene-centric searches
- Properties = what it IS, Filters = what it's LINKED to
Now go query some databases! 🧬