NCBI: A Practical Guide

So you need to search for nucleotide sequences, reference sequences, or gene information? Welcome to NCBI — the American counterpart to Europe's EBI, and home to GenBank, RefSeq, and about 40 other interconnected databases.

What is NCBI?

National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.

What it gives you:

  • GenBank (primary nucleotide sequences)
  • RefSeq (curated reference sequences)
  • Gene database (gene-centric information)
  • PubMed (literature)
  • dbSNP, ClinVar, OMIM (variants & clinical)
  • BLAST (sequence alignment)
  • And ~40 more databases, all cross-linked
ℹ️
Global Search

Search any term (e.g., "HBB") from the NCBI homepage and it returns results across ALL databases — Literature, Genes, Proteins, Genomes, Genetics, Chemicals. Then drill down into the specific database you need.


The Three Main Sequence Databases

DatabaseWhat it isKey Point
NucleotideCollection from GenBank, RefSeq, TPA, PDBPrimary entry point for sequences
GenBankPrimary archive — anyone can submitRaw data, may have duplicates/contradictions
RefSeqCurated, non-redundant reference sequencesClean, reviewed, NCBI-maintained

GenBank vs RefSeq: Know the Difference

This is crucial — they serve different purposes:

AspectGenBankRefSeq
CurationNot curatedCurated by NCBI
Who submitsAuthors/labsNCBI creates from existing data
Who revisesOnly original authorNCBI updates continuously
RedundancyMultiple records for same locusSingle record per molecule
ConsistencyRecords can contradict each otherConsistent, reviewed
ScopeAny speciesModel organisms mainly
Data sharingShared via INSDCNCBI exclusive
AnalogyPrimary literatureReview articles
💡
When to Use Which?

GenBank: When you need all available sequences, including rare species or unpublished data.
RefSeq: When you need a reliable, canonical reference sequence for analysis.


INSDC: The Global Sequence Collaboration

GenBank doesn't exist in isolation. Since 2005, three databases synchronize daily:

        DDBJ (Japan)
           ↓
    ← → INSDC ← →
           ↓
NCBI/GenBank    ENA/EBI (Europe)
    (USA)

Submit to one, it appears in all three. This is why you sometimes see the same sequence with different accession prefixes.


Understanding Accession Numbers

GenBank Accessions

The LOCUS line tells you a lot:

LOCUS       SCU49845    5028 bp    DNA    PLN    21-JUN-1999
            ↑           ↑          ↑      ↑      ↑
         Name        Length     Type  Division  Date

GenBank Divisions (the 3-letter code):

CodeDivision
PRIPrimate sequences
RODRodent sequences
MAMOther mammalian
VRTOther vertebrate
INVInvertebrate
PLNPlant, fungal, algal
BCTBacterial
VRLViral
PHGBacteriophage
SYNSynthetic

Query by division: gbdiv_pln[Properties]


RefSeq Accession Prefixes

This is important — the prefix tells you exactly what type of sequence it is:

PrefixTypeCuration Level
NM_mRNACurated ✓
NP_ProteinCurated ✓
NR_Non-coding RNACurated ✓
XM_mRNAPredicted (computational)
XP_ProteinPredicted (computational)
XR_Non-coding RNAPredicted (computational)
NG_Genomic regionReference
NC_ChromosomeComplete
NT_ContigAssembly
NW_WGS SupercontigAssembly
⚠️
N vs X Prefix

NM_, NP_ = Curated, experimentally supported
XM_, XP_ = Predicted by algorithms, not yet reviewed

For reliable analyses, prefer N* prefixes when available!


RefSeq Status Codes

StatusMeaningReliability
REVIEWEDReviewed by NCBI staff, literature-backed⭐⭐⭐ Highest
VALIDATEDInitial review done, preferred sequence⭐⭐ High
PROVISIONALNot yet reviewed, gene association established⭐ Medium
PREDICTEDComputational prediction, some aspects predicted⭐ Medium
INFERREDPredicted, partially supported by homologyLow
MODELAutomatic pipeline, no individual reviewLowest

NCBI Search Syntax

This is where it gets powerful. NCBI uses field tags in square brackets.

Basic Syntax

search_term[Field Tag]

Boolean operators must be UPPERCASE: AND, OR, NOT

Common Field Tags

Field TagWhat it searchesExample
[Title]Definition lineglyceraldehyde 3 phosphate dehydrogenase[Title]
[Organism]NCBI taxonomymouse[Organism], "Homo sapiens"[Organism]
[Properties]Molecule type, source, etc.biomol mrna[Properties]
[Filter]Subsets of datanucleotide omim[Filter]
[Gene Name]Gene symbolBRCA1[Gene Name]
[EC/RN Number]Enzyme Commission number2.1.1.1[EC/RN Number]
[Accession]Accession numberNM_001234[Accession]

Useful Properties Field Terms

Molecule Type

biomol_mrna[Properties]
biomol_genomic[Properties]
biomol_rrna[Properties]

GenBank Division

gbdiv_pri[Properties]    (primates)
gbdiv_rod[Properties]    (rodents)
gbdiv_est[Properties]    (ESTs)
gbdiv_htg[Properties]    (high throughput genomic)

Gene Location

gene_in_mitochondrion[Properties]
gene_in_chloroplast[Properties]
gene_in_genomic[Properties]

Source Database

srcdb_refseq[Properties]           (any RefSeq)
srcdb_refseq_reviewed[Properties]  (reviewed RefSeq only)
srcdb_refseq_validated[Properties] (validated RefSeq only)
srcdb_pdb[Properties]
srcdb_swiss_prot[Properties]

The Gene database is the best starting point for gene-specific searches. It integrates information from multiple sources: nomenclature, RefSeqs, maps, pathways, variations, phenotypes.

Gene-Specific Field Tags

Find genes by...Search syntax
Free texthuman muscular dystrophy
Gene symbolBRCA1[sym]
Organismhuman[Organism]
ChromosomeY[CHR] AND human[ORGN]
Gene Ontology term"cell adhesion"[GO] or 10030[GO]
EC number1.9.3.1[EC]
PubMed ID11331580[PMID]
AccessionM11313[accn]

Gene Properties

genetype protein coding[Properties]
genetype pseudo[Properties]
has transcript variants[Properties]
srcdb refseq reviewed[Properties]
feattype regulatory[Properties]

Gene Filters

gene clinvar[Filter]      (has ClinVar entries)
gene omim[Filter]         (has OMIM entries)
gene structure[Filter]    (has 3D structure)
gene type noncoding[Filter]
gene type pseudo[Filter]
src genomic[Filter]
src organelle[Filter]

Building Complex Queries

Query Structure

term1[Field] AND term2[Field] AND (term3[Field] OR term4[Field])
⚠️
Boolean Operators Must Be UPPERCASE

AND, OR, NOT — lowercase won't work!

Example Query Walkthrough

Goal: Find all reviewed/validated RefSeq mRNA entries for mouse enzymes with EC 2.1.1.1 or 2.1.1.10

Breaking it down:

RequirementQuery Component
mRNA sequences"biomol mrna"[Properties]
EC 2.1.1.1 OR 2.1.1.10(2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number])
Mouse"mus musculus"[Organism]
Reviewed OR validated RefSeq("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

Final query:

"biomol mrna"[Properties] AND (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) AND "mus musculus"[Organism] AND ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

Result: 9 entries


Practice Exercises

Exercise 1: Nucleotide Database Query

Q: In NCBI "Nucleotide", find all entries containing:

  • mRNA sequences
  • coding for enzymes with EC Numbers 2.1.1.1 and 2.1.1.10
  • from Mus musculus
  • which have been reviewed or validated in RefSeq

How many entries?

Click for answer

Answer: 9 entries (range: 1-10)

Query:

"biomol mrna"[Properties] AND (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) AND "mus musculus"[Organism] AND ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

How to build it:

RequirementField Tag
mRNA"biomol mrna"[Properties]
EC numbers (OR)(2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number])
Mouse"mus musculus"[Organism]
RefSeq quality("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

Exercise 2: Gene Database Query

Q: In the «Gene» database, look for all genes:

  • coding for proteins (protein-coding genes)
  • associated to the GO term "ATP synthase"
  • whose source is mitochondrial or genomic
  • annotated in ClinVar OR OMIM

How many entries?

Click for answer

Answer: 32 entries (range: 31-40)

Query:

"genetype protein coding"[Properties] AND "atp synthase"[Gene Ontology] AND ("source mitochondrion"[Properties] OR "source genomic"[Properties]) AND ("gene clinvar"[Filter] OR "gene omim"[Filter])

How to build it:

RequirementField Tag
Protein-coding"genetype protein coding"[Properties]
GO term"atp synthase"[Gene Ontology]
Source (OR)("source mitochondrion"[Properties] OR "source genomic"[Properties])
Clinical (OR)("gene clinvar"[Filter] OR "gene omim"[Filter])

Common Query Patterns

Pattern 1: Species + Molecule Type + Quality

"homo sapiens"[Organism] AND biomol mrna[Properties] AND srcdb refseq reviewed[Properties]

Pattern 2: Gene Function + Clinical Relevance

"kinase"[Gene Ontology] AND gene clinvar[Filter] AND human[Organism]

Pattern 3: Chromosome Region + Gene Type

7[CHR] AND human[ORGN] AND genetype protein coding[Properties]

Pattern 4: Multiple EC Numbers

(1.1.1.1[EC/RN Number] OR 1.1.1.2[EC/RN Number] OR 1.1.1.3[EC/RN Number])

⚠️ Common NCBI Search Mistakes

Mistake #1: Lowercase Boolean Operators

❌ biomol mrna[Properties] and mouse[Organism]
✓ biomol mrna[Properties] AND mouse[Organism]

The fix: Always use UPPERCASE AND, OR, NOT


Mistake #2: Missing Quotes Around Multi-Word Terms

❌ mus musculus[Organism]
✓ "mus musculus"[Organism]

❌ biomol mrna[Properties]
✓ "biomol mrna"[Properties]

The fix: Use quotes around phrases with spaces


Mistake #3: Wrong Database for Your Query

You want...Use this database
Gene information, GO terms, pathwaysGene
Nucleotide sequencesNucleotide
Protein sequencesProtein
VariantsdbSNP, ClinVar
LiteraturePubMed

Mistake #4: Confusing Properties vs Filters

TypePurposeExample
PropertiesContent-based attributesbiomol mrna[Properties]
FiltersRelationships to other databasesgene clinvar[Filter]

Rule of thumb:

  • Properties = what the sequence IS
  • Filters = what the sequence is LINKED to

Mistake #5: Using GenBank When You Need RefSeq

If you need a reliable reference sequence for analysis, don't just search Nucleotide — filter for RefSeq:

srcdb refseq[Properties]

Or for highest quality:

srcdb refseq reviewed[Properties]

Quick Reference: Field Tags Cheatsheet

Nucleotide Database

PurposeQuery
mRNA onlybiomol mrna[Properties]
Genomic DNAbiomol genomic[Properties]
RefSeq onlysrcdb refseq[Properties]
RefSeq reviewedsrcdb refseq reviewed[Properties]
Specific organism"Homo sapiens"[Organism]
EC number1.1.1.1[EC/RN Number]
GenBank divisiongbdiv_pri[Properties]

Gene Database

PurposeQuery
Protein-coding genesgenetype protein coding[Properties]
Pseudogenesgenetype pseudo[Properties]
GO term"term"[Gene Ontology]
Has ClinVargene clinvar[Filter]
Has OMIMgene omim[Filter]
Has structuregene structure[Filter]
Chromosome7[CHR]
Gene symbolBRCA1[sym]

Cytogenetic Location Quick Reference

For the Gene database, understanding cytogenetic notation:

    7  q  3  1  .  2
    ↑  ↑  ↑  ↑     ↑
   Chr Arm Region Band Sub-band
   
   p = short arm (petit)
   q = long arm

Example: CFTR gene is at 7q31.2 = Chromosome 7, long arm, region 3, band 1, sub-band 2


TL;DR

  • NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
  • GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
  • RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
  • Boolean operators MUST be UPPERCASE: AND, OR, NOT
  • Use quotes around multi-word terms: "homo sapiens"[Organism]
  • Gene database = best starting point for gene-centric searches
  • Properties = what it IS, Filters = what it's LINKED to

Now go query some databases! 🧬