UniProt

Introduction

So you need protein sequences, functions, domains, or disease associations? Welcome to UniProt — the world's most comprehensive protein database, and your one-stop shop for everything protein-related.

Universal Protein Resource — a collaboration between three major institutions since 2002:

InstitutionLocationContribution
SIBSwiss Institute of Bioinformatics, LausanneUniProtKB/Swiss-Prot
EBIEuropean Bioinformatics Institute, UKUniProtKB/TrEMBL, UniParc
PIRProtein Information Resource, GeorgetownUniRef

What it gives you:

  • Protein sequences and functions
  • Domains, families, PTMs
  • Disease associations and variants
  • Subcellular localization
  • Cross-references to 180+ external databases
  • Proteomes for complete organisms
  • BLAST, Align, ID mapping tools

The UniProt Structure

UniProt isn't just one database — it's a collection:

                    UniProt
                       │
       ┌───────────────┼───────────────┐
       │               │               │
   UniProtKB        UniRef         UniParc
   (Knowledge)    (Clusters)      (Archive)
       │
   ┌───┴───┐
   │       │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)
DatabaseWhat it isSize (approx.)
Swiss-ProtManually curated, reviewed~570,000 entries
TrEMBLAutomatically annotated~250,000,000 entries
UniRefClustered sequences (100%, 90%, 50% identity)Reduced redundancy
UniParcComplete archive of all sequencesNon-redundant archive
ProteomesComplete protein sets per organism~160,000 proteomes

Swiss-Prot vs TrEMBL: Know the Difference

This is the most important distinction in UniProt:

AspectSwiss-Prot (Reviewed)TrEMBL (Unreviewed)
CurationManually reviewed by expertsComputationally analyzed
Data sourceScientific publicationsSequence repositories
IsoformsGrouped together per geneIndividual entries
QualityHigh confidenceVariable
Size~570K entries~250M entries
Icon⭐ Gold star📄 Document
⚠️
Always Filter by Review Status!

When you need reliable annotations, always add reviewed:true to your query. TrEMBL entries can be useful for breadth, but Swiss-Prot entries are gold standard.


UniProt Identifiers

Accession Numbers

The primary identifier — stable and persistent:

P05067      (6 characters: 1 letter + 5 alphanumeric)
A0A024RBG1  (10 characters: newer format)

Entry Names

Human-readable format: GENE_SPECIES

APP_HUMAN    → Amyloid precursor protein, Human
INS_HUMAN    → Insulin, Human
SPIKE_SARS2  → Spike protein, SARS-CoV-2
💡
Accession vs Entry Name

Accession (P05067) = stable, use for databases and scripts
Entry name (APP_HUMAN) = readable, can change if gene name updates


Protein Existence Levels

UniProt classifies how confident we are that a protein actually exists:

LevelEvidenceDescription
1Protein levelExperimental evidence (MS, X-ray, etc.)
2Transcript levelmRNA evidence, no protein detected
3HomologyInferred from similar sequences
4PredictedGene prediction, no other evidence
5UncertainDubious, may not exist

Query syntax: existence:1 (for protein-level evidence)


Annotation Score

A 1-5 score indicating annotation completeness (not accuracy!):

ScoreMeaning
5/5Well-characterized, extensively annotated
4/5Good annotation coverage
3/5Moderate annotation
2/5Basic annotation
1/5Minimal annotation
ℹ️
Score ≠ Accuracy

A score of 5/5 means the entry has lots of annotations — it doesn't guarantee they're all correct. A score of 1/5 might just mean the protein hasn't been studied much yet.


UniProt Search Syntax

UniProt uses a field-based query syntax. The general format:

field:value

Basic Query Structure

term1 AND term2 AND (term3 OR term4)

Boolean operators: AND, OR, NOT (can be uppercase or lowercase)


Key Search Fields

Organism and Taxonomy

FieldExampleDescription
organism_nameorganism_name:humanSearch by name
organism_idorganism_id:9606Search by NCBI taxonomy ID
taxonomy_idtaxonomy_id:9606Same as organism_id

Common taxonomy IDs:

  • Human: 9606
  • Mouse: 10090
  • Rat: 10116
  • Zebrafish: 7955
  • E. coli K12: 83333
  • Yeast: 559292

Review Status and Existence

FieldExampleDescription
reviewedreviewed:trueSwiss-Prot only
reviewedreviewed:falseTrEMBL only
existenceexistence:1Protein-level evidence

Enzyme Classification (EC Numbers)

FieldExampleDescription
ecec:3.4.21.1Exact EC number
ecec:3.4.21.*Wildcard for all serine endopeptidases
ecec:3.4.*All peptidases
💡
EC Number Wildcards

Use * as wildcard: ec:3.4.21.* matches all serine endopeptidases (3.4.21.1, 3.4.21.2, etc.)


Proteomes

FieldExampleDescription
proteomeproteome:UP000005640Human reference proteome
proteomeproteome:UP000000589Mouse reference proteome

Finding proteome IDs: Go to UniProt → Proteomes → Search your organism


Cross-References (External Databases)

FieldExampleDescription
databasedatabase:pdbHas PDB structure
databasedatabase:smrHas Swiss-Model structure
databasedatabase:ensemblHas Ensembl cross-ref
xrefxref:pdb-1abcSpecific PDB ID

Function and Annotation

FieldExampleDescription
cc_functioncc_function:"ion transport"Function comment
cc_scl_termcc_scl_term:SL-0039Subcellular location term
keywordkeyword:kinaseUniProt keyword
familyfamily:kinaseProtein family

Gene Ontology

FieldExampleDescription
gogo:0007155Any GO term (by ID)
gogo:"cell adhesion"Any GO term (by name)
goagoa:0007155GO annotation (same as go)

Sequence Properties

FieldExampleDescription
lengthlength:[100 TO 500]Sequence length range
massmass:[10000 TO 50000]Molecular weight range
cc_mass_spectrometrycc_mass_spectrometry:*Has MS data

Building Complex Queries

Pattern 1: Reviewed + Organism + Function

reviewed:true AND organism_id:9606 AND cc_function:"kinase"

Pattern 2: Multiple EC Numbers

(ec:3.4.21.*) OR (ec:3.4.22.*)

Pattern 3: Multiple Organisms

(organism_id:10116) OR (organism_id:7955)

Pattern 4: Proteome + Database Cross-Reference

proteome:UP000005640 AND (database:pdb OR database:smr) AND reviewed:true

Pattern 5: Complex Boolean Logic

For "exactly two of three conditions" (A, B, C):

((A AND B) OR (B AND C) OR (A AND C)) NOT (A AND B AND C)

Practice Exercises

Exercise 1: Protein Existence Statistics

Q: (1) What percentage of TrEMBL entries have evidence at "protein level"? (2) What percentage of Swiss-Prot entries have evidence at "protein level"?

Click for answer

Answers:

  1. TrEMBL: ~0.17% (343,595 / 199,006,239)
  2. Swiss-Prot: ~20.7% (118,866 / 573,661)

Queries:

(1) (existence:1) AND (reviewed:false)
(2) (existence:1) AND (reviewed:true)

Takeaway: Swiss-Prot has ~100x higher percentage of experimentally verified proteins — that's why manual curation matters!


Exercise 2: EC Numbers + Multiple Organisms

Q: Retrieve all reviewed proteins annotated as either:

  • Cysteine endopeptidases (EC 3.4.22.*)
  • Serine endopeptidases (EC 3.4.21.*)

From: Rattus norvegicus [10116] and Danio rerio [7955]

How many?

Click for answer

Answer: 132 entries (121 rat, 11 zebrafish)

Query:

((ec:3.4.21.*) OR (ec:3.4.22.*)) AND ((organism_id:10116) OR (organism_id:7955)) AND (reviewed:true)

How to build it:

RequirementQuery Component
Serine OR Cysteine peptidases(ec:3.4.21.*) OR (ec:3.4.22.*)
Rat OR Zebrafish(organism_id:10116) OR (organism_id:7955)
Reviewed onlyreviewed:true

⚠️ Watch the parentheses! Without proper grouping, you'll get wrong results.


Exercise 3: Proteome + Structure Cross-References

Q: Retrieve all reviewed entries from the Human Reference Proteome that have either:

  • A PDB structure, OR
  • A Swiss-Model Repository structure

How many?

Click for answer

Answer: 17,695 entries

Query:

proteome:UP000005640 AND ((database:pdb) OR (database:smr)) AND (reviewed:true)

Components:

RequirementQuery
Human Reference Proteomeproteome:UP000005640
PDB OR SMR structure(database:pdb) OR (database:smr)
Reviewedreviewed:true

Exercise 4: Complex Boolean — "Exactly Two of Three"

Q: Find all reviewed entries with exactly two of these three properties:

  • Function: "ion transport" (CC field)
  • Subcellular location: "cell membrane" (SL-0039)
  • GO term: "cell adhesion" (GO:0007155)
Click for answer

Answer: 2,022 entries

Query:

(cc_function:"ion transport" AND cc_scl_term:SL-0039) OR (cc_scl_term:SL-0039 AND go:0007155) OR (cc_function:"ion transport" AND go:0007155) NOT (cc_function:"ion transport" AND cc_scl_term:SL-0039 AND go:0007155) AND (reviewed:true)

Logic breakdown:

"Exactly two of three" = (A AND B) OR (B AND C) OR (A AND C), but NOT (A AND B AND C)

VariableCondition
Acc_function:"ion transport"
Bcc_scl_term:SL-0039
Cgo:0007155

⚠️ Common UniProt Search Mistakes

Mistake #1: Forgetting reviewed:true

❌ organism_id:9606 AND ec:3.4.21.*
   → Returns millions of TrEMBL entries

✓ organism_id:9606 AND ec:3.4.21.* AND reviewed:true
   → Returns curated Swiss-Prot entries only

Mistake #2: Wrong Parentheses Grouping

❌ ec:3.4.21.* OR ec:3.4.22.* AND organism_id:9606
   → Parsed as: ec:3.4.21.* OR (ec:3.4.22.* AND organism_id:9606)
   → Gets ALL serine peptidases from ANY organism

✓ (ec:3.4.21.* OR ec:3.4.22.*) AND organism_id:9606
   → Gets both types, but only from human

Rule: Always use parentheses to make grouping explicit!


Mistake #3: Confusing Taxonomy Fields

organism_id:9606    → Works ✓
organism_name:human → Works ✓
taxonomy:human      → Doesn't work as expected

Best practice: Use organism_id with the NCBI taxonomy ID for precision.


Mistake #4: Missing Quotes Around Phrases

❌ cc_function:ion transport
   → Searches for "ion" in function AND "transport" anywhere

✓ cc_function:"ion transport"
   → Searches for the phrase "ion transport" in function

Mistake #5: Using Wrong Field for Cross-References

❌ pdb:1ABC
   → Not a valid field

✓ database:pdb AND xref:pdb-1ABC
   → Correct way to search for specific PDB

Or to find ANY protein with PDB:

database:pdb

Quick Reference: Common Query Patterns

By Organism

organism_id:9606                           # Human
organism_id:10090                          # Mouse
(organism_id:9606) OR (organism_id:10090)  # Human OR Mouse

By Enzyme Class

ec:1.1.1.1              # Exact EC
ec:1.1.1.*              # All in 1.1.1.x
ec:1.*                  # All oxidoreductases

By Evidence Level

reviewed:true                    # Swiss-Prot only
reviewed:false                   # TrEMBL only
existence:1                      # Protein-level evidence
existence:1 AND reviewed:true    # Best quality

By Database Cross-Reference

database:pdb                     # Has any PDB structure
database:smr                     # Has Swiss-Model
database:ensembl                 # Has Ensembl link
(database:pdb) OR (database:smr) # Has any 3D structure

By Proteome

proteome:UP000005640    # Human reference proteome
proteome:UP000000589    # Mouse reference proteome
proteome:UP000000625    # E. coli K12 proteome

By Function/Location

cc_function:"kinase"              # Function contains "kinase"
cc_scl_term:SL-0039               # Cell membrane
keyword:phosphoprotein            # UniProt keyword
go:0007155                        # GO term by ID
go:"cell adhesion"                # GO term by name

Entry Sections Quick Reference

A UniProtKB entry contains these sections:

SectionWhat you find
FunctionCatalytic activity, cofactors, pathway
Names & TaxonomyProtein names, gene names, organism
Subcellular LocationWhere in the cell
Disease & VariantsAssociated diseases, natural variants
PTM/ProcessingPost-translational modifications
ExpressionTissue specificity, developmental stage
InteractionProtein-protein interactions
Structure3D structure info, links to PDB
Family & DomainsPfam, InterPro, PROSITE
SequenceAmino acid sequence, isoforms
Cross-referencesLinks to 180+ external databases

Tools Available in UniProt

ToolWhat it does
BLASTSequence similarity search
AlignMultiple sequence alignment
Peptide SearchFind proteins containing a peptide
ID MappingConvert between ID systems
Batch RetrievalGet multiple entries at once

Download Formats

FormatUse case
FASTASequences for analysis tools
TSVTabular data for Excel/R/Python
ExcelDirect spreadsheet use
JSONProgrammatic access
XMLStructured data exchange
GFFGenome annotations
ListJust accession numbers
💡
Customize Your Download

Before downloading, click "Customize columns" to select exactly which fields you need. This saves processing time later!


Automatic Annotation Systems

For TrEMBL entries, annotations come from:

SystemHow it works
UniRuleManually curated rules based on Swiss-Prot templates
ARBAAssociation Rule-Based Annotation using InterPro
ProtNLMGoogle's NLP model for protein function prediction

Evidence codes (ECO):

  • ECO:0000269 — Experimental evidence
  • ECO:0000305 — Curator inference
  • ECO:0000256 — Sequence model (automatic)
  • ECO:0000259 — InterPro match (automatic)

TL;DR

  • UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
  • Always add reviewed:true when you need reliable annotations
  • Query syntax: field:value with AND, OR, NOT
  • Use parentheses to group OR conditions properly
  • Common fields: organism_id, ec, reviewed, existence, database, proteome, go
  • Wildcards: Use * for EC numbers (e.g., ec:3.4.21.*)
  • Protein existence: Level 1 = experimental evidence, Level 5 = uncertain

Now go find some proteins! 🧬