UniProt
Introduction
So you need protein sequences, functions, domains, or disease associations? Welcome to UniProt — the world's most comprehensive protein database, and your one-stop shop for everything protein-related.
Universal Protein Resource — a collaboration between three major institutions since 2002:
| Institution | Location | Contribution |
|---|---|---|
| SIB | Swiss Institute of Bioinformatics, Lausanne | UniProtKB/Swiss-Prot |
| EBI | European Bioinformatics Institute, UK | UniProtKB/TrEMBL, UniParc |
| PIR | Protein Information Resource, Georgetown | UniRef |
What it gives you:
- Protein sequences and functions
- Domains, families, PTMs
- Disease associations and variants
- Subcellular localization
- Cross-references to 180+ external databases
- Proteomes for complete organisms
- BLAST, Align, ID mapping tools
The UniProt Structure
UniProt isn't just one database — it's a collection:
UniProt
│
┌───────────────┼───────────────┐
│ │ │
UniProtKB UniRef UniParc
(Knowledge) (Clusters) (Archive)
│
┌───┴───┐
│ │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)
| Database | What it is | Size (approx.) |
|---|---|---|
| Swiss-Prot | Manually curated, reviewed | ~570,000 entries |
| TrEMBL | Automatically annotated | ~250,000,000 entries |
| UniRef | Clustered sequences (100%, 90%, 50% identity) | Reduced redundancy |
| UniParc | Complete archive of all sequences | Non-redundant archive |
| Proteomes | Complete protein sets per organism | ~160,000 proteomes |
Swiss-Prot vs TrEMBL: Know the Difference
This is the most important distinction in UniProt:
| Aspect | Swiss-Prot (Reviewed) | TrEMBL (Unreviewed) |
|---|---|---|
| Curation | Manually reviewed by experts | Computationally analyzed |
| Data source | Scientific publications | Sequence repositories |
| Isoforms | Grouped together per gene | Individual entries |
| Quality | High confidence | Variable |
| Size | ~570K entries | ~250M entries |
| Icon | ⭐ Gold star | 📄 Document |
When you need reliable annotations, always add reviewed:true to your query. TrEMBL entries can be useful for breadth, but Swiss-Prot entries are gold standard.
UniProt Identifiers
Accession Numbers
The primary identifier — stable and persistent:
P05067 (6 characters: 1 letter + 5 alphanumeric)
A0A024RBG1 (10 characters: newer format)
Entry Names
Human-readable format: GENE_SPECIES
APP_HUMAN → Amyloid precursor protein, Human
INS_HUMAN → Insulin, Human
SPIKE_SARS2 → Spike protein, SARS-CoV-2
Accession (P05067) = stable, use for databases and scripts
Entry name (APP_HUMAN) = readable, can change if gene name updates
Protein Existence Levels
UniProt classifies how confident we are that a protein actually exists:
| Level | Evidence | Description |
|---|---|---|
| 1 | Protein level | Experimental evidence (MS, X-ray, etc.) |
| 2 | Transcript level | mRNA evidence, no protein detected |
| 3 | Homology | Inferred from similar sequences |
| 4 | Predicted | Gene prediction, no other evidence |
| 5 | Uncertain | Dubious, may not exist |
Query syntax: existence:1 (for protein-level evidence)
Annotation Score
A 1-5 score indicating annotation completeness (not accuracy!):
| Score | Meaning |
|---|---|
| 5/5 | Well-characterized, extensively annotated |
| 4/5 | Good annotation coverage |
| 3/5 | Moderate annotation |
| 2/5 | Basic annotation |
| 1/5 | Minimal annotation |
A score of 5/5 means the entry has lots of annotations — it doesn't guarantee they're all correct. A score of 1/5 might just mean the protein hasn't been studied much yet.
UniProt Search Syntax
UniProt uses a field-based query syntax. The general format:
field:value
Basic Query Structure
term1 AND term2 AND (term3 OR term4)
Boolean operators: AND, OR, NOT (can be uppercase or lowercase)
Key Search Fields
Organism and Taxonomy
| Field | Example | Description |
|---|---|---|
organism_name | organism_name:human | Search by name |
organism_id | organism_id:9606 | Search by NCBI taxonomy ID |
taxonomy_id | taxonomy_id:9606 | Same as organism_id |
Common taxonomy IDs:
- Human:
9606 - Mouse:
10090 - Rat:
10116 - Zebrafish:
7955 - E. coli K12:
83333 - Yeast:
559292
Review Status and Existence
| Field | Example | Description |
|---|---|---|
reviewed | reviewed:true | Swiss-Prot only |
reviewed | reviewed:false | TrEMBL only |
existence | existence:1 | Protein-level evidence |
Enzyme Classification (EC Numbers)
| Field | Example | Description |
|---|---|---|
ec | ec:3.4.21.1 | Exact EC number |
ec | ec:3.4.21.* | Wildcard for all serine endopeptidases |
ec | ec:3.4.* | All peptidases |
Use * as wildcard: ec:3.4.21.* matches all serine endopeptidases (3.4.21.1, 3.4.21.2, etc.)
Proteomes
| Field | Example | Description |
|---|---|---|
proteome | proteome:UP000005640 | Human reference proteome |
proteome | proteome:UP000000589 | Mouse reference proteome |
Finding proteome IDs: Go to UniProt → Proteomes → Search your organism
Cross-References (External Databases)
| Field | Example | Description |
|---|---|---|
database | database:pdb | Has PDB structure |
database | database:smr | Has Swiss-Model structure |
database | database:ensembl | Has Ensembl cross-ref |
xref | xref:pdb-1abc | Specific PDB ID |
Function and Annotation
| Field | Example | Description |
|---|---|---|
cc_function | cc_function:"ion transport" | Function comment |
cc_scl_term | cc_scl_term:SL-0039 | Subcellular location term |
keyword | keyword:kinase | UniProt keyword |
family | family:kinase | Protein family |
Gene Ontology
| Field | Example | Description |
|---|---|---|
go | go:0007155 | Any GO term (by ID) |
go | go:"cell adhesion" | Any GO term (by name) |
goa | goa:0007155 | GO annotation (same as go) |
Sequence Properties
| Field | Example | Description |
|---|---|---|
length | length:[100 TO 500] | Sequence length range |
mass | mass:[10000 TO 50000] | Molecular weight range |
cc_mass_spectrometry | cc_mass_spectrometry:* | Has MS data |
Building Complex Queries
Pattern 1: Reviewed + Organism + Function
reviewed:true AND organism_id:9606 AND cc_function:"kinase"
Pattern 2: Multiple EC Numbers
(ec:3.4.21.*) OR (ec:3.4.22.*)
Pattern 3: Multiple Organisms
(organism_id:10116) OR (organism_id:7955)
Pattern 4: Proteome + Database Cross-Reference
proteome:UP000005640 AND (database:pdb OR database:smr) AND reviewed:true
Pattern 5: Complex Boolean Logic
For "exactly two of three conditions" (A, B, C):
((A AND B) OR (B AND C) OR (A AND C)) NOT (A AND B AND C)
Practice Exercises
Exercise 1: Protein Existence Statistics
Q: (1) What percentage of TrEMBL entries have evidence at "protein level"? (2) What percentage of Swiss-Prot entries have evidence at "protein level"?
Click for answer
Answers:
- TrEMBL: ~0.17% (343,595 / 199,006,239)
- Swiss-Prot: ~20.7% (118,866 / 573,661)
Queries:
(1) (existence:1) AND (reviewed:false)
(2) (existence:1) AND (reviewed:true)
Takeaway: Swiss-Prot has ~100x higher percentage of experimentally verified proteins — that's why manual curation matters!
Exercise 2: EC Numbers + Multiple Organisms
Q: Retrieve all reviewed proteins annotated as either:
- Cysteine endopeptidases (EC 3.4.22.*)
- Serine endopeptidases (EC 3.4.21.*)
From: Rattus norvegicus [10116] and Danio rerio [7955]
How many?
Click for answer
Answer: 132 entries (121 rat, 11 zebrafish)
Query:
((ec:3.4.21.*) OR (ec:3.4.22.*)) AND ((organism_id:10116) OR (organism_id:7955)) AND (reviewed:true)
How to build it:
| Requirement | Query Component |
|---|---|
| Serine OR Cysteine peptidases | (ec:3.4.21.*) OR (ec:3.4.22.*) |
| Rat OR Zebrafish | (organism_id:10116) OR (organism_id:7955) |
| Reviewed only | reviewed:true |
⚠️ Watch the parentheses! Without proper grouping, you'll get wrong results.
Exercise 3: Proteome + Structure Cross-References
Q: Retrieve all reviewed entries from the Human Reference Proteome that have either:
- A PDB structure, OR
- A Swiss-Model Repository structure
How many?
Click for answer
Answer: 17,695 entries
Query:
proteome:UP000005640 AND ((database:pdb) OR (database:smr)) AND (reviewed:true)
Components:
| Requirement | Query |
|---|---|
| Human Reference Proteome | proteome:UP000005640 |
| PDB OR SMR structure | (database:pdb) OR (database:smr) |
| Reviewed | reviewed:true |
Exercise 4: Complex Boolean — "Exactly Two of Three"
Q: Find all reviewed entries with exactly two of these three properties:
- Function: "ion transport" (CC field)
- Subcellular location: "cell membrane" (SL-0039)
- GO term: "cell adhesion" (GO:0007155)
Click for answer
Answer: 2,022 entries
Query:
(cc_function:"ion transport" AND cc_scl_term:SL-0039) OR (cc_scl_term:SL-0039 AND go:0007155) OR (cc_function:"ion transport" AND go:0007155) NOT (cc_function:"ion transport" AND cc_scl_term:SL-0039 AND go:0007155) AND (reviewed:true)
Logic breakdown:
"Exactly two of three" = (A AND B) OR (B AND C) OR (A AND C), but NOT (A AND B AND C)
| Variable | Condition |
|---|---|
| A | cc_function:"ion transport" |
| B | cc_scl_term:SL-0039 |
| C | go:0007155 |
⚠️ Common UniProt Search Mistakes
Mistake #1: Forgetting reviewed:true
❌ organism_id:9606 AND ec:3.4.21.*
→ Returns millions of TrEMBL entries
✓ organism_id:9606 AND ec:3.4.21.* AND reviewed:true
→ Returns curated Swiss-Prot entries only
Mistake #2: Wrong Parentheses Grouping
❌ ec:3.4.21.* OR ec:3.4.22.* AND organism_id:9606
→ Parsed as: ec:3.4.21.* OR (ec:3.4.22.* AND organism_id:9606)
→ Gets ALL serine peptidases from ANY organism
✓ (ec:3.4.21.* OR ec:3.4.22.*) AND organism_id:9606
→ Gets both types, but only from human
Rule: Always use parentheses to make grouping explicit!
Mistake #3: Confusing Taxonomy Fields
organism_id:9606 → Works ✓
organism_name:human → Works ✓
taxonomy:human → Doesn't work as expected
Best practice: Use organism_id with the NCBI taxonomy ID for precision.
Mistake #4: Missing Quotes Around Phrases
❌ cc_function:ion transport
→ Searches for "ion" in function AND "transport" anywhere
✓ cc_function:"ion transport"
→ Searches for the phrase "ion transport" in function
Mistake #5: Using Wrong Field for Cross-References
❌ pdb:1ABC
→ Not a valid field
✓ database:pdb AND xref:pdb-1ABC
→ Correct way to search for specific PDB
Or to find ANY protein with PDB:
database:pdb
Quick Reference: Common Query Patterns
By Organism
organism_id:9606 # Human
organism_id:10090 # Mouse
(organism_id:9606) OR (organism_id:10090) # Human OR Mouse
By Enzyme Class
ec:1.1.1.1 # Exact EC
ec:1.1.1.* # All in 1.1.1.x
ec:1.* # All oxidoreductases
By Evidence Level
reviewed:true # Swiss-Prot only
reviewed:false # TrEMBL only
existence:1 # Protein-level evidence
existence:1 AND reviewed:true # Best quality
By Database Cross-Reference
database:pdb # Has any PDB structure
database:smr # Has Swiss-Model
database:ensembl # Has Ensembl link
(database:pdb) OR (database:smr) # Has any 3D structure
By Proteome
proteome:UP000005640 # Human reference proteome
proteome:UP000000589 # Mouse reference proteome
proteome:UP000000625 # E. coli K12 proteome
By Function/Location
cc_function:"kinase" # Function contains "kinase"
cc_scl_term:SL-0039 # Cell membrane
keyword:phosphoprotein # UniProt keyword
go:0007155 # GO term by ID
go:"cell adhesion" # GO term by name
Entry Sections Quick Reference
A UniProtKB entry contains these sections:
| Section | What you find |
|---|---|
| Function | Catalytic activity, cofactors, pathway |
| Names & Taxonomy | Protein names, gene names, organism |
| Subcellular Location | Where in the cell |
| Disease & Variants | Associated diseases, natural variants |
| PTM/Processing | Post-translational modifications |
| Expression | Tissue specificity, developmental stage |
| Interaction | Protein-protein interactions |
| Structure | 3D structure info, links to PDB |
| Family & Domains | Pfam, InterPro, PROSITE |
| Sequence | Amino acid sequence, isoforms |
| Cross-references | Links to 180+ external databases |
Tools Available in UniProt
| Tool | What it does |
|---|---|
| BLAST | Sequence similarity search |
| Align | Multiple sequence alignment |
| Peptide Search | Find proteins containing a peptide |
| ID Mapping | Convert between ID systems |
| Batch Retrieval | Get multiple entries at once |
Download Formats
| Format | Use case |
|---|---|
| FASTA | Sequences for analysis tools |
| TSV | Tabular data for Excel/R/Python |
| Excel | Direct spreadsheet use |
| JSON | Programmatic access |
| XML | Structured data exchange |
| GFF | Genome annotations |
| List | Just accession numbers |
Before downloading, click "Customize columns" to select exactly which fields you need. This saves processing time later!
Automatic Annotation Systems
For TrEMBL entries, annotations come from:
| System | How it works |
|---|---|
| UniRule | Manually curated rules based on Swiss-Prot templates |
| ARBA | Association Rule-Based Annotation using InterPro |
| ProtNLM | Google's NLP model for protein function prediction |
Evidence codes (ECO):
ECO:0000269— Experimental evidenceECO:0000305— Curator inferenceECO:0000256— Sequence model (automatic)ECO:0000259— InterPro match (automatic)
TL;DR
- UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
- Always add
reviewed:truewhen you need reliable annotations - Query syntax:
field:valuewithAND,OR,NOT - Use parentheses to group OR conditions properly
- Common fields:
organism_id,ec,reviewed,existence,database,proteome,go - Wildcards: Use
*for EC numbers (e.g.,ec:3.4.21.*) - Protein existence: Level 1 = experimental evidence, Level 5 = uncertain
Now go find some proteins! 🧬