Protein Databases
Protein databases store information about protein structures, sequences, and functions. They come from experimental methods or computational predictions.
PDB
What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.
How experimental structure data is obtained? (3 methods)
- X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
- NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
- Cryo-Electron Microscopy (Cryo-EM)(1%)
What is a Ligand?: A ligand is any small molecule, ion, or cofactor that binds to the protein in the structure, often to perform a specific biological function. Example: iron in hemoglobin
What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.
What is the PDB? (Again)
The Protein Data Bank is the central repository for 3D structures of biological macromolecules (proteins, DNA, RNA). If you want to know what a protein looks like in 3D, you go to PDB.
Current stats:
- ~227,000 experimental structures
- ~1,000,000+ computed structure models (AlphaFold)
The wwPDB Consortium
wwPDB (worldwide Protein Data Bank) was established in 2003. Three data centers maintain it:
| Center | Location | Website |
|---|---|---|
| RCSB PDB | USA | rcsb.org |
| PDBe | Europe (EMBL-EBI) | ebi.ac.uk/pdbe |
| PDBj | Japan | pdbj.org |
They all share the same data, but each has different tools and interfaces.
What wwPDB Does
- Structure deposition — researchers submit their structures through OneDep (deposit.wwpdb.org)
- Structure validation — quality checking before release
- Structure archive — maintaining the database
Related Archives
| Archive | What it stores |
|---|---|
| PDB | Atomic coordinates |
| EMDB | Electron microscopy density maps |
| BMRB | NMR data (chemical shifts, restraints) |
SIFTS
SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:
- PDB entries ↔ UniProt sequences
- Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl
This is how you can search PDB by Pfam domain or UniProt ID.
Part 1: Experimental Methods
Three main methods to determine protein structures:
| Method | % of PDB (2017) | Size limit | Resolution |
|---|---|---|---|
| X-ray crystallography | 88% | None | Can be <1 Å |
| NMR spectroscopy | 10% | <50-70 kDa | N/A |
| Cryo-EM | 1% (now ~10%) | >50 kDa | Rarely <2.2 Å |
Important: Cryo-EM has grown exponentially since 2017 due to the "Resolution Revolution."
X-ray Crystallography
The Process
Protein → Crystallize → X-ray beam → Diffraction pattern →
Electron density map → Atomic model
- Crystallization — grow protein crystals (ordered molecular packing)
- X-ray diffraction — shoot X-rays at the crystal
- Diffraction pattern — X-rays scatter, creating spots on detector
- Phase determination — the "phase problem" (you measure intensities but need phases)
- Electron density map — Fourier transform gives you electron density
- Model fitting — build atomic model into the density
Why X-rays?
Wavelength matters:
- Visible light: λ ≈ 10⁻⁵ cm — too big to resolve atoms
- X-rays: λ ≈ 10⁻⁸ cm — comparable to atomic distances (~1-2 Å)
Problem: No lens can focus X-rays. Computers must calculate the inverse Fourier transform.
Why Crystals?
A single molecule gives too weak a signal. Crystals contain millions of molecules in identical orientations, amplifying the diffraction signal.
The Phase Problem
When X-rays scatter, you measure:
- Amplitudes |F(hkl)| — from diffraction spot intensities ✓
- Phases α(hkl) — LOST in the measurement ✗
Phases must be determined indirectly (molecular replacement, heavy atom methods, etc.). This is why X-ray crystallography is hard.
Resolution
Definition: The smallest detail you can see in the structure.
What limits resolution: If molecules in the crystal aren't perfectly aligned (due to flexibility or disorder), fine details are lost.
| Resolution | Quality | What you can see |
|---|---|---|
| 0.5-1.5 Å | Exceptional | Individual atoms, hydrogens sometimes visible |
| 1.5-2.5 Å | High | Most features clear, good for detailed analysis |
| 2.5-3.5 Å | Medium | Overall fold clear, some ambiguity in sidechains |
| >3.5 Å | Low | Only general shape, significant uncertainty |
Lower number = better resolution. A 1.5 Å structure is better than a 3.0 Å structure.
Cryo-Electron Microscopy (Cryo-EM)
The Resolution Revolution
Nobel Prize in Chemistry 2017. Progress on β-Galactosidase:
| Year | Resolution |
|---|---|
| 2005 | 25 Å (blob) |
| 2011 | 11 Å |
| 2013 | 6 Å |
| 2014 | 3.8 Å |
| 2015 | 2.2 Å |
The Process
Protein → Flash-freeze in vitreous ice → Image thousands of particles →
Align and average → 3D reconstruction → Build model
- Sample preparation — purify protein, flash-freeze in thin ice layer
- Imaging — electron beam through frozen sample
- Data collection — thousands of images of individual particles
- Image processing — classify, align, and average particles
- 3D reconstruction — combine to get density map
- Model building — fit atomic model into density
Advantages
- No crystals needed — works on samples that won't crystallize
- Large complexes — good for ribosomes, viruses, membrane proteins
- Multiple conformations — can separate different states
Limitations
- Size limit: Generally requires proteins >50 kDa (small proteins are hard to image)
- Resolution: Very rarely reaches below ~2.2 Å
NMR Spectroscopy
How It Works
NMR doesn't give you a single structure. It gives you restraints (constraints):
- Dihedral angles — backbone and sidechain torsion angles
- Inter-proton distances — from NOE (Nuclear Overhauser Effect)
- Other restraints — hydrogen bonds, orientations
The Output
NMR produces a bundle of structures (ensemble), all compatible with the restraints.
Model 1
/
Restraints → Model 2 → All satisfy the experimental data
\
Model 3
A reference structure can be calculated by averaging.
What Does Variation Mean?
When NMR models differ from each other, it could mean:
- Real flexibility — the protein actually moves
- Uncertainty — not enough data to pin down the position
This is ambiguous and requires careful interpretation.
Advantages
- Dynamics — can observe protein folding, conformational changes
- Solution state — protein in solution, not crystal
Limitations
- Size limit: ≤50-70 kDa (larger proteins have overlapping signals)
Method Comparison Summary
| Feature | X-ray | Cryo-EM | NMR |
|---|---|---|---|
| Sample | Crystal required | Frozen in ice | Solution |
| Size limit | None | >50 kDa | <50-70 kDa |
| Resolution | Can be <1 Å | Rarely <2.2 Å | N/A |
| Dynamics | No | Limited | Yes |
| Multiple states | Difficult | Yes | Yes |
| Membrane proteins | Difficult | Good | Limited |
Part 2: AlphaFold and Computed Structure Models
Timeline
| Method | First structure | Nobel Prize |
|---|---|---|
| X-ray | 1958 | 1962 |
| NMR | 1988 | 2002 |
| Cryo-EM | 2014 | 2017 |
| AlphaFold | 2020 | 2024 |
What is AlphaFold?
A deep learning system that predicts protein structure from sequence.
Amino acid sequence → AlphaFold neural network → 3D structure prediction
How It Works
Input features:
-
MSA (Multiple Sequence Alignment) — find related sequences in:
- UniRef90 (using jackhmmer)
- Mgnify (metagenomic sequences)
- BFD (2.5 billion proteins)
-
Template structures — search PDB70 for similar known structures
Key concept: Co-evolution
If two positions in a protein always mutate together across evolution, they're probably in contact in 3D.
Example:
Position 3: R, R, R, K, K, K (all positive)
Position 9: D, D, D, E, E, E (all negative)
These positions probably form a salt bridge.
AlphaFold Performance
At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).
- GDT > 90 ≈ experimental structure accuracy
- Previous best methods: 40-60 GDT
AlphaFold essentially solved the protein folding problem for single domains.
AlphaFold Database
- Created: July 2021
- Current size: ~214 million structures
- Coverage: 48 complete proteomes (including human)
- Access: UniProt, RCSB PDB, Ensembl
AlphaFold Confidence Metrics
These are critical for interpreting AlphaFold predictions.
pLDDT (predicted Local Distance Difference Test)
Stored in the B-factor column of AlphaFold PDB files.
| pLDDT | Confidence | Interpretation |
|---|---|---|
| >90 | Very high | Side chains reliable, can analyze active sites |
| 70-90 | Confident | Backbone reliable |
| 50-70 | Low | Uncertain |
| <50 | Very low | Likely disordered, NOT a structure prediction |
What pLDDT measures: Confidence in local structure (not global fold).
Uses:
- Identify structured domains vs disordered regions
- Decide which parts to trust
PAE (Predicted Aligned Error)
A 2D matrix showing confidence in relative positions between residues.
Residue j →
┌─────────────────┐
R │ ■■■ ░░░ │ ■ = low error (confident)
e │ ■■■ ░░░ │ ░ = high error (uncertain)
s │ │
i │ ■■■■■ │
d │ ■■■■■ │
u │ │
e │ ░░░░░░ │
i ↓ │ ░░░░░░ │
└─────────────────┘
Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)
Use PAE for: Determining if domain arrangements are reliable.
Part 3: PDB File Formats
Legacy PDB Format
ATOM 1 N LYS A 1 -21.816 -8.515 19.632 1.00 41.97
ATOM 2 CA LYS A 1 -20.532 -9.114 20.100 1.00 41.18
| Column | Meaning |
|---|---|
| ATOM | Record type |
| 1, 2 | Atom serial number |
| N, CA | Atom name |
| LYS | Residue name |
| A | Chain ID |
| 1 | Residue number |
| -21.816, -8.515, 19.632 | X, Y, Z coordinates (Å) |
| 1.00 | Occupancy |
| 41.97 | B-factor |
mmCIF Format
Current standard. More flexible than legacy PDB format:
- Can handle >99,999 atoms
- Machine-readable
- Extensible
The B-factor Column
The B-factor means different things depending on the method:
| Method | B-factor contains | Meaning |
|---|---|---|
| X-ray | Temperature factor | Atomic mobility/disorder |
| NMR | RMSF | Fluctuation across models |
| AlphaFold | pLDDT | Prediction confidence |
For X-ray: $$B = 8\pi^2 U^2$$
Where U² is mean square displacement.
| B-factor | Displacement | Interpretation |
|---|---|---|
| 15 Ų | ~0.44 Š| Rigid |
| 60 Ų | ~0.87 Š| Flexible |
Part 4: Data Validation
Why Validation Matters
Not all PDB structures are equal quality. You need to check:
- Resolution (for X-ray/Cryo-EM)
- R-factors (for X-ray)
- Geometry (for all)
Resolution
Most important quality indicator for X-ray and Cryo-EM.
Lower = better. A 1.5 Å structure shows more detail than a 3.0 Å structure.
R-factor (X-ray only)
Measures how well the model fits the experimental data.
$$R = \frac{\sum |F_{obs} - F_{calc}|}{\sum |F_{obs}|}$$
| R-factor | Interpretation |
|---|---|
| <0.20 | Good fit |
| 0.20-0.25 | Acceptable |
| >0.30 | Significant errors likely |
Types of R-factors:
- R-work: Calculated on data used for refinement
- R-free: Calculated on test set NOT used for refinement (more honest)
R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.
Geometry Validation
| Metric | What it checks |
|---|---|
| Clashscore | Steric clashes between atoms |
| Ramachandran outliers | Unusual backbone angles (φ/ψ) |
| Sidechain outliers | Unusual rotamer conformations |
| RSRZ outliers | Residues that don't fit electron density |
RSRZ: Real Space R-value Z-score
- Measures fit between residue and electron density
- RSRZ > 2 = outlier
wwPDB Validation Report
Every PDB entry has a validation report with:
- Overall quality metrics
- Chain-by-chain analysis
- Residue-level indicators
- Color coding (green = good, red = bad)
Always check the validation report before trusting a structure!
Part 5: Advanced Search in RCSB PDB
Query Builder Categories
-
Attribute Search
- Structure attributes (method, resolution, date)
- Chemical attributes (ligands)
- Full text
-
Sequence-based Search
- Sequence similarity (BLAST)
- Sequence motif
-
Structure-based Search
- 3D shape similarity
- Structure motif
-
Chemical Search
- Ligand similarity
Key Search Fields
| Field | Use for |
|---|---|
| Experimental Method | "X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR" |
| Data Collection Resolution | X-ray resolution |
| Reconstruction Resolution | Cryo-EM resolution |
| Source Organism | Species |
| UniProt Accession | Link to UniProt |
| Pfam Identifier | Domain family |
| CATH Identifier | Structure classification |
| Reference Sequence Coverage | How much of UniProt sequence is in structure |
Boolean Logic
AND — both conditions must be true
OR — either condition can be true
Important: When combining different resolution types, use OR correctly.
Practice Exercises
Exercise 1: Pfam Domain Search
Find X-ray structures at resolution ≤2.5 Å, from human and mouse, containing Pfam domain PF00004.
Query:
Experimental Method = "X-RAY DIFFRACTION"
AND Identifier = "PF00004" AND Annotation Type = "Pfam"
AND (Source Organism = "Homo sapiens" OR Source Organism = "Mus musculus")
AND Data Collection Resolution <= 2.5
Answer: 11-50 (15 entries)
Exercise 2: UniProt ID List with Filters
Find X-ray structures for a list of UniProt IDs, with resolution ≤2.2 Å and sequence coverage ≥0.90.
Query:
Accession Code(s) IS ANY OF [list of UniProt IDs]
AND Database Name = "UniProt"
AND Experimental Method = "X-RAY DIFFRACTION"
AND Data Collection Resolution <= 2.2
AND Reference Sequence Coverage >= 0.9
Answer: 501-1000 (811 entries)
Note: "Reference Sequence Coverage" tells you what fraction of the UniProt sequence is present in the PDB structure. Coverage of 0.90 means at least 90% of the protein is in the structure.
Exercise 3: Combining X-ray and Cryo-EM
Find all X-ray structures with resolution ≤2.2 Å AND all Cryo-EM structures with reconstruction resolution ≤2.2 Å.
The tricky part: X-ray uses "Data Collection Resolution" but Cryo-EM uses "Reconstruction Resolution". You need to combine them correctly.
Query:
(Experimental Method = "X-RAY DIFFRACTION" OR Experimental Method = "ELECTRON MICROSCOPY")
AND (Data Collection Resolution <= 2.2 OR Reconstruction Resolution <= 2.2)
Answer: 100001-1000000 (128,107 entries: 127,405 X-ray + 702 EM)
Why this works: Each entry will match either:
- X-ray AND Data Collection Resolution ≤2.2, OR
- EM AND Reconstruction Resolution ≤2.2
Exercise 4: Cryo-EM Quality Filter
Among Cryo-EM structures with resolution ≤2.2 Å, how many have Ramachandran outliers <1%?
Query:
Experimental Method = "ELECTRON MICROSCOPY"
AND Reconstruction Resolution <= 2.2
AND Molprobity Percentage Ramachandran Outliers <= 1
Answer: 101-1000 (687 out of 702 total)
This tells you that most high-resolution Cryo-EM structures have good geometry.
Query Building Tips
1. Use the Right Resolution Field
| Method | Resolution Field |
|---|---|
| X-ray | Data Collection Resolution |
| Cryo-EM | Reconstruction Resolution |
| NMR | N/A (no resolution) |
2. Experimental Method Exact Names
Use exactly:
"X-RAY DIFFRACTION"(not "X-ray" or "crystallography")"ELECTRON MICROSCOPY"(not "Cryo-EM" or "EM")"SOLUTION NMR"(not just "NMR")
3. Organism Names
Use full taxonomic name:
"Homo sapiens"(not "human")"Mus musculus"(not "mouse")"Rattus norvegicus"(not "rat")
4. UniProt Queries
When searching by UniProt ID, specify:
Accession Code = [ID] AND Database Name = "UniProt"
5. Combining OR Conditions
Always put OR conditions in parentheses:
(Organism = "Homo sapiens" OR Organism = "Mus musculus")
Otherwise precedence may give unexpected results.
What to Check When Using a PDB Structure
- Experimental method — X-ray? NMR? Cryo-EM?
- Resolution — <2.5 Å is generally good for most purposes
- R-factors — R-free should be reasonable for the resolution
- Validation report — check for outliers in your region of interest
- Sequence coverage — does the structure include the region you care about?
- Ligands/cofactors — are they present? Are they what you expect?
Comparing Experimental vs AlphaFold Structures
When AlphaFold structures are available:
| Check | Experimental | AlphaFold |
|---|---|---|
| Overall reliability | Resolution, R-factor | pLDDT, PAE |
| Local confidence | B-factor (flexibility) | pLDDT (prediction confidence) |
| Disordered regions | Often missing | Low pLDDT (<50) |
| Ligand binding sites | Can have ligands | No ligands |
| Protein-protein interfaces | Shown in complex structures | Not reliable unless AlphaFold-Multimer |
Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).
Quick Reference
PDB Quality Indicators
| Indicator | Good value | Bad value |
|---|---|---|
| Resolution | <2.5 Å | >3.5 Å |
| R-free | <0.25 | >0.30 |
| Ramachandran outliers | <1% | >5% |
| Clashscore | <5 | >20 |
AlphaFold Confidence
| pLDDT | Meaning |
|---|---|
| >90 | Very confident, analyze details |
| 70-90 | Confident backbone |
| 50-70 | Low confidence |
| <50 | Likely disordered |
Search Field Cheatsheet
| What you want | Field to use |
|---|---|
| X-ray resolution | Data Collection Resolution |
| Cryo-EM resolution | Reconstruction Resolution |
| Species | Source Organism Taxonomy Name |
| UniProt link | Accession Code + Database Name = "UniProt" |
| Pfam domain | Identifier + Annotation Type = "Pfam" |
| CATH superfamily | Lineage Identifier (CATH) |
| Coverage | Reference Sequence Coverage |
| Geometry quality | Molprobity Percentage Ramachandran Outliers |
For the Oral Exam
Be prepared to explain:
-
Why crystallography needs crystals — signal amplification from ordered molecular packing
-
The phase problem — you measure amplitudes but lose phases; must determine indirectly
-
What resolution means — ability to distinguish fine details; limited by crystal order
-
Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances
-
NMR gives ensembles, not single structures — restraints satisfied by multiple conformations
-
What pLDDT means — local prediction confidence, stored in B-factor column
-
Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning
-
How to assess structure quality — resolution, R-factors, validation metrics
-
B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)
-
How to construct complex PDB queries — combining method, resolution, organism, domain annotations