Protein Databases

Protein databases store information about protein structures, sequences, and functions. They come from experimental methods or computational predictions.

PDB

📖
Definition

What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.

How experimental structure data is obtained? (3 methods)

  1. X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
  2. NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
  3. Cryo-Electron Microscopy (Cryo-EM)(1%)

What is a Ligand?: A ligand is any small molecule, ion, or cofactor that binds to the protein in the structure, often to perform a specific biological function. Example: iron in hemoglobin

What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.

What is the PDB? (Again)

The Protein Data Bank is the central repository for 3D structures of biological macromolecules (proteins, DNA, RNA). If you want to know what a protein looks like in 3D, you go to PDB.

Current stats:

  • ~227,000 experimental structures
  • ~1,000,000+ computed structure models (AlphaFold)

The wwPDB Consortium

wwPDB (worldwide Protein Data Bank) was established in 2003. Three data centers maintain it:

CenterLocationWebsite
RCSB PDBUSArcsb.org
PDBeEurope (EMBL-EBI)ebi.ac.uk/pdbe
PDBjJapanpdbj.org

They all share the same data, but each has different tools and interfaces.

What wwPDB Does

  1. Structure deposition — researchers submit their structures through OneDep (deposit.wwpdb.org)
  2. Structure validation — quality checking before release
  3. Structure archive — maintaining the database
ArchiveWhat it stores
PDBAtomic coordinates
EMDBElectron microscopy density maps
BMRBNMR data (chemical shifts, restraints)

SIFTS

SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:

  • PDB entries ↔ UniProt sequences
  • Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl

This is how you can search PDB by Pfam domain or UniProt ID.


Part 1: Experimental Methods

Three main methods to determine protein structures:

Method% of PDB (2017)Size limitResolution
X-ray crystallography88%NoneCan be <1 Å
NMR spectroscopy10%<50-70 kDaN/A
Cryo-EM1% (now ~10%)>50 kDaRarely <2.2 Å

Important: Cryo-EM has grown exponentially since 2017 due to the "Resolution Revolution."


X-ray Crystallography

The Process

Protein → Crystallize → X-ray beam → Diffraction pattern → 
Electron density map → Atomic model
  1. Crystallization — grow protein crystals (ordered molecular packing)
  2. X-ray diffraction — shoot X-rays at the crystal
  3. Diffraction pattern — X-rays scatter, creating spots on detector
  4. Phase determination — the "phase problem" (you measure intensities but need phases)
  5. Electron density map — Fourier transform gives you electron density
  6. Model fitting — build atomic model into the density

Why X-rays?

Wavelength matters:

  • Visible light: λ ≈ 10⁻⁵ cm — too big to resolve atoms
  • X-rays: λ ≈ 10⁻⁸ cm — comparable to atomic distances (~1-2 Å)

Problem: No lens can focus X-rays. Computers must calculate the inverse Fourier transform.

Why Crystals?

A single molecule gives too weak a signal. Crystals contain millions of molecules in identical orientations, amplifying the diffraction signal.

The Phase Problem

When X-rays scatter, you measure:

  • Amplitudes |F(hkl)| — from diffraction spot intensities ✓
  • Phases α(hkl) — LOST in the measurement ✗

Phases must be determined indirectly (molecular replacement, heavy atom methods, etc.). This is why X-ray crystallography is hard.

Resolution

Definition: The smallest detail you can see in the structure.

What limits resolution: If molecules in the crystal aren't perfectly aligned (due to flexibility or disorder), fine details are lost.

ResolutionQualityWhat you can see
0.5-1.5 ÅExceptionalIndividual atoms, hydrogens sometimes visible
1.5-2.5 ÅHighMost features clear, good for detailed analysis
2.5-3.5 ÅMediumOverall fold clear, some ambiguity in sidechains
>3.5 ÅLowOnly general shape, significant uncertainty

Lower number = better resolution. A 1.5 Å structure is better than a 3.0 Å structure.


Cryo-Electron Microscopy (Cryo-EM)

The Resolution Revolution

Nobel Prize in Chemistry 2017. Progress on β-Galactosidase:

YearResolution
200525 Å (blob)
201111 Å
20136 Å
20143.8 Å
20152.2 Å

The Process

Protein → Flash-freeze in vitreous ice → Image thousands of particles → 
Align and average → 3D reconstruction → Build model
  1. Sample preparation — purify protein, flash-freeze in thin ice layer
  2. Imaging — electron beam through frozen sample
  3. Data collection — thousands of images of individual particles
  4. Image processing — classify, align, and average particles
  5. 3D reconstruction — combine to get density map
  6. Model building — fit atomic model into density

Advantages

  • No crystals needed — works on samples that won't crystallize
  • Large complexes — good for ribosomes, viruses, membrane proteins
  • Multiple conformations — can separate different states

Limitations

  • Size limit: Generally requires proteins >50 kDa (small proteins are hard to image)
  • Resolution: Very rarely reaches below ~2.2 Å

NMR Spectroscopy

How It Works

NMR doesn't give you a single structure. It gives you restraints (constraints):

  1. Dihedral angles — backbone and sidechain torsion angles
  2. Inter-proton distances — from NOE (Nuclear Overhauser Effect)
  3. Other restraints — hydrogen bonds, orientations

The Output

NMR produces a bundle of structures (ensemble), all compatible with the restraints.

                Model 1
               /
Restraints → Model 2  → All satisfy the experimental data
               \
                Model 3

A reference structure can be calculated by averaging.

What Does Variation Mean?

When NMR models differ from each other, it could mean:

  • Real flexibility — the protein actually moves
  • Uncertainty — not enough data to pin down the position

This is ambiguous and requires careful interpretation.

Advantages

  • Dynamics — can observe protein folding, conformational changes
  • Solution state — protein in solution, not crystal

Limitations

  • Size limit: ≤50-70 kDa (larger proteins have overlapping signals)

Method Comparison Summary

FeatureX-rayCryo-EMNMR
SampleCrystal requiredFrozen in iceSolution
Size limitNone>50 kDa<50-70 kDa
ResolutionCan be <1 ÅRarely <2.2 ÅN/A
DynamicsNoLimitedYes
Multiple statesDifficultYesYes
Membrane proteinsDifficultGoodLimited

Part 2: AlphaFold and Computed Structure Models

Timeline

MethodFirst structureNobel Prize
X-ray19581962
NMR19882002
Cryo-EM20142017
AlphaFold20202024

What is AlphaFold?

A deep learning system that predicts protein structure from sequence.

Amino acid sequence → AlphaFold neural network → 3D structure prediction

How It Works

Input features:

  1. MSA (Multiple Sequence Alignment) — find related sequences in:

    • UniRef90 (using jackhmmer)
    • Mgnify (metagenomic sequences)
    • BFD (2.5 billion proteins)
  2. Template structures — search PDB70 for similar known structures

Key concept: Co-evolution

If two positions in a protein always mutate together across evolution, they're probably in contact in 3D.

Example:

Position 3: R, R, R, K, K, K    (all positive)
Position 9: D, D, D, E, E, E    (all negative)

These positions probably form a salt bridge.

AlphaFold Performance

At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).

  • GDT > 90 ≈ experimental structure accuracy
  • Previous best methods: 40-60 GDT

AlphaFold essentially solved the protein folding problem for single domains.

AlphaFold Database

  • Created: July 2021
  • Current size: ~214 million structures
  • Coverage: 48 complete proteomes (including human)
  • Access: UniProt, RCSB PDB, Ensembl

AlphaFold Confidence Metrics

These are critical for interpreting AlphaFold predictions.

pLDDT (predicted Local Distance Difference Test)

Stored in the B-factor column of AlphaFold PDB files.

pLDDTConfidenceInterpretation
>90Very highSide chains reliable, can analyze active sites
70-90ConfidentBackbone reliable
50-70LowUncertain
<50Very lowLikely disordered, NOT a structure prediction

What pLDDT measures: Confidence in local structure (not global fold).

Uses:

  • Identify structured domains vs disordered regions
  • Decide which parts to trust

PAE (Predicted Aligned Error)

A 2D matrix showing confidence in relative positions between residues.

        Residue j →
      ┌─────────────────┐
  R   │ ■■■     ░░░     │  ■ = low error (confident)
  e   │ ■■■     ░░░     │  ░ = high error (uncertain)
  s   │                 │
  i   │     ■■■■■       │
  d   │     ■■■■■       │
  u   │                 │
  e   │         ░░░░░░  │
  i ↓ │         ░░░░░░  │
      └─────────────────┘

Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)

Use PAE for: Determining if domain arrangements are reliable.


Part 3: PDB File Formats

Legacy PDB Format

ATOM      1  N   LYS A   1     -21.816  -8.515  19.632  1.00 41.97
ATOM      2  CA  LYS A   1     -20.532  -9.114  20.100  1.00 41.18
ColumnMeaning
ATOMRecord type
1, 2Atom serial number
N, CAAtom name
LYSResidue name
AChain ID
1Residue number
-21.816, -8.515, 19.632X, Y, Z coordinates (Å)
1.00Occupancy
41.97B-factor

mmCIF Format

Current standard. More flexible than legacy PDB format:

  • Can handle >99,999 atoms
  • Machine-readable
  • Extensible

The B-factor Column

The B-factor means different things depending on the method:

MethodB-factor containsMeaning
X-rayTemperature factorAtomic mobility/disorder
NMRRMSFFluctuation across models
AlphaFoldpLDDTPrediction confidence

For X-ray: $$B = 8\pi^2 U^2$$

Where U² is mean square displacement.

B-factorDisplacementInterpretation
15 Ų~0.44 ÅRigid
60 Ų~0.87 ÅFlexible

Part 4: Data Validation

Why Validation Matters

Not all PDB structures are equal quality. You need to check:

  • Resolution (for X-ray/Cryo-EM)
  • R-factors (for X-ray)
  • Geometry (for all)

Resolution

Most important quality indicator for X-ray and Cryo-EM.

Lower = better. A 1.5 Å structure shows more detail than a 3.0 Å structure.

R-factor (X-ray only)

Measures how well the model fits the experimental data.

$$R = \frac{\sum |F_{obs} - F_{calc}|}{\sum |F_{obs}|}$$

R-factorInterpretation
<0.20Good fit
0.20-0.25Acceptable
>0.30Significant errors likely

Types of R-factors:

  • R-work: Calculated on data used for refinement
  • R-free: Calculated on test set NOT used for refinement (more honest)

R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.

Geometry Validation

MetricWhat it checks
ClashscoreSteric clashes between atoms
Ramachandran outliersUnusual backbone angles (φ/ψ)
Sidechain outliersUnusual rotamer conformations
RSRZ outliersResidues that don't fit electron density

RSRZ: Real Space R-value Z-score

  • Measures fit between residue and electron density
  • RSRZ > 2 = outlier

wwPDB Validation Report

Every PDB entry has a validation report with:

  • Overall quality metrics
  • Chain-by-chain analysis
  • Residue-level indicators
  • Color coding (green = good, red = bad)

Always check the validation report before trusting a structure!


Part 5: Advanced Search in RCSB PDB

Query Builder Categories

  1. Attribute Search

    • Structure attributes (method, resolution, date)
    • Chemical attributes (ligands)
    • Full text
  2. Sequence-based Search

    • Sequence similarity (BLAST)
    • Sequence motif
  3. Structure-based Search

    • 3D shape similarity
    • Structure motif
  4. Chemical Search

    • Ligand similarity

Key Search Fields

FieldUse for
Experimental Method"X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR"
Data Collection ResolutionX-ray resolution
Reconstruction ResolutionCryo-EM resolution
Source OrganismSpecies
UniProt AccessionLink to UniProt
Pfam IdentifierDomain family
CATH IdentifierStructure classification
Reference Sequence CoverageHow much of UniProt sequence is in structure

Boolean Logic

AND — both conditions must be true
OR  — either condition can be true

Important: When combining different resolution types, use OR correctly.


Practice Exercises

Find X-ray structures at resolution ≤2.5 Å, from human and mouse, containing Pfam domain PF00004.

Query:

Experimental Method = "X-RAY DIFFRACTION"
AND Identifier = "PF00004" AND Annotation Type = "Pfam"
AND (Source Organism = "Homo sapiens" OR Source Organism = "Mus musculus")
AND Data Collection Resolution <= 2.5

Answer: 11-50 (15 entries)


Exercise 2: UniProt ID List with Filters

Find X-ray structures for a list of UniProt IDs, with resolution ≤2.2 Å and sequence coverage ≥0.90.

Query:

Accession Code(s) IS ANY OF [list of UniProt IDs]
AND Database Name = "UniProt"
AND Experimental Method = "X-RAY DIFFRACTION"
AND Data Collection Resolution <= 2.2
AND Reference Sequence Coverage >= 0.9

Answer: 501-1000 (811 entries)

Note: "Reference Sequence Coverage" tells you what fraction of the UniProt sequence is present in the PDB structure. Coverage of 0.90 means at least 90% of the protein is in the structure.


Exercise 3: Combining X-ray and Cryo-EM

Find all X-ray structures with resolution ≤2.2 Å AND all Cryo-EM structures with reconstruction resolution ≤2.2 Å.

The tricky part: X-ray uses "Data Collection Resolution" but Cryo-EM uses "Reconstruction Resolution". You need to combine them correctly.

Query:

(Experimental Method = "X-RAY DIFFRACTION" OR Experimental Method = "ELECTRON MICROSCOPY")
AND (Data Collection Resolution <= 2.2 OR Reconstruction Resolution <= 2.2)

Answer: 100001-1000000 (128,107 entries: 127,405 X-ray + 702 EM)

Why this works: Each entry will match either:

  • X-ray AND Data Collection Resolution ≤2.2, OR
  • EM AND Reconstruction Resolution ≤2.2

Exercise 4: Cryo-EM Quality Filter

Among Cryo-EM structures with resolution ≤2.2 Å, how many have Ramachandran outliers <1%?

Query:

Experimental Method = "ELECTRON MICROSCOPY"
AND Reconstruction Resolution <= 2.2
AND Molprobity Percentage Ramachandran Outliers <= 1

Answer: 101-1000 (687 out of 702 total)

This tells you that most high-resolution Cryo-EM structures have good geometry.


Query Building Tips

1. Use the Right Resolution Field

MethodResolution Field
X-rayData Collection Resolution
Cryo-EMReconstruction Resolution
NMRN/A (no resolution)

2. Experimental Method Exact Names

Use exactly:

  • "X-RAY DIFFRACTION" (not "X-ray" or "crystallography")
  • "ELECTRON MICROSCOPY" (not "Cryo-EM" or "EM")
  • "SOLUTION NMR" (not just "NMR")

3. Organism Names

Use full taxonomic name:

  • "Homo sapiens" (not "human")
  • "Mus musculus" (not "mouse")
  • "Rattus norvegicus" (not "rat")

4. UniProt Queries

When searching by UniProt ID, specify:

Accession Code = [ID] AND Database Name = "UniProt"

5. Combining OR Conditions

Always put OR conditions in parentheses:

(Organism = "Homo sapiens" OR Organism = "Mus musculus")

Otherwise precedence may give unexpected results.


What to Check When Using a PDB Structure

  1. Experimental method — X-ray? NMR? Cryo-EM?
  2. Resolution — <2.5 Å is generally good for most purposes
  3. R-factors — R-free should be reasonable for the resolution
  4. Validation report — check for outliers in your region of interest
  5. Sequence coverage — does the structure include the region you care about?
  6. Ligands/cofactors — are they present? Are they what you expect?

Comparing Experimental vs AlphaFold Structures

When AlphaFold structures are available:

CheckExperimentalAlphaFold
Overall reliabilityResolution, R-factorpLDDT, PAE
Local confidenceB-factor (flexibility)pLDDT (prediction confidence)
Disordered regionsOften missingLow pLDDT (<50)
Ligand binding sitesCan have ligandsNo ligands
Protein-protein interfacesShown in complex structuresNot reliable unless AlphaFold-Multimer

Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).


Quick Reference

PDB Quality Indicators

IndicatorGood valueBad value
Resolution<2.5 Å>3.5 Å
R-free<0.25>0.30
Ramachandran outliers<1%>5%
Clashscore<5>20

AlphaFold Confidence

pLDDTMeaning
>90Very confident, analyze details
70-90Confident backbone
50-70Low confidence
<50Likely disordered

Search Field Cheatsheet

What you wantField to use
X-ray resolutionData Collection Resolution
Cryo-EM resolutionReconstruction Resolution
SpeciesSource Organism Taxonomy Name
UniProt linkAccession Code + Database Name = "UniProt"
Pfam domainIdentifier + Annotation Type = "Pfam"
CATH superfamilyLineage Identifier (CATH)
CoverageReference Sequence Coverage
Geometry qualityMolprobity Percentage Ramachandran Outliers

For the Oral Exam

Be prepared to explain:

  1. Why crystallography needs crystals — signal amplification from ordered molecular packing

  2. The phase problem — you measure amplitudes but lose phases; must determine indirectly

  3. What resolution means — ability to distinguish fine details; limited by crystal order

  4. Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances

  5. NMR gives ensembles, not single structures — restraints satisfied by multiple conformations

  6. What pLDDT means — local prediction confidence, stored in B-factor column

  7. Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning

  8. How to assess structure quality — resolution, R-factors, validation metrics

  9. B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)

  10. How to construct complex PDB queries — combining method, resolution, organism, domain annotations