Everything

What is database? A database is a large structured set of persistent data, usually in computer-readable form.

A DBMS is a software package that enables users:

  • to access the data
  • to manipulate (create, edit, link, update) files as needed
  • to preserve the integrity of the data
  • to deal with security issues (who should have access)

PubMed/MeSH

it comprises more than 39 million citations for biomedical and related journal from MEDLINE, life science journals, and online books

MeSH database (Medical Subject Headings) – controlled vocabulary thesaurus

The query is very easy, just be carefull for what OR, AND and '()'. Read the query correctly to know what is the correct query and correct Mesh.

PDB

Definition: What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.

How experimental structure data is obtained? (3 methods)

  1. X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
  2. NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
  3. Cryo-Electron Microscopy (Cryo-EM)(1%)

What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.

SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:

  1. PDB entries ↔ UniProt sequences
  2. Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl

This is how you can search PDB by Pfam domain or UniProt ID.

Method Comparison Summary

FeatureX-rayCryo-EMNMR
SampleCrystal requiredFrozen in iceSolution
Size limitNone>50 kDa<50-70 kDa
ResolutionCan be <1 ÅRarely <2.2 ÅN/A
DynamicsNoLimitedYes
Multiple statesDifficultYesYes
Membrane proteinsDifficultGoodLimited

AlphaFold

What is AlphaFold? A deep learning system that predicts protein structure from Amino acid sequence.

At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).

AlphaFold essentially solved the protein folding problem for single domains.

pLDDT (predicted Local Distance Difference Test): Stored in the B-factor column of AlphaFold PDB files.

What pLDDT measures: Confidence in local structure (not global fold).

  1. Identify structured domains vs disordered regions
  2. Decide which parts to trust

PAE (Predicted Aligned Error) Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)

Use PAE for: Determining if domain arrangements are reliable.

PDB file format: Legacy and mmCIF Format (current standard)

The B-factor Column

The B-factor means different things depending on the method:

MethodB-factor containsMeaning
X-rayTemperature factorAtomic mobility/disorder
NMRRMSFFluctuation across models
AlphaFoldpLDDTPrediction confidence

When validate you measure:

  1. Resolution (for X-ray/Cryo-EM)
  2. R-factors (for X-ray)
  3. Geometry (for all)

R-factor (X-ray only): Measures how well the model fits the experimental data. <0.20 -> Good fit

Types of R-factors:

  1. R-work: Calculated on data used for refinement
  2. R-free: Calculated on test set NOT used for refinement (more honest)

R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.

Data Validation:

  1. Resolution
  2. Geometery
  3. R-Factor

Key Search Fields

FieldUse for
Experimental Method"X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR"
Data Collection ResolutionX-ray resolution
Reconstruction ResolutionCryo-EM resolution
Source OrganismSpecies
UniProt AccessionLink to UniProt
Pfam IdentifierDomain family
CATH IdentifierStructure classification
Reference Sequence CoverageHow much of UniProt sequence is in structure

Comparing Experimental vs AlphaFold Structures

When AlphaFold structures are available:

CheckExperimentalAlphaFold
Overall reliabilityResolution, R-factorpLDDT, PAE
Local confidenceB-factor (flexibility)pLDDT (prediction confidence)
Disordered regionsOften missingLow pLDDT (<50)
Ligand binding sitesCan have ligandsNo ligands
Protein-protein interfacesShown in complex structuresNot reliable unless AlphaFold-Multimer

Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).

For the Oral Exam

Be prepared to explain:

  1. Why crystallography needs crystals — signal amplification from ordered molecular packing

  2. The phase problem — you measure amplitudes but lose phases; must determine indirectly

  3. What resolution means — ability to distinguish fine details; limited by crystal order

  4. Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances

  5. NMR gives ensembles, not single structures — restraints satisfied by multiple conformations

  6. What pLDDT means — local prediction confidence, stored in B-factor column

  7. Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning

  8. How to assess structure quality — resolution, R-factors, validation metrics

  9. B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)

  10. How to construct complex PDB queries — combining method, resolution, organism, domain annotations

UniProt

What it gives you:

  1. Protein sequences and functions
  2. Domains, families, PTMs
  3. Disease associations and variants
  4. Subcellular localization
  5. Cross-references to 180+ external databases
  6. Proteomes for complete organisms
  7. BLAST, Align, ID mapping tools
                    UniProt
                       │
       ┌───────────────┼───────────────┐
       │               │               │
   UniProtKB        UniRef         UniParc
   (Knowledge)    (Clusters)      (Archive)
       │
   ┌───┴───┐
   │       │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)

UniProt classifies how confident we are that a protein actually exists. Query syntax: existence:1 (for protein-level evidence)

It also has ID Mapping: Convert between ID systems

TL;DR

  • UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
  • Always add reviewed:true when you need reliable annotations
  • Query syntax: field:value with AND, OR, NOT
  • Use parentheses to group OR conditions properly
  • Common fields: organism_id, ec, reviewed, existence, database, proteome, go
  • Wildcards: Use * for EC numbers (e.g., ec:3.4.21.*)
  • Protein existence: Level 1 = experimental evidence, Level 5 = uncertain

NCBI

What is NCBI? National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.

What it gives you:

  1. GenBank (primary nucleotide sequences)
  2. RefSeq (curated reference sequences)
  3. Gene database (gene-centric information)
  4. PubMed (literature)
  5. dbSNP, ClinVar, OMIM (variants & clinical)
  6. BLAST (sequence alignment)
  7. And ~40 more databases, all cross-linked

TL;DR

  • NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
  • GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
  • RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
  • Boolean operators MUST be UPPERCASE: AND, OR, NOT
  • Use quotes around multi-word terms: "homo sapiens"[Organism]
  • Gene database = best starting point for gene-centric searches
  • Properties = what it IS, Filters = what it's LINKED to

Ensembl

Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.

What it gives you:

  1. Gene sets (splice variants, proteins, ncRNAs)
  2. Comparative genomics (alignments, protein trees, orthologues)
  3. Variation data (SNPs, InDels, CNVs)
  4. BioMart for bulk data export
  5. REST API for programmatic access
  6. Everything is open source

BioMart: Bulk Data Queries

Workflow Example: ID Conversion Goal: Convert RefSeq protein IDs to Ensembl Gene IDs

TL;DR

  • Ensembl = genome browser + database for genes, transcripts, variants, orthologues
  • IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
  • MANE Select = highest quality transcript annotation (use these when possible)
  • BioMart = bulk query tool: Dataset → Filters → Attributes → Export

Avoid these mistakes:

  1. Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
  2. Use the text input field, not just checkboxes
  3. Orthologue = cross-species, Paralogue = same species
  4. Start with the species of your INPUT IDs as your dataset
  5. Always include your filter column in output attributes M