Everything
What is database? A database is a large structured set of persistent data, usually in computer-readable form.
A DBMS is a software package that enables users:
- to access the data
- to manipulate (create, edit, link, update) files as needed
- to preserve the integrity of the data
- to deal with security issues (who should have access)
PubMed/MeSH
it comprises more than 39 million citations for biomedical and related journal from MEDLINE, life science journals, and online books
MeSH database (Medical Subject Headings) – controlled vocabulary thesaurus
The query is very easy, just be carefull for what OR, AND and '()'. Read the query correctly to know what is the correct query and correct Mesh.
PDB
Definition: What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.
How experimental structure data is obtained? (3 methods)
- X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
- NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
- Cryo-Electron Microscopy (Cryo-EM)(1%)
What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.
SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:
- PDB entries ↔ UniProt sequences
- Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl
This is how you can search PDB by Pfam domain or UniProt ID.
Method Comparison Summary
| Feature | X-ray | Cryo-EM | NMR |
|---|---|---|---|
| Sample | Crystal required | Frozen in ice | Solution |
| Size limit | None | >50 kDa | <50-70 kDa |
| Resolution | Can be <1 Å | Rarely <2.2 Å | N/A |
| Dynamics | No | Limited | Yes |
| Multiple states | Difficult | Yes | Yes |
| Membrane proteins | Difficult | Good | Limited |
AlphaFold
What is AlphaFold? A deep learning system that predicts protein structure from Amino acid sequence.
At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).
AlphaFold essentially solved the protein folding problem for single domains.
pLDDT (predicted Local Distance Difference Test): Stored in the B-factor column of AlphaFold PDB files.
What pLDDT measures: Confidence in local structure (not global fold).
- Identify structured domains vs disordered regions
- Decide which parts to trust
PAE (Predicted Aligned Error) Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)
Use PAE for: Determining if domain arrangements are reliable.
PDB file format: Legacy and mmCIF Format (current standard)
The B-factor Column
The B-factor means different things depending on the method:
| Method | B-factor contains | Meaning |
|---|---|---|
| X-ray | Temperature factor | Atomic mobility/disorder |
| NMR | RMSF | Fluctuation across models |
| AlphaFold | pLDDT | Prediction confidence |
When validate you measure:
- Resolution (for X-ray/Cryo-EM)
- R-factors (for X-ray)
- Geometry (for all)
R-factor (X-ray only): Measures how well the model fits the experimental data. <0.20 -> Good fit
Types of R-factors:
- R-work: Calculated on data used for refinement
- R-free: Calculated on test set NOT used for refinement (more honest)
R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.
Data Validation:
- Resolution
- Geometery
- R-Factor
Key Search Fields
| Field | Use for |
|---|---|
| Experimental Method | "X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR" |
| Data Collection Resolution | X-ray resolution |
| Reconstruction Resolution | Cryo-EM resolution |
| Source Organism | Species |
| UniProt Accession | Link to UniProt |
| Pfam Identifier | Domain family |
| CATH Identifier | Structure classification |
| Reference Sequence Coverage | How much of UniProt sequence is in structure |
Comparing Experimental vs AlphaFold Structures
When AlphaFold structures are available:
| Check | Experimental | AlphaFold |
|---|---|---|
| Overall reliability | Resolution, R-factor | pLDDT, PAE |
| Local confidence | B-factor (flexibility) | pLDDT (prediction confidence) |
| Disordered regions | Often missing | Low pLDDT (<50) |
| Ligand binding sites | Can have ligands | No ligands |
| Protein-protein interfaces | Shown in complex structures | Not reliable unless AlphaFold-Multimer |
Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).
For the Oral Exam
Be prepared to explain:
-
Why crystallography needs crystals — signal amplification from ordered molecular packing
-
The phase problem — you measure amplitudes but lose phases; must determine indirectly
-
What resolution means — ability to distinguish fine details; limited by crystal order
-
Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances
-
NMR gives ensembles, not single structures — restraints satisfied by multiple conformations
-
What pLDDT means — local prediction confidence, stored in B-factor column
-
Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning
-
How to assess structure quality — resolution, R-factors, validation metrics
-
B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)
-
How to construct complex PDB queries — combining method, resolution, organism, domain annotations
UniProt
What it gives you:
- Protein sequences and functions
- Domains, families, PTMs
- Disease associations and variants
- Subcellular localization
- Cross-references to 180+ external databases
- Proteomes for complete organisms
- BLAST, Align, ID mapping tools
UniProt
│
┌───────────────┼───────────────┐
│ │ │
UniProtKB UniRef UniParc
(Knowledge) (Clusters) (Archive)
│
┌───┴───┐
│ │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)
UniProt classifies how confident we are that a protein actually exists. Query syntax: existence:1 (for protein-level evidence)
It also has ID Mapping: Convert between ID systems
TL;DR
- UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
- Always add
reviewed:truewhen you need reliable annotations - Query syntax:
field:valuewithAND,OR,NOT - Use parentheses to group OR conditions properly
- Common fields:
organism_id,ec,reviewed,existence,database,proteome,go - Wildcards: Use
*for EC numbers (e.g.,ec:3.4.21.*) - Protein existence: Level 1 = experimental evidence, Level 5 = uncertain
NCBI
What is NCBI? National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.
What it gives you:
- GenBank (primary nucleotide sequences)
- RefSeq (curated reference sequences)
- Gene database (gene-centric information)
- PubMed (literature)
- dbSNP, ClinVar, OMIM (variants & clinical)
- BLAST (sequence alignment)
- And ~40 more databases, all cross-linked
TL;DR
- NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
- GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
- RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
- Boolean operators MUST be UPPERCASE:
AND,OR,NOT - Use quotes around multi-word terms:
"homo sapiens"[Organism] - Gene database = best starting point for gene-centric searches
- Properties = what it IS, Filters = what it's LINKED to
Ensembl
Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.
What it gives you:
- Gene sets (splice variants, proteins, ncRNAs)
- Comparative genomics (alignments, protein trees, orthologues)
- Variation data (SNPs, InDels, CNVs)
- BioMart for bulk data export
- REST API for programmatic access
- Everything is open source
BioMart: Bulk Data Queries
Workflow Example: ID Conversion Goal: Convert RefSeq protein IDs to Ensembl Gene IDs
TL;DR
- Ensembl = genome browser + database for genes, transcripts, variants, orthologues
- IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
- MANE Select = highest quality transcript annotation (use these when possible)
- BioMart = bulk query tool: Dataset → Filters → Attributes → Export
Avoid these mistakes:
- Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
- Use the text input field, not just checkboxes
- Orthologue = cross-species, Paralogue = same species
- Start with the species of your INPUT IDs as your dataset
- Always include your filter column in output attributes M