Everything

What is database? A database is a large structured set of persistent data, usually in computer-readable form.

A DBMS is a software package that enables users:

to access the data
to manipulate (create, edit, link, update) files as needed
to preserve the integrity of the data
to deal with security issues (who should have access)

PubMed/MeSH

it comprises more than 39 million citations for biomedical and related journal from MEDLINE, life science journals, and online books

MeSH database (Medical Subject Headings) – controlled vocabulary thesaurus

The query is very easy, just be carefull for what OR, AND and '()'. Read the query correctly to know what is the correct query and correct Mesh.

PDB

Definition: What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.

How experimental structure data is obtained? (3 methods)

X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
Cryo-Electron Microscopy (Cryo-EM)(1%)

What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.

SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:

PDB entries ↔ UniProt sequences
Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl

This is how you can search PDB by Pfam domain or UniProt ID.

Method Comparison Summary

Feature	X-ray	Cryo-EM	NMR
Sample	Crystal required	Frozen in ice	Solution
Size limit	None	>50 kDa	<50-70 kDa
Resolution	Can be <1 Å	Rarely <2.2 Å	N/A
Dynamics	No	Limited	Yes
Multiple states	Difficult	Yes	Yes
Membrane proteins	Difficult	Good	Limited

AlphaFold

What is AlphaFold? A deep learning system that predicts protein structure from Amino acid sequence.

At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).

AlphaFold essentially solved the protein folding problem for single domains.

pLDDT (predicted Local Distance Difference Test): Stored in the B-factor column of AlphaFold PDB files.

What pLDDT measures: Confidence in local structure (not global fold).

Identify structured domains vs disordered regions
Decide which parts to trust

PAE (Predicted Aligned Error) Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)

Use PAE for: Determining if domain arrangements are reliable.

PDB file format: Legacy and mmCIF Format (current standard)

The B-factor Column

The B-factor means different things depending on the method:

Method	B-factor contains	Meaning
X-ray	Temperature factor	Atomic mobility/disorder
NMR	RMSF	Fluctuation across models
AlphaFold	pLDDT	Prediction confidence

When validate you measure:

Resolution (for X-ray/Cryo-EM)
R-factors (for X-ray)
Geometry (for all)

R-factor (X-ray only): Measures how well the model fits the experimental data. <0.20 -> Good fit

Types of R-factors:

R-work: Calculated on data used for refinement
R-free: Calculated on test set NOT used for refinement (more honest)

R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.

Data Validation:

Resolution
Geometery
R-Factor

Key Search Fields

Field	Use for
Experimental Method	"X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR"
Data Collection Resolution	X-ray resolution
Reconstruction Resolution	Cryo-EM resolution
Source Organism	Species
UniProt Accession	Link to UniProt
Pfam Identifier	Domain family
CATH Identifier	Structure classification
Reference Sequence Coverage	How much of UniProt sequence is in structure

Comparing Experimental vs AlphaFold Structures

When AlphaFold structures are available:

Check	Experimental	AlphaFold
Overall reliability	Resolution, R-factor	pLDDT, PAE
Local confidence	B-factor (flexibility)	pLDDT (prediction confidence)
Disordered regions	Often missing	Low pLDDT (<50)
Ligand binding sites	Can have ligands	No ligands
Protein-protein interfaces	Shown in complex structures	Not reliable unless AlphaFold-Multimer

Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).

For the Oral Exam

Be prepared to explain:

Why crystallography needs crystals — signal amplification from ordered molecular packing
The phase problem — you measure amplitudes but lose phases; must determine indirectly
What resolution means — ability to distinguish fine details; limited by crystal order
Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances
NMR gives ensembles, not single structures — restraints satisfied by multiple conformations
What pLDDT means — local prediction confidence, stored in B-factor column
Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning
How to assess structure quality — resolution, R-factors, validation metrics
B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)
How to construct complex PDB queries — combining method, resolution, organism, domain annotations

UniProt

What it gives you:

Protein sequences and functions
Domains, families, PTMs
Disease associations and variants
Subcellular localization
Cross-references to 180+ external databases
Proteomes for complete organisms
BLAST, Align, ID mapping tools

                    UniProt
                       │
       ┌───────────────┼───────────────┐
       │               │               │
   UniProtKB        UniRef         UniParc
   (Knowledge)    (Clusters)      (Archive)
       │
   ┌───┴───┐
   │       │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)

UniProt classifies how confident we are that a protein actually exists. Query syntax: existence:1 (for protein-level evidence)

It also has ID Mapping: Convert between ID systems

TL;DR

UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
Always add reviewed:true when you need reliable annotations
Query syntax: field:value with AND, OR, NOT
Use parentheses to group OR conditions properly
Common fields: organism_id, ec, reviewed, existence, database, proteome, go
Wildcards: Use * for EC numbers (e.g., ec:3.4.21.*)
Protein existence: Level 1 = experimental evidence, Level 5 = uncertain

NCBI

What is NCBI? National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.

What it gives you:

GenBank (primary nucleotide sequences)
RefSeq (curated reference sequences)
Gene database (gene-centric information)
PubMed (literature)
dbSNP, ClinVar, OMIM (variants & clinical)
BLAST (sequence alignment)
And ~40 more databases, all cross-linked

TL;DR

NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
Boolean operators MUST be UPPERCASE: AND, OR, NOT
Use quotes around multi-word terms: "homo sapiens"[Organism]
Gene database = best starting point for gene-centric searches
Properties = what it IS, Filters = what it's LINKED to

Ensembl

Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.

What it gives you:

Gene sets (splice variants, proteins, ncRNAs)
Comparative genomics (alignments, protein trees, orthologues)
Variation data (SNPs, InDels, CNVs)
BioMart for bulk data export
REST API for programmatic access
Everything is open source

BioMart: Bulk Data Queries

Workflow Example: ID Conversion Goal: Convert RefSeq protein IDs to Ensembl Gene IDs

TL;DR

Ensembl = genome browser + database for genes, transcripts, variants, orthologues
IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
MANE Select = highest quality transcript annotation (use these when possible)
BioMart = bulk query tool: Dataset → Filters → Attributes → Export

Avoid these mistakes:

Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
Use the text input field, not just checkboxes
Orthologue = cross-species, Paralogue = same species
Start with the species of your INPUT IDs as your dataset
Always include your filter column in output attributes M

Bioinformatics Forever