Ensembl: A Practical Guide
So you need to look up genes, transcripts, variants, or convert IDs between databases? Welcome to Ensembl — the genome browser that bioinformaticians actually use daily.
What is Ensembl?
Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.
What it gives you:
- Gene sets (splice variants, proteins, ncRNAs)
- Comparative genomics (alignments, protein trees, orthologues)
- Variation data (SNPs, InDels, CNVs)
- BioMart for bulk data export
- REST API for programmatic access
- Everything is open source
Currently we're on GRCh38.p14 (Genome Reference Consortium). The original Human Genome Project finished in 2003 — cost $3 billion and took 15 years. Now you can access it for free in seconds. Science is wild.
Ensembl Stable Identifiers
This is the ID system you'll see everywhere. Memorize the prefixes:
| Prefix | Meaning | Example |
|---|---|---|
ENSG | Gene ID | ENSG00000141510 |
ENST | Transcript ID | ENST00000269305 |
ENSP | Peptide/Protein ID | ENSP00000269305 |
ENSE | Exon ID | ENSE00001146308 |
ENSR | Regulatory Feature | ENSR00000000001 |
ENSFM | Protein Family | ENSFM00250000000001 |
For other species, a 3-letter code is inserted: ENSMUSG (mouse), ENSDARG (zebrafish), ENSCSAVG (Ciona savignyi), etc.
Transcript Quality Tiers
Not all transcripts are created equal. Here's the hierarchy:
MANE Select (Gold Standard) 🥇
- Matched Annotation between NCBI and EBI
- Perfectly aligned to GRCh38
- Complete sequence identity with RefSeq
- This is your go-to transcript
Merged (Ensembl/Havana) 🥈
- Automatically annotated + manually curated
- High confidence
CCDS (Consensus CDS)
- Collaborative effort for consistent protein-coding annotations
- Shared between NCBI, EBI, UCSC, and others
Ensembl Protein Coding (Red)
- Automatic annotation based on mRNA/protein evidence
- Good, but not manually verified
When doing variant analysis, prefer MANE Select transcripts. Using a low-confidence transcript can give you wrong coordinates or missed variants.
Using the Ensembl Browser
Basic Navigation
- Go to ensembl.org
- Search by: gene name, Ensembl ID, coordinates, or variant ID (rs number)
- Gene page shows: location, transcripts, variants, orthologues, etc.
Key Information You Can Find
For any gene (e.g., MYH9):
- Ensembl Gene ID → ENSG00000100345
- Chromosomal coordinates → 22:36,281,270-36,393,331
- Cytogenetic location → 22q12.3
- Strand → Forward (+) or Reverse (-)
- Number of transcripts → and which are protein-coding
- MANE Select transcript → with CCDS and RefSeq cross-references
Viewing Variants
- Navigate to your gene
- Go to "Variant table" or zoom into a specific region
- Filter by: consequence type, clinical significance (ClinVar), etc.
- Click on any variant (e.g., rs80338828) to see:
- Alleles and frequencies
- Consequence (missense, synonymous, etc.)
- Clinical annotations (ClinVar, OMIM)
- Population frequencies
BioMart: Bulk Data Queries
BioMart is where Ensembl gets powerful. No programming required — it's a web interface for mining data in bulk.
Access: ensembl.org → BioMart (top menu)
The Three-Step Process
1. DATASET → Choose species/database (e.g., Human genes GRCh38.p14)
2. FILTERS → Narrow down what you want (gene list, chromosome, biotype...)
3. ATTRIBUTES → Choose what columns to export (IDs, names, sequences...)
Goal: Convert RefSeq protein IDs to Ensembl Gene IDs
- Dataset: Human genes (GRCh38.p14)
- Filters → External References → RefSeq peptide ID → paste your list
- Attributes: Gene stable ID, Gene name, RefSeq peptide ID
- Results → Export as CSV/TSV/HTML
⚠️ Common BioMart Mistakes (And How to Avoid Them)
These will save you hours of frustration. Learn from pain.
Mistake #1: Pasting IDs in the Wrong Filter Field
You have RefSeq IDs (NP_001214, NP_001216...) and you paste them into "Gene stable ID(s)" field. Result? Empty results.
Why it happens: The "Gene stable ID(s)" field expects Ensembl IDs (ENSG...), not RefSeq IDs.
The fix:
| ID Type | Where to Paste |
|---|---|
ENSG00000xxxxx | Filters → GENE → Gene stable ID(s) |
NP_xxxxxx (RefSeq protein) | Filters → EXTERNAL → RefSeq peptide ID(s) |
NM_xxxxxx (RefSeq mRNA) | Filters → EXTERNAL → RefSeq mRNA ID(s) |
P12345 (UniProt) | Filters → EXTERNAL → UniProtKB/Swiss-Prot ID(s) |
Look at your ID prefix. If it's NOT "ENS...", you need to find the matching field under EXTERNAL → External References.
Mistake #2: Checkbox vs Text Input Confusion
Some filter options have both a checkbox AND a text field:
☑ With RefSeq peptide ID(s): Only ← Checkbox (just filters for genes that HAVE RefSeq IDs)
[________________________] ← Text field (where you paste YOUR specific IDs)
The mistake: Checking the box but not pasting IDs in the text field.
What happens:
- Checkbox alone = "Give me all genes that have ANY RefSeq ID" (thousands of results)
- Text field = "Give me only genes matching THESE specific RefSeq IDs" (your actual query)
The fix: Always paste your ID list in the text input field, not just check the box.
Mistake #3: Orthologue vs Paralogue Mix-up
You want to find human equivalents of Ciona genes. You select Paralogue %id. Result? Wrong data or empty results.
| Term | Meaning | Use When |
|---|---|---|
| Orthologue | Same gene in different species (separated by speciation) | Ciona gene → Human equivalent |
| Paralogue | Different gene in same species (separated by duplication) | Human BRCA1 → Human BRCA2 |
The fix:
For cross-species queries (e.g., Ciona → Human):
Attributes → Homologues → Human Orthologues
✓ Human gene stable ID
✓ Human gene name
✓ %id. target Human gene identical to query gene
NOT:
Attributes → Homologues → Paralogues ← WRONG for cross-species!
Mistake #4: Forgetting to Include Filter Column in Attributes
The scenario: You filter by RefSeq peptide ID, but don't include it in your output attributes.
What happens: You get a list of Ensembl IDs with no way to match them back to your original input!
| Gene stable ID | Gene name |
|---|---|
| ENSG00000137752 | CASP1 |
| ENSG00000196954 | CASP4 |
Wait... which RefSeq ID was CASP1 again? 🤷
The fix: Always include your filter field as an output attribute:
Attributes:
✓ Gene stable ID
✓ Gene name
✓ RefSeq peptide ID ← Include this for verification!
Now you get:
| Gene stable ID | Gene name | RefSeq peptide ID |
|---|---|---|
| ENSG00000137752 | CASP1 | NP_001214 |
| ENSG00000196954 | CASP4 | NP_001216 |
Much better!
Mistake #5: Wrong Dataset for Cross-Species Queries
The scenario: You want human orthologues of Ciona genes. You select "Human genes" as your dataset.
What happens: You can't input Ciona gene IDs because you're in the Human database!
The fix: Start from the source species:
Dataset: Ciona savignyi genes ← Start here (your input species)
Filters: Gene stable ID → paste Ciona IDs
Attributes:
- Gene stable ID (Ciona)
- Human orthologue gene ID ← Get human data as attributes
- Human gene name
Rule: Dataset = species of your INPUT IDs. Other species come through Homologues attributes.
BioMart Mistakes Cheatsheet
| Symptom | Likely Cause | Fix |
|---|---|---|
| Empty results | IDs in wrong filter field | Match ID prefix to correct filter (EXTERNAL for non-Ensembl IDs) |
| Way too many results | Used checkbox without text input | Paste specific IDs in the text field |
| Wrong species data | Selected Paralogue instead of Orthologue | Use Orthologue for cross-species |
| Can't match results to input | Didn't include filter column in output | Add your filter field to Attributes |
| Can't input your IDs | Wrong dataset selected | Dataset = species of your INPUT IDs |
Common BioMart Queries
Query Type 1: ID Conversion
RefSeq → Ensembl + HGNC Symbol
| Step | Action |
|---|---|
| Dataset | Human genes (GRCh38.p14) |
| Filters | EXTERNAL → RefSeq peptide ID(s) → paste list |
| Attributes | Gene stable ID, HGNC symbol, RefSeq peptide ID |
Query Type 2: Finding Orthologues
Find human orthologues of genes from another species
| Step | Action |
|---|---|
| Dataset | Source species (e.g., Ciona savignyi genes) |
| Filters | Gene stable ID → paste your list |
| Attributes | Gene stable ID, Human orthologue gene ID, Human gene name, % identity |
Orthologue = cross-species. Paralogue = same species. Don't mix them up!
Query Type 3: Variant Export
Get all missense variants for a gene list
| Step | Action |
|---|---|
| Dataset | Human genes (GRCh38.p14) |
| Filters | Gene name → your list; Variant consequence → missense_variant |
| Attributes | Gene name, Variant name (rs ID), Consequence, Amino acid change |
Query Type 4: Find Genes with PDB Structures
Count/export genes that have associated 3D structures
| Step | Action |
|---|---|
| Dataset | Human genes (GRCh38.p14) |
| Filters | With PDB ID → Only |
| Attributes | Gene stable ID, Gene name, PDB ID, UniProtKB/Swiss-Prot ID |
Practice Exercises
Exercise 1: SNP Nucleotide Lookup
Q: In Ensembl, consider the SNP variation
rs80338826. Which is the DNA nucleotide triplet coding for the wild-type amino acid residue (transcript MYH9-201)?
Click for answer
Answer: The triplet is CGT (coding for Arginine).
How to find it:
- Search
rs80338826in Ensembl - Go to the variant page
- Look at transcript MYH9-201 consequences
- Check the codon column for the reference allele
Exercise 2: RefSeq to Ensembl Conversion
Q: Convert these RefSeq protein IDs to Ensembl Gene IDs and HGNC symbols:
NP_203126, NP_001214, NP_001216, NP_001220 NP_036246, NP_203519, NP_203520, NP_203522
Click for answer
BioMart Setup: | Step | What to do | |------|------------| | Dataset | Human genes (GRCh38.p14) | | Filters | EXTERNAL → RefSeq peptide ID(s) → paste the NP_ IDs | | Attributes | Gene stable ID, HGNC symbol, RefSeq peptide ID |
⚠️ Don't paste NP_ IDs in "Gene stable ID" field — that's for ENSG IDs only!
Results:
| Gene stable ID | HGNC symbol | RefSeq peptide ID |
|---|---|---|
| ENSG00000137752 | CASP1 | NP_001214 |
| ENSG00000196954 | CASP4 | NP_001216 |
| ENSG00000132906 | CASP9 | NP_001220 |
| ENSG00000105141 | CASP14 | NP_036246 |
| ENSG00000165806 | CASP7 | NP_203126 |
| ENSG00000064012 | CASP8 | NP_203519 |
| ENSG00000064012 | CASP8 | NP_203520 |
(Notice: CASP8 has multiple RefSeq IDs mapping to it — different isoforms!)
Exercise 3: Cross-Species Orthologue Search
Q: Find human orthologues for these Ciona savignyi genes:
ENSCSAVG00000000002, ENSCSAVG00000000003, ENSCSAVG00000000006 ENSCSAVG00000000007, ENSCSAVG00000000009, ENSCSAVG00000000011
Click for answer
BioMart Setup: | Step | What to do | |------|------------| | Dataset | Ciona savignyi genes (NOT Human!) | | Filters | Gene stable ID(s) → paste the ENSCSAVG IDs | | Attributes | Gene stable ID, Human orthologue gene ID, Human gene name, %id target Human |
⚠️ Use Orthologue (cross-species), NOT Paralogue (same species)!
Results:
| C. savignyi Gene ID | Human Gene ID | Human Gene Name | % Identity |
|---|---|---|---|
| ENSCSAVG00000000002 | ENSG00000156026 | MCU | 55.1% |
| ENSCSAVG00000000003 | ENSG00000169435 | RASSF6 | 29.6% |
| ENSCSAVG00000000003 | ENSG00000101265 | RASSF2 | 35.4% |
| ENSCSAVG00000000003 | ENSG00000107551 | RASSF4 | 33.1% |
| ENSCSAVG00000000007 | ENSG00000145416 | MARCHF1 | 58.8% |
| ENSCSAVG00000000009 | ENSG00000171865 | RNASEH1 | 39.4% |
| ENSCSAVG00000000011 | ENSG00000146856 | AGBL3 | 69.1% |
(Note: ENSCSAVG00000000003 maps to multiple RASSF family members — gene family expansion!) (Note: ENSCSAVG00000000006 has no human orthologue)
Exercise 4: MYH9 Gene Exploration
Q: For the human MYH9 gene:
- What's the Ensembl code? How many transcripts? All protein-coding? Forward or reverse strand?
- What's the MANE Select transcript code? CCDS code? RefSeq codes?
- Chromosomal coordinates? Cytogenetic location?
- Zoom to exon 17 (22:36,306,051-36,305,930). Any variants annotated in both ClinVar and OMIM? Check rs80338828.
Click for answer
-
Ensembl Gene ID: ENSG00000100345
Transcripts: Multiple (check current count — it changes between releases)
Not all protein-coding — some are processed transcripts, nonsense-mediated decay, etc.
Strand: Reverse (-) -
MANE Select: ENST00000216181 (MYH9-201)
CCDS: CCDS14099
RefSeq: NM_002473 (mRNA), NP_002464 (protein) -
Coordinates: Chr22:36,281,270-36,393,331 (GRCh38)
Cytogenetic: 22q12.3 -
rs80338828: Yes, annotated in both ClinVar and OMIM
Associated with MYH9-related disorders (May-Hegglin anomaly, etc.)
Quick Reference: BioMart Checklist
□ Selected correct dataset (species of your INPUT IDs)
□ Pasted IDs in the CORRECT filter field (match ID prefix!)
□ Used text input field, not just checkbox
□ Selected Orthologue (not Paralogue) for cross-species queries
□ Included filter column in attributes (for verification)
□ Checked "Unique results only" if needed
□ Tested with small subset before full export
- BioMart can be slow with large queries — be patient or split into batches
- Always double-check your assembly version (GRCh37 vs GRCh38)
- For programmatic access, use the Ensembl REST API instead
- Video tutorial: EBI BioMart Tutorial
TL;DR
- Ensembl = genome browser + database for genes, transcripts, variants, orthologues
- IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
- MANE Select = highest quality transcript annotation (use these when possible)
- BioMart = bulk query tool: Dataset → Filters → Attributes → Export
Avoid these mistakes:
- Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
- Use the text input field, not just checkboxes
- Orthologue = cross-species, Paralogue = same species
- Start with the species of your INPUT IDs as your dataset
- Always include your filter column in output attributes
Now go explore some genomes! 🧬