Ensembl: A Practical Guide

So you need to look up genes, transcripts, variants, or convert IDs between databases? Welcome to Ensembl — the genome browser that bioinformaticians actually use daily.

What is Ensembl?

Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.

What it gives you:

  • Gene sets (splice variants, proteins, ncRNAs)
  • Comparative genomics (alignments, protein trees, orthologues)
  • Variation data (SNPs, InDels, CNVs)
  • BioMart for bulk data export
  • REST API for programmatic access
  • Everything is open source
ℹ️
The Human Reference Genome

Currently we're on GRCh38.p14 (Genome Reference Consortium). The original Human Genome Project finished in 2003 — cost $3 billion and took 15 years. Now you can access it for free in seconds. Science is wild.


Ensembl Stable Identifiers

This is the ID system you'll see everywhere. Memorize the prefixes:

PrefixMeaningExample
ENSGGene IDENSG00000141510
ENSTTranscript IDENST00000269305
ENSPPeptide/Protein IDENSP00000269305
ENSEExon IDENSE00001146308
ENSRRegulatory FeatureENSR00000000001
ENSFMProtein FamilyENSFM00250000000001
💡
Non-Human Species

For other species, a 3-letter code is inserted: ENSMUSG (mouse), ENSDARG (zebrafish), ENSCSAVG (Ciona savignyi), etc.


Transcript Quality Tiers

Not all transcripts are created equal. Here's the hierarchy:

MANE Select (Gold Standard) 🥇

  • Matched Annotation between NCBI and EBI
  • Perfectly aligned to GRCh38
  • Complete sequence identity with RefSeq
  • This is your go-to transcript

Merged (Ensembl/Havana) 🥈

  • Automatically annotated + manually curated
  • High confidence

CCDS (Consensus CDS)

  • Collaborative effort for consistent protein-coding annotations
  • Shared between NCBI, EBI, UCSC, and others

Ensembl Protein Coding (Red)

  • Automatic annotation based on mRNA/protein evidence
  • Good, but not manually verified
⚠️
Always Check Transcript Quality

When doing variant analysis, prefer MANE Select transcripts. Using a low-confidence transcript can give you wrong coordinates or missed variants.


Using the Ensembl Browser

Basic Navigation

  1. Go to ensembl.org
  2. Search by: gene name, Ensembl ID, coordinates, or variant ID (rs number)
  3. Gene page shows: location, transcripts, variants, orthologues, etc.

Key Information You Can Find

For any gene (e.g., MYH9):

  • Ensembl Gene ID → ENSG00000100345
  • Chromosomal coordinates → 22:36,281,270-36,393,331
  • Cytogenetic location → 22q12.3
  • Strand → Forward (+) or Reverse (-)
  • Number of transcripts → and which are protein-coding
  • MANE Select transcript → with CCDS and RefSeq cross-references

Viewing Variants

  1. Navigate to your gene
  2. Go to "Variant table" or zoom into a specific region
  3. Filter by: consequence type, clinical significance (ClinVar), etc.
  4. Click on any variant (e.g., rs80338828) to see:
    • Alleles and frequencies
    • Consequence (missense, synonymous, etc.)
    • Clinical annotations (ClinVar, OMIM)
    • Population frequencies

BioMart: Bulk Data Queries

BioMart is where Ensembl gets powerful. No programming required — it's a web interface for mining data in bulk.

Access: ensembl.org → BioMart (top menu)

The Three-Step Process

1. DATASET    → Choose species/database (e.g., Human genes GRCh38.p14)
2. FILTERS    → Narrow down what you want (gene list, chromosome, biotype...)
3. ATTRIBUTES → Choose what columns to export (IDs, names, sequences...)
💻
Workflow Example: ID Conversion

Goal: Convert RefSeq protein IDs to Ensembl Gene IDs

  1. Dataset: Human genes (GRCh38.p14)
  2. Filters → External References → RefSeq peptide ID → paste your list
  3. Attributes: Gene stable ID, Gene name, RefSeq peptide ID
  4. Results → Export as CSV/TSV/HTML

⚠️ Common BioMart Mistakes (And How to Avoid Them)

These will save you hours of frustration. Learn from pain.

Mistake #1: Pasting IDs in the Wrong Filter Field

⚠️
The Classic Blunder

You have RefSeq IDs (NP_001214, NP_001216...) and you paste them into "Gene stable ID(s)" field. Result? Empty results.

Why it happens: The "Gene stable ID(s)" field expects Ensembl IDs (ENSG...), not RefSeq IDs.

The fix:

ID TypeWhere to Paste
ENSG00000xxxxxFilters → GENE → Gene stable ID(s)
NP_xxxxxx (RefSeq protein)Filters → EXTERNAL → RefSeq peptide ID(s)
NM_xxxxxx (RefSeq mRNA)Filters → EXTERNAL → RefSeq mRNA ID(s)
P12345 (UniProt)Filters → EXTERNAL → UniProtKB/Swiss-Prot ID(s)
💡
Rule of Thumb

Look at your ID prefix. If it's NOT "ENS...", you need to find the matching field under EXTERNAL → External References.


Mistake #2: Checkbox vs Text Input Confusion

Some filter options have both a checkbox AND a text field:

☑ With RefSeq peptide ID(s): Only    ← Checkbox (just filters for genes that HAVE RefSeq IDs)
[________________________]           ← Text field (where you paste YOUR specific IDs)

The mistake: Checking the box but not pasting IDs in the text field.

What happens:

  • Checkbox alone = "Give me all genes that have ANY RefSeq ID" (thousands of results)
  • Text field = "Give me only genes matching THESE specific RefSeq IDs" (your actual query)

The fix: Always paste your ID list in the text input field, not just check the box.


Mistake #3: Orthologue vs Paralogue Mix-up

⚠️
Know the Difference!

You want to find human equivalents of Ciona genes. You select Paralogue %id. Result? Wrong data or empty results.

TermMeaningUse When
OrthologueSame gene in different species (separated by speciation)Ciona gene → Human equivalent
ParalogueDifferent gene in same species (separated by duplication)Human BRCA1 → Human BRCA2

The fix:

For cross-species queries (e.g., Ciona → Human):

Attributes → Homologues → Human Orthologues
    ✓ Human gene stable ID
    ✓ Human gene name
    ✓ %id. target Human gene identical to query gene

NOT:

Attributes → Homologues → Paralogues   ← WRONG for cross-species!

Mistake #4: Forgetting to Include Filter Column in Attributes

The scenario: You filter by RefSeq peptide ID, but don't include it in your output attributes.

What happens: You get a list of Ensembl IDs with no way to match them back to your original input!

Gene stable IDGene name
ENSG00000137752CASP1
ENSG00000196954CASP4

Wait... which RefSeq ID was CASP1 again? 🤷

The fix: Always include your filter field as an output attribute:

Attributes:
    ✓ Gene stable ID
    ✓ Gene name
    ✓ RefSeq peptide ID    ← Include this for verification!

Now you get:

Gene stable IDGene nameRefSeq peptide ID
ENSG00000137752CASP1NP_001214
ENSG00000196954CASP4NP_001216

Much better!


Mistake #5: Wrong Dataset for Cross-Species Queries

The scenario: You want human orthologues of Ciona genes. You select "Human genes" as your dataset.

What happens: You can't input Ciona gene IDs because you're in the Human database!

The fix: Start from the source species:

Dataset: Ciona savignyi genes    ← Start here (your input species)
Filters: Gene stable ID → paste Ciona IDs
Attributes: 
    - Gene stable ID (Ciona)
    - Human orthologue gene ID    ← Get human data as attributes
    - Human gene name

Rule: Dataset = species of your INPUT IDs. Other species come through Homologues attributes.


BioMart Mistakes Cheatsheet

SymptomLikely CauseFix
Empty resultsIDs in wrong filter fieldMatch ID prefix to correct filter (EXTERNAL for non-Ensembl IDs)
Way too many resultsUsed checkbox without text inputPaste specific IDs in the text field
Wrong species dataSelected Paralogue instead of OrthologueUse Orthologue for cross-species
Can't match results to inputDidn't include filter column in outputAdd your filter field to Attributes
Can't input your IDsWrong dataset selectedDataset = species of your INPUT IDs

Common BioMart Queries

Query Type 1: ID Conversion

RefSeq → Ensembl + HGNC Symbol

StepAction
DatasetHuman genes (GRCh38.p14)
FiltersEXTERNAL → RefSeq peptide ID(s) → paste list
AttributesGene stable ID, HGNC symbol, RefSeq peptide ID

Query Type 2: Finding Orthologues

Find human orthologues of genes from another species

StepAction
DatasetSource species (e.g., Ciona savignyi genes)
FiltersGene stable ID → paste your list
AttributesGene stable ID, Human orthologue gene ID, Human gene name, % identity
⚠️
Remember

Orthologue = cross-species. Paralogue = same species. Don't mix them up!


Query Type 3: Variant Export

Get all missense variants for a gene list

StepAction
DatasetHuman genes (GRCh38.p14)
FiltersGene name → your list; Variant consequence → missense_variant
AttributesGene name, Variant name (rs ID), Consequence, Amino acid change

Query Type 4: Find Genes with PDB Structures

Count/export genes that have associated 3D structures

StepAction
DatasetHuman genes (GRCh38.p14)
FiltersWith PDB ID → Only
AttributesGene stable ID, Gene name, PDB ID, UniProtKB/Swiss-Prot ID

Practice Exercises

Exercise 1: SNP Nucleotide Lookup

Q: In Ensembl, consider the SNP variation rs80338826. Which is the DNA nucleotide triplet coding for the wild-type amino acid residue (transcript MYH9-201)?

Click for answer

Answer: The triplet is CGT (coding for Arginine).

How to find it:

  1. Search rs80338826 in Ensembl
  2. Go to the variant page
  3. Look at transcript MYH9-201 consequences
  4. Check the codon column for the reference allele

Exercise 2: RefSeq to Ensembl Conversion

Q: Convert these RefSeq protein IDs to Ensembl Gene IDs and HGNC symbols:

NP_203126, NP_001214, NP_001216, NP_001220
NP_036246, NP_203519, NP_203520, NP_203522
Click for answer

BioMart Setup: | Step | What to do | |------|------------| | Dataset | Human genes (GRCh38.p14) | | Filters | EXTERNAL → RefSeq peptide ID(s) → paste the NP_ IDs | | Attributes | Gene stable ID, HGNC symbol, RefSeq peptide ID |

⚠️ Don't paste NP_ IDs in "Gene stable ID" field — that's for ENSG IDs only!

Results:

Gene stable IDHGNC symbolRefSeq peptide ID
ENSG00000137752CASP1NP_001214
ENSG00000196954CASP4NP_001216
ENSG00000132906CASP9NP_001220
ENSG00000105141CASP14NP_036246
ENSG00000165806CASP7NP_203126
ENSG00000064012CASP8NP_203519
ENSG00000064012CASP8NP_203520

(Notice: CASP8 has multiple RefSeq IDs mapping to it — different isoforms!)


Q: Find human orthologues for these Ciona savignyi genes:

ENSCSAVG00000000002, ENSCSAVG00000000003, ENSCSAVG00000000006
ENSCSAVG00000000007, ENSCSAVG00000000009, ENSCSAVG00000000011
Click for answer

BioMart Setup: | Step | What to do | |------|------------| | Dataset | Ciona savignyi genes (NOT Human!) | | Filters | Gene stable ID(s) → paste the ENSCSAVG IDs | | Attributes | Gene stable ID, Human orthologue gene ID, Human gene name, %id target Human |

⚠️ Use Orthologue (cross-species), NOT Paralogue (same species)!

Results:

C. savignyi Gene IDHuman Gene IDHuman Gene Name% Identity
ENSCSAVG00000000002ENSG00000156026MCU55.1%
ENSCSAVG00000000003ENSG00000169435RASSF629.6%
ENSCSAVG00000000003ENSG00000101265RASSF235.4%
ENSCSAVG00000000003ENSG00000107551RASSF433.1%
ENSCSAVG00000000007ENSG00000145416MARCHF158.8%
ENSCSAVG00000000009ENSG00000171865RNASEH139.4%
ENSCSAVG00000000011ENSG00000146856AGBL369.1%

(Note: ENSCSAVG00000000003 maps to multiple RASSF family members — gene family expansion!) (Note: ENSCSAVG00000000006 has no human orthologue)


Exercise 4: MYH9 Gene Exploration

Q: For the human MYH9 gene:

  1. What's the Ensembl code? How many transcripts? All protein-coding? Forward or reverse strand?
  2. What's the MANE Select transcript code? CCDS code? RefSeq codes?
  3. Chromosomal coordinates? Cytogenetic location?
  4. Zoom to exon 17 (22:36,306,051-36,305,930). Any variants annotated in both ClinVar and OMIM? Check rs80338828.
Click for answer
  1. Ensembl Gene ID: ENSG00000100345
    Transcripts: Multiple (check current count — it changes between releases)
    Not all protein-coding — some are processed transcripts, nonsense-mediated decay, etc.
    Strand: Reverse (-)

  2. MANE Select: ENST00000216181 (MYH9-201)
    CCDS: CCDS14099
    RefSeq: NM_002473 (mRNA), NP_002464 (protein)

  3. Coordinates: Chr22:36,281,270-36,393,331 (GRCh38)
    Cytogenetic: 22q12.3

  4. rs80338828: Yes, annotated in both ClinVar and OMIM
    Associated with MYH9-related disorders (May-Hegglin anomaly, etc.)


Quick Reference: BioMart Checklist

□ Selected correct dataset (species of your INPUT IDs)
□ Pasted IDs in the CORRECT filter field (match ID prefix!)
□ Used text input field, not just checkbox
□ Selected Orthologue (not Paralogue) for cross-species queries
□ Included filter column in attributes (for verification)
□ Checked "Unique results only" if needed
□ Tested with small subset before full export
📝
Pro Tips
  • BioMart can be slow with large queries — be patient or split into batches
  • Always double-check your assembly version (GRCh37 vs GRCh38)
  • For programmatic access, use the Ensembl REST API instead
  • Video tutorial: EBI BioMart Tutorial

TL;DR

  • Ensembl = genome browser + database for genes, transcripts, variants, orthologues
  • IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
  • MANE Select = highest quality transcript annotation (use these when possible)
  • BioMart = bulk query tool: Dataset → Filters → Attributes → Export

Avoid these mistakes:

  1. Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
  2. Use the text input field, not just checkboxes
  3. Orthologue = cross-species, Paralogue = same species
  4. Start with the species of your INPUT IDs as your dataset
  5. Always include your filter column in output attributes

Now go explore some genomes! 🧬