Programmatic Access to Databases
Why Programmatic Access?
You've used web interfaces to search databases. But what if you need to:
- Query 500 proteins automatically
- Extract specific fields from thousands of entries
- Build a pipeline that updates daily
You need to talk to databases programmatically — through their APIs.
Part 1: How the Web Works
URLs
A URL (Uniform Resource Locator) is an address for a resource on the web:
https://www.rcsb.org/structure/4GYD
HTTP Protocol
When your browser opens a page:
- Browser identifies the server from the URL
- Sends a request using HTTP (or HTTPS for secure)
- Server responds with content + status code
HTTP Methods:
- GET — retrieve data (what we'll mostly use)
- POST — send data to create/update
- PUT — update data
- DELETE — remove data
Status Codes
Every HTTP response includes a status code:
| Range | Meaning | Example |
|---|---|---|
| 1XX | Information | 100 Continue |
| 2XX | Success | 200 OK |
| 3XX | Redirect | 301 Moved Permanently |
| 4XX | Client error | 404 Not Found |
| 5XX | Server error | 500 Internal Server Error |
Key rule: Always check if status code is 200 (or in 2XX range) before processing the response.
Part 2: REST and JSON
REST
REST (REpresentational State Transfer) is an architecture for web services.
A REST API lets you:
- Send HTTP requests to specific URLs
- Get structured data back
Most bioinformatics databases offer REST APIs: PDB, UniProt, NCBI, Ensembl.
JSON
JSON (JavaScript Object Notation) is the standard format for API responses.
Four rules:
- Data is in name/value pairs
- Data is separated by commas
- Curly braces {} hold objects (like Python dictionaries)
- Square brackets [] hold arrays (like Python lists)
Example:
{
"entry_id": "4GYD",
"resolution": 1.86,
"chains": ["A", "B"],
"ligands": [
{"id": "CFF", "name": "Caffeine"},
{"id": "HOH", "name": "Water"}
]
}
This maps directly to Python:
{}→ dictionary[]→ list"text"→ string- numbers → int or float
Part 3: The requests Module
Python's requests module makes HTTP requests simple.
Basic GET Request
import requests
res = requests.get('http://www.google.com')
print(res.status_code) # 200
Check Status Before Processing
res = requests.get('http://www.google.com')
if res.status_code == 200:
print(res.text) # The HTML content
else:
print(f"Error: {res.status_code}")
What Happens with Errors
r = requests.get('https://github.com/timelines.json')
print(r.status_code) # 404
print(r.text) # Error message from GitHub
Always check the status code. Don't assume success.
Getting JSON Responses
Most APIs return JSON. Convert it to a Python dictionary:
r = requests.get('https://some-api.com/data')
data = r.json() # Now it's a dictionary
print(type(data)) # <class 'dict'>
print(data.keys()) # See what's inside
Part 4: PDB REST API
The Protein Data Bank has multiple APIs. Let's start with the REST API.
PDB Terminology
| Term | Meaning | Example |
|---|---|---|
| Entry | Complete structure from one experiment | 4GYD |
| Polymer Entity | One chain (protein, DNA, RNA) | 4GYD entity 1 |
| Chemical Component | Small molecule, ligand, ion | CFF (caffeine) |
Get Entry Information
r = requests.get('https://data.rcsb.org/rest/v1/core/entry/4GYD')
data = r.json()
print(data.keys())
# dict_keys(['cell', 'citation', 'diffrn', 'entry', 'exptl', ...])
print(data['cell'])
# {'Z_PDB': 4, 'angle_alpha': 90.0, 'angle_beta': 90.0, ...}
Get Polymer Entity (Chain) Information
# 4GYD, entity 1
r = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4GYD/1')
data = r.json()
print(data['entity_poly'])
# Contains sequence, polymer type, etc.
Get PubMed Annotations
r = requests.get('https://data.rcsb.org/rest/v1/core/pubmed/4GYD')
data = r.json()
print(data['rcsb_pubmed_abstract_text'])
# The paper's abstract
Get Chemical Component Information
# CFF = Caffeine
r = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/CFF')
data = r.json()
print(data['chem_comp'])
# {'formula': 'C8 H10 N4 O2', 'formula_weight': 194.191, 'name': 'CAFFEINE', ...}
Get DrugBank Information
r = requests.get('https://data.rcsb.org/rest/v1/core/drugbank/CFF')
data = r.json()
print(data['drugbank_info']['description'])
# "A methylxanthine naturally occurring in some beverages..."
print(data['drugbank_info']['indication'])
# What the drug is used for
Get FASTA Sequence
Note: This returns plain text, not JSON.
r = requests.get('https://www.rcsb.org/fasta/entry/4GYD/download')
print(r.text)
# >4GYD_1|Chain A|...
# MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG...
Process Multiple Proteins
protein_ids = ['4GYD', '4H0J', '4H0K']
protein_dict = dict()
for protein in protein_ids:
r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{protein}')
data = r.json()
protein_dict[protein] = data['cell']
# Print cell dimensions
for protein_id, cell in protein_dict.items():
print(f"{protein_id}: a={cell['length_a']}, b={cell['length_b']}, c={cell['length_c']}")
Part 5: PDB Search API
The Search API lets you query across the entire PDB database.
Base URL: http://search.rcsb.org/rcsbsearch/v2/query?json=<query>
Important: The query must be URL-encoded.
URL Encoding
Special characters in URLs must be encoded. Use requests.utils.requote_uri():
my_query = '{"query": ...}' # JSON query string
encoded = requests.utils.requote_uri(my_query)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'
r = requests.get(url)
Sequence Similarity Search (BLAST-like)
Find structures with similar sequences:
fasta = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"
my_query = '''{
"query": {
"type": "terminal",
"service": "sequence",
"parameters": {
"evalue_cutoff": 1,
"identity_cutoff": 0.9,
"sequence_type": "protein",
"value": "%s"
}
},
"request_options": {
"scoring_strategy": "sequence"
},
"return_type": "polymer_entity"
}''' % fasta
r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()
print(f"Total matches: {j['total_count']}")
for item in j['result_set']:
print(item['identifier'], "score =", item['score'])
Sequence Motif Search (PROSITE)
Find structures containing a specific motif:
# Zinc finger Cys2His2-like fold group
# PROSITE pattern: C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
my_query = '''{
"query": {
"type": "terminal",
"service": "seqmotif",
"parameters": {
"value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H",
"pattern_type": "prosite",
"sequence_type": "protein"
}
},
"return_type": "polymer_entity"
}'''
r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()
print(f"Total: {j['total_count']}, returned: {len(j['result_set'])}")
Search by Chemical Component
Find all entries containing caffeine:
my_query = '''{
"query": {
"type": "terminal",
"service": "text",
"parameters": {
"attribute": "rcsb_nonpolymer_instance_annotation.comp_id",
"operator": "exact_match",
"value": "CFF"
}
},
"return_type": "entry"
}'''
url = "https://search.rcsb.org/rcsbsearch/v2/query?json=%s" % requests.utils.requote_uri(my_query)
r = requests.get(url)
data = r.json()
pdb_ids = [row["identifier"] for row in data.get("result_set", [])]
print(f"Entries with caffeine: {len(pdb_ids)}")
print(pdb_ids)
Understanding the Response
j = r.json()
j.keys()
# dict_keys(['query_id', 'result_type', 'total_count', 'result_set'])
j['total_count'] # Total number of matches
j['result_set'] # List of results (may be paginated)
# Each result
j['result_set'][0]
# {'identifier': '4GYD_1', 'score': 1.0, ...}
Part 6: PDB GraphQL API
GraphQL is a query language that lets you request exactly the fields you need.
Endpoint: https://data.rcsb.org/graphql
Interactive testing: http://data.rcsb.org/graphql/index.html (GraphiQL)
Why GraphQL?
REST: Multiple requests for related data GraphQL: One request, specify exactly what you want
Basic Query
my_query = '''{
entry(entry_id: "4GYD") {
cell {
Z_PDB
angle_alpha
angle_beta
angle_gamma
length_a
length_b
length_c
volume
}
}
}'''
r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(my_query))
j = r.json()
print(j.keys()) # dict_keys(['data'])
print(j['data'])
# {'entry': {'cell': {'Z_PDB': 4, 'angle_alpha': 90.0, ...}}}
Accessing the Data
params = j['data']['entry']['cell']
for key, value in params.items():
print(f"{key}: {value}")
Query Multiple Entries
my_query = '''{
entries(entry_ids: ["4GYD", "4H0J", "4H0K"]) {
rcsb_id
cell {
length_a
length_b
length_c
}
}
}'''
Find UniProt Mappings
my_query = '''{
polymer_entity(entry_id: "4GYD", entity_id: "1") {
rcsb_polymer_entity_container_identifiers {
entry_id
entity_id
}
rcsb_polymer_entity_align {
aligned_regions {
entity_beg_seq_id
length
}
reference_database_name
reference_database_accession
}
}
}'''
Part 7: UniProt API
UniProt uses the Proteins REST API at https://www.ebi.ac.uk/proteins/api/
Important: Specify JSON Format
UniProt doesn't return JSON by default. You must request it:
headers = {"Accept": "application/json"}
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=P0A3X7&reviewed=true"
r = requests.get(requestURL, headers=headers)
j = r.json()
Response Structure
UniProt returns a list, not a dictionary:
type(j) # <class 'list'>
len(j) # Number of entries returned
# Access first entry
j[0].keys()
# dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', ...])
Extract Gene Ontology Information
print(f"Accession: {j[0]['accession']}") # P0A3X7
print(f"ID: {j[0]['id']}") # CYC6_NOSS1
print("Gene Ontologies:")
for item in j[0]['dbReferences']:
if item['type'] == "GO":
print(f" {item['id']}: {item['properties']['term']}")
Part 8: NCBI API
NCBI also offers REST APIs for programmatic access.
Gene Information
headers = {'Accept': 'application/json'}
gene_id = 8291 # DYSF (dysferlin)
r = requests.get(f'https://api.ncbi.nlm.nih.gov/datasets/v1alpha/gene/id/{gene_id}', headers=headers)
j = r.json()
gene = j['genes'][0]['gene']
print(gene['description']) # dysferlin
print(gene['symbol']) # DYSF
print(gene['taxname']) # Homo sapiens
Part 9: Common Patterns
Pattern 1: Always Check Status
r = requests.get(url)
if r.status_code != 200:
print(f"Error: {r.status_code}")
print(r.text)
else:
data = r.json()
# process data
Pattern 2: Loop Through Multiple IDs
ids = ['4GYD', '4H0J', '4H0K']
results = {}
for id in ids:
r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
if r.status_code == 200:
results[id] = r.json()
else:
print(f"Failed to get {id}")
Pattern 3: Extract Specific Fields
# Get resolution for multiple structures
resolutions = {}
for id in ids:
r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
data = r.json()
# Navigate nested structure
resolutions[id] = data['rcsb_entry_info']['resolution_combined'][0]
Pattern 4: Build URL with Parameters
base_url = "https://www.ebi.ac.uk/proteins/api/proteins"
params = {
'offset': 0,
'size': 10,
'accession': 'P0A3X7',
'reviewed': 'true'
}
# Build query string
query = '&'.join([f"{k}={v}" for k, v in params.items()])
url = f"{base_url}?{query}"
Pattern 5: Handle Paginated Results
Search APIs often return limited results per page:
j = r.json()
print(f"Total: {j['total_count']}")
print(f"Returned: {len(j['result_set'])}")
# If total > returned, you need pagination
# Check API docs for how to request more pages
API Summary
| Database | Base URL | JSON by default? | Notes |
|---|---|---|---|
| PDB REST | data.rcsb.org/rest/v1/core/ | Yes | Entry, entity, chemcomp |
| PDB Search | search.rcsb.org/rcsbsearch/v2/query | Yes | URL-encode query |
| PDB GraphQL | data.rcsb.org/graphql | Yes | Flexible queries |
| UniProt | ebi.ac.uk/proteins/api/ | No (need header) | Returns list |
| NCBI | api.ncbi.nlm.nih.gov/datasets/ | No (need header) | Gene, genome, etc. |
Quick Reference
requests Basics
import requests
# GET request
r = requests.get(url)
r = requests.get(url, headers={'Accept': 'application/json'})
# Check status
r.status_code # 200 = success
# Get response
r.text # As string
r.json() # As dictionary (if JSON)
URL Encoding
# For Search API queries
encoded = requests.utils.requote_uri(query_string)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'
PDB API URLs
# Entry info
f'https://data.rcsb.org/rest/v1/core/entry/{pdb_id}'
# Polymer entity
f'https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}'
# Chemical component
f'https://data.rcsb.org/rest/v1/core/chemcomp/{ccd_id}'
# DrugBank
f'https://data.rcsb.org/rest/v1/core/drugbank/{ccd_id}'
# PubMed
f'https://data.rcsb.org/rest/v1/core/pubmed/{pdb_id}'
# FASTA
f'https://www.rcsb.org/fasta/entry/{pdb_id}/download'
# GraphQL
f'https://data.rcsb.org/graphql?query={encoded_query}'
# Search
f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded_query}'
UniProt API URL
# Needs header: {"Accept": "application/json"}
f'https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true'
Common Mistakes
| Mistake | Problem | Fix |
|---|---|---|
| Not checking status code | Process garbage data | Always check r.status_code == 200 |
| Forgetting JSON header for UniProt | Get HTML instead of JSON | Add headers={"Accept": "application/json"} |
| Not URL-encoding search queries | Query fails | Use requests.utils.requote_uri() |
| Assuming dict when it's a list | KeyError | Check type(r.json()) |
Calling .json() on non-JSON | Error | Check if response is actually JSON |
| Not handling missing keys | KeyError | Use .get('key', default) |
Workflow Example: Get GO Terms for a PDB Structure
Complete workflow combining PDB and UniProt:
import requests
# 1. Get UniProt ID from PDB
pdb_id = "4GYD"
query = '''{
polymer_entity(entry_id: "%s", entity_id: "1") {
rcsb_polymer_entity_align {
reference_database_name
reference_database_accession
}
}
}''' % pdb_id
r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(query))
data = r.json()
# Find UniProt accession
for align in data['data']['polymer_entity']['rcsb_polymer_entity_align']:
if align['reference_database_name'] == 'UniProt':
uniprot_id = align['reference_database_accession']
break
print(f"UniProt ID: {uniprot_id}")
# 2. Get GO terms from UniProt
url = f"https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true"
r = requests.get(url, headers={"Accept": "application/json"})
j = r.json()
print("Gene Ontology terms:")
for item in j[0]['dbReferences']:
if item['type'] == "GO":
print(f" {item['id']}: {item['properties']['term']}")