Programmatic Access to Databases

Why Programmatic Access?

You've used web interfaces to search databases. But what if you need to:

Query 500 proteins automatically
Extract specific fields from thousands of entries
Build a pipeline that updates daily

You need to talk to databases programmatically — through their APIs.

Part 1: How the Web Works

URLs

A URL (Uniform Resource Locator) is an address for a resource on the web:

https://www.rcsb.org/structure/4GYD

HTTP Protocol

When your browser opens a page:

Browser identifies the server from the URL
Sends a request using HTTP (or HTTPS for secure)
Server responds with content + status code

HTTP Methods:

GET — retrieve data (what we'll mostly use)
POST — send data to create/update
PUT — update data
DELETE — remove data

Status Codes

Every HTTP response includes a status code:

Range	Meaning	Example
1XX	Information	100 Continue
2XX	Success	200 OK
3XX	Redirect	301 Moved Permanently
4XX	Client error	404 Not Found
5XX	Server error	500 Internal Server Error

Key rule: Always check if status code is 200 (or in 2XX range) before processing the response.

Part 2: REST and JSON

REST

REST (REpresentational State Transfer) is an architecture for web services.

A REST API lets you:

Send HTTP requests to specific URLs
Get structured data back

Most bioinformatics databases offer REST APIs: PDB, UniProt, NCBI, Ensembl.

JSON

JSON (JavaScript Object Notation) is the standard format for API responses.

Four rules:

Data is in name/value pairs
Data is separated by commas
Curly braces {} hold objects (like Python dictionaries)
Square brackets [] hold arrays (like Python lists)

Example:

{
    "entry_id": "4GYD",
    "resolution": 1.86,
    "chains": ["A", "B"],
    "ligands": [
        {"id": "CFF", "name": "Caffeine"},
        {"id": "HOH", "name": "Water"}
    ]
}

This maps directly to Python:

{} → dictionary
[] → list
"text" → string
numbers → int or float

Part 3: The requests Module

Python's requests module makes HTTP requests simple.

Basic GET Request

import requests

res = requests.get('http://www.google.com')
print(res.status_code)  # 200

Check Status Before Processing

res = requests.get('http://www.google.com')

if res.status_code == 200:
    print(res.text)  # The HTML content
else:
    print(f"Error: {res.status_code}")

What Happens with Errors

r = requests.get('https://github.com/timelines.json')
print(r.status_code)  # 404
print(r.text)  # Error message from GitHub

Always check the status code. Don't assume success.

Getting JSON Responses

Most APIs return JSON. Convert it to a Python dictionary:

r = requests.get('https://some-api.com/data')
data = r.json()  # Now it's a dictionary

print(type(data))  # <class 'dict'>
print(data.keys())  # See what's inside

Part 4: PDB REST API

The Protein Data Bank has multiple APIs. Let's start with the REST API.

PDB Terminology

Term	Meaning	Example
Entry	Complete structure from one experiment	4GYD
Polymer Entity	One chain (protein, DNA, RNA)	4GYD entity 1
Chemical Component	Small molecule, ligand, ion	CFF (caffeine)

Get Entry Information

r = requests.get('https://data.rcsb.org/rest/v1/core/entry/4GYD')
data = r.json()

print(data.keys())
# dict_keys(['cell', 'citation', 'diffrn', 'entry', 'exptl', ...])

print(data['cell'])
# {'Z_PDB': 4, 'angle_alpha': 90.0, 'angle_beta': 90.0, ...}

Get Polymer Entity (Chain) Information

# 4GYD, entity 1
r = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4GYD/1')
data = r.json()

print(data['entity_poly'])
# Contains sequence, polymer type, etc.

Get PubMed Annotations

r = requests.get('https://data.rcsb.org/rest/v1/core/pubmed/4GYD')
data = r.json()

print(data['rcsb_pubmed_abstract_text'])
# The paper's abstract

Get Chemical Component Information

# CFF = Caffeine
r = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/CFF')
data = r.json()

print(data['chem_comp'])
# {'formula': 'C8 H10 N4 O2', 'formula_weight': 194.191, 'name': 'CAFFEINE', ...}

Get DrugBank Information

r = requests.get('https://data.rcsb.org/rest/v1/core/drugbank/CFF')
data = r.json()

print(data['drugbank_info']['description'])
# "A methylxanthine naturally occurring in some beverages..."

print(data['drugbank_info']['indication'])
# What the drug is used for

Get FASTA Sequence

Note: This returns plain text, not JSON.

r = requests.get('https://www.rcsb.org/fasta/entry/4GYD/download')
print(r.text)

# >4GYD_1|Chain A|...
# MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG...

Process Multiple Proteins

protein_ids = ['4GYD', '4H0J', '4H0K']

protein_dict = dict()
for protein in protein_ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{protein}')
    data = r.json()
    protein_dict[protein] = data['cell']

# Print cell dimensions
for protein_id, cell in protein_dict.items():
    print(f"{protein_id}: a={cell['length_a']}, b={cell['length_b']}, c={cell['length_c']}")

Part 5: PDB Search API

The Search API lets you query across the entire PDB database.

Base URL: http://search.rcsb.org/rcsbsearch/v2/query?json=<query>

Important: The query must be URL-encoded.

URL Encoding

Special characters in URLs must be encoded. Use requests.utils.requote_uri():

my_query = '{"query": ...}'  # JSON query string
encoded = requests.utils.requote_uri(my_query)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'
r = requests.get(url)

Sequence Similarity Search (BLAST-like)

Find structures with similar sequences:

fasta = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"

my_query = '''{
  "query": {
    "type": "terminal",
    "service": "sequence",
    "parameters": {
      "evalue_cutoff": 1,
      "identity_cutoff": 0.9,
      "sequence_type": "protein",
      "value": "%s"
    }
  },
  "request_options": {
    "scoring_strategy": "sequence"
  },
  "return_type": "polymer_entity"
}''' % fasta

r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(f"Total matches: {j['total_count']}")
for item in j['result_set']:
    print(item['identifier'], "score =", item['score'])

Sequence Motif Search (PROSITE)

Find structures containing a specific motif:

# Zinc finger Cys2His2-like fold group
# PROSITE pattern: C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

my_query = '''{
  "query": {
    "type": "terminal",
    "service": "seqmotif",
    "parameters": {
      "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H",
      "pattern_type": "prosite",
      "sequence_type": "protein"
    }
  },
  "return_type": "polymer_entity"
}'''

r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(f"Total: {j['total_count']}, returned: {len(j['result_set'])}")

Search by Chemical Component

Find all entries containing caffeine:

my_query = '''{
    "query": {
        "type": "terminal",
        "service": "text",
        "parameters": {
            "attribute": "rcsb_nonpolymer_instance_annotation.comp_id",
            "operator": "exact_match",
            "value": "CFF"
        }
    },
    "return_type": "entry"
}'''

url = "https://search.rcsb.org/rcsbsearch/v2/query?json=%s" % requests.utils.requote_uri(my_query)
r = requests.get(url)
data = r.json()

pdb_ids = [row["identifier"] for row in data.get("result_set", [])]
print(f"Entries with caffeine: {len(pdb_ids)}")
print(pdb_ids)

Understanding the Response

j = r.json()

j.keys()
# dict_keys(['query_id', 'result_type', 'total_count', 'result_set'])

j['total_count']  # Total number of matches
j['result_set']   # List of results (may be paginated)

# Each result
j['result_set'][0]
# {'identifier': '4GYD_1', 'score': 1.0, ...}

Part 6: PDB GraphQL API

GraphQL is a query language that lets you request exactly the fields you need.

Endpoint: https://data.rcsb.org/graphql

Interactive testing: http://data.rcsb.org/graphql/index.html (GraphiQL)

Why GraphQL?

REST: Multiple requests for related data GraphQL: One request, specify exactly what you want

Basic Query

my_query = '''{
    entry(entry_id: "4GYD") {
        cell {
            Z_PDB
            angle_alpha
            angle_beta
            angle_gamma
            length_a
            length_b
            length_c
            volume
        }
    }
}'''

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(j.keys())  # dict_keys(['data'])

print(j['data'])
# {'entry': {'cell': {'Z_PDB': 4, 'angle_alpha': 90.0, ...}}}

Accessing the Data

params = j['data']['entry']['cell']

for key, value in params.items():
    print(f"{key}: {value}")

Query Multiple Entries

my_query = '''{
    entries(entry_ids: ["4GYD", "4H0J", "4H0K"]) {
        rcsb_id
        cell {
            length_a
            length_b
            length_c
        }
    }
}'''

Find UniProt Mappings

my_query = '''{
    polymer_entity(entry_id: "4GYD", entity_id: "1") {
        rcsb_polymer_entity_container_identifiers {
            entry_id
            entity_id
        }
        rcsb_polymer_entity_align {
            aligned_regions {
                entity_beg_seq_id
                length
            }
            reference_database_name
            reference_database_accession
        }
    }
}'''

Part 7: UniProt API

UniProt uses the Proteins REST API at https://www.ebi.ac.uk/proteins/api/

Important: Specify JSON Format

UniProt doesn't return JSON by default. You must request it:

headers = {"Accept": "application/json"}
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=P0A3X7&reviewed=true"

r = requests.get(requestURL, headers=headers)
j = r.json()

Response Structure

UniProt returns a list, not a dictionary:

type(j)  # <class 'list'>
len(j)   # Number of entries returned

# Access first entry
j[0].keys()
# dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', ...])

Extract Gene Ontology Information

print(f"Accession: {j[0]['accession']}")  # P0A3X7
print(f"ID: {j[0]['id']}")  # CYC6_NOSS1

print("Gene Ontologies:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print(f"  {item['id']}: {item['properties']['term']}")

Part 8: NCBI API

NCBI also offers REST APIs for programmatic access.

Gene Information

headers = {'Accept': 'application/json'}
gene_id = 8291  # DYSF (dysferlin)

r = requests.get(f'https://api.ncbi.nlm.nih.gov/datasets/v1alpha/gene/id/{gene_id}', headers=headers)
j = r.json()

gene = j['genes'][0]['gene']
print(gene['description'])  # dysferlin
print(gene['symbol'])       # DYSF
print(gene['taxname'])      # Homo sapiens

Part 9: Common Patterns

Pattern 1: Always Check Status

r = requests.get(url)
if r.status_code != 200:
    print(f"Error: {r.status_code}")
    print(r.text)
else:
    data = r.json()
    # process data

Pattern 2: Loop Through Multiple IDs

ids = ['4GYD', '4H0J', '4H0K']
results = {}

for id in ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
    if r.status_code == 200:
        results[id] = r.json()
    else:
        print(f"Failed to get {id}")

Pattern 3: Extract Specific Fields

# Get resolution for multiple structures
resolutions = {}
for id in ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
    data = r.json()
    # Navigate nested structure
    resolutions[id] = data['rcsb_entry_info']['resolution_combined'][0]

Pattern 4: Build URL with Parameters

base_url = "https://www.ebi.ac.uk/proteins/api/proteins"
params = {
    'offset': 0,
    'size': 10,
    'accession': 'P0A3X7',
    'reviewed': 'true'
}

# Build query string
query = '&'.join([f"{k}={v}" for k, v in params.items()])
url = f"{base_url}?{query}"

Pattern 5: Handle Paginated Results

Search APIs often return limited results per page:

j = r.json()
print(f"Total: {j['total_count']}")
print(f"Returned: {len(j['result_set'])}")

# If total > returned, you need pagination
# Check API docs for how to request more pages

API Summary

Database	Base URL	JSON by default?	Notes
PDB REST	data.rcsb.org/rest/v1/core/	Yes	Entry, entity, chemcomp
PDB Search	search.rcsb.org/rcsbsearch/v2/query	Yes	URL-encode query
PDB GraphQL	data.rcsb.org/graphql	Yes	Flexible queries
UniProt	ebi.ac.uk/proteins/api/	No (need header)	Returns list
NCBI	api.ncbi.nlm.nih.gov/datasets/	No (need header)	Gene, genome, etc.

Quick Reference

requests Basics

import requests

# GET request
r = requests.get(url)
r = requests.get(url, headers={'Accept': 'application/json'})

# Check status
r.status_code  # 200 = success

# Get response
r.text  # As string
r.json()  # As dictionary (if JSON)

URL Encoding

# For Search API queries
encoded = requests.utils.requote_uri(query_string)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'

PDB API URLs

# Entry info
f'https://data.rcsb.org/rest/v1/core/entry/{pdb_id}'

# Polymer entity
f'https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}'

# Chemical component
f'https://data.rcsb.org/rest/v1/core/chemcomp/{ccd_id}'

# DrugBank
f'https://data.rcsb.org/rest/v1/core/drugbank/{ccd_id}'

# PubMed
f'https://data.rcsb.org/rest/v1/core/pubmed/{pdb_id}'

# FASTA
f'https://www.rcsb.org/fasta/entry/{pdb_id}/download'

# GraphQL
f'https://data.rcsb.org/graphql?query={encoded_query}'

# Search
f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded_query}'

UniProt API URL

# Needs header: {"Accept": "application/json"}
f'https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true'

Common Mistakes

Mistake	Problem	Fix
Not checking status code	Process garbage data	Always check `r.status_code == 200`
Forgetting JSON header for UniProt	Get HTML instead of JSON	Add `headers={"Accept": "application/json"}`
Not URL-encoding search queries	Query fails	Use `requests.utils.requote_uri()`
Assuming dict when it's a list	KeyError	Check `type(r.json())`
Calling `.json()` on non-JSON	Error	Check if response is actually JSON
Not handling missing keys	KeyError	Use `.get('key', default)`

Workflow Example: Get GO Terms for a PDB Structure

Complete workflow combining PDB and UniProt:

import requests

# 1. Get UniProt ID from PDB
pdb_id = "4GYD"
query = '''{
    polymer_entity(entry_id: "%s", entity_id: "1") {
        rcsb_polymer_entity_align {
            reference_database_name
            reference_database_accession
        }
    }
}''' % pdb_id

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(query))
data = r.json()

# Find UniProt accession
for align in data['data']['polymer_entity']['rcsb_polymer_entity_align']:
    if align['reference_database_name'] == 'UniProt':
        uniprot_id = align['reference_database_accession']
        break

print(f"UniProt ID: {uniprot_id}")

# 2. Get GO terms from UniProt
url = f"https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true"
r = requests.get(url, headers={"Accept": "application/json"})
j = r.json()

print("Gene Ontology terms:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print(f"  {item['id']}: {item['properties']['term']}")

Bioinformatics Forever