Programmatic Access to Databases

Why Programmatic Access?

You've used web interfaces to search databases. But what if you need to:

  • Query 500 proteins automatically
  • Extract specific fields from thousands of entries
  • Build a pipeline that updates daily

You need to talk to databases programmatically — through their APIs.


Part 1: How the Web Works

URLs

A URL (Uniform Resource Locator) is an address for a resource on the web:

https://www.rcsb.org/structure/4GYD

HTTP Protocol

When your browser opens a page:

  1. Browser identifies the server from the URL
  2. Sends a request using HTTP (or HTTPS for secure)
  3. Server responds with content + status code

HTTP Methods:

  • GET — retrieve data (what we'll mostly use)
  • POST — send data to create/update
  • PUT — update data
  • DELETE — remove data

Status Codes

Every HTTP response includes a status code:

RangeMeaningExample
1XXInformation100 Continue
2XXSuccess200 OK
3XXRedirect301 Moved Permanently
4XXClient error404 Not Found
5XXServer error500 Internal Server Error

Key rule: Always check if status code is 200 (or in 2XX range) before processing the response.


Part 2: REST and JSON

REST

REST (REpresentational State Transfer) is an architecture for web services.

A REST API lets you:

  • Send HTTP requests to specific URLs
  • Get structured data back

Most bioinformatics databases offer REST APIs: PDB, UniProt, NCBI, Ensembl.

JSON

JSON (JavaScript Object Notation) is the standard format for API responses.

Four rules:

  1. Data is in name/value pairs
  2. Data is separated by commas
  3. Curly braces {} hold objects (like Python dictionaries)
  4. Square brackets [] hold arrays (like Python lists)

Example:

{
    "entry_id": "4GYD",
    "resolution": 1.86,
    "chains": ["A", "B"],
    "ligands": [
        {"id": "CFF", "name": "Caffeine"},
        {"id": "HOH", "name": "Water"}
    ]
}

This maps directly to Python:

  • {} → dictionary
  • [] → list
  • "text" → string
  • numbers → int or float

Part 3: The requests Module

Python's requests module makes HTTP requests simple.

Basic GET Request

import requests

res = requests.get('http://www.google.com')
print(res.status_code)  # 200

Check Status Before Processing

res = requests.get('http://www.google.com')

if res.status_code == 200:
    print(res.text)  # The HTML content
else:
    print(f"Error: {res.status_code}")

What Happens with Errors

r = requests.get('https://github.com/timelines.json')
print(r.status_code)  # 404
print(r.text)  # Error message from GitHub

Always check the status code. Don't assume success.

Getting JSON Responses

Most APIs return JSON. Convert it to a Python dictionary:

r = requests.get('https://some-api.com/data')
data = r.json()  # Now it's a dictionary

print(type(data))  # <class 'dict'>
print(data.keys())  # See what's inside

Part 4: PDB REST API

The Protein Data Bank has multiple APIs. Let's start with the REST API.

PDB Terminology

TermMeaningExample
EntryComplete structure from one experiment4GYD
Polymer EntityOne chain (protein, DNA, RNA)4GYD entity 1
Chemical ComponentSmall molecule, ligand, ionCFF (caffeine)

Get Entry Information

r = requests.get('https://data.rcsb.org/rest/v1/core/entry/4GYD')
data = r.json()

print(data.keys())
# dict_keys(['cell', 'citation', 'diffrn', 'entry', 'exptl', ...])

print(data['cell'])
# {'Z_PDB': 4, 'angle_alpha': 90.0, 'angle_beta': 90.0, ...}

Get Polymer Entity (Chain) Information

# 4GYD, entity 1
r = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4GYD/1')
data = r.json()

print(data['entity_poly'])
# Contains sequence, polymer type, etc.

Get PubMed Annotations

r = requests.get('https://data.rcsb.org/rest/v1/core/pubmed/4GYD')
data = r.json()

print(data['rcsb_pubmed_abstract_text'])
# The paper's abstract

Get Chemical Component Information

# CFF = Caffeine
r = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/CFF')
data = r.json()

print(data['chem_comp'])
# {'formula': 'C8 H10 N4 O2', 'formula_weight': 194.191, 'name': 'CAFFEINE', ...}

Get DrugBank Information

r = requests.get('https://data.rcsb.org/rest/v1/core/drugbank/CFF')
data = r.json()

print(data['drugbank_info']['description'])
# "A methylxanthine naturally occurring in some beverages..."

print(data['drugbank_info']['indication'])
# What the drug is used for

Get FASTA Sequence

Note: This returns plain text, not JSON.

r = requests.get('https://www.rcsb.org/fasta/entry/4GYD/download')
print(r.text)

# >4GYD_1|Chain A|...
# MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG...

Process Multiple Proteins

protein_ids = ['4GYD', '4H0J', '4H0K']

protein_dict = dict()
for protein in protein_ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{protein}')
    data = r.json()
    protein_dict[protein] = data['cell']

# Print cell dimensions
for protein_id, cell in protein_dict.items():
    print(f"{protein_id}: a={cell['length_a']}, b={cell['length_b']}, c={cell['length_c']}")

Part 5: PDB Search API

The Search API lets you query across the entire PDB database.

Base URL: http://search.rcsb.org/rcsbsearch/v2/query?json=<query>

Important: The query must be URL-encoded.

URL Encoding

Special characters in URLs must be encoded. Use requests.utils.requote_uri():

my_query = '{"query": ...}'  # JSON query string
encoded = requests.utils.requote_uri(my_query)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'
r = requests.get(url)

Sequence Similarity Search (BLAST-like)

Find structures with similar sequences:

fasta = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"

my_query = '''{
  "query": {
    "type": "terminal",
    "service": "sequence",
    "parameters": {
      "evalue_cutoff": 1,
      "identity_cutoff": 0.9,
      "sequence_type": "protein",
      "value": "%s"
    }
  },
  "request_options": {
    "scoring_strategy": "sequence"
  },
  "return_type": "polymer_entity"
}''' % fasta

r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(f"Total matches: {j['total_count']}")
for item in j['result_set']:
    print(item['identifier'], "score =", item['score'])

Sequence Motif Search (PROSITE)

Find structures containing a specific motif:

# Zinc finger Cys2His2-like fold group
# PROSITE pattern: C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

my_query = '''{
  "query": {
    "type": "terminal",
    "service": "seqmotif",
    "parameters": {
      "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H",
      "pattern_type": "prosite",
      "sequence_type": "protein"
    }
  },
  "return_type": "polymer_entity"
}'''

r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(f"Total: {j['total_count']}, returned: {len(j['result_set'])}")

Search by Chemical Component

Find all entries containing caffeine:

my_query = '''{
    "query": {
        "type": "terminal",
        "service": "text",
        "parameters": {
            "attribute": "rcsb_nonpolymer_instance_annotation.comp_id",
            "operator": "exact_match",
            "value": "CFF"
        }
    },
    "return_type": "entry"
}'''

url = "https://search.rcsb.org/rcsbsearch/v2/query?json=%s" % requests.utils.requote_uri(my_query)
r = requests.get(url)
data = r.json()

pdb_ids = [row["identifier"] for row in data.get("result_set", [])]
print(f"Entries with caffeine: {len(pdb_ids)}")
print(pdb_ids)

Understanding the Response

j = r.json()

j.keys()
# dict_keys(['query_id', 'result_type', 'total_count', 'result_set'])

j['total_count']  # Total number of matches
j['result_set']   # List of results (may be paginated)

# Each result
j['result_set'][0]
# {'identifier': '4GYD_1', 'score': 1.0, ...}

Part 6: PDB GraphQL API

GraphQL is a query language that lets you request exactly the fields you need.

Endpoint: https://data.rcsb.org/graphql

Interactive testing: http://data.rcsb.org/graphql/index.html (GraphiQL)

Why GraphQL?

REST: Multiple requests for related data GraphQL: One request, specify exactly what you want

Basic Query

my_query = '''{
    entry(entry_id: "4GYD") {
        cell {
            Z_PDB
            angle_alpha
            angle_beta
            angle_gamma
            length_a
            length_b
            length_c
            volume
        }
    }
}'''

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(j.keys())  # dict_keys(['data'])

print(j['data'])
# {'entry': {'cell': {'Z_PDB': 4, 'angle_alpha': 90.0, ...}}}

Accessing the Data

params = j['data']['entry']['cell']

for key, value in params.items():
    print(f"{key}: {value}")

Query Multiple Entries

my_query = '''{
    entries(entry_ids: ["4GYD", "4H0J", "4H0K"]) {
        rcsb_id
        cell {
            length_a
            length_b
            length_c
        }
    }
}'''

Find UniProt Mappings

my_query = '''{
    polymer_entity(entry_id: "4GYD", entity_id: "1") {
        rcsb_polymer_entity_container_identifiers {
            entry_id
            entity_id
        }
        rcsb_polymer_entity_align {
            aligned_regions {
                entity_beg_seq_id
                length
            }
            reference_database_name
            reference_database_accession
        }
    }
}'''

Part 7: UniProt API

UniProt uses the Proteins REST API at https://www.ebi.ac.uk/proteins/api/

Important: Specify JSON Format

UniProt doesn't return JSON by default. You must request it:

headers = {"Accept": "application/json"}
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=P0A3X7&reviewed=true"

r = requests.get(requestURL, headers=headers)
j = r.json()

Response Structure

UniProt returns a list, not a dictionary:

type(j)  # <class 'list'>
len(j)   # Number of entries returned

# Access first entry
j[0].keys()
# dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', ...])

Extract Gene Ontology Information

print(f"Accession: {j[0]['accession']}")  # P0A3X7
print(f"ID: {j[0]['id']}")  # CYC6_NOSS1

print("Gene Ontologies:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print(f"  {item['id']}: {item['properties']['term']}")

Part 8: NCBI API

NCBI also offers REST APIs for programmatic access.

Gene Information

headers = {'Accept': 'application/json'}
gene_id = 8291  # DYSF (dysferlin)

r = requests.get(f'https://api.ncbi.nlm.nih.gov/datasets/v1alpha/gene/id/{gene_id}', headers=headers)
j = r.json()

gene = j['genes'][0]['gene']
print(gene['description'])  # dysferlin
print(gene['symbol'])       # DYSF
print(gene['taxname'])      # Homo sapiens

Part 9: Common Patterns

Pattern 1: Always Check Status

r = requests.get(url)
if r.status_code != 200:
    print(f"Error: {r.status_code}")
    print(r.text)
else:
    data = r.json()
    # process data

Pattern 2: Loop Through Multiple IDs

ids = ['4GYD', '4H0J', '4H0K']
results = {}

for id in ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
    if r.status_code == 200:
        results[id] = r.json()
    else:
        print(f"Failed to get {id}")

Pattern 3: Extract Specific Fields

# Get resolution for multiple structures
resolutions = {}
for id in ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
    data = r.json()
    # Navigate nested structure
    resolutions[id] = data['rcsb_entry_info']['resolution_combined'][0]

Pattern 4: Build URL with Parameters

base_url = "https://www.ebi.ac.uk/proteins/api/proteins"
params = {
    'offset': 0,
    'size': 10,
    'accession': 'P0A3X7',
    'reviewed': 'true'
}

# Build query string
query = '&'.join([f"{k}={v}" for k, v in params.items()])
url = f"{base_url}?{query}"

Pattern 5: Handle Paginated Results

Search APIs often return limited results per page:

j = r.json()
print(f"Total: {j['total_count']}")
print(f"Returned: {len(j['result_set'])}")

# If total > returned, you need pagination
# Check API docs for how to request more pages

API Summary

DatabaseBase URLJSON by default?Notes
PDB RESTdata.rcsb.org/rest/v1/core/YesEntry, entity, chemcomp
PDB Searchsearch.rcsb.org/rcsbsearch/v2/queryYesURL-encode query
PDB GraphQLdata.rcsb.org/graphqlYesFlexible queries
UniProtebi.ac.uk/proteins/api/No (need header)Returns list
NCBIapi.ncbi.nlm.nih.gov/datasets/No (need header)Gene, genome, etc.

Quick Reference

requests Basics

import requests

# GET request
r = requests.get(url)
r = requests.get(url, headers={'Accept': 'application/json'})

# Check status
r.status_code  # 200 = success

# Get response
r.text  # As string
r.json()  # As dictionary (if JSON)

URL Encoding

# For Search API queries
encoded = requests.utils.requote_uri(query_string)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'

PDB API URLs

# Entry info
f'https://data.rcsb.org/rest/v1/core/entry/{pdb_id}'

# Polymer entity
f'https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}'

# Chemical component
f'https://data.rcsb.org/rest/v1/core/chemcomp/{ccd_id}'

# DrugBank
f'https://data.rcsb.org/rest/v1/core/drugbank/{ccd_id}'

# PubMed
f'https://data.rcsb.org/rest/v1/core/pubmed/{pdb_id}'

# FASTA
f'https://www.rcsb.org/fasta/entry/{pdb_id}/download'

# GraphQL
f'https://data.rcsb.org/graphql?query={encoded_query}'

# Search
f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded_query}'

UniProt API URL

# Needs header: {"Accept": "application/json"}
f'https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true'

Common Mistakes

MistakeProblemFix
Not checking status codeProcess garbage dataAlways check r.status_code == 200
Forgetting JSON header for UniProtGet HTML instead of JSONAdd headers={"Accept": "application/json"}
Not URL-encoding search queriesQuery failsUse requests.utils.requote_uri()
Assuming dict when it's a listKeyErrorCheck type(r.json())
Calling .json() on non-JSONErrorCheck if response is actually JSON
Not handling missing keysKeyErrorUse .get('key', default)

Workflow Example: Get GO Terms for a PDB Structure

Complete workflow combining PDB and UniProt:

import requests

# 1. Get UniProt ID from PDB
pdb_id = "4GYD"
query = '''{
    polymer_entity(entry_id: "%s", entity_id: "1") {
        rcsb_polymer_entity_align {
            reference_database_name
            reference_database_accession
        }
    }
}''' % pdb_id

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(query))
data = r.json()

# Find UniProt accession
for align in data['data']['polymer_entity']['rcsb_polymer_entity_align']:
    if align['reference_database_name'] == 'UniProt':
        uniprot_id = align['reference_database_accession']
        break

print(f"UniProt ID: {uniprot_id}")

# 2. Get GO terms from UniProt
url = f"https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true"
r = requests.get(url, headers={"Accept": "application/json"})
j = r.json()

print("Gene Ontology terms:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print(f"  {item['id']}: {item['properties']['term']}")