Create Your Own Database
The Goal
Combine everything you've learned:
- SQLite databases
- PDB GraphQL API
- UniProt REST API
Into one project: Build your own local database that integrates data from multiple sources.
Part 1: The Problem
You have PDB IDs (e.g., 4GYD, 1TU2). You want to store:
From PDB:
- Structure weight (kDa)
- Atom count
- Residue count
- Polymer information
- UniProt IDs
- Source organism
From UniProt:
- Gene Ontology (GO) annotations
Why a local database? Because:
- Faster queries than hitting APIs repeatedly
- Combine data from multiple sources
- Custom queries across all your data
- Works offline
Part 2: Gene Ontology (GO)
What is GO?
Gene Ontology is a standardized vocabulary for describing protein functions. It lets you compare proteins across species using consistent terminology.
Three Categories
| Category | Code | What it describes | Example |
|---|---|---|---|
| Molecular Function | F | What the protein does at molecular level | F:iron ion binding |
| Biological Process | P | What pathway/process it's involved in | P:photosynthesis |
| Cellular Component | C | Where in the cell it's located | C:plasma membrane |
GO ID Format
GO:0005506
Seven digits after "GO:". Each ID maps to a specific term.
Example GO Entry
{
'type': 'GO',
'id': 'GO:0005506',
'properties': {
'term': 'F:iron ion binding',
'source': 'IEA:InterPro'
}
}
- id: The GO identifier
- term: Category code + description
- source: Where the annotation came from (evidence)
Part 3: Database Schema Design
Why Multiple Tables?
One PDB structure can have:
- Multiple polymers (chains)
- Each polymer can have multiple GO annotations
This is a one-to-many relationship. Storing everything in one table would mean massive data duplication.
The Three Tables
structures (1) ----< (N) polymers (1) ----< (N) go_annotations
One structure → many polymers → many GO annotations
Table 1: structures
CREATE TABLE structures (
pdb_id TEXT PRIMARY KEY,
title TEXT,
total_weight REAL,
atom_count INTEGER,
residue_count INTEGER
)
One row per PDB entry.
Table 2: polymers
CREATE TABLE polymers (
polymer_id TEXT PRIMARY KEY,
pdb_id TEXT NOT NULL,
uniprot_accession TEXT,
protein_name TEXT,
scientific_name TEXT,
FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
UNIQUE (polymer_id, scientific_name, uniprot_accession)
)
One row per polymer (chain) in a structure.
The FOREIGN KEY links back to the structures table.
Table 3: go_annotations
CREATE TABLE go_annotations (
id INTEGER PRIMARY KEY,
go_id TEXT NOT NULL,
go_term TEXT NOT NULL,
go_source TEXT NOT NULL,
polymer_id TEXT NOT NULL,
FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
UNIQUE (polymer_id, go_id)
)
One row per GO annotation per polymer.
The id INTEGER PRIMARY KEY auto-increments — you don't specify it when inserting.
Part 4: Creating the Schema
import sqlite3 as sql
import requests
# Connect to database (creates file if doesn't exist)
conn = sql.connect('my_database.sqlite')
cur = conn.cursor()
# Drop existing tables (start fresh)
cur.execute('DROP TABLE IF EXISTS structures')
cur.execute('DROP TABLE IF EXISTS polymers')
cur.execute('DROP TABLE IF EXISTS go_annotations')
# Create tables
cur.execute('''CREATE TABLE structures (
pdb_id TEXT PRIMARY KEY,
title TEXT,
total_weight REAL,
atom_count INTEGER,
residue_count INTEGER
)''')
cur.execute('''CREATE TABLE polymers (
polymer_id TEXT PRIMARY KEY,
pdb_id TEXT NOT NULL,
uniprot_accession TEXT,
protein_name TEXT,
scientific_name TEXT,
FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
UNIQUE (polymer_id, scientific_name, uniprot_accession)
)''')
cur.execute('''CREATE TABLE go_annotations (
id INTEGER PRIMARY KEY,
go_id TEXT NOT NULL,
go_term TEXT NOT NULL,
go_source TEXT NOT NULL,
polymer_id TEXT NOT NULL,
FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
UNIQUE (polymer_id, go_id)
)''')
conn.commit()
Part 5: The GraphQL Query
What We Need from PDB
{
entries(entry_ids: ["4GYD", "1TU2"]) {
rcsb_id
struct { title }
rcsb_entry_info {
molecular_weight
deposited_atom_count
deposited_modeled_polymer_monomer_count
}
polymer_entities {
rcsb_id
rcsb_entity_source_organism {
ncbi_scientific_name
}
uniprots {
rcsb_uniprot_container_identifiers {
uniprot_id
}
rcsb_uniprot_protein {
name {
value
}
}
}
}
}
}
Understanding the Response Structure
The response is nested:
entries (list)
└── each entry (one per PDB ID)
├── rcsb_id
├── struct.title
├── rcsb_entry_info (weight, counts)
└── polymer_entities (list)
└── each polymer
├── rcsb_id (polymer ID like "4GYD_1")
├── rcsb_entity_source_organism (list of organisms)
└── uniprots (list)
├── rcsb_uniprot_container_identifiers.uniprot_id
└── rcsb_uniprot_protein.name.value
Execute the Query
pdb_query = '''
{
entries(entry_ids: ["4GYD", "1TU2"]) {
rcsb_id
struct { title }
rcsb_entry_info {
molecular_weight
deposited_atom_count
deposited_modeled_polymer_monomer_count
}
polymer_entities {
rcsb_id
rcsb_entity_source_organism {
ncbi_scientific_name
}
uniprots {
rcsb_uniprot_container_identifiers {
uniprot_id
}
rcsb_uniprot_protein {
name {
value
}
}
}
}
}
}
'''
p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()
Part 6: Populating the Database
Step 1: Insert into structures table
for prot in j['data']['entries']:
pdb_id = prot['rcsb_id']
title = prot['struct']['title']
weight = prot['rcsb_entry_info']['molecular_weight']
atom_count = prot['rcsb_entry_info']['deposited_atom_count']
residue_count = prot['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
cur.execute('INSERT INTO structures VALUES (?, ?, ?, ?, ?)',
(pdb_id, title, weight, atom_count, residue_count))
Step 2: Insert into polymers table
# Still inside the loop over entries
for polymer in prot['polymer_entities']:
polymer_id = polymer['rcsb_id']
# Extract all source organisms (could be multiple)
source_organisms = []
for so in polymer['rcsb_entity_source_organism']:
source_organisms.append(so['ncbi_scientific_name'])
# Extract all UniProt info
uniprots = []
for up in polymer['uniprots']:
uniprot_id = up['rcsb_uniprot_container_identifiers']['uniprot_id']
protein_name = up['rcsb_uniprot_protein']['name']['value']
uniprots.append((uniprot_id, protein_name))
# Create all combinations (organism × uniprot)
combinations = [(org, up) for org in source_organisms for up in uniprots]
# Insert each combination
for (organism, uniprot_info) in combinations:
cur.execute('INSERT INTO polymers VALUES (?, ?, ?, ?, ?)',
(polymer_id,
pdb_id,
uniprot_info[0], # UniProt accession
uniprot_info[1], # Protein name
organism)) # Scientific name
Step 3: Query UniProt and insert GO annotations
# For each UniProt ID, get GO annotations
for up in uniprots:
accession_id = up[0]
# Query UniProt API
uniprot_url = f'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession={accession_id}'
r = requests.get(uniprot_url, headers={"Accept": "application/json"})
# GO info is in dbReferences
db_info = r.json()[0]['dbReferences']
for db in db_info:
if db['type'] == 'GO':
go_id = db['id']
go_term = db['properties']['term']
go_source = db['properties']['source']
# Insert (don't specify id - it auto-increments)
cur.execute('''INSERT INTO go_annotations
(go_id, go_term, go_source, polymer_id)
VALUES (?, ?, ?, ?)''',
(go_id, go_term, go_source, polymer_id))
conn.commit()
Part 7: The Complete Code
import sqlite3 as sql
import requests
# Connect
conn = sql.connect('my_database.sqlite')
cur = conn.cursor()
# Create schema
cur.execute('DROP TABLE IF EXISTS structures')
cur.execute('DROP TABLE IF EXISTS polymers')
cur.execute('DROP TABLE IF EXISTS go_annotations')
cur.execute('''CREATE TABLE structures (
pdb_id TEXT PRIMARY KEY,
title TEXT,
total_weight REAL,
atom_count INTEGER,
residue_count INTEGER
)''')
cur.execute('''CREATE TABLE polymers (
polymer_id TEXT PRIMARY KEY,
pdb_id TEXT NOT NULL,
uniprot_accession TEXT,
protein_name TEXT,
scientific_name TEXT,
FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
UNIQUE (polymer_id, scientific_name, uniprot_accession)
)''')
cur.execute('''CREATE TABLE go_annotations (
id INTEGER PRIMARY KEY,
go_id TEXT NOT NULL,
go_term TEXT NOT NULL,
go_source TEXT NOT NULL,
polymer_id TEXT NOT NULL,
FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
UNIQUE (polymer_id, go_id)
)''')
conn.commit()
# Query PDB
pdb_query = '''{ entries(entry_ids: ["4GYD", "1TU2"]) { ... } }''' # Full query here
p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()
# Populate database
for prot in j['data']['entries']:
# Insert structure
pdb_id = prot['rcsb_id']
title = prot['struct']['title']
weight = prot['rcsb_entry_info']['molecular_weight']
atom_count = prot['rcsb_entry_info']['deposited_atom_count']
residue_count = prot['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
cur.execute('INSERT INTO structures VALUES (?, ?, ?, ?, ?)',
(pdb_id, title, weight, atom_count, residue_count))
# Insert polymers and GO annotations
for polymer in prot['polymer_entities']:
polymer_id = polymer['rcsb_id']
source_organisms = [so['ncbi_scientific_name']
for so in polymer['rcsb_entity_source_organism']]
uniprots = [(up['rcsb_uniprot_container_identifiers']['uniprot_id'],
up['rcsb_uniprot_protein']['name']['value'])
for up in polymer['uniprots']]
combinations = [(org, up) for org in source_organisms for up in uniprots]
for (organism, uniprot_info) in combinations:
cur.execute('INSERT INTO polymers VALUES (?, ?, ?, ?, ?)',
(polymer_id, pdb_id, uniprot_info[0], uniprot_info[1], organism))
# Get GO annotations from UniProt
for up in uniprots:
accession_id = up[0]
uniprot_url = f'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession={accession_id}'
r = requests.get(uniprot_url, headers={"Accept": "application/json"})
for db in r.json()[0]['dbReferences']:
if db['type'] == 'GO':
cur.execute('''INSERT INTO go_annotations
(go_id, go_term, go_source, polymer_id)
VALUES (?, ?, ?, ?)''',
(db['id'], db['properties']['term'],
db['properties']['source'], polymer_id))
conn.commit()
conn.close()
Part 8: Querying Your Database
Basic Queries
Get all info for a PDB ID:
cur.execute('SELECT * FROM structures WHERE pdb_id = ?', ("4GYD",))
print(cur.fetchall())
# [('4GYD', 'Nostoc sp Cytochrome c6', 58.57, 4598, 516)]
Get all polymers for a PDB ID:
cur.execute('SELECT * FROM polymers WHERE pdb_id = ?', ("4GYD",))
print(cur.fetchall())
# [('4GYD_1', '4GYD', 'P0A3X7', 'Cytochrome c6', 'Nostoc sp. PCC 7120')]
Top 10 heaviest structures:
cur.execute('''SELECT pdb_id, title, total_weight
FROM structures
ORDER BY total_weight DESC
LIMIT 10''')
print(cur.fetchall())
GO annotations from a specific source:
cur.execute('SELECT * FROM go_annotations WHERE go_source LIKE "%UniProtKB-UniRule%"')
print(cur.fetchall())
Queries Across Tables (JOINs)
Get all GO IDs for a UniProt accession (using subquery):
cur.execute('''
SELECT go_id FROM go_annotations AS ga
WHERE ga.polymer_id IN (
SELECT p.polymer_id
FROM polymers AS p
WHERE p.uniprot_accession = ?
)
''', ("P46444",))
print(cur.fetchall())
Same query using JOIN:
cur.execute('''
SELECT g.go_id
FROM go_annotations AS g
JOIN polymers AS p ON p.polymer_id = g.polymer_id
WHERE p.uniprot_accession = ?
''', ("P46444",))
print(cur.fetchall())
Both return the same result. The AS creates aliases (shortcuts for table names).
Count GO annotations per structure:
cur.execute('''
SELECT COUNT(go_annotations.go_id)
FROM go_annotations
WHERE polymer_id IN (
SELECT polymer_id
FROM polymers
WHERE pdb_id = ?
)
''', ("1TU2",))
print(cur.fetchall())
# [(8,)]
Part 9: Understanding JOINs
What is a JOIN?
A JOIN combines rows from two tables based on a related column.
The Tables
polymers:
polymer_id | pdb_id | uniprot_accession | ...
-----------+--------+-------------------+----
4GYD_1 | 4GYD | P0A3X7 | ...
1TU2_1 | 1TU2 | P46444 | ...
go_annotations:
id | go_id | polymer_id | ...
---+-------------+------------+----
1 | GO:0005506 | 4GYD_1 | ...
2 | GO:0009055 | 4GYD_1 | ...
3 | GO:0005507 | 1TU2_1 | ...
JOIN in Action
SELECT g.go_id, p.uniprot_accession
FROM go_annotations AS g
JOIN polymers AS p ON p.polymer_id = g.polymer_id
WHERE p.pdb_id = '4GYD'
This:
- Takes each row from
go_annotations - Finds the matching row in
polymers(where polymer_ids match) - Combines them
- Filters by pdb_id
Result:
go_id | uniprot_accession
-----------+------------------
GO:0005506 | P0A3X7
GO:0009055 | P0A3X7
Subquery Alternative
Same result, different approach:
SELECT go_id FROM go_annotations
WHERE polymer_id IN (
SELECT polymer_id FROM polymers WHERE pdb_id = '4GYD'
)
- Inner query gets polymer_ids for 4GYD
- Outer query gets GO IDs for those polymers
Part 10: Exporting the Schema
Why Export Schema?
You might want to:
- Document your database structure
- Recreate the database elsewhere
- Share the schema without the data
export_schema.py
import sqlite3
import os
import sys
def export_sqlite_schema(db_path, output_file):
"""
Extracts the schema from a SQLite database and writes it to a file.
"""
if not os.path.isfile(db_path):
print(f"Error: Database file '{db_path}' not found.")
return False
try:
# Connect read-only
conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
cursor = conn.cursor()
# Get schema from sqlite_master
cursor.execute("SELECT sql FROM sqlite_master WHERE sql IS NOT NULL;")
schema_statements = cursor.fetchall()
if not schema_statements:
print("No schema found in the database.")
return False
# Write to file
with open(output_file, "w", encoding="utf-8") as f:
for stmt in schema_statements:
f.write(stmt[0] + ";\n\n")
print(f"Schema successfully exported to '{output_file}'")
return True
except sqlite3.Error as e:
print(f"SQLite error: {e}")
return False
finally:
if 'conn' in locals():
conn.close()
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python export_schema.py <database_path> <output_sql_file>")
sys.exit(1)
db_file = sys.argv[1]
output_file = sys.argv[2]
export_sqlite_schema(db_file, output_file)
Usage
python export_schema.py my_database.sqlite schema.sql
What sqlite_master Contains
Every SQLite database has a special table called sqlite_master that stores:
- Table definitions (CREATE TABLE statements)
- Index definitions
- View definitions
- Trigger definitions
SELECT sql FROM sqlite_master WHERE sql IS NOT NULL;
Returns all the CREATE statements that define your database structure.
Part 11: Key Concepts Summary
Database Design
| Concept | Application |
|---|---|
| Primary Key | Unique identifier for each row (pdb_id, polymer_id) |
| Foreign Key | Links tables together (polymers.pdb_id → structures.pdb_id) |
| One-to-Many | One structure has many polymers; one polymer has many GO annotations |
| UNIQUE constraint | Prevents duplicate combinations |
| Auto-increment | id INTEGER PRIMARY KEY auto-generates values |
Data Flow
PDB GraphQL API
↓
Extract structure info → INSERT INTO structures
↓
Extract polymer info → INSERT INTO polymers
↓
For each UniProt ID:
↓
UniProt REST API
↓
Extract GO annotations → INSERT INTO go_annotations
↓
conn.commit()
SQL Operations
| Operation | Example |
|---|---|
| SELECT | SELECT * FROM structures WHERE pdb_id = '4GYD' |
| WHERE | Filter rows |
| ORDER BY | ORDER BY total_weight DESC |
| LIMIT | LIMIT 10 |
| LIKE | WHERE go_source LIKE '%UniRule%' |
| COUNT | SELECT COUNT(go_id) FROM ... |
| JOIN | Combine related tables |
| Subquery | Nested SELECT |
Quick Reference
Schema Creation Pattern
cur.execute('DROP TABLE IF EXISTS tablename')
cur.execute('''CREATE TABLE tablename (
column1 TYPE CONSTRAINT,
column2 TYPE CONSTRAINT,
FOREIGN KEY (column) REFERENCES other_table(column)
)''')
conn.commit()
Insert Pattern
# With all columns
cur.execute('INSERT INTO table VALUES (?, ?, ?)', (val1, val2, val3))
# With specific columns (skip auto-increment)
cur.execute('INSERT INTO table (col1, col2) VALUES (?, ?)', (val1, val2))
Query Pattern
cur.execute('SELECT columns FROM table WHERE condition', (params,))
results = cur.fetchall()
JOIN Pattern
cur.execute('''
SELECT t1.col, t2.col
FROM table1 AS t1
JOIN table2 AS t2 ON t1.key = t2.key
WHERE condition
''')
Common Mistakes
| Mistake | Problem | Fix |
|---|---|---|
Forgetting conn.commit() | Data not saved | Always commit after inserts |
Wrong number of ? | Insert fails | Count columns carefully |
| Not handling lists | Missing data | Check if lists could have multiple items |
| Hardcoding IDs | Not reusable | Use variables and parameters |
| Not closing connection | Resource leak | Always conn.close() |
| Duplicate primary key | Insert fails | Use UNIQUE constraints or check first |