Create Your Own Database

The Goal

Combine everything you've learned:

  • SQLite databases
  • PDB GraphQL API
  • UniProt REST API

Into one project: Build your own local database that integrates data from multiple sources.


Part 1: The Problem

You have PDB IDs (e.g., 4GYD, 1TU2). You want to store:

From PDB:

  • Structure weight (kDa)
  • Atom count
  • Residue count
  • Polymer information
  • UniProt IDs
  • Source organism

From UniProt:

  • Gene Ontology (GO) annotations

Why a local database? Because:

  • Faster queries than hitting APIs repeatedly
  • Combine data from multiple sources
  • Custom queries across all your data
  • Works offline

Part 2: Gene Ontology (GO)

What is GO?

Gene Ontology is a standardized vocabulary for describing protein functions. It lets you compare proteins across species using consistent terminology.

Three Categories

CategoryCodeWhat it describesExample
Molecular FunctionFWhat the protein does at molecular levelF:iron ion binding
Biological ProcessPWhat pathway/process it's involved inP:photosynthesis
Cellular ComponentCWhere in the cell it's locatedC:plasma membrane

GO ID Format

GO:0005506

Seven digits after "GO:". Each ID maps to a specific term.

Example GO Entry

{
    'type': 'GO',
    'id': 'GO:0005506',
    'properties': {
        'term': 'F:iron ion binding',
        'source': 'IEA:InterPro'
    }
}
  • id: The GO identifier
  • term: Category code + description
  • source: Where the annotation came from (evidence)

Part 3: Database Schema Design

Why Multiple Tables?

One PDB structure can have:

  • Multiple polymers (chains)
  • Each polymer can have multiple GO annotations

This is a one-to-many relationship. Storing everything in one table would mean massive data duplication.

The Three Tables

structures (1) ----< (N) polymers (1) ----< (N) go_annotations

One structure → many polymers → many GO annotations

Table 1: structures

CREATE TABLE structures (
    pdb_id TEXT PRIMARY KEY,
    title TEXT,
    total_weight REAL,
    atom_count INTEGER,
    residue_count INTEGER
)

One row per PDB entry.

Table 2: polymers

CREATE TABLE polymers (
    polymer_id TEXT PRIMARY KEY,
    pdb_id TEXT NOT NULL,
    uniprot_accession TEXT,
    protein_name TEXT,
    scientific_name TEXT,
    FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
    UNIQUE (polymer_id, scientific_name, uniprot_accession)
)

One row per polymer (chain) in a structure.

The FOREIGN KEY links back to the structures table.

Table 3: go_annotations

CREATE TABLE go_annotations (
    id INTEGER PRIMARY KEY,
    go_id TEXT NOT NULL,
    go_term TEXT NOT NULL,
    go_source TEXT NOT NULL,
    polymer_id TEXT NOT NULL,
    FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
    UNIQUE (polymer_id, go_id)
)

One row per GO annotation per polymer.

The id INTEGER PRIMARY KEY auto-increments — you don't specify it when inserting.


Part 4: Creating the Schema

import sqlite3 as sql
import requests

# Connect to database (creates file if doesn't exist)
conn = sql.connect('my_database.sqlite')
cur = conn.cursor()

# Drop existing tables (start fresh)
cur.execute('DROP TABLE IF EXISTS structures')
cur.execute('DROP TABLE IF EXISTS polymers')
cur.execute('DROP TABLE IF EXISTS go_annotations')

# Create tables
cur.execute('''CREATE TABLE structures (
    pdb_id TEXT PRIMARY KEY,
    title TEXT,
    total_weight REAL,
    atom_count INTEGER,
    residue_count INTEGER
)''')

cur.execute('''CREATE TABLE polymers (
    polymer_id TEXT PRIMARY KEY,
    pdb_id TEXT NOT NULL,
    uniprot_accession TEXT,
    protein_name TEXT,
    scientific_name TEXT,
    FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
    UNIQUE (polymer_id, scientific_name, uniprot_accession)
)''')

cur.execute('''CREATE TABLE go_annotations (
    id INTEGER PRIMARY KEY,
    go_id TEXT NOT NULL,
    go_term TEXT NOT NULL,
    go_source TEXT NOT NULL,
    polymer_id TEXT NOT NULL,
    FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
    UNIQUE (polymer_id, go_id)
)''')

conn.commit()

Part 5: The GraphQL Query

What We Need from PDB

{
  entries(entry_ids: ["4GYD", "1TU2"]) {
    rcsb_id
    struct { title }
    rcsb_entry_info {
      molecular_weight
      deposited_atom_count
      deposited_modeled_polymer_monomer_count
    }
    polymer_entities {
      rcsb_id
      rcsb_entity_source_organism {
        ncbi_scientific_name
      }
      uniprots {
        rcsb_uniprot_container_identifiers {
          uniprot_id
        }
        rcsb_uniprot_protein {
          name {
            value
          }
        }
      }
    }
  }
}

Understanding the Response Structure

The response is nested:

entries (list)
  └── each entry (one per PDB ID)
        ├── rcsb_id
        ├── struct.title
        ├── rcsb_entry_info (weight, counts)
        └── polymer_entities (list)
              └── each polymer
                    ├── rcsb_id (polymer ID like "4GYD_1")
                    ├── rcsb_entity_source_organism (list of organisms)
                    └── uniprots (list)
                          ├── rcsb_uniprot_container_identifiers.uniprot_id
                          └── rcsb_uniprot_protein.name.value

Execute the Query

pdb_query = '''
{
  entries(entry_ids: ["4GYD", "1TU2"]) {
    rcsb_id
    struct { title }
    rcsb_entry_info {
      molecular_weight
      deposited_atom_count
      deposited_modeled_polymer_monomer_count
    }
    polymer_entities {
      rcsb_id
      rcsb_entity_source_organism {
        ncbi_scientific_name
      }
      uniprots {
        rcsb_uniprot_container_identifiers {
          uniprot_id
        }
        rcsb_uniprot_protein {
          name {
            value
          }
        }
      }
    }
  }
}
'''

p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()

Part 6: Populating the Database

Step 1: Insert into structures table

for prot in j['data']['entries']:
    pdb_id = prot['rcsb_id']
    title = prot['struct']['title']
    weight = prot['rcsb_entry_info']['molecular_weight']
    atom_count = prot['rcsb_entry_info']['deposited_atom_count']
    residue_count = prot['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
    
    cur.execute('INSERT INTO structures VALUES (?, ?, ?, ?, ?)',
                (pdb_id, title, weight, atom_count, residue_count))

Step 2: Insert into polymers table

    # Still inside the loop over entries
    for polymer in prot['polymer_entities']:
        polymer_id = polymer['rcsb_id']
        
        # Extract all source organisms (could be multiple)
        source_organisms = []
        for so in polymer['rcsb_entity_source_organism']:
            source_organisms.append(so['ncbi_scientific_name'])
        
        # Extract all UniProt info
        uniprots = []
        for up in polymer['uniprots']:
            uniprot_id = up['rcsb_uniprot_container_identifiers']['uniprot_id']
            protein_name = up['rcsb_uniprot_protein']['name']['value']
            uniprots.append((uniprot_id, protein_name))
        
        # Create all combinations (organism × uniprot)
        combinations = [(org, up) for org in source_organisms for up in uniprots]
        
        # Insert each combination
        for (organism, uniprot_info) in combinations:
            cur.execute('INSERT INTO polymers VALUES (?, ?, ?, ?, ?)',
                        (polymer_id,
                         pdb_id,
                         uniprot_info[0],  # UniProt accession
                         uniprot_info[1],  # Protein name
                         organism))        # Scientific name

Step 3: Query UniProt and insert GO annotations

        # For each UniProt ID, get GO annotations
        for up in uniprots:
            accession_id = up[0]
            
            # Query UniProt API
            uniprot_url = f'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession={accession_id}'
            r = requests.get(uniprot_url, headers={"Accept": "application/json"})
            
            # GO info is in dbReferences
            db_info = r.json()[0]['dbReferences']
            
            for db in db_info:
                if db['type'] == 'GO':
                    go_id = db['id']
                    go_term = db['properties']['term']
                    go_source = db['properties']['source']
                    
                    # Insert (don't specify id - it auto-increments)
                    cur.execute('''INSERT INTO go_annotations 
                                   (go_id, go_term, go_source, polymer_id)
                                   VALUES (?, ?, ?, ?)''',
                                (go_id, go_term, go_source, polymer_id))

conn.commit()

Part 7: The Complete Code

import sqlite3 as sql
import requests

# Connect
conn = sql.connect('my_database.sqlite')
cur = conn.cursor()

# Create schema
cur.execute('DROP TABLE IF EXISTS structures')
cur.execute('DROP TABLE IF EXISTS polymers')
cur.execute('DROP TABLE IF EXISTS go_annotations')

cur.execute('''CREATE TABLE structures (
    pdb_id TEXT PRIMARY KEY,
    title TEXT,
    total_weight REAL,
    atom_count INTEGER,
    residue_count INTEGER
)''')

cur.execute('''CREATE TABLE polymers (
    polymer_id TEXT PRIMARY KEY,
    pdb_id TEXT NOT NULL,
    uniprot_accession TEXT,
    protein_name TEXT,
    scientific_name TEXT,
    FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
    UNIQUE (polymer_id, scientific_name, uniprot_accession)
)''')

cur.execute('''CREATE TABLE go_annotations (
    id INTEGER PRIMARY KEY,
    go_id TEXT NOT NULL,
    go_term TEXT NOT NULL,
    go_source TEXT NOT NULL,
    polymer_id TEXT NOT NULL,
    FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
    UNIQUE (polymer_id, go_id)
)''')

conn.commit()

# Query PDB
pdb_query = '''{ entries(entry_ids: ["4GYD", "1TU2"]) { ... } }'''  # Full query here
p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()

# Populate database
for prot in j['data']['entries']:
    # Insert structure
    pdb_id = prot['rcsb_id']
    title = prot['struct']['title']
    weight = prot['rcsb_entry_info']['molecular_weight']
    atom_count = prot['rcsb_entry_info']['deposited_atom_count']
    residue_count = prot['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
    
    cur.execute('INSERT INTO structures VALUES (?, ?, ?, ?, ?)',
                (pdb_id, title, weight, atom_count, residue_count))
    
    # Insert polymers and GO annotations
    for polymer in prot['polymer_entities']:
        polymer_id = polymer['rcsb_id']
        
        source_organisms = [so['ncbi_scientific_name'] 
                          for so in polymer['rcsb_entity_source_organism']]
        
        uniprots = [(up['rcsb_uniprot_container_identifiers']['uniprot_id'],
                    up['rcsb_uniprot_protein']['name']['value'])
                   for up in polymer['uniprots']]
        
        combinations = [(org, up) for org in source_organisms for up in uniprots]
        
        for (organism, uniprot_info) in combinations:
            cur.execute('INSERT INTO polymers VALUES (?, ?, ?, ?, ?)',
                        (polymer_id, pdb_id, uniprot_info[0], uniprot_info[1], organism))
        
        # Get GO annotations from UniProt
        for up in uniprots:
            accession_id = up[0]
            uniprot_url = f'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession={accession_id}'
            r = requests.get(uniprot_url, headers={"Accept": "application/json"})
            
            for db in r.json()[0]['dbReferences']:
                if db['type'] == 'GO':
                    cur.execute('''INSERT INTO go_annotations 
                                   (go_id, go_term, go_source, polymer_id)
                                   VALUES (?, ?, ?, ?)''',
                                (db['id'], db['properties']['term'], 
                                 db['properties']['source'], polymer_id))

conn.commit()
conn.close()

Part 8: Querying Your Database

Basic Queries

Get all info for a PDB ID:

cur.execute('SELECT * FROM structures WHERE pdb_id = ?', ("4GYD",))
print(cur.fetchall())
# [('4GYD', 'Nostoc sp Cytochrome c6', 58.57, 4598, 516)]

Get all polymers for a PDB ID:

cur.execute('SELECT * FROM polymers WHERE pdb_id = ?', ("4GYD",))
print(cur.fetchall())
# [('4GYD_1', '4GYD', 'P0A3X7', 'Cytochrome c6', 'Nostoc sp. PCC 7120')]

Top 10 heaviest structures:

cur.execute('''SELECT pdb_id, title, total_weight 
               FROM structures
               ORDER BY total_weight DESC
               LIMIT 10''')
print(cur.fetchall())

GO annotations from a specific source:

cur.execute('SELECT * FROM go_annotations WHERE go_source LIKE "%UniProtKB-UniRule%"')
print(cur.fetchall())

Queries Across Tables (JOINs)

Get all GO IDs for a UniProt accession (using subquery):

cur.execute('''
    SELECT go_id FROM go_annotations AS ga
    WHERE ga.polymer_id IN (
        SELECT p.polymer_id
        FROM polymers AS p
        WHERE p.uniprot_accession = ?
    )
''', ("P46444",))
print(cur.fetchall())

Same query using JOIN:

cur.execute('''
    SELECT g.go_id
    FROM go_annotations AS g
    JOIN polymers AS p ON p.polymer_id = g.polymer_id
    WHERE p.uniprot_accession = ?
''', ("P46444",))
print(cur.fetchall())

Both return the same result. The AS creates aliases (shortcuts for table names).

Count GO annotations per structure:

cur.execute('''
    SELECT COUNT(go_annotations.go_id)
    FROM go_annotations
    WHERE polymer_id IN (
        SELECT polymer_id
        FROM polymers
        WHERE pdb_id = ?
    )
''', ("1TU2",))
print(cur.fetchall())
# [(8,)]

Part 9: Understanding JOINs

What is a JOIN?

A JOIN combines rows from two tables based on a related column.

The Tables

polymers:
polymer_id | pdb_id | uniprot_accession | ...
-----------+--------+-------------------+----
4GYD_1     | 4GYD   | P0A3X7           | ...
1TU2_1     | 1TU2   | P46444           | ...

go_annotations:
id | go_id       | polymer_id | ...
---+-------------+------------+----
1  | GO:0005506  | 4GYD_1     | ...
2  | GO:0009055  | 4GYD_1     | ...
3  | GO:0005507  | 1TU2_1     | ...

JOIN in Action

SELECT g.go_id, p.uniprot_accession
FROM go_annotations AS g
JOIN polymers AS p ON p.polymer_id = g.polymer_id
WHERE p.pdb_id = '4GYD'

This:

  1. Takes each row from go_annotations
  2. Finds the matching row in polymers (where polymer_ids match)
  3. Combines them
  4. Filters by pdb_id

Result:

go_id      | uniprot_accession
-----------+------------------
GO:0005506 | P0A3X7
GO:0009055 | P0A3X7

Subquery Alternative

Same result, different approach:

SELECT go_id FROM go_annotations
WHERE polymer_id IN (
    SELECT polymer_id FROM polymers WHERE pdb_id = '4GYD'
)
  1. Inner query gets polymer_ids for 4GYD
  2. Outer query gets GO IDs for those polymers

Part 10: Exporting the Schema

Why Export Schema?

You might want to:

  • Document your database structure
  • Recreate the database elsewhere
  • Share the schema without the data

export_schema.py

import sqlite3
import os
import sys

def export_sqlite_schema(db_path, output_file):
    """
    Extracts the schema from a SQLite database and writes it to a file.
    """
    if not os.path.isfile(db_path):
        print(f"Error: Database file '{db_path}' not found.")
        return False
    
    try:
        # Connect read-only
        conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
        cursor = conn.cursor()
        
        # Get schema from sqlite_master
        cursor.execute("SELECT sql FROM sqlite_master WHERE sql IS NOT NULL;")
        schema_statements = cursor.fetchall()
        
        if not schema_statements:
            print("No schema found in the database.")
            return False
        
        # Write to file
        with open(output_file, "w", encoding="utf-8") as f:
            for stmt in schema_statements:
                f.write(stmt[0] + ";\n\n")
        
        print(f"Schema successfully exported to '{output_file}'")
        return True
        
    except sqlite3.Error as e:
        print(f"SQLite error: {e}")
        return False
    finally:
        if 'conn' in locals():
            conn.close()

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python export_schema.py <database_path> <output_sql_file>")
        sys.exit(1)
    
    db_file = sys.argv[1]
    output_file = sys.argv[2]
    export_sqlite_schema(db_file, output_file)

Usage

python export_schema.py my_database.sqlite schema.sql

What sqlite_master Contains

Every SQLite database has a special table called sqlite_master that stores:

  • Table definitions (CREATE TABLE statements)
  • Index definitions
  • View definitions
  • Trigger definitions
SELECT sql FROM sqlite_master WHERE sql IS NOT NULL;

Returns all the CREATE statements that define your database structure.


Part 11: Key Concepts Summary

Database Design

ConceptApplication
Primary KeyUnique identifier for each row (pdb_id, polymer_id)
Foreign KeyLinks tables together (polymers.pdb_id → structures.pdb_id)
One-to-ManyOne structure has many polymers; one polymer has many GO annotations
UNIQUE constraintPrevents duplicate combinations
Auto-incrementid INTEGER PRIMARY KEY auto-generates values

Data Flow

PDB GraphQL API
      ↓
Extract structure info → INSERT INTO structures
      ↓
Extract polymer info → INSERT INTO polymers
      ↓
For each UniProt ID:
      ↓
UniProt REST API
      ↓
Extract GO annotations → INSERT INTO go_annotations
      ↓
conn.commit()

SQL Operations

OperationExample
SELECTSELECT * FROM structures WHERE pdb_id = '4GYD'
WHEREFilter rows
ORDER BYORDER BY total_weight DESC
LIMITLIMIT 10
LIKEWHERE go_source LIKE '%UniRule%'
COUNTSELECT COUNT(go_id) FROM ...
JOINCombine related tables
SubqueryNested SELECT

Quick Reference

Schema Creation Pattern

cur.execute('DROP TABLE IF EXISTS tablename')
cur.execute('''CREATE TABLE tablename (
    column1 TYPE CONSTRAINT,
    column2 TYPE CONSTRAINT,
    FOREIGN KEY (column) REFERENCES other_table(column)
)''')
conn.commit()

Insert Pattern

# With all columns
cur.execute('INSERT INTO table VALUES (?, ?, ?)', (val1, val2, val3))

# With specific columns (skip auto-increment)
cur.execute('INSERT INTO table (col1, col2) VALUES (?, ?)', (val1, val2))

Query Pattern

cur.execute('SELECT columns FROM table WHERE condition', (params,))
results = cur.fetchall()

JOIN Pattern

cur.execute('''
    SELECT t1.col, t2.col
    FROM table1 AS t1
    JOIN table2 AS t2 ON t1.key = t2.key
    WHERE condition
''')

Common Mistakes

MistakeProblemFix
Forgetting conn.commit()Data not savedAlways commit after inserts
Wrong number of ?Insert failsCount columns carefully
Not handling listsMissing dataCheck if lists could have multiple items
Hardcoding IDsNot reusableUse variables and parameters
Not closing connectionResource leakAlways conn.close()
Duplicate primary keyInsert failsUse UNIQUE constraints or check first