# Bad: Student info repeated for each course
(1, 'Alice', 22, 'alice@unibo.it', 'BDB')
(2, 'Alice', 22, 'alice@unibo.it', 'BDP1')  # Alice duplicated!
(3, 'Bob', 23, 'bob@unibo.it', 'BDB')

Use two linked tables:

Students:
(1, 'Alice', 22, 'alice@unibo.it')
(2, 'Bob', 23, 'bob@unibo.it')

Enrollments:
(1, 1, 'BDB')    # student_id=1 (Alice)
(2, 1, 'BDP1')   # student_id=1 (Alice)
(3, 2, 'BDB')    # student_id=2 (Bob)

Benefits:

No data duplication
Update student info in one place
Smaller storage

Part 2: ACID Properties

A transaction is a unit of work — a set of operations that must either all succeed or all fail.

ACID guarantees for transactions:

Property	Meaning
Atomic	All operations complete, or none do
Consistent	Database goes from one valid state to another
Isolated	Transactions don't interfere with each other
Durable	Once committed, changes survive crashes/power failures

Relational databases provide ACID compliance. This is why banks use them.

Part 3: SQLite

What is SQLite?

SQLite is a relational database that lives in a single file. No server needed.

It's everywhere:

Every Android and iOS device
Windows 10/11, macOS
Firefox, Chrome, Safari
Estimated 1 trillion+ SQLite databases in active use

SQLite vs Traditional Databases

Traditional (PostgreSQL, MySQL)	SQLite
Separate server process	No server
Client connects via network	Direct file access
Multiple files	Single file
Complex setup	Zero configuration

Python Support

Python has built-in SQLite support. No installation needed:

import sqlite3

Part 4: Connecting to SQLite

Basic Connection

import sqlite3 as sql

# Connect to database (creates file if it doesn't exist)
conn = sql.connect('my_database.db')

# Get a cursor (your pointer into the database)
cur = conn.cursor()

After this, you'll have a file called my_database.db in your directory.

In-Memory Database

For testing or temporary work:

conn = sql.connect(':memory:')

Fast, but everything is lost when you close Python.

What's a Cursor?

The cursor is how you execute commands and retrieve results. Think of it as your interface to the database.

cur.execute('SQL command here')

Part 5: Creating Tables

CREATE TABLE Syntax

cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

Data Types

Type	What it stores
INTEGER	Whole numbers
TEXT	Strings
REAL	Floating point numbers
BLOB	Binary data

Constraints

Constraint	Meaning
PRIMARY KEY	Unique identifier for each row
NOT NULL	Cannot be empty
UNIQUE	No duplicate values allowed

The "Table Already Exists" Problem

If you run CREATE TABLE twice, you get an error.

Solution: Drop the table first if it exists.

cur.execute('DROP TABLE IF EXISTS Students')

cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

Creating Tables with Foreign Keys

cur.execute('''DROP TABLE IF EXISTS Students''')
cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

cur.execute('''DROP TABLE IF EXISTS Student_courses''')
cur.execute('''
    CREATE TABLE Student_courses (
        id INTEGER PRIMARY KEY,
        student_id INTEGER NOT NULL,
        course_id INTEGER,
        course_name TEXT NOT NULL,
        FOREIGN KEY(student_id) REFERENCES Students(id)
    )
''')

conn.commit()

Part 6: Inserting Data

Single Row

cur.execute('''
    INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')
''')

What Happens If You Insert a Duplicate Primary Key?

cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
# ERROR! id=1 already exists

Primary keys must be unique.

Multiple Rows with executemany()

the_students = (
    (1, 'John', 'Doe', 21, 'john@doe.com'),
    (2, 'Alice', 'Doe', 22, 'alice@doe.com'),
    (3, 'Rose', 'Short', 21, 'rose@short.com')
)

cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)

The ? Placeholders

Each ? gets replaced by one value from your tuple.

# 5 columns = 5 question marks
cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)

Why use ? instead of string formatting?

Cleaner code
Prevents SQL injection attacks
Handles escaping automatically

Part 7: The Commit Rule

Critical: Changes are NOT saved until you call commit().

cur.execute('INSERT INTO Students VALUES (4, "Diana", "Smith", 20, "diana@smith.com")')

# At this point, the data is only in memory

conn.commit()  # NOW it's written to disk

If you close the connection without committing, all changes since the last commit are lost.

The Complete Pattern

# Make changes
cur.execute('INSERT ...')
cur.execute('UPDATE ...')
cur.execute('DELETE ...')

# Save to disk
conn.commit()

# Close when done
conn.close()

Part 8: Querying Data (SELECT)

Get All Rows

cur.execute('SELECT * FROM Students')
print(cur.fetchall())

Output:

[(1, 'John', 'Doe', 21, 'john@doe.com'),
 (2, 'Alice', 'Doe', 22, 'alice@doe.com'),
 (3, 'Rose', 'Short', 21, 'rose@short.com')]

fetchall() vs fetchone()

fetchall() returns a list of all rows:

cur.execute('SELECT * FROM Students')
all_rows = cur.fetchall()  # List of tuples

fetchone() returns one row at a time:

cur.execute('SELECT * FROM Students')
first = cur.fetchone()   # (1, 'John', 'Doe', 21, 'john@doe.com')
second = cur.fetchone()  # (2, 'Alice', 'Doe', 22, 'alice@doe.com')
third = cur.fetchone()   # (3, 'Rose', 'Short', 21, 'rose@short.com')
fourth = cur.fetchone()  # None (no more rows)

Important: fetchall() Exhausts the Cursor

cur.execute('SELECT * FROM Students')
print(cur.fetchall())  # Returns all rows
print(cur.fetchall())  # Returns [] (empty list!)

Once you've fetched all rows, there's nothing left to fetch. You need to execute the query again.

Select Specific Columns

cur.execute('SELECT last_name, email FROM Students')
print(cur.fetchall())
# [('Doe', 'john@doe.com'), ('Doe', 'alice@doe.com'), ('Short', 'rose@short.com')]

Filter with WHERE

cur.execute('SELECT * FROM Students WHERE id=3')
print(cur.fetchall())
# [(3, 'Rose', 'Short', 21, 'rose@short.com')]

Pattern Matching with LIKE

# Emails ending with 'doe.com'
cur.execute("SELECT * FROM Students WHERE email LIKE '%doe.com'")
print(cur.fetchall())
# [(1, 'John', 'Doe', 21, 'john@doe.com'), (2, 'Alice', 'Doe', 22, 'alice@doe.com')]

Wildcards:

% — any sequence of characters (including none)
_ — exactly one character

Examples:

LIKE 'A%'      # Starts with A
LIKE '%e'      # Ends with e
LIKE '%li%'    # Contains 'li'
LIKE '_ohn'    # 4 characters ending in 'ohn' (John, Bohn, etc.)

Note: LIKE is case-insensitive in SQLite.

Part 9: Deleting Data

cur.execute('DELETE FROM Students WHERE id=1')
conn.commit()

Warning: Without WHERE, you delete everything:

cur.execute('DELETE FROM Students')  # Deletes ALL rows!

Part 10: Error Handling

The Proper Pattern

import sqlite3 as sql

try:
    conn = sql.connect('my_database.db')
    cur = conn.cursor()
    print("Connection successful")
    
    # Your database operations here
    cur.execute('SELECT * FROM Students')
    print(cur.fetchall())
    
    cur.close()  # Close cursor to free memory
    
except sql.Error as error:
    print("Error in SQLite:", error)
    
finally:
    conn.close()  # Always close connection, even if error occurred

Why use try/except/finally?

Database operations can fail (file locked, disk full, etc.)
finally ensures connection is always closed
Prevents resource leaks

Part 11: Pandas Integration

This is where SQLite becomes really useful for data analysis.

Read SQLite into DataFrame

import pandas as pd
import sqlite3 as sql

conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
conn.close()

Now you have a DataFrame with all the Pandas functionality.

df.head()
df.info()
df.describe()

Filter in SQL vs Filter in Pandas

Option 1: Filter in SQL (better for large databases)

df = pd.read_sql_query('SELECT * FROM gubbio WHERE NO2 > 50', conn)

Only matching rows are loaded into memory.

Option 2: Load all, filter in Pandas

df = pd.read_sql_query('SELECT * FROM gubbio', conn)
df_filtered = df[df['NO2'] > 50]

Loads everything, then filters.

Use SQL filtering when:

Database is large
You only need a small subset

Use Pandas filtering when:

Data fits in memory
You need multiple different analyses

Write DataFrame to SQLite

conn = sql.connect('output.sqlite')
df.to_sql('table_name', conn, if_exists='replace')
conn.close()

if_exists options:

'fail' — raise error if table exists (default)
'replace' — drop table and recreate
'append' — add rows to existing table

Part 12: Data Cleaning Example (Gubbio Dataset)

The Dataset

Environmental monitoring data from Gubbio, Italy (2018):

Columns: year, month, day, hour, NO2, O3, PM10, PM25
Values are in µg/m³
Problem: Missing/invalid readings are coded as -999

The Problem with -999 Values

df = pd.read_sql_query('SELECT * FROM gubbio', conn)
print(df['NO2'].mean())  # Wrong! Includes -999 values

The -999 values will drastically lower your mean.

Solution 1: Replace with 0 (for visualization only)

df.loc[df.NO2 < 0, 'NO2'] = 0
df.loc[df.O3 < 0, 'O3'] = 0
df.loc[df.PM10 < 0, 'PM10'] = 0
df.loc[df.PM25 < 0, 'PM25'] = 0

Good for plotting (no negative spikes), but bad for statistics — zeros still affect the mean.

Solution 2: Replace with NaN (for analysis)

import numpy as np

df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan

This is the correct approach. Pandas ignores NaN in calculations:

df['NO2'].mean()  # Calculates mean of valid values only

Using loc[] to Find and Modify

Find rows matching condition:

# All rows where NO2 is negative
print(df.loc[df.NO2 < 0])

# Just the NO2 column where NO2 is negative
print(df.loc[df.NO2 < 0, 'NO2'])

Modify matching rows:

df.loc[df.NO2 < 0, 'NO2'] = np.nan

This reads: "For rows where NO2 < 0, set the NO2 column to NaN."

Part 13: DateTime Handling

Creating DateTime from Components

The Gubbio dataset has separate year, month, day, hour columns. Combine them:

df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])

Result: A proper datetime column like 2018-01-01 00:00:00.

Setting DateTime as Index

df.set_index('timerep', inplace=True)

Now you can do time-based operations.

Check the Result

df.info()

You'll see DatetimeIndex instead of RangeIndex.

Part 14: Resampling (Time Aggregation)

What is Resampling?

Converting from higher frequency (hourly) to lower frequency (daily, monthly, yearly).

Basic Syntax

df.resample('D').mean()  # Daily mean
df.resample('M').mean()  # Monthly mean
df.resample('A').mean()  # Annual mean

Resample Codes

Code	Frequency
'H'	Hourly
'D'	Daily
'W'	Weekly
'M'	Monthly
'A'	Annual/Yearly

Examples

Daily mean of PM10, PM25, NO2:

df.resample('D').mean()[['PM10', 'PM25', 'NO2']]

Yearly mean:

df.resample('A').mean()[['PM10', 'PM25']]

Combining Resample with Query

Find days where PM10 exceeded 50 µg/m³ (WHO 24-hour limit):

df.resample('D').mean().query('PM10 > 50')[['PM10']]

This:

Resamples to daily
Computes the mean
Filters to days where PM10 > 50
Shows only the PM10 column

Find days where PM2.5 exceeded 24 µg/m³:

df.resample('D').mean().query('PM25 > 24')[['PM25']]

WHO Air Quality Limits

Pollutant	Annual Limit	24-Hour Limit
PM2.5	10 µg/m³	24 µg/m³
PM10	20 µg/m³	50 µg/m³

Part 15: Saving and Loading with DateTime Index

The Problem

When you save a DataFrame with a datetime index to SQLite and read it back, the index might not be preserved correctly.

Wrong Way

# Save
df.to_sql('gubbio', conn, if_exists='replace')

# Load
df2 = pd.read_sql('SELECT * FROM gubbio', conn)
df2.plot(y=['NO2'])  # X-axis is wrong!

Correct Way: Preserve the Index

Saving:

df.to_sql('gubbio', conn, if_exists='replace', index=True, index_label='timerep')

Loading:

df2 = pd.read_sql('SELECT * FROM gubbio', conn, index_col='timerep', parse_dates=['timerep'])

Parameters:

index=True — save the index as a column
index_label='timerep' — name the index column
index_col='timerep' — use this column as index when loading
parse_dates=['timerep'] — parse as datetime

Part 16: Complete Workflow

Typical pattern: Load → Clean → Analyze → Save

import pandas as pd
import sqlite3 as sql
import numpy as np

# 1. Connect and load
conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)

# 2. Clean bad values (replace -999 with NaN)
df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan

# 3. Create datetime index
df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
df.set_index('timerep', inplace=True)

# 4. Analyze
# Daily averages
daily = df.resample('D').mean()[['PM10', 'PM25', 'NO2']]

# Days exceeding WHO PM10 limit
bad_pm10_days = df.resample('D').mean().query('PM10 > 50')[['PM10']]
print(f"Days PM10 > 50: {len(bad_pm10_days)}")

# Yearly average
yearly = df.resample('A').mean()[['PM10', 'PM25']]
print(yearly)

# 5. Plot
df.plot(y=['NO2'])
df.plot(y=['O3'])

# 6. Save results
df.to_sql('gubbio_clean', conn, if_exists='replace', index=True, index_label='timerep')

# 7. Close
conn.close()

SQL Commands Summary

Command	Purpose	Example
CREATE TABLE	Define schema	`CREATE TABLE Students (id INTEGER PRIMARY KEY, name TEXT)`
DROP TABLE	Delete table	`DROP TABLE IF EXISTS Students`
INSERT INTO	Add rows	`INSERT INTO Students VALUES (1, 'Alice')`
SELECT	Query data	`SELECT * FROM Students WHERE age > 20`
DELETE	Remove rows	`DELETE FROM Students WHERE id = 1`
LIKE	Pattern match	`SELECT * FROM Students WHERE name LIKE 'A%'`

Python SQLite Summary

Operation	Code
Connect	`conn = sql.connect('file.db')`
Get cursor	`cur = conn.cursor()`
Execute	`cur.execute('SQL')`
Execute many	`cur.executemany('SQL', list_of_tuples)`
Fetch one	`cur.fetchone()`
Fetch all	`cur.fetchall()`
Save changes	`conn.commit()`
Close cursor	`cur.close()`
Close connection	`conn.close()`

Pandas + SQLite Summary

Operation	Code
Read	`pd.read_sql_query('SELECT...', conn)`
Read with index	`pd.read_sql_query('...', conn, index_col='col', parse_dates=['col'])`
Write	`df.to_sql('table', conn, if_exists='replace')`
Write with index	`df.to_sql('table', conn, if_exists='replace', index=True, index_label='name')`

Common Mistakes

Mistake	Problem	Fix
Forgot `conn.commit()`	Changes not saved	Always commit after INSERT/UPDATE/DELETE
Using `==` in SQL	Syntax error	Use single `=` for equality
Replace -999 with 0	Wrong statistics	Use `np.nan` instead
`DELETE FROM table` without WHERE	Deletes everything	Always specify condition
CREATE TABLE twice	Error	Use `DROP TABLE IF EXISTS` first
Wrong number of `?`	Error	Must match column count
Not closing connection	Resource leak	Always `conn.close()`
fetchall() twice	Empty second result	Re-execute query or use fetchone()

Quick Reference Card

import sqlite3 as sql
import pandas as pd
import numpy as np

# Connect
conn = sql.connect('database.db')
cur = conn.cursor()

# Create table
cur.execute('DROP TABLE IF EXISTS MyTable')
cur.execute('CREATE TABLE MyTable (id INTEGER PRIMARY KEY, value REAL)')

# Insert
cur.executemany('INSERT INTO MyTable VALUES (?, ?)', [(1, 10.5), (2, 20.3)])
conn.commit()

# Query
cur.execute('SELECT * FROM MyTable WHERE value > 15')
results = cur.fetchall()

# Load into Pandas
df = pd.read_sql_query('SELECT * FROM MyTable', conn)

# Clean data
df.loc[df.value < 0, 'value'] = np.nan

# Save back
df.to_sql('MyTable', conn, if_exists='replace', index=False)

# Close
conn.close()

Bioinformatics Forever