SQLite

What is a Relational Database?

A relational database organizes data into tables. Each table has:

  • Rows (also called records) — individual entries
  • Columns (also called fields) — attributes of each entry

Tables can be linked together through common fields. This is the "relational" part.


Part 1: Core Concepts

Schema

The schema is the structure definition of your database:

  • Number of tables
  • Column names and data types
  • Constraints (what's allowed)

Critical: You must define the schema BEFORE you can store any data. This is a fixed structure — not flexible like a spreadsheet.

Primary Key

A column that uniquely identifies each row.

Rules:

  • Only ONE primary key per table
  • Values must be unique (no duplicates)
  • Cannot be NULL

Example: Student ID, Order Number, ISBN

Foreign Key

A column that references a primary key in another table.

Rules:

  • Can have multiple foreign keys in one table
  • Values don't need to be unique
  • Creates relationships between tables

Example: student_id in an Enrollments table references id in a Students table.

Why Use Multiple Tables?

Instead of repeating data:

# Bad: Student info repeated for each course
(1, 'Alice', 22, 'alice@unibo.it', 'BDB')
(2, 'Alice', 22, 'alice@unibo.it', 'BDP1')  # Alice duplicated!
(3, 'Bob', 23, 'bob@unibo.it', 'BDB')

Use two linked tables:

Students:
(1, 'Alice', 22, 'alice@unibo.it')
(2, 'Bob', 23, 'bob@unibo.it')

Enrollments:
(1, 1, 'BDB')    # student_id=1 (Alice)
(2, 1, 'BDP1')   # student_id=1 (Alice)
(3, 2, 'BDB')    # student_id=2 (Bob)

Benefits:

  • No data duplication
  • Update student info in one place
  • Smaller storage

Part 2: ACID Properties

A transaction is a unit of work — a set of operations that must either all succeed or all fail.

ACID guarantees for transactions:

PropertyMeaning
AtomicAll operations complete, or none do
ConsistentDatabase goes from one valid state to another
IsolatedTransactions don't interfere with each other
DurableOnce committed, changes survive crashes/power failures

Relational databases provide ACID compliance. This is why banks use them.


Part 3: SQLite

What is SQLite?

SQLite is a relational database that lives in a single file. No server needed.

It's everywhere:

  • Every Android and iOS device
  • Windows 10/11, macOS
  • Firefox, Chrome, Safari
  • Estimated 1 trillion+ SQLite databases in active use

SQLite vs Traditional Databases

Traditional (PostgreSQL, MySQL)SQLite
Separate server processNo server
Client connects via networkDirect file access
Multiple filesSingle file
Complex setupZero configuration

Python Support

Python has built-in SQLite support. No installation needed:

import sqlite3

Part 4: Connecting to SQLite

Basic Connection

import sqlite3 as sql

# Connect to database (creates file if it doesn't exist)
conn = sql.connect('my_database.db')

# Get a cursor (your pointer into the database)
cur = conn.cursor()

After this, you'll have a file called my_database.db in your directory.

In-Memory Database

For testing or temporary work:

conn = sql.connect(':memory:')

Fast, but everything is lost when you close Python.

What's a Cursor?

The cursor is how you execute commands and retrieve results. Think of it as your interface to the database.

cur.execute('SQL command here')

Part 5: Creating Tables

CREATE TABLE Syntax

cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

Data Types

TypeWhat it stores
INTEGERWhole numbers
TEXTStrings
REALFloating point numbers
BLOBBinary data

Constraints

ConstraintMeaning
PRIMARY KEYUnique identifier for each row
NOT NULLCannot be empty
UNIQUENo duplicate values allowed

The "Table Already Exists" Problem

If you run CREATE TABLE twice, you get an error.

Solution: Drop the table first if it exists.

cur.execute('DROP TABLE IF EXISTS Students')

cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

Creating Tables with Foreign Keys

cur.execute('''DROP TABLE IF EXISTS Students''')
cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

cur.execute('''DROP TABLE IF EXISTS Student_courses''')
cur.execute('''
    CREATE TABLE Student_courses (
        id INTEGER PRIMARY KEY,
        student_id INTEGER NOT NULL,
        course_id INTEGER,
        course_name TEXT NOT NULL,
        FOREIGN KEY(student_id) REFERENCES Students(id)
    )
''')

conn.commit()

Part 6: Inserting Data

Single Row

cur.execute('''
    INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')
''')

What Happens If You Insert a Duplicate Primary Key?

cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
# ERROR! id=1 already exists

Primary keys must be unique.

Multiple Rows with executemany()

the_students = (
    (1, 'John', 'Doe', 21, 'john@doe.com'),
    (2, 'Alice', 'Doe', 22, 'alice@doe.com'),
    (3, 'Rose', 'Short', 21, 'rose@short.com')
)

cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)

The ? Placeholders

Each ? gets replaced by one value from your tuple.

# 5 columns = 5 question marks
cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)

Why use ? instead of string formatting?

  1. Cleaner code
  2. Prevents SQL injection attacks
  3. Handles escaping automatically

Part 7: The Commit Rule

Critical: Changes are NOT saved until you call commit().

cur.execute('INSERT INTO Students VALUES (4, "Diana", "Smith", 20, "diana@smith.com")')

# At this point, the data is only in memory

conn.commit()  # NOW it's written to disk

If you close the connection without committing, all changes since the last commit are lost.

The Complete Pattern

# Make changes
cur.execute('INSERT ...')
cur.execute('UPDATE ...')
cur.execute('DELETE ...')

# Save to disk
conn.commit()

# Close when done
conn.close()

Part 8: Querying Data (SELECT)

Get All Rows

cur.execute('SELECT * FROM Students')
print(cur.fetchall())

Output:

[(1, 'John', 'Doe', 21, 'john@doe.com'),
 (2, 'Alice', 'Doe', 22, 'alice@doe.com'),
 (3, 'Rose', 'Short', 21, 'rose@short.com')]

fetchall() vs fetchone()

fetchall() returns a list of all rows:

cur.execute('SELECT * FROM Students')
all_rows = cur.fetchall()  # List of tuples

fetchone() returns one row at a time:

cur.execute('SELECT * FROM Students')
first = cur.fetchone()   # (1, 'John', 'Doe', 21, 'john@doe.com')
second = cur.fetchone()  # (2, 'Alice', 'Doe', 22, 'alice@doe.com')
third = cur.fetchone()   # (3, 'Rose', 'Short', 21, 'rose@short.com')
fourth = cur.fetchone()  # None (no more rows)

Important: fetchall() Exhausts the Cursor

cur.execute('SELECT * FROM Students')
print(cur.fetchall())  # Returns all rows
print(cur.fetchall())  # Returns [] (empty list!)

Once you've fetched all rows, there's nothing left to fetch. You need to execute the query again.

Select Specific Columns

cur.execute('SELECT last_name, email FROM Students')
print(cur.fetchall())
# [('Doe', 'john@doe.com'), ('Doe', 'alice@doe.com'), ('Short', 'rose@short.com')]

Filter with WHERE

cur.execute('SELECT * FROM Students WHERE id=3')
print(cur.fetchall())
# [(3, 'Rose', 'Short', 21, 'rose@short.com')]

Pattern Matching with LIKE

# Emails ending with 'doe.com'
cur.execute("SELECT * FROM Students WHERE email LIKE '%doe.com'")
print(cur.fetchall())
# [(1, 'John', 'Doe', 21, 'john@doe.com'), (2, 'Alice', 'Doe', 22, 'alice@doe.com')]

Wildcards:

  • % — any sequence of characters (including none)
  • _ — exactly one character

Examples:

LIKE 'A%'      # Starts with A
LIKE '%e'      # Ends with e
LIKE '%li%'    # Contains 'li'
LIKE '_ohn'    # 4 characters ending in 'ohn' (John, Bohn, etc.)

Note: LIKE is case-insensitive in SQLite.


Part 9: Deleting Data

cur.execute('DELETE FROM Students WHERE id=1')
conn.commit()

Warning: Without WHERE, you delete everything:

cur.execute('DELETE FROM Students')  # Deletes ALL rows!

Part 10: Error Handling

The Proper Pattern

import sqlite3 as sql

try:
    conn = sql.connect('my_database.db')
    cur = conn.cursor()
    print("Connection successful")
    
    # Your database operations here
    cur.execute('SELECT * FROM Students')
    print(cur.fetchall())
    
    cur.close()  # Close cursor to free memory
    
except sql.Error as error:
    print("Error in SQLite:", error)
    
finally:
    conn.close()  # Always close connection, even if error occurred

Why use try/except/finally?

  • Database operations can fail (file locked, disk full, etc.)
  • finally ensures connection is always closed
  • Prevents resource leaks

Part 11: Pandas Integration

This is where SQLite becomes really useful for data analysis.

Read SQLite into DataFrame

import pandas as pd
import sqlite3 as sql

conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
conn.close()

Now you have a DataFrame with all the Pandas functionality.

df.head()
df.info()
df.describe()

Filter in SQL vs Filter in Pandas

Option 1: Filter in SQL (better for large databases)

df = pd.read_sql_query('SELECT * FROM gubbio WHERE NO2 > 50', conn)

Only matching rows are loaded into memory.

Option 2: Load all, filter in Pandas

df = pd.read_sql_query('SELECT * FROM gubbio', conn)
df_filtered = df[df['NO2'] > 50]

Loads everything, then filters.

Use SQL filtering when:

  • Database is large
  • You only need a small subset

Use Pandas filtering when:

  • Data fits in memory
  • You need multiple different analyses

Write DataFrame to SQLite

conn = sql.connect('output.sqlite')
df.to_sql('table_name', conn, if_exists='replace')
conn.close()

if_exists options:

  • 'fail' — raise error if table exists (default)
  • 'replace' — drop table and recreate
  • 'append' — add rows to existing table

Part 12: Data Cleaning Example (Gubbio Dataset)

The Dataset

Environmental monitoring data from Gubbio, Italy (2018):

  • Columns: year, month, day, hour, NO2, O3, PM10, PM25
  • Values are in µg/m³
  • Problem: Missing/invalid readings are coded as -999

The Problem with -999 Values

df = pd.read_sql_query('SELECT * FROM gubbio', conn)
print(df['NO2'].mean())  # Wrong! Includes -999 values

The -999 values will drastically lower your mean.

Solution 1: Replace with 0 (for visualization only)

df.loc[df.NO2 < 0, 'NO2'] = 0
df.loc[df.O3 < 0, 'O3'] = 0
df.loc[df.PM10 < 0, 'PM10'] = 0
df.loc[df.PM25 < 0, 'PM25'] = 0

Good for plotting (no negative spikes), but bad for statistics — zeros still affect the mean.

Solution 2: Replace with NaN (for analysis)

import numpy as np

df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan

This is the correct approach. Pandas ignores NaN in calculations:

df['NO2'].mean()  # Calculates mean of valid values only

Using loc[] to Find and Modify

Find rows matching condition:

# All rows where NO2 is negative
print(df.loc[df.NO2 < 0])

# Just the NO2 column where NO2 is negative
print(df.loc[df.NO2 < 0, 'NO2'])

Modify matching rows:

df.loc[df.NO2 < 0, 'NO2'] = np.nan

This reads: "For rows where NO2 < 0, set the NO2 column to NaN."


Part 13: DateTime Handling

Creating DateTime from Components

The Gubbio dataset has separate year, month, day, hour columns. Combine them:

df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])

Result: A proper datetime column like 2018-01-01 00:00:00.

Setting DateTime as Index

df.set_index('timerep', inplace=True)

Now you can do time-based operations.

Check the Result

df.info()

You'll see DatetimeIndex instead of RangeIndex.


Part 14: Resampling (Time Aggregation)

What is Resampling?

Converting from higher frequency (hourly) to lower frequency (daily, monthly, yearly).

Basic Syntax

df.resample('D').mean()  # Daily mean
df.resample('M').mean()  # Monthly mean
df.resample('A').mean()  # Annual mean

Resample Codes

CodeFrequency
'H'Hourly
'D'Daily
'W'Weekly
'M'Monthly
'A'Annual/Yearly

Examples

Daily mean of PM10, PM25, NO2:

df.resample('D').mean()[['PM10', 'PM25', 'NO2']]

Yearly mean:

df.resample('A').mean()[['PM10', 'PM25']]

Combining Resample with Query

Find days where PM10 exceeded 50 µg/m³ (WHO 24-hour limit):

df.resample('D').mean().query('PM10 > 50')[['PM10']]

This:

  1. Resamples to daily
  2. Computes the mean
  3. Filters to days where PM10 > 50
  4. Shows only the PM10 column

Find days where PM2.5 exceeded 24 µg/m³:

df.resample('D').mean().query('PM25 > 24')[['PM25']]

WHO Air Quality Limits

PollutantAnnual Limit24-Hour Limit
PM2.510 µg/m³24 µg/m³
PM1020 µg/m³50 µg/m³

Part 15: Saving and Loading with DateTime Index

The Problem

When you save a DataFrame with a datetime index to SQLite and read it back, the index might not be preserved correctly.

Wrong Way

# Save
df.to_sql('gubbio', conn, if_exists='replace')

# Load
df2 = pd.read_sql('SELECT * FROM gubbio', conn)
df2.plot(y=['NO2'])  # X-axis is wrong!

Correct Way: Preserve the Index

Saving:

df.to_sql('gubbio', conn, if_exists='replace', index=True, index_label='timerep')

Loading:

df2 = pd.read_sql('SELECT * FROM gubbio', conn, index_col='timerep', parse_dates=['timerep'])

Parameters:

  • index=True — save the index as a column
  • index_label='timerep' — name the index column
  • index_col='timerep' — use this column as index when loading
  • parse_dates=['timerep'] — parse as datetime

Part 16: Complete Workflow

Typical pattern: Load → Clean → Analyze → Save

import pandas as pd
import sqlite3 as sql
import numpy as np

# 1. Connect and load
conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)

# 2. Clean bad values (replace -999 with NaN)
df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan

# 3. Create datetime index
df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
df.set_index('timerep', inplace=True)

# 4. Analyze
# Daily averages
daily = df.resample('D').mean()[['PM10', 'PM25', 'NO2']]

# Days exceeding WHO PM10 limit
bad_pm10_days = df.resample('D').mean().query('PM10 > 50')[['PM10']]
print(f"Days PM10 > 50: {len(bad_pm10_days)}")

# Yearly average
yearly = df.resample('A').mean()[['PM10', 'PM25']]
print(yearly)

# 5. Plot
df.plot(y=['NO2'])
df.plot(y=['O3'])

# 6. Save results
df.to_sql('gubbio_clean', conn, if_exists='replace', index=True, index_label='timerep')

# 7. Close
conn.close()

SQL Commands Summary

CommandPurposeExample
CREATE TABLEDefine schemaCREATE TABLE Students (id INTEGER PRIMARY KEY, name TEXT)
DROP TABLEDelete tableDROP TABLE IF EXISTS Students
INSERT INTOAdd rowsINSERT INTO Students VALUES (1, 'Alice')
SELECTQuery dataSELECT * FROM Students WHERE age > 20
DELETERemove rowsDELETE FROM Students WHERE id = 1
LIKEPattern matchSELECT * FROM Students WHERE name LIKE 'A%'

Python SQLite Summary

OperationCode
Connectconn = sql.connect('file.db')
Get cursorcur = conn.cursor()
Executecur.execute('SQL')
Execute manycur.executemany('SQL', list_of_tuples)
Fetch onecur.fetchone()
Fetch allcur.fetchall()
Save changesconn.commit()
Close cursorcur.close()
Close connectionconn.close()

Pandas + SQLite Summary

OperationCode
Readpd.read_sql_query('SELECT...', conn)
Read with indexpd.read_sql_query('...', conn, index_col='col', parse_dates=['col'])
Writedf.to_sql('table', conn, if_exists='replace')
Write with indexdf.to_sql('table', conn, if_exists='replace', index=True, index_label='name')

Common Mistakes

MistakeProblemFix
Forgot conn.commit()Changes not savedAlways commit after INSERT/UPDATE/DELETE
Using == in SQLSyntax errorUse single = for equality
Replace -999 with 0Wrong statisticsUse np.nan instead
DELETE FROM table without WHEREDeletes everythingAlways specify condition
CREATE TABLE twiceErrorUse DROP TABLE IF EXISTS first
Wrong number of ?ErrorMust match column count
Not closing connectionResource leakAlways conn.close()
fetchall() twiceEmpty second resultRe-execute query or use fetchone()

Quick Reference Card

import sqlite3 as sql
import pandas as pd
import numpy as np

# Connect
conn = sql.connect('database.db')
cur = conn.cursor()

# Create table
cur.execute('DROP TABLE IF EXISTS MyTable')
cur.execute('CREATE TABLE MyTable (id INTEGER PRIMARY KEY, value REAL)')

# Insert
cur.executemany('INSERT INTO MyTable VALUES (?, ?)', [(1, 10.5), (2, 20.3)])
conn.commit()

# Query
cur.execute('SELECT * FROM MyTable WHERE value > 15')
results = cur.fetchall()

# Load into Pandas
df = pd.read_sql_query('SELECT * FROM MyTable', conn)

# Clean data
df.loc[df.value < 0, 'value'] = np.nan

# Save back
df.to_sql('MyTable', conn, if_exists='replace', index=False)

# Close
conn.close()