SQLite
What is a Relational Database?
A relational database organizes data into tables. Each table has:
- Rows (also called records) — individual entries
- Columns (also called fields) — attributes of each entry
Tables can be linked together through common fields. This is the "relational" part.
Part 1: Core Concepts
Schema
The schema is the structure definition of your database:
- Number of tables
- Column names and data types
- Constraints (what's allowed)
Critical: You must define the schema BEFORE you can store any data. This is a fixed structure — not flexible like a spreadsheet.
Primary Key
A column that uniquely identifies each row.
Rules:
- Only ONE primary key per table
- Values must be unique (no duplicates)
- Cannot be NULL
Example: Student ID, Order Number, ISBN
Foreign Key
A column that references a primary key in another table.
Rules:
- Can have multiple foreign keys in one table
- Values don't need to be unique
- Creates relationships between tables
Example: student_id in an Enrollments table references id in a Students table.
Why Use Multiple Tables?
Instead of repeating data:
# Bad: Student info repeated for each course
(1, 'Alice', 22, 'alice@unibo.it', 'BDB')
(2, 'Alice', 22, 'alice@unibo.it', 'BDP1') # Alice duplicated!
(3, 'Bob', 23, 'bob@unibo.it', 'BDB')
Use two linked tables:
Students:
(1, 'Alice', 22, 'alice@unibo.it')
(2, 'Bob', 23, 'bob@unibo.it')
Enrollments:
(1, 1, 'BDB') # student_id=1 (Alice)
(2, 1, 'BDP1') # student_id=1 (Alice)
(3, 2, 'BDB') # student_id=2 (Bob)
Benefits:
- No data duplication
- Update student info in one place
- Smaller storage
Part 2: ACID Properties
A transaction is a unit of work — a set of operations that must either all succeed or all fail.
ACID guarantees for transactions:
| Property | Meaning |
|---|---|
| Atomic | All operations complete, or none do |
| Consistent | Database goes from one valid state to another |
| Isolated | Transactions don't interfere with each other |
| Durable | Once committed, changes survive crashes/power failures |
Relational databases provide ACID compliance. This is why banks use them.
Part 3: SQLite
What is SQLite?
SQLite is a relational database that lives in a single file. No server needed.
It's everywhere:
- Every Android and iOS device
- Windows 10/11, macOS
- Firefox, Chrome, Safari
- Estimated 1 trillion+ SQLite databases in active use
SQLite vs Traditional Databases
| Traditional (PostgreSQL, MySQL) | SQLite |
|---|---|
| Separate server process | No server |
| Client connects via network | Direct file access |
| Multiple files | Single file |
| Complex setup | Zero configuration |
Python Support
Python has built-in SQLite support. No installation needed:
import sqlite3
Part 4: Connecting to SQLite
Basic Connection
import sqlite3 as sql
# Connect to database (creates file if it doesn't exist)
conn = sql.connect('my_database.db')
# Get a cursor (your pointer into the database)
cur = conn.cursor()
After this, you'll have a file called my_database.db in your directory.
In-Memory Database
For testing or temporary work:
conn = sql.connect(':memory:')
Fast, but everything is lost when you close Python.
What's a Cursor?
The cursor is how you execute commands and retrieve results. Think of it as your interface to the database.
cur.execute('SQL command here')
Part 5: Creating Tables
CREATE TABLE Syntax
cur.execute('''
CREATE TABLE Students (
id INTEGER PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
age INTEGER,
email TEXT NOT NULL UNIQUE
)
''')
Data Types
| Type | What it stores |
|---|---|
| INTEGER | Whole numbers |
| TEXT | Strings |
| REAL | Floating point numbers |
| BLOB | Binary data |
Constraints
| Constraint | Meaning |
|---|---|
| PRIMARY KEY | Unique identifier for each row |
| NOT NULL | Cannot be empty |
| UNIQUE | No duplicate values allowed |
The "Table Already Exists" Problem
If you run CREATE TABLE twice, you get an error.
Solution: Drop the table first if it exists.
cur.execute('DROP TABLE IF EXISTS Students')
cur.execute('''
CREATE TABLE Students (
id INTEGER PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
age INTEGER,
email TEXT NOT NULL UNIQUE
)
''')
Creating Tables with Foreign Keys
cur.execute('''DROP TABLE IF EXISTS Students''')
cur.execute('''
CREATE TABLE Students (
id INTEGER PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
age INTEGER,
email TEXT NOT NULL UNIQUE
)
''')
cur.execute('''DROP TABLE IF EXISTS Student_courses''')
cur.execute('''
CREATE TABLE Student_courses (
id INTEGER PRIMARY KEY,
student_id INTEGER NOT NULL,
course_id INTEGER,
course_name TEXT NOT NULL,
FOREIGN KEY(student_id) REFERENCES Students(id)
)
''')
conn.commit()
Part 6: Inserting Data
Single Row
cur.execute('''
INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')
''')
What Happens If You Insert a Duplicate Primary Key?
cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
# ERROR! id=1 already exists
Primary keys must be unique.
Multiple Rows with executemany()
the_students = (
(1, 'John', 'Doe', 21, 'john@doe.com'),
(2, 'Alice', 'Doe', 22, 'alice@doe.com'),
(3, 'Rose', 'Short', 21, 'rose@short.com')
)
cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)
The ? Placeholders
Each ? gets replaced by one value from your tuple.
# 5 columns = 5 question marks
cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)
Why use ? instead of string formatting?
- Cleaner code
- Prevents SQL injection attacks
- Handles escaping automatically
Part 7: The Commit Rule
Critical: Changes are NOT saved until you call commit().
cur.execute('INSERT INTO Students VALUES (4, "Diana", "Smith", 20, "diana@smith.com")')
# At this point, the data is only in memory
conn.commit() # NOW it's written to disk
If you close the connection without committing, all changes since the last commit are lost.
The Complete Pattern
# Make changes
cur.execute('INSERT ...')
cur.execute('UPDATE ...')
cur.execute('DELETE ...')
# Save to disk
conn.commit()
# Close when done
conn.close()
Part 8: Querying Data (SELECT)
Get All Rows
cur.execute('SELECT * FROM Students')
print(cur.fetchall())
Output:
[(1, 'John', 'Doe', 21, 'john@doe.com'),
(2, 'Alice', 'Doe', 22, 'alice@doe.com'),
(3, 'Rose', 'Short', 21, 'rose@short.com')]
fetchall() vs fetchone()
fetchall() returns a list of all rows:
cur.execute('SELECT * FROM Students')
all_rows = cur.fetchall() # List of tuples
fetchone() returns one row at a time:
cur.execute('SELECT * FROM Students')
first = cur.fetchone() # (1, 'John', 'Doe', 21, 'john@doe.com')
second = cur.fetchone() # (2, 'Alice', 'Doe', 22, 'alice@doe.com')
third = cur.fetchone() # (3, 'Rose', 'Short', 21, 'rose@short.com')
fourth = cur.fetchone() # None (no more rows)
Important: fetchall() Exhausts the Cursor
cur.execute('SELECT * FROM Students')
print(cur.fetchall()) # Returns all rows
print(cur.fetchall()) # Returns [] (empty list!)
Once you've fetched all rows, there's nothing left to fetch. You need to execute the query again.
Select Specific Columns
cur.execute('SELECT last_name, email FROM Students')
print(cur.fetchall())
# [('Doe', 'john@doe.com'), ('Doe', 'alice@doe.com'), ('Short', 'rose@short.com')]
Filter with WHERE
cur.execute('SELECT * FROM Students WHERE id=3')
print(cur.fetchall())
# [(3, 'Rose', 'Short', 21, 'rose@short.com')]
Pattern Matching with LIKE
# Emails ending with 'doe.com'
cur.execute("SELECT * FROM Students WHERE email LIKE '%doe.com'")
print(cur.fetchall())
# [(1, 'John', 'Doe', 21, 'john@doe.com'), (2, 'Alice', 'Doe', 22, 'alice@doe.com')]
Wildcards:
%— any sequence of characters (including none)_— exactly one character
Examples:
LIKE 'A%' # Starts with A
LIKE '%e' # Ends with e
LIKE '%li%' # Contains 'li'
LIKE '_ohn' # 4 characters ending in 'ohn' (John, Bohn, etc.)
Note: LIKE is case-insensitive in SQLite.
Part 9: Deleting Data
cur.execute('DELETE FROM Students WHERE id=1')
conn.commit()
Warning: Without WHERE, you delete everything:
cur.execute('DELETE FROM Students') # Deletes ALL rows!
Part 10: Error Handling
The Proper Pattern
import sqlite3 as sql
try:
conn = sql.connect('my_database.db')
cur = conn.cursor()
print("Connection successful")
# Your database operations here
cur.execute('SELECT * FROM Students')
print(cur.fetchall())
cur.close() # Close cursor to free memory
except sql.Error as error:
print("Error in SQLite:", error)
finally:
conn.close() # Always close connection, even if error occurred
Why use try/except/finally?
- Database operations can fail (file locked, disk full, etc.)
finallyensures connection is always closed- Prevents resource leaks
Part 11: Pandas Integration
This is where SQLite becomes really useful for data analysis.
Read SQLite into DataFrame
import pandas as pd
import sqlite3 as sql
conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
conn.close()
Now you have a DataFrame with all the Pandas functionality.
df.head()
df.info()
df.describe()
Filter in SQL vs Filter in Pandas
Option 1: Filter in SQL (better for large databases)
df = pd.read_sql_query('SELECT * FROM gubbio WHERE NO2 > 50', conn)
Only matching rows are loaded into memory.
Option 2: Load all, filter in Pandas
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
df_filtered = df[df['NO2'] > 50]
Loads everything, then filters.
Use SQL filtering when:
- Database is large
- You only need a small subset
Use Pandas filtering when:
- Data fits in memory
- You need multiple different analyses
Write DataFrame to SQLite
conn = sql.connect('output.sqlite')
df.to_sql('table_name', conn, if_exists='replace')
conn.close()
if_exists options:
'fail'— raise error if table exists (default)'replace'— drop table and recreate'append'— add rows to existing table
Part 12: Data Cleaning Example (Gubbio Dataset)
The Dataset
Environmental monitoring data from Gubbio, Italy (2018):
- Columns: year, month, day, hour, NO2, O3, PM10, PM25
- Values are in µg/m³
- Problem: Missing/invalid readings are coded as
-999
The Problem with -999 Values
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
print(df['NO2'].mean()) # Wrong! Includes -999 values
The -999 values will drastically lower your mean.
Solution 1: Replace with 0 (for visualization only)
df.loc[df.NO2 < 0, 'NO2'] = 0
df.loc[df.O3 < 0, 'O3'] = 0
df.loc[df.PM10 < 0, 'PM10'] = 0
df.loc[df.PM25 < 0, 'PM25'] = 0
Good for plotting (no negative spikes), but bad for statistics — zeros still affect the mean.
Solution 2: Replace with NaN (for analysis)
import numpy as np
df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan
This is the correct approach. Pandas ignores NaN in calculations:
df['NO2'].mean() # Calculates mean of valid values only
Using loc[] to Find and Modify
Find rows matching condition:
# All rows where NO2 is negative
print(df.loc[df.NO2 < 0])
# Just the NO2 column where NO2 is negative
print(df.loc[df.NO2 < 0, 'NO2'])
Modify matching rows:
df.loc[df.NO2 < 0, 'NO2'] = np.nan
This reads: "For rows where NO2 < 0, set the NO2 column to NaN."
Part 13: DateTime Handling
Creating DateTime from Components
The Gubbio dataset has separate year, month, day, hour columns. Combine them:
df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
Result: A proper datetime column like 2018-01-01 00:00:00.
Setting DateTime as Index
df.set_index('timerep', inplace=True)
Now you can do time-based operations.
Check the Result
df.info()
You'll see DatetimeIndex instead of RangeIndex.
Part 14: Resampling (Time Aggregation)
What is Resampling?
Converting from higher frequency (hourly) to lower frequency (daily, monthly, yearly).
Basic Syntax
df.resample('D').mean() # Daily mean
df.resample('M').mean() # Monthly mean
df.resample('A').mean() # Annual mean
Resample Codes
| Code | Frequency |
|---|---|
| 'H' | Hourly |
| 'D' | Daily |
| 'W' | Weekly |
| 'M' | Monthly |
| 'A' | Annual/Yearly |
Examples
Daily mean of PM10, PM25, NO2:
df.resample('D').mean()[['PM10', 'PM25', 'NO2']]
Yearly mean:
df.resample('A').mean()[['PM10', 'PM25']]
Combining Resample with Query
Find days where PM10 exceeded 50 µg/m³ (WHO 24-hour limit):
df.resample('D').mean().query('PM10 > 50')[['PM10']]
This:
- Resamples to daily
- Computes the mean
- Filters to days where PM10 > 50
- Shows only the PM10 column
Find days where PM2.5 exceeded 24 µg/m³:
df.resample('D').mean().query('PM25 > 24')[['PM25']]
WHO Air Quality Limits
| Pollutant | Annual Limit | 24-Hour Limit |
|---|---|---|
| PM2.5 | 10 µg/m³ | 24 µg/m³ |
| PM10 | 20 µg/m³ | 50 µg/m³ |
Part 15: Saving and Loading with DateTime Index
The Problem
When you save a DataFrame with a datetime index to SQLite and read it back, the index might not be preserved correctly.
Wrong Way
# Save
df.to_sql('gubbio', conn, if_exists='replace')
# Load
df2 = pd.read_sql('SELECT * FROM gubbio', conn)
df2.plot(y=['NO2']) # X-axis is wrong!
Correct Way: Preserve the Index
Saving:
df.to_sql('gubbio', conn, if_exists='replace', index=True, index_label='timerep')
Loading:
df2 = pd.read_sql('SELECT * FROM gubbio', conn, index_col='timerep', parse_dates=['timerep'])
Parameters:
index=True— save the index as a columnindex_label='timerep'— name the index columnindex_col='timerep'— use this column as index when loadingparse_dates=['timerep']— parse as datetime
Part 16: Complete Workflow
Typical pattern: Load → Clean → Analyze → Save
import pandas as pd
import sqlite3 as sql
import numpy as np
# 1. Connect and load
conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
# 2. Clean bad values (replace -999 with NaN)
df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan
# 3. Create datetime index
df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
df.set_index('timerep', inplace=True)
# 4. Analyze
# Daily averages
daily = df.resample('D').mean()[['PM10', 'PM25', 'NO2']]
# Days exceeding WHO PM10 limit
bad_pm10_days = df.resample('D').mean().query('PM10 > 50')[['PM10']]
print(f"Days PM10 > 50: {len(bad_pm10_days)}")
# Yearly average
yearly = df.resample('A').mean()[['PM10', 'PM25']]
print(yearly)
# 5. Plot
df.plot(y=['NO2'])
df.plot(y=['O3'])
# 6. Save results
df.to_sql('gubbio_clean', conn, if_exists='replace', index=True, index_label='timerep')
# 7. Close
conn.close()
SQL Commands Summary
| Command | Purpose | Example |
|---|---|---|
| CREATE TABLE | Define schema | CREATE TABLE Students (id INTEGER PRIMARY KEY, name TEXT) |
| DROP TABLE | Delete table | DROP TABLE IF EXISTS Students |
| INSERT INTO | Add rows | INSERT INTO Students VALUES (1, 'Alice') |
| SELECT | Query data | SELECT * FROM Students WHERE age > 20 |
| DELETE | Remove rows | DELETE FROM Students WHERE id = 1 |
| LIKE | Pattern match | SELECT * FROM Students WHERE name LIKE 'A%' |
Python SQLite Summary
| Operation | Code |
|---|---|
| Connect | conn = sql.connect('file.db') |
| Get cursor | cur = conn.cursor() |
| Execute | cur.execute('SQL') |
| Execute many | cur.executemany('SQL', list_of_tuples) |
| Fetch one | cur.fetchone() |
| Fetch all | cur.fetchall() |
| Save changes | conn.commit() |
| Close cursor | cur.close() |
| Close connection | conn.close() |
Pandas + SQLite Summary
| Operation | Code |
|---|---|
| Read | pd.read_sql_query('SELECT...', conn) |
| Read with index | pd.read_sql_query('...', conn, index_col='col', parse_dates=['col']) |
| Write | df.to_sql('table', conn, if_exists='replace') |
| Write with index | df.to_sql('table', conn, if_exists='replace', index=True, index_label='name') |
Common Mistakes
| Mistake | Problem | Fix |
|---|---|---|
Forgot conn.commit() | Changes not saved | Always commit after INSERT/UPDATE/DELETE |
Using == in SQL | Syntax error | Use single = for equality |
| Replace -999 with 0 | Wrong statistics | Use np.nan instead |
DELETE FROM table without WHERE | Deletes everything | Always specify condition |
| CREATE TABLE twice | Error | Use DROP TABLE IF EXISTS first |
Wrong number of ? | Error | Must match column count |
| Not closing connection | Resource leak | Always conn.close() |
| fetchall() twice | Empty second result | Re-execute query or use fetchone() |
Quick Reference Card
import sqlite3 as sql
import pandas as pd
import numpy as np
# Connect
conn = sql.connect('database.db')
cur = conn.cursor()
# Create table
cur.execute('DROP TABLE IF EXISTS MyTable')
cur.execute('CREATE TABLE MyTable (id INTEGER PRIMARY KEY, value REAL)')
# Insert
cur.executemany('INSERT INTO MyTable VALUES (?, ?)', [(1, 10.5), (2, 20.3)])
conn.commit()
# Query
cur.execute('SELECT * FROM MyTable WHERE value > 15')
results = cur.fetchall()
# Load into Pandas
df = pd.read_sql_query('SELECT * FROM MyTable', conn)
# Clean data
df.loc[df.value < 0, 'value'] = np.nan
# Save back
df.to_sql('MyTable', conn, if_exists='replace', index=False)
# Close
conn.close()