Pandas: Complete Notes
Setup
Every Pandas script starts with:
import pandas as pd
import numpy as np
pd and np are conventions. Everyone uses them.
Part 1: Series
A Series is a 1-dimensional array with labels (called an index).
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
Output:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
The left column (0, 1, 2...) is the index. The right column is the values.
Accessing Index and Values Separately
print(s.index) # RangeIndex(start=0, stop=6, step=1)
print(s.values) # array([ 1., 3., 5., nan, 6., 8.])
Iterating Over a Series
# Just values
for i in s.values:
print(i)
# Both index and values
for i, v in s.items():
print(i, v)
Slicing a Series
Works like Python lists:
print(s[1:3]) # Elements at index 1 and 2 (3 is excluded)
Check for NaN
np.isnan(s[3]) # True
Custom Index
The index doesn't have to be integers:
students = pd.Series(
[28, 15, 30, 24, 10, 19],
index=['Lorenzo', 'Alessandra', 'Sofia', 'Giovanni', 'Matteo', 'Chiara']
)
print(students)
Output:
Lorenzo 28
Alessandra 15
Sofia 30
Giovanni 24
Matteo 10
Chiara 19
dtype: int64
Now you access by name:
print(students['Sofia']) # 30
Filtering a Series
print(students[students >= 18])
Output:
Lorenzo 28
Sofia 30
Giovanni 24
Chiara 19
dtype: int64
Creating Series from a Dictionary
capitals = pd.Series({
'Italy': 'Rome',
'Germany': 'Berlin',
'France': 'Paris',
'Spain': 'Madrid',
'Portugal': 'Lisbon'
})
Unlike regular Python dictionaries, Series support slicing:
print(capitals['France':'Portugal'])
Output:
France Paris
Spain Madrid
Portugal Lisbon
dtype: object
Convert Series to List
capitals.to_list() # ['Rome', 'Berlin', 'Paris', 'Madrid', 'Lisbon']
Part 2: DataFrames
A DataFrame is a 2-dimensional table. Each column is a Series.
Creating a DataFrame from Series
# First, create two Series with the same index
capitals = pd.Series({
'Italy': 'Rome',
'Germany': 'Berlin',
'France': 'Paris',
'Spain': 'Madrid',
'Portugal': 'Lisbon'
})
population = pd.Series({
'Italy': 58_800_000,
'Spain': 48_400_000,
'Germany': 84_400_000,
'Portugal': 10_400_000,
'France': 68_200_000
})
# Combine into DataFrame
countries = pd.DataFrame({'capitals': capitals, 'population': population})
print(countries)
Output:
capitals population
Italy Rome 58800000
Germany Berlin 84400000
France Paris 68200000
Spain Madrid 48400000
Portugal Lisbon 10400000
Creating a DataFrame from a Dictionary
df = pd.DataFrame({
'country': ['France', 'Germany', 'Italy', 'Portugal', 'Spain'],
'capital': ['Paris', 'Berlin', 'Rome', 'Lisbon', 'Madrid'],
'population': [68_200_000, 84_400_000, 58_800_000, 10_400_000, 48_400_000]
})
This creates an automatic numeric index (0, 1, 2...).
DataFrame Properties
print(countries.index) # Index(['Italy', 'Germany', 'France', 'Spain', 'Portugal'])
print(countries.columns) # Index(['capitals', 'population'])
print(countries.shape) # (5, 2) → 5 rows, 2 columns
print(countries.size) # 10 → total elements
Accessing Columns
Two ways:
# Bracket notation
countries['population']
# Dot notation
countries.population
Both return a Series.
Accessing Multiple Columns
Use a list inside brackets:
countries[['capitals', 'population']] # Returns a DataFrame
Slicing Rows
countries['Italy':] # From Italy to the end
countries[0:3] # Rows 0, 1, 2
Filtering
countries[countries.population > 60_000_000]
Iterating Over a Single Column
for cap in countries['capitals']:
print(cap)
Convert DataFrame to Dictionary
countries.to_dict()
Part 3: Reading Files
CSV Files
df = pd.read_csv('covid19-sample.csv')
Excel Files
First install openpyxl (only once):
!pip install openpyxl
Then:
df = pd.read_excel('covid19-sample.xlsx')
Important: Excel reading is 500-1000x slower than CSV. Use CSV when possible.
Reading from a URL
df = pd.read_csv('https://github.com/dsalomoni/bdb-2024/raw/main/covid/covid19-sample.csv')
Reading Only Specific Columns
my_columns = ['country', 'weekly_count', 'year_week']
df = pd.read_csv('covid19-sample.csv', usecols=my_columns)
Part 4: Inspecting Data
First/Last Rows
df.head() # First 5 rows
df.head(10) # First 10 rows
df.tail(3) # Last 3 rows
Shape and Size
df.shape # (rows, columns) tuple
df.size # Total elements = rows × columns
Column Names
df.columns
Unique Values in a Column
df['indicator'].unique() # Array of unique values
df['indicator'].nunique() # Count of unique values
Part 5: Selecting and Slicing
By Row Number
df[3500:3504] # Rows 3500, 3501, 3502, 3503
df[777:778] # Just row 777
Specific Column from a Slice
df[777:778]['year_week']
# or
df[777:778].year_week
Multiple Columns
df.head()[['country', 'year_week']]
Using loc[]
Access rows by index label or by condition:
# By label
df.loc[19828]
# By condition
df.loc[df.weekly_count > 4500]
Part 6: Filtering with Conditions
Direct Filtering
df[df['grade'] > 27]
Multiple Conditions
# AND - use &
df[(df['grade'] > 27) & (df['age'] < 30)]
# OR - use |
df[(df['grade'] > 29) | (df['age'] > 30)]
Important: Wrap each condition in parentheses.
Using query() — The Better Way
df.query('country=="Italy" and indicator=="cases"')
With variables:
start_week = '2020-10'
end_week = '2021-48'
df.query('year_week >= @start_week and year_week <= @end_week')
Or using string formatting:
df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))
Part 7: iterrows() vs query()
The Slow Way: iterrows()
it_cases = dict()
for index, row in df.iterrows():
if row['country'] == 'Italy':
if row['indicator'] == 'cases':
week = row['year_week']
if (week >= start_week) and (week <= end_week):
it_cases[week] = row['weekly_count']
df2 = pd.DataFrame(list(it_cases.items()), columns=['week', 'cases'])
Time: ~1.52 seconds for 41,000 rows
The Fast Way: query()
df3 = df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))
Time: ~0.01 seconds
query() is about 150x faster than iterrows().
Part 8: Sorting
Sort a Series
series.sort_values() # Ascending
series.sort_values(ascending=False) # Descending
Sort a DataFrame
df.sort_values(by='quantity') # By one column
df.sort_values(by='quantity', ascending=False) # Descending
df.sort_values(by=['column1', 'column2']) # By multiple columns
Sort a Dictionary with Pandas
x = {'apple': 5, 'banana': 2, 'orange': 8, 'grape': 1}
series_x = pd.Series(x)
sorted_x = series_x.sort_values().to_dict()
# {'grape': 1, 'banana': 2, 'apple': 5, 'orange': 8}
Part 9: Common Functions
sum()
df['weekly_count'].sum()
describe()
Generates statistics for numerical columns:
df.describe()
Output includes: count, mean, std, min, 25%, 50%, 75%, max
nunique() and unique()
df['country'].nunique() # Number of unique values
df['country'].unique() # Array of unique values
mean() and median()
df['salary'].mean() # Average
df['salary'].median() # Middle value
When to use which:
- Mean: When data is symmetrically distributed, no outliers
- Median: When data has outliers or is skewed
Example:
Blood pressure readings: 142, 124, 121, 150, 215
Mean = (142+124+121+150+215)/5 = 150.4
Median = 142 (middle value when sorted: 121, 124, 142, 150, 215)
The 215 outlier pulls the mean up but doesn't affect the median.
Part 10: groupby()
Split data into groups, then apply a function.
Basic groupby
df_grouped = df.groupby('continent')
This returns a DataFrameGroupBy object. By itself, not useful. You need to apply a function:
df.groupby('continent').sum()
df.groupby('continent')['weekly_count'].mean()
df.groupby('continent')['weekly_count'].count()
Multiple Statistics with agg()
df.groupby('Agency')['Salary Range From'].agg(['mean', 'median'])
Group by Multiple Columns
df.groupby(['Agency', 'Posting Type'])['Salary Range From'].mean()
To prevent the grouped columns from becoming the index:
df.groupby(['Agency', 'Posting Type'], as_index=False)['Salary Range From'].mean()
Accessing Groups
grouped = df.groupby('continent')
# What groups exist?
grouped.groups.keys()
# Get one specific group
grouped.get_group('Oceania')
# How many unique countries in Oceania?
grouped.get_group('Oceania')['country'].nunique()
# Which countries?
grouped.get_group('Oceania')['country'].unique()
Sorting groupby Results
df.groupby('Agency')['# Of Positions'].count().sort_values(ascending=False).head(10)
Part 11: cut() — Binning Data
Convert continuous values into categories.
Basic Usage
df = pd.DataFrame({'age': [25, 30, 35, 40, 45, 50, 55, 60, 65]})
bins = [20, 40, 60, 80]
df['age_group'] = pd.cut(df['age'], bins)
Result:
age age_group
0 25 (20, 40]
1 30 (20, 40]
2 35 (20, 40]
3 40 (20, 40]
4 45 (40, 60]
...
With Labels
df['age_group'] = pd.cut(df['age'], bins, labels=['young', 'middle', 'old'])
Automatic Bins
pd.cut(df['Salary Range From'], bins=3, labels=['low', 'middle', 'high'])
Pandas automatically calculates the bin ranges.
Combining cut() with groupby()
# Add salary category column
jobs['salary_bin'] = pd.cut(jobs['Salary Range From'], bins=3, labels=['low', 'middle', 'high'])
# Now group by it
jobs.groupby('salary_bin')['Salary Range From'].count()
Part 12: Data Cleaning
Removing Duplicates
# Remove duplicate rows (all columns must match)
df.drop_duplicates(inplace=True)
# Remove duplicates based on specific column
df.drop_duplicates(subset=['B'], inplace=True)
Handling Missing Values (NaN)
Option 1: Fill with a value
# Fill with mean
df['A'].fillna(df['A'].mean(), inplace=True)
# Fill with median
df['B'].fillna(df['B'].median(), inplace=True)
Option 2: Drop rows with NaN
df.dropna() # Drop any row with NaN
df.dropna(subset=['grade']) # Only if specific column is NaN
Part 13: Data Scaling
When columns have very different scales (e.g., age: 20-60, salary: 50000-200000), analysis and visualization become difficult.
Standardization (StandardScaler)
Transforms data to have mean = 0 and standard deviation = 1.
from sklearn.preprocessing import StandardScaler
df_unscaled = pd.DataFrame({'A': [1, 3, 2, 2, 1], 'B': [65, 130, 80, 70, 50]})
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_unscaled)
df_scaled = pd.DataFrame(df_scaled, columns=df_unscaled.columns)
When to use: Data follows a Gaussian (bell-shaped) distribution.
Normalization (MinMaxScaler)
Transforms data to range [0, 1].
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_unscaled)
df_scaled = pd.DataFrame(df_scaled, columns=df_unscaled.columns)
When to use: Distribution is unknown or not Gaussian.
Warning: Normalization is more sensitive to outliers than standardization.
Part 14: Plotting
Basic Syntax
df.plot(x='column_x', y='column_y', kind='line')
Plot Types
| kind= | Plot Type |
|---|---|
'line' | Line plot |
'bar' | Bar chart |
'barh' | Horizontal bar |
'pie' | Pie chart |
'scatter' | Scatter plot |
'hist' | Histogram |
Examples
# Bar plot
df.plot(x='name', y='age', kind='bar', title='Ages')
# Line plot
df.plot(x='month', y='sales', kind='line', title='Monthly Sales')
# With more options
df.plot(kind='bar', ylabel='Total cases', title='COVID-19', grid=True, logy=True)
Plotting Two DataFrames Together
# Get axis from first plot
ax = df1.plot(kind='line', x='Month', title='Comparison')
# Add second plot to same axis
df2.plot(ax=ax, kind='line')
ax.set_xlabel('Month')
ax.set_ylabel('Sales')
ax.legend(['Vendor A', 'Vendor B'])
Part 15: Exporting Data
To CSV
df.to_csv('output.csv', index=False)
To Excel
df.to_excel('output.xlsx', index=False)
index=False prevents writing the row numbers as a column.
Part 16: Statistics Refresher
Variance and Standard Deviation
Variance (σ²): Average of squared differences from the mean.
$$\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$
Standard Deviation (σ): Square root of variance.
$$\sigma = \sqrt{\sigma^2}$$
Why use standard deviation instead of variance?
- Variance has squared units (meters² if data is in meters)
- Standard deviation has the same units as the original data
- Standard deviation is more interpretable
Gaussian Distribution
A bell-shaped curve where:
- Mean, median, and mode are equal (at the center)
- ~68% of data falls within 1 standard deviation of the mean
- ~95% within 2 standard deviations
- ~99.7% within 3 standard deviations
Quick Reference
Reading
pd.read_csv('file.csv')
pd.read_csv('file.csv', usecols=['col1', 'col2'])
pd.read_excel('file.xlsx')
Inspecting
df.head(), df.tail()
df.shape, df.size, df.columns
df.describe()
df['col'].unique(), df['col'].nunique()
Selecting
df['col'] # Single column (Series)
df[['col1', 'col2']] # Multiple columns (DataFrame)
df[0:5] # Rows 0-4
df.loc[df['col'] > x] # By condition
df.query('col > x') # By condition (faster)
Cleaning
df.dropna()
df.fillna(value)
df.drop_duplicates()
Aggregating
df['col'].sum(), .mean(), .median(), .std(), .count()
df.groupby('col')['val'].mean()
df.groupby('col')['val'].agg(['mean', 'median', 'count'])
Sorting
df.sort_values(by='col')
df.sort_values(by='col', ascending=False)
Exporting
df.to_csv('out.csv', index=False)
df.to_excel('out.xlsx', index=False)
Performance Summary
| Operation | Speed |
|---|---|
read_csv() | Fast |
read_excel() | 500-1000x slower |
query() | Fast |
df[condition] | Fast |
iterrows() | ~150x slower than query() |
Rule: Avoid iterrows() on large datasets. Use query() or boolean indexing instead.