Pandas: Complete Notes

Setup

Every Pandas script starts with:

import pandas as pd
import numpy as np

pd and np are conventions. Everyone uses them.

Part 1: Series

A Series is a 1-dimensional array with labels (called an index).

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

The left column (0, 1, 2...) is the index. The right column is the values.

Accessing Index and Values Separately

print(s.index)    # RangeIndex(start=0, stop=6, step=1)
print(s.values)   # array([ 1.,  3.,  5., nan,  6.,  8.])

Iterating Over a Series

# Just values
for i in s.values:
    print(i)

# Both index and values
for i, v in s.items():
    print(i, v)

Slicing a Series

Works like Python lists:

print(s[1:3])  # Elements at index 1 and 2 (3 is excluded)

Check for NaN

np.isnan(s[3])  # True

Custom Index

The index doesn't have to be integers:

students = pd.Series(
    [28, 15, 30, 24, 10, 19], 
    index=['Lorenzo', 'Alessandra', 'Sofia', 'Giovanni', 'Matteo', 'Chiara']
)
print(students)

Output:

Lorenzo       28
Alessandra    15
Sofia         30
Giovanni      24
Matteo        10
Chiara        19
dtype: int64

Now you access by name:

print(students['Sofia'])  # 30

Filtering a Series

print(students[students >= 18])

Output:

Lorenzo     28
Sofia       30
Giovanni    24
Chiara      19
dtype: int64

Creating Series from a Dictionary

capitals = pd.Series({
    'Italy': 'Rome',
    'Germany': 'Berlin',
    'France': 'Paris',
    'Spain': 'Madrid',
    'Portugal': 'Lisbon'
})

Unlike regular Python dictionaries, Series support slicing:

print(capitals['France':'Portugal'])

Output:

France      Paris
Spain       Madrid
Portugal    Lisbon
dtype: object

Convert Series to List

capitals.to_list()  # ['Rome', 'Berlin', 'Paris', 'Madrid', 'Lisbon']

Part 2: DataFrames

A DataFrame is a 2-dimensional table. Each column is a Series.

Creating a DataFrame from Series

# First, create two Series with the same index
capitals = pd.Series({
    'Italy': 'Rome',
    'Germany': 'Berlin',
    'France': 'Paris',
    'Spain': 'Madrid',
    'Portugal': 'Lisbon'
})

population = pd.Series({
    'Italy': 58_800_000,
    'Spain': 48_400_000,
    'Germany': 84_400_000,
    'Portugal': 10_400_000,
    'France': 68_200_000
})

# Combine into DataFrame
countries = pd.DataFrame({'capitals': capitals, 'population': population})
print(countries)

Output:

          capitals  population
Italy         Rome    58800000
Germany     Berlin    84400000
France       Paris    68200000
Spain       Madrid    48400000
Portugal    Lisbon    10400000

Creating a DataFrame from a Dictionary

df = pd.DataFrame({
    'country': ['France', 'Germany', 'Italy', 'Portugal', 'Spain'],
    'capital': ['Paris', 'Berlin', 'Rome', 'Lisbon', 'Madrid'],
    'population': [68_200_000, 84_400_000, 58_800_000, 10_400_000, 48_400_000]
})

This creates an automatic numeric index (0, 1, 2...).

DataFrame Properties

print(countries.index)    # Index(['Italy', 'Germany', 'France', 'Spain', 'Portugal'])
print(countries.columns)  # Index(['capitals', 'population'])
print(countries.shape)    # (5, 2) → 5 rows, 2 columns
print(countries.size)     # 10 → total elements

Accessing Columns

Two ways:

# Bracket notation
countries['population']

# Dot notation
countries.population

Both return a Series.

Accessing Multiple Columns

Use a list inside brackets:

countries[['capitals', 'population']]  # Returns a DataFrame

Slicing Rows

countries['Italy':]      # From Italy to the end
countries[0:3]           # Rows 0, 1, 2

Filtering

countries[countries.population > 60_000_000]

Iterating Over a Single Column

for cap in countries['capitals']:
    print(cap)

Convert DataFrame to Dictionary

countries.to_dict()

Part 3: Reading Files

CSV Files

df = pd.read_csv('covid19-sample.csv')

Excel Files

First install openpyxl (only once):

!pip install openpyxl

Then:

df = pd.read_excel('covid19-sample.xlsx')

Important: Excel reading is 500-1000x slower than CSV. Use CSV when possible.

Reading from a URL

df = pd.read_csv('https://github.com/dsalomoni/bdb-2024/raw/main/covid/covid19-sample.csv')

Reading Only Specific Columns

my_columns = ['country', 'weekly_count', 'year_week']
df = pd.read_csv('covid19-sample.csv', usecols=my_columns)

Part 4: Inspecting Data

First/Last Rows

df.head()      # First 5 rows
df.head(10)    # First 10 rows
df.tail(3)     # Last 3 rows

Shape and Size

df.shape  # (rows, columns) tuple
df.size   # Total elements = rows × columns

Column Names

df.columns

Unique Values in a Column

df['indicator'].unique()    # Array of unique values
df['indicator'].nunique()   # Count of unique values

Part 5: Selecting and Slicing

By Row Number

df[3500:3504]  # Rows 3500, 3501, 3502, 3503
df[777:778]    # Just row 777

Specific Column from a Slice

df[777:778]['year_week']
# or
df[777:778].year_week

Multiple Columns

df.head()[['country', 'year_week']]

Using loc[]

Access rows by index label or by condition:

# By label
df.loc[19828]

# By condition
df.loc[df.weekly_count > 4500]

Part 6: Filtering with Conditions

Direct Filtering

df[df['grade'] > 27]

Multiple Conditions

# AND - use &
df[(df['grade'] > 27) & (df['age'] < 30)]

# OR - use |
df[(df['grade'] > 29) | (df['age'] > 30)]

Important: Wrap each condition in parentheses.

Using query() — The Better Way

df.query('country=="Italy" and indicator=="cases"')

With variables:

start_week = '2020-10'
end_week = '2021-48'
df.query('year_week >= @start_week and year_week <= @end_week')

Or using string formatting:

df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))

Part 7: iterrows() vs query()

The Slow Way: iterrows()

it_cases = dict()
for index, row in df.iterrows():
    if row['country'] == 'Italy':
        if row['indicator'] == 'cases':
            week = row['year_week']
            if (week >= start_week) and (week <= end_week):
                it_cases[week] = row['weekly_count']

df2 = pd.DataFrame(list(it_cases.items()), columns=['week', 'cases'])

Time: ~1.52 seconds for 41,000 rows

The Fast Way: query()

df3 = df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))

Time: ~0.01 seconds

query() is about 150x faster than iterrows().

Part 8: Sorting

Sort a Series

series.sort_values()                    # Ascending
series.sort_values(ascending=False)     # Descending

Sort a DataFrame

df.sort_values(by='quantity')                     # By one column
df.sort_values(by='quantity', ascending=False)    # Descending
df.sort_values(by=['column1', 'column2'])         # By multiple columns

Sort a Dictionary with Pandas

x = {'apple': 5, 'banana': 2, 'orange': 8, 'grape': 1}
series_x = pd.Series(x)
sorted_x = series_x.sort_values().to_dict()
# {'grape': 1, 'banana': 2, 'apple': 5, 'orange': 8}

Part 9: Common Functions

sum()

df['weekly_count'].sum()

describe()

Generates statistics for numerical columns:

df.describe()

Output includes: count, mean, std, min, 25%, 50%, 75%, max

nunique() and unique()

df['country'].nunique()  # Number of unique values
df['country'].unique()   # Array of unique values

mean() and median()

df['salary'].mean()    # Average
df['salary'].median()  # Middle value

When to use which:

Mean: When data is symmetrically distributed, no outliers
Median: When data has outliers or is skewed

Example:

Blood pressure readings: 142, 124, 121, 150, 215

Mean = (142+124+121+150+215)/5 = 150.4
Median = 142 (middle value when sorted: 121, 124, 142, 150, 215)

The 215 outlier pulls the mean up but doesn't affect the median.

Part 10: groupby()

Split data into groups, then apply a function.

Basic groupby

df_grouped = df.groupby('continent')

This returns a DataFrameGroupBy object. By itself, not useful. You need to apply a function:

df.groupby('continent').sum()
df.groupby('continent')['weekly_count'].mean()
df.groupby('continent')['weekly_count'].count()

Multiple Statistics with agg()

df.groupby('Agency')['Salary Range From'].agg(['mean', 'median'])

Group by Multiple Columns

df.groupby(['Agency', 'Posting Type'])['Salary Range From'].mean()

To prevent the grouped columns from becoming the index:

df.groupby(['Agency', 'Posting Type'], as_index=False)['Salary Range From'].mean()

Accessing Groups

grouped = df.groupby('continent')

# What groups exist?
grouped.groups.keys()

# Get one specific group
grouped.get_group('Oceania')

# How many unique countries in Oceania?
grouped.get_group('Oceania')['country'].nunique()

# Which countries?
grouped.get_group('Oceania')['country'].unique()

Sorting groupby Results

df.groupby('Agency')['# Of Positions'].count().sort_values(ascending=False).head(10)

Part 11: cut() — Binning Data

Convert continuous values into categories.

Basic Usage

df = pd.DataFrame({'age': [25, 30, 35, 40, 45, 50, 55, 60, 65]})
bins = [20, 40, 60, 80]

df['age_group'] = pd.cut(df['age'], bins)

Result:

   age   age_group
0   25    (20, 40]
1   30    (20, 40]
2   35    (20, 40]
3   40    (20, 40]
4   45    (40, 60]
...

With Labels

df['age_group'] = pd.cut(df['age'], bins, labels=['young', 'middle', 'old'])

Automatic Bins

pd.cut(df['Salary Range From'], bins=3, labels=['low', 'middle', 'high'])

Pandas automatically calculates the bin ranges.

Combining cut() with groupby()

# Add salary category column
jobs['salary_bin'] = pd.cut(jobs['Salary Range From'], bins=3, labels=['low', 'middle', 'high'])

# Now group by it
jobs.groupby('salary_bin')['Salary Range From'].count()

Part 12: Data Cleaning

Removing Duplicates

# Remove duplicate rows (all columns must match)
df.drop_duplicates(inplace=True)

# Remove duplicates based on specific column
df.drop_duplicates(subset=['B'], inplace=True)

Handling Missing Values (NaN)

Option 1: Fill with a value

# Fill with mean
df['A'].fillna(df['A'].mean(), inplace=True)

# Fill with median
df['B'].fillna(df['B'].median(), inplace=True)

Option 2: Drop rows with NaN

df.dropna()                     # Drop any row with NaN
df.dropna(subset=['grade'])     # Only if specific column is NaN

Part 13: Data Scaling

When columns have very different scales (e.g., age: 20-60, salary: 50000-200000), analysis and visualization become difficult.

Standardization (StandardScaler)

Transforms data to have mean = 0 and standard deviation = 1.

from sklearn.preprocessing import StandardScaler

df_unscaled = pd.DataFrame({'A': [1, 3, 2, 2, 1], 'B': [65, 130, 80, 70, 50]})

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_unscaled)
df_scaled = pd.DataFrame(df_scaled, columns=df_unscaled.columns)

When to use: Data follows a Gaussian (bell-shaped) distribution.

Normalization (MinMaxScaler)

Transforms data to range [0, 1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_unscaled)
df_scaled = pd.DataFrame(df_scaled, columns=df_unscaled.columns)

When to use: Distribution is unknown or not Gaussian.

Warning: Normalization is more sensitive to outliers than standardization.

Part 14: Plotting

Basic Syntax

df.plot(x='column_x', y='column_y', kind='line')

Plot Types

kind=	Plot Type
`'line'`	Line plot
`'bar'`	Bar chart
`'barh'`	Horizontal bar
`'pie'`	Pie chart
`'scatter'`	Scatter plot
`'hist'`	Histogram

Examples

# Bar plot
df.plot(x='name', y='age', kind='bar', title='Ages')

# Line plot
df.plot(x='month', y='sales', kind='line', title='Monthly Sales')

# With more options
df.plot(kind='bar', ylabel='Total cases', title='COVID-19', grid=True, logy=True)

Plotting Two DataFrames Together

# Get axis from first plot
ax = df1.plot(kind='line', x='Month', title='Comparison')

# Add second plot to same axis
df2.plot(ax=ax, kind='line')

ax.set_xlabel('Month')
ax.set_ylabel('Sales')
ax.legend(['Vendor A', 'Vendor B'])

Part 15: Exporting Data

To CSV

df.to_csv('output.csv', index=False)

To Excel

df.to_excel('output.xlsx', index=False)

index=False prevents writing the row numbers as a column.

Part 16: Statistics Refresher

Variance and Standard Deviation

Variance (σ²): Average of squared differences from the mean.

$$\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$

Standard Deviation (σ): Square root of variance.

$$\sigma = \sqrt{\sigma^2}$$

Why use standard deviation instead of variance?

Variance has squared units (meters² if data is in meters)
Standard deviation has the same units as the original data
Standard deviation is more interpretable

Gaussian Distribution

A bell-shaped curve where:

Mean, median, and mode are equal (at the center)
~68% of data falls within 1 standard deviation of the mean
~95% within 2 standard deviations
~99.7% within 3 standard deviations

Quick Reference

Reading

pd.read_csv('file.csv')
pd.read_csv('file.csv', usecols=['col1', 'col2'])
pd.read_excel('file.xlsx')

Inspecting

df.head(), df.tail()
df.shape, df.size, df.columns
df.describe()
df['col'].unique(), df['col'].nunique()

Selecting

df['col']                    # Single column (Series)
df[['col1', 'col2']]         # Multiple columns (DataFrame)
df[0:5]                      # Rows 0-4
df.loc[df['col'] > x]        # By condition
df.query('col > x')          # By condition (faster)

Cleaning

df.dropna()
df.fillna(value)
df.drop_duplicates()

Aggregating

df['col'].sum(), .mean(), .median(), .std(), .count()
df.groupby('col')['val'].mean()
df.groupby('col')['val'].agg(['mean', 'median', 'count'])

Sorting

df.sort_values(by='col')
df.sort_values(by='col', ascending=False)

Exporting

df.to_csv('out.csv', index=False)
df.to_excel('out.xlsx', index=False)

Performance Summary

Operation	Speed
`read_csv()`	Fast
`read_excel()`	500-1000x slower
`query()`	Fast
`df[condition]`	Fast
`iterrows()`	~150x slower than query()

Rule: Avoid iterrows() on large datasets. Use query() or boolean indexing instead.

Bioinformatics Forever