Welcome

Hello, I'm Mahmoud, and these are my notes. Since you're reading something I've written, I want to share a bit about how I approach learning and what you can expect here.

Want to know more about me? Check out my blog at mahmoud.ninja

A Personal Note

I didn't make the wrong choice coming here. It's just a matter of time. And if everything seems very easy and known already, why am I here? I'm here to feel this, to feel that I don't belong here. To make the complex things (the prof calls them "simple") more easier and understandable without using AI. I hate using AI when learning something new.

I don't know where I should start, but I should start somewhere and learn recursively, not in order.

I love three things in my life: viruses, genetics, and databases. I hoped they would be four, but yeah, that's life.

Anyway, I will never fail with something I'm keen on and love. Hoping this will not be love from one way. I am talking about genetics here.

A Word About AI

I don't use AI for learning or creating the core content in these notes. If I did, there would be no point in making this as anyone can ask ChatGPT or Claude for explanations.

What I may use AI for:

  • Proofreading and fixing grammar or typos
  • Reformatting text or code
  • Catching mistakes I missed

What I never use AI for:

  • Understanding new concepts
  • Generating explanations or examples
  • Writing the actual content you're reading

If you wanted AI-generated content, you wouldn't need these notes. You're here because sometimes learning from someone who's figuring it out alongside you is more helpful than learning from something that already knows everything.

My Philosophy

I believe there are no shortcuts to success. To me, success means respecting your time, and before investing that time, you need a plan rooted in where you want to go in life.

Learn, learn, learn, and when you think you've learned enough, write it down and share it.

About These Notes

I don't strive for perfectionism. Sometimes I write something and hope someone will point out where I'm wrong so we can both learn from it. That's the beauty of sharing knowledge: it's a two-way street.

I tend to be pretty chill, and I occasionally throw in some sarcasm when it feels appropriate. These are my notes after all, so please don't be annoyed if you encounter something that doesn't resonate with you, just skip ahead.

I'm creating this resource purely out of love for sharing and teaching. Ironically, I'm learning more by organizing and explaining these concepts than I ever did just studying them. Sharing is learning. Imagine if scientists never shared their research, we'd still be in the dark ages.

Everything I create here is released under Creative Commons (CC BY 4.0). You're free to share, copy, remix, and build upon this material for any purpose, even commercially, as long as you give appropriate credit.

I deeply respect intellectual property rights. I will never share copyrighted materials, proprietary resources, or content that was shared with our class under restricted access. All external resources linked here are publicly available or properly attributed.

If you notice any copyright violations or improperly shared materials, please contact me immediately at mahmoudahmedxyz@gmail.com, and I will remove the content right away and make necessary corrections.

Final Thoughts

I have tremendous respect for everyone in this learning journey. We're all here trying to understand complex topics, and we all learn differently. If these notes help you even a little bit, then this project has served its purpose.

What's New ๐Ÿ“ข

Stay up to date with the latest additions and updates to the notes.


December 2025

Week 1 (Dec 5)

โœจ
New Content Added

Applied Genomics:


Computational Methods for Bioinformatics

Course Overview

Exam Format: Writing/debugging and discussing code (on paper or IDE)

What This Course Is Really About

Solving problems. Not just any problemsโ€”2D and 3D problems that show up constantly in bioinformatics.

Don't always use brute force. No "try everything and hope it works." Use the right data structures. Think before you code.

Most difficult problems will include Arrays and Strings.

Practice Resources

LeetCode
Practice problem-solving patterns, especially:

  • Array/matrix manipulation (2D problems)
  • Hash tables and sets
  • Graph algorithms (for 3D structures)
  • Dynamic programming basics

Recommended pace: Solve 2 problems for every lecture

Study Strategy

  1. Understand the problem first - Don't jump to coding
  2. Think about data structures - What fits this problem naturally?
  3. Sketch the approach - On paper
  4. Code it cleanly - Readable beats clever

Exam Preparation

You'll write and debug code, then discuss your choices. They want to hear your reasoning:

  • Why this data structure?
  • Could you optimize it further?

Practice explaining your code out loud. If you can't articulate why you chose an approach, you probably don't understand it well enough.

Linux Fundamentals

The History of Linux

In 1880, the French government awarded the Volta Prize to Alexander Graham Bell. Instead of going to the Maldives (kidding...he had work to do), he went to America and opened Bell Labs.

This lab researched electronics and something revolutionary called the mathematical theory of communications. In the 1950s came the transistor revolution. Bell Labs scientists won 10 Nobel Prizes...not too shabby.

But around this time, Russia made the USA nervous by launching the first satellite, Sputnik, in 1957. This had nothing to do with operating systems, it was literally just a satellite beeping in space, but it scared America enough to kickstart the space race.

President Eisenhower responded by creating ARPA (Advanced Research Projects Agency) in 1958, and asked James Killian, MIT's president, to help develop computer technology. This led to Project MAC (Mathematics and Computation) at MIT.

Before Project MAC, using a computer meant bringing a stack of punch cards with your instructions, feeding them into the machine, and waiting. During this time, no one else could use the computer, it was one job at a time.

The big goal of Project MAC was to allow multiple programmers to use the same computer simultaneously, executing different instructions at the same time. This concept was called time-sharing.

MIT and Bell Labs cooperated and developed the first operating system to support time-sharing: CTSS (Compatible Time-Sharing System). They wanted to expand this to larger mainframe computers, so they partnered with General Electric (GE), who manufactured these machines. In 1964, they developed the first real OS with time-sharing support called Multics. It also introduced the terminal as a new type of input device.

In the late 1960s, GE and Bell Labs left the project. GE's computer department was bought by Honeywell, which continued the project with MIT and created a commercial version that sold for 25 years.

In 1969, Bell Labs engineers (Dennis Ritchie and Ken Thompson) developed a new OS based on Multics. In 1970, they introduced Unics (later called Unix, the name was a sarcastic play on "Multics," implying it was simpler).

The first two versions of Unix were written in assembly language, which was then translated by an assembler and linker into machine code. The big problem with assembly was that it was tightly coupled to specific processors, meaning you'd need to rewrite Unix for each processor architecture. So Dennis Ritchie decided to create a new programming language: C.

They rebuilt Unix using C. At this time, AT&T owned Bell Labs (now it's Nokia). AT&T declared that Unix was theirs and no one else could touch it, classic monopolization.

AT&T did make one merciful agreement: universities could use Unix for educational purposes. But after AT&T was broken up into smaller companies in 1984, even this stopped. Things got worse.

One person was watching all this and decided to take action: Andrew S. Tanenbaum. In 1987, he created a new Unix-inspired OS called MINIX. It was free for universities and designed to work on Intel chips. It had some issues, occasional crashes and overheating, but this was just the beginning. This was the first time someone made a Unix-like OS outside of AT&T.

The main difference between Unix and MINIX was that MINIX was built on a microkernel architecture. Unix had a larger monolithic kernel, but MINIX separated some modules, for example, device drivers were moved from kernel space to user space.

It's unclear if MINIX was truly open source, but people outside universities wanted access and wanted to contribute and modify it.

Around the same time MINIX was being developed, another person named Richard Stallman started the free software movement based on four freedoms: Freedom to run, Freedom to study, Freedom to modify, and Freedom to share. This led to the GPL license (GNU General Public License), which ensured that if you used something free, your product must also be free. They created the GNU Project, which produced many important tools like the GCC compiler, Bash shell, and more.

But there was one problem: the kernel, the beating heart of the operating system that talks to the hardware, was missing.

Let's leave the USA and cross the Atlantic Ocean. In Finland, a student named Linus Torvalds was stuck at home while his classmates vacationed in Baltim Egypt (kidding). He was frustrated with MINIX, had heard about GPL and GNU, and decided to make something new. "I know what I should do with my life," he thought. As a side hobby project in 1991, he started working on a new kernel (not based on MINIX) and sent an email to his classmates discussing it.

Linus announced Freax (maybe meant "free Unix") with a GPL license. After six months, he released another version and called it Linux. He improved the kernel and integrated many GNU Project tools. He uploaded the source code to the internet (though Git came much later, he initially used FTP). This mini-project became the most widely used OS on Earth.

The penguin mascot (Tux) came from multiple stories: Linus was supposedly bitten by a penguin at a zoo, and he also watched March of the Penguins and was inspired by how they cooperate and share to protect their eggs and each other. Cute and fitting.

...And that's the history intro.

Linux Distributions

Okay... let's install Linux. Which Linux? Wait, really? There are multiple Linuxes?

Here's the deal: the open-source part is the kernel, but different developers take it and add their own packages, libraries, and maybe create a GUI. Others add their own tweaks and features. This leads to many different versions, which we call distributions (or distros for short).

Some examples: Red Hat, Slackware, Debian.

Even distros themselves can be modified with additional features, which creates a version of a version. For example, Debian led to Ubuntu, these are called derivatives.

How many distros and derivatives exist in the world? Many. How many exactly? I said many. Anyone with a computer can create one.

So what's the main difference between these distros, so I know which one is suitable for me? The main differences fall into two categories: philosophical and technical.

One of the biggest technical differences is package management, the system that lets you install software, including the type and format of software itself.

Another difference is configuration files, their locations differ from one distro to another.

We agreed that everything is free, right? Well, you may find some paid versions like Red Hat Enterprise Linux, which charges for features like an additional layer of security, professional support, and guaranteed upgrades. Fedora is also owned by Red Hat and acts as a testing ground (a "backdoor," if you will) for new features before they hit Red Hat Enterprise.

The philosophical part is linked to the functional part. If you're using Linux for research, there are distros with specialized software for that. Maybe you're into ethical hacking, Kali Linux is for you. If you're afraid of switching from another OS, you might like Linux Mint, which even has themes that make it look like Windows.

Okay, which one should I install now? Heh... There are a ton of options and you can install any of them, but my preference is Ubuntu.

Ubuntu is the most popular for development and data engineering. But remember, in all cases, you'll be using the terminal a lot. So install Ubuntu, maybe in dual boot, and keep Windows if possible so you don't regret it later and blame me.


The Terminal

Yes, this is what matters for us. Every distro will come with a default terminal but you can install others if you want. Anyway, open the terminal from the apps or just click Ctrl+Alt+T.

alt text

Zoom in using Ctrl+Shift++ or out using Ctrl+-

By default first thing you will see the prompt name@host:path$ which your name @ the machine name then ~ then dollar sign colon then $. After $ you can write your command.

You can change the colors and all preferences and save each for profile.

You can even change the prompt itself as it is just a variable (more on variable later).

Basic Commands

First, everything is case sensitive, so be careful.

[1] echo

This command echoes whatever you write after it.

$ echo "Hello, terminal"

Output:

Hello, terminal

[2] pwd

This prints the current directory.

$ pwd

Output:

/home/mahmoudxyz

[3] cd

This is for changing the directory.

$ cd Desktop

The directory changed with no output, you can check this using pwd.

To go back to the main directory use:

$ cd ~

Or just:

$ cd

Note that this means we are back to /home/mahmoudxyz

To go back to the previous directory (in this case /home) even if you don't know the name, you can use:

$ cd ..

[4] ls

This command outputs the current files and directories (folders).

First let's go to desktop again:

$ cd /home/mahmoudxyz/Desktop

Yes, you can go to a specific dir if you know its path. Note that in Linux we are using / not \ like Windows.

Now let's see what files and directories are in my Desktop:

$ ls

Output:

file1  python  testdir

If you notice that in my case, my terminal supports colors. The blue ones are directories and the grey (maybe black) is the file.

But you may deal with some terminal that doesn't support colors, in this case you can use:

$ ls -F

Output:

file1  python/  testdir/

What ends with / like python/ is a directory otherwise it's a file like file1.

You can see the hidden files using:

$ ls -a

Output:

.  ..  file1  python  testdir  .you-cant-see-me

We saw .you-cant-see-me, but we are not hackers that we saw something hidden, being hidden is more than organizing purpose than actually hiding something.

You can also list the files in the long format using:

$ ls -l

Output:

total 8
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz    0 Nov  2 10:48 file1
drwxrwxr-x 2 mahmoudxyz mahmoudxyz 4096 Oct 16 15:20 python
drwxrwxr-x 2 mahmoudxyz mahmoudxyz 4096 Nov  1 21:45 testdir

Let's take the file1 and analyze the output:

ColumnMeaning
-rw-rw-r-- 1File type + permissions (more on this later)
1Number of hard links (more on this later)
mahmoudxyzOwner name
mahmoudxyzGroup name
0File size (bytes)
Nov 2 10:48Last modification date & time
file1File or directory name

We can also combine these flags/options:

$ ls -l -a -F

Output:

total 16
drwxr-xr-x  4 mahmoudxyz mahmoudxyz 4096 Nov  2 10:53 ./
drwxr-x--- 47 mahmoudxyz mahmoudxyz 4096 Nov  1 21:55 ../
-rw-rw-r--  1 mahmoudxyz mahmoudxyz    0 Nov  2 10:48 file1
drwxrwxr-x  2 mahmoudxyz mahmoudxyz 4096 Oct 16 15:20 python/
drwxrwxr-x  2 mahmoudxyz mahmoudxyz 4096 Nov  1 21:45 testdir/
-rw-rw-r--  1 mahmoudxyz mahmoudxyz    0 Nov  2 10:53 .you-cant-see-me

Or shortly:

$ ls -laF

The same output. The order of options is not important so ls -lFa will work as well.

[5] clear

This cleans your terminal. You can also use shortcut Ctrl+l

[6] mkdir

This makes a new directory.

$ mkdir new-dir

Then let's see the output:

$ ls -F

Output:

file1  new-dir/  python/  testdir/

[7] rmdir

This will remove the directory.

$ rmdir new-dir

Then let's see the output:

$ ls -F

Output:

file1  python/  testdir/

[8] touch

This command is for creating a new file.

$ mkdir new-dir
$ cd new-dir
$ touch file1
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:26 file1

You can also make more than one file with:

$ touch file2 file3
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:26 file1
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file2
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file3

In fact touch was created for modifying the timestamp of the file so let's try again:

$ touch file1
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:30 file1
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file2
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file3

What changed? The timestamp of file1. The touch is the easiest way to create a new file, it just changes the timestamp of the file and if it doesn't exist, it will create a new one.

[9] rm

This will remove the file.

$ rm file1
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file2
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file3

[10] echo & cat (revisited)

Yes again, but this time, it will be used to create a new file with some text inside it.

$ echo "Hello, World" > file1

To output this file we can use:

$ cat file1

Output:

Hello, World

Notes:

  • If file1 doesn't exist, it will create a new one.
  • If it does exist โ†’ it will be overwritten.

To append text instead of overwrite use >>:

$ echo "Hello, Mah" >> file1

To output this file we can use:

$ cat file1

Output:

Hello, World
Hello, Mah

[11] rm -r

Let's go back:

$ cd ..

And then let's try to remove the directory:

$ rmdir new-dir

Output:

rmdir: failed to remove 'new-dir': Directory not empty

In case the directory is not empty, we can use rm that we used for removing a file but this time with a flag -r which means recursively remove everything in the folder.

$ rm -r new-dir

[12] cp

This command is for copying a file.

cp source destination

(you can also rename it while copying it)

For example, let's copy the hosts file:

$ cp /etc/hosts .

The dot . means the current directory. Meaning copy this file from this source to here. You can see the content of the file using cat as before.

[13] man

man is the built-in manual for commands. It contains short descriptions for the command and its options and their functions. It is useful and can be replaced nowadays with online search or even AI.

Try:

$ man ls

And then try:

$ man cd

No manual entry for cd. I don't know why exactly, but it's probably because cd is built into the shell itself and not an external command or maybe programmer choice.


Unix Philosophy

Second System Syndrome: If a software or system succeeds, any similar system that comes after it will likely fail. This is probably a psychological phenomenon, developers constantly compare themselves to the successful system, wanting to be like it but better. The fear of not matching that success often causes failure. Maybe you can succeed if you don't compare yourself to it.

Another thing: when developers started making software for Linux, everything was chaotic and random. This led to the creation of principles to govern development, a philosophy to follow. These principles ensure that when you develop something, you follow the same Unix mentality:

  1. Small is Beautiful โ€“ Keep programs compact and focused; bloat is the enemy.
  2. Each Program Does One Thing Well โ€“ Master one task instead of being mediocre at many.
  3. Prototype as Soon as Possible โ€“ Build it, test it, break it, learn from it, fast iteration wins.
  4. Choose Portability Over Efficiency โ€“ Code that runs everywhere beats code that's blazing fast on one system.
  5. Store Data in Flat Text Files โ€“ Text is universal, readable, and easy to parse; proprietary formats lock you in.
  6. Use Software Leverage โ€“ Don't reinvent the wheel; use existing tools and combine them creatively.
  7. Use Shell Scripts to Increase Leverage and Portability โ€“ Automate tasks and glue programs together with simple scripts.
  8. Avoid Captive User Interfaces โ€“ Don't trap users in rigid menus; let them pipe, redirect, and automate.
  9. Make Every Program a Filter โ€“ Take input, transform it, produce output, programs should be composable building blocks.

These concepts all lead to one fundamental Unix principle: everything is a file. Devices, processes, sockets, treat them all as files for consistency and simplicity.

Not all people follow this now, but the important question is: is it important? I don't know. But still the question is: is it important for you as a data engineer or analyst who will deal with data and different distros and different computers which maybe will be remote? Yes, it is important and very important.

Text Files

It's a bit strange that we are talking about editing text files in 2025. Really, does it matter?

Yes, it matters and it's a big topic in Linux because of what we discussed in the previous section.

There are a lot of editors on Linux like vi, nano and emacs. There is a famous debate between emacs and vim.

You can find vi in almost every distro. The shortcuts for it are many and hard to memorize if you are not dealing with it much, but you can use cheatsheets.

Simply put: vi is just two things, insert mode and command mode. The default when you open a file for the first time is the command mode. To start writing something you have to enter the insert mode by pressing i.

You might wonder why vi uses keyboard letters for navigation instead of arrow keys. Simple answer: arrow keys didn't exist on keyboards when vi was created in 1976. You're the lucky generation with arrow keys, the original vi users had to make do with what they had.

nano on the other hand is more simple and easier to use and edit files with.

Use any editor, probably vi or nano and start practicing on one.

Terminal vs Shell

Terminal โ‰  Shell. Let's clear this up.

The shell is the thing that actually interprets your commands. It's the engine doing the work. File manipulation, running programs, printing text. That's all the shell.

The terminal is just the program that opens a window so you can talk to the shell. It's the middleman, the GUI wrapper, the pretty face.

Historical note:

This distinction mattered more when terminals were physical devices, actual hardware connected to mainframes. Today, we use terminal emulators (software), so the difference is mostly semantic. For practical purposes, just know: the shell runs your commands, the terminal displays them.

Pipes, Filters and Redirection

Standard Streams

Unix processes use I/O streams to read and write data.

Input stream sources include keyboards, terminals, devices, files, output from other processes, etc.

Unix processes have three standard streams:

  • STDIN (0) โ€“ Standard Input (data coming in from keyboard, file, etc.)
  • STDOUT (1) โ€“ Standard Output (normal output going to terminal, file, etc.)
  • STDERR (2) โ€“ Standard Error (error messages going to terminal, file, etc.)

Example: Try running cat with no arguments, it waits for input from STDIN and echoes it to STDOUT.

  • Ctrl+D โ€“ Stops the input stream and sends an EOF (End of File) signal to the process.
  • Ctrl+C โ€“ Sends an INT (Interrupt) signal to the process (i.e., kills the process).

Redirection

Redirection allows you to change the defaults for stdin, stdout, or stderr, sending them to different devices or files using their file descriptors.

File Descriptors

A file descriptor is a reference (or handle) used by the kernel to access a file. Every process gets its own file descriptor table.

Redirect stdin with <

Use the < operator to redirect standard input from a file:

$ wc < textfile

Using Heredocs with <<

Accepts input until a specified delimiter word is reached:

$ cat << EOF
# Type multiple lines here
# Press Enter, then type EOF to end
EOF

Using Herestrings with <<<

Pass a string directly as input:

$ cat <<< "Hello, Linux"

Redirect stdout using > and >>

Overwrite a file with > (or explicitly with 1>):

$ who > file      # Redirect stdout to file (overwrite)
$ cat file        # View the file

Append to a file with >>:

$ whoami >> file  # Append stdout to file
$ cat file        # View the file

Redirect stderr using 2> and 2>>

Redirect error messages to a file:

$ ls /xyz 2> err  # /xyz doesn't exist, error goes to err file
$ cat err         # View the error

Combining stdout and stderr

Redirect both stdout and stderr to the same file:

# Method 1: Redirect stderr to err, then stdout to the same place
$ ls /etc /xyz 2> err 1>&2

# Method 2: Redirect stdout to err, then stderr to the same place
$ ls /etc /xyz 1> err 2>&1

# Method 3: Shorthand for redirecting both
$ ls /etc /xyz &> err

$ cat err  # View both output and errors

Ignoring Error Messages with /dev/null

The black hole of Unix, anything sent here disappears:

$ ls /xyz 2> /dev/null  # Suppress error messages

User and Group Management

It is not complicated. The user here is like any other OS. An account with some permission and can do some operations.

There are three types of users in Linux:

Super user

The administrator that can do anything in the world. It is called root.

  • ID from 0 to 999

System user

This represents software and not a real person. Some software may need some access and permissions to do some tasks and operations or maybe install something.

  • ID from 0 to 999

Normal user

This is us.

  • ID >= 1000

Each user has its ID, shell, environmental vars and home dir.

File Ownership and Permissions

(Content to be added)


More on Navigating the Filesystem

Absolute vs Relative Paths

The root directory (/) is like "C:" in Windows, the top of the filesystem hierarchy.

Absolute path: Starts from root, always begins with /

/home/mahmoudxyz/Documents/notes.txt
/etc/passwd
/usr/bin/python3

Relative path: Starts from your current location

Documents/notes.txt          # Relative to current directory
../Desktop/file.txt          # Go up one level, then into Desktop
../../etc/hosts              # Go up two levels, then into etc

Special directory references:

  • . = current directory
  • .. = parent directory
  • ~ = your home directory
  • - = previous directory (used with cd -)

Useful Navigation Commands

ls -lh - List in long format with human-readable sizes

$ ls -lh
-rw-r--r-- 1 mahmoud mahmoud 1.5M Nov 10 14:23 data.csv
-rw-r--r-- 1 mahmoud mahmoud  12K Nov 10 14:25 notes.txt

ls -lhd - Show directory itself, not contents

$ ls -lhd /home/mahmoud
drwxr-xr-x 47 mahmoud mahmoud 4.0K Nov 10 12:00 /home/mahmoud

ls -lR - Recursive listing (all subdirectories)

$ ls -lR
./Documents:
-rw-r--r-- 1 mahmoud mahmoud 1234 Nov 10 14:23 file1.txt

./Documents/Projects:
-rw-r--r-- 1 mahmoud mahmoud 5678 Nov 10 14:25 file2.txt

tree - Visual directory tree (may need to install)

$ tree
.
โ”œโ”€โ”€ Documents
โ”‚   โ”œโ”€โ”€ file1.txt
โ”‚   โ””โ”€โ”€ Projects
โ”‚       โ””โ”€โ”€ file2.txt
โ”œโ”€โ”€ Downloads
โ””โ”€โ”€ Desktop

stat - Detailed file information

$ stat notes.txt
  File: notes.txt
  Size: 1234       Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d  Inode: 12345678   Links: 1
Access: 2024-11-10 14:23:45.123456789 +0100
Modify: 2024-11-10 14:23:45.123456789 +0100
Change: 2024-11-10 14:23:45.123456789 +0100

Shows: size, inode number, links, permissions, timestamps

Shell Globbing (Wildcards)

Wildcards let you match multiple files with patterns.

* - Matches any number of any characters (including none)

$ echo *                    # All files in current directory
$ echo *.txt                # All files ending with .txt
$ echo file*                # All files starting with "file"
$ echo *data*               # All files containing "data"

? - Matches exactly one character

$ echo b?at                 # Matches: boat, beat, b1at, b@at
$ echo file?.txt            # Matches: file1.txt, fileA.txt
$ echo ???                  # Matches any 3-character filename

[...] - Matches any character inside brackets

$ echo file[123].txt        # Matches: file1.txt, file2.txt, file3.txt
$ echo [a-z]*               # Files starting with lowercase letter
$ echo [A-Z]*               # Files starting with uppercase letter
$ echo *[0-9]               # Files ending with a digit

[!...] - Matches any character NOT in brackets

$ echo [!a-z]*              # Files NOT starting with lowercase letter
$ echo *[!0-9].txt          # .txt files NOT ending with a digit before extension

Practical examples:

$ ls *.jpg *.png            # All image files (jpg or png)
$ rm temp*                  # Delete all files starting with "temp"
$ cp *.txt backup/          # Copy all text files to backup folder
$ mv file[1-5].txt archive/ # Move file1.txt through file5.txt

File Structure: The Three Components

Every file in Linux consists of three parts:

1. Filename

The human-readable name you see and use.

2. Data Block

The actual content stored on disk, the file's data.

3. Inode (Index Node)

Metadata about the file stored in a data structure. Contains:

  • File size
  • Owner (UID) and group (GID)
  • Permissions
  • Timestamps (access, modify, change)
  • Number of hard links
  • Pointers to data blocks on disk
  • NOT the filename (filenames are stored in directory entries)

View inode number:

$ ls -i
12345678 file1.txt
12345679 file2.txt

View detailed inode information:

$ stat file1.txt

A link is a way to reference the same file from multiple locations. Think of it like shortcuts in Windows, but with two different types.


Concept: Another filename pointing to the same inode and data.

It's like having two labels on the same box. Both names are equally valid, neither is "original" or "copy."

Create a hard link:

$ ln original.txt hardlink.txt

What happens:

  • Both filenames point to the same inode
  • Both have equal status (no "original")
  • Changing content via either name affects both (same data)
  • File size, permissions, content are identical (because they ARE the same file)

Check with ls -i:

$ ls -i
12345678 original.txt
12345678 hardlink.txt    # Same inode number!

What if you delete the original?

$ rm original.txt
$ cat hardlink.txt        # Still works! Data is intact

Why? The data isn't deleted until all hard links are removed. The inode keeps a link count, only when it reaches 0 does the system delete the data.

Limitations of hard links:

  • Cannot cross filesystems (different partitions/drives)
  • Cannot link to directories (to prevent circular references)
  • Both files must be on the same partition

Concept: A special file that points to another filename, like a shortcut in Windows.

The soft link has its own inode, separate from the target file.

Create a soft link:

$ ln -s original.txt softlink.txt

What happens:

  • softlink.txt has a different inode
  • It contains the path to original.txt
  • Reading softlink.txt automatically redirects to original.txt

Check with ls -li:

$ ls -li
12345678 -rw-r--r-- 1 mahmoud mahmoud 100 Nov 10 14:00 original.txt
12345680 lrwxrwxrwx 1 mahmoud mahmoud  12 Nov 10 14:01 softlink.txt -> original.txt

Notice:

  • Different inode numbers
  • l at the start (link file type)
  • -> shows what it points to

What if you delete the original?

$ rm original.txt
$ cat softlink.txt        # Error: No such file or directory

The softlink still exists, but it's now a broken link (points to nothing).

Advantages of soft links:

  • Can cross filesystems (different partitions/drives)
  • Can link to directories
  • Can link to files that don't exist yet (forward reference)

FeatureHard LinkSoft Link
InodeSame as originalDifferent (own inode)
ContentPoints to dataPoints to filename
Delete originalLink still worksLink breaks
Cross filesystemsNoYes
Link to directoriesNoYes
Shows targetNo (looks like normal file)Yes (-> in ls -l)
Link countIncreasesDoesn't affect original

When to use each:

Hard links:

  • Backup/versioning within same filesystem
  • Ensure file persists even if "original" name is deleted
  • Save space (no duplicate data)

Soft links:

  • Link across different partitions
  • Link to directories
  • Create shortcuts for convenience
  • When you want the link to break if target is moved/deleted (intentional dependency)

Practical Examples

Hard link example:

$ echo "Important data" > data.txt
$ ln data.txt backup.txt              # Create hard link
$ rm data.txt                         # "Original" deleted
$ cat backup.txt                      # Still accessible!
Important data

Soft link example:

$ ln -s /usr/bin/python3 ~/python     # Shortcut to Python
$ ~/python --version                  # Works!
Python 3.10.0
$ rm /usr/bin/python3                 # If Python is removed
$ ~/python --version                  # Link breaks
bash: ~/python: No such file or directory

Link to directory (only soft link):

$ ln -s /var/log/nginx ~/nginx-logs   # Easy access to logs
$ cd ~/nginx-logs                     # Navigate via link
$ pwd                                 # Shows real path
/var/log/nginx

Understanding the Filesystem Hierarchy Standard

Mounting

There's no link between the hierarchy of directories and their location on the disk.

For more details, see: Linux Foundation FHS 3.0

File Management

[1] grep

This command to print lines matching pattern

Let's create a file to try examples on it:

echo -e "root\nhello\nroot\nRoot" >> file

Now let's use grep to search for the word root in this file:

$ grep root file

output:

root
root

You can search for anything excluding the root word:

$ grep -v root file

output:

hello
Root

You can search ingoring the case:

$ grep -i root file

result:

root
root
Root

You can also use REGEX:

$ grep -i r. file

result:

root
root
Root

[2] less

to page through a file (an alternative to more)

-- use with /word to search for a word in the file -- use with ?word to search backwards for a word in the file -- use with n to go to the next occurrence of the word -- use with N to go to the previous occurrence of the word -- use with q to quit the file

[3] diff

compare files line by line

[4] file

determine file type

$ file file
file: ASCII text

[5] find and locate

search for files in a directory hierarchy

[6] head and tail

head - output the first part of files head /usr/share/dict/words - display the first 10 lines of the file /usr/share/dict/words head -n 20 /usr/share/dict/words - display the first 20 lines of the file /usr/share/dict/words

tail - output the last part of files tail /usr/share/dict/words - display the last 10 lines of the file /usr/share/dict/words tail -n 20 /usr/share/dict/words - display the last 20 lines of the file /usr/share/dict/words

[7] mv

mv - move (rename) files mv file1 file2 - rename file1 to file2

mv - move (rename) files mv file1 file2 - rename file1 to file2

[8] cp

cp - copy files and directories cp file1 file2 - copy file1 to file2

[9] tar

archive utility

[10] gzip

[11] mount and unmount

what is the meaning of mounitng

Managing Linux Processes

What is a Process?

When Linux executes a program, it:

  1. Reads the file from disk
  2. Loads it into memory
  3. Reads the instructions inside it
  4. Executes them one by one

A process is the running instance of that program. It might be visible in your GUI or running invisibly in the background.

Types of Processes

Processes can be executed from different sources:

By origin:

  • Compiled programs (C, C++, Rust, etc.)
  • Shell scripts containing commands
  • Interpreted languages (Python, Perl, etc.)

By trigger:

  • Manually executed by a user
  • Scheduled (via cron or systemd timers)
  • Triggered by events or other processes

By category:

  • System processes - Managed by the kernel
  • User processes - Started by users (manually, scheduled, or remotely)

The Process Hierarchy

Every Linux system starts with a parent process that spawns all other processes. This is either:

  • init or sysvinit (older systems)
  • systemd (modern systems)

The first process gets PID 1 (Process ID 1), even though it's technically branched from the kernel itself (PID 0, which you never see directly).

From PID 1, all other processes branch out in a tree structure. Every process has:

  • PID (Process ID) - Its own unique identifier
  • PPID (Parent Process ID) - The ID of the process that started it

Viewing Processes

[1] ps - Process Snapshot

Basic usage - current terminal only:

$ ps

Output:

    PID TTY          TIME CMD
  14829 pts/1    00:00:00 bash
  14838 pts/1    00:00:00 ps

This shows only processes running in your current terminal session for your user.

All users' processes:

$ ps -a

Output:

    PID TTY          TIME CMD
   2955 tty2     00:00:00 gnome-session-b
  14971 pts/1    00:00:00 ps

All processes in the system:

$ ps -e

Output:

    PID TTY          TIME CMD
      1 ?        00:00:00 systemd
      2 ?        00:00:00 kthreadd
      3 ?        00:00:00 rcu_gp
    ... (hundreds more)

Note: The ? in the TTY column means the process was started by the kernel and has no controlling terminal.

Detailed process information:

$ ps -l

Output:

F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000   14829   14821  0  80   0 -  2865 do_wai pts/1    00:00:00 bash
4 R  1000   15702   14829  0  80   0 -  3445 -      pts/1    00:00:00 ps

Here you can see the PPID (parent process ID). Notice that ps has bash as its parent (the PPID of ps matches the PID of bash).

Most commonly used:

$ ps -efl

This shows all processes with full details - PID, PPID, user, CPU time, memory, and command.

Understanding Daemons

Any system process running in the background typically ends with d (named after "daemon"). Examples:

  • systemd - System and service manager
  • sshd - SSH server
  • httpd or nginx - Web servers
  • crond - Job scheduler

Daemons are like Windows services - processes that run in the background, whether they're system or user processes.


[2] pstree - Process Tree Visualization

See the hierarchy of all running processes:

$ pstree

Output:

systemdโ”€โ”ฌโ”€ModemManagerโ”€โ”€โ”€3*[{ModemManager}]
        โ”œโ”€NetworkManagerโ”€โ”€โ”€3*[{NetworkManager}]
        โ”œโ”€accounts-daemonโ”€โ”€โ”€3*[{accounts-daemon}]
        โ”œโ”€avahi-daemonโ”€โ”€โ”€avahi-daemon
        โ”œโ”€bluetoothd
        โ”œโ”€colordโ”€โ”€โ”€3*[{colord}]
        โ”œโ”€containerdโ”€โ”€โ”€15*[{containerd}]
        โ”œโ”€cron
        โ”œโ”€cups-browsedโ”€โ”€โ”€3*[{cups-browsed}]
        โ”œโ”€cupsdโ”€โ”€โ”€5*[dbus]
        โ”œโ”€dbus-daemon
        โ”œโ”€dockerdโ”€โ”€โ”€19*[{dockerd}]
        โ”œโ”€fwupdโ”€โ”€โ”€5*[{fwupd}]
        ... (continues)

What you're seeing:

  • systemd is the parent process (PID 1)
  • Everything else branches from it
  • Multiple processes run in parallel
  • Some processes spawn their own children (like dockerd with 19 threads)

This visualization makes it easy to understand process relationships.


[3] top - Live Process Monitor

Unlike ps (which shows a snapshot), top shows real-time process information:

$ top

You'll see:

  • Processes sorted by CPU usage (by default)
  • Live updates of CPU and memory consumption
  • System load averages
  • Running vs sleeping processes

Press q to quit.

Useful top commands while running:

  • k - Kill a process (prompts for PID)
  • M - Sort by memory usage
  • P - Sort by CPU usage
  • 1 - Show individual CPU cores
  • h - Help

[4] htop - Better Process Monitor

htop is like top but modern, colorful, and more interactive.

Installation (if not already installed):

$ which htop   # Check if installed
$ sudo apt install htop   # Install if needed

Run it:

$ htop

Features:

  • Color-coded display
  • Mouse support (click to select processes)
  • Easy process filtering and searching
  • Visual CPU and memory bars
  • Tree view of process hierarchy
  • Built-in kill/nice/priority management

Navigation:

  • Arrow keys to move
  • F3 - Search for a process
  • F4 - Filter by name
  • F5 - Tree view
  • F9 - Kill a process
  • F10 or q - Quit

Foreground vs Background Processes

Sometimes you only have one terminal and want to run multiple long-running tasks. Background processes let you do this.

Foreground Processes (Default)

When you run a command normally, it runs in the foreground and blocks your terminal:

$ sleep 10

Your terminal is blocked for 10 seconds. You can't type anything until it finishes.

Background Processes

Add & at the end to run in the background:

$ sleep 10 &

Output:

[1] 12345

The terminal is immediately available. The numbers show [job_number] PID.

Managing Jobs

View running jobs:

$ jobs

Output:

[1]+  Running                 sleep 10 &

Bring a background job to foreground:

$ fg

If you have multiple jobs:

$ fg %1   # Bring job 1 to foreground
$ fg %2   # Bring job 2 to foreground

Send current foreground process to background:

  1. Press Ctrl+Z (suspends the process)
  2. Type bg (resumes it in background)

Example:

$ sleep 25
^Z
[1]+  Stopped                 sleep 25

$ bg
[1]+ sleep 25 &

$ jobs
[1]+  Running                 sleep 25 &

Stopping Processes

Process Signals

The kill command doesn't just "kill" - it sends signals to processes. The process decides how to respond.

Common signals:

SignalNumberMeaningProcess Can Ignore?
SIGHUP1Hang up (terminal closed)Yes
SIGINT2Interrupt (Ctrl+C)Yes
SIGTERM15Terminate gracefully (default)Yes
SIGKILL9Kill immediatelyNO
SIGSTOP19Stop/pause processNO
SIGCONT18Continue stopped processNO

Using kill

Syntax:

$ kill -SIGNAL PID

Example - find a process:

$ ps
    PID TTY          TIME CMD
  14829 pts/1    00:00:00 bash
  17584 pts/1    00:00:00 sleep
  18865 pts/1    00:00:00 ps

Try graceful termination first (SIGTERM):

$ kill -SIGTERM 17584

Or use the number:

$ kill -15 17584

Or just use default (SIGTERM is default):

$ kill 17584

If the process ignores SIGTERM, force kill (SIGKILL):

$ kill -SIGKILL 17584

Or:

$ kill -9 17584

Verify it's gone:

$ ps
    PID TTY          TIME CMD
  14829 pts/1    00:00:00 bash
  19085 pts/1    00:00:00 ps
[2]+  Killed                  sleep 10

Why SIGTERM vs SIGKILL?

SIGTERM (15) - Graceful shutdown:

  • Process can clean up (save files, close connections)
  • Child processes are also terminated properly
  • Always try this first

SIGKILL (9) - Immediate death:

  • Process cannot ignore or handle this signal
  • No cleanup happens
  • Can create zombie processes if parent doesn't reap children
  • Can cause memory leaks or corrupted files
  • Use only as last resort

Zombie Processes

A zombie is a dead process that hasn't been cleaned up by its parent.

What happens:

  1. Process finishes execution
  2. Kernel marks it as terminated
  3. Parent should read the exit status (called "reaping")
  4. If parent doesn't reap it, it becomes a zombie

Identifying zombies:

$ ps aux | grep Z

Look for processes with state Z (zombie).

Fixing zombies:

  • Kill the parent process (zombies are already dead)
  • The parent's death forces the kernel to reclassify zombies under init/systemd, which cleans them up
  • Or wait - some zombies disappear when the parent finally checks on them

killall - Kill by Name

Instead of finding PIDs, kill all processes with a specific name:

$ killall sleep

This kills ALL processes named sleep, regardless of their PID.

With signals:

$ killall -SIGTERM firefox
$ killall -9 chrome   # Force kill all Chrome processes

Warning: Be careful with killall - it affects all matching processes, even ones you might not want to kill.


Managing Services with systemctl

Modern Linux systems use systemd to manage services (daemons). The systemctl command controls them.

Service Status

Check if a service is running:

$ systemctl status ssh

Output shows:

  • Active/inactive status
  • PID of the main process
  • Recent log entries
  • Memory and CPU usage

Starting and Stopping Services

Start a service:

$ sudo systemctl start nginx

Stop a service:

$ sudo systemctl stop nginx

Restart a service (stop then start):

$ sudo systemctl restart nginx

Reload configuration without restarting:

$ sudo systemctl reload nginx

Enable/Disable Services at Boot

Enable a service to start automatically at boot:

$ sudo systemctl enable ssh

Disable a service from starting at boot:

$ sudo systemctl disable ssh

Enable AND start immediately:

$ sudo systemctl enable --now nginx

Listing Services

List all running services:

$ systemctl list-units --type=service --state=running

List all services (running or not):

$ systemctl list-units --type=service --all

List enabled services:

$ systemctl list-unit-files --type=service --state=enabled

Viewing Logs

See logs for a specific service:

$ journalctl -u nginx

Follow logs in real-time:

$ journalctl -u nginx -f

See only recent logs:

$ journalctl -u nginx --since "10 minutes ago"

Practical Examples

Example 1: Finding and Killing a Hung Process

# Find the process
$ ps aux | grep firefox

# Kill it gracefully
$ kill 12345

# Wait a few seconds, check if still there
$ ps aux | grep firefox

# Force kill if necessary
$ kill -9 12345

Example 2: Running a Long Script in Background

# Start a long-running analysis
$ python analyze_genome.py &

# Check it's running
$ jobs

# Do other work...

# Bring it back to see output
$ fg

Example 3: Checking System Load

# See what's consuming resources
$ htop

# Or check load average
$ uptime

# Or see top CPU processes
$ ps aux --sort=-%cpu | head

Example 4: Restarting a Web Server

# Check status
$ systemctl status nginx

# Restart it
$ sudo systemctl restart nginx

# Check logs if something went wrong
$ journalctl -u nginx -n 50

Summary: Process Management Commands

CommandPurpose
psSnapshot of processes
ps -eflAll processes with details
pstreeProcess hierarchy tree
topReal-time process monitor
htopBetter real-time monitor
jobsList background jobs
fgBring job to foreground
bgContinue job in background
command &Run command in background
Ctrl+ZSuspend current process
kill PIDSend SIGTERM to process
kill -9 PIDForce kill process
killall nameKill all processes by name
systemctl statusCheck service status
systemctl startStart a service
systemctl stopStop a service
systemctl restartRestart a service
systemctl enableEnable at boot

Shell Scripts (Bash Scripting)

A shell script is simply a collection of commands written in a text file. That's it. Nothing magical.

The original name was "shell script," but when GNU created bash (Bourne Again SHell), the term "bash script" became common.

Why Shell Scripts Matter

1. Automation
If you're typing the same commands repeatedly, write them once in a script.

2. Portability
Scripts work across different Linux machines and distributions (mostly).

3. Scheduling
Combine scripts with cron jobs to run tasks automatically.

4. DRY Principle
Don't Repeat Yourself - write once, run many times.

Important: Nothing new here. Everything you've already learned about Linux commands applies. Shell scripts just let you organize and automate them.


Creating Your First Script

Create a file called first-script.sh:

$ nano first-script.sh

Write some commands:

echo "Hello, World"

Note: The .sh extension doesn't technically matter in Linux (unlike Windows), but it's convention. Use it so humans know it's a shell script.


Making Scripts Executable

Check the current permissions:

$ ls -l first-script.sh

Output:

-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 21 Nov  6 07:21 first-script.sh

Notice: No x (execute) permission. The file isn't executable yet.

Adding Execute Permission

$ chmod +x first-script.sh

Permission options:

  • u+x - Execute for user (owner) only
  • g+x - Execute for group only
  • o+x - Execute for others only
  • a+x or just +x - Execute for all (user, group, others)

Check permissions again:

$ ls -l first-script.sh

Output:

-rwxrwxr-x 1 mahmoudxyz mahmoudxyz 21 Nov  6 07:21 first-script.sh

Now we have x for user, group, and others.


Running Shell Scripts

There are two main ways to execute a script:

Method 1: Specify the Shell

$ sh first-script.sh

Or:

$ bash first-script.sh

This explicitly tells which shell to use.

Method 2: Direct Execution

$ ./first-script.sh

Why the ./ ?

Let's try without it:

$ first-script.sh

You'll get an error:

first-script.sh: command not found

Why? When you type a command without a path, the shell searches through directories listed in $PATH looking for that command. Your current directory (.) is usually NOT in $PATH for security reasons.

The ./ explicitly says: "Run the script in the current directory (.), don't search $PATH."

You could do this:

$ PATH=.:$PATH

Now first-script.sh would work without ./, but DON'T DO THIS. It's a security risk - you might accidentally execute malicious scripts in your current directory.

Best practices:

  1. Use ./script.sh for local scripts
  2. Put system-wide scripts in /usr/local/bin (which IS in $PATH)

The Shebang Line

Problem: How does the system know which interpreter to use for your script? Bash? Zsh? Python?

Solution: The shebang (#!) on the first line.

Basic Shebang

#!/bin/bash
echo "Hello, World"

What this means:
"Execute this script using /bin/bash"

When you run ./first-script.sh, the system:

  1. Reads the first line
  2. Sees #!/bin/bash
  3. Runs /bin/bash first-script.sh

Shebang with Other Languages

You can use shebang for any interpreted language:

#!/usr/bin/python3
print("Hello, World")

Now this file runs as a Python script!

The Portable Shebang

Problem: What if bash isn't at /bin/bash? What if python3 is at /usr/local/bin/python3 instead of /usr/bin/python3?

Solution: Use env to find the interpreter:

#!/usr/bin/env bash
echo "Hello, World"

Or for Python:

#!/usr/bin/env python3
print("Hello, World")

How it works:
env searches through $PATH to find the command. The shebang becomes: "Please find (env) where bash is located and execute this script with it."

Why env is better:

  • More portable across systems
  • Finds interpreters wherever they're installed
  • env itself is almost always at /usr/bin/env

Basic Shell Syntax

Command Separators

Semicolon (;) - Run commands sequentially:

$ echo "Hello" ; ls

This runs echo, then runs ls (regardless of whether echo succeeded).

AND (&&) - Run second command only if first succeeds:

$ echo "Hello" && ls

If echo succeeds (exit code 0), then run ls. If it fails, stop.

OR (||) - Run second command only if first fails:

$ false || ls

If false fails (exit code non-zero), then run ls. If it succeeds, stop.

Practical example:

$ cd /some/directory && echo "Changed directory successfully"

Only prints the message if cd succeeded.

$ cd /some/directory || echo "Failed to change directory"

Only prints the message if cd failed.


Variables

Variables store data that you can use throughout your script.

Declaring Variables

#!/bin/bash

# Integer variable
declare -i sum=16

# String variable
declare name="Mahmoud"

# Constant (read-only)
declare -r PI=3.14

# Array
declare -a names=()
names[0]="Alice"
names[1]="Bob"
names[2]="Charlie"

Key points:

  • declare -i = integer type
  • declare -r = read-only (constant)
  • declare -a = array
  • You can also just use sum=16 without declare (it works, but less explicit)

Using Variables

Access variables with $:

echo $sum          # Prints: 16
echo $name         # Prints: Mahmoud
echo $PI           # Prints: 3.14

For arrays and complex expressions, use ${}:

echo ${names[0]}   # Prints: Alice
echo ${names[1]}   # Prints: Bob
echo ${names[2]}   # Prints: Charlie

Why ${} matters:

echo "$nameTest"   # Looks for variable called "nameTest" (doesn't exist)
echo "${name}Test" # Prints: MahmoudTest (correct!)

Important Script Options

set -e

What it does: Exit script immediately if any command fails (non-zero exit code).

Why it matters: Prevents cascading errors. If step 1 fails, don't continue to step 2.

Example without set -e:

cd /nonexistent/directory
rm -rf *  # DANGER! This still runs even though cd failed

Example with set -e:

set -e
cd /nonexistent/directory  # Script stops here if this fails
rm -rf *                   # Never executes

Exit Codes

Every command returns an exit code:

  • 0 = Success
  • Non-zero = Failure (different numbers mean different errors)

Check the last command's exit code:

$ true
$ echo $?   # Prints: 0

$ false
$ echo $?   # Prints: 1

In scripts, explicitly exit with a code:

#!/bin/bash
echo "Script completed successfully"
exit 0  # Return 0 (success) to the calling process

Arithmetic Operations

There are multiple ways to do math in bash. Pick one and stick with it for consistency.

#!/bin/bash

num=4
echo $((num * 5))      # Prints: 20
echo $((num + 10))     # Prints: 14
echo $((num ** 2))     # Prints: 16 (exponentiation)

Operators:

  • + addition
  • - subtraction
  • * multiplication
  • / integer division
  • % modulo (remainder)
  • ** exponentiation

Pros: Built into bash, fast, clean syntax
Cons: Integer-only (no decimals)

Method 2: expr

#!/bin/bash

num=4
expr $num + 6      # Prints: 10
expr $num \* 5     # Prints: 20 (note the backslash before *)

Pros: Traditional, works in older shells
Cons: Awkward syntax, needs escaping for *

Method 3: bc (For Floating Point)

#!/bin/bash

echo "4.5 + 2.3" | bc       # Prints: 6.8
echo "10 / 3" | bc -l       # Prints: 3.33333... (-l for decimals)
echo "scale=2; 10/3" | bc   # Prints: 3.33 (2 decimal places)

Pros: Supports floating-point arithmetic
Cons: External program (slower), more complex

My recommendation: Use $(( )) for most cases. Use bc when you need decimals.


Logical Operations and Conditionals

Exit Code Testing

#!/bin/bash

true ; echo $?    # Prints: 0
false ; echo $?   # Prints: 1

Logical Operators

true && echo "True"     # Prints: True (because true succeeds)
false || echo "False"   # Prints: False (because false fails)

Comparison Operators

There are TWO syntaxes for comparisons in bash. Stick to one.

For integers:

[[ 1 -le 2 ]]  # Less than or equal
[[ 3 -ge 2 ]]  # Greater than or equal
[[ 5 -lt 10 ]] # Less than
[[ 8 -gt 4 ]]  # Greater than
[[ 5 -eq 5 ]]  # Equal
[[ 5 -ne 3 ]]  # Not equal

For strings and mixed:

[[ 3 == 3 ]]   # Equal
[[ 3 != 4 ]]   # Not equal
[[ 5 > 3 ]]    # Greater than (lexicographic for strings)
[[ 2 < 9 ]]    # Less than (lexicographic for strings)

Testing the result:

[[ 3 == 3 ]] ; echo $?   # Prints: 0 (true)
[[ 3 != 3 ]] ; echo $?   # Prints: 1 (false)
[[ 5 > 3 ]] ; echo $?    # Prints: 0 (true)
Option 2: test Command (Traditional)
test 1 -le 5 ; echo $?   # Prints: 0 (true)
test 10 -lt 5 ; echo $?  # Prints: 1 (false)

test is equivalent to [ ] (note: single brackets):

[ 1 -le 5 ] ; echo $?    # Same as test

My recommendation: Use [[ ]] (double brackets). It's more powerful and less error-prone than [ ] or test.

File Test Operators

Check file properties:

test -f /etc/hosts ; echo $?     # Does file exist? (0 = yes)
test -d /home ; echo $?           # Is it a directory? (0 = yes)
test -r /etc/shadow ; echo $?    # Do I have read permission? (1 = no)
test -w /tmp ; echo $?            # Do I have write permission? (0 = yes)
test -x /usr/bin/ls ; echo $?    # Is it executable? (0 = yes)

Common file tests:

  • -f file exists and is a regular file
  • -d directory exists
  • -e exists (any type)
  • -r readable
  • -w writable
  • -x executable
  • -s file exists and is not empty

Using [[ ]] syntax:

[[ -f /etc/hosts ]] && echo "File exists"
[[ -r /etc/shadow ]] || echo "Cannot read this file"

Positional Parameters (Command-Line Arguments)

When you run a script with arguments, bash provides special variables to access them.

Special Variables

#!/bin/bash

# $0 - Name of the script itself
# $# - Number of command-line arguments
# $* - All arguments as a single string
# $@ - All arguments as separate strings (array-like)
# $1 - First argument
# $2 - Second argument
# $3 - Third argument
# ... and so on

Example Script

#!/bin/bash

echo "Script name: $0"
echo "Total number of arguments: $#"
echo "All arguments: $*"
echo "First argument: $1"
echo "Second argument: $2"

Running it:

$ ./script.sh hello world 123

Output:

Script name: ./script.sh
Total number of arguments: 3
All arguments: hello world 123
First argument: hello
Second argument: world

$* vs $@

$* - Treats all arguments as a single string:

for arg in "$*"; do
    echo $arg
done
# Output: hello world 123 (all as one)

$@ - Treats arguments as separate items:

for arg in "$@"; do
    echo $arg
done
# Output:
# hello
# world
# 123

Recommendation: Use "$@" when looping through arguments.


Functions

Functions let you organize code into reusable blocks.

Basic Function

#!/bin/bash

Hello() {
    echo "Hello Functions!"
}

Hello  # Call the function

Alternative syntax:

function Hello() {
    echo "Hello Functions!"
}

Both work the same. Pick one style and be consistent.

Functions with Return Values

#!/bin/bash

function Hello() {
    echo "Hello Functions!"
    return 0  # Success
}

function GetTimestamp() {
    echo "The time now is $(date +%m/%d/%y' '%R)"
    return 0
}

Hello
echo "Exit code: $?"  # Prints: 0

GetTimestamp

Important: return only returns exit codes (0-255), NOT values like other languages.

To return a value, use echo:

function Add() {
    local result=$(($1 + $2))
    echo $result  # "Return" the value via stdout
}

sum=$(Add 5 3)  # Capture the output
echo "Sum: $sum"  # Prints: Sum: 8

Function Arguments

Functions can take arguments like scripts:

#!/bin/bash

Greet() {
    echo "Hello, $1!"  # $1 is first argument to function
}

Greet "Mahmoud"  # Prints: Hello, Mahmoud!
Greet "World"    # Prints: Hello, World!

Reading User Input

Basic read Command

#!/bin/bash

echo "What is your name?"
read name
echo "Hello, $name!"

How it works:

  1. Script displays prompt
  2. Waits for user to type and press Enter
  3. Stores input in variable name

read with Inline Prompt

#!/bin/bash

read -p "What is your name? " name
echo "Hello, $name!"

-p flag: Display prompt on same line as input

Reading Multiple Variables

#!/bin/bash

read -p "Enter your first and last name: " first last
echo "Hello, $first $last!"

Input: Mahmoud Xyz
Output: Hello, Mahmoud Xyz!

Reading Passwords (Securely)

#!/bin/bash

read -sp "Enter your password: " password
echo ""  # New line after hidden input
echo "Password received (length: ${#password})"

-s flag: Silent mode - doesn't display what user types
-p flag: Inline prompt

Security note: This hides the password from screen, but it's still in memory as plain text. For real password handling, use dedicated tools.

Reading from Files

#!/bin/bash

while read line; do
    echo "Line: $line"
done < /etc/passwd

Reads /etc/passwd line by line.


Best Practices

  1. Always use shebang: #!/usr/bin/env bash
  2. Use set -e: Stop on errors
  3. Use set -u: Stop on undefined variables
  4. Use set -o pipefail: Catch errors in pipes
  5. Quote variables: Use "$var" not $var (prevents word splitting)
  6. Check return codes: Test if commands succeeded
  7. Add comments: Explain non-obvious logic
  8. Use functions: Break complex scripts into smaller pieces
  9. Test thoroughly: Run scripts in safe environment first

The Holy Trinity of Safety

#!/usr/bin/env bash
set -euo pipefail
  • -e exit on error
  • -u exit on undefined variable
  • -o pipefail exit on pipe failures

About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (bash documentation, shell scripting tutorials, Linux guides).

This is my academic work, how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

References

[1] Ahmed Sami (Architect @ Microsoft).
Linux for Data Engineers (Arabic โ€“ Egyptian Dialect), 11h 30m.
YouTube

Python

python-comic

๐Ÿ’ก
Philosophy

I don't like cheat sheets. What we really need is daily problem-solving. Read other people's code, understand how they think - this is the only real way to improve.

This is a quick overview combined with practice problems. Things might appear in a reversed order sometimes - we'll introduce concepts by solving problems and covering tools as needed.

โ„น๏ธ
Need Help?

If you need help setting up something, write me.


Resources

Free Books:

If you want to buy:


Your First Program

alt text

print("Hello, World!")
โš ๏ธ
Everything is Case Sensitive

print() works. Print() does not!

The print() Function

Optional arguments: sep and end

sep (separator) - what goes between values:

print("A", "B", "C")              # A B C (default: space)
print("A", "B", "C", sep="-")     # A-B-C
print(1, 2, 3, sep=" | ")         # 1 | 2 | 3

end - what prints after the line:

print("Hello")
print("World")
# Output:
# Hello
# World

print("Hello", end=" ")
print("World")
# Output: Hello World

Escape Characters

๐Ÿ“
Common Escape Characters

\n โ†’ New line
\t โ†’ Tab
\\ โ†’ Backslash
\' โ†’ Single quote
\" โ†’ Double quote

Practice

๐Ÿ’ป
Exercise 1

Print a box of asterisks (4 rows, 19 asterisks each)

๐Ÿ’ป
Exercise 2

Print a hollow box (asterisks on edges, spaces inside)

๐Ÿ’ป
Exercise 3

Print a triangle pattern starting with one asterisk


Variables and Assignment

A variable stores a value in memory so you can use it later.

x = 7
y = 3
total = x + y
print(total)  # 11

alt text

โš ๏ธ
Assignment vs Equality

The = sign is for assignment, not mathematical equality. You're telling Python to store the right side value in the left side variable.

Multiple assignment:

x, y, z = 1, 2, 3

Variable Naming Rules

  • Must start with letter or underscore
  • Can contain letters, numbers, underscores
  • Cannot start with number
  • Cannot contain spaces
  • Cannot use Python keywords (for, if, class, etc.)
  • Case sensitive: age, Age, AGE are different

Assignment Operators

๐Ÿ“
Shortcuts

x += 3 โ†’ Same as x = x + 3
x -= 2 โ†’ Same as x = x - 2
x *= 4 โ†’ Same as x = x * 4
x /= 2 โ†’ Same as x = x / 2


Reading Input

name = input("What's your name? ")
print(f"Hello, {name}!")
โš ๏ธ
Important

input() always returns a string! Even if the user types 42, you get "42".

Converting input:

age = int(input("How old are you? "))
price = float(input("Enter price: $"))

Practice

๐Ÿ’ป
Exercise 1

Ask for a number, print its square in a complete sentence ending with a period (use sep)

๐Ÿ’ป
Exercise 2

Compute: (512 - 282) / (47 ร— 48 + 5)

๐Ÿ’ป
Exercise 3

Convert kilograms to pounds (2.2 pounds per kilogram)


Basic Data Types

Strings

Text inside quotes:

name = "Mahmoud"
message = 'Hello'

Can use single or double quotes. Strings can contain letters, numbers, spaces, symbols.

Numbers

  • int โ†’ Whole numbers: 7, 0, -100
  • float โ†’ Decimals: 3.14, 0.5, -2.7

Boolean

True or false values:

print(5 > 3)        # True
print(2 == 10)      # False
print("a" in "cat") # True

Logical Operators

๐Ÿ“
Operators

and โ†’ Both must be true
or โ†’ At least one must be true
not โ†’ Reverses the boolean
== โ†’ Equal to
!= โ†’ Not equal to
>, <, >=, <= โ†’ Comparisons

Practice

๐Ÿ’ป
DNA Validation Exercises

Read a DNA sequence and check:
1. Contains BOTH "A" AND "T"
2. Contains "U" OR "T"
3. Is pure RNA (no "T")
4. Is empty or only whitespace
5. Is valid DNA (only A, T, G, C)
6. Contains "A" OR "G" but NOT both
7. Contains any stop codon ("TAA", "TAG", "TGA")

Type Checking and Casting

print(type("hello"))  # <class 'str'>
print(type(10))       # <class 'int'>
print(type(3.5))      # <class 'float'>
print(type(True))     # <class 'bool'>

Type casting:

int("10")      # 10
float(5)       # 5.0
str(3.14)      # "3.14"
bool(0)        # False
bool(5)        # True
list("hi")     # ['h', 'i']
โš ๏ธ
Invalid Casts

int("hello") and float("abc") will cause errors!


Sequences

sequences.png

Strings

Strings are sequences of characters.

Indexing

Indexes start from 0:

alt text

name = "Python"
print(name[0])   # P
print(name[3])   # h
โš ๏ธ
Strings Are Immutable

You cannot change characters directly: name[0] = "J" causes an error!
But you can reassign the whole string: name = "Java"

String Operations

# Concatenation
"Hello" + " " + "World"  # "Hello World"

# Multiplication
"ha" * 3                 # "hahaha"

# Length
len("Python")            # 6

# Methods
text = "hello"
text.upper()             # "HELLO"
text.replace("h", "j")   # "jello"

Common String Methods

๐Ÿ“
Useful Methods

.upper(), .lower(), .capitalize(), .title()
.strip(), .lstrip(), .rstrip()
.replace(old, new), .split(sep), .join(list)
.find(sub), .count(sub)
.startswith(), .endswith()
.isalpha(), .isdigit(), .isalnum()

Practice

๐Ÿ’ป
Exercise 1

Convert DNA โ†’ RNA only if T exists (don't use if)

๐Ÿ’ป
Exercise 2

Check if DNA starts with "ATG" AND ends with "TAA"

๐Ÿ’ป
Exercise 3

Read text and print the last character


Lists

Lists can contain different types and are mutable (changeable).

numbers = [1, 2, 3]
mixed = [1, "hello", True]

List Operations

# Accessing
colors = ["red", "blue", "green"]
print(colors[1])  # "blue"

# Modifying (lists ARE mutable!)
colors[1] = "yellow"

# Adding
colors.append("black")          # Add at end
colors.insert(1, "white")       # Add at position

# Removing
del colors[1]                   # Remove by index
value = colors.pop()            # Remove last
colors.remove("red")            # Remove by value

# Sorting
numbers = [3, 1, 2]
numbers.sort()                  # Permanent
sorted(numbers)                 # Temporary

# Other operations
numbers.reverse()               # Reverse in place
len(numbers)                    # Length

Practice

๐Ÿ’ป
Exercise 1

Print the middle element of a list

๐Ÿ’ป
Exercise 2

Mutate RNA: ["A", "U", "G", "C", "U", "A"]
- Change first "A" to "G"
- Change last "A" to "C"

๐Ÿ’ป
Exercise 3

Swap first and last codon in: ["A","U","G","C","G","A","U","U","G"]

๐Ÿ’ป
Exercise 4

Create complementary DNA: Aโ†”T, Gโ†”C for ["A","T","G","C"]


Slicing

Extract portions of sequences: [start:stop:step]

alt text

โš ๏ธ
Stop is Excluded

[0:3] gives indices 0, 1, 2 (NOT 3)

Basic Slicing

numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

numbers[2:5]      # [2, 3, 4]
numbers[:3]       # [0, 1, 2] - from beginning
numbers[5:]       # [5, 6, 7, 8, 9] - to end
numbers[:]        # Copy everything
numbers[::2]      # [0, 2, 4, 6, 8] - every 2nd element

Negative Indices

Count from the end: -1 is last, -2 is second-to-last

numbers[-1]       # 9 - last element
numbers[-3:]      # [7, 8, 9] - last 3 elements
numbers[:-2]      # [0, 1, 2, 3, 4, 5, 6, 7] - all except last 2
numbers[::-1]     # Reverse!

Practice

๐Ÿ’ป
Exercise 1

Reverse middle 6 elements (indices 2-7) of [0,1,2,3,4,5,6,7,8,9]

๐Ÿ’ป
Exercise 2

Get every 3rd element backwards from ['a','b',...,'j']

๐Ÿ’ป
Exercise 3

Swap first 3 and last 3 characters in "abcdefghij"


Control Flow

If Statements

age = 18

if age >= 18:
    print("Adult")
elif age >= 13:
    print("Teen")
else:
    print("Child")
๐Ÿ’ก
elif vs Separate if

elif stops checking after first match. Separate if statements check all conditions.

Practice

๐Ÿ’ป
Exercise 1

Convert cm to inches (2.54 cm/inch). Print "invalid" if negative.

๐Ÿ’ป
Exercise 2

Print student year: โ‰ค23: freshman, 24-53: sophomore, 54-83: junior, โ‰ฅ84: senior

๐Ÿ’ป
Exercise 3

Number guessing game (1-10)


Loops

For Loops

# Loop through list
for fruit in ["apple", "banana"]:
    print(fruit)

# With index
for i, fruit in enumerate(["apple", "banana"]):
    print(f"{i}: {fruit}")

# Range
for i in range(5):        # 0, 1, 2, 3, 4
    print(i)

for i in range(2, 5):     # 2, 3, 4
    print(i)

for i in range(0, 10, 2): # 0, 2, 4, 6, 8
    print(i)

While Loops

count = 0
while count < 5:
    print(count)
    count += 1
โš ๏ธ
Infinite Loops

Make sure your condition eventually becomes False!

Control Statements

๐Ÿ“
Loop Control

break โ†’ Exit loop immediately
continue โ†’ Skip to next iteration
pass โ†’ Do nothing (placeholder)

Practice

๐Ÿ’ป
Exercise 1

Print your name 100 times

๐Ÿ’ป
Exercise 2

Print numbers and their squares from 1-20

๐Ÿ’ป
Exercise 3

Print: 8, 11, 14, 17, ..., 89 using a for loop


String & List Exercises

๐Ÿ’ป
String Challenges

1. Count spaces to estimate words
2. Check if parentheses are balanced
3. Check if word contains vowels
4. Encrypt by rearranging even/odd indices
5. Capitalize first letter of each word

๐Ÿ’ป
List Challenges

1. Replace all values > 10 with 10
2. Remove duplicates from list
3. Find longest run of zeros
4. Create [1,1,0,1,0,0,1,0,0,0,...]
5. Remove first character from each string


F-Strings (String Formatting)

Modern, clean way to format strings:

name = 'Ahmed'
age = 45
txt = f"My name is {name}, I am {age}"

Number Formatting

pi = 3.14159265359

f'{pi:.2f}'              # '3.14' - 2 decimals
f'{10:03d}'              # '010' - pad with zeros
f'{12345678:,d}'         # '12,345,678' - commas
f'{42:>10d}'             # '        42' - right align
f'{1234.5:>10,.2f}'      # '  1,234.50' - combined

Functions in F-Strings

name = "alice"
f"Hello, {name.upper()}!"        # 'Hello, ALICE!'

numbers = [3, 1, 4]
f"Sum: {sum(numbers)}"           # 'Sum: 8'

String Methods

split() and join()

# Split
text = "one,two,three"
words = text.split(',')          # ['one', 'two', 'three']
text.split()                     # Split on any whitespace

# Join
words = ['one', 'two', 'three']
', '.join(words)                 # 'one, two, three'
''.join(['H','e','l','l','o'])   # 'Hello'

partition()

Splits at first occurrence:

email = "user@example.com"
username, _, domain = email.partition('@')
# username = 'user', domain = 'example.com'

Character Checks

'123'.isdigit()          # True - all digits
'Hello123'.isalnum()     # True - letters and numbers
'hello'.isalpha()        # True - only letters
'hello'.islower()        # True - all lowercase
'HELLO'.isupper()        # True - all uppercase

Two Sum Problem

โ“
Problem

Given an array of integers and a target, return indices of two numbers that add up to target.

# Input: nums = [2, 7, 11, 15], target = 9
# Output: [0, 1]  (because 2 + 7 = 9)

Brute Force Solution (O(nยฒ))

nums = [2, 7, 11, 15]
target = 9

for i in range(len(nums)):
    for j in range(i + 1, len(nums)):
        if nums[i] + nums[j] == target:
            print([i, j])
โš ๏ธ
Nested Loops = Slow

Time complexity: O(nยฒ)
10 elements = ~100 operations
1,000 elements = ~1,000,000 operations!

alt text


Unpacking with * and **

Unpacking Iterables (*)

# Basic unpacking
numbers = [1, 2, 3]
a, b, c = numbers

# Catch remaining items
first, *middle, last = [1, 2, 3, 4, 5]
# first = 1, middle = [2, 3, 4], last = 5

# In function calls
def add(a, b, c):
    return a + b + c

numbers = [1, 2, 3]
add(*numbers)  # Same as add(1, 2, 3)

# Combining lists
list1 = [1, 2]
list2 = [3, 4]
combined = [*list1, *list2]  # [1, 2, 3, 4]

Unpacking Dictionaries (**)

# Merge dictionaries
defaults = {'color': 'blue', 'size': 'M'}
custom = {'size': 'L'}
final = {**defaults, **custom}
# {'color': 'blue', 'size': 'L'}

# In function calls
def create_user(name, age, city):
    print(f"{name}, {age}, {city}")

data = {'name': 'Bob', 'age': 30, 'city': 'NYC'}
create_user(**data)
๐Ÿ’ก
Remember

* unpacks iterables into positional arguments
** unpacks dictionaries into keyword arguments

Python Dictionaries

What is a Dictionary?

A dictionary stores data as key-value pairs.

# Basic structure
student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Access by key
print(student['name'])   # Alex
print(student['age'])    # 20

Creating Dictionaries

# Empty dictionary
empty = {}

# With initial values
person = {'name': 'Alex', 'age': 20}

# Using dict() constructor
person = dict(name='Alex', age=20)

Basic Operations

Adding and Modifying

student = {'name': 'Alex', 'age': 20}

# Add new key
student['major'] = 'CS'
print(student)  # {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Modify existing value
student['age'] = 21
print(student)  # {'name': 'Alex', 'age': 21, 'major': 'CS'}

Deleting

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Delete specific key
del student['major']
print(student)  # {'name': 'Alex', 'age': 20}

# Remove and return value
age = student.pop('age')
print(age)      # 20
print(student)  # {'name': 'Alex'}

Getting Values Safely

student = {'name': 'Alex', 'age': 20}

# Direct access - raises error if key missing
print(student['name'])      # Alex
# print(student['grade'])   # KeyError!

# Safe access with .get() - returns None if missing
print(student.get('name'))   # Alex
print(student.get('grade'))  # None

# Provide default value
print(student.get('grade', 'N/A'))  # N/A

Useful Methods

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Get all keys
print(student.keys())    # dict_keys(['name', 'age', 'major'])

# Get all values
print(student.values())  # dict_values(['Alex', 20, 'CS'])

# Get all key-value pairs
print(student.items())   # dict_items([('name', 'Alex'), ('age', 20), ('major', 'CS')])

# Get length
print(len(student))      # 3

Membership Testing

Use in to check if a key exists (not value!):

student = {'name': 'Alex', 'age': 20}

# Check if key exists
print('name' in student)     # True
print('grade' in student)    # False

# Check if key does NOT exist
print('grade' not in student)  # True

# To check values, use .values()
print('Alex' in student.values())  # True
print(20 in student.values())      # True

Important: Checking in on a dictionary is O(1) - instant! This is why dictionaries are so powerful.


Looping Through Dictionaries

Loop Over Keys (Default)

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Default: loops over keys
for key in student:
    print(key)
# name
# age
# major

# Explicit (same result)
for key in student.keys():
    print(key)

Loop Over Values

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

for value in student.values():
    print(value)
# Alex
# 20
# CS

Loop Over Keys and Values Together

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

for key, value in student.items():
    print(f"{key}: {value}")
# name: Alex
# age: 20
# major: CS

Loop With Index Using enumerate()

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

for index, key in enumerate(student):
    print(f"{index}: {key} = {student[key]}")
# 0: name = Alex
# 1: age = 20
# 2: major = CS

# Or with items()
for index, (key, value) in enumerate(student.items()):
    print(f"{index}: {key} = {value}")

Dictionary Order

Python 3.7+: Dictionaries maintain insertion order.

# Items stay in the order you add them
d = {}
d['first'] = 1
d['second'] = 2
d['third'] = 3

for key in d:
    print(key)
# first
# second
# third  (guaranteed order!)

Note: Before Python 3.7, dictionary order was not guaranteed. If you need to support older Python, don't rely on order.

Important: While keys maintain insertion order, this doesn't mean dictionaries are sorted. They just remember the order you added things.

# Not sorted - just insertion order
d = {'c': 3, 'a': 1, 'b': 2}
print(list(d.keys()))  # ['c', 'a', 'b'] - insertion order, not alphabetical

Complex Values

Lists as Values

student = {
    'name': 'Alex',
    'courses': ['Math', 'Physics', 'CS']
}

# Access list items
print(student['courses'][0])  # Math

# Modify list
student['courses'].append('Biology')
print(student['courses'])  # ['Math', 'Physics', 'CS', 'Biology']

Nested Dictionaries

students = {
    1: {'name': 'Alex', 'age': 20},
    2: {'name': 'Maria', 'age': 22},
    3: {'name': 'Jordan', 'age': 21}
}

# Access nested values
print(students[1]['name'])  # Alex
print(students[2]['age'])   # 22

# Modify nested values
students[3]['age'] = 22

# Add new entry
students[4] = {'name': 'Casey', 'age': 19}

Why Dictionaries Are Fast: Hashing

Dictionaries use hashing to achieve O(1) lookup time.

How it works:

  1. When you add a key, Python computes a hash (a number) from the key
  2. This hash tells Python exactly where to store the value in memory
  3. When you look up the key, Python computes the same hash and goes directly to that location

Result: Looking up a key takes the same time whether your dictionary has 10 items or 10 million items.

# List: O(n) - must check each element
my_list = [2, 7, 11, 15]
if 7 in my_list:  # Checks: 2? no. 7? yes! (2 checks)
    print("Found")

# Dictionary: O(1) - instant lookup
my_dict = {2: 'a', 7: 'b', 11: 'c', 15: 'd'}
if 7 in my_dict:  # Goes directly to location (1 check)
    print("Found")

Practical Example: Two Sum Problem

Problem: Find two numbers that add up to a target.

Slow approach (nested loops - O(nยฒ)):

nums = [2, 7, 11, 15]
target = 9

for i in range(len(nums)):
    for j in range(i + 1, len(nums)):
        if nums[i] + nums[j] == target:
            print([i, j])  # [0, 1]

Fast approach (dictionary - O(n)):

nums = [2, 7, 11, 15]
target = 9
seen = {}

for i, num in enumerate(nums):
    complement = target - num
    if complement in seen:
        print([seen[complement], i])  # [0, 1]
    else:
        seen[num] = i

Why it's faster:

  • We loop once through the array
  • For each number, we check if its complement exists (O(1) lookup)
  • Total: O(n) instead of O(nยฒ)

Trace through:

i=0, num=2: complement=7, not in seen, add {2: 0}
i=1, num=7: complement=2, IS in seen at index 0, return [0, 1]

Exercises

Exercise 1: Create a dictionary of 5 countries and their capitals. Print each country and its capital.

Exercise 2: Write a program that counts how many times each character appears in a string.

Exercise 3: Given a list of numbers, create a dictionary where keys are numbers and values are their squares.

Exercise 4: Create a program that stores product names and prices. Let the user look up prices by product name.

Exercise 5: Given a 5ร—5 list of numbers, count how many times each number appears and print the three most common.

Exercise 6: DNA pattern matching - given a list of DNA sequences and a pattern with wildcards (*), find matching sequences:

sequences = ['ATGCATGC', 'ATGGATGC', 'TTGCATGC']
pattern = 'ATG*ATGC'  # * matches any character
# Should match: 'ATGCATGC', 'ATGGATGC'
Solutions
# Exercise 1
capitals = {'France': 'Paris', 'Japan': 'Tokyo', 'Italy': 'Rome', 
            'Egypt': 'Cairo', 'Brazil': 'Brasilia'}
for country, capital in capitals.items():
    print(f"{country}: {capital}")

# Exercise 2
text = "hello world"
char_count = {}
for char in text:
    char_count[char] = char_count.get(char, 0) + 1
print(char_count)

# Exercise 3
numbers = [1, 2, 3, 4, 5]
squares = {n: n**2 for n in numbers}
print(squares)  # {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

# Exercise 4
products = {}
while True:
    name = input("Product name (or 'done'): ")
    if name == 'done':
        break
    price = float(input("Price: "))
    products[name] = price

while True:
    lookup = input("Look up product (or 'quit'): ")
    if lookup == 'quit':
        break
    print(products.get(lookup, "Product not found"))

# Exercise 5
import random
grid = [[random.randint(1, 10) for _ in range(5)] for _ in range(5)]
counts = {}
for row in grid:
    for num in row:
        counts[num] = counts.get(num, 0) + 1
# Sort by count and get top 3
top3 = sorted(counts.items(), key=lambda x: x[1], reverse=True)[:3]
print("Top 3:", top3)

# Exercise 6
sequences = ['ATGCATGC', 'ATGGATGC', 'TTGCATGC']
pattern = 'ATG*ATGC'

for seq in sequences:
    match = True
    for i, char in enumerate(pattern):
        if char != '*' and char != seq[i]:
            match = False
            break
    if match:
        print(f"Match: {seq}")

Summary

OperationSyntaxTime
Created = {'a': 1}O(1)
Accessd['key']O(1)
Add/Modifyd['key'] = valueO(1)
Deletedel d['key']O(1)
Check key exists'key' in dO(1)
Get all keysd.keys()O(1)
Get all valuesd.values()O(1)
Loopfor k in dO(n)

Key takeaways:

  • Dictionaries are fast for lookups (O(1))
  • Use .get() for safe access with default values
  • Loop with .items() to get both keys and values
  • Python 3.7+ maintains insertion order
  • Perfect for counting, caching, and mapping data

Functions

๐Ÿ“–
What is a Function?

A function is a reusable block of code that performs a specific task. It's like a recipe you can follow multiple times without rewriting the steps.

alt text

The DRY Principle

๐Ÿ’ก
DRY = Don't Repeat Yourself

If you're copying and pasting code, you should probably write a function instead!

Without a function (repetitive):

# Calculating area three times - notice the pattern?
area1 = 10 * 5
print(f"Area 1: {area1}")

area2 = 8 * 6
print(f"Area 2: {area2}")

area3 = 12 * 4
print(f"Area 3: {area3}")

With a function (clean):

def calculate_area(length, width):
    return length * width

print(f"Area 1: {calculate_area(10, 5)}")
print(f"Area 2: {calculate_area(8, 6)}")
print(f"Area 3: {calculate_area(12, 4)}")

Basic Function Syntax

Declaring a Function

def greet():
    print("Hello, World!")

Anatomy:

  • def โ†’ keyword to start a function
  • greet โ†’ function name (use descriptive names!)
  • () โ†’ parentheses for parameters
  • : โ†’ colon to start the body
  • Indented code โ†’ what the function does

Calling a Function

โš ๏ธ
Important

Defining a function doesn't run it! You must call it.

def greet():
    print("Hello, World!")

greet()  # Now it runs!
greet()  # You can call it multiple times

Parameters and Arguments

๐Ÿ“
Terminology

Parameters are in the definition. Arguments are the actual values you pass.

def greet(name):      # 'name' is a parameter
    print(f"Hello, {name}!")

greet("Alice")        # "Alice" is an argument

Multiple parameters:

def add_numbers(a, b):
    result = a + b
    print(f"{a} + {b} = {result}")

add_numbers(5, 3)     # Output: 5 + 3 = 8

Return Values

Functions can give back results using return:

def multiply(a, b):
    return a * b

result = multiply(4, 5)
print(result)  # 20

# Use the result directly in calculations
total = multiply(3, 7) + multiply(2, 4)  # 21 + 8 = 29
โ„น๏ธ
print() vs return

print() shows output on screen. return sends a value back so you can use it later.


Default Arguments

Give parameters default values if no argument is provided:

def power(base, exponent=2):  # exponent defaults to 2
    return base ** exponent

print(power(5))      # 25 (5ยฒ)
print(power(5, 3))   # 125 (5ยณ)

Multiple defaults:

def create_profile(name, age=18, country="USA"):
    print(f"{name}, {age} years old, from {country}")

create_profile("Alice")                    # Uses both defaults
create_profile("Bob", 25)                  # Uses country default
create_profile("Charlie", 30, "Canada")    # No defaults used
โš ๏ธ
Rule

Parameters with defaults must come after parameters without defaults!

# โŒ Wrong
def bad(a=5, b):
    pass

# โœ… Correct
def good(b, a=5):
    pass

Variable Number of Arguments

*args (Positional Arguments)

Use when you don't know how many arguments will be passed:

def sum_all(*numbers):
    total = 0
    for num in numbers:
        total += num
    return total

print(sum_all(1, 2, 3))           # 6
print(sum_all(10, 20, 30, 40))    # 100

**kwargs (Keyword Arguments)

Use for named arguments as a dictionary:

def print_info(**details):
    for key, value in details.items():
        print(f"{key}: {value}")

print_info(name="Alice", age=25, city="New York")
# Output:
# name: Alice
# age: 25
# city: New York

Combining Everything

๐Ÿ’ก
Order Matters

When combining, use this order: regular params โ†’ *args โ†’ default params โ†’ **kwargs

def flexible(required, *args, default="default", **kwargs):
    print(f"Required: {required}")
    print(f"Args: {args}")
    print(f"Default: {default}")
    print(f"Kwargs: {kwargs}")

flexible("Must have", 1, 2, 3, default="Custom", extra="value")

Scope: Local vs Global

๐Ÿ“–
Scope

Scope determines where a variable can be accessed in your code.

Local scope: Variables inside functions only exist inside that function

def calculate():
    result = 10 * 5  # Local variable
    print(result)

calculate()        # 50
print(result)      # โŒ ERROR! result doesn't exist here

Global scope: Variables outside functions can be accessed anywhere

total = 0  # Global variable

def add_to_total(amount):
    global total  # Modify the global variable
    total += amount

add_to_total(10)
print(total)  # 10
๐Ÿ’ก
Best Practice

Avoid global variables! Pass values as arguments and return results instead.

Better approach:

def add_to_total(current, amount):
    return current + amount

total = 0
total = add_to_total(total, 10)  # 10
total = add_to_total(total, 5)   # 15

Decomposition

๐Ÿ“–
Decomposition

Breaking complex problems into smaller, manageable functions. Each function should do one thing well.

Bad (one giant function):

def process_order(items, customer):
    # Calculate, discount, tax, print - all in one!
    total = sum(item['price'] for item in items)
    if total > 100:
        total *= 0.9
    total *= 1.08
    print(f"Customer: {customer}")
    print(f"Total: ${total:.2f}")

Good (decomposed):

def calculate_subtotal(items):
    return sum(item['price'] for item in items)

def apply_discount(amount):
    return amount * 0.9 if amount > 100 else amount

def add_tax(amount):
    return amount * 1.08

def print_receipt(customer, total):
    print(f"Customer: {customer}")
    print(f"Total: ${total:.2f}")

def process_order(items, customer):
    subtotal = calculate_subtotal(items)
    discounted = apply_discount(subtotal)
    final = add_tax(discounted)
    print_receipt(customer, final)

Benefits: โœ… Easier to understand โœ… Easier to test โœ… Reusable components โœ… Easier to debug


Recursion

๐Ÿ“–
Recursion

When a function calls itself to solve smaller versions of the same problem.

Classic example: Factorial (5! = 5 ร— 4 ร— 3 ร— 2 ร— 1)

def factorial(n):
    # Base case: stop condition
    if n == 0 or n == 1:
        return 1
    
    # Recursive case: call itself
    return n * factorial(n - 1)

print(factorial(5))  # 120

How it works:

factorial(5) = 5 ร— factorial(4)
             = 5 ร— (4 ร— factorial(3))
             = 5 ร— (4 ร— (3 ร— factorial(2)))
             = 5 ร— (4 ร— (3 ร— (2 ร— factorial(1))))
             = 5 ร— (4 ร— (3 ร— (2 ร— 1)))
             = 120

Key parts of recursion:

๐Ÿ“
Recursion Checklist

1. Base case: When to stop
2. Recursive case: Call itself with simpler input
3. Progress: Each call must get closer to the base case

Another example: Countdown

def countdown(n):
    if n == 0:
        print("Blast off!")
        return
    print(n)
    countdown(n - 1)

countdown(3)
# Output: 3, 2, 1, Blast off!
โš ๏ธ
Watch Out

Deep recursion can cause memory issues. Python has a default recursion limit.


Practice Exercises

๐Ÿ’ป
Exercise 1: Rectangle Printer

Write a function rectangle(m, n) that prints an m ร— n box of asterisks.

rectangle(2, 4)
# Output:
# ****
# ****
๐Ÿ’ป
Exercise 2: Add Excitement

Write add_excitement(words) that adds "!" to each string in a list.

  • Version A: Modify the original list
  • Version B: Return a new list without modifying the original
words = ["hello", "world"]
add_excitement(words)
# words is now ["hello!", "world!"]
๐Ÿ’ป
Exercise 3: Sum Digits

Write sum_digits(num) that returns the sum of all digits in a number.

sum_digits(123)   # Returns: 6 (1 + 2 + 3)
sum_digits(4567)  # Returns: 22 (4 + 5 + 6 + 7)
๐Ÿ’ป
Exercise 4: First Difference

Write first_diff(str1, str2) that returns the first position where strings differ, or -1 if identical.

first_diff("hello", "world")  # Returns: 0
first_diff("test", "tent")    # Returns: 2
first_diff("same", "same")    # Returns: -1
๐Ÿ’ป
Exercise 5: Tic-Tac-Toe

A 3ร—3 board uses: 0 = empty, 1 = X, 2 = O

  • Part A: Write a function that randomly places a 2 in an empty spot
  • Part B: Write a function that checks if someone has won (returns True/False)
๐Ÿ’ป
Exercise 6: String Matching

Write matches(str1, str2) that counts how many positions have the same character.

matches("python", "path")  # Returns: 3 (positions 0, 2, 3)
๐Ÿ’ป
Exercise 7: Find All Occurrences

Write findall(string, char) that returns a list of all positions where a character appears.

findall("hello", "l")  # Returns: [2, 3]
findall("test", "x")   # Returns: []
๐Ÿ’ป
Exercise 8: Case Swap

Write change_case(string) that swaps uppercase โ†” lowercase.

change_case("Hello World")  # Returns: "hELLO wORLD"

Challenge Exercises

โ“
Challenge 1: Merge Sorted Lists

Write merge(list1, list2) that combines two sorted lists into one sorted list.

  • Try it with .sort() method
  • Try it without using .sort()
merge([1, 3, 5], [2, 4, 6])  # Returns: [1, 2, 3, 4, 5, 6]
โ“
Challenge 2: Number to English

Write verbose(num) that converts numbers to English words (up to 10ยนโต).

verbose(123456)  
# Returns: "one hundred twenty-three thousand, four hundred fifty-six"
โ“
Challenge 3: Base 20 Conversion

Convert base 10 numbers to base 20 using letters A-T (A=0, B=1, ..., T=19).

base20(0)    # Returns: "A"
base20(20)   # Returns: "BA"
base20(39)   # Returns: "BT"
base20(400)  # Returns: "BAA"
โ“
Challenge 4: Closest Value

Write closest(L, n) that returns the largest element in L that doesn't exceed n.

closest([1, 6, 3, 9, 11], 8)  # Returns: 6
closest([5, 10, 15, 20], 12)  # Returns: 10

Higher-Order Functions

๐Ÿ“–
Definition

Higher-Order Function: A function that either takes another function as a parameter OR returns a function as a result.

Why Do We Need Them?

Imagine you have a list of numbers and you want to:

  • Keep only the even numbers
  • Keep only numbers greater than 10
  • Keep only numbers divisible by 3

You could write three different functions... or write ONE function that accepts different "rules" as parameters!

๐Ÿ’ก
Key Idea

Separate what to do (iterate through a list) from how to decide (the specific rule)


Worked Example: Filtering Numbers

Step 1: The Problem

We have a list of numbers: [3, 8, 15, 4, 22, 7, 11]

We want to filter them based on different conditions.

Step 2: Without Higher-Order Functions (Repetitive)

# Filter for even numbers
def filter_even(numbers):
    result = []
    for num in numbers:
        if num % 2 == 0:
            result.append(num)
    return result

# Filter for numbers > 10
def filter_large(numbers):
    result = []
    for num in numbers:
        if num > 10:
            result.append(num)
    return result
โš ๏ธ
Problem

Notice how we're repeating the same loop structure? Only the condition changes!

Step 3: With Higher-Order Function (Smart)

def filter_numbers(numbers, condition):
    """
    Filter numbers based on any condition function.
    
    numbers: list of numbers
    condition: a function that returns True/False
    """
    result = []
    for num in numbers:
        if condition(num):  # Call the function we received!
            result.append(num)
    return result
โœ…
Solution

Now we have ONE function that can work with ANY condition!

Step 4: Define Simple Condition Functions

def is_even(n):
    return n % 2 == 0

def is_large(n):
    return n > 10

def is_small(n):
    return n < 10

Step 5: Use It!

numbers = [3, 8, 15, 4, 22, 7, 11]

print(filter_numbers(numbers, is_even))   # [8, 4, 22]
print(filter_numbers(numbers, is_large))  # [15, 22, 11]
print(filter_numbers(numbers, is_small))  # [3, 8, 4, 7]
โ„น๏ธ
Notice

We pass the function name WITHOUT parentheses: is_even not is_even()


Practice Exercises

๐Ÿ’ป
Exercise 1: String Filter

Complete this function:

def filter_words(words, condition):
    # Your code here
    pass

def is_long(word):
    return len(word) > 5

def starts_with_a(word):
    return word.lower().startswith('a')

# Test it:
words = ["apple", "cat", "banana", "amazing", "dog"]
print(filter_words(words, is_long))         # Should print: ["banana", "amazing"]
print(filter_words(words, starts_with_a))   # Should print: ["apple", "amazing"]
๐Ÿ’ป
Exercise 2: Number Transformer

Write a higher-order function that transforms numbers:

def transform_numbers(numbers, transformer):
    # Your code here: apply transformer to each number
    pass

def double(n):
    return n * 2

def square(n):
    return n ** 2

# Test it:
nums = [1, 2, 3, 4, 5]
print(transform_numbers(nums, double))   # Should print: [2, 4, 6, 8, 10]
print(transform_numbers(nums, square))   # Should print: [1, 4, 9, 16, 25]
๐Ÿ’ป
Exercise 3: Grade Calculator

Create a function that grades scores using different grading systems:

def apply_grading(scores, grade_function):
    # Your code here
    pass

def strict_grade(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    else:
        return 'C'

def pass_fail(score):
    return 'Pass' if score >= 60 else 'Fail'

# Test it:
scores = [95, 75, 85, 55]
print(apply_grading(scores, strict_grade))  # Should print: ['A', 'C', 'B', 'C']
print(apply_grading(scores, pass_fail))     # Should print: ['Pass', 'Pass', 'Pass', 'Fail']

Conclusion

๐Ÿ“
Remember

1. Functions can be passed as parameters (like any other value)
2. The higher-order function provides the structure (loop, collection)
3. The parameter function provides the specific behavior (condition, transformation)
4. This makes code more reusable and flexible

๐Ÿ’ก
Real Python Examples

Python has built-in higher-order functions you'll use all the time:
โ€ข sorted(items, key=function)
โ€ข map(function, items)
โ€ข filter(function, items)


Challenge Exercise

โ“
DNA Sequence Validator

Write a higher-order function validate_sequences(sequences, validator) that checks a list of DNA sequences using different validation rules.

Validation functions to create:

  • is_valid_dna(seq) - checks if sequence contains only A, C, G, T
  • is_long_enough(seq) - checks if sequence is at least 10 characters
  • has_start_codon(seq) - checks if sequence starts with "ATG"
sequences = ["ATGCGATCG", "ATGXYZ", "AT", "ATGCCCCCCCCCC"]

# Your solution should work like this:
print(validate_sequences(sequences, is_valid_dna))
# [True, False, True, True]

print(validate_sequences(sequences, is_long_enough))
# [False, False, False, True]

Tuples and Sets


Part 1: Tuples

What is a Tuple?

A tuple is essentially an immutable list. Once created, you cannot change its contents.

# List - mutable (can change)
L = [1, 2, 3]
L[0] = 100  # Works fine

# Tuple - immutable (cannot change)
t = (1, 2, 3)
t[0] = 100  # TypeError: 'tuple' object does not support item assignment

Creating Tuples

# With parentheses
t = (1, 2, 3)

# Without parentheses (comma makes it a tuple)
t = 1, 2, 3

# Single element tuple (comma is required!)
t = (1,)    # This is a tuple
t = (1)     # This is just an integer!

# Empty tuple
t = ()
t = tuple()

# From a list
t = tuple([1, 2, 3])

# From a string
t = tuple("hello")  # ('h', 'e', 'l', 'l', 'o')

Common mistake:

# This is NOT a tuple
x = (5)
print(type(x))  # <class 'int'>

# This IS a tuple
x = (5,)
print(type(x))  # <class 'tuple'>

Accessing Tuple Elements

t = ('a', 'b', 'c', 'd', 'e')

# Indexing (same as lists)
print(t[0])     # 'a'
print(t[-1])    # 'e'

# Slicing
print(t[1:3])   # ('b', 'c')
print(t[:3])    # ('a', 'b', 'c')
print(t[2:])    # ('c', 'd', 'e')

# Length
print(len(t))   # 5

Why Use Tuples?

1. Faster and Less Memory

Tuples are more efficient than lists:

import sys

L = [1, 2, 3, 4, 5]
t = (1, 2, 3, 4, 5)

print(sys.getsizeof(L))  # 104 bytes
print(sys.getsizeof(t))  # 80 bytes (smaller!)

2. Safe - Data Cannot Be Changed

When you want to ensure data stays constant:

# RGB color that shouldn't change
RED = (255, 0, 0)
# RED[0] = 200  # Error! Can't modify

# Coordinates
location = (40.7128, -74.0060)  # New York

3. Can Be Dictionary Keys

Lists cannot be dictionary keys, but tuples can:

# This works
locations = {
    (40.7128, -74.0060): "New York",
    (51.5074, -0.1278): "London"
}
print(locations[(40.7128, -74.0060)])  # New York

# This fails
# locations = {[40.7128, -74.0060]: "New York"}  # TypeError!

4. Return Multiple Values

Functions can return tuples:

def get_stats(numbers):
    return min(numbers), max(numbers), sum(numbers)

low, high, total = get_stats([1, 2, 3, 4, 5])
print(low, high, total)  # 1 5 15

Tuple Unpacking

# Basic unpacking
t = (1, 2, 3)
a, b, c = t
print(a, b, c)  # 1 2 3

# Swap values (elegant!)
x, y = 10, 20
x, y = y, x
print(x, y)  # 20 10

# Unpacking with *
t = (1, 2, 3, 4, 5)
first, *middle, last = t
print(first)   # 1
print(middle)  # [2, 3, 4]
print(last)    # 5

Looping Through Tuples

t = ('a', 'b', 'c')

# Basic loop
for item in t:
    print(item)

# With index
for i, item in enumerate(t):
    print(f"{i}: {item}")

# Loop through list of tuples
points = [(0, 0), (1, 2), (3, 4)]
for x, y in points:
    print(f"x={x}, y={y}")

Tuple Methods

Tuples have only two methods (because they're immutable):

t = (1, 2, 3, 2, 2, 4)

# Count occurrences
print(t.count(2))   # 3

# Find index
print(t.index(3))   # 2

Tuples vs Lists Summary

FeatureTupleList
Syntax(1, 2, 3)[1, 2, 3]
MutableNoYes
SpeedFasterSlower
MemoryLessMore
Dictionary keyYesNo
Use caseFixed dataChanging data

Tuple Exercises

Exercise 1: Create a tuple with your name, age, and city. Print each element.

Exercise 2: Given t = (1, 2, 3, 4, 5), print the first and last elements.

Exercise 3: Write a function that returns the min, max, and average of a list as a tuple.

Exercise 4: Swap two variables using tuple unpacking.

Exercise 5: Create a tuple from the string "ATGC" and count how many times 'A' appears.

Exercise 6: Given a list of (x, y) coordinates, calculate the distance of each from origin.

Exercise 7: Use a tuple as a dictionary key to store city names by their (latitude, longitude).

Exercise 8: Unpack (1, 2, 3, 4, 5) into first, middle (as list), and last.

Exercise 9: Create a function that returns the quotient and remainder of two numbers as a tuple.

Exercise 10: Loop through [(1, 'a'), (2, 'b'), (3, 'c')] and print each pair.

Exercise 11: Convert a list [1, 2, 3] to a tuple and back to a list.

Exercise 12: Find the index of 'G' in the tuple ('A', 'T', 'G', 'C').

Exercise 13: Create a tuple of tuples representing a 3x3 grid and print the center element.

Exercise 14: Given two tuples, concatenate them into a new tuple.

Exercise 15: Sort a list of (name, score) tuples by score in descending order.

Solutions
# Exercise 1
person = ("Mahmoud", 25, "Bologna")
print(person[0], person[1], person[2])

# Exercise 2
t = (1, 2, 3, 4, 5)
print(t[0], t[-1])

# Exercise 3
def stats(numbers):
    return min(numbers), max(numbers), sum(numbers)/len(numbers)
print(stats([1, 2, 3, 4, 5]))

# Exercise 4
x, y = 10, 20
x, y = y, x
print(x, y)

# Exercise 5
dna = tuple("ATGC")
print(dna.count('A'))

# Exercise 6
import math
coords = [(3, 4), (0, 5), (1, 1)]
for x, y in coords:
    dist = math.sqrt(x**2 + y**2)
    print(f"({x}, {y}): {dist:.2f}")

# Exercise 7
cities = {
    (40.71, -74.00): "New York",
    (51.51, -0.13): "London"
}
print(cities[(40.71, -74.00)])

# Exercise 8
t = (1, 2, 3, 4, 5)
first, *middle, last = t
print(first, middle, last)

# Exercise 9
def div_mod(a, b):
    return a // b, a % b
print(div_mod(17, 5))  # (3, 2)

# Exercise 10
pairs = [(1, 'a'), (2, 'b'), (3, 'c')]
for num, letter in pairs:
    print(f"{num}: {letter}")

# Exercise 11
L = [1, 2, 3]
t = tuple(L)
L2 = list(t)
print(t, L2)

# Exercise 12
dna = ('A', 'T', 'G', 'C')
print(dna.index('G'))  # 2

# Exercise 13
grid = ((1, 2, 3), (4, 5, 6), (7, 8, 9))
print(grid[1][1])  # 5

# Exercise 14
t1 = (1, 2)
t2 = (3, 4)
t3 = t1 + t2
print(t3)  # (1, 2, 3, 4)

# Exercise 15
scores = [("Alice", 85), ("Bob", 92), ("Charlie", 78)]
sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
print(sorted_scores)

Part 2: Sets

What is a Set?

A set is a collection of unique elements with no duplicates. Sets work like mathematical sets.

# Duplicates are automatically removed
S = {1, 2, 2, 3, 3, 3}
print(S)  # {1, 2, 3}

# Unordered - no indexing
# print(S[0])  # TypeError!

Creating Sets

# With curly braces
S = {1, 2, 3, 4, 5}

# From a list (removes duplicates)
S = set([1, 2, 2, 3, 3])
print(S)  # {1, 2, 3}

# From a string
S = set("hello")
print(S)  # {'h', 'e', 'l', 'o'}  (no duplicate 'l')

# Empty set (NOT {} - that's an empty dict!)
S = set()
print(type(S))   # <class 'set'>
print(type({}))  # <class 'dict'>

Adding and Removing Elements

S = {1, 2, 3}

# Add single element
S.add(4)
print(S)  # {1, 2, 3, 4}

# Add multiple elements
S.update([5, 6, 7])
print(S)  # {1, 2, 3, 4, 5, 6, 7}

# Remove element (raises error if not found)
S.remove(7)
print(S)  # {1, 2, 3, 4, 5, 6}

# Discard element (no error if not found)
S.discard(100)  # No error
S.discard(6)
print(S)  # {1, 2, 3, 4, 5}

# Pop random element
x = S.pop()
print(x)  # Some element (unpredictable which one)

# Clear all elements
S.clear()
print(S)  # set()

Membership Testing

Very fast - O(1):

S = {1, 2, 3, 4, 5}

print(3 in S)     # True
print(100 in S)   # False
print(100 not in S)  # True

Looping Through Sets

S = {'a', 'b', 'c'}

# Basic loop
for item in S:
    print(item)

# With enumerate
for i, item in enumerate(S):
    print(f"{i}: {item}")

Note: Sets are unordered - iteration order is not guaranteed!


Set Operations (The Powerful Part!)

Sets support mathematical set operations.

Union: Elements in Either Set

A = {1, 2, 3}
B = {3, 4, 5}

# Using | operator
print(A | B)  # {1, 2, 3, 4, 5}

# Using method
print(A.union(B))  # {1, 2, 3, 4, 5}

Intersection: Elements in Both Sets

A = {1, 2, 3}
B = {3, 4, 5}

# Using & operator
print(A & B)  # {3}

# Using method
print(A.intersection(B))  # {3}

Difference: Elements in A but Not in B

A = {1, 2, 3}
B = {3, 4, 5}

# Using - operator
print(A - B)  # {1, 2}
print(B - A)  # {4, 5}

# Using method
print(A.difference(B))  # {1, 2}

Symmetric Difference: Elements in Either but Not Both

A = {1, 2, 3}
B = {3, 4, 5}

# Using ^ operator
print(A ^ B)  # {1, 2, 4, 5}

# Using method
print(A.symmetric_difference(B))  # {1, 2, 4, 5}

Subset and Superset

A = {1, 2}
B = {1, 2, 3, 4}

# Is A a subset of B?
print(A <= B)        # True
print(A.issubset(B)) # True

# Is B a superset of A?
print(B >= A)          # True
print(B.issuperset(A)) # True

# Proper subset (subset but not equal)
print(A < B)  # True
print(A < A)  # False

Disjoint: No Common Elements

A = {1, 2}
B = {3, 4}
C = {2, 3}

print(A.isdisjoint(B))  # True (no overlap)
print(A.isdisjoint(C))  # False (2 is common)

Set Operations Summary

OperationOperatorMethodResult
UnionA \| BA.union(B)All elements from both
IntersectionA & BA.intersection(B)Common elements
DifferenceA - BA.difference(B)In A but not in B
Symmetric DiffA ^ BA.symmetric_difference(B)In either but not both
SubsetA <= BA.issubset(B)True if A โІ B
SupersetA >= BA.issuperset(B)True if A โЇ B
Disjoint-A.isdisjoint(B)True if no overlap

In-Place Operations

Modify the set directly (note the method names end in _update):

A = {1, 2, 3}
B = {3, 4, 5}

# Union in-place
A |= B  # or A.update(B)
print(A)  # {1, 2, 3, 4, 5}

# Intersection in-place
A = {1, 2, 3}
A &= B  # or A.intersection_update(B)
print(A)  # {3}

# Difference in-place
A = {1, 2, 3}
A -= B  # or A.difference_update(B)
print(A)  # {1, 2}

Practical Examples

Remove Duplicates from List

L = [1, 2, 2, 3, 3, 3, 4]
unique = list(set(L))
print(unique)  # [1, 2, 3, 4]

Find Common Elements

list1 = [1, 2, 3, 4]
list2 = [3, 4, 5, 6]
common = set(list1) & set(list2)
print(common)  # {3, 4}

Find Unique DNA Bases

dna = "ATGCATGCATGC"
bases = set(dna)
print(bases)  # {'A', 'T', 'G', 'C'}

Set Exercises

Exercise 1: Create a set from the list [1, 2, 2, 3, 3, 3] and print it.

Exercise 2: Add the number 10 to a set {1, 2, 3}.

Exercise 3: Remove duplicates from [1, 1, 2, 2, 3, 3, 4, 4].

Exercise 4: Find common elements between {1, 2, 3, 4} and {3, 4, 5, 6}.

Exercise 5: Find elements in {1, 2, 3} but not in {2, 3, 4}.

Exercise 6: Find all unique characters in the string "mississippi".

Exercise 7: Check if {1, 2} is a subset of {1, 2, 3, 4}.

Exercise 8: Find symmetric difference of {1, 2, 3} and {3, 4, 5}.

Exercise 9: Check if two sets {1, 2} and {3, 4} have no common elements.

Exercise 10: Given DNA sequence "ATGCATGC", create set of unique nucleotides.

Exercise 11: Combine sets {1, 2}, {3, 4}, {5, 6} into one set.

Exercise 12: Given two lists of students, find students in both classes.

Exercise 13: Remove element 3 from set {1, 2, 3, 4} safely (no error if missing).

Exercise 14: Create a set of prime numbers less than 20 and check membership of 17.

Exercise 15: Given three sets A, B, C, find elements that are in all three.

Solutions
# Exercise 1
S = set([1, 2, 2, 3, 3, 3])
print(S)  # {1, 2, 3}

# Exercise 2
S = {1, 2, 3}
S.add(10)
print(S)

# Exercise 3
L = [1, 1, 2, 2, 3, 3, 4, 4]
print(list(set(L)))

# Exercise 4
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}
print(A & B)  # {3, 4}

# Exercise 5
A = {1, 2, 3}
B = {2, 3, 4}
print(A - B)  # {1}

# Exercise 6
print(set("mississippi"))

# Exercise 7
A = {1, 2}
B = {1, 2, 3, 4}
print(A <= B)  # True

# Exercise 8
A = {1, 2, 3}
B = {3, 4, 5}
print(A ^ B)  # {1, 2, 4, 5}

# Exercise 9
A = {1, 2}
B = {3, 4}
print(A.isdisjoint(B))  # True

# Exercise 10
dna = "ATGCATGC"
print(set(dna))  # {'A', 'T', 'G', 'C'}

# Exercise 11
A = {1, 2}
B = {3, 4}
C = {5, 6}
print(A | B | C)  # {1, 2, 3, 4, 5, 6}

# Exercise 12
class1 = ["Alice", "Bob", "Charlie"]
class2 = ["Bob", "Diana", "Charlie"]
print(set(class1) & set(class2))  # {'Bob', 'Charlie'}

# Exercise 13
S = {1, 2, 3, 4}
S.discard(3)  # Safe removal
S.discard(100)  # No error
print(S)

# Exercise 14
primes = {2, 3, 5, 7, 11, 13, 17, 19}
print(17 in primes)  # True

# Exercise 15
A = {1, 2, 3, 4}
B = {2, 3, 4, 5}
C = {3, 4, 5, 6}
print(A & B & C)  # {3, 4}

Summary: When to Use What?

Data TypeUse When
ListOrdered, allow duplicates, need to modify
TupleOrdered, no modification needed, dictionary keys
SetNo duplicates, fast membership testing, set operations
DictKey-value mapping, fast lookup by key

Python Exceptions

Errors vs Bugs vs Exceptions

Syntax Errors

Errors in your code before it runs. Python can't even understand what you wrote.

# Missing colon
if True
    print("Hello")  # SyntaxError: expected ':'

# Unclosed parenthesis
print("Hello"  # SyntaxError: '(' was never closed

Fix: Correct the syntax. Python tells you exactly where the problem is.

Bugs

Your code runs, but it does the wrong thing. No error message - just incorrect behavior.

# Bug: wrong formula
def circle_area(radius):
    return 2 * 3.14 * radius  # Wrong! This is circumference, not area

print(circle_area(5))  # Returns 31.4, should be 78.5

Why "bug"? Legend says early computers had actual insects causing problems. The term stuck.

Fix: Debug your code - find and fix the logic error.

Exceptions

Errors that occur during execution. The code is syntactically correct, but something goes wrong at runtime.

# Runs fine until...
x = 10 / 0  # ZeroDivisionError: division by zero

# Or...
my_list = [1, 2, 3]
print(my_list[10])  # IndexError: list index out of range

Fix: Handle the exception or prevent the error condition.


What is an Exception?

An exception is Python's way of saying "something unexpected happened and I can't continue."

When an exception occurs:

  1. Python stops normal execution
  2. Creates an exception object with error details
  3. Looks for code to handle it
  4. If no handler found, program crashes with traceback
# Exception in action
print("Start")
x = 10 / 0  # Exception here!
print("End")  # Never reached

# Output:
# Start
# Traceback (most recent call last):
#   File "example.py", line 2, in <module>
#     x = 10 / 0
# ZeroDivisionError: division by zero

Common Exceptions

# ZeroDivisionError
10 / 0

# TypeError - wrong type
"hello" + 5

# ValueError - right type, wrong value
int("hello")

# IndexError - list index out of range
[1, 2, 3][10]

# KeyError - dictionary key not found
{'a': 1}['b']

# FileNotFoundError
open("nonexistent.txt")

# AttributeError - object has no attribute
"hello".append("!")

# NameError - variable not defined
print(undefined_variable)

# ImportError - module not found
import nonexistent_module

Handling Exceptions

Basic try/except

try:
    x = 10 / 0
except:
    print("Something went wrong!")

# Output: Something went wrong!

Problem: This catches ALL exceptions - even ones you didn't expect. Not recommended.

try:
    x = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero!")

# Output: Cannot divide by zero!

Catching Multiple Specific Exceptions

try:
    value = int(input("Enter a number: "))
    result = 10 / value
except ValueError:
    print("That's not a valid number!")
except ZeroDivisionError:
    print("Cannot divide by zero!")

Catching Multiple Exceptions Together

try:
    # Some risky code
    pass
except (ValueError, TypeError):
    print("Value or Type error occurred!")

Getting Exception Details

try:
    x = 10 / 0
except ZeroDivisionError as e:
    print(f"Error: {e}")
    print(f"Type: {type(e).__name__}")

# Output:
# Error: division by zero
# Type: ZeroDivisionError

The Complete try/except/else/finally

try:
    # Code that might raise an exception
    result = 10 / 2
except ZeroDivisionError:
    # Runs if exception occurs
    print("Cannot divide by zero!")
else:
    # Runs if NO exception occurs
    print(f"Result: {result}")
finally:
    # ALWAYS runs, exception or not
    print("Cleanup complete")

# Output:
# Result: 5.0
# Cleanup complete

When to Use Each Part

BlockWhen It RunsUse For
tryAlways attemptsCode that might fail
exceptIf exception occursHandle the error
elseIf NO exceptionCode that depends on try success
finallyALWAYSCleanup (close files, connections)

finally is Guaranteed

def risky_function():
    try:
        return 10 / 0
    except ZeroDivisionError:
        return "Error!"
    finally:
        print("This ALWAYS prints!")

result = risky_function()
# Output: This ALWAYS prints!
# result = "Error!"

Best Practices

1. Be Specific - Don't Catch Everything

# BAD - catches everything, hides bugs
try:
    do_something()
except:
    pass

# GOOD - catches only what you expect
try:
    do_something()
except ValueError:
    handle_value_error()

2. Don't Silence Exceptions Without Reason

# BAD - silently ignores errors
try:
    important_operation()
except Exception:
    pass  # What went wrong? We'll never know!

# GOOD - at least log it
try:
    important_operation()
except Exception as e:
    print(f"Error occurred: {e}")
    # or use logging.error(e)

3. Use else for Code That Depends on try Success

# Less clear
try:
    file = open("data.txt")
    content = file.read()
    process(content)
except FileNotFoundError:
    print("File not found")

# More clear - separate "risky" from "safe" code
try:
    file = open("data.txt")
except FileNotFoundError:
    print("File not found")
else:
    content = file.read()
    process(content)

4. Use finally for Cleanup

file = None
try:
    file = open("data.txt")
    content = file.read()
except FileNotFoundError:
    print("File not found")
finally:
    if file:
        file.close()  # Always close, even if error

# Even better - use context manager
with open("data.txt") as file:
    content = file.read()  # Automatically closes!

5. Catch Exceptions at the Right Level

# Don't catch too early
def read_config():
    # Let the caller handle missing file
    with open("config.txt") as f:
        return f.read()

# Catch at appropriate level
def main():
    try:
        config = read_config()
    except FileNotFoundError:
        print("Config file missing, using defaults")
        config = get_defaults()

Raising Exceptions

Use raise to throw your own exceptions:

def divide(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero!")
    return a / b

try:
    result = divide(10, 0)
except ValueError as e:
    print(e)  # Cannot divide by zero!

Re-raising Exceptions

try:
    risky_operation()
except ValueError:
    print("Logging this error...")
    raise  # Re-raise the same exception

Built-in Exception Hierarchy

All exceptions inherit from BaseException. Here's the hierarchy:

BaseException
โ”œโ”€โ”€ SystemExit
โ”œโ”€โ”€ KeyboardInterrupt
โ”œโ”€โ”€ GeneratorExit
โ””โ”€โ”€ Exception
    โ”œโ”€โ”€ StopIteration
    โ”œโ”€โ”€ ArithmeticError
    โ”‚   โ”œโ”€โ”€ FloatingPointError
    โ”‚   โ”œโ”€โ”€ OverflowError
    โ”‚   โ””โ”€โ”€ ZeroDivisionError
    โ”œโ”€โ”€ AssertionError
    โ”œโ”€โ”€ AttributeError
    โ”œโ”€โ”€ BufferError
    โ”œโ”€โ”€ EOFError
    โ”œโ”€โ”€ ImportError
    โ”‚   โ””โ”€โ”€ ModuleNotFoundError
    โ”œโ”€โ”€ LookupError
    โ”‚   โ”œโ”€โ”€ IndexError
    โ”‚   โ””โ”€โ”€ KeyError
    โ”œโ”€โ”€ MemoryError
    โ”œโ”€โ”€ NameError
    โ”‚   โ””โ”€โ”€ UnboundLocalError
    โ”œโ”€โ”€ OSError
    โ”‚   โ”œโ”€โ”€ FileExistsError
    โ”‚   โ”œโ”€โ”€ FileNotFoundError
    โ”‚   โ”œโ”€โ”€ IsADirectoryError
    โ”‚   โ”œโ”€โ”€ NotADirectoryError
    โ”‚   โ”œโ”€โ”€ PermissionError
    โ”‚   โ””โ”€โ”€ TimeoutError
    โ”œโ”€โ”€ ReferenceError
    โ”œโ”€โ”€ RuntimeError
    โ”‚   โ”œโ”€โ”€ NotImplementedError
    โ”‚   โ””โ”€โ”€ RecursionError
    โ”œโ”€โ”€ SyntaxError
    โ”‚   โ””โ”€โ”€ IndentationError
    โ”‚       โ””โ”€โ”€ TabError
    โ”œโ”€โ”€ TypeError
    โ””โ”€โ”€ ValueError
        โ””โ”€โ”€ UnicodeError
            โ”œโ”€โ”€ UnicodeDecodeError
            โ”œโ”€โ”€ UnicodeEncodeError
            โ””โ”€โ”€ UnicodeTranslateError

Why Hierarchy Matters

Catching a parent catches all children:

# Catches ZeroDivisionError, OverflowError, FloatingPointError
try:
    result = 10 / 0
except ArithmeticError:
    print("Math error!")

# Catches IndexError and KeyError
try:
    my_list[100]
except LookupError:
    print("Lookup failed!")

Tip: Catch Exception instead of bare except: - it doesn't catch KeyboardInterrupt or SystemExit.

# Better than bare except
try:
    do_something()
except Exception as e:
    print(f"Error: {e}")

User-Defined Exceptions

Create custom exceptions by inheriting from Exception:

Basic Custom Exception

class InvalidDNAError(Exception):
    """Raised when DNA sequence contains invalid characters"""
    pass

def validate_dna(sequence):
    valid_bases = set("ATGC")
    for base in sequence.upper():
        if base not in valid_bases:
            raise InvalidDNAError(f"Invalid base: {base}")
    return True

try:
    validate_dna("ATGXCCC")
except InvalidDNAError as e:
    print(f"Invalid DNA: {e}")

Custom Exception with Attributes

class InsufficientFundsError(Exception):
    """Raised when account has insufficient funds"""
    
    def __init__(self, balance, amount):
        self.balance = balance
        self.amount = amount
        self.shortage = amount - balance
        super().__init__(
            f"Cannot withdraw ${amount}. "
            f"Balance: ${balance}. "
            f"Short by: ${self.shortage}"
        )

class BankAccount:
    def __init__(self, balance):
        self.balance = balance
    
    def withdraw(self, amount):
        if amount > self.balance:
            raise InsufficientFundsError(self.balance, amount)
        self.balance -= amount
        return amount

# Usage
account = BankAccount(100)
try:
    account.withdraw(150)
except InsufficientFundsError as e:
    print(e)
    print(f"You need ${e.shortage} more")

# Output:
# Cannot withdraw $150. Balance: $100. Short by: $50
# You need $50 more

Exception Hierarchy for Your Project

# Base exception for your application
class BioinformaticsError(Exception):
    """Base exception for bioinformatics operations"""
    pass

# Specific exceptions
class SequenceError(BioinformaticsError):
    """Base for sequence-related errors"""
    pass

class InvalidDNAError(SequenceError):
    """Invalid DNA sequence"""
    pass

class InvalidRNAError(SequenceError):
    """Invalid RNA sequence"""
    pass

class AlignmentError(BioinformaticsError):
    """Sequence alignment failed"""
    pass

# Now you can catch at different levels
try:
    process_sequence()
except InvalidDNAError:
    print("DNA issue")
except SequenceError:
    print("Some sequence issue")
except BioinformaticsError:
    print("General bioinformatics error")

Exercises

Exercise 1: Write code that catches a ZeroDivisionError and prints a friendly message.

Exercise 2: Ask user for a number, handle both ValueError (not a number) and ZeroDivisionError (if dividing by it).

Exercise 3: Write a function that opens a file and handles FileNotFoundError.

Exercise 4: Create a function that takes a list and index, returns the element, handles IndexError.

Exercise 5: Write code that handles KeyError when accessing a dictionary.

Exercise 6: Create a custom NegativeNumberError and raise it if a number is negative.

Exercise 7: Write a function that converts string to int, handling ValueError, and returns 0 on failure.

Exercise 8: Use try/except/else/finally to read a file and ensure it's always closed.

Exercise 9: Create a custom InvalidAgeError with min and max age attributes.

Exercise 10: Write a function that validates an email (must contain @), raise ValueError if invalid.

Exercise 11: Handle multiple exceptions: TypeError, ValueError, ZeroDivisionError in one block.

Exercise 12: Create a hierarchy: ValidationError โ†’ EmailError, PhoneError.

Exercise 13: Re-raise an exception after logging it.

Exercise 14: Create a InvalidSequenceError for DNA validation with the invalid character as attribute.

Exercise 15: Write a "safe divide" function that returns None on any error instead of crashing.

Solutions
# Exercise 1
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero!")

# Exercise 2
try:
    num = int(input("Enter a number: "))
    result = 100 / num
    print(f"100 / {num} = {result}")
except ValueError:
    print("That's not a valid number!")
except ZeroDivisionError:
    print("Cannot divide by zero!")

# Exercise 3
def read_file(filename):
    try:
        with open(filename) as f:
            return f.read()
    except FileNotFoundError:
        print(f"File '{filename}' not found")
        return None

# Exercise 4
def safe_get(lst, index):
    try:
        return lst[index]
    except IndexError:
        print(f"Index {index} out of range")
        return None

# Exercise 5
d = {'a': 1, 'b': 2}
try:
    value = d['c']
except KeyError:
    print("Key not found!")
    value = None

# Exercise 6
class NegativeNumberError(Exception):
    pass

def check_positive(n):
    if n < 0:
        raise NegativeNumberError(f"{n} is negative!")
    return n

# Exercise 7
def safe_int(s):
    try:
        return int(s)
    except ValueError:
        return 0

# Exercise 8
file = None
try:
    file = open("data.txt")
    content = file.read()
except FileNotFoundError:
    print("File not found")
    content = ""
else:
    print("File read successfully")
finally:
    if file:
        file.close()
    print("Cleanup done")

# Exercise 9
class InvalidAgeError(Exception):
    def __init__(self, age, min_age=0, max_age=150):
        self.age = age
        self.min_age = min_age
        self.max_age = max_age
        super().__init__(f"Age {age} not in range [{min_age}, {max_age}]")

# Exercise 10
def validate_email(email):
    if '@' not in email:
        raise ValueError(f"Invalid email: {email} (missing @)")
    return True

# Exercise 11
try:
    # risky code
    pass
except (TypeError, ValueError, ZeroDivisionError) as e:
    print(f"Error: {e}")

# Exercise 12
class ValidationError(Exception):
    pass

class EmailError(ValidationError):
    pass

class PhoneError(ValidationError):
    pass

# Exercise 13
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Logging: Division by zero occurred")
    raise

# Exercise 14
class InvalidSequenceError(Exception):
    def __init__(self, sequence, invalid_char):
        self.sequence = sequence
        self.invalid_char = invalid_char
        super().__init__(f"Invalid character '{invalid_char}' in sequence")

def validate_dna(seq):
    for char in seq:
        if char not in "ATGC":
            raise InvalidSequenceError(seq, char)
    return True

# Exercise 15
def safe_divide(a, b):
    try:
        return a / b
    except Exception:
        return None

print(safe_divide(10, 2))   # 5.0
print(safe_divide(10, 0))   # None
print(safe_divide("a", 2))  # None

Summary

ConceptDescription
Syntax ErrorCode is malformed, won't run
BugCode runs but gives wrong result
ExceptionRuntime error, can be handled
try/exceptCatch and handle exceptions
elseRuns if no exception
finallyAlways runs (cleanup)
raiseThrow an exception
Custom ExceptionInherit from Exception

Best Practices:

  1. Catch specific exceptions, not bare except:
  2. Don't silence exceptions without reason
  3. Use finally for cleanup
  4. Create custom exceptions for your domain
  5. Build exception hierarchies for complex projects

Useful modules

This is planned to be added later

Files and Sys Module

Reading Files

โœ…
Always Use Context Manager (with)

Files automatically close, even if errors occur. This is the modern, safe way.

# โœ… Best way - file automatically closes
with open("data.txt", "r") as file:
    content = file.read()
    print(content)

# โŒ Old way - must manually close (don't do this)
file = open("data.txt", "r")
content = file.read()
file.close()  # Easy to forget!

File Modes

๐Ÿ“
Common Modes

"r" โ†’ Read (default)
"w" โ†’ Write (overwrites entire file!)
"a" โ†’ Append (adds to end)
"x" โ†’ Create (fails if exists)
"rb"/"wb" โ†’ Binary modes

# Read
with open("data.txt", "r") as f:
    content = f.read()

# Write (overwrites!)
with open("output.txt", "w") as f:
    f.write("Hello, World!")

# Append (adds to end)
with open("log.txt", "a") as f:
    f.write("New entry\n")

Reading Methods

read() - Entire File

with open("data.txt") as f:
    content = f.read()  # Whole file as string

readline() - One Line at a Time

with open("data.txt") as f:
    first = f.readline()   # First line
    second = f.readline()  # Second line

readlines() - All Lines as List

with open("data.txt") as f:
    lines = f.readlines()  # ['line1\n', 'line2\n', ...]

Looping Through Files

๐Ÿ’ก
Best Practice: Iterate Directly

Most memory efficient - reads one line at a time. Works with huge files!

# Best way - memory efficient
with open("data.txt") as f:
    for line in f:
        print(line, end="")  # Line already has \n

# With line numbers
with open("data.txt") as f:
    for i, line in enumerate(f, start=1):
        print(f"{i}: {line}", end="")

# Strip newlines
with open("data.txt") as f:
    for line in f:
        line = line.strip()  # Remove \n
        print(line)

# Process as list
with open("data.txt") as f:
    lines = [line.strip() for line in f]

Writing Files

write() - Single String

with open("output.txt", "w") as f:
    f.write("Hello\n")
    f.write("World\n")

writelines() - List of Strings

โš ๏ธ
writelines() Doesn't Add Newlines

You must include \n yourself!

lines = ["Line 1\n", "Line 2\n", "Line 3\n"]
with open("output.txt", "w") as f:
    f.writelines(lines)
with open("output.txt", "w") as f:
    print("Hello, World!", file=f)
    print("Another line", file=f)

Processing Lines

Splitting

# By delimiter
line = "name,age,city"
parts = line.split(",")  # ['name', 'age', 'city']

# By whitespace (default)
line = "John   25   NYC"
parts = line.split()  # ['John', '25', 'NYC']

# With max splits
line = "a,b,c,d,e"
parts = line.split(",", 2)  # ['a', 'b', 'c,d,e']

Joining

words = ['Hello', 'World']
sentence = " ".join(words)  # "Hello World"

lines = ['line1', 'line2', 'line3']
content = "\n".join(lines)

Processing CSV Data

with open("data.csv") as f:
    for line in f:
        parts = line.strip().split(",")
        name, age, city = parts
        print(f"{name} is {age} from {city}")

The sys Module

Command Line Arguments

import sys

print(sys.argv)  # List of all arguments
# python script.py hello world
# Output: ['script.py', 'hello', 'world']

print(sys.argv[0])  # Script name
print(sys.argv[1])  # First argument
print(len(sys.argv))  # Number of arguments

Basic Argument Handling

import sys

if len(sys.argv) < 2:
    print("Usage: python script.py <filename>")
    sys.exit(1)

filename = sys.argv[1]
print(f"Processing: {filename}")

Processing Multiple Arguments

import sys

# python script.py file1.txt file2.txt file3.txt
for filename in sys.argv[1:]:  # Skip script name
    print(f"Processing: {filename}")

Argument Validation

๐Ÿ’ป
Complete Template

Validation pattern for command-line scripts

import sys
import os

def main():
    # Check argument count
    if len(sys.argv) != 3:
        print("Usage: python script.py <input> <output>")
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    
    # Check if input exists
    if not os.path.exists(input_file):
        print(f"Error: {input_file} not found")
        sys.exit(1)
    
    # Check if output exists
    if os.path.exists(output_file):
        response = input(f"{output_file} exists. Overwrite? (y/n): ")
        if response.lower() != 'y':
            print("Aborted")
            sys.exit(0)
    
    # Process files
    process(input_file, output_file)

if __name__ == "__main__":
    main()

Standard Streams

stdin, stdout, stderr

import sys

# Read from stdin
line = sys.stdin.readline()

# Write to stdout (like print)
sys.stdout.write("Hello\n")

# Write to stderr (for errors)
sys.stderr.write("Error: failed\n")

Reading from Pipe

# In terminal
cat data.txt | python script.py
echo "Hello" | python script.py
# script.py
import sys

for line in sys.stdin:
    print(f"Received: {line.strip()}")

Exit Codes

๐Ÿ“
Convention

0 โ†’ Success
1 โ†’ General error
2 โ†’ Command line error

import sys

# Exit with success
sys.exit(0)

# Exit with error
sys.exit(1)

# Exit with message
sys.exit("Error: something went wrong")

Useful sys Attributes

import sys

# Python version
print(sys.version)         # '3.10.0 (default, ...)'
print(sys.version_info)    # sys.version_info(major=3, ...)

# Platform
print(sys.platform)        # 'linux', 'darwin', 'win32'

# Module search paths
print(sys.path)

# Maximum integer
print(sys.maxsize)

# Default encoding
print(sys.getdefaultencoding())  # 'utf-8'

Building Command Line Tools

Simple Script Template

#!/usr/bin/env python3
"""Simple command line tool."""

import sys
import os

def print_usage():
    print("Usage: python tool.py <input_file>")
    print("Options:")
    print("  -h, --help    Show help")
    print("  -v, --verbose Verbose output")

def main():
    # Parse arguments
    if len(sys.argv) < 2 or sys.argv[1] in ['-h', '--help']:
        print_usage()
        sys.exit(0)
    
    verbose = '-v' in sys.argv or '--verbose' in sys.argv
    
    # Get input file
    input_file = None
    for arg in sys.argv[1:]:
        if not arg.startswith('-'):
            input_file = arg
            break
    
    if not input_file:
        print("Error: No input file", file=sys.stderr)
        sys.exit(1)
    
    if not os.path.exists(input_file):
        print(f"Error: {input_file} not found", file=sys.stderr)
        sys.exit(1)
    
    # Process
    if verbose:
        print(f"Processing {input_file}...")
    
    with open(input_file) as f:
        for line in f:
            print(line.strip())
    
    if verbose:
        print("Done!")

if __name__ == "__main__":
    main()

Word Count Tool

๐Ÿ’ป
Example: wc Clone

Count lines, words, and characters

#!/usr/bin/env python3
import sys

def count_file(filename):
    lines = words = chars = 0
    with open(filename) as f:
        for line in f:
            lines += 1
            words += len(line.split())
            chars += len(line)
    return lines, words, chars

def main():
    if len(sys.argv) < 2:
        print("Usage: python wc.py <file1> [file2] ...")
        sys.exit(1)
    
    total_l = total_w = total_c = 0
    
    for filename in sys.argv[1:]:
        try:
            l, w, c = count_file(filename)
            print(f"{l:8} {w:8} {c:8} {filename}")
            total_l += l
            total_w += w
            total_c += c
        except FileNotFoundError:
            print(f"Error: {filename} not found", file=sys.stderr)
    
    if len(sys.argv) > 2:
        print(f"{total_l:8} {total_w:8} {total_c:8} total")

if __name__ == "__main__":
    main()

FASTA Sequence Counter

#!/usr/bin/env python3
import sys

def process_fasta(filename):
    sequences = 0
    total_bases = 0
    
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                sequences += 1
            else:
                total_bases += len(line)
    
    return sequences, total_bases

def main():
    if len(sys.argv) != 2:
        print("Usage: python fasta_count.py <file.fasta>")
        sys.exit(1)
    
    filename = sys.argv[1]
    
    try:
        seqs, bases = process_fasta(filename)
        print(f"Sequences: {seqs}")
        print(f"Total bases: {bases}")
        print(f"Average: {bases/seqs:.1f}")
    except FileNotFoundError:
        print(f"Error: {filename} not found", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

File Path Operations

import os

# Join paths (cross-platform)
path = os.path.join("folder", "subfolder", "file.txt")

# Get filename
os.path.basename("/path/to/file.txt")  # "file.txt"

# Get directory
os.path.dirname("/path/to/file.txt")   # "/path/to"

# Split extension
name, ext = os.path.splitext("data.txt")  # "data", ".txt"

# Check existence
os.path.exists("file.txt")    # True/False
os.path.isfile("file.txt")    # True if file
os.path.isdir("folder")       # True if directory

# Get file size
os.path.getsize("file.txt")   # Size in bytes

# Get absolute path
os.path.abspath("file.txt")

Practice Exercises

๐Ÿ’ป
Basic File Operations

1. Read file and print with line numbers
2. Count lines in a file
3. Copy file contents (use sys.argv)
4. Parse and format CSV rows
5. Reverse file contents

๐Ÿ’ป
Command Line Tools

6. Search for word and print matching lines
7. Read stdin, write stdout in uppercase
8. Validate arguments (file must exist)
9. Word frequency counter (top 10 words)
10. Parse FASTA (extract names and lengths)

๐Ÿ’ป
Advanced Tools

11. Merge multiple files into one
12. Remove blank lines from file
13. Convert file to uppercase
14. Log analyzer (count ERROR/WARNING/INFO)
15. Build grep-like tool: python grep.py <pattern> <file>


Quick Reference

๐Ÿ“
Essential Commands

with open(file) as f: โ†’ Open safely
f.read() โ†’ Read all
for line in f: โ†’ Iterate lines
f.write(string) โ†’ Write
sys.argv โ†’ Get arguments
sys.exit(code) โ†’ Exit program
print(..., file=sys.stderr) โ†’ Error output
os.path.exists(file) โ†’ Check file
os.path.join(a, b) โ†’ Join paths


Best Practices

โœ…
Follow These Rules

1. Always use with for files
2. Validate command line arguments
3. Handle missing files gracefully
4. Use sys.exit(1) for errors
5. Write errors to stderr
6. Use os.path for cross-platform paths


Solution Hints

๐Ÿ’ก
Exercise 1: Line Numbers

Use enumerate(f, start=1) when iterating

๐Ÿ’ก
Exercise 6: Search Tool

Check if word in line: for each line

๐Ÿ’ก
Exercise 9: Word Frequency

Use from collections import Counter and .most_common(10)

๐Ÿ’ก
Exercise 15: Grep Tool

Use re.search(pattern, line) for pattern matching

Debugging

Theory

PyCharm Debug Tutorial

Using the IDLE Debugger

Regular Expressions in Python

๐Ÿ“–
What are Regular Expressions?

Regular expressions (regex) are powerful patterns used to search, match, and manipulate text. You can find patterns, not just exact text.

Regular Expressions

Examples:

  • Find all email addresses in a document
  • Validate phone numbers
  • Extract gene IDs from biological data
  • Find DNA/RNA sequence patterns
  • Clean messy text data

Getting Started

Import the Module

import re
๐Ÿ’ก
Always Use Raw Strings

Write regex patterns with the r prefix: r"pattern"

Why Raw Strings Matter

# Normal string - \n becomes a newline
print("Hello\nWorld")
# Output:
# Hello
# World

# Raw string - \n stays as literal characters
print(r"Hello\nWorld")
# Output: Hello\nWorld

In regex, backslashes are special! Raw strings prevent confusion:

# โŒ Confusing without raw string
pattern = "\\d+"

# โœ… Clean with raw string
pattern = r"\d+"
โœ…
Golden Rule

Always write regex patterns as raw strings: r"pattern"


Level 1: Literal Matching

The simplest regex matches exact text.

import re

dna = "ATGCGATCG"

# Search for exact text "ATG"
if re.search(r"ATG", dna):
    print("Found ATG!")

Your First Function: re.search()

โ„น๏ธ
re.search(pattern, text)

Looks for a pattern anywhere in text. Returns a match object if found, None if not.

match = re.search(r"ATG", "ATGCCC")
if match:
    print("Found:", match.group())    # Found: ATG
    print("Position:", match.start())  # Position: 0
โš ๏ธ
Case Sensitive

Regex is case-sensitive by default! "ATG" โ‰  "atg"

Practice

๐Ÿ’ป
Exercise 1.1

Find which sequences contain "ATG": ["ATGCCC", "TTTAAA", "ATGATG"]

๐Ÿ’ป
Exercise 1.2

Check if "PYTHON" appears in: "I love PYTHON programming"


Level 2: The Dot . - Match Any Character

The dot . matches any single character (except newline).

# Find "A" + any character + "G"
dna = "ATGCGATCG"
matches = re.findall(r"A.G", dna)
print(matches)  # ['ATG', 'ACG']

New Function: re.findall()

โ„น๏ธ
re.findall(pattern, text)

Finds all matches and returns them as a list.

text = "cat bat rat"
print(re.findall(r".at", text))  # ['cat', 'bat', 'rat']

Practice

๐Ÿ’ป
Exercise 2.1

Match "b.t" (b + any char + t) in: "bat bet bit bot but"

๐Ÿ’ป
Exercise 2.2

Find all 3-letter patterns starting with 'c' in: "cat cow cup car"


Level 3: Character Classes [ ]

Square brackets let you specify which characters to match.

# Match any nucleotide (A, T, G, or C)
dna = "ATGCXYZ"
nucleotides = re.findall(r"[ATGC]", dna)
print(nucleotides)  # ['A', 'T', 'G', 'C']

Character Ranges

Use - for ranges:

re.findall(r"[0-9]", "Room 123")      # ['1', '2', '3']
re.findall(r"[a-z]", "Hello")         # ['e', 'l', 'l', 'o']
re.findall(r"[A-Z]", "Hello")         # ['H']
re.findall(r"[A-Za-z]", "Hello123")   # ['H', 'e', 'l', 'l', 'o']

Negation with ^

^ inside brackets means "NOT these characters":

# Match anything that's NOT a nucleotide
dna = "ATGC-X123"
non_nucleotides = re.findall(r"[^ATGC]", dna)
print(non_nucleotides)  # ['-', 'X', '1', '2', '3']

Practice

๐Ÿ’ป
Exercise 3.1

Find all digits in: "Gene ID: ABC123"

๐Ÿ’ป
Exercise 3.2

Find all vowels in: "bioinformatics"

๐Ÿ’ป
Exercise 3.3

Find all NON-digits in: "Room123"


Level 4: Quantifiers - Repeating Patterns

Quantifiers specify how many times a pattern repeats.

๐Ÿ“
Quantifier Reference

* โ†’ 0 or more times
+ โ†’ 1 or more times
? โ†’ 0 or 1 time (optional)
{n} โ†’ Exactly n times
{n,m} โ†’ Between n and m times

Examples

# Find sequences of 2+ C's
dna = "ATGCCCAAAGGG"
print(re.findall(r"C+", dna))       # ['CCC']
print(re.findall(r"C{2,}", dna))    # ['CCC']

# Find all digit groups
text = "Call 123 or 4567"
print(re.findall(r"\d+", text))     # ['123', '4567']

# Optional minus sign
print(re.findall(r"-?\d+", "123 -456 789"))  # ['123', '-456', '789']

Combining with Character Classes

# Find all 3-letter codons
dna = "ATGCCCAAATTT"
codons = re.findall(r"[ATGC]{3}", dna)
print(codons)  # ['ATG', 'CCC', 'AAA', 'TTT']

Practice

๐Ÿ’ป
Exercise 4.1

Find sequences of exactly 3 A's in: "ATGCCCAAAGGGTTT"

๐Ÿ’ป
Exercise 4.2

Match "colou?r" (u is optional) in: "color colour"

๐Ÿ’ป
Exercise 4.3

Find all digit sequences in: "123 4567 89"


Level 5: Escaping Special Characters

Special characters like . * + ? [ ] ( ) have special meanings. To match them literally, escape with \.

# โŒ Wrong - dot matches ANY character
text = "file.txt and fileXtxt"
print(re.findall(r"file.txt", text))  # ['file.txt', 'fileXtxt']

# โœ… Correct - escaped dot matches only literal dot
print(re.findall(r"file\.txt", text))  # ['file.txt']

Common Examples

re.search(r"\$100", "$100")           # Literal dollar sign
re.search(r"What\?", "What?")         # Literal question mark
re.search(r"C\+\+", "C++")            # Literal plus signs
re.search(r"\(test\)", "(test)")      # Literal parentheses

Practice

๐Ÿ’ป
Exercise 5.1

Match "data.txt" (with literal dot) in: "File: data.txt"

๐Ÿ’ป
Exercise 5.2

Match "c++" in: "I code in c++ and python"


Level 6: Predefined Shortcuts

Python provides shortcuts for common character types.

๐Ÿ“
Common Shortcuts

\d โ†’ Any digit [0-9]
\D โ†’ Any non-digit
\w โ†’ Word character [A-Za-z0-9_]
\W โ†’ Non-word character
\s โ†’ Whitespace (space, tab, newline)
\S โ†’ Non-whitespace

Examples

# Find all digits
text = "Room 123, Floor 4"
print(re.findall(r"\d+", text))  # ['123', '4']

# Find all words
sentence = "DNA_seq-123 test"
print(re.findall(r"\w+", sentence))  # ['DNA_seq', '123', 'test']

# Split on whitespace
data = "ATG  CCC\tAAA"
print(re.split(r"\s+", data))  # ['ATG', 'CCC', 'AAA']

Practice

๐Ÿ’ป
Exercise 6.1

Find all word characters in: "Hello-World"

๐Ÿ’ป
Exercise 6.2

Split on whitespace: "ATG CCC\tAAA"


Level 7: Anchors - Position Matching

Anchors match positions, not characters.

๐Ÿ“
Anchor Reference

^ โ†’ Start of string
$ โ†’ End of string
\b โ†’ Word boundary
\B โ†’ Not a word boundary

Examples

dna = "ATGCCCATG"

# Match only at start
print(re.search(r"^ATG", dna))   # Matches!
print(re.search(r"^CCC", dna))   # None

# Match only at end
print(re.search(r"ATG$", dna))   # Matches!
print(re.search(r"CCC$", dna))   # None

# Word boundaries - whole words only
text = "The cat concatenated strings"
print(re.findall(r"\bcat\b", text))  # ['cat'] - only the word
print(re.findall(r"cat", text))      # ['cat', 'cat'] - both

Practice

๐Ÿ’ป
Exercise 7.1

Find sequences starting with "ATG": ["ATGCCC", "CCCATG", "TACATG"]

๐Ÿ’ป
Exercise 7.2

Match whole word "cat" (not "concatenate") in: "The cat sat"


Level 8: Alternation - OR Operator |

The pipe | means "match this OR that".

# Match either ATG or AUG
dna = "ATG is DNA, AUG is RNA"
print(re.findall(r"ATG|AUG", dna))  # ['ATG', 'AUG']

# Match stop codons
rna = "AUGCCCUAAUAGUGA"
print(re.findall(r"UAA|UAG|UGA", rna))  # ['UAA', 'UAG', 'UGA']

Practice

๐Ÿ’ป
Exercise 8.1

Match "email" or "phone" in: "Contact via email or phone"

๐Ÿ’ป
Exercise 8.2

Find stop codons (TAA, TAG, TGA) in: ["ATG", "TAA", "TAG"]


Level 9: Groups and Capturing ( )

Parentheses create groups you can extract separately.

# Extract parts of an email
email = "user@example.com"
match = re.search(r"(\w+)@(\w+)\.(\w+)", email)
if match:
    print("Username:", match.group(1))   # user
    print("Domain:", match.group(2))     # example
    print("TLD:", match.group(3))        # com
    print("Full:", match.group(0))       # user@example.com

Named Groups

Use (?P<name>...) for readable names:

gene_id = "NM_001234"
match = re.search(r"(?P<prefix>[A-Z]+)_(?P<number>\d+)", gene_id)
if match:
    print(match.group('prefix'))  # NM
    print(match.group('number'))  # 001234

Practice

๐Ÿ’ป
Exercise 9.1

Extract area code from: "Call 123-456-7890"

๐Ÿ’ป
Exercise 9.2

Extract year, month, day from: "2024-11-20"


Level 10: More Useful Functions

re.sub() - Find and Replace

# Mask stop codons
dna = "ATGTAACCC"
masked = re.sub(r"TAA|TAG|TGA", "XXX", dna)
print(masked)  # ATGXXXCCC

# Clean multiple spaces
text = "too    many     spaces"
clean = re.sub(r"\s+", " ", text)
print(clean)  # "too many spaces"

re.compile() - Reusable Patterns

# Compile once, use many times (more efficient!)
pattern = re.compile(r"ATG")

for seq in ["ATGCCC", "TTTAAA", "GCGCGC"]:
    if pattern.search(seq):
        print(f"{seq} contains ATG")

Practice

๐Ÿ’ป
Exercise 10.1

Replace all A's with N's in: "ATGCCCAAA"

๐Ÿ’ป
Exercise 10.2

Mask all digits with "X" in: "Room123Floor4"


Biological Examples

๐Ÿ’ก
Real Applications

Here's how regex is used in bioinformatics!

Validate DNA Sequences

def is_valid_dna(sequence):
    """Check if sequence contains only A, T, G, C"""
    return bool(re.match(r"^[ATGC]+$", sequence))

print(is_valid_dna("ATGCCC"))  # True
print(is_valid_dna("ATGXCC"))  # False

Find Restriction Sites

def find_ecori(dna):
    """Find EcoRI recognition sites (GAATTC)"""
    matches = re.finditer(r"GAATTC", dna)
    return [(m.start(), m.group()) for m in matches]

dna = "ATGGAATTCCCCGAATTC"
print(find_ecori(dna))  # [(3, 'GAATTC'), (12, 'GAATTC')]

Count Codons

def count_codons(dna):
    """Split DNA into codons (groups of 3)"""
    return re.findall(r"[ATGC]{3}", dna)

dna = "ATGCCCAAATTT"
print(count_codons(dna))  # ['ATG', 'CCC', 'AAA', 'TTT']

Extract Gene IDs

def extract_gene_ids(text):
    """Extract gene IDs like NM_123456"""
    return re.findall(r"[A-Z]{2}_\d+", text)

text = "Genes NM_001234 and XM_567890 are important"
print(extract_gene_ids(text))  # ['NM_001234', 'XM_567890']

Quick Reference

๐Ÿ“
Pattern Cheat Sheet

abc โ†’ Literal text
. โ†’ Any character
[abc] โ†’ Any of a, b, c
[^abc] โ†’ NOT a, b, c
[a-z] โ†’ Range
* โ†’ 0 or more
+ โ†’ 1 or more
? โ†’ 0 or 1 (optional)
{n} โ†’ Exactly n times
\d โ†’ Digit
\w โ†’ Word character
\s โ†’ Whitespace
^ โ†’ Start of string
$ โ†’ End of string
\b โ†’ Word boundary
| โ†’ OR
(...) โ†’ Capture group


Key Functions Summary

โ„น๏ธ
Function Reference

re.search(pattern, text) โ†’ Find first match
re.findall(pattern, text) โ†’ Find all matches
re.finditer(pattern, text) โ†’ Iterator of matches
re.sub(pattern, replacement, text) โ†’ Replace matches
re.split(pattern, text) โ†’ Split on pattern
re.compile(pattern) โ†’ Reusable pattern


Resources

Object-Oriented Programming in Python

Object-Oriented Programming (OOP) is a way of organizing code by bundling related data and functions together into "objects". Instead of writing separate functions that work on data, you create objects that contain both the data and the functions that work with that data.

Why Learn OOP?

OOP helps you write code that is easier to understand, reuse, and maintain. It mirrors how we think about the real world - objects with properties and behaviors.

The four pillars of OOP:

  1. Encapsulation - Bundle data and methods together
  2. Abstraction - Hide complex implementation details
  3. Inheritance - Create new classes based on existing ones
  4. Polymorphism - Same interface, different implementations

Level 1: Understanding Classes and Objects

What is a Class?

A class is a blueprint or template for creating objects. Think of it like a cookie cutter - it defines the shape, but it's not the cookie itself.

# This is a class - a blueprint for dogs
class Dog:
    pass  # Empty for now

Naming Convention

Classes use PascalCase (UpperCamelCase):

class Dog:              # โœ“ Good
class BankAccount:      # โœ“ Good
class DataProcessor:    # โœ“ Good

class my_class:         # โœ— Bad (snake_case)
class myClass:          # โœ— Bad (camelCase)

What is an Object (Instance)?

An object (or instance) is an actual "thing" created from the class blueprint. If the class is a cookie cutter, the object is the actual cookie.

class Dog:
    pass

# Creating objects (instances)
buddy = Dog()  # buddy is an object
max_dog = Dog()  # max_dog is another object

# Both are dogs, but they're separate objects
print(type(buddy))  # lass '__main__.Dog'>

Terminology:

  • Dog is the class (blueprint)
  • buddy and max_dog are instances or objects (actual things)
  • We say: "buddy is an instance of Dog" or "buddy is a Dog object"

Level 2: Attributes - Giving Objects Data

Attributes are variables that store data inside an object. They represent the object's properties or state.

Instance Attributes

Instance attributes are unique to each object:

class Dog:
    def __init__(self, name, age):
        self.name = name  # Instance attribute
        self.age = age    # Instance attribute

# Create two different dogs
buddy = Dog("Buddy", 3)
max_dog = Dog("Max", 5)

# Each has its own attributes
print(buddy.name)    # "Buddy"
print(max_dog.name)  # "Max"
print(buddy.age)     # 3
print(max_dog.age)   # 5

Understanding __init__

__init__ is a special method called a constructor. It runs automatically when you create a new object.

class Dog:
    def __init__(self, name, age):
        print(f"Creating a dog named {name}!")
        self.name = name
        self.age = age

buddy = Dog("Buddy", 3)  
# Prints: "Creating a dog named Buddy!"

What __init__ does:

  • Initializes (sets up) the new object's attributes
  • Runs automatically when you call Dog(...)
  • First parameter is always self

The double underscores (__init__) are called "dunder" (double-underscore). These mark special methods that Python recognizes for specific purposes.

Understanding self

self refers to the specific object you're working with:

class Dog:
    def __init__(self, name):
        self.name = name  # self.name means "THIS dog's name"

buddy = Dog("Buddy")
# When creating buddy, self refers to buddy
# So self.name = "Buddy" stores "Buddy" in buddy's name attribute

max_dog = Dog("Max")
# When creating max_dog, self refers to max_dog
# So self.name = "Max" stores "Max" in max_dog's name attribute

Important:

  • self is just a naming convention (you could use another name, but don't!)
  • Always include self as the first parameter in methods
  • You don't pass self when calling methods - Python does it automatically

Class Attributes

Class attributes are shared by ALL objects of that class:

class Dog:
    species = "Canis familiaris"  # Class attribute (shared)
    
    def __init__(self, name):
        self.name = name  # Instance attribute (unique)

buddy = Dog("Buddy")
max_dog = Dog("Max")

print(buddy.species)   # "Canis familiaris"
print(max_dog.species) # "Canis familiaris" (same for both)
print(buddy.name)      # "Buddy" (different)
print(max_dog.name)    # "Max" (different)

Practice:

Exercise 1: Create a Cat class with name and color attributes

Exercise 2: Create two cat objects with different names and colors

Exercise 3: Create a Book class with title, author, and pages attributes

Exercise 4: Add a class attribute book_count to track how many books exist

Exercise 5: Create a Student class with name and grade attributes

Solutions
# Exercise 1 & 2
class Cat:
    def __init__(self, name, color):
        self.name = name
        self.color = color

whiskers = Cat("Whiskers", "orange")
mittens = Cat("Mittens", "black")
print(whiskers.name, whiskers.color)  # Whiskers orange
print(mittens.name, mittens.color)    # Mittens black

# Exercise 3
class Book:
    def __init__(self, title, author, pages):
        self.title = title
        self.author = author
        self.pages = pages

book1 = Book("Python Basics", "John Doe", 300)
print(book1.title)  # Python Basics

# Exercise 4
class Book:
    book_count = 0  # Class attribute
    
    def __init__(self, title, author):
        self.title = title
        self.author = author
        Book.book_count += 1

book1 = Book("Book 1", "Author 1")
book2 = Book("Book 2", "Author 2")
print(Book.book_count)  # 2

# Exercise 5
class Student:
    def __init__(self, name, grade):
        self.name = name
        self.grade = grade

student = Student("Alice", "A")
print(student.name, student.grade)  # Alice A

Level 3: Methods - Giving Objects Behavior

Methods are functions defined inside a class. They define what objects can do.

Instance Methods

Instance methods operate on a specific object and can access its attributes:

class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def bark(self):  # Instance method
        return f"{self.name} says Woof!"
    
    def get_age_in_dog_years(self):
        return self.age * 7

buddy = Dog("Buddy", 3)
print(buddy.bark())                    # "Buddy says Woof!"
print(buddy.get_age_in_dog_years())    # 21

Key points:

  • First parameter is always self
  • Can access object's attributes using self.attribute_name
  • Called using dot notation: object.method()

Methods Can Modify Attributes

Methods can both read and change an object's attributes:

class BankAccount:
    def __init__(self, balance):
        self.balance = balance
    
    def deposit(self, amount):
        self.balance += amount  # Modify the balance
        return self.balance
    
    def withdraw(self, amount):
        if amount <= self.balance:
            self.balance -= amount
            return self.balance
        else:
            return "Insufficient funds"
    
    def get_balance(self):
        return self.balance

account = BankAccount(100)
account.deposit(50)
print(account.get_balance())  # 150
account.withdraw(30)
print(account.get_balance())  # 120

Practice: Methods

Exercise 1: Add a meow() method to the Cat class

Exercise 2: Add a have_birthday() method to Dog that increases age by 1

Exercise 3: Create a Rectangle class with width, height, and methods area() and perimeter()

Exercise 4: Add a description() method to Book that returns a formatted string

Exercise 5: Create a Counter class with increment(), decrement(), and reset() methods

Solutions
# Exercise 1
class Cat:
    def __init__(self, name):
        self.name = name
    
    def meow(self):
        return f"{self.name} says Meow!"

cat = Cat("Whiskers")
print(cat.meow())  # Whiskers says Meow!

# Exercise 2
class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def have_birthday(self):
        self.age += 1
        return f"{self.name} is now {self.age} years old!"

dog = Dog("Buddy", 3)
print(dog.have_birthday())  # Buddy is now 4 years old!

# Exercise 3
class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height
    
    def area(self):
        return self.width * self.height
    
    def perimeter(self):
        return 2 * (self.width + self.height)

rect = Rectangle(5, 3)
print(rect.area())       # 15
print(rect.perimeter())  # 16

# Exercise 4
class Book:
    def __init__(self, title, author, pages):
        self.title = title
        self.author = author
        self.pages = pages
    
    def description(self):
        return f"'{self.title}' by {self.author}, {self.pages} pages"

book = Book("Python Basics", "John Doe", 300)
print(book.description())  # 'Python Basics' by John Doe, 300 pages

# Exercise 5
class Counter:
    def __init__(self):
        self.count = 0
    
    def increment(self):
        self.count += 1
    
    def decrement(self):
        self.count -= 1
    
    def reset(self):
        self.count = 0
    
    def get_count(self):
        return self.count

counter = Counter()
counter.increment()
counter.increment()
print(counter.get_count())  # 2
counter.decrement()
print(counter.get_count())  # 1
counter.reset()
print(counter.get_count())  # 0

Level 4: Inheritance - Reusing Code

Inheritance lets you create a new class based on an existing class. The new class inherits attributes and methods from the parent.

Why? Code reuse - don't repeat yourself!

Basic Inheritance

# Parent class (also called base class or superclass)
class Animal:
    def __init__(self, name):
        self.name = name
    
    def speak(self):
        return "Some sound"

# Child class (also called derived class or subclass)
class Dog(Animal):  # Dog inherits from Animal
    def speak(self):  # Override parent method
        return f"{self.name} says Woof!"

class Cat(Animal):
    def speak(self):
        return f"{self.name} says Meow!"

dog = Dog("Buddy")
cat = Cat("Whiskers")

print(dog.speak())  # "Buddy says Woof!"
print(cat.speak())  # "Whiskers says Meow!"

What happened:

  • Dog and Cat inherit __init__ from Animal (no need to rewrite it!)
  • Both override the speak method with their own version
  • Each child gets all parent attributes and methods automatically

Extending Parent's __init__ with super()

Use super() to call the parent's __init__ and then add more:

class Animal:
    def __init__(self, name):
        self.name = name

class Dog(Animal):
    def __init__(self, name, breed):
        super().__init__(name)  # Call parent's __init__
        self.breed = breed      # Add new attribute
    
    def info(self):
        return f"{self.name} is a {self.breed}"

dog = Dog("Buddy", "Golden Retriever")
print(dog.info())  # "Buddy is a Golden Retriever"
print(dog.name)    # "Buddy" (inherited from Animal)

Method Overriding

Method overriding happens when a child class provides its own implementation of a parent's method:

class Animal:
    def speak(self):
        return "Some sound"
    
    def move(self):
        return "Moving"

class Fish(Animal):
    def move(self):  # Override
        return "Swimming"
    
    def speak(self):  # Override
        return "Blub"

class Bird(Animal):
    def move(self):  # Override
        return "Flying"
    # speak() not overridden, so uses parent's version

fish = Fish()
bird = Bird()

print(fish.move())   # "Swimming" (overridden)
print(fish.speak())  # "Blub" (overridden)
print(bird.move())   # "Flying" (overridden)
print(bird.speak())  # "Some sound" (inherited, not overridden)

Rule: When you call a method, Python uses the child's version if it exists, otherwise the parent's version.

Practice: Inheritance

Exercise 1: Create a Vehicle parent class with brand and year attributes

Exercise 2: Create Car and Motorcycle child classes that inherit from Vehicle

Exercise 3: Override a description() method in each child class

Exercise 4: Create an Employee parent class and a Manager child class with additional department attribute

Exercise 5: Create a Shape parent with color attribute, and Circle and Square children

Solutions
# Exercise 1, 2, 3
class Vehicle:
    def __init__(self, brand, year):
        self.brand = brand
        self.year = year
    
    def description(self):
        return f"{self.year} {self.brand}"

class Car(Vehicle):
    def description(self):
        return f"{self.year} {self.brand} Car"

class Motorcycle(Vehicle):
    def description(self):
        return f"{self.year} {self.brand} Motorcycle"

car = Car("Toyota", 2020)
bike = Motorcycle("Harley", 2019)
print(car.description())   # 2020 Toyota Car
print(bike.description())  # 2019 Harley Motorcycle

# Exercise 4
class Employee:
    def __init__(self, name, salary):
        self.name = name
        self.salary = salary

class Manager(Employee):
    def __init__(self, name, salary, department):
        super().__init__(name, salary)
        self.department = department
    
    def info(self):
        return f"{self.name} manages {self.department}"

manager = Manager("Alice", 80000, "Sales")
print(manager.info())  # Alice manages Sales
print(manager.salary)  # 80000

# Exercise 5
class Shape:
    def __init__(self, color):
        self.color = color

class Circle(Shape):
    def __init__(self, color, radius):
        super().__init__(color)
        self.radius = radius
    
    def area(self):
        return 3.14159 * self.radius ** 2

class Square(Shape):
    def __init__(self, color, side):
        super().__init__(color)
        self.side = side
    
    def area(self):
        return self.side ** 2

circle = Circle("red", 5)
square = Square("blue", 4)
print(circle.area())   # 78.53975
print(circle.color)    # red
print(square.area())   # 16
print(square.color)    # blue

Level 5: Special Decorators for Methods

Decorators modify how methods behave. They're marked with @ symbol before the method.

@property - Methods as Attributes

Makes a method accessible like an attribute (no parentheses needed):

class Circle:
    def __init__(self, radius):
        self._radius = radius
    
    @property
    def radius(self):
        return self._radius
    
    @property
    def area(self):
        return 3.14159 * self._radius ** 2
    
    @property
    def circumference(self):
        return 2 * 3.14159 * self._radius

circle = Circle(5)
print(circle.radius)         # 5 (no parentheses!)
print(circle.area)           # 78.53975 (calculated on access)
print(circle.circumference)  # 31.4159

@staticmethod - Methods Without self

Static methods don't need access to the instance:

class Math:
    @staticmethod
    def add(x, y):
        return x + y
    
    @staticmethod
    def multiply(x, y):
        return x * y

# Call without creating an instance
print(Math.add(5, 3))       # 8
print(Math.multiply(4, 7))  # 28

@classmethod - Methods That Receive the Class

Class methods receive the class itself (not the instance):

class Dog:
    count = 0  # Class attribute
    
    def __init__(self, name):
        self.name = name
        Dog.count += 1
    
    @classmethod
    def get_count(cls):
        return f"There are {cls.count} dogs"
    
    @classmethod
    def create_default(cls):
        return cls("Default Dog")

dog1 = Dog("Buddy")
dog2 = Dog("Max")
print(Dog.get_count())  # "There are 2 dogs"

# Create a dog using class method
dog3 = Dog.create_default()
print(dog3.name)        # "Default Dog"
print(Dog.get_count())  # "There are 3 dogs"

Practice: Decorators

Exercise 1: Create a Temperature class with celsius property and fahrenheit property

Exercise 2: Add a static method is_freezing(celsius) to check if temperature is below 0

Exercise 3: Create a Person class with class method to count total people created

Exercise 4: Add a property age to calculate age from birth year

Exercise 5: Create utility class StringUtils with static methods for string operations

Solutions
# Exercise 1
class Temperature:
    def __init__(self, celsius):
        self._celsius = celsius
    
    @property
    def celsius(self):
        return self._celsius
    
    @property
    def fahrenheit(self):
        return (self._celsius * 9/5) + 32

temp = Temperature(25)
print(temp.celsius)     # 25
print(temp.fahrenheit)  # 77.0

# Exercise 2
class Temperature:
    def __init__(self, celsius):
        self._celsius = celsius
    
    @property
    def celsius(self):
        return self._celsius
    
    @staticmethod
    def is_freezing(celsius):
        return celsius < 0

print(Temperature.is_freezing(-5))  # True
print(Temperature.is_freezing(10))  # False

# Exercise 3
class Person:
    count = 0
    
    def __init__(self, name):
        self.name = name
        Person.count += 1
    
    @classmethod
    def get_total_people(cls):
        return cls.count

p1 = Person("Alice")
p2 = Person("Bob")
print(Person.get_total_people())  # 2

# Exercise 4
class Person:
    def __init__(self, name, birth_year):
        self.name = name
        self.birth_year = birth_year
    
    @property
    def age(self):
        from datetime import datetime
        current_year = datetime.now().year
        return current_year - self.birth_year

person = Person("Alice", 1990)
print(person.age)  # Calculates current age

# Exercise 5
class StringUtils:
    @staticmethod
    def reverse(text):
        return text[::-1]
    
    @staticmethod
    def word_count(text):
        return len(text.split())
    
    @staticmethod
    def capitalize_words(text):
        return text.title()

print(StringUtils.reverse("hello"))           # "olleh"
print(StringUtils.word_count("hello world"))  # 2
print(StringUtils.capitalize_words("hello world"))  # "Hello World"

Level 6: Abstract Classes - Enforcing Rules

An abstract class is a class that cannot be instantiated directly. It exists only as a blueprint for other classes to inherit from.

Why? To enforce that child classes implement certain methods - it's a contract.

Creating Abstract Classes

Use the abc module (Abstract Base Classes):

from abc import ABC, abstractmethod

class Animal(ABC):  # Inherit from ABC
    def __init__(self, name):
        self.name = name
    
    @abstractmethod  # Must be implemented by children
    def speak(self):
        pass
    
    @abstractmethod
    def move(self):
        pass

# This will cause an error:
# animal = Animal("Generic")  # TypeError: Can't instantiate abstract class

class Dog(Animal):
    def speak(self):  # Must implement
        return f"{self.name} barks"
    
    def move(self):   # Must implement
        return f"{self.name} walks"

dog = Dog("Buddy")  # This works!
print(dog.speak())  # "Buddy barks"
print(dog.move())   # "Buddy walks"

Key points:

  • Abstract classes inherit from ABC
  • Use @abstractmethod for methods that must be implemented
  • Child classes MUST implement all abstract methods
  • Cannot create instances of abstract classes directly

Why Use Abstract Classes?

They enforce consistency across child classes:

from abc import ABC, abstractmethod

class Shape(ABC):
    @abstractmethod
    def area(self):
        pass
    
    @abstractmethod
    def perimeter(self):
        pass

class Rectangle(Shape):
    def __init__(self, width, height):
        self.width = width
        self.height = height
    
    def area(self):
        return self.width * self.height
    
    def perimeter(self):
        return 2 * (self.width + self.height)

class Circle(Shape):
    def __init__(self, radius):
        self.radius = radius
    
    def area(self):
        return 3.14159 * self.radius ** 2
    
    def perimeter(self):
        return 2 * 3.14159 * self.radius

# Both Rectangle and Circle MUST have area() and perimeter()
rect = Rectangle(5, 3)
circle = Circle(4)
print(rect.area())      # 15
print(circle.area())    # 50.26544

Practice: Abstract Classes

Exercise 1: Create an abstract Vehicle class with abstract method start_engine()

Exercise 2: Create abstract PaymentMethod class with abstract process_payment(amount) method

Exercise 3: Create concrete classes CreditCard and PayPal that inherit from PaymentMethod

Exercise 4: Create abstract Database class with abstract connect() and query() methods

Exercise 5: Create abstract FileProcessor with abstract read() and write() methods

Solutions
# Exercise 1
from abc import ABC, abstractmethod

class Vehicle(ABC):
    @abstractmethod
    def start_engine(self):
        pass

class Car(Vehicle):
    def start_engine(self):
        return "Car engine started"

car = Car()
print(car.start_engine())  # Car engine started

# Exercise 2 & 3
class PaymentMethod(ABC):
    @abstractmethod
    def process_payment(self, amount):
        pass

class CreditCard(PaymentMethod):
    def __init__(self, card_number):
        self.card_number = card_number
    
    def process_payment(self, amount):
        return f"Charged ${amount} to card {self.card_number}"

class PayPal(PaymentMethod):
    def __init__(self, email):
        self.email = email
    
    def process_payment(self, amount):
        return f"Charged ${amount} to PayPal account {self.email}"

card = CreditCard("1234-5678")
paypal = PayPal("user@email.com")
print(card.process_payment(100))    # Charged $100 to card 1234-5678
print(paypal.process_payment(50))   # Charged $50 to PayPal account user@email.com

# Exercise 4
class Database(ABC):
    @abstractmethod
    def connect(self):
        pass
    
    @abstractmethod
    def query(self, sql):
        pass

class MySQL(Database):
    def connect(self):
        return "Connected to MySQL"
    
    def query(self, sql):
        return f"Executing MySQL query: {sql}"

db = MySQL()
print(db.connect())           # Connected to MySQL
print(db.query("SELECT *"))   # Executing MySQL query: SELECT *

# Exercise 5
class FileProcessor(ABC):
    @abstractmethod
    def read(self, filename):
        pass
    
    @abstractmethod
    def write(self, filename, data):
        pass

class TextFileProcessor(FileProcessor):
    def read(self, filename):
        return f"Reading text from {filename}"
    
    def write(self, filename, data):
        return f"Writing text to {filename}: {data}"

processor = TextFileProcessor()
print(processor.read("data.txt"))              # Reading text from data.txt
print(processor.write("out.txt", "Hello"))     # Writing text to out.txt: Hello

Level 7: Design Pattern - Template Method

The Template Method Pattern defines the skeleton of an algorithm in a parent class, but lets child classes implement specific steps.

from abc import ABC, abstractmethod

class DataProcessor(ABC):
    """Template for processing data"""
    
    def process(self):
        """Template method - defines the workflow"""
        data = self.load_data()
        cleaned = self.clean_data(data)
        result = self.analyze_data(cleaned)
        self.save_results(result)
    
    @abstractmethod
    def load_data(self):
        """Children must implement"""
        pass
    
    @abstractmethod
    def clean_data(self, data):
        """Children must implement"""
        pass
    
    @abstractmethod
    def analyze_data(self, data):
        """Children must implement"""
        pass
    
    def save_results(self, result):
        """Default implementation (can override)"""
        print(f"Saving: {result}")


class CSVProcessor(DataProcessor):
    def load_data(self):
        return "CSV data loaded"
    
    def clean_data(self, data):
        return f"{data} -> cleaned"
    
    def analyze_data(self, data):
        return f"{data} -> analyzed"


class JSONProcessor(DataProcessor):
    def load_data(self):
        return "JSON data loaded"
    
    def clean_data(self, data):
        return f"{data} -> cleaned differently"
    
    def analyze_data(self, data):
        return f"{data} -> analyzed differently"


# Usage
csv = CSVProcessor()
csv.process()
# Output: Saving: CSV data loaded -> cleaned -> analyzed

json = JSONProcessor()
json.process()
# Output: Saving: JSON data loaded -> cleaned differently -> analyzed differently

Benefits:

  • Common workflow defined once in parent
  • Each child implements specific steps differently
  • Prevents code duplication
  • Enforces consistent structure

Summary: Key Concepts

Classes and Objects

  • Class = blueprint (use PascalCase)
  • Object/Instance = actual thing created from class
  • __init__ = constructor that runs when creating objects
  • self = reference to the current object

Attributes and Methods

  • Attributes = data (variables) stored in objects
  • Instance attributes = unique to each object (defined in __init__)
  • Class attributes = shared by all objects
  • Methods = functions that define object behavior
  • Access both using self.name inside the class

Inheritance

  • Child class inherits from parent class
  • Use super() to call parent's methods
  • Method overriding = child replaces parent's method
  • Promotes code reuse

Decorators

  • @property = access method like an attribute
  • @staticmethod = method without self, doesn't need instance
  • @classmethod = receives class instead of instance
  • @abstractmethod = marks methods that must be implemented

Abstract Classes

  • Cannot be instantiated directly
  • Use ABC and @abstractmethod
  • Enforce that children implement specific methods
  • Create contracts/interfaces

Design Patterns

  • Template Method = define algorithm structure in parent, implement steps in children
  • Promotes consistency and reduces duplication

Dynamic Programming

What is Dynamic Programming?

Dynamic Programming (DP) is an optimization technique that solves complex problems by breaking them down into simpler subproblems and storing their results to avoid redundant calculations.

The key idea: If you've already solved a subproblem, don't solve it againโ€”just look up the answer!

Two fundamental principles:

  1. Overlapping subproblems - the same smaller problems are solved multiple times
  2. Optimal substructure - the optimal solution can be built from optimal solutions to subproblems

Why it matters: DP can transform exponentially slow algorithms into polynomial or even linear time algorithms by trading memory for speed.


Prerequisites: Why Dictionaries Are Perfect for DP

Before diving into dynamic programming, you should understand Python dictionaries. If you're not comfortable with dictionaries yet, review them firstโ€”they're the foundation of most DP solutions.

Quick dictionary essentials for DP:

# Creating and using dictionaries
cache = {}  # Empty dictionary

# Store results
cache[5] = 120
cache[6] = 720

# Check if we've seen this before
if 5 in cache:  # O(1) - instant lookup!
    print(cache[5])

# This is why dictionaries are perfect for DP!

Why dictionaries work for DP:

  • O(1) lookup time - checking if a result exists is instant
  • O(1) insertion time - storing a new result is instant
  • Flexible keys - can store results for any input value
  • Clear mapping - easy relationship between input (key) and result (value)

Now let's see DP in action with a classic example.


The Classic Example: Fibonacci

The Fibonacci sequence is perfect for understanding DP because it clearly shows the problem of redundant calculations.

The Problem: Naive Recursion

Fibonacci definition:

  • F(0) = 0
  • F(1) = 1
  • F(n) = F(n-1) + F(n-2)

Naive recursive solution:

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

print(fibonacci(10))  # 55
# Try fibonacci(40) - it takes forever!

Why Is This So Slow?

Look at the redundant calculations for fibonacci(5):

fibonacci(5)
โ”œโ”€โ”€ fibonacci(4)
โ”‚   โ”œโ”€โ”€ fibonacci(3)
โ”‚   โ”‚   โ”œโ”€โ”€ fibonacci(2)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ fibonacci(1)  โ† Calculated
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ fibonacci(0)  โ† Calculated
โ”‚   โ”‚   โ””โ”€โ”€ fibonacci(1)      โ† Calculated AGAIN
โ”‚   โ””โ”€โ”€ fibonacci(2)          โ† Calculated AGAIN
โ”‚       โ”œโ”€โ”€ fibonacci(1)      โ† Calculated AGAIN
โ”‚       โ””โ”€โ”€ fibonacci(0)      โ† Calculated AGAIN
โ””โ”€โ”€ fibonacci(3)              โ† Entire subtree calculated AGAIN
    โ”œโ”€โ”€ fibonacci(2)          โ† Calculated AGAIN
    โ”‚   โ”œโ”€โ”€ fibonacci(1)      โ† Calculated AGAIN
    โ”‚   โ””โ”€โ”€ fibonacci(0)      โ† Calculated AGAIN
    โ””โ”€โ”€ fibonacci(1)          โ† Calculated AGAIN

The numbers:

  • fibonacci(1) is calculated 5 times
  • fibonacci(2) is calculated 3 times
  • fibonacci(3) is calculated 2 times

For fibonacci(40), you'd do 331,160,281 function calls. That's insane for a simple calculation!

Time complexity: O(2^n) - exponential! Each call spawns two more calls.


Dynamic Programming Solution: Memoization

Memoization = storing (caching) results we've already calculated using a dictionary.

# Dictionary to store computed results
memo = {}

def fibonacci_dp(n):
    # Check if we've already calculated this
    if n in memo:
        return memo[n]
    
    # Base cases
    if n <= 1:
        return n
    
    # Calculate, store, and return
    result = fibonacci_dp(n - 1) + fibonacci_dp(n - 2)
    memo[n] = result
    return result

# First call - calculates and stores results
print(fibonacci_dp(10))   # 55
print(memo)  # {2: 1, 3: 2, 4: 3, 5: 5, 6: 8, 7: 13, 8: 21, 9: 34, 10: 55}

# Subsequent calls - instant lookups!
print(fibonacci_dp(50))   # 12586269025 (instant!)
print(fibonacci_dp(100))  # Works perfectly, still instant!

How Memoization Works: Step-by-Step

Let's trace fibonacci_dp(5) with empty memo:

Call fibonacci_dp(5):
  5 not in memo
  Calculate: fibonacci_dp(4) + fibonacci_dp(3)
  
  Call fibonacci_dp(4):
    4 not in memo
    Calculate: fibonacci_dp(3) + fibonacci_dp(2)
    
    Call fibonacci_dp(3):
      3 not in memo
      Calculate: fibonacci_dp(2) + fibonacci_dp(1)
      
      Call fibonacci_dp(2):
        2 not in memo
        Calculate: fibonacci_dp(1) + fibonacci_dp(0)
        fibonacci_dp(1) = 1 (base case)
        fibonacci_dp(0) = 0 (base case)
        memo[2] = 1, return 1
      
      fibonacci_dp(1) = 1 (base case)
      memo[3] = 2, return 2
    
    Call fibonacci_dp(2):
      2 IS in memo! Return 1 immediately (no calculation!)
    
    memo[4] = 3, return 3
  
  Call fibonacci_dp(3):
    3 IS in memo! Return 2 immediately (no calculation!)
  
  memo[5] = 5, return 5

Final memo: {2: 1, 3: 2, 4: 3, 5: 5}

Notice: We only calculate each Fibonacci number once. All subsequent requests are instant dictionary lookups!

Time complexity: O(n) - we calculate each number from 0 to n exactly once
Space complexity: O(n) - we store n results in the dictionary

Comparison:

  • Without DP: fibonacci(40) = 331,160,281 operations โฐ
  • With DP: fibonacci(40) = 40 operations โšก

That's over 8 million times faster!


Top-Down vs Bottom-Up Approaches

There are two main ways to implement DP:

Top-Down (Memoization) - What We Just Did

Start with the big problem and recursively break it down, storing results as you go.

memo = {}

def fib_topdown(n):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_topdown(n - 1) + fib_topdown(n - 2)
    return memo[n]

Pros:

  • Intuitive if you think recursively
  • Only calculates what's needed
  • Easy to add memoization to existing recursive code

Cons:

  • Uses recursion (stack space)
  • Slightly slower due to function call overhead

Bottom-Up (Tabulation) - Build From Smallest

Start with the smallest subproblems and build up to the answer.

def fib_bottomup(n):
    if n <= 1:
        return n
    
    # Build table from bottom up
    dp = {0: 0, 1: 1}
    
    for i in range(2, n + 1):
        dp[i] = dp[i - 1] + dp[i - 2]
    
    return dp[n]

print(fib_bottomup(10))  # 55

Even more optimized (space-efficient):

def fib_optimized(n):
    if n <= 1:
        return n
    
    # We only need the last two values
    prev2, prev1 = 0, 1
    
    for i in range(2, n + 1):
        current = prev1 + prev2
        prev2, prev1 = prev1, current
    
    return prev1

print(fib_optimized(100))  # 354224848179261915075

Pros:

  • No recursion (no stack overflow risk)
  • Can optimize space usage (we did it above!)
  • Often slightly faster

Cons:

  • Less intuitive at first
  • Calculates all subproblems even if not needed

When to Use Dynamic Programming

Use DP when you spot these characteristics:

1. Overlapping Subproblems

The same calculations are repeated many times.

Example: In Fibonacci, we calculate F(3) multiple times when computing F(5).

2. Optimal Substructure

The optimal solution to the problem contains optimal solutions to subproblems.

Example: The optimal path from A to C through B must include the optimal path from A to B.

3. You Can Define a Recurrence Relation

You can express the solution in terms of solutions to smaller instances.

Example: F(n) = F(n-1) + F(n-2)


Common DP Problem Patterns

1. Climbing Stairs

Problem: How many distinct ways can you climb n stairs if you can take 1 or 2 steps at a time?

def climbStairs(n):
    if n <= 2:
        return n
    
    memo = {1: 1, 2: 2}
    
    for i in range(3, n + 1):
        memo[i] = memo[i - 1] + memo[i - 2]
    
    return memo[n]

print(climbStairs(5))  # 8
# Ways: 1+1+1+1+1, 1+1+1+2, 1+1+2+1, 1+2+1+1, 2+1+1+1, 1+2+2, 2+1+2, 2+2+1

Key insight: This is actually Fibonacci in disguise! To reach step n, you either came from step n-1 (one step) or step n-2 (two steps).

2. Coin Change

Problem: Given coins of different denominations, find the minimum number of coins needed to make a target amount.

def coinChange(coins, amount):
    # dp[i] = minimum coins needed to make amount i
    dp = {0: 0}
    
    for i in range(1, amount + 1):
        min_coins = float('inf')
        
        # Try each coin
        for coin in coins:
            if i - coin >= 0 and i - coin in dp:
                min_coins = min(min_coins, dp[i - coin] + 1)
        
        if min_coins != float('inf'):
            dp[i] = min_coins
    
    return dp.get(amount, -1)

print(coinChange([1, 2, 5], 11))  # 3 (5 + 5 + 1)
print(coinChange([2], 3))          # -1 (impossible)

The DP Recipe: How to Solve DP Problems

  1. Identify if it's a DP problem

    • Do you see overlapping subproblems?
    • Can you break it into smaller similar problems?
  2. Define the state

    • What information do you need to solve each subproblem?
    • This becomes your dictionary key
  3. Write the recurrence relation

    • How do you calculate dp[n] from smaller subproblems?
    • Example: F(n) = F(n-1) + F(n-2)
  4. Identify base cases

    • What are the smallest subproblems you can solve directly?
    • Example: F(0) = 0, F(1) = 1
  5. Implement and optimize

    • Start with top-down memoization (easier to write)
    • Optimize to bottom-up if needed
    • Consider space optimization

Common Mistakes to Avoid

1. Forgetting to Check the Cache

# Wrong - doesn't check memo first
def fib_wrong(n):
    if n <= 1:
        return n
    memo[n] = fib_wrong(n - 1) + fib_wrong(n - 2)  # Calculates every time!
    return memo[n]

# Correct - checks memo first
def fib_correct(n):
    if n in memo:  # Check first!
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_correct(n - 1) + fib_correct(n - 2)
    return memo[n]

2. Not Storing the Result

# Wrong - calculates but doesn't store
def fib_wrong(n):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    return fib_wrong(n - 1) + fib_wrong(n - 2)  # Doesn't store!

# Correct - stores before returning
def fib_correct(n):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_correct(n - 1) + fib_correct(n - 2)  # Store it!
    return memo[n]

3. Using Mutable Default Arguments

# Wrong - memo persists between calls!
def fib_wrong(n, memo={}):
    # ...

# Correct - create fresh memo or pass it explicitly
def fib_correct(n, memo=None):
    if memo is None:
        memo = {}
    # ...

Summary

Dynamic Programming is about:

  • Recognizing overlapping subproblems
  • Storing solutions to avoid recalculation
  • Trading memory for speed

Key techniques:

  • Top-down (memoization): Recursive + dictionary cache
  • Bottom-up (tabulation): Iterative + build from smallest

When to use:

  • Same subproblems solved repeatedly
  • Optimal substructure exists
  • Can define recurrence relation

The power of DP:

  • Transforms exponential O(2^n) โ†’ linear O(n)
  • Essential for many algorithmic problems
  • Dictionaries make implementation clean and fast

Remember: Not every problem needs DP! Use it when you spot repeated calculations. Sometimes a simple loop or greedy algorithm is better.


Practice Problems to Try

  1. House Robber - Maximum money you can rob from houses without robbing adjacent ones
  2. Longest Common Subsequence - Find longest sequence common to two strings
  3. Edit Distance - Minimum operations to convert one string to another
  4. Maximum Subarray - Find contiguous subarray with largest sum
  5. Unique Paths - Count paths in a grid from top-left to bottom-right

Each of these follows the same DP pattern we've learned. Try to identify the state, recurrence relation, and base cases!

Design Tic-Tac-Toe with Python

Project source: Hyperskill - Tic-Tac-Toe

Project Structure

This project is divided into multiple stages on Hyperskill, each with specific instructions and requirements. I'm sharing the final stage here, which integrates all previous components. The final stage instructions may seem brief as they build on earlier stages where the game logic was developed incrementally.

The complete input/output specifications can be found in the link above.

Sample Execution

---------
|       |
|       |
|       |
---------
3 1
---------
|       |
|       |
| X     |
---------
1 1
---------
| O     |
|       |
| X     |
---------
3 2
---------
| O     |
|       |
| X X   |
---------
0 0
Coordinates should be from 1 to 3!
1 2
---------
| O O   |
|       |
| X X   |
---------
3 3
---------
| O O   |
|       |
| X X X |
---------
X wins

Code


xo_arr = [[" "] * 3 for _ in range(3)]

def display_game(arr):
    row_one =  " ".join(xo_arr[0])
    row_two =  " ".join(xo_arr[1])
    row_three =  " ".join(xo_arr[2])

    print("---------")
    print(f"| {row_one} |")
    print(f"| {row_two} |")
    print(f"| {row_three} |")
    print("---------")


# This could be made in different(shorter) way, I think
# maybe make list of set all combinations for wining 
# and then check if it in or not 
def is_win(s):
    symbol_win =  xo_arr[0] == 3 * s
    symbol_win =  symbol_win or xo_arr[1] == 3 * s
    symbol_win =  symbol_win or xo_arr[2] == 3 * s

    symbol_win =  symbol_win or (xo_arr[0][0] == s and xo_arr[0][1] == s and xo_arr[0][2] == s)
    symbol_win =  symbol_win or (xo_arr[1][0] == s and xo_arr[1][1] == s and xo_arr[1][2] == s)
    symbol_win =  symbol_win or (xo_arr[2][0] == s and xo_arr[2][1] == s and xo_arr[2][2] == s)

    symbol_win =  symbol_win or (xo_arr[0][0] == s and xo_arr[1][1] == s and xo_arr[2][2] == s)
    symbol_win =  symbol_win or (xo_arr[0][2] == s and xo_arr[1][1] == s and xo_arr[2][0] == s)

    return symbol_win


symbol = "X"

display_game(xo_arr)


while True: 

    move = input()
    
    row_coordinate = move[0]
    column_coordinate = move[2]

    if not (row_coordinate.isdigit() and column_coordinate.isdigit()):
        print("You should enter numbers!")
        continue
    else:
        row_coordinate = int(row_coordinate)
        column_coordinate = int(column_coordinate)

    if not (1 <= row_coordinate <= 3 and 1 <= column_coordinate <= 3):
        print("Coordinates should be from 1 to 3!")
        continue

    elif xo_arr[row_coordinate - 1][column_coordinate - 1] == "X" or xo_arr[row_coordinate - 1][column_coordinate - 1] == "O":
        print("This cell is occupied! Choose another one!")
        continue

    xo_arr[row_coordinate - 1][column_coordinate - 1] = symbol

    if symbol == "X":
        symbol = "O"
    else:
        symbol = "X"

    display_game(xo_arr)

    o_win = is_win("O")
    x_win = is_win("X")


    if x_win:
        print("X wins")
        break

    elif o_win:
        print("O wins")
        break
    elif  " " not in xo_arr[0] and " " not in xo_arr[1] and " " not in xo_arr[2] :
        print("Draw")
        break

Multiplication Table

Write a multiplication table based on a maximum input value.

example:


> Please input number: 10
1    2    3    4    5    6    7    8    9    10  
2    4    6    8    10   12   14   16   18   20  
3    6    9    12   15   18   21   24   27   30  
4    8    12   16   20   24   28   32   36   40  
5    10   15   20   25   30   35   40   45   50  
6    12   18   24   30   36   42   48   54   60  
7    14   21   28   35   42   49   56   63   70  
8    16   24   32   40   48   56   64   72   80  
9    18   27   36   45   54   63   72   81   90  
10   20   30   40   50   60   70   80   90   100 

Implementation

This solution is dynamic because it depends on the number of digits in each result. If the maximum number in the table is 100, then the results can have:

three spaces โ†’ 1โ€“9

two spaces โ†’ 10โ€“99

one space โ†’ 100

So to align everything, you look at the biggest number in the table and check how many digits it has. You can do this mathematically (using tens) or simply by getting the length of the string of the number.

Then you add the right amount of spaces before each value to keep the table lined up.

num = int(input("Please input number: "))
max_spaces = len(str(num * num)) 
row = []

for i in range(1, num + 1):
    for j in range(1, num + 1):
        product = str(i * j)
        space =  " " * (max_spaces + 1 - len(product))
        row.append(product + space)
    
    print(*row)
    row = []


Sieve of Eratosthenes

This is an implementation of the Sieve of Eratosthenes.

You can find the full description of the algorithm on its Wikipedia page here.

Code


n = 120

consecutive_int  = [True for _ in range(2, n + 1)]

def mark_multiples(ci, p):
    for i in range(p * p, len(ci) + 2, p):
        ci[i - 2] = False
    return ci

def get_next_prime_notmarked(ci, p):
    for i in range(p + 1, len(ci) + 2):
        if ci[i - 2]:
            return i
    return - 1
            

next_prime = 2


while True:
    consecutive_int = mark_multiples(consecutive_int, next_prime)
    next_prime = get_next_prime_notmarked(consecutive_int, next_prime)
    if next_prime == -1:
        break

def convert_arr_nums(consecutive_int):
    num = ""
    for i in range(len(consecutive_int)):
        if consecutive_int[i]:
            num += str(i + 2) + " "
    return num
            

print(convert_arr_nums(consecutive_int))

Spiral Matrix

Difficulty: Medium
Source: LeetCode

Description

Given an m x n matrix, return all elements of the matrix in spiral order. The spiral traversal goes clockwise starting from the top-left corner: right โ†’ down โ†’ left โ†’ up, repeating inward until all elements are visited.

Code


# To be solved

Rotate Image

Difficulty: Medium
Source: LeetCode

Description

Given an n x n 2D matrix representing an image, rotate the image by 90 degrees clockwise.

Constraint: You must rotate the image in-place by modifying the input matrix directly. Do not allocate another 2D matrix.

Example

Input: matrix = [[1,2,3],[4,5,6],[7,8,9]]
Output: [[7,4,1],[8,5,2],[9,6,3]]

Code

# To be solved

Set Matrix Zeroes

Difficulty: Medium
Source: LeetCode

Description

Given an m x n integer matrix, if an element is 0, set its entire row and column to 0's.

Constraint: You must do it in place.

Example

Input: matrix = [[1,1,1],
                 [1,0,1],
                 [1,1,1]]
Output: [[1,0,1],
         [0,0,0],
         [1,0,1]]

Code

# To be solved

Two Pointers Intro

2 Pointers Technique

Watch this video to get overview on the pattern

2 Pointers Problems

Sliding Window Algorithm - Variable Length + Fixed Length

Reverse String

Difficulty: Easy
Source: LeetCode

Description

Write a function that reverse string in-place

Example

Input: s = ["h","e","l","l","o"]
Output: ["o","l","l","e","h"]

Code

# To be solved

Two Sum II - Input Array Is Sorted

Difficulty: Medium
Source: LeetCode

Description

You are given a 1-indexed integer array numbers that is sorted in non-decreasing order and an integer target.โ€‹

Your task is to return the 1-based indices of two different elements in numbers whose sum is exactly equal to target, with the guarantee that exactly one such pair exists

Please see full description in this link

Example

Example 1:

Input: numbers = [2, 7, 11, 15], target = 9โ€‹

Expected output: [1, 2]

Explanation: numbers[1] + numbers[2] = 2 + 7 = 9, so the correct indices are [1, 2].

Code

# To be solved

3sum

Difficulty: Medium
Source: LeetCode

Description

You are given an integer array nums, and the goal is to return all unique triplets [nums[i], nums[j], nums[k] such that each index is distinct and the sum of the three numbers is zero.โ€‹ The answer must not include duplicate triplets, even if the same values appear multiple times in the array.

Please see full description in this link

Example

Example 1:

Input: nums = [-1, 0, 1, 2, -1, -4]โ€‹

One valid output: [[-1, -1, 2], [-1, 0, 1]] (order of triplets or numbers within a triplet does not matter).

Code

# To be solved

Container With Most Water

Difficulty: Medium
Source: LeetCode

Description

You are given an array height where each element represents the height of a vertical line drawn at that index on the x-axis.โ€‹

Your goal is to pick two distinct lines such that, using the x-axis as the base, the container formed between these lines holds the maximum amount of water, and you must return that maximum water area

Please see full description in this link

Example

Example 1:

  • Input: height = [1, 8, 6, 2, 5, 4, 8, 3, 7]โ€‹
  • Output: 49
  • Explanation (high level): The best container uses the line of height 8 and the line of height 7, which are far enough apart that the width and the shorter height together produce area 49.โ€‹

Code

# To be solved

Remove Duplicates from Sorted Array

Difficulty: Medium
Source: LeetCode

Description

You are given an integer array nums sorted in non-decreasing order, and you need to modify it in-place so that each distinct value appears only once in the prefix of the array.โ€‹ After the operation, you return an integer k representing how many unique values remain at the start of nums, and the first k positions should contain those unique values in their original relative order.โ€‹

Please see full description in this link

Example

Example 1:

  • Input: nums = [1, 1, 2]โ€‹
  • Output: k = 2 and numsโ€™s first k elements become [1, 2, _] (the last position can hold any value)
  • Explanation: The unique values are 1 and 2, so they occupy the first two positions and the function returns 2.โ€‹

Code

# To be solved

Move Zeroes

Difficulty: Medium
Source: LeetCode

Description

You are given an integer array nums and must move every 0 in the array to the end, without changing the relative order of the non-zero values.โ€‹ The rearrangement has to be performed directly on nums (in-place), and the overall extra space usage must remain O(1).

Please see full description in this link

Example

Example 1:

  • Input: nums = [0, 1, 0, 3, 12]โ€‹
  • Output (final state of nums): [1, 3, 12, 0, 0]
  • Explanation: The non-zero elements 1, 3, 12 stay in the same relative order, and both zeros are moved to the end

Code

# To be solved

Valid Palindrome

Difficulty: Medium
Source: LeetCode

Description

You are given a string s consisting of printable ASCII characters, and the goal is to determine whether it forms a palindrome when considering only letters and digits and treating uppercase and lowercase as the same.โ€‹ After filtering out non-alphanumeric characters and converting all remaining characters to a single case, the cleaned string must read the same from left to right and right to left to be considered valid.โ€‹

Please see full description in this link

Example

Example 1:

  • Input: s = "A man, a plan, a canal: Panama"โ€‹
  • Output: True
  • Explanation: After removing non-alphanumeric characters and lowering case, it becomes "amanaplanacanalpanama", which reads the same forwards and backwards.โ€‹

Code

# To be solved

Sliding Window Intro

Sliding Window Technique

Watch this video to get overview on the pattern

Sliding Window Problems

Sliding Window Algorithm - Variable Length + Fixed Length

Longest Substring Without Repeating Characters

Description

You are given a string s, and the goal is to determine the maximum length of any substring that has all unique characters, meaning no character appears more than once in that substring.

The substring must be contiguous within s (no reordering or skipping), and you only need to return the length of the longest such substring, not the substring itself.

Example

Example 1:

  • Input: s = "abcabcbb"
  • Output: 3
  • Explanation: One longest substring without repeating characters is "abc", which has length 3.

Example 2:

  • Input: s = "bbbbb"
  • Output: 1
  • Explanation: Every substring with unique characters is just "b", so the maximum length is 1.

Example 3:

  • Input: s = "pwwkew"
  • Output: 3
  • Explanation: A valid longest substring is "wke" with length 3; note that "pwke" is not allowed because it is not contiguous.

You can test edge cases like s = "" (empty string) or s = " " (single space) to see how the result behaves.[6][8]

Code

# LeetCode 3: Longest Substring Without Repeating Characters
# Credit: Problem from LeetCode (see problem page for full statement and tests).

def lengthOfLongestSubstring(s: str) -> int:
    """
    Write your solution here.

    Requirements:
    - Consider contiguous substrings of s.
    - Within the chosen substring, all characters must be distinct.
    - Return the maximum length among all such substrings.
    """
    # To be solved
    raise NotImplementedError

Maximum Number of Vowels in a Substring of Given Length

Difficulty: Medium
Source: LeetCode

Description

Given a string s and an integer k, return the maximum number of vowel letters in any substring of s with length k.

Vowel letters in English are 'a', 'e', 'i', 'o', and 'u'.

Examples

Input: s = "abciiidef", k = 3
Output: 3
Explanation: The substring "iii" contains 3 vowel letters
Input: s = "aeiou", k = 2
Output: 2
Explanation: Any substring of length 2 contains 2 vowels
Input: s = "leetcode", k = 3
Output: 2
Explanation: "lee", "eet" and "ode" contain 2 vowels

Code

# To be solved

Climbing Stairs

Difficulty: Easy
Source: LeetCode

Description

You are climbing a staircase. It takes n steps to reach the top.

Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?

Examples

Input: n = 2
Output: 2
Explanation: There are two ways to climb to the top:
1. 1 step + 1 step
2. 2 steps
Input: n = 3
Output: 3
Explanation: There are three ways to climb to the top:
1. 1 step + 1 step + 1 step
2. 1 step + 2 steps
3. 2 steps + 1 step

Code

# To be solved

Counting Bits

Difficulty: Easy
Source: LeetCode

Description

Given an integer n, return an array ans of length n + 1 such that for each i (0 <= i <= n), ans[i] is the number of 1's in the binary representation of i.

Example

Input: n = 2
Output: [0,1,1]
Explanation:
0 --> 0 (zero 1's)
1 --> 1 (one 1)
2 --> 10 (one 1)

Code

# To be solved

Decode Ways

Difficulty: Medium
Source: LeetCode

Description

Given a string s of digits, return the number of ways to decode it using the mapping:

"1" -> 'A', 
"2" -> 'B',
 ..., 
"26" -> 'Z'

A digit string can be decoded in multiple ways since some codes overlap (e.g., "12" can be "AB" or "L").

Rules:

  • Valid codes are "1" to "26"
  • Leading zeros are invalid (e.g., "06" is invalid, but "6" is valid)
  • Return 0 if the string cannot be decoded

Examples

Input: s = "12"
Output: 2
Explanation: Can be decoded as "AB" (1, 2) or "L" (12)
Input: s = "11106"
Output: 2
Explanation: 
- "AAJF" with grouping (1, 1, 10, 6)
- "KJF" with grouping (11, 10, 6)
- (1, 11, 06) is invalid because "06" is not valid

Code

# To be solved

Maximal Square

Difficulty: Medium
Source: LeetCode

Description

Given an m x n binary matrix filled with 0's and 1's, find the largest square containing only 1's and return its area.

Example

Input: matrix = [
  ["1","0","1","0","0"],
  ["1","0","1","1","1"],
  ["1","1","1","1","1"],
  ["1","0","0","1","0"]
]
Output: 4
Explanation: The largest square of 1's has side length 2, so area = 2 ร— 2 = 4

Code

# To be solved

Word Break

Difficulty: Medium
Source: LeetCode

Description

Given a string s and a dictionary of strings wordDict, return true if s can be segmented into a space-separated sequence of one or more dictionary words.

Note: The same word in the dictionary may be reused multiple times in the segmentation.

Example

Input: s = "leetcode", wordDict = ["leet","code"]
Output: true
Explanation: "leetcode" can be segmented as "leet code"
Input: s = "applepenapple", wordDict = ["apple","pen"]
Output: true
Explanation: "applepenapple" can be segmented as "apple pen apple"
Note: "apple" is reused

Code

# To be solved

Longest Increasing Subsequence

Difficulty: Medium
Source: LeetCode

Description

Given an integer array nums, return the length of the longest strictly increasing subsequence.

A subsequence is derived by deleting some or no elements without changing the order of the remaining elements.

Example

Input: nums = [10,9,2,5,3,7,101,18]
Output: 4
Explanation: The longest increasing subsequence is [2,3,7,101], with length 4

Code

# To be solved

Subarray Sum Equals K

Problem credit: This note is for practicing the LeetCode problem โ€œSubarray Sum Equals Kโ€. For the full official statement, examples, and judge, see the LeetCode problem page.

Description

You are given an integer array nums and an integer k, and the task is to return the number of non-empty contiguous subarrays whose elements add up to k.

A subarray is defined as a sequence of one or more elements that appear consecutively in the original array, without reordering or skipping indices.

Example

Example 1:

  • Input: nums = [1, 1, 1], k = 2
  • Output: 2
  • Explanation: The subarrays [1, 1] using indices [0, 1] and [1, 2] both sum to 2, so the answer is 2.

Example 2:

  • Input: nums = [1, 2, 3], k = 3
  • Output: 2
  • Explanation: The subarrays [1, 2] and [3] each sum to 3, giving a total count of 2.

You can experiment with inputs that include negative numbers, such as [2, 2, -4, 1, 1, 2] and various k values, to see how multiple overlapping subarrays can share the same sum.

Code

# LeetCode 560: Subarray Sum Equals K
# Credit: Problem from LeetCode (see problem page for full statement and tests).

def subarraySum(nums: List[int], k: int) -> int:
    """
    Write your solution here.

    Requirements:
    - Count all non-empty contiguous subarrays whose sum is exactly k.
    - nums may contain positive, negative, and zero values.
    - Return the total number of such subarrays.
    """
    # To be solved
    raise NotImplementedError

Count Vowel Substrings of a String

Difficulty: Easy
Source: LeetCode

Description

Given a string word, return the number of vowel substrings in word.

A vowel substring is a contiguous substring that:

  • Only consists of vowels ('a', 'e', 'i', 'o', 'u')
  • Contains all five vowels at least once

Examples

Input: word = "aeiouu"
Output: 2
Explanation: The vowel substrings are "aeiou" and "aeiouu"
Input: word = "unicornarihan"
Output: 0
Explanation: Not all 5 vowels are present, so there are no vowel substrings

Code

# To be solved

Roman to Integer

The problem can be found here

Solution one

Let's think, simple solution for this problem, will be change the way that system work, in another word, instead of making minus, will make everything just sum.

class Solution:
    def romanToInt(self, s: str) -> int:
        roman = {
            "I": 1,
            "V": 5,
            "X": 10,
            "L": 50,
            "C": 100,
            "D": 500,
            "M": 1000
        }
        replace = {
            "IV": "IIII",
            "IX": "VIIII",
            "XL": "XXXX",
            "XC": "LXXXX",
            "CD": "CCCC",
            "CM": "DCCCC"
        }

        for k, v in replace.items(): 
            s = s.replace(k, v)
            
        return sum([roman[char] for char in s])

Solution two

Another way to think about this, is just if we say smaller number before bigger number, we should minus, otherwise, we should continue adding numbers.

class Solution:
    def romanToInt(self, s: str) -> int:
        roman = {
            "I": 1,
            "V": 5,
            "X": 10,
            "L": 50,
            "C": 100,
            "D": 500,
            "M": 1000
        }
        total = 0
        pre_value = 0

        for i in s:
            if pre_value < roman[i]:
                total += roman[i] - 2 * pre_value
            else:
                total += roman[i]
            
            pre_value = roman[i]
        
        return total

This solution in runtime beats 100%, but memory only 20% better

why I did this roman[i] - 2 * pre_value? because we need to minus the added value in the previous step.

Basic Calculator

Difficulty: Medium

Description

Given a string expression containing digits and operators (+, -, *, /), evaluate the expression and return the result.

Rules:

  • Follow standard operator precedence (multiplication and division before addition and subtraction)
  • Division should be integer division (truncate toward zero)
  • No parentheses in the expression

Examples

Input: s = "3+2*2"
Output: 7
Explanation: Multiplication first: 3 + (2*2) = 3 + 4 = 7
Input: s = "4-8/2"
Output: 0
Explanation: Division first: 4 - (8/2) = 4 - 4 = 0
Input: s = "14/3*2"
Output: 8
Explanation: Left to right for same precedence: (14/3)*2 = 4*2 = 8

Code

# To be solved

Resources

The exercises and examples in this material are inspired by several open educational resources released under Creative Commons licenses. Instead of referencing each one separately throughout the notes, here is a list of the main books and sources I used:

  • [A Practical Introduction to Python Programming- ยฉ 2015 Brian Heinold] (CC BY-NC-SA 3.0)
  • [A Practical Introduction to Python Programming- ยฉ 2015 Brian Heinold] (CC BY-NC-SA 3.0)

All credit goes to the original authors for their openly licensed educational content.

Core Concepts

Structural Bioinformatics

Focus: Protein folding and structure prediction

The main goal of structural bioinformatics is predicting the final 3D structure of a protein starting from its amino acid sequence. This is one of the fundamental challenges in computational biology.

The Central Dogma Connection

Question raised: To be sure that a protein is expressed, you must have a transcript. Why?

Because: DNA โ†’ RNA (transcript) โ†’ Protein. Without the transcript (mRNA), there's no template for translation into protein. Gene expression requires transcription first.

What is Protein/DNA Folding?

Folding is the process by which a linear sequence (amino acids for proteins, nucleotides for DNA) adopts a specific three-dimensional structure. This structure determines function.

  • Protein folding: Amino acid chain โ†’ functional 3D protein
  • DNA folding: Linear DNA โ†’ chromatin structure

Structure and Function

A fundamental principle in biology: structure determines function. The 3D shape of a protein dictates what it can do - what it binds to, what reactions it catalyzes, how it interacts with other molecules.

The structure of a molecule is dependent on the electron density, in reality the structure itself is just the shape of the electron density cloud of the molecule in space. The structure determines also the function ๐Ÿกช when you know the structure, you can derive properties of the molecule and so the function.

๐Ÿ”ฌ
Random Fact

Bioinformatics does not produce data, it analyses existing data. Quality of the data is crucial.

Functional Annotation

One of the most important fields in bioinformatics is functional annotation.

What does it mean?

Functional annotation is the process of assigning biological meaning to sequences or structures. Given a protein sequence, what does it do? What pathways is it involved in? What cellular processes does it regulate?

This involves:

  • Predicting function from sequence similarity
  • Domain identification
  • Pathway assignment
  • Gene Ontology (GO) terms
๐Ÿ’ก
Reference databases

The reference database for protein structures is the PDB

The reference database for protein function is UNIPROT

The reference database for DNA sequences is GENBANK , which is in the U.S., in Europe we have. ECA

The reference database for the human genome is ENSEMBL, located in the Sanger Institute in Hinxton and UCSC (from the U.S.A.)

Functional annotation in uniport can be manually curated (SWISSPROT) or automatic (TREMBL). Swissprot contains only non-redundant sequences.

๐Ÿ”ฌ
Random Fact

Those databases contain various isoforms of the same proteins.

We can also see the distribution of proteins based on length in Uniprot. The majority of the proteins sit between 100 and 500 residues, with some proteins that are very big, and others that are very small. However, it is not a normal distribution. The tail corresponding to the big sequences is larger, and this is because a very small number of aminoacids can generate a small number of unique sequences. Also we can see the abundance of the aminoacids. The more abundant are the aliphatic ones.

Data Challenges

The professor discussed practical issues in bioinformatics data:

Collection: How do we gather biological data?
Production: How is data generated (sequencing, experiments)?
Quality: How reliable is the data? What are the error rates?
Redundancy: Multiple entries for the same protein/gene - how do we handle duplicates?

Gene Ontology (GO)

A standardized vocabulary for describing:

  • Biological processes (what cellular processes the gene/protein is involved in)
  • Molecular functions (what the protein does at the molecular level)
  • Cellular components (where in the cell it's located)

GO provides a controlled language for functional annotation across all organisms.

Machine Learning in Bioinformatics

๐Ÿ“–
Definition

Machine learning is about fitting a function(or line) between input and output

Given input data (like protein sequences), ML tries to learn patterns that map to outputs (like protein function or structure). Essentially: find the line (or curve, or complex function) that best describes the relationship between what you know (input) and what you want to predict (output).

We are in the era of big data, and to manage all this data we need new algorithms. Artificial intelligence is an old concept, in the 80s however, an algorithm that can train artificial intelligences was developed. Learning is essentially and optimization process.

Deep learning is a variant of machine learning that is more complex, accurate and performative. Today we call classical machine learning โ€œshallowโ€ machine learning. It is important to have good quality data in order to train these machines so they can associate some information to specific data.

Proteins and Bioinformatics

What is a Protein?

  1. A biopolymer - a biological polymer made of amino acid monomers linked together.
  2. A complex system capable of folding in the solvent
  3. A protein is capable of interactions with other molecules

Are All Proteins Natural?

No.

  • Natural proteins: Encoded by genes, produced by cells
  • Synthetic proteins: Designed and manufactured in labs
  • Modified proteins: Natural proteins with artificial modifications

This distinction matters for understanding protein databases and experimental vs. computational protein design.

Protein Sequence

The linear order of amino acids in a protein. This is the primary structure and is directly encoded by DNA/RNA.

Proteins as Complex Systems

Proteins aren't just simple chains - they're complex biological systems that:

  • Fold into specific 3D structures
  • Interact with other molecules
  • Respond to environmental conditions
  • Have dynamic behavior (not static structures)

As biopolymers, they exhibit emergent properties that aren't obvious from just reading the sequence.

๐Ÿ”ฌ
Random Fact

Complex models can be very useful, for example organoids are at the forefront of medicine. Having a reliable cellular model is a challenge to solve.

Protein Stability

Measured by ฮ”G (delta G) of folding

ฮ”G represents the change in free energy during the folding process:

  • Negative ฮ”G: Folding is favorable (stable protein)
  • Positive ฮ”G: Folding is unfavorable (unstable)
  • ฮ”G โ‰ˆ 0: Marginal stability

This thermodynamic measurement tells us how stable a folded protein is compared to its unfolded state.

Transfer of Knowledge (Annotation)

One of the key principles in bioinformatics: we can transfer functional information from well-studied proteins to newly discovered ones based on sequence or structural similarity.

If protein A is well-characterized and protein B is similar, we can infer that B likely has similar function. This is the basis of homology-based annotation.

๐Ÿ”ฌ
Random Fact

Protein phases are aggregations of proteins that presumably have a common goal. For example, proteins in the Krebs cycle aggregate themselves, generating a protein phase. This process is driven by protein affinity with each other. The process is considered so important that if some of those phases do not occur, diseases can arise.

Structure vs. Sequence

Key principle: The structure of a protein is more informative than its sequence.

Why?

  • Sequences can diverge significantly while structure remains conserved
  • Different sequences can fold into similar structures (convergent evolution)
  • Structure directly relates to function
  • Structural similarity reveals evolutionary relationships that sequence alone might miss

This is why structural bioinformatics is so important - knowing the 3D structure gives you more information about function than just the sequence.

Macromolecular Crowding

Concept: Inside cells, it's crowded. Really crowded.

โ„น๏ธ
Info

Macromolecular crowding: the cytoplasm of any cell is a dynamic environment. Macromolecular crowding is how the cell balances the number of molecules with the number of processes.

Proteins don't fold and function in isolation - they're surrounded by other proteins, RNA, DNA, and small molecules. This crowding affects:

  • Folding kinetics
  • Protein stability
  • Protein-protein interactions
  • Diffusion rates

It is important to remember that the intracellular environment is very crowded and studying all the interactions is very important and an issue nowadays. For example, one thing that we donโ€™t understand is how chromosomes interact within the nucleus, and understanding this can lead to the production of models. A model is crucial for doing data analysis. If the model is not there, we have to produce it.

Lab experiments often use dilute solutions, but cells are packed with macromolecules. This environmental difference matters for understanding real protein behavior.

Protein Quality and Databases

Where to find reliable protein data?

UniProt: Universal protein database

  • Contains both reviewed and unreviewed entries
  • Comprehensive but variable quality

Swiss-Prot (part of UniProt):

  • Manually curated and reviewed
  • High-quality, experimentally validated annotations
  • Gold standard for protein information
  • Smaller than UniProt but much more reliable

Rule of thumb: For critical analyses, prefer Swiss-Prot. For exploratory work, UniProt is broader but requires more careful validation.

Interoperability: the characteristic of databases to talk to themselves. It is important to retrieve complete information that databases talk to each other.

Data Quality management: the quality of data is a very important issue. It is crucial to be able to discriminate between good and bad data. Even in databases there is good data and very bad data.

Folding of proteins

๐Ÿ“
Note

The most important thing (cause) that drives the folding of a protein is the hydrophobic effect. The folding of a protein is specific to the family of a protein. Proteins can be composed of more single polypeptide chains, in this case we say they are heteropolymers.

Summary: What We've Covered

  • Structural bioinformatics and protein folding

  • Structure-function relationship

  • Functional annotation and Gene Ontology

  • Data quality challenges

  • ML as function fitting

  • Proteins as biopolymers and complex systems

  • Natural vs. synthetic proteins

  • Protein stability (ฮ”G)

  • Structure is more informative than sequence

  • Macromolecular crowding

  • Data quality: UniProt vs. Swiss-Prot

Main themes:

  • Predicting protein structure and function from sequence
  • Understanding proteins as complex, context-dependent systems
  • Data quality and annotation are critical challenges
  • Computational methods (especially ML) are essential tools

Folding and Proteins

Folding occurs in solvent ๐Ÿกช in a polar solvent a protein can only fold, and it does it spontaneously.

A protein is a complex system, because the properties of a protein cannot be derived by the sum of the chemical-physical properties of the residues. Also, proteins are social entities.

Proteins can be composed of more single polypeptide chains, in this case we say they are heteropolymers.

Stabilizing interactions in proteins:

  • Dipole-Dipole interactions: molecules with non-symmetrical electron distributions.
  • Ion-Ion interactions: interactions within oppositely charged molecules.
  • Van der Waals interactions: mainly occurs between non-polar molecules.
  • Hydrogen bonding.
  • Disulfide bonds.
โ„น๏ธ
Proteins can be classified according to their most common secondary structure (SCOP classification):

1. All alpha-proteins: they have at least 70% alpha helixes

2. All beta-proteins

3. Alpha+beta proteins: alpha helixes and beta sheets occur separately along the protein ๐Ÿกช beta sheets are therefore mostly antiparallel

4. Alpha/beta proteins: alpha helixes and beta sheets are alternating along the proteins ๐Ÿกช beta sheets are therefore mostly parallel

Protein Identity: protein with a 30% sequence identity have the same structure. This is an important statistic, because if we want to train a machine, we want to avoid to have a lot of proteins with the same structure. We can see from the PDB the number of non-redundant structures according to identity in the statistics section.

Dihedral angles

The most mobile angles of a protein backbone are the dihedral angles. The peptide bond is very rigid because it is stabilized by resonance, so it is not mobile, the average length of the peptide bond is 1.32 A. The possible dihedral angles of a polypeptide are represented in the Ramachandran plot. It shows the favoured, allowed and generously allowed (and forbidden) dihedral angles for each residue. The Ramachandran plot has on the x axis the Phi degrees, and in the y axis the Psi degrees. Each dot rapresents a residue.

The Phi (line + circle) angle is the angle between the alpha carbon and the nitrogen, the Psi (trident) angle is the angle between the alpha carbon and the carbon of the carboxylic acid.

Protein surface

Vad der Waals volume: the van der waals volume of a specific atom is the volume occupied by that atom. The volume has the shape of a sphere ๐Ÿกช due atomi non possono avvicinarsi tra di loro (per interagire) a una distanza minore dei loro raggi di van der waals, but, in a covalent bond, the space occupied by two atoms is not the sum of their van der waals volumes, because in covalent bond the van der waals volumes overlap.

The solvent accessible surface is computed using a probe in the shape of a sphere (the sphere represents the solvent, so it has the van der waals volume of a molecule of solvent). The probe is moved across the surface of the protein and the resulting line that the centre of the sphere draws is the solvent accessible surface.

The solvent excluded surface instead, is more similar to the real surface of the protein, since it is an approximation of the van der waals radii of the protein obtained by the boundary that separates protein and solvent.

Protein domains

๐Ÿ“–
Definition

A protein domain is a portion of a protein characterized by a set of secondary structures with a specific organization in space.

PFAM is a database. A large collection of protein families represented by multiple sequence alignments and HMMs. PFAM models are HMMs trained to recognize protein domains. It is the most used database for detecting domains in full length proteins.

๐Ÿ”ฌ
Fact

PFAM ๐Ÿกช HMMs and MSA for protein family representation

PROSITE ๐Ÿกช small domains, motifs and conserved/active sites. Sequence analysis

INTERPRO ๐Ÿกช meta database for annotation

PROSITE: Databases that contain motifs and small domains. It focuses on active sites, binding sites ecc. It contains patterns (regular expressions) and profiles. Not used for whole domains.

INTERPRO: It is a meta-database that integrates many databases (PFAM and PROSITE for example). It is mainly used for functional annotation.

CATH: Class Architecture Topology/fold Homologous superfamily. It is a database resource that provides information on the evolutionary relationships of protein domains.

SCOP

SCOP ๐Ÿกช structural classification of domains:

Similar to CATH and Pfam databases, SCOP (structural classification of proteins) provides a classification of individual structural domains of proteins, rather than a classification of the entire proteins which may include a significant number of different domains. It focuses on the relationship between proteins and the classification of proteins into families starting from their structure. It has a hierarchical classification system.

Protein Families (SCOP):

Families: clearly evolutionary related. Protein in one family have almost all at least 30% sequence identity. ๐Ÿกช below 30% sequence identity we can have protein that share the same structure and proteins that have completely different structure. Sometimes protein can share the same structure even below 10% sequence identity, but we have to superimpose the structures to find out. 30% comes from the methods used in sequence alignment ๐Ÿกช those methods cannot predict the same structure for a protein under 30% identity of sequence. It is important to note that some family of proteins have the same function, but different structures ๐Ÿกช in this case, to know what the structure of a protein in a family of this type is to look at the length of the protein, and see what is the best structure inside that family that fits.

Superfamily: groups 2 or more families with probable common evolutionary origin, even if their sequence identity is low. Proteins in a superfamily have sequence identity below 30%. Proteins in superfamily have similar structures, and sometimes (not always) share function.

Fold: major structural similarity, proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Having the same fold do not imply that the proteins share evolutionary history, it is purely a structural classification and may be the result of convergent evolution. Folds provide a useful way to understand the limited number of structural solutions used by nature.

Class: secondary structure-based classification (alpha proteins, beta proteins, alpha+beta, alpha/beta)

Sequence Alignment

Why Do We Align Sequences?

Because similarity reveals relationships.

If two protein or DNA sequences are similar, they likely:

  • Share a common ancestor (homology)
  • Have similar functions (we can transfer annotations)
  • Adopt similar structures (especially for proteins)

The core idea: Evolution preserves what works. Similar sequences suggest shared evolutionary history, which means shared function and structure.

Without alignment, we can't quantify similarity. Alignment gives us a systematic way to compare sequences and measure their relatedness.

Pairwise vs Multiple Sequence Alignment

FeaturePairwise AlignmentMultiple Sequence Alignment (MSA)
DefinitionAlign two sequencesAlign three or more sequences
PurposeFind similarity between two sequencesFind conserved regions across multiple sequences
AlgorithmsNeedleman-Wunsch (global)
Smith-Waterman (local)
Progressive (ClustalW, MUSCLE)
Iterative (MAFFT)
Consistency-based (T-Coffee)
ComplexityO(nยฒ) - fastO(n^k) where k = number of sequences - slow
Common ToolsBLAST, FASTA
EMBOSS (Needle, Water)
ClustalW, ClustalOmega
MUSCLE, MAFFT
T-Coffee, Clustal Phi
OutputOne optimal alignmentConsensus of all sequences
Best ForComparing two proteins/genes
Database searches
Phylogenetic analysis
Finding conserved motifs
Family analysis

Pairwise Sequence Alignment

The basic scenario in bioinformatics:

  • You have a sequence of interest (newly discovered, unknown function)
  • You have a known sequence (well-studied, annotated)
  • Question: Are they similar?
  • Hypothesis: If similar, they might share function/structure

Sequence Identity

Sequence identity is the percentage of exact matches between aligned sequences.

Example:

Seq1: ACGTACGT
Seq2: ACGTCCGT
      ||||.|||
Identity: 7/8 = 87.5%

But identity alone doesn't tell the whole story - we need to consider biological similarity (similar but not identical amino acids).


Two Types of Sequence Alignment

Global Alignment

Goal: Align every residue in both sequences from start to end.

Residue = individual unit in a sequence:

  • For DNA/RNA: nucleotide (A, C, G, T/U)
  • For proteins: amino acid

How it works:

  • Start sequences at the same position
  • Optimize alignment by inserting gaps where needed
  • Forces alignment of entire sequences

Example (ASCII):

Seq1: ACGTACGT----
      |||| |||
Seq2: ACGTTACGTAGC

Best for: Sequences of similar length that are expected to be similar along their entire length.

Local Alignment

Goal: Find the most similar regions between sequences, ignoring less similar parts.

How it works:

  • Identify regions of high similarity
  • Ignore dissimilar terminals and regions
  • Can find multiple local alignments in the same pair

Example (ASCII):

Seq1:        GTACGT
             ||||||
Seq2: AAAAGTGTACGTCCCC

Only the middle region is aligned; terminals are ignored.

Best for:

  • Short sequence vs. longer sequence
  • Distantly related sequences
  • Finding conserved domains in otherwise divergent proteins

Scoring Alignments

Because there are many possible ways to align two sequences, we need a scoring function to assess alignment quality.

Simple Scoring: Percent Match

Basic approach: Count matches and calculate percentage.

Seq1: ACGTACGT
      |||| |||
Seq2: ACGTTCGT

Matches: 7/8 = 87.5%

Problem: This treats all mismatches equally. But some substitutions are more biologically likely than others.


Additive Scoring with Linear Gap Penalty

Better approach: Assign scores to matches, mismatches, and gaps.

Simple scoring scheme:

  • Match (SIM): +1
  • Mismatch: -1
  • Gap penalty (GAP): -1

Formula:

Score = ฮฃ[SIM(s1[pos], s2[pos])] + (gap_positions ร— GAP)

Example:

Seq1: ACGT-ACGT
      |||| ||||
Seq2: ACGTTACGT

Matches: 8 ร— (+1) = +8
Gap: 1 ร— (-1) = -1
Total Score = +7

Affine Gap Penalty: A Better Model

Problem with linear gap penalty: Five gaps in one place vs. five gaps in different places - which is more biologically realistic?

Answer: Consecutive gaps (one insertion/deletion event) are more likely than multiple separate events.

Affine gap penalty:

  • GOP (Gap Opening Penalty): Cost to START a gap (e.g., -5)
  • GEP (Gap Extension Penalty): Cost to EXTEND an existing gap (e.g., -1)

Formula:

Score = ฮฃ[SIM(s1[pos], s2[pos])] + (number_of_gaps ร— GOP) + (total_gap_length ร— GEP)

Example:

One gap of length 3: GOP + (3 ร— GEP) = -5 + (3 ร— -1) = -8
Three gaps of length 1: 3 ร— (GOP + GEP) = 3 ร— (-5 + -1) = -18

Consecutive gaps are penalized less - matches biological reality better.


DNA vs. Protein Level Alignment

The Problem

Consider these DNA sequences:

DNA1: CAC
DNA2: CAT
      ||.

At the DNA level: C matches C, A matches A, but C doesn't match T (67% identity).

But translate to protein:

CAC โ†’ Histidine (His)
CAT โ†’ Histidine (His)

Both code for the same amino acid! At the protein level, they're 100% identical.

Which Level to Use?

DNA alignment:

  • More sensitive to recent changes
  • Can detect synonymous mutations
  • Good for closely related sequences

Protein alignment:

  • Captures functional conservation
  • More robust for distant relationships
  • Ignores silent mutations

Rule of thumb: For evolutionary distant sequences, protein alignment is more informative because the genetic code is redundant - multiple codons can encode the same amino acid.


Substitution Matrices: Beyond Simple Scoring

The DNA Problem: Not All Mutations Are Equal

Transversion and Transition

Transitions (purine โ†” purine or pyrimidine โ†” pyrimidine):

  • A โ†” G
  • C โ†” T
  • More common in evolution

Transversions (purine โ†” pyrimidine):

  • A/G โ†” C/T
  • Less common (different ring structures)

Implication: Not all mismatches should have the same penalty. A transition should be penalized less than a transversion.

The Protein Problem: Chemical Similarity

Amino Acid Properties Venn Diagram

Amino acids have different chemical properties:

  • Hydrophobic vs. hydrophilic
  • Charged vs. neutral
  • Small vs. large
  • Aromatic vs. aliphatic

Key insight: Substitutions between chemically similar amino acids (same set in the diagram) occur with higher probability in evolution.

Example:

  • Leucine (Leu) โ†’ Isoleucine (Ile): Both hydrophobic, similar size โ†’ common
  • Leucine (Leu) โ†’ Aspartic acid (Asp): Hydrophobic โ†’ charged โ†’ rare

Problem: Venn diagrams aren't computer-friendly. We need numbers.

Solution: Substitution matrices.


PAM Matrices (Point Accepted Mutation)

PAM250 Matrix

Image ยฉ Anthony S. Serianni. Used under fair use for educational purposes.
Source: https://www3.nd.edu/~aseriann/CHAP7B.html/sld017.htm

PAM matrices encode the probability of amino acid substitutions.

How to read the matrix:

  • This is a symmetric matrix (half shown, diagonal contains self-matches)
  • Diagonal values (e.g., Cys-Cys = 12): Score for matching the same amino acid
  • Off-diagonal values: Score for substituting one amino acid for another

Examples from PAM250:

  • Cys โ†” Cys: +12 (perfect match, high score)
  • Pro โ†” Leu: -3 (not very similar, small penalty)
  • Pro โ†” Trp: -6 (very different, larger penalty)

Key principle: Similar amino acids (chemically) have higher substitution probabilities and therefore higher scores in the matrix.

What Does PAM250 Mean?

PAM = Point Accepted Mutation

PAM1: 1% of amino acids have been substituted (very similar sequences)
PAM250: Extrapolated to 250 PAMs (very distant sequences)

Higher PAM number = more evolutionary distance = use for distantly related proteins


BLOSUM Matrices (BLOcks SUbstitution Matrix)

BLOSUM is another family of substitution matrices, built differently from PAM.

How BLOSUM is Built

Block database: Collections of ungapped, aligned sequences from related proteins.

Amino acids in the blocks are grouped by chemistry of the side chain (like in the Venn diagram).

Each value in the matrix is calculated by:

Frequency of (amino acid pair in database)
รท
Frequency expected by chance

Then converted to a log-odds score.

Interpreting BLOSUM Scores

Zero score:
Amino acid pair occurs as often as expected by random chance.

Positive score:
Amino acid pair occurs more often than by chance (conserved substitution).

Negative score:
Amino acid pair occurs less often than by chance (rare/unfavorable substitution).

BLOSUM Naming: The Percentage

BLOSUM62: Matrix built from blocks with no more than 62% similarity.

What this means:

  • BLOSUM62: Mid-range, general purpose
  • BLOSUM80: More related proteins (higher % identity)
  • BLOSUM45: Distantly related proteins (lower % identity)

Note: Higher number = MORE similar sequences used to build matrix.

Which BLOSUM to Use?

Depends on how related you think your sequences are:

Comparing two cow proteins?
Use BLOSUM80 (closely related species, expect high similarity)

Comparing human protein to bacteria?
Use BLOSUM45 (distantly related, expect low similarity)

Don't know how related they are?
Use BLOSUM62 (default, works well for most cases)


PAM vs. BLOSUM: Summary

FeaturePAMBLOSUM
Based onEvolutionary model (extrapolated mutations)Observed alignments (block database)
Numbers meanEvolutionary distance (PAM units)% similarity of sequences used
Higher numberMore distant sequencesMore similar sequences (opposite!)
PAM250 โ‰ˆBLOSUM45(both for distant proteins)
PAM100 โ‰ˆBLOSUM80(both for close proteins)
Most commonPAM250BLOSUM62

Key difference in naming:

  • PAM: Higher number = MORE evolutionary distance
  • BLOSUM: Higher number = LESS evolutionary distance (MORE similar sequences)

Which to use?

  • BLOSUM is more commonly used today (especially BLOSUM62)
  • PAM is more theoretically grounded but less practical
  • For most purposes: Start with BLOSUM62

Dynamic Programming

Please see the complete topic written in this seperate page

Needleman Wunsch Algorithm

Biomedical Databases

Hey! Welcome to my notes for the Biomedical Databases course where biology meets data engineering.

Course Overview

Total Lectures: 14
Pace: About 2 lectures per week
Structure: The course is divided into modules, with the first module focusing on biological databases specifically.

Important Heads-Up

The exam may be split into two sessions based on the modules. The first module is all about biological databases, so pay extra attention for right preparing.

Supplementary Learning Resource

If you want to dive deeper into database fundamentals (and I mean really deep), check out:

CMU 15-445/645: Intro to Database Systems (Fall 2024)

About the CMU Course

This is one of the best database courses available online, taught by Andy Pavlo at Carnegie Mellon University. It's more advanced and assumes some C++ knowledge, but the explanations are incredibly clear.

Recommended approach:

  • Watch about 2 CMU videos for every 1 lecture we have
  • Don't worry if you don't understand everythingโ€”it's graduate-level content
  • Focus on the conceptual explanations rather than the C++ implementation details
  • Use it to deepen your understanding, not as a replacement for our course

The CMU course covers database internals, query optimization, storage systems, and transaction management at a much deeper level. It's perfect if you're curious about how databases actually work under the hood.

Study Strategy

Here's what works for me:

  1. Attend the lecture and take rough notes
  2. Review and organize the notes here within 24 hours (while it's fresh)
  3. Watch relevant CMU videos for deeper understanding (optional but recommended)
  4. Practice with real databases when applicable
  5. Connect concepts between biological applications and database theory

Boolean Algebra in Nutshell

There are only two Boolean values:

  • True (1, yes, on)
  • False (0, no, off)

Basic Operators

AND Operator (โˆง)

The AND operator returns True only when both inputs are True.

Truth Table:

ABA AND B
FalseFalseFalse
FalseTrueFalse
TrueFalseFalse
TrueTrueTrue

OR Operator (โˆจ)

The OR operator returns True when at least one input is True.

Truth Table:

ABA OR B
FalseFalseFalse
FalseTrueTrue
TrueFalseTrue
TrueTrueTrue

NOT Operator (ยฌ)

The NOT operator flips the value - True becomes False, False becomes True.

Truth Table:

ANOT A
FalseTrue
TrueFalse

Combining Operators

You can combine operators to create complex logical expressions.

Operator Precedence (Order of Operations)

โš ๏ธ
Order Matters

1. NOT (highest priority)
2. AND
3. OR (lowest priority)

Example: A OR B AND C

  • First do: B AND C
  • Then do: A OR (result)

Use parentheses to be clear: (A OR B) AND C

Venn Diagrams

Write an expression to represent the outlined part of the Venn diagram shown.

Set Operations Venn Diagrams

โ„น๏ธ
Image Source

Image from Book Title by David Lippman, Pierce College. Licensed under CC BY-SA. View original

โ“
Problem 1: Morning Beverages

A survey asks 200 people "What beverage do you drink in the morning?", and offers these choices:

  • Tea only
  • Coffee only
  • Both coffee and tea

Suppose 20 report tea only, 80 report coffee only, 40 report both.

Questions:
a) How many people drink tea in the morning?
b) How many people drink neither tea nor coffee?

โ“
Problem 2: Course Enrollment

Fifty students were surveyed and asked if they were taking a social science (SS), humanities (HM) or a natural science (NS) course the next quarter.

  • 21 were taking a SS course
  • 26 were taking a HM course
  • 19 were taking a NS course
  • 9 were taking SS and HM
  • 7 were taking SS and NS
  • 10 were taking HM and NS
  • 3 were taking all three
  • 7 were taking none

Question: How many students are taking only a SS course?

โ„น๏ธ
Source Attribution

Problems adapted from David Lippman, Pierce College. Licensed under CC BY-SA.

PubMed/MeSH

Learn a systematic approach to find relevant articles on a given topic in PubMed combined with Mesh

PubMed is a free search engine maintained by the U.S. National Library of Medicine (NLM) that gives you access to more than 39 million citations from biomedical and life-science literature


PubMed
โ”œโ”€โ”€ Search
โ”‚   โ”œโ”€โ”€ Basic Search
โ”‚   โ””โ”€โ”€ Advanced Search
โ”‚       โ””โ”€โ”€ MeSH Search
โ”‚
โ”œโ”€โ”€ Filters
โ”‚   โ”œโ”€โ”€ Year
โ”‚   โ”œโ”€โ”€ Article Type
โ”‚   โ””โ”€โ”€ Free Full Text
โ”‚
โ”œโ”€โ”€ Databases
โ”‚   โ”œโ”€โ”€ MEDLINE
โ”‚   โ”œโ”€โ”€ PubMed Central
โ”‚   โ””โ”€โ”€ Bookshelf
โ”‚
โ””โ”€โ”€ Article Page
    โ”œโ”€โ”€ Citation
    โ”œโ”€โ”€ Abstract
    โ”œโ”€โ”€ MeSH Terms
    โ””โ”€โ”€ Links to Full Text


What is Mesh DB?

Mesh terms are like tags attached to research papers. You can access Mesh database from this link https://www.ncbi.nlm.nih.gov/mesh/.

MeSH DB (Medical Subject Headings Database) is a controlled vocabulary system used to tag, organize, and standardize biomedical topics for precise searching in PubMed.

alt text

Protein Databases

Protein databases store information about protein structures, sequences, and functions. They come from experimental methods or computational predictions.

PDB

๐Ÿ“–
Definition

What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.

How experimental structure data is obtained? (3 methods)

  1. X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
  2. NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
  3. Cryo-Electron Microscopy (Cryo-EM)(1%)

What is a Ligand?: A ligand is any small molecule, ion, or cofactor that binds to the protein in the structure, often to perform a specific biological function. Example: iron in hemoglobin

What is Resolution (ร…)? Resolution (in ร…ngstrรถm) measures the level of detail; smaller value = sharper, more accurate structure.

UCSF-Chimera

Resources:

Short Playlist:

Ramachandran Plots

What is Ramachandran Plots?

Ramachandran Plots in UCSF-Chimera here

UniProt

NCBI

ENSEMBL

This Container Has a Snake Inside

We will talk in this topic about containers and how to put the snake (python) inside them.

alt text This image is a reference to a scene from an Egyptian movie, where a character humorously asks whatโ€™s inside the box.

Introduction to Containers

๐Ÿ“–
Definition

Containers: an easy way of making bundle of an application with some requirments and with abilty to deploy it in many places .

Applications inside a box and with some requirments? Hmmm, but Virtual Machine can do this. We need to know how the whole story begun.

The Beginning: Bare Metal

๐Ÿ“
One App, One Server

Each application needed its own physical server. Servers ran at 5-15% capacity but you paid for 100%.

Virtual Machines (VMs) Solution

โœ…
Split One Server Into Many

Hypervisor software lets you run multiple "virtual servers" on one physical machine.

How it works:

Physical Server
โ”œโ”€โ”€ Hypervisor
โ”œโ”€โ”€ VM 1 (Full OS + App)
โ”œโ”€โ”€ VM 2 (Full OS + App)
โ””โ”€โ”€ VM 3 (Full OS + App)

alt text

โš ๏ธ
The Hidden Costs

VMs solved hardware waste but created new problems at scale.

Every VM runs a complete operating system, if you have 1,000 VMs, you're running 1,000 complete operating systems, each consuming 2-4GB RAM, taking minutes to boot, and requiring constant maintenance.

Every operating system needs a license

Each VM's operating system needs monthly patches, security updates, backups, monitoring, and troubleshooting, at 1,000 VMs, you're maintaining 1,000 separate operating systems.

You need specialized VMware administrators, OS administrators for each type of VM, network virtualization experts, and storage specialists, even with templates, deploying a new VM takes days because it requires coordination across multiple expert teams.

Container Architecture

If you notice in the previous image, we are repeating the OS. We just need to change the app and its requirements.

Think about it: an OS is just a kernel (for hardware recognition - the black screen that appears when you turn on the PC) and user space. For running applications, we don't need the full user space, we only need the kernel (for hardware access).

Another thing - the VMs are already installed on a real (physical) machine that already has a kernel, so why not just use it? If we could use the host's kernel and get rid of the OS for each VM, we'd solve half the problem. This is one of the main ideas behind containers.

How can we do this? First, remember that the Linux kernel is the same everywhere in the world - what makes distributions different is the user space. Start with the kernel, add some tools and configurations, you get Debian. Add different tools, you get Ubuntu. It's always: kernel + different stuff on top = different distributions.

How do containers achieve this idea? By using layers. Think of it like a cake:

alt text

You can stop at any layer! Layer 1 alone (just the base OS files) is a valid container - yes, you can have a "container of an OS", but remember it's not a full OS, just the user space files without a kernel. Each additional layer adds something specific you need.

After you finish building these layers, you can save the complete stack as a template, this template is called an image. When you run an image, it becomes a running container.

alt text

Remember, we don't care about the OS - Windows, Linux, macOS - they all have kernels. If your app needs Linux-specific tools or Windows-specific tools, you can add just those specific components in a layer and continue building. This reduces dependencies dramatically.

The idea is: start from the kernel and build up only what you need. But how exactly does this work?

The Linux Magic: cgroups and namespaces

Containers utilize Linux kernel features, specifically cgroups and namespaces.

cgroups (control groups): It controls how much CPU, memory, and disk a process can use.

Example:

  • Process A: Use maximum 2 CPU cores and 4GB RAM
  • Process B: Use maximum 1 CPU core and 2GB RAM
  • Container = cgroups ensures Process A can't steal resources from Process B

namespaces: These manage process isolation and hierarchy, they make processes think they're alone on the system.

Example: Process tree isolation

Host System:
โ”œโ”€โ”€ Process 1 (PID 1)
โ”œโ”€โ”€ Process 2 (PID 2)
โ””โ”€โ”€ Process 3 (PID 3)

Inside Container (namespace):
โ””โ”€โ”€ Process 1 (thinks it's PID 1, but it's actually PID 453 on host)
    โ””โ”€โ”€ Process 2 (thinks it's PID 2, but it's actually PID 454 on host)

The container's processes think they're the only processes on the system, completely unaware of other containers or host processes.

Containers = cgroups + namespaces + layers

If you think about it, cgroups + namespaces = container isolation. You start with one process, isolated in its own namespace with resource limits from cgroups. From that process, you install specific libraries, then Python, then pip install your dependencies, and each step is a layer.

alt text

You can even utilize the same idea of Unix signals to control containers, and send SIGTERM to stop a process, and by extension, stop the entire container.

Because namespaces and cgroups are built into the Linux kernel, we only need the kernel, nothing else! No full operating system required.

The Tool: Docker

There are many technologies that achieve containerization (rkt, Podman, containerd), but the most famous one is made by Docker Inc. The software? They called it "Docker."

Yeah, super creative naming there, folks. :)

alt text

If you install Docker on Windows, you are actually installing Docker Desktop, which creates a lightweight virtual machine behind the scenes. Inside that VM, Docker runs a Linux environment, and your Linux containers run there.

If you want to run Windows containers, Docker Desktop can switch to Windows container mode, but those require the Windows kernel and cannot run inside the Linux VM.

Same for macOS.

If you install Docker on Linux, there is no virtual machine involved. You simply get the tools to create and run containers directly

Install Docker

For Windows of macOS see see: Overview of Docker Desktop.

If you are Ubuntu run these commands:

curl -fsSL https://get.docker.com -o get-docker.sh

Then

sudo sh ./get-docker.sh --dry-run

Then run to verify:

sudo docker info

If writing sudo everytime is annoying, then you need to yourself(the name of the user) to the docker group and then restart your machine:

Run the following with replacing mahmoudxyz with your username:

sudo usermod -aG docker mahmoudxyz

After you restart your PC, you will not need to use sudo again before docker.

Basic Docker Commands

Let's start with a simple command:

docker run -it python

This command creates and starts a container (a shortcut for docker create + docker start). The -i flag keeps STDIN open (interactive), and -t allocates a terminal (TTY).

Another useful thing about docker run is that if you donโ€™t have the image locally, Docker will automatically pull it from Docker Hub.

The output of this command shows some downloads and other logs, but the most important part is something like:

Digest: sha256:[text here]

This string can also serve as your image ID.

After the download finishes, Docker will directly open the Python interactive mode:

Python interactive mode

You can write Python code here, but if you exit Python, the entire container stops. This illustrates an important concept: a container is designed to run a single process. Once that process ends, the container itself ends.

CommandDescriptionExample
docker pullDownloads an image from Docker Hub (or another registry)docker pull fedora
docker createCreates a container from an image without starting itdocker create fedora
docker runCreates and starts a container (shortcut for create + start)docker run fedora
docker psLists running containersdocker ps
docker ps -aLists all containers (stopped + running)docker ps -a
docker imagesShows all downloaded imagesdocker images

Useful Flags

FlagMeaningExample
-iKeep STDIN open (interactive)docker run -i fedora
-tAllocate a TTY (terminal)docker run -t fedora
-itInteractive + TTY โ†’ lets you use the container shelldocker run -it fedora bash
ls (in Docker context)Used inside container to list files (Linux command)docker run -it ubuntu ls

To remove a container, use:

docker rm <container_id_or_name>

You can only remove stopped containers. If a container is running, you need to stop it first with:

docker stop <container_id_or_name>

Port Forwarding

When you run a container that exposes a service (like a web server), you often want to access it from your host machine. Docker allows this using the -p flag:

docker run -p <host_port>:<container_port> <image>

Example:

docker run -p 8080:80 nginx
  1. 8080 โ†’ the port on your host machine
  2. 80 โ†’ the port inside the container that Nginx listens on

Now, you can open your browser and visit: http://localhost:8080 โ€ฆand youโ€™ll see the Nginx welcome page.

Docker Networks (in nutshell)

Docker containers are isolated by default. Each container has its own network stack and cannot automatically see or communicate with other containers unless you connect them.

A Docker network allows containers to:

  • Communicate with each other using container names instead of IPs.
  • Avoid port conflicts and isolate traffic from the host or other containers.
  • Use DNS resolution inside the network (so container1 can reach container2 by name).

Default Networks

Docker automatically creates a few networks:

  1. bridge โ†’ the default network for standalone containers.
  2. host โ†’ containers share the hostโ€™s network.
  3. none โ†’ containers have no network

If you want multiple containers (e.g., Jupyter + database) to talk to each other safely and easily, itโ€™s best to create a custom network like bdb-net.

Example:

docker network create bdb-net

Jupyter Docker

Jupyter Notebook can easily run inside a Docker container, which helps avoid installing Python and packages locally.

Don't forget to create the network first:

docker network create bdb-net
docker run -d --rm --name my_jupyter --mount src=bdb_data,dst=/home/jovyan -p 127.0.0.1:8888:8888 --network bdb-net -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="bdb_password" --user root -e CHOWN_HOME=yes -e CHOWN_HOME_OPTS="-R" jupyter/datascience-notebook

Flags and options:

OptionMeaning
-dRun container in detached mode (in the background)
--rmAutomatically remove container when it stops
--name my_jupyterAssign a custom name to the container
--mount src=bdb_data,dst=/home/jovyanMount local volume bdb_data to /home/jovyan inside container
-p 127.0.0.1:8888:8888Forward host localhost port 8888 to container port 8888
--network bdb-netConnect container to Docker network bdb-net
-e JUPYTER_ENABLE_LAB=yesStart Jupyter Lab instead of classic Notebook
-e JUPYTER_TOKEN="bdb_password"Set a token/password for access
--user rootRun container as root user (needed for certain permissions)
-e CHOWN_HOME=yes -e CHOWN_HOME_OPTS="-R"Change ownership of home directory to user inside container
jupyter/datascience-notebookThe Docker image containing Python, Jupyter, and data science packages

After running this, access Jupyter Lab at: http://127.0.0.1:8888. Use the token bdb_password to log in.

Topics (coming soon)

Docker engine architecture, docker image deep dives, container deep dives, Network

Pandas

Introduction to Databases

๐Ÿ“–
Definition

A database (DB) is an organized collection of structured data stored electronically in a computer system, managed by a Database Management System (DBMS).

Let's Invent Database

Alright, so imagine you're building a movie collection app with Python. At first, you might think "I'll just use files!"

You create a file for each movie - titanic.txt, inception.txt, and so on. Inside each file, you write the title, director, year, rating. Simple enough!

But then problems start piling up. You want to find all movies from 2010? Now you're writing Python code to open every single file, read it, parse it, check the year. Slow and messy.

Your friend wants to update a movie's rating while you're reading it? Boom! File corruption or lost data because two programs can't safely write to the same file simultaneously.

You want to find all movies directed by Nolan AND released after 2010? Now your Python script is getting complex, looping through thousands of files, filtering multiple conditions.

What if the power goes out mid-write? Half-updated file, corrupted data.

This is where you start thinking, "there has to be a better way!" What if instead of scattered files, we had one organized system that could handle all this? A system designed from the ground up for concurrent access, fast searching, data integrity, and complex queries. That's the core idea behind what we'd call a database.

Database Management System

So you've realized you need a better system. Enter the DBMS, the Database Management System.

Instead of your Python code directly wrestling with files, the DBMS handles all the heavy lifting, managing storage, handling concurrent users, ensuring data doesn't get corrupted, and executing queries efficiently.

But here's the key question: how should we actually structure this data?

This is where the data model comes in. It's your blueprint for organizing information. For movies, you might think: "Every movie has attributes: title, director, year, rating." That's a relational model thinking, data organized in tables with rows and columns, like a spreadsheet but much more powerful.

Relational Model - Tables:

movie_idtitledirectoryearrating
1InceptionNolan20108.8
2TitanicCameron19977.9
3InterstellarNolan20148.7

Or maybe you think: "Movies are connected, directors make movies, actors star in them, movies belong to genres." That's more of a graph model, focusing on relationships between entities.

Graph Model - Nodes and Relationships:

(Movie: Inception)
       |
       |--[DIRECTED_BY]--> (Director: Nolan)
       |
       |--[RELEASED_IN]--> (Year: 2010)
       |
       |--[HAS_RATING]--> (Rating: 8.8)

(Movie: Interstellar)
       |
       |--[DIRECTED_BY]--> (Director: Nolan)
       |
       |--[RELEASED_IN]--> (Year: 2014)

The data model you choose shapes everything, how you store data, how you query it, how it performs. It's the fundamental architectural decision that defines your database.

What Is Schema ?

The schema is the blueprint (like class in Java or python) or structure of your database, it defines what can be stored and how it's organized, but not the actual data itself.

For our movie table, the schema would be:

Movies (
  movie_id: INTEGER,
  title: TEXT,
  director: TEXT,
  year: INTEGER,
  rating: FLOAT
)

It specifies the table name, column names, and data types. It's like the architectural plan of a building, it shows the rooms and layout, but the furniture (actual data) comes later.

The schema enforces rules: you can't suddenly add a movie with a text value in the year field, or store a rating as a string. It keeps your data consistent and predictable.

Data Models

These are just example to know, but we will study only few, so it's ok if you they sounded complex, but they aren't.

Relational (SQL)

  • Examples: PostgreSQL, MySQL, SQLite
  • Use case: transactions. Need ACID guarantees, complex joins between related data.

Key-Value

  • Examples: Redis, Memcached
  • Use case: Session storage, user login tokens. Lightning-fast lookups by key, simple get/set operations.

Document/JSON (NoSQL)

  • Examples: MongoDB, CouchDB
  • Use case: Blog platform, each post is a JSON document with nested comments, tags, metadata. Flexible schema, easy to evolve.

Wide Column / Column Family

  • Examples: Cassandra, HBase
  • Use case: Time-series data like IoT sensors. Billions of writes per day, queried by device_id and timestamp range.

Array/Matrix/Vector

  • Examples: PostgreSQL with pgvector, Pinecone, Weaviate
  • Use case: AI embeddings for semantic search - store vectors representing documents, find similar items by vector distance.

Legacy Models:

  • Hierarchical
  • Network
  • Semantic
  • Entity-Relationship

The CAP Theorems

So you're building a distributed system. Maybe you've got servers in New York, London, and Tokyo because you want to be fancy and global. Everything's going great until someone asks you a simple question: "What happens when the network breaks?"

Welcome to the CAP theorem, where you learn that you can't have your cake, eat it too, and share it perfectly across three continents simultaneously.

The Three Musketeers (But Only Two Can Fight at Once)

CAP stands for Consistency, Availability, and Partition Tolerance. The theorem, courtesy of Eric Brewer in 2000, says you can only pick two out of three. It's like a cruel database version of "choose your fighter."

Consistency (C): Every node in your distributed system sees the same data at the same time. You read from Tokyo, you read from New York - same answer, guaranteed.

Availability (A): Every request gets a response, even if some nodes are down. The system never says "sorry, come back later."

Partition Tolerance (P): The system keeps working even when network connections between nodes fail. Because networks will fail - it's not if, it's when.

โš ๏ธ
Mind-Bender Alert

The "C" in CAP is NOT the same as the "C" in ACID! ACID consistency means your data follows all the rules (constraints, foreign keys). CAP consistency means all nodes agree on what the data is right now. Totally different beasts.

Why P Isn't Really Optional (Spoiler: Physics)

Here's the dirty secret: Partition Tolerance isn't actually optional in distributed systems. Network failures happen. Cables get cut. Routers die. Someone trips over the ethernet cord. Cosmic rays flip bits (yes, really).

If you're distributed across multiple machines, partitions will occur. So the real choice isn't CAP - it's really CP vs AP. You're choosing between Consistency and Availability when the network inevitably goes haywire.

โ„น๏ธ
The Single Machine Exception

If your "distributed system" is actually just one machine, congratulations! You can have CA because there's no network to partition. But then you're not really distributed, are you? This is why traditional RDBMS like PostgreSQL on a single server can give you strong consistency AND high availability.

CP: Consistency Over Availability

The Choice: "I'd rather return an error than return wrong data."

When a network partition happens, CP systems refuse to respond until they can guarantee you're getting consistent data. They basically say "I'm not going to lie to you, so I'm just going to shut up until I know the truth."

Examples: MongoDB (in default config), HBase, Redis (in certain modes), traditional SQL databases with synchronous replication.

When to choose CP:

  • Banking and financial systems - you CANNOT have Bob's account showing different balances on different servers
  • Inventory systems - overselling products because two datacenters disagree is bad for business
  • Configuration management - if half your servers think feature X is on and half think it's off, chaos ensues
  • Anything where stale data causes real problems, and it's better to show an error than a lie
๐Ÿ’ป
Real World Example

Your bank's ATM won't let you withdraw money during a network partition because it can't verify your balance with the main server. Annoying? Yes. Better than letting you overdraw? Absolutely.

AP: Availability Over Consistency

The Choice: "I'd rather give you an answer (even if it might be stale) than no answer at all."

AP systems keep responding even during network partitions. They might give you slightly outdated data, but hey, at least they're talking to you! They eventually sync up when the network heals - this is called "eventual consistency."

Examples: Cassandra, DynamoDB, Riak, CouchDB, DNS (yes, the internet's phone book).

When to choose AP:

  • Social media - if you see a slightly stale like count during a network issue, the world doesn't end
  • Shopping cart systems - better to let users add items even if inventory count is slightly off, sort it out later
  • Analytics dashboards - last hour's metrics are better than no metrics
  • Caching layers - stale cache beats no cache
  • Anything where availability matters more than perfect accuracy
๐Ÿ’ป
Real World Example

Twitter/X during high traffic: you might see different follower counts on different servers for a few seconds. But the tweets keep flowing, the system stays up, and eventually everything syncs. For a social platform, staying online beats perfect consistency.

The "It Depends"

Here's where it gets interesting: modern systems often aren't pure CP or AP. They let you tune the trade-off!

Cassandra has a "consistency level" setting. Want CP behavior? Set it to QUORUM. Want AP? Set it to ONE. You're literally sliding the dial between consistency and availability based on what each query needs.

๐Ÿ’ก
Pro Architecture Move

Different parts of your system can make different choices! Use CP for critical financial data, AP for user preferences and UI state. This is called "polyglot persistence" and it's how the big players actually do it.

The Plot Twist: PACELC

Just when you thought you understood CAP, along comes PACELC to ruin your day. It says: even when there's NO partition (normal operation), you still have to choose between Latency and Consistency.

Want every read to be perfectly consistent? You'll pay for it in latency because nodes have to coordinate. Want fast responses? Accept that reads might be slightly stale.

But that's a story for another day...

๐Ÿ“
Remember

CAP isn't about right or wrong. It's about understanding trade-offs and making conscious choices based on your actual needs. The worst decision is not knowing you're making one at all.

TL;DR

You can't have perfect consistency, perfect availability, AND handle network partitions. Since partitions are inevitable in distributed systems, you're really choosing between CP (consistent but might go down) or AP (always available but might be stale).

Choose CP when wrong data is worse than no data. Choose AP when no data is worse than slightly outdated data.

Now go forth and distribute responsibly!

ACID: The Database's Solemn Vow (NOT EXAM)

Picture this: You're transferring $500 from your savings to your checking account. The database deducts $500 from savings... and then the power goes out. Did the money vanish into the digital void? Did it get added to checking? Are you now $500 poorer for no reason?

This is the nightmare that keeps database architects up at night. And it's exactly why ACID exists.

ACID is a set of properties that guarantees your database transactions are reliable, even when the universe conspires against you. It stands for Atomicity, Consistency, Isolation, and Durability - which sounds like boring corporate jargon until you realize it's the difference between "my money's safe" and "WHERE DID MY MONEY GO?!"

A is for Atomicity: All or Nothing, Baby

Atomicity means a transaction is indivisible - it's an atom (get it?). Either the entire thing happens, or none of it does. No half-baked in-between states.

Back to our money transfer:

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 500 WHERE account_id = 'savings';
  UPDATE accounts SET balance = balance + 500 WHERE account_id = 'checking';
COMMIT;

If the power dies after the first UPDATE, atomicity guarantees that when the system comes back up, it's like that first UPDATE never happened. Your savings account still has the $500. The transaction either completes fully (both updates) or rolls back completely (neither update).

๐Ÿ’ป
Real World Analogy

Ordering a pizza. Either you get the pizza AND they charge your card, or neither happens. You can't end up with "they charged me but I got no pizza" or "I got pizza but they forgot to charge me." Well, okay, in real life that sometimes happens. But in ACID databases? Never.

โš ๏ธ
Common Confusion

Atomicity doesn't mean fast or instant. It means indivisible. A transaction can take 10 seconds, but it's still atomic - either all 10 seconds of work commits, or none of it does.

C is for Consistency: Follow the Rules or Get Out

Consistency means your database moves from one valid state to another valid state. All your rules - constraints, triggers, cascades, foreign keys - must be satisfied before and after every transaction.

Let's say you have a rule: "Account balance cannot be negative." Consistency guarantees that no transaction can violate this, even temporarily during execution.

-- This has a constraint: balance >= 0
UPDATE accounts SET balance = balance - 1000 WHERE account_id = 'savings';

If your savings only has $500, this transaction will be rejected. The database won't let you break the rules, even for a nanosecond.

โ„น๏ธ
The Big Confusion

Remember: ACID consistency is about business rules and constraints within your database. CAP consistency (from the previous article) is about all servers in a distributed system agreeing on the same value. Same word, completely different meanings. Because computer science loves confusing us.

I is for Isolation: Mind Your Own Business

Isolation means concurrent transactions don't step on each other's toes. When multiple transactions run at the same time, they should behave as if they're running one after another, in some order.

Imagine two people trying to book the last seat on a flight at the exact same moment:

Transaction 1: Check if seats available โ†’ Yes โ†’ Book seat
Transaction 2: Check if seats available โ†’ Yes โ†’ Book seat

Without isolation, both might see "seats available" and both book the same seat. Chaos! Isolation prevents this by making sure transactions don't see each other's half-finished work.

๐Ÿ“
The Isolation Plot Twist

Isolation actually has different levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable). Stronger isolation = safer but slower. Weaker isolation = faster but riskier. Most databases default to something in the middle because perfect isolation is expensive.

The Classic Problem: Dirty Reads, Phantom Reads, and Other Horror Stories

Without proper isolation, you get gems like:

Dirty Read: You read data that another transaction hasn't committed yet. They roll back, and you read data that never actually existed. Spooky!

Non-Repeatable Read: You read a value, someone else changes it, you read it again in the same transaction and get a different answer. Identity crisis for data!

Phantom Read: You run a query that returns 5 rows. Run it again in the same transaction, now there are 6 rows because someone inserted data. Where did that 6th row come from? It's a phantom!

๐Ÿ’ป
Example: The Double-Booking Nightmare

Two users book the same hotel room because both checked availability before either transaction committed. Isolation levels (like Serializable) prevent this by locking the relevant rows or using techniques like MVCC (Multi-Version Concurrency Control).

D is for Durability: Once Committed, Forever Committed

Durability means once a transaction is committed, it's permanent. Even if the server explodes, catches fire, and falls into the ocean immediately after, your committed data is safe.

How? Write-Ahead Logging (WAL), journaling, replication - databases use all kinds of tricks to write data to disk before saying "yep, it's committed!"

COMMIT; -- At this moment, the database promises your data is SAFE
-- Server can crash now, data is still there when it comes back up
๐Ÿ’ก
Behind the Scenes

When you COMMIT, the database doesn't just trust RAM. It writes to persistent storage (disk, SSD) and often waits for the OS to confirm the write completed. This is why commits can feel slow - durability isn't free, but it's worth every millisecond when disaster strikes.

When ACID Matters (Hint: More Than You Think)

Absolutely need ACID:

  • Banking and financial systems - money doesn't just disappear or duplicate
  • E-commerce - orders, payments, inventory must be consistent
  • Medical records - patient data integrity is literally life-or-death
  • Booking systems - double-booking is unacceptable
  • Anything involving legal compliance or auditing

Maybe can relax ACID:

  • Analytics dashboards - approximate counts are fine
  • Social media likes - if a like gets lost in the noise, who cares?
  • Caching layers - stale cache is better than no cache
  • Logging systems - losing 0.01% of logs during a crash might be acceptable
๐Ÿšซ
The "We Don't Need ACID" Famous Last Words

"Our app is simple, we don't need all that ACID overhead!" - said every developer before they had to explain to their CEO why customer orders disappeared. Don't be that developer.

The Trade-off: ACID vs Performance

Here's the uncomfortable truth: ACID guarantees aren't free. They cost performance.

Ensuring atomicity? Needs transaction logs.
Enforcing consistency? Needs constraint checking.
Providing isolation? Needs locking or MVCC overhead.
Guaranteeing durability? Needs disk writes and fsyncs.

This is why NoSQL databases got popular in the early 2010s. They said "what if we... just didn't do all that?" and suddenly you could handle millions of writes per second. Of course, you also had data corruption, lost writes, and race conditions, but hey, it was fast!

๐Ÿ”ฌ
Historical Fun Fact

MongoDB famously had a "durability" setting that was OFF by default for years. Your data wasn't actually safe after a commit unless you explicitly turned on write concerns. They fixed this eventually, but not before countless developers learned about durability the hard way.

Modern Databases: Having Your Cake and Eating It Too

The plot twist? Modern databases are getting really good at ACID without sacrificing too much performance:

  • PostgreSQL uses MVCC (Multi-Version Concurrency Control) for high-performance isolation
  • CockroachDB gives you ACID and horizontal scaling
  • Google Spanner provides global ACID transactions across datacenters

The "NoSQL vs SQL" war has settled into "use the right tool for the job, and maybe that tool is a NewSQL database that gives you both."

๐Ÿ’ก
Pro Tip

Don't sacrifice ACID unless you have a specific, measured performance problem. Premature optimization killed more projects than slow databases ever did. Start with ACID, relax it only when you must.

TL;DR

ACID is your database's promise that your data is safe and correct:

  • Atomicity: All or nothing - no half-done transactions
  • Consistency: Rules are never broken - constraints always hold
  • Isolation: Transactions don't interfere with each other
  • Durability: Committed means forever - even through disasters

Yes, it costs performance. No, you probably shouldn't skip it unless you really, REALLY know what you're doing and have a very good reason.

Your future self (and your CEO) will thank you when the server crashes and your data is still intact.

Database Management System Architecture [NOT EXAM]

So you've got data. Lots of it. And you need to store it, query it, update it, and make sure it doesn't explode when a thousand users hit it simultaneously. Enter the DBMS - the unsung hero working behind the scenes while you're busy writing SELECT * FROM users.

But what actually happens when you fire off that query? What's going on in the engine room? Let's pop the hood and see how these beautiful machines work.

The Big Picture: Layers Upon Layers

A DBMS is like an onion - layers upon layers, and sometimes it makes you cry when you dig too deep. But unlike an onion, each layer has a specific job and they all work together in harmony (most of the time).

Think of it as a restaurant:

  • Query Interface: The waiter taking your order
  • Query Processor: The chef figuring out how to make your dish
  • Storage Manager: The kitchen staff actually cooking and storing ingredients
  • Transaction Manager: The manager making sure orders don't get mixed up
  • Disk Storage: The pantry and freezer where everything lives

Let's break down each component and see what it actually does.

1. Query Interface: "Hello, How Can I Help You?"

This is where you interact with the database. It's the friendly face (or command line) that accepts your SQL queries, API calls, or whatever language your DBMS speaks.

Components:

  • SQL Parser: Takes your SQL string and turns it into something the computer understands
  • DDL Compiler: Handles schema definitions (CREATE TABLE, ALTER TABLE)
  • DML Compiler: Handles data manipulation (SELECT, INSERT, UPDATE, DELETE)
SELECT * FROM users WHERE age > 18;

The parser looks at this and thinks: "Okay, they want data. From the 'users' table. With a condition. Got it." Then it passes this understanding down the chain.

โ„น๏ธ
Fun Fact

When you write terrible SQL with syntax errors, this is where it gets caught. The parser is that friend who tells you "that's not how you spell SELECT" before you embarrass yourself further.

2. Query Processor: The Brain of the Operation

This is where the magic happens. Your query might say "give me all users over 18," but HOW should the database do that? Scan every single row? Use an index? Check the age column first or last? The query processor figures all this out.

Key Components:

Query Optimizer

The optimizer is basically an AI that's been doing its job since the 1970s. It looks at your query and generates multiple execution plans, then picks the best one based on statistics about your data.

SELECT u.name, o.total 
FROM users u 
JOIN orders o ON u.id = o.user_id 
WHERE u.country = 'Italy';

The optimizer thinks: "Should I find Italian users first, then join orders? Or scan orders first? How many Italian users are there? Is there an index on country? On user_id?" It runs the math and picks the fastest path.

๐Ÿ’ป
Real World Example

This is why adding an index can make queries 1000x faster. The optimizer sees the index and thinks "oh perfect, I can use that instead of scanning millions of rows!" Same query, completely different execution plan.

Query Execution Engine

Once the optimizer picks a plan, the execution engine actually runs it. It's the worker bee that fetches data, applies filters, joins tables, and assembles your result set.

๐Ÿ’ก
Pro Tip

Most databases let you see the query plan with EXPLAIN or EXPLAIN ANALYZE. If your query is slow, this is your first stop. The optimizer shows you exactly what it's doing, and often you'll spot the problem immediately - like a missing index or an accidental full table scan.

3. Transaction Manager: Keeping the Peace

Remember ACID? This is where it happens. The transaction manager makes sure multiple users can work with the database simultaneously without chaos erupting.

Key Responsibilities:

Concurrency Control

Prevents the classic problems: two people trying to buy the last concert ticket, or withdrawing money from the same account simultaneously. Uses techniques like:

  • Locking: "Sorry, someone else is using this row right now, wait your turn"
  • MVCC (Multi-Version Concurrency Control): "Here's your own snapshot of the data, everyone gets their own version"
  • Timestamp Ordering: "We'll execute transactions in timestamp order, nice and orderly"

Recovery Manager

When things go wrong (power outage, crash, cosmic ray), this component brings the database back to a consistent state. It uses:

  • Write-Ahead Logging (WAL): Write to the log before writing to the database, so you can replay or undo operations
  • Checkpoints: Periodic snapshots so recovery doesn't have to replay the entire history since the Big Bang
  • Rollback: Undo incomplete transactions
  • Roll-forward: Redo committed transactions that didn't make it to disk
โš ๏ธ
Why Commits Feel Slow

When you COMMIT, the database doesn't just write to memory and call it a day. It writes to the WAL, flushes to disk, and waits for confirmation. This is why durability costs performance - but it's also why your data survives disasters.

4. Storage Manager: Where Bytes Live

This layer manages the actual storage of data on disk (or SSD, or whatever physical medium you're using). It's the bridge between "logical" concepts like tables and rows, and "physical" reality like disk blocks and file pointers.

Components:

Buffer Manager

RAM is fast, disk is slow. The buffer manager keeps frequently accessed data in memory (the buffer pool) so queries don't have to hit disk constantly.

It's like keeping your favorite snacks on the counter instead of going to the store every time you're hungry.

When memory fills up, it uses replacement policies (LRU - Least Recently Used is popular) to decide what to kick out.

File Manager

Manages the actual files on disk. Tables aren't stored as neat CSV files - they're stored in complex structures optimized for different access patterns:

  • Heap Files: Unordered collection of records, good for full table scans
  • Sorted Files: Records sorted by some key, good for range queries
  • Hash Files: Records distributed by hash function, good for exact-match lookups
  • Clustered Files: Related records stored together, good for joins

Index Manager

Manages indexes - the phone book of your database. Instead of scanning every row to find what you want, indexes let you jump straight to the relevant data.

Common index types:

  • B-Tree / B+Tree: Sorted tree structure, handles ranges beautifully
  • Hash Index: Lightning fast for exact matches, useless for ranges
  • Bitmap Index: Great for columns with few distinct values (like gender, status)
  • Full-Text Index: Specialized for text search
๐Ÿ’ป
Example: Why Indexes Matter

Finding a user by ID without an index: scan 10 million rows, takes seconds.
Finding a user by ID with a B-tree index: traverse a tree with height ~4, takes milliseconds.
Same query, 1000x speed difference. Indexes are your friend!

5. The Disk Storage Layer: Ground Zero

At the bottom of it all, your data lives on physical storage. This layer deals with the gritty details:

  • Blocks/Pages: Data is stored in fixed-size chunks (usually 4KB-16KB)
  • Slotted Pages: How records fit inside blocks
  • Free Space Management: Tracking which blocks have room for new data
  • Data Compression: Squeezing more data into less space

Modern databases are incredibly clever here. They use techniques like:

  • Column-oriented storage: Store columns separately for analytics workloads
  • Compression: Save disk space and I/O bandwidth
  • Partitioning: Split huge tables across multiple physical locations
๐Ÿ“
The Performance Hierarchy

- CPU Cache: ~1 nanosecond
- RAM: ~100 nanoseconds
- SSD: ~100 microseconds (1000x slower than RAM!)
- HDD: ~10 milliseconds (100,000x slower than RAM!)

This is why the buffer manager is so critical. Every disk access avoided is a massive win.

Architectural Patterns: Different Strokes for Different Folks

Not all DBMS architectures are the same. They evolved to solve different problems.

Centralized Architecture

Traditional, single-server setup. Everything lives on one machine.

Pros: Simple, full ACID guarantees, no network latency between components
Cons: Limited by one machine's resources, single point of failure

Example: PostgreSQL or MySQL on a single server

Client-Server Architecture

Clients connect to a central database server. Most common pattern today.

Pros: Centralized control, easier security, clients can be lightweight
Cons: Server can become a bottleneck

Example: Your web app connecting to a PostgreSQL server

Distributed Architecture

Data spread across multiple nodes, often in different locations.

Pros: Massive scalability, fault tolerance, can survive node failures
Cons: Complex, CAP theorem strikes, eventual consistency headaches

Example: Cassandra, MongoDB sharded clusters, CockroachDB

Parallel Architecture

Multiple processors/cores working on the same query simultaneously.

Types:

  • Shared Memory: All processors share RAM and disk (symmetric multiprocessing)
  • Shared Disk: Processors have their own memory but share disks
  • Shared Nothing: Each processor has its own memory and disk (most scalable)

Example: Modern PostgreSQL can parallelize queries across cores

โ„น๏ธ
The Evolution

We went from centralized mainframes (1970s) โ†’ client-server (1990s) โ†’ distributed NoSQL (2000s) โ†’ distributed NewSQL (2010s). Each era solved the previous era's limitations while introducing new challenges.

Modern Twists: Cloud and Serverless

The cloud changed the game. Now we have:

Database-as-a-Service (DBaaS): Amazon RDS, Google Cloud SQL - you get a managed database without worrying about the infrastructure.

Serverless Databases: Aurora Serverless, Cosmos DB - database scales automatically, you pay per query.

Separation of Storage and Compute: Modern architectures split storage (S3, object storage) from compute (query engines). Scale them independently!

๐Ÿ’ก
The Big Idea

Traditional databases bundle everything together. Modern cloud databases separate concerns: storage is cheap and infinite (S3), compute is expensive and scales (EC2). Why pay for compute when you're not querying? This is the serverless revolution.

Putting It All Together: A Query's Journey

Let's trace what happens when you run a query:

SELECT name, email FROM users WHERE age > 25 ORDER BY name LIMIT 10;
  1. Query Interface: Parses the SQL, validates syntax
  2. Query Processor: Optimizer creates execution plan ("use age index, sort results, take first 10")
  3. Transaction Manager: Assigns a transaction ID, determines isolation level
  4. Storage Manager:
    • Buffer manager checks if needed data is in memory
    • If not, file manager reads from disk
    • Index manager uses age index to find matching rows
  5. Execution Engine: Applies filter, sorts, limits results
  6. Transaction Manager: Commits transaction, releases locks
  7. Query Interface: Returns results to your application

All this happens in milliseconds. Databases are incredibly sophisticated machines!

โœ…
Mind Blown Yet?

Next time your query returns in 50ms, take a moment to appreciate the decades of computer science and engineering that made it possible. From parsing to optimization to disk I/O to lock management - it's a symphony of coordinated components.

TL;DR

A DBMS is a complex system with multiple layers:

  • Query Interface: Takes your SQL and validates it
  • Query Processor: Figures out the best way to execute your query
  • Transaction Manager: Ensures ACID properties and handles concurrency
  • Storage Manager: Manages buffer pool, files, and indexes
  • Disk Storage: Where your data actually lives

Different architectures (centralized, distributed, parallel) trade off simplicity vs scalability vs consistency.

Modern databases are moving toward cloud-native, separation of storage and compute, and serverless models.

The next time you write SELECT *, remember: there's a whole orchestra playing in the background to make that query work.

Concurrency Control Theory -not for exam

Remember our ACID article? We talked about how databases promise to keep your data safe and correct. But there's a problem we glossed over: what happens when multiple transactions run at the same time?

Spoiler alert: chaos. Beautiful, fascinating, wallet-draining chaos.

The $25 That Vanished Into Thin Air

Let's start with a horror story. You've got $100 in your bank account. You try to pay for something that costs $25. Simple, right?

Read Balance: $100
Check if $100 > $25? โœ“
Pay $25
New Balance: $75
Write Balance: $75

Works perfectly! Until the power goes out right after you read the balance but before you write it back. Now what? Did the payment go through? Is your money gone? This is where Atomicity saves you - either the entire transaction happens or none of it does.

But here's an even scarier scenario: What if TWO payments of $25 try to execute at the exact same time?

Transaction 1: Read Balance ($100) โ†’ Check funds โ†’ Pay $25
Transaction 2: Read Balance ($100) โ†’ Check funds โ†’ Pay $25
Transaction 1: Write Balance ($75)
Transaction 2: Write Balance ($75)

Both transactions read $100, both think they have enough money, both pay $25... and your final balance is $75 instead of $50. You just got a free $25! (Your bank is not happy.)

This is the nightmare that keeps database architects awake at night. And it's exactly why concurrency control exists.

๐Ÿšซ
The Real World Impact

These aren't theoretical problems. In 2012, Knight Capital lost $440 million in 45 minutes due to a race condition in their trading system. Concurrent transactions matter!

The Strawman Solution: Just Don't

The simplest solution? Don't allow concurrency at all. Execute one transaction at a time, in order, like a polite British queue.

Transaction 1 โ†’ Complete โ†’ Transaction 2 โ†’ Complete โ†’ Transaction 3 โ†’ ...

Before each transaction starts, copy the entire database to a new file. If it succeeds, overwrite the original. If it fails, delete the copy. Done!

This actually works! It's perfectly correct! It also has the performance of a potato.

Why? Because while one transaction is waiting for a slow disk read, every other transaction in the world is just... waiting. Doing nothing. Your expensive multi-core server is running one thing at a time like it's 1975.

We can do better.

The Goal: Having Your Cake and Eating It Too

What we actually want:

  • Better utilization: Use all those CPU cores! Don't let them sit idle!
  • Increased response times: When one transaction waits for I/O, let another one run
  • Correctness: Don't lose money or corrupt data
  • Fairness: Don't let one transaction starve forever

The challenge is allowing transactions to interleave their operations while still maintaining the illusion that they ran one at a time.

๐Ÿ“–
Key Concept: Serializability

A schedule (interleaving of operations) is serializable if its result is equivalent to *some* serial execution of the transactions. We don't care which order, just that there exists *some* valid serial order that produces the same result.

The DBMS View: It's All About Reads and Writes

The database doesn't understand your application logic. It doesn't know you're transferring money or booking hotel rooms. All it sees is:

Transaction T1: R(A), W(A), R(B), W(B)
Transaction T2: R(A), W(A), R(B), W(B)

Where R = Read and W = Write. That's it. The DBMS's job is to interleave these operations in a way that doesn't break correctness.

The Classic Example: Interest vs Transfer

You've got two accounts, A and B, each with $1000. Two transactions run:

T1: Transfer $100 from A to B

A = A - 100  // A becomes $900
B = B + 100  // B becomes $1100

T2: Add 6% interest to both accounts

A = A * 1.06
B = B * 1.06

What should the final balance be? Well, A + B should equal $2120 (the original $2000 plus 6% interest).

Serial Execution: The Safe Path

If T1 runs completely before T2:

A = 1000 - 100 = 900
B = 1000 + 100 = 1100
Then apply interest:
A = 900 * 1.06 = 954
B = 1100 * 1.06 = 1166
Total: $2120 โœ“

If T2 runs completely before T1:

A = 1000 * 1.06 = 1060
B = 1000 * 1.06 = 1060
Then transfer:
A = 1060 - 100 = 960
B = 1060 + 100 = 1160
Total: $2120 โœ“

Both valid! Different final states, but both correct because A + B = $2120.

Good Interleaving: Still Correct

T1: A = A - 100  (A = 900)
T1: B = B + 100  (B = 1100)
T2: A = A * 1.06 (A = 954)
T2: B = B * 1.06 (B = 1166)
Total: $2120 โœ“

This interleaving is equivalent to running T1 then T2 serially. We're good!

Bad Interleaving: Money Disappears

T1: A = A - 100  (A = 900)
T2: A = A * 1.06 (A = 1060) โ† Used old value of A!
T2: B = B * 1.06 (B = 1060)
T1: B = B + 100  (B = 1160) โ† Used old value of B!
Total: $2114 โœ—

We lost $6! This schedule is NOT equivalent to any serial execution. It's incorrect.

โš ๏ธ
The Problem

T1 read A before T2 updated it, but T2 read B before T1 updated it. The transactions are interleaved in an inconsistent way - each transaction sees a mix of old and new values.

Conflicting Operations: The Root of All Evil

When do operations actually conflict? When they can cause problems if interleaved incorrectly?

Two operations conflict if:

  1. They're from different transactions
  2. They're on the same object (same data item)
  3. At least one is a write

This gives us three types of conflicts:

Read-Write Conflicts: The Unrepeatable Read

T1: R(A) โ†’ sees $10
T2: W(A) โ†’ writes $19
T1: R(A) โ†’ sees $19

T1 reads A twice in the same transaction and gets different values! The data changed underneath it. This is called an unrepeatable read.

Write-Read Conflicts: The Dirty Read

T1: W(A) โ†’ writes $12 (not committed yet)
T2: R(A) โ†’ reads $12
T2: W(A) โ†’ writes $14 (based on dirty data)
T2: COMMIT
T1: ROLLBACK โ† Oh no!

T2 read data that T1 wrote but never committed. That data never "really existed" because T1 rolled back. T2 made decisions based on a lie. This is a dirty read.

๐Ÿ’ป
Real World Example

You're booking the last seat on a flight. The reservation system reads "1 seat available" from a transaction that's updating inventory but hasn't committed. You book the seat. That transaction rolls back. Turns out there were actually 0 seats. Now you're stuck at the airport arguing with gate agents.

Write-Write Conflicts: The Lost Update

T1: W(A) โ†’ writes "Bob"
T2: W(A) โ†’ writes "Alice"

T2's write overwrites T1's write. If T1 hasn't committed yet, its update is lost. This is the lost update problem.

Conflict Serializability: The Practical Standard

Now we can formally define what makes a schedule acceptable. A schedule is conflict serializable if we can transform it into a serial schedule by swapping non-conflicting operations.

The Dependency Graph Trick

Here's a clever way to check if a schedule is conflict serializable:

  1. Draw one node for each transaction
  2. Draw an edge from Ti to Tj if Ti has an operation that conflicts with an operation in Tj, and Ti's operation comes first
  3. If the graph has a cycle, the schedule is NOT conflict serializable

Example: The Bad Schedule

T1: R(A), W(A), R(B), W(B)
T2: R(A), W(A), R(B), W(B)

With interleaving:

T1: R(A), W(A)
T2: R(A), W(A)
T2: R(B), W(B)
T1: R(B), W(B)

Dependency graph:

T1 โ†’ T2  (T1 writes A, T2 reads A - T1 must come first)
T2 โ†’ T1  (T2 writes B, T1 reads B - T2 must come first)

There's a cycle! T1 needs to come before T2 AND T2 needs to come before T1. Impossible! This schedule is not conflict serializable.

๐Ÿ’ก
Why This Matters

The dependency graph gives us a mechanical way to check serializability. If there's no cycle, we can find a valid serial order by doing a topological sort of the graph. This is how the DBMS reasons about schedules!

View Serializability: The Broader Definition

Conflict serializability is practical, but it's also conservative - it rejects some schedules that are actually correct.

View serializability is more permissive. Two schedules are view equivalent if:

  1. If T1 reads the initial value of A in one schedule, it reads the initial value in the other
  2. If T1 reads a value of A written by T2 in one schedule, it does so in the other
  3. If T1 writes the final value of A in one schedule, it does so in the other

Consider this schedule:

T1: R(A), W(A)
T2: W(A)
T3: W(A)

The dependency graph has cycles (it's not conflict serializable), but it's view serializable! Why? Because T3 writes the final value of A in both the interleaved schedule and the serial schedule T1โ†’T2โ†’T3. The intermediate writes by T1 and T2 don't matter - they're overwritten anyway.

This is called a blind write - writing a value without reading it first.

โ„น๏ธ
Why Don't Databases Use View Serializability?

Checking view serializability is NP-Complete. It's computationally expensive and impractical for real-time transaction processing. Conflict serializability is polynomial time and good enough for 99.9% of cases.

The Universe of Schedules

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      All Possible Schedules         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   View Serializable           โ”‚  โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”‚
โ”‚  โ”‚  โ”‚ Conflict Serializable   โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ”‚  Serial Schedules โ”‚  โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ”‚  โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Most databases enforce conflict serializability because:

  • It's efficient to check
  • It covers the vast majority of practical cases
  • It can be enforced with locks, timestamps, or optimistic methods

How Do We Actually Enforce This?

We've talked about what serializability means, but not how to enforce it. That's the job of concurrency control protocols, which come in two flavors:

Pessimistic: Assume conflicts will happen, prevent them proactively

  • Two-Phase Locking (2PL) - most common
  • Timestamp Ordering
  • "Don't let problems arise in the first place"

Optimistic: Assume conflicts are rare, deal with them when detected

  • Optimistic Concurrency Control (OCC)
  • Multi-Version Concurrency Control (MVCC)
  • "Let transactions run freely, check for conflicts at commit time"

We'll dive deep into these in the next article, but the key insight is that all of them are trying to ensure the schedules they produce are serializable.

๐Ÿ“
Important Distinction

This article is about checking whether schedules are correct. The next article is about generating correct schedules in the first place. The theory tells us what's correct; the protocols tell us how to achieve it.

The NoSQL Backlash (That's Now Backtracking)

Around 2010, the NoSQL movement said "transactions are slow, ACID is overkill, eventual consistency is fine!" Systems like early MongoDB and Cassandra threw out strict serializability for performance.

And you know what? They were fast! They could handle millions of writes per second!

They also had data corruption, lost writes, and developers pulling their hair out debugging race conditions.

The pendulum has swung back. Modern databases (NewSQL, distributed SQL) are proving you can have both performance AND correctness. Turns out the computer scientists in the 1970s knew what they were doing.

๐Ÿ”ฌ
Historical Note

The theory of serializability was developed in the 1970s-1980s by pioneers like Jim Gray, Phil Bernstein, and Christos Papadimitriou. It's stood the test of time because it's based on fundamental principles, not implementation details.

TL;DR

The Problem: Multiple concurrent transactions can interfere with each other, causing lost updates, dirty reads, and inconsistent data.

The Solution: Ensure all schedules are serializable - equivalent to some serial execution.

Key Concepts:

  • Conflicting operations: Two operations on the same object from different transactions, at least one is a write
  • Conflict serializability: Can transform the schedule into a serial one by swapping non-conflicting operations (check with dependency graphs)
  • View serializability: Broader definition, but too expensive to enforce in practice

Types of Conflicts:

  • Read-Write: Unrepeatable reads
  • Write-Read: Dirty reads
  • Write-Write: Lost updates

Next Time: We'll learn about Two-Phase Locking, MVCC, and how databases actually enforce serializability in practice. The theory is beautiful; the implementation is where the magic happens! ๐Ÿ”’

Lec1

Lec2 Notes V2

Lec3 Notes V2

Lec4 Notes V2

Lec5 Notes V2

Lec6 Notes V2

Proteomics Intro

Proteomics is the large-scale study of proteomes

Proteome = ALL the proteins present at a specific time

Why is Proteomics More Complicated Than Genomics?

Because proteome is CONSTANTLY CHANGING:

  1. Different proteins in brain vs. liver vs. skin
  2. Different proteins when you're a baby vs. adult
  3. Different proteins when you exercise, eat, sleep, or get sick

Why "Same Genes = Same Proteins" is WRONG?

Step 1: Not all genes are active

You have: ~20,000 genes total Each cell uses: Only ~11,000 genes This determines: What type of cell it is (brain, muscle, skin, etc.)

Step 2: Things get MORE complex because of:

  1. Splicing variants

One gene โ†’ can be "edited" into different versions

  1. Post-translational modifications (PTMs)

Proteins get chemically modified AFTER they're made Like buying a plain t-shirt, then adding patches, cutting it, dyeing it

  1. Protein-protein interactions (PPIs)

Proteins work in teams, not alone Different combinations = different functions

  1. Subcellular localization

WHERE the protein is located matters Same protein in the nucleus vs. membrane = different job

Levels Of Protein

Primary: The sequence of amino acids

Secondary:

  1. ฮฑ-helix (alpha helix)
  2. ฮฒ-sheet (beta sheet)

Tertiary Structure: The overall 3D shape of the ENTIRE protein chain

Held together by:

  • Hydrogen bonds
  • Ionic bonds
  • Disulfide bridges (strong S-S bonds between cysteines)
  • Hydrophobic interactions

Quaternary Structure: Multiple protein chains coming together


Primary (1ยฐ):     โ—โ€”โ—โ€”โ—โ€”โ—โ€”โ—โ€”โ—โ€”โ—โ€”โ—โ€”โ—
                  (linear chain)

Secondary (2ยฐ):   ~~~โ—~~~  and  โ‰‹โ‰‹โ‰‹
                  (helix)      (sheet)

Tertiary (3ยฐ):    ๐Ÿ€
                  (one chain folded into 3D shape)

Quaternary (4ยฐ):  ๐Ÿ€๐Ÿ€
                  ๐Ÿ€๐Ÿ€
                  (multiple chains together)

Chaperones help folding but don't determine folding

Chaperones increase during stress, That's why they're called "heat shock proteins"

More on Protein

Top down vs bottom up proteomics

Analytical Chemistry Review

Mass Spectrometry for Visual Learners

Analytical Chemistry Lessons

NMR, Chromatography, Infrared

Silico Cloning

Theory: Plasmid Design: The Basics

Watch Video Walkthrough

Watch Plasmid Design: The Basics

Watch Another Video

Watch Plasmid Design: The Basics

How to Design Plasmids: Benchling Tutorial

How to Design Plasmids: Benchling Tutorial

Benchling Tutorial and some discussions about Benchling usage

Antartic fish Antifreeze Tutorial

Antartic fish Antifreeze Tutorial

Step by Step Plasmid Design for Antartic fish Antifreeze using Benchling Tutorial

Cells

Watch Video about Cell Organelles

Cell Organelles - Explained in a way that finally makes sense!

Cell Division

Watch Video about Cell Division

Cell Division from MIT

Rules of Inheritance

Watch Video about Rules of Inheritance

Rules of Inheritance from MIT

Applied Genomics

What is Genetics?

Genetics is the connection between phenotype and genotype.

  • Genotype: The gene contentโ€”what's written in your DNA
  • Phenotype: What we actually seeโ€”the observable traits

Two Approaches to Understanding Genetics

Forward Genetics: Moving from phenotype to genotype
โ†’ "Why do I sneeze in sunlight?" โ†’ Find the responsible gene through mapping

Reverse Genetics: Moving from genotype to phenotype
โ†’ "What happens if I break this gene?" โ†’ Create mutations and observe the effects

Real Examples of Phenotypes

Examples of how genetics shapes our everyday experiences:

Cilantro Taste: Some people think cilantro tastes like soap. This isn't about preferenceโ€”it's genetics. Variations in the OR6A2 gene literally change how cilantro tastes to you.

ACHOO Syndrome: Ever sneeze when you look at bright sunlight? That's not random. It's linked to a genetic polymorphism near the Zeb2 gene. (ACHOO stands for Autosomal Dominant Compelling Helio-Ophthalmic Outburstโ€”yes, someone really wanted that acronym to work.)

These examples show that genetic differences create genuinely different experiences of the world, not just different disease risks.

What is a Gene?

This seems like a simple question, but it has multiple valid answers depending on your perspective:

1. DNA Sequence Definition

A gene is simply a stretch of DNAโ€”a sequence of nucleic acids.

2. Functional Definition

A gene corresponds to a phenotype. It's associated with specific traits or conditions (like ACHOO syndrome).

3. Mendelian Definition

A gene is an independently segregating unit in inheritanceโ€”the discrete units Mendel discovered with his peas.

4. Genomic Definition

A gene is a specific physical location in the genome. This matters for mapping studies and understanding genomic architecture.

The Structure-Function Connection

DNA's double helix isn't just beautifulโ€”it's functional. The structure provides a mechanism for copying and transmitting genetic information from one generation to the next. Form follows function, and function requires form.


Key Terminology

Let's define the language we'll use throughout this course:

Alleles

Different versions of a gene. Since humans are diploid (two copies of most chromosomes), we have two alleles for most genes. They can be:

  • Homozygous: Both alleles are identical
  • Heterozygous: The two alleles are different

Mutants

An altered version of a gene that has been observed to change. Important: we only call something a "mutant" when we witness the mutation for the first timeโ€”like seeing a new change in a child that isn't present in either parent.

Genotype

The complete set of alleles in an individual. Your genetic makeup.

Wildtype

A standard reference genome used as a baseline for comparison. Important points:

  • Often highly inbred (identical alleles)
  • Used to identify mutations
  • Does NOT mean "healthy" or "normal"
  • NOT applicable to humansโ€”there is no single "normal" human genotype

Why "Wildtype" Doesn't Work for Humans

There is no universal standard for what is "normal" in human genetics. We have incredible natural variation. What's common in one population might be rare in another. What works well in one environment might be disadvantageous in another.

The idea of a single reference "wildtype" human is both scientifically inaccurate and philosophically problematic. Human genetic diversity is a feature, not a bug.

Pedigrees

Pedigrees are family trees that track the inheritance of traits across generations. They're one of our most powerful tools for understanding genetic inheritance patterns in humans, where we can't do controlled breeding experiments (for obvious ethical reasons).

How These Notes Are Organized

I'm not following a strict linear order because genetics doesn't work linearly. Genes interact. Pathways overlap. Everything connects to everything else.

These notes will grow recursivelyโ€”starting with foundations, then branching out as connections become clear. Some sections will reference concepts we haven't covered yet. That's fine. Learning genetics is like assembling a puzzle where you can't see the final picture until enough pieces are in place.

My approach:

  1. Start with fundamentals (this page)
  2. Build out core concepts as we cover them in lectures
  3. Connect ideas across topics as patterns emerge
  4. Revisit and refine as understanding deepens

About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (online courses, YouTube, documentation, textbooks).

This is my academic work, how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Resources

The exercises and examples in this material are inspired by several open educational resources released under Creative Commons licenses. Instead of referencing each one separately throughout the notes, here is a list of the main books and sources I used:

  • [biology-2e- ยฉ OpenStax] (CC BY-NC-SA 3.0)

All credit goes to the original authors for their openly licensed educational content.

PLINK Genotype File Formats

PLINK is a free, open-source toolset designed for genome-wide association studies (GWAS) and population genetics analysis.

When you're dealing with genotype data from thousands (or millions) of people across hundreds of thousands (or millions) of genetic variants, you face several problems:

  1. File size: Raw genotype data is MASSIVE
  2. Processing speed: Reading and analyzing this data needs to be fast
  3. Standardization: Different labs and companies produce data in different formats
  4. Analysis tools: You need efficient ways to compute allele frequencies, test for associations, filter variants, etc.

PLINK solves these problems by providing:

  • Efficient binary file formats (compact storage)
  • Fast algorithms for common genetic analyses
  • Format conversion tools
  • Quality control utilities
  • Analyzing data from genotyping chips (Illumina, Affymetrix)
  • Running genome-wide association studies (GWAS)
  • Computing population genetics statistics
  • Quality control and filtering of genetic variants
  • Converting between different genotype file formats

This is PLINK's primary format - a set of three files that work together. It's called "binary" because the main genotype data is stored in a compressed binary format rather than human-readable text.

The .fam File (Family/Sample Information)

The .fam file contains information about each individual (sample) in your study. It has 6 columns with NO header row.

Format:


FamilyID  IndividualID  FatherID  MotherID  Sex  Phenotype

Example .fam file:


FAM001  IND001  0  0  1  2
FAM001  IND002  0  0  2  1
FAM002  IND003  IND004  IND005  1  -9
FAM002  IND004  0  0  1  1
FAM002  IND005  0  0  2  1

Column Breakdown:

Column 1: Family ID

  • Groups individuals into families
  • Can be the same as Individual ID if samples are unrelated
  • Example: FAM001, FAM002

Column 2: Individual ID

  • Unique identifier for each person
  • Must be unique within each family
  • Example: IND001, IND002

Column 3: Paternal ID (Father)

  • Individual ID of the father
  • 0 = father not in dataset (unknown or not genotyped)
  • Used for constructing pedigrees and family-based analyses

Column 4: Maternal ID (Mother)

  • Individual ID of the mother
  • 0 = mother not in dataset
  • Must match an Individual ID if the parent is in the study

Column 5: Sex

  • 1 = Male
  • 2 = Female
  • 0 = Unknown sex
  • Other codes (like -9) are sometimes used for unknown, but 0 is standard

Column 6: Phenotype

  • The trait you're studying (disease status, quantitative trait, etc.)
  • For binary (case-control) traits:
    • 1 = Control (unaffected)
    • 2 = Case (affected)
    • 0 or -9 = Missing phenotype
  • For quantitative traits: Any numeric value
  • -9 = Standard missing value code

Important Notes About Special Codes:

0 (Zero):

  • In Parent columns: Parent not in dataset
  • In Sex column: Unknown sex
  • In Phenotype column: Missing phenotype (though -9 is more common)

-9 (Negative nine):

  • Universal "missing data" code in PLINK
  • Most commonly used for missing phenotype
  • Sometimes used for unknown sex (though 0 is standard)

Why these codes matter:

  • PLINK will skip individuals with missing phenotypes in association tests
  • Parent information is crucial for family-based tests (like TDT)
  • Sex information is needed for X-chromosome analysis

The .bim File (Variant Information)

The .bim file (binary marker information) describes each genetic variant. It has 6 columns with NO header row.

Format:

Chromosome  VariantID  GeneticDistance  Position  Allele1  Allele2

Example .bim file:

1   rs12345    0    752566    G    A
1   rs67890    0    798959    C    T
2   rs11111    0    1240532   A    G
3   rs22222    0    5820321   T    C
X   rs33333    0    2947392   G    A

Column Breakdown:

Column 1: Chromosome

  • Chromosome number: 1-22 (autosomes)
  • Sex chromosomes: X, Y, XY (pseudoautosomal), MT (mitochondrial)
  • Example: 1, 2, X

Column 2: Variant ID

  • Usually an rsID (reference SNP ID from dbSNP)
  • Format: rs followed by numbers (e.g., rs12345)
  • Can be any unique identifier if rsID isn't available
  • Example: chr1:752566:G:A (chromosome:position:ref:alt format)

Column 3: Genetic Distance

  • Position in centimorgans (cM)
  • Measures recombination distance, not physical distance
  • Often set to 0 if unknown (very common)
  • Used in linkage analysis and some phasing algorithms

Column 4: Base-Pair Position

  • Physical position on the chromosome
  • Measured in base pairs from the start of the chromosome
  • Example: 752566 means 752,566 bases from chromosome start
  • Critical for genome builds: Make sure you know if it's GRCh37 (hg19) or GRCh38 (hg38)!

Column 5: Allele 1

  • First allele (often the reference allele)
  • Single letter: A, C, G, T
  • Can also be I (insertion), D (deletion), or 0 (missing)

Column 6: Allele 2

  • Second allele (often the alternate/effect allele)
  • Same coding as Allele 1

Important Notes:

Allele coding:

  • These alleles define what genotypes mean in the .bed file
  • Genotype AA means homozygous for Allele1
  • Genotype AB means heterozygous
  • Genotype BB means homozygous for Allele2

Strand issues:

  • Alleles should be on the forward strand
  • Mixing strands between datasets causes major problems in meta-analysis
  • Always check strand alignment when combining datasets!

The .bed File (Binary Genotype Data)

The .bed file contains the actual genotype calls in compressed binary format. This file is NOT human-readable - you can't open it in a text editor and make sense of it.

Key characteristics:

Why binary?

  • Space efficiency: A text file with millions of genotypes is huge; binary format compresses this dramatically
  • Speed: Computer can read binary data much faster than parsing text
  • Example: A dataset with 1 million SNPs and 10,000 people:
    • Text format (.ped): ~30 GB
    • Binary format (.bed): ~2.4 GB

What's stored:

  • Genotype calls for every individual at every variant
  • Each genotype is encoded efficiently (2 bits per genotype)
  • Encoding:
    • 00 = Homozygous for allele 1 (AA)
    • 01 = Missing genotype
    • 10 = Heterozygous (AB)
    • 11 = Homozygous for allele 2 (BB)

SNP-major vs. individual-major:

  • PLINK binary files are stored in SNP-major mode by default
  • This means genotypes are organized by variant (all individuals for SNP1, then all individuals for SNP2, etc.)
  • More efficient for most analyses (which process one SNP at a time)

You never edit .bed files manually - always use PLINK commands to modify or convert them.


This is the original PLINK format. It's human-readable but much larger and slower than binary format. Mostly used for small datasets or when you need to manually inspect/edit data.

The .map File (Variant Map)

Similar to .bim but with only 4 columns.

Format:

Chromosome  VariantID  GeneticDistance  Position

Example .map file:

1   rs12345    0    752566
1   rs67890    0    798959
2   rs11111    0    1240532
3   rs22222    0    5820321

Notice: NO allele information in .map files (unlike .bim files).


The .ped File (Pedigree + Genotypes)

Contains both sample information AND genotype data in one large text file.

Format:

FamilyID  IndividualID  FatherID  MotherID  Sex  Phenotype  [Genotypes...]

The first 6 columns are identical to the .fam file. After that, genotypes are listed as pairs of alleles (one pair per SNP).

Example .ped file:

FAM001  IND001  0  0  1  2  G G  C T  A G  T T
FAM001  IND002  0  0  2  1  G A  C C  A A  T C
FAM002  IND003  0  0  1  1  A A  T T  G G  C C

Genotype Encoding:

Each SNP is represented by two alleles separated by a space:

  • G G = Homozygous for G allele
  • G A = Heterozygous (one G, one A)
  • A A = Homozygous for A allele
  • 0 0 = Missing genotype

Important: The order of alleles in heterozygotes doesn't matter (G A = A G).

Problems with .ped format:

  • HUGE files for large datasets (gigabytes to terabytes)
  • Slow to process (text parsing is computationally expensive)
  • No explicit allele definition (you have to infer which alleles exist from the data)

When to use .ped/.map:

  • Small datasets (< 1,000 individuals, < 10,000 SNPs)
  • When you need to manually edit genotypes
  • Importing data from older software
  • Best practice: Convert to binary format (.bed/.bim/.fam) immediately for analysis

Transposed Format (.tped/.tfam)

This format is a "transposed" version of .ped/.map. Instead of one row per individual, you have one row per SNP.

The .tfam File

Identical to .fam file - contains sample information.

Format:

FamilyID  IndividualID  FatherID  MotherID  Sex  Phenotype

The .tped File (Transposed Genotypes)

Each row represents one SNP, with genotypes for all individuals.

Format:

Chromosome  VariantID  GeneticDistance  Position  [Genotypes for all individuals...]

Example .tped file:

1  rs12345  0  752566  G G  G A  A A  G G  A A
1  rs67890  0  798959  C T  C C  T T  C T  C C
2  rs11111  0  1240532 A G  A A  G G  A G  A A

The first 4 columns are like the .map file. After that, genotypes are listed for all individuals (2 alleles per person, space-separated).

When to use .tped/.tfam:

  • When your data is organized by SNP rather than by individual
  • Converting from certain genotyping platforms
  • Some imputation software prefers this format
  • Still text format so same size/speed issues as .ped

Long Format

Long format (also called "additive" or "dosage" format) represents genotypes as numeric values instead of allele pairs.

Format options:

Additive coding (most common):

FamilyID  IndividualID  VariantID  Genotype
FAM001    IND001        rs12345    0
FAM001    IND001        rs67890    1
FAM001    IND001        rs11111    2
FAM001    IND002        rs12345    1

Numeric genotype values:

  • 0 = Homozygous for reference allele (AA)
  • 1 = Heterozygous (AB)
  • 2 = Homozygous for alternate allele (BB)
  • NA or -9 = Missing

Why long format?

  • Easy to use in statistical software (R, Python pandas)
  • Flexible for merging with other data (phenotypes, covariates)
  • Good for database storage (one row per observation)
  • Can include dosages for imputed data (values between 0-2, like 0.85)

Downsides:

  • MASSIVE file size (one row per person per SNP)
  • Example: 10,000 people ร— 1 million SNPs = 10 billion rows
  • Not practical for genome-wide data without compression

When to use:

  • Working with a small subset of SNPs in R/Python
  • Merging genotypes with other tabular data
  • Machine learning applications where you need a feature matrix

Variant Call Format (VCF)

VCF is the standard format for storing genetic variation from sequencing data. Unlike genotyping arrays (which only check specific SNPs), sequencing produces all variants, including rare and novel ones.

Key characteristics:

Comprehensive information:

  • Genotypes for all samples at each variant
  • Quality scores for each call
  • Read depth, allele frequencies
  • Functional annotations
  • Multiple alternate alleles at the same position

File structure:

  • Header lines start with ## (metadata about reference genome, samples, etc.)
  • Column header line starts with #CHROM (defines columns)
  • Data lines: One per variant

Standard VCF columns:

#CHROM  POS     ID         REF  ALT     QUAL  FILTER  INFO           FORMAT  [Sample genotypes...]
1       752566  rs12345    G    A       100   PASS    AF=0.23;DP=50  GT:DP   0/1:30  1/1:25  0/0:28

Column Breakdown:

CHROM: Chromosome (1-22, X, Y, MT)

POS: Position on chromosome (1-based coordinate)

ID: Variant identifier (rsID or . if none)

REF: Reference allele (what's in the reference genome)

ALT: Alternate allele(s) - can be multiple, comma-separated

  • Example: A,T means two alternate alleles

QUAL: Quality score (higher = more confident call)

  • Phred-scaled: QUAL=30 means 99.9% confidence
  • . if unavailable

FILTER: Quality filter status

  • PASS = passed all filters
  • LowQual, HighMissing, etc. = failed specific filters
  • . = no filtering applied

INFO: Semicolon-separated annotations

  • AF=0.23 = Allele frequency 23%
  • DP=50 = Total read depth
  • AC=10 = Allele count
  • Many possible fields (defined in header)

FORMAT: Describes the per-sample data fields

  • GT = Genotype
  • DP = Read depth for this sample
  • GQ = Genotype quality
  • Example: GT:DP:GQ

Sample columns: One column per individual

  • Data corresponds to FORMAT field
  • Example: 0/1:30:99 means heterozygous, 30 reads, quality 99

Genotype Encoding in VCF:

GT (Genotype) format:

  • 0/0 = Homozygous reference (REF/REF)
  • 0/1 = Heterozygous (REF/ALT)
  • 1/1 = Homozygous alternate (ALT/ALT)
  • ./. = Missing genotype
  • 1/2 = Heterozygous with two different alternate alleles
  • 0|1 = Phased genotype (pipe | instead of slash /)

Phased vs. unphased:

  • / = unphased (don't know which allele came from which parent)
  • | = phased (know parental origin)
  • 0|1 means reference allele from parent 1, alternate from parent 2

Compressed VCF (.vcf.gz):

VCF files are usually gzipped and indexed:

  • .vcf.gz = compressed VCF (much smaller)
  • .vcf.gz.tbi = tabix index (allows fast random access)
  • Tools like bcftools and vcftools work directly with compressed VCFs

Example sizes:

  • Uncompressed VCF: 100 GB
  • Compressed .vcf.gz: 10-15 GB
  • Always work with compressed VCFs!

When to use VCF:

  • Sequencing data (whole genome, exome, targeted)
  • When you need detailed variant information
  • Storing rare and novel variants
  • Multi-sample studies with complex annotations
  • NOT typical for genotyping array data (use PLINK binary instead)

Oxford Format (.gen / .bgen + .sample)

Developed by the Oxford statistics group, commonly used in UK Biobank and imputation software (IMPUTE2, SHAPEIT).

The .sample File

Contains sample information, similar to .fam but with a header row.

Format:

ID_1 ID_2 missing sex phenotype
0 0 0 D B
IND001 IND001 0 1 2
IND002 IND002 0 2 1

First two rows are special:

  • Row 1: Column names
  • Row 2: Data types
    • D = Discrete/categorical
    • C = Continuous
    • B = Binary
    • 0 = Not used

Subsequent rows: Sample data

  • ID_1: Usually same as ID_2 for unrelated individuals
  • ID_2: Sample identifier
  • missing: Missingness rate (usually 0)
  • sex: 1=male, 2=female
  • phenotype: Your trait of interest

The .gen File (Genotype Probabilities)

Stores genotype probabilities rather than hard calls. This is crucial for imputed data where you're not certain of the exact genotype.

Format:

Chromosome  VariantID  Position  Allele1  Allele2  [Genotype probabilities for all samples...]

Example .gen file:

1  rs12345  752566  G  A  1 0 0  0.95 0.05 0  0 0.1 0.9

Genotype Probability Triplets:

For each sample, three probabilities (must sum to 1.0):

  • P(AA) = Probability of homozygous for allele 1
  • P(AB) = Probability of heterozygous
  • P(BB) = Probability of homozygous for allele 2

Example interpretations:

  • 1 0 0 = Definitely AA (100% certain)
  • 0 0 1 = Definitely BB (100% certain)
  • 0 1 0 = Definitely AB (100% certain)
  • 0.9 0.1 0 = Probably AA, might be AB (uncertain genotype)
  • 0.33 0.33 0.33 = Completely uncertain (missing data)

Why probabilities matter:

  • Imputed genotypes aren't perfectly certain
  • Better to use probabilities than picking "best guess" genotype
  • Allows proper statistical modeling of uncertainty
  • Example: If imputation says 90% chance of AA, 10% chance AB, you should account for that uncertainty

The .bgen File (Binary Gen)

Binary version of .gen format - compressed and indexed for fast access.

Key features:

  • Much smaller than text .gen files
  • Includes variant indexing for rapid queries
  • Supports different compression levels
  • Stores genotype probabilities (like .gen) or dosages
  • Used by UK Biobank and other large biobanks

Associated files:

  • .bgen = Main genotype file
  • .bgen.bgi = Index file (for fast lookup)
  • .sample = Sample information (same as with .gen)

When to use Oxford format:

  • Working with imputed data
  • UK Biobank analyses
  • Using Oxford software (SNPTEST, QCTOOL, etc.)
  • When you need to preserve genotype uncertainty

Converting to PLINK:

  • PLINK2 can read .bgen files
  • Can convert to hard calls (loses probability information)
  • Or use dosages (keeps uncertainty as 0-2 continuous values)

23andMe Format

23andMe is a direct-to-consumer genetic testing company. Their raw data format is simple but NOT standardized for research use.

Format:

# rsid    chromosome    position    genotype
rs12345    1    752566    AG
rs67890    1    798959    CC
rs11111    2    1240532   --

Column Breakdown:

rsid: Variant identifier (rsID from dbSNP)

chromosome: Chromosome number (1-22, X, Y, MT)

  • Note: Sometimes uses 23 for X, 24 for Y, 25 for XY, 26 for MT

position: Base-pair position

  • Warning: Build version (GRCh37 vs GRCh38) is often unclear!
  • Check the file header or 23andMe documentation

genotype: Two-letter allele call

  • AG = Heterozygous
  • AA = Homozygous
  • -- = Missing/no call
  • DD or II = Deletion or insertion (rare)

Important Limitations:

Not standardized:

  • Different builds over time (some files are GRCh37, newer ones GRCh38)
  • Allele orientation issues (forward vs. reverse strand)
  • Variant filtering varies by chip version

Only genotyped SNPs:

  • Typically 500k-1M SNPs (depending on chip version)
  • No imputed data in raw download
  • Focused on common variants (rare variants not included)

Missing quality information:

  • No quality scores
  • No read depth or confidence metrics
  • "No call" (--) doesn't tell you why it failed

Privacy and consent issues:

  • Users may not understand research implications
  • IRB approval needed for research use
  • Cannot assume informed consent for specific research

Many online tools exist, but be careful:

  1. Determine genome build (critical!)
  2. Check strand orientation
  3. Handle missing genotypes (-- โ†’ 0 0)
  4. Verify chromosome coding (especially X/Y/MT)

Typical workflow:

# Convert to PLINK format (using a conversion script)
python 23andme_to_plink.py raw_data.txt

# Creates .ped and .map files
# Then convert to binary
plink --file raw_data --make-bed --out data

When you'd use 23andMe data:

  • Personal genomics projects
  • Ancestry analysis
  • Polygenic risk score estimation
  • Educational purposes
  • NOT suitable for: Clinical decisions, serious GWAS (too small), research without proper consent

Summary: Choosing the Right Format

FormatBest ForProsCons
PLINK binary (.bed/.bim/.fam)GWAS, large genotyping arraysFast, compact, standardLoses probability info
PLINK text (.ped/.map)Small datasets, manual editingHuman-readableHuge, slow
VCF (.vcf/.vcf.gz)Sequencing data, rare variantsComprehensive info, standardComplex, overkill for arrays
Oxford (.bgen/.gen)Imputed data, UK BiobankPreserves uncertaintyLess common in US
23andMePersonal genomicsDirect-to-consumerNot research-grade
Long formatStatistical analysis in R/PythonEasy to manipulateMassive file size

General recommendations:

  1. For genotyping array data: Use PLINK binary format (.bed/.bim/.fam)
  2. For sequencing data: Use compressed VCF (.vcf.gz)
  3. For imputed data: Use Oxford .bgen or VCF with dosages
  4. For statistical analysis: Convert subset to long format
  5. For personal data: Convert 23andMe to PLINK, but carefully

File conversions:

  • PLINK can convert between most formats
  • Always document your conversions (genome build, strand, filters)
  • Verify a few variants manually after conversion
  • Keep original files - conversions can introduce errors

Sanger Sequencing

The Chemistry: dNTPs vs ddNTPs

dNTP (deoxynucleotide triphosphate):

  • Normal DNA building blocks: dATP, dCTP, dGTP, dTTP
  • Have a 3'-OH group โ†’ DNA polymerase can add another nucleotide
  • Chain continues growing

ddNTP (dideoxynucleotide triphosphate):

  • Modified nucleotides: ddATP, ddCTP, ddGTP, ddTTP
  • Missing the 3'-OH group โ†’ no place to attach next nucleotide
  • Chain terminates (stops growing)

The key idea: Mix normal dNTPs with a small amount of ddNTPs. Sometimes the polymerase adds a normal dNTP (chain continues), sometimes it adds a ddNTP (chain stops). This creates DNA fragments of different lengths, all ending at the same type of base.


The Classic Method: Four Separate Reactions

You set up four tubes, each with:

  • Template DNA (what you want to sequence)
  • Primer (starting point)
  • DNA polymerase
  • All four dNTPs (A, C, G, T)
  • One type of ddNTP (different for each tube)

The Four Reactions:

Tube 1 - ddATP: Chains terminate at every A position Tube 2 - ddCTP: Chains terminate at every C position
Tube 3 - ddGTP: Chains terminate at every G position
Tube 4 - ddTTP: Chains terminate at every T position

Example Results:

Let's say the template sequence is: 5'-ACGTACGT-3'

Tube A (ddATP): Fragments ending at A positions

A
ACGTA
ACGTACGTA

Tube C (ddCTP): Fragments ending at C positions

AC
ACGTAC

Tube G (ddGTP): Fragments ending at G positions

ACG
ACGTACG

Tube T (ddTTP): Fragments ending at T positions

ACGT
ACGTACGT

Gel Electrophoresis Separation

Run all four samples on a gel. Smallest fragments move furthest, largest stay near the top.

        A    C    G    T
        |    |    |    |
Start โ†’ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”  (loading wells)
        
        |              |  โ† ACGT (8 bases)
        |    |         |  โ† ACGTACG (7 bases)
        |              |  โ† ACGTAC (6 bases) 
        |         |    |  โ† ACGTA (5 bases)
        |         |    |  โ† ACGT (4 bases)
        |    |    |    |  โ† ACG (3 bases)
        |    |         |  โ† AC (2 bases)
        |              |  โ† A (1 base)
        
      โ†“ Direction of migration โ†“

Reading the sequence: Start from the bottom (smallest fragment) and go up:

Bottom โ†’ Top:  A - C - G - T - A - C - G - T
Sequence:      A   C   G   T   A   C   G   T

The sequence is ACGTACGT (read from bottom to top).


Modern Method: Fluorescent Dyes

Instead of four separate tubes, we now use one tube with four different fluorescent ddNTPs:

  • ddATP = Green fluorescence
  • ddCTP = Blue fluorescence
  • ddGTP = Yellow fluorescence
  • ddTTP = Red fluorescence

What happens:

  1. All fragments are created in one tube
  2. Run them through a capillary (tiny tube) instead of a gel
  3. Laser detects fragments as they pass by
  4. Computer records the color (= which base) and timing (= fragment size)

Chromatogram output:

Fluorescence
    โ†‘
    |     G  C     T     A  G  C  T
    |    /\  /\   /\    /\ /\ /\ /\
    |___/  \/  \_/  \__/  X  \/  \_____โ†’ Time
    |                     / \
Position: 1  2  3  4  5  6  7  8

The computer reads the peaks and outputs: GCTAGCT


Why Sanger Sequencing Still Matters

  • High accuracy (~99.9%)
  • Gold standard for validating variants
  • Good for short reads (up to ~800 bases)
  • Single-molecule sequencing - no PCR bias
  • Used for: Confirming mutations, plasmid verification, PCR product sequencing

Limitations:

  • One fragment at a time (not high-throughput)
  • Expensive for large-scale projects (replaced by next-gen sequencing)
  • Can't detect low-frequency variants (< 15-20%)

About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (textbooks, online tutorials, sequencing method documentation).

This is my academic workโ€”how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Lecture 2: Applied Genomics Overview

Key Concepts Covered

Hardy-Weinberg Equilibrium
Population genetics foundation - allele frequencies (p, q, r) in populations remain constant under specific conditions.

Quantitative Genetics (QG)
Study of traits controlled by multiple genes. Used for calculating breeding values in agriculture and understanding complex human traits.

The Human Genome

  • ~3 billion base pairs
  • <5% codes for proteins (the rest: regulatory, structural, "junk")
  • Massive scale creates computational challenges

QTL (Quantitative Trait Loci)
Genomic regions associated with quantitative traits - linking genotype to phenotype.

Genomics Definition
Study of entire genomes - all DNA sequences, genes, and their interactions.

Sequencing Accuracy
Modern sequencing: <1 error per 10,000 bases

Comparative Genomics
Comparing genomes across species to understand evolution, function, and conservation.

Applied Genomics (Why we're here)
Analyze genomes and extract information - turning raw sequence data into biological insights.

Major Challenges in Genomic Data

  1. Storage - Billions of bases = terabytes of data
  2. Transfer - Moving large datasets between systems
  3. Processing - Computational power for analysis

Sequencing Direction Note

Sanger sequencing: Input = what you're reading (direct)
NGS: Reverse problem - detect complement synthesis, infer template

Next-Generation Sequencing (NGS)

Ion Torrent Sequencing

Ion Torrent is a next-generation sequencing technology that detects DNA sequences by measuring pH changes instead of using light or fluorescence. It's fast, relatively cheap, and doesn't require expensive optical systems.


The Chemistry: Detecting Hydrogen Ions

The Core Principle

When DNA polymerase adds a nucleotide to a growing DNA strand, it releases a hydrogen ion (Hโบ).

The reaction:

dNTP + DNA(n) โ†’ DNA(n+1) + PPi + Hโบ
  • DNA polymerase incorporates a nucleotide
  • Pyrophosphate (PPi) is released
  • One Hโบ ion is released per nucleotide added
  • The Hโบ changes the pH of the solution
  • A pH sensor detects this change

Key insight: No fluorescent labels, no lasers, no cameras. Just chemistry and pH sensors.

Why amplification? A single molecule releasing one Hโบ isn't detectable. A million copies releasing a million Hโบ ions at once creates a measurable pH change.

The Homopolymer Problem

What Are Homopolymers?

A homopolymer is a stretch of identical nucleotides in a row:

  • AAAA (4 A's)
  • TTTTTT (6 T's)
  • GGGGG (5 G's)

Why They're a Problem in Ion Torrent

Normal case (single nucleotide):

  • Flow A โ†’ 1 nucleotide added โ†’ 1 Hโบ released โ†’ small pH change โ†’ signal = 1

Homopolymer case (multiple identical nucleotides):

  • Flow A โ†’ 4 nucleotides added (AAAA) โ†’ 4 Hโบ released โ†’ larger pH change โ†’ signal = 4

The challenge: Distinguishing between signal strengths. Is it 3 A's or 4 A's? Is it 7 T's or 8 T's?

The Math Problem

Signal intensity is proportional to the number of nucleotides incorporated:

  • 1 nucleotide = signal intensity ~100
  • 2 nucleotides = signal intensity ~200
  • 3 nucleotides = signal intensity ~300
  • ...but measurements have noise

Example measurements:

  • True 3 A's might measure as 290-310
  • True 4 A's might measure as 390-410
  • Overlap zone: Is a signal of 305 actually 3 or 4?

The longer the homopolymer, the harder it is to count accurately.

Consequences:

  • Insertions/deletions (indels) in homopolymer regions
  • Frameshifts if in coding regions (completely changes protein)
  • False variants called in genetic studies
  • Harder genome assembly (ambiguous regions)

Here's a concise section on Ion Torrent systems:


Ion Torrent Systems

Ion Torrent offers different sequencing systems optimized for various throughput needs.

System Comparison

FeatureIon PGMIon Proton/S5
Throughput30 Mb - 2 GbUp to 15 Gb
Run time4-7 hours2-4 hours
Read length35-400 bp200 bp
Best forSmall targeted panels, single samplesExomes, large panels, multiple samples
Cost per runLowerHigher
Lab spaceBenchtopBenchtop

Advantages of Ion Torrent

1. Speed

  • No optical scanning between cycles
  • Direct electronic detection
  • Runs complete in 2-4 hours (vs. days for some platforms)

2. Cost

  • No expensive lasers or cameras
  • Simpler hardware = lower instrument cost
  • Good for small labs or targeted sequencing

3. Scalability

  • Different chip sizes for different throughput needs
  • Can sequence 1 sample or 96 samples
  • Good for clinical applications

4. Long reads (relatively)

  • 200-400 bp reads standard
  • Longer than Illumina (75-300 bp typically)
  • Helpful for some applications

Disadvantages of Ion Torrent

1. Homopolymer errors (the big one)

  • Indel errors in long homopolymers
  • Limits accuracy for some applications

2. Lower overall accuracy

  • ~98-99% accuracy vs. 99.9% for Illumina
  • More errors per base overall

3. Smaller throughput

  • Maximum output: ~15 Gb per run
  • Illumina NovaSeq: up to 6 Tb per run
  • Not ideal for whole genome sequencing of complex organisms

4. Systematic errors

  • Errors aren't random - they cluster in homopolymers
  • Harder to correct computationally

Conclusion

Ion Torrent is a clever technology that trades optical complexity for electronic simplicity. It's fast and cost-effective for targeted applications, but the homopolymer problem remains its Achilles' heel.

The homopolymer issue isn't a deal-breaker - it's manageable with proper bioinformatics and sufficient coverage. But you need to know about it when designing experiments and interpreting results.

For clinical targeted sequencing (like cancer panels), Ion Torrent is excellent. For reference-quality genome assemblies or ultra-high-accuracy applications, other platforms might be better choices.

The key lesson: Every sequencing technology has trade-offs. Understanding them helps you choose the right tool for your specific question.


About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (sequencing technology documentation, bioinformatics tutorials, scientific literature).

This is my academic workโ€”how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Lec3

ABI SOLiD Sequencing (Historical)

What Was SOLiD?

SOLiD (Sequencing by Oligonucleotide Ligation and Detection) was a next-generation sequencing platform developed by Applied Biosystems (later acquired by Life Technologies, then Thermo Fisher).

Status: Essentially discontinued. Replaced by Ion Torrent and other technologies.


The Key Difference: Ligation Instead of Synthesis

Unlike other NGS platforms:

  • Illumina: Sequencing by synthesis (polymerase adds nucleotides)
  • Ion Torrent: Sequencing by synthesis (polymerase adds nucleotides)
  • SOLiD: Sequencing by ligation (ligase joins short probes)

How It Worked (Simplified)

  1. DNA fragments attached to beads (emulsion PCR, like Ion Torrent)
  2. Fluorescent probes (short 8-base oligonucleotides) compete to bind
  3. DNA ligase joins the matching probe to the primer
  4. Detect fluorescence to identify which probe bound
  5. Cleave probe, move to next position
  6. Repeat with different primers to read the sequence

Key concept: Instead of building a complementary strand one nucleotide at a time, SOLiD interrogated the sequence using short probes that bind and get ligated.

Why It's Dead (or Nearly Dead)

Advantages that didn't matter enough:

  • Very high accuracy (>99.9% after two-base encoding)
  • Error detection built into chemistry

Fatal disadvantages:

  1. Complex bioinformatics - two-base encoding required specialized tools
  2. Long run times - 7-14 days per run (vs. hours for Ion Torrent, 1-2 days for Illumina)
  3. Expensive - high cost per base
  4. Company pivot - Life Technologies acquired Ion Torrent and shifted focus there

The market chose: Illumina won on simplicity and throughput, Ion Torrent won on speed.

What You Should Remember

1. Different chemistry - Ligation-based, not synthesis-based

2. Two-base encoding - Clever error-checking mechanism, but added complexity

3. Historical importance - Showed alternative approaches to NGS were possible

4. Why it failed - Too slow, too complex, company shifted to Ion Torrent

5. Legacy - Some older papers used SOLiD data; understanding the platform helps interpret those results


The Bottom Line

SOLiD was an interesting experiment in using ligation chemistry for sequencing. It achieved high accuracy through two-base encoding but couldn't compete with faster, simpler platforms.

Why learn about it?

  • Understand the diversity of approaches to NGS
  • Interpret older literature that used SOLiD
  • Appreciate why chemistry simplicity matters (Illumina's success)

You won't use it, but knowing it existed helps you understand the evolution of sequencing technologies and why certain platforms won the market.


Illumina Sequencing

Illumina is the dominant next-generation sequencing platform worldwide. It uses reversible terminator chemistry and fluorescent detection to sequence millions of DNA fragments simultaneously with high accuracy.


The Chemistry: Reversible Terminators

The Core Principle

Unlike Ion Torrent (which detects Hโบ ions), Illumina detects fluorescent light from labeled nucleotides.

Key innovation: Reversible terminators

Normal dNTP:

  • Has 3'-OH group
  • Polymerase adds it and continues to next base

Reversible terminator (Illumina):

  • Has 3'-OH blocked by a chemical group
  • Has fluorescent dye attached
  • Polymerase adds it and stops
  • After imaging, the block and dye are removed
  • Polymerase continues to next base

Why this matters: You get exactly one base added per cycle, making base calling precise.


How It Works: Step by Step

1. Library Preparation

DNA is fragmented and adapters are ligated to both ends of each fragment.

Adapters contain:

  • Primer binding sites
  • Index sequences (barcodes for sample identification)
  • Sequences complementary to flow cell oligos

2. Cluster Generation (Bridge Amplification)

This is Illumina's signature step - amplification happens on the flow cell surface.

The flow cell:

  • Glass slide with millions of oligos attached to the surface
  • Two types of oligos (P5 and P7) arranged in a lawn

Bridge amplification process:

Step 1: DNA fragments bind to flow cell oligos (one end attaches)

Step 2: The free end bends over and binds to nearby oligo (forms a "bridge")

Step 3: Polymerase copies the fragment, creating double-stranded bridge

Step 4: Bridge is denatured (separated into two strands)

Step 5: Both strands bind to nearby oligos and repeat

Result: Each original fragment creates a cluster of ~1,000 identical copies in a tiny spot on the flow cell.

Why amplification? Like Ion Torrent, a single molecule's fluorescent signal is too weak to detect. A thousand identical molecules in the same spot produce a strong signal.

Visual representation:

Original fragment: โ•โ•โ•DNAโ•โ•โ•

After bridge amplification:
โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘
โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘  โ† ~1000 copies in one cluster
โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘ โ•‘
Flow cell surface

3. Sequencing by Synthesis

Now the actual sequencing begins.

Cycle 1:

  1. Add fluorescent reversible terminators (all four: A, C, G, T, each with different color)
  2. Polymerase incorporates one base (only one because it's a terminator)
  3. Wash away unincorporated nucleotides
  4. Image the flow cell with laser
    • Green light = A was added
    • Blue light = C was added
    • Yellow light = G was added
    • Red light = T was added
  5. Cleave off the fluorescent dye and the 3' blocking group
  6. Repeat for next base

Cycle 2, 3, 4... 300+: Same process, one base at a time.

Key difference from Ion Torrent:

  • Illumina: All four nucleotides present at once, polymerase chooses correct one
  • Ion Torrent: One nucleotide type at a time, polymerase adds it only if it matches

Color System

2 color and 4 colors system

No Homopolymer Problem

Why Illumina Handles Homopolymers Better

Remember Ion Torrent's main weakness? Homopolymers like AAAA produce strong signals that are hard to quantify (is it 3 A's or 4?).

Illumina doesn't have this problem because:

  1. One base per cycle - the terminator ensures only one nucleotide is added
  2. Direct counting - if you see 4 green signals in a row, it's exactly 4 A's
  3. No signal intensity interpretation - just presence/absence of color

Example:

Sequence: AAAA

Illumina:

Cycle 1: Green (A)
Cycle 2: Green (A)
Cycle 3: Green (A)
Cycle 4: Green (A)
โ†’ Exactly 4 A's, no ambiguity

Ion Torrent:

Flow A: Large signal (proportional to 4 Hโบ ions)
โ†’ Is it 4? Or 3? Or 5? (requires signal quantification)

Error Profile: Substitutions, Not Indels

Illumina's Main Error Type

Substitution errors - reading the wrong base (A instead of G, C instead of T)

Error rate: ~0.1% (1 error per 1,000 bases, or 99.9% accuracy)

Common causes:

  1. Phasing/pre-phasing - some molecules in a cluster get out of sync
  2. Dye crosstalk - fluorescent signals bleed between channels
  3. Quality degradation - accuracy decreases toward end of reads

Why Few Indels?

Because of the reversible terminator:

  • Exactly one base per cycle
  • Can't skip a base (would need terminator removal without incorporation)
  • Can't add two bases (terminator blocks second addition)

Comparison:

Error TypeIlluminaIon Torrent
Substitutions~99% of errors~30% of errors
Insertions/Deletions~1% of errors~70% of errors
Homopolymer errorsRareCommon

Phasing and Pre-phasing

The Synchronization Problem

In a perfect world, all molecules in a cluster stay perfectly synchronized - all at the same base position.

Reality: Some molecules lag behind (phasing) or jump ahead (pre-phasing).

Phasing (Lagging Behind)

Cycle 1: All molecules at position 1 โœ“
Cycle 2: 98% at position 2, 2% still at position 1 (incomplete extension)
Cycle 3: 96% at position 3, 4% behind...

As cycles progress, the cluster becomes a mix of molecules at different positions.

Result: Blurry signal - you're imaging multiple bases at once.

Pre-phasing (Jumping Ahead)

Cause: Incomplete removal of terminator or dye

A molecule might:

  • Have terminator removed
  • BUT dye not fully removed
  • Next cycle adds another base (now 2 bases ahead of schedule)

Impact on Quality

Early cycles (1-100): High accuracy, minimal phasing
Middle cycles (100-200): Good accuracy, some phasing
Late cycles (200-300+): Lower accuracy, significant phasing

Quality scores decline with read length. This is why:

  • Read 1 (first 150 bases) typically has higher quality than Read 2
  • Paired-end reads are used (sequence both ends, higher quality at each end)

Paired-End Sequencing

What Is Paired-End?

Instead of sequencing only one direction, sequence both ends of the DNA fragment.

Process:

  1. Read 1: Sequence from one end (forward direction) for 150 bases
  2. Regenerate clusters (bridge amplification again)
  3. Read 2: Sequence from the other end (reverse direction) for 150 bases

Result: Two reads from the same fragment, separated by a known distance.

Why Paired-End?

1. Better mapping

  • If one end maps ambiguously, the other might be unique
  • Correct orientation and distance constrain mapping

2. Detect structural variants

  • Deletions: Reads closer than expected
  • Insertions: Reads farther than expected
  • Inversions: Wrong orientation
  • Translocations: Reads on different chromosomes

3. Improve assembly

  • Links across repetitive regions
  • Spans gaps

4. Quality assurance

  • If paired reads don't map correctly, flag as problematic

Illumina Systems

Different Throughput Options

Illumina offers multiple sequencing platforms for different scales:

SystemThroughputRun TimeRead LengthBest For
iSeq 1001.2 Gb9-19 hours150 bpSmall targeted panels, amplicons
MiniSeq8 Gb4-24 hours150 bpSmall labs, targeted sequencing
MiSeq15 Gb4-55 hours300 bpTargeted panels, small genomes, amplicon seq
NextSeq120 Gb12-30 hours150 bpExomes, transcriptomes, small genomes
NovaSeq6000 Gb (6 Tb)13-44 hours250 bpWhole genomes, large projects, population studies

Key trade-offs:

  • Higher throughput = longer run time
  • Longer reads = lower throughput or longer run time
  • Bigger machines = higher capital cost but lower cost per Gb

Advantages of Illumina

1. High Accuracy

  • 99.9% base accuracy (Q30 or higher)
  • Few indel errors
  • Reliable base calling

2. High Throughput

  • Billions of reads per run
  • Suitable for whole genomes at population scale

3. Low Cost (at scale)

  • ~$5-10 per Gb for high-throughput systems
  • Cheapest for large projects

4. Mature Technology

  • Well-established protocols
  • Extensive bioinformatics tools
  • Large user community

5. Flexible Read Lengths

  • 50 bp to 300 bp
  • Single-end or paired-end

6. Multiplexing

  • Sequence 96+ samples in one run using barcodes
  • Reduces cost per sample

Disadvantages of Illumina

1. Short Reads

  • Maximum ~300 bp (vs. PacBio: 10-20 kb)
  • Hard to resolve complex repeats
  • Difficult for de novo assembly of large genomes

2. Run Time

  • 12-44 hours for high-throughput systems
  • Longer than Ion Torrent (2-4 hours)
  • Not ideal for ultra-rapid diagnostics

3. PCR Amplification Bias

  • Bridge amplification favors certain sequences
  • GC-rich or AT-rich regions may be underrepresented
  • Some sequences difficult to amplify

4. Equipment Cost

  • NovaSeq: $850,000-$1,000,000
  • High upfront investment
  • Requires dedicated space and trained staff

5. Phasing Issues

  • Quality degrades with read length
  • Limits maximum usable read length

When to Use Illumina

Ideal Applications

Whole Genome Sequencing (WGS)

  • Human, animal, plant genomes
  • Resequencing (alignment to reference)
  • Population genomics

Whole Exome Sequencing (WES)

  • Capture and sequence only coding regions
  • Clinical diagnostics
  • Disease gene discovery

RNA Sequencing (RNA-seq)

  • Gene expression profiling
  • Transcript discovery
  • Differential expression analysis

ChIP-Seq / ATAC-Seq

  • Protein-DNA interactions
  • Chromatin accessibility
  • Epigenomics

Metagenomics

  • Microbial community profiling
  • 16S rRNA sequencing
  • Shotgun metagenomics

Targeted Panels

  • Cancer hotspot panels
  • Carrier screening
  • Pharmacogenomics

Not Ideal For

Long-range phasing (use PacBio or Oxford Nanopore)
Structural variant detection (short reads struggle with large rearrangements)
Ultra-rapid turnaround (use Ion Torrent for speed)
De novo assembly of repeat-rich genomes (long reads better)


Illumina vs Ion Torrent: Summary

FeatureIlluminaIon Torrent
DetectionFluorescencepH (Hโบ ions)
ChemistryReversible terminatorsNatural dNTPs + ddNTPs
Read length50-300 bp200-400 bp
Run time12-44 hours (high-throughput)2-4 hours
Accuracy99.9%98-99%
Main errorSubstitutionsIndels (homopolymers)
HomopolymersNo problemMajor issue
ThroughputUp to 6 Tb (NovaSeq)Up to 15 Gb
Cost per Gb$5-10 (at scale)$50-100
Best forLarge projects, WGS, high accuracyTargeted panels, speed

The Bottom Line

Illumina is the workhorse of genomics. It's not the fastest (Ion Torrent), not the longest reads (PacBio/Nanopore), but it hits the sweet spot of:

  • High accuracy
  • High throughput
  • Reasonable cost
  • Mature ecosystem

For most genomic applications - especially resequencing, RNA-seq, and exomes - Illumina is the default choice.

The main limitation is short reads. For applications requiring long-range information (phasing variants, resolving repeats, de novo assembly), you'd combine Illumina with long-read technologies or use long-read platforms alone.

Key takeaway: Illumina's reversible terminator chemistry elegantly solves the homopolymer problem by ensuring exactly one base per cycle, trading speed (longer run time) for accuracy (99.9%).


About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (Illumina documentation, sequencing technology literature, bioinformatics tutorials).

This is my academic workโ€”how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Nanopore Sequencing

Overview

Oxford Nanopore uses tiny protein pores embedded in a membrane to read DNA directly - no amplification, no fluorescence.


How It Works

The Setup: Membrane with Nanopores

A membrane separates two chambers with different electrical charges. Embedded in the membrane are protein nanopores - tiny holes just big enough for single-stranded DNA to pass through.

     Voltage applied across membrane
              โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                   โ†“
    โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•  โ† Membrane
                โ”‚ โ—ฏ โ—ฏ โ”‚              โ† Nanopores
    โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                   โ†‘
              DNA threads through

The Detection: Measuring Current

  1. DNA strand is fed through the pore by a motor protein
  2. As each base passes through, it partially blocks the pore
  3. Each base (A, T, G, C) has a different size/shape
  4. Different bases create different electrical resistance
  5. We measure the change in current to identify the base

Key insight: No labels, no cameras, no lasers - just electrical signals!


The Signal: It's Noisy

The raw signal is messy - multiple bases in the pore at once, random fluctuations:

Current
   โ”‚
   โ”‚ โ–„โ–„โ–„   โ–„โ–„    โ–„โ–„โ–„โ–„   โ–„โ–„   โ–„โ–„โ–„
   โ”‚โ–ˆ   โ–ˆโ–„โ–ˆ  โ–ˆโ–„โ–„โ–ˆ    โ–ˆโ–„โ–ˆ  โ–ˆโ–„โ–ˆ   โ–ˆโ–„โ–„
   โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Time
   
   Base: A  A  T  G   C  C  G  A

Machine learning (neural networks) decodes this noisy signal into base calls.


Why Nanopore?

Ultra-Long Reads

  • Typical: 10-50 kb
  • Record: >4 Mb (yes, megabases!)
  • Limited only by DNA fragment length, not the technology

Cheap and Portable

  • MinION device fits in your hand, costs ~$1000
  • Can sequence in the field (disease outbreaks, remote locations)
  • Real-time data - see results as sequencing happens

Direct Detection

  • Can detect modified bases (methylation) directly
  • No PCR amplification needed
  • Can sequence RNA directly (no cDNA conversion)

Error Rate and Correction

Raw accuracy: ~93-97% (improving with each update)

Error type: Mostly indels, especially in homopolymers

Improving Accuracy

1. Higher coverage: Multiple reads of the same region, errors cancel out

2. Duplex sequencing: DNA is double-stranded - sequence both strands and combine:

Forward strand:  ATGCCCAAA
                 |||||||||
Reverse strand:  TACGGGTTT  (complement)

โ†’ Consensus: Higher accuracy

3. Better basecallers: Neural networks keep improving, accuracy increases with software updates

PacBio Sequencing

Overview

PacBio (Pacific Biosciences) uses SMRT sequencing (Single Molecule Real-Time) to produce long reads - often 10,000 to 25,000+ base pairs.

For better illustration, watch the video below:


How It Works

The Setup: ZMW (Zero-Mode Waveguide)

PacBio uses tiny wells called ZMWs - holes so small that light can only illuminate the very bottom.

At the bottom of each well:

  • A single DNA polymerase is fixed in place
  • A single DNA template is threaded through it

The Chemistry: Real-Time Detection

  1. Fluorescent nucleotides (A, T, G, C - each with different color) float in solution
  2. When polymerase grabs the correct nucleotide, it holds it in the detection zone
  3. Laser detects the fluorescence - we see which base is being added
  4. Polymerase incorporates the nucleotide, releases the fluorescent tag
  5. Repeat - watching DNA synthesis in real-time

Key difference from Illumina: We watch a single molecule of polymerase working continuously, not millions of molecules in sync.


Why Long Reads?

The circular template trick:

PacBio uses SMRTbell templates - DNA with hairpin adapters on both ends, forming a circle.

    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚              โ”‚
โ”€โ”€โ”€โ”€โ”ค   Template   โ”œโ”€โ”€โ”€โ”€
    โ”‚              โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

The polymerase goes around and around, reading the same template multiple times.


Error Correction: Why High Accuracy?

Raw reads have ~10-15% error rate (mostly insertions/deletions)

But: Because polymerase circles the template multiple times, we get multiple reads of the same sequence.

CCS (Circular Consensus Sequencing):

  • Align all passes of the same template
  • Errors are random, so they cancel out
  • Result: >99.9% accuracy (HiFi reads)
Pass 1:  ATGC-CCAAA
Pass 2:  ATGCCC-AAA
Pass 3:  ATGCCCAAAA
Pass 4:  ATGCCC-AAA
         โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Consensus: ATGCCCAAA  โœ“

When to Use PacBio

Ideal for:

  • De novo genome assembly
  • Resolving repetitive regions
  • Detecting structural variants
  • Full-length transcript sequencing
  • Phasing haplotypes

Not ideal for:

  • Large-scale population studies (cost)
  • When short reads are sufficient

Before Data Analysis

Understanding the Problem First

A common mistake in applied genomics is rushing to analysis before fully understanding the problem. Many researchers want to jump straight to implementation before proper design, or analyze sequences before understanding their origin and quality.

The Requirements Phase is Critical

Never underestimate the importance of thoroughly defining requirements. While solving problems is exciting and rewarding, spending weeks solving the wrong problem is far worse. I've learned this lesson the hard wayโ€”delivering excellent solutions that didn't address the actual need. As the saying goes, "the operation was a success, but the patient died."

Before investing significant time, money, and effort (resources you may not be able to recoup), invest in understanding the problem:

  • Interview all stakeholders multiple times
  • Don't worry about asking "obvious" questionsโ€”assumptions cause problems
  • Create scenarios to test your understanding
  • Have others explain the problem back to you from your perspective
  • Ask people to validate your interpretation

Many critical details go unmentioned because experts assume they're obvious. It's your responsibility to ask clarifying questions until you're confident you understand the requirements completely.


DNA Quality Requirements

Quality assessment of DNA is a critical step before next-generation sequencing (NGS). Both library preparation and sequencing success depend heavily on:

  • Sample concentration: sufficient DNA quantity for the workflow
  • DNA purity: absence of contaminants that interfere with enzymes

Understanding DNA Purity Measurements

The 260/280 absorbance ratio is the standard purity metric:

  • Nucleic acids absorb maximally at 260 nm wavelength
  • Proteins absorb maximally at 280 nm wavelength
  • The ratio between these measurements indicates sample composition

Interpreting the 260/280 ratio:

  • ~1.8 = pure DNA (target value)
  • Higher ratios = excess nucleic acids present
  • Lower ratios = protein contamination

Abnormal 260/280 ratios suggest contamination by proteins, residual extraction reagents (like phenol), or measurement errors.


Understanding Your Sequencing Report

Every sequencing experiment generates a detailed reportโ€”always request and review it carefully!

Example: Whole Genome Sequencing (WGS)

What is WGS? Whole Genome Sequencing reads the complete DNA sequence of an organism's genome in a single experiment.

Example calculation: If you ordered 40ร— WGS coverage of Sus scrofa (pig) DNA:

  • S. scrofa genome size: ~2.8 billion base pairs (2.8 Gb)
  • Expected data: at least 112 Gb (calculated as 40ร— ร— 2.8 Gb)

Pro tip: Calculate these expected values before requesting a quotation so you can verify the company delivers what you paid for.


Sequencing Depth and Coverage Explained

Depth of Coverage

Definition: The average number of times each base in the genome is sequenced.

Formula: Depth = (L ร— N) / G

Where:

  • L = read length (base pairs per sequence read)
  • N = total number of reads generated
  • G = haploid genome size (total base pairs)

This can be simplified to: Depth = Total sequenced base pairs / Genome size

Notation: Depth is expressed as "Xร—" (e.g., 5ร—, 10ร—, 30ร—, 100ร—), where X indicates how many times the average base was sequenced.

Breadth of Coverage

Definition: The percentage of the target genome that has been sequenced at a minimum depth threshold.

Example for Human Genome (~3 Gb):

Average DepthBreadth of Coverage
<1ร—Maximum 33% of genome
1ร—Maximum 67% of genome
1โ€“3ร—>99% of genome
3โ€“5ร—>99% of genome
7โ€“8ร—>99% of genome

Key insight: Higher depth doesn't just mean more reads per baseโ€”it ensures more complete coverage across the entire genome. Even at 1ร— average depth, many regions may have zero coverage due to uneven distribution of reads.

Variant Discovery Delivery Framework

alt text

Quality Control in Next-Generation Sequencing

Introduction: Why Sequencing Isn't Perfect

Next-generation sequencing (NGS) has revolutionized genomics, but it's not error-free. Every sequencing run introduces errors, and understanding these errors is crucial for reliable variant discovery. In this article, we'll explore how errors occur, how quality is measured, and how to analyze sequencing data quality using Python.

โš ๏ธ
Warning

Poor quality control can lead to false variant calls, wasting weeks of downstream analysis. Always perform QC before proceeding!


How Sequencing Errors Happen

Sequencing errors occur at multiple stages of the NGS process. Let's understand the main sources:

1. Cluster Generation Errors

In Illumina sequencing, DNA fragments are amplified into clusters on a flow cell. Each cluster should contain identical copies of the same fragment.

What can go wrong:

  • Incomplete amplification: Some molecules in the cluster don't amplify properly
  • Mixed clusters: Multiple different DNA fragments amplify in the same location
  • Phasing errors: Molecules in a cluster get out of sync during sequencing
๐Ÿ’ป
Example: Phasing Error

Imagine sequencing the sequence "ATCGATCG":

  • Cycle 1: All molecules read "A" โœ…
  • Cycle 2: All molecules read "T" โœ…
  • Cycle 3: 99% read "C", but 1% lagged and still read "T" โš ๏ธ
  • Cycle 4: Now signals are mixed - getting worse each cycle

Result: Quality degrades as the read progresses!

2. Terminator Not Removed

During sequencing-by-synthesis:

  1. A fluorescent nucleotide with a reversible terminator is added
  2. The terminator prevents the next nucleotide from being added
  3. After imaging, the terminator should be cleaved off
  4. Problem: If the terminator isn't removed, the sequence stops prematurely

This creates shorter reads and reduces coverage at later positions.

3. Optical Errors

  • Incorrect base calling: The imaging system misidentifies which fluorescent signal is present
  • Signal bleeding: Fluorescent signals from nearby clusters interfere with each other
  • Photobleaching: Fluorescent dyes fade over time, reducing signal strength

4. Biochemical Errors

  • Incorrect nucleotide incorporation: DNA polymerase occasionally adds the wrong base
  • Damaged bases: Pre-existing DNA damage can cause misreads
  • Secondary structures: GC-rich or repetitive regions can form structures that interfere with sequencing
๐Ÿ”ฌ
Fact

Typical Illumina sequencing error rates are around 0.1-1%, meaning 99-99.9% of bases are correct. However, with billions of bases sequenced, this still means millions of errors!


Understanding Base Quality Scores

Since every base call can be wrong, sequencers assign a quality score to each base, representing the confidence that the base call is correct.

Probability vs Quality Score

Instead of storing raw probabilities, sequencing platforms use Phred quality scores:

๐Ÿ“–
Definition: Phred Quality Score

Q = -10 ร— logโ‚โ‚€(P)

Where P is the probability that the base call is incorrect.

Why Use Quality Scores Instead of Probabilities?

There are several practical reasons:

  1. Easier to interpret: Q=30 is easier to remember than P=0.001
  2. Compact storage: Single ASCII characters encode quality (more on this later)
  3. Natural scale: Higher numbers = better quality (intuitive)
  4. Historical: Originally developed for Sanger sequencing, now standard across platforms

Quality Score Reference Table

Quality Score (Q)Error Probability (P)AccuracyInterpretation
Q101 in 10 (0.1)90%Low quality
Q201 in 100 (0.01)99%Acceptable
Q301 in 1,000 (0.001)99.9%Good quality
Q401 in 10,000 (0.0001)99.99%Excellent quality
๐Ÿ’ก
Tip

Q30 is generally considered the minimum acceptable quality for variant calling. Bases below Q20 are often filtered out.

Calculating Quality Scores

Let's see some examples:

Example 1: A base with 99% confidence (P = 0.01)

Q = -10 ร— logโ‚โ‚€(0.01)
Q = -10 ร— (-2)
Q = 20

Example 2: A base with 99.9% confidence (P = 0.001)

Q = -10 ร— logโ‚โ‚€(0.001)
Q = -10 ร— (-3)
Q = 30
โ“
Question

If a base has a quality score of Q=25, what's the probability it's correct?

Click to see answer

P = 10^(-Q/10) = 10^(-25/10) = 10^(-2.5) โ‰ˆ 0.00316

So accuracy = 1 - 0.00316 = 99.68% correct


The FASTQ File Format

Sequencing data is typically stored in FASTQ format, which contains both the DNA sequence and quality scores for each base.

๐Ÿ“–
Definition: FASTQ Format

A text-based format for storing both nucleotide sequences and their corresponding quality scores. Each read is represented by exactly 4 lines.

FASTQ File Structure

Each sequencing read takes exactly 4 lines:

@SEQ_ID                           โ† Line 1: Header (starts with @)
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT  โ† Line 2: Sequence
+                                 โ† Line 3: Separator (starts with +)
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65  โ† Line 4: Quality scores

Breaking it down:

  1. Line 1 - Header: Starts with @, contains read identifier and optional description

    • Example: @SRR123456.1 M01234:23:000000000-A1B2C:1:1101:15555:1234 1:N:0:1
  2. Line 2 - Sequence: The actual DNA sequence (A, T, C, G, sometimes N for unknown)

  3. Line 3 - Separator: Always starts with +, optionally repeats the header (usually just +)

  4. Line 4 - Quality Scores: ASCII-encoded quality scores (one character per base)

ASCII Encoding of Quality Scores

Quality scores are encoded as single ASCII characters to save space. The encoding formula is:

ASCII_character = chr(Quality_Score + 33)

The +33 offset is called Phred+33 encoding (also known as Sanger format).

๐Ÿ’ป
Example: Quality Encoding

Quality score Q=30:

  • ASCII value = 30 + 33 = 63
  • Character = chr(63) = '?'

Quality score Q=40:

  • ASCII value = 40 + 33 = 73
  • Character = chr(73) = 'I'

Quality Character Reference

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
|         |         |         |         |
0        10        20        30        40
  • ! = Q0 (worst quality, 50% error rate)
  • + = Q10 (10% error rate)
  • 5 = Q20 (1% error rate)
  • ? = Q30 (0.1% error rate)
  • I = Q40 (0.01% error rate)
โš ๏ธ
Warning: Different Encodings Exist

Older Illumina data used Phred+64 encoding (adding 64 instead of 33). Always check which encoding your data uses! Modern data uses Phred+33.


Parsing FASTQ Files with Python

Now let's write Python code to read and analyze FASTQ files. We'll build this step-by-step, as if working in a Jupyter notebook.

Step 1: Reading a FASTQ File

First, let's write a function to parse FASTQ files:

def read_fastq(filename):
    """
    Read a FASTQ file and return lists of sequences and quality strings.
    
    Parameters:
    -----------
    filename : str
        Path to the FASTQ file
        
    Returns:
    --------
    sequences : list
        List of DNA sequences
    qualities : list
        List of quality strings (ASCII encoded)
    """
    sequences = []
    qualities = []
    
    with open(filename, 'r') as f:
        while True:
            # Read 4 lines at a time
            header = f.readline().strip()
            if not header:  # End of file
                break
            
            seq = f.readline().strip()
            plus = f.readline().strip()
            qual = f.readline().strip()
            
            sequences.append(seq)
            qualities.append(qual)
    
    return sequences, qualities
๐Ÿ’ก
Tip

For very large FASTQ files (common in NGS), consider using generators or the BioPython library to avoid loading everything into memory at once.

Step 2: Converting Phred+33 to Numeric Quality Scores

Now let's create a helper function to convert ASCII characters to numeric quality scores:

def phred33_to_q(qual_str):
    """
    Convert a Phred+33 encoded quality string to numeric quality scores.
    
    Parameters:
    -----------
    qual_str : str
        Quality string with ASCII-encoded scores
        
    Returns:
    --------
    list of int
        Numeric quality scores
    """
    return [ord(char) - 33 for char in qual_str]

Let's test it:

# Example quality string
example_qual = "!5?II"

# Convert to numeric scores
scores = phred33_to_q(example_qual)
print(f"Quality string: {example_qual}")
print(f"Numeric scores: {scores}")
print(f"Interpretation:")
for char, score in zip(example_qual, scores):
    error_prob = 10 ** (-score / 10)
    accuracy = (1 - error_prob) * 100
    print(f"  '{char}' โ†’ Q{score} โ†’ {accuracy:.2f}% accurate")

Output:

Quality string: !5?II
Numeric scores: [0, 20, 30, 40, 40]
Interpretation:
  '!' โ†’ Q0 โ†’ 50.00% accurate
  '5' โ†’ Q20 โ†’ 99.00% accurate
  '?' โ†’ Q30 โ†’ 99.90% accurate
  'I' โ†’ Q40 โ†’ 99.99% accurate
  'I' โ†’ Q40 โ†’ 99.99% accurate

Visualizing Quality Distributions

Step 3: Creating a Quality Score Histogram

Let's write a function to compute quality score distributions:

def quality_histogram(qualities, phred_offset=33):
    """
    Calculate histogram of quality scores across all bases.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    phred_offset : int
        Phred encoding offset (33 for Phred+33, 64 for Phred+64)
        
    Returns:
    --------
    dict
        Dictionary with quality scores as keys and counts as values
    """
    from collections import Counter
    
    all_scores = []
    for qual_str in qualities:
        scores = [ord(char) - phred_offset for char in qual_str]
        all_scores.extend(scores)
    
    return Counter(all_scores)

Step 4: Visualizing with Matplotlib

import matplotlib.pyplot as plt
import numpy as np

def plot_quality_distribution(qualities, title="Quality Score Distribution"):
    """
    Plot histogram of quality scores.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    title : str
        Plot title
    """
    # Get histogram data
    hist = quality_histogram(qualities)
    
    # Prepare data for plotting
    scores = sorted(hist.keys())
    counts = [hist[s] for s in scores]
    
    # Create plot
    plt.figure(figsize=(12, 6))
    plt.bar(scores, counts, color='steelblue', alpha=0.7, edgecolor='black')
    
    # Add reference lines for quality thresholds
    plt.axvline(x=20, color='orange', linestyle='--', linewidth=2, label='Q20 (99% accurate)')
    plt.axvline(x=30, color='green', linestyle='--', linewidth=2, label='Q30 (99.9% accurate)')
    
    plt.xlabel('Quality Score (Q)', fontsize=12)
    plt.ylabel('Number of Bases', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

# Example usage:
# sequences, qualities = read_fastq('sample.fastq')
# plot_quality_distribution(qualities)
โœ…
Success

You now have a complete pipeline to read FASTQ files and visualize quality distributions!


Quality by Read Position

One of the most important QC checks is looking at how quality changes across read positions. Remember our phasing error example? Quality typically degrades toward the end of reads.

Step 5: Computing Mean Quality by Position

def quality_by_position(qualities, phred_offset=33):
    """
    Calculate mean quality score at each position along the read.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    phred_offset : int
        Phred encoding offset
        
    Returns:
    --------
    positions : list
        Position numbers (0-indexed)
    mean_qualities : list
        Mean quality score at each position
    """
    # Find maximum read length
    max_len = max(len(q) for q in qualities)
    
    # Initialize lists to store quality scores at each position
    position_scores = [[] for _ in range(max_len)]
    
    # Collect all scores at each position
    for qual_str in qualities:
        scores = [ord(char) - phred_offset for char in qual_str]
        for pos, score in enumerate(scores):
            position_scores[pos].append(score)
    
    # Calculate mean at each position
    positions = list(range(max_len))
    mean_qualities = [np.mean(scores) if scores else 0 
                      for scores in position_scores]
    
    return positions, mean_qualities

Step 6: Plotting Quality by Position

def plot_quality_by_position(qualities, title="Quality Scores by Position"):
    """
    Plot mean quality score across read positions.
    
    Parameters:
    -----------
    qualities : list of str
        List of quality strings from FASTQ
    title : str
        Plot title
    """
    positions, mean_quals = quality_by_position(qualities)
    
    plt.figure(figsize=(14, 6))
    plt.plot(positions, mean_quals, linewidth=2, color='steelblue', marker='o', 
             markersize=3, markevery=5)
    
    # Add reference lines
    plt.axhline(y=20, color='orange', linestyle='--', linewidth=2, 
                label='Q20 threshold', alpha=0.7)
    plt.axhline(y=30, color='green', linestyle='--', linewidth=2, 
                label='Q30 threshold', alpha=0.7)
    
    plt.xlabel('Position in Read (bp)', fontsize=12)
    plt.ylabel('Mean Quality Score (Q)', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.ylim(0, max(mean_quals) + 5)
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_quality_by_position(qualities, title="Quality Degradation Across Read")
๐Ÿ“
What to Look For

In a typical quality-by-position plot:

  • โœ… Quality starts high (Q30-40) at the beginning
  • โš ๏ธ Gradual decline is normal (phasing effects)
  • ๐Ÿšซ Sudden drops indicate problems (adapter contamination, chemistry issues)
  • ๐Ÿšซ Quality below Q20 for most of the read โ†’ consider re-sequencing

Analyzing GC Content

GC content analysis is another crucial quality control metric. Let's understand why it matters and how to analyze it.

Why Analyze GC Content?

๐Ÿ“–
Definition: GC Content

GC content is the percentage of bases in a DNA sequence that are either Guanine (G) or Cytosine (C).

Formula: GC% = (G + C) / (A + T + G + C) ร— 100

Reasons to monitor GC content:

  1. Bias detection: PCR amplification can be biased toward or against GC-rich regions
  2. Contamination: Unexpected GC distribution may indicate adapter contamination or sample contamination
  3. Coverage issues: Extreme GC content (very high or low) is harder to sequence accurately
  4. Species verification: Different organisms have characteristic GC content ranges
๐Ÿ”ฌ
Organism GC Content Examples
  • Humans: ~41% GC
  • E. coli: ~51% GC
  • P. falciparum (malaria parasite): ~19% GC (very AT-rich!)
  • Some bacteria: up to ~75% GC

Step 7: Calculating GC Content

def calculate_gc_content(sequence):
    """
    Calculate GC content percentage for a DNA sequence.
    
    Parameters:
    -----------
    sequence : str
        DNA sequence string
        
    Returns:
    --------
    float
        GC content as a percentage (0-100)
    """
    sequence = sequence.upper()
    gc_count = sequence.count('G') + sequence.count('C')
    total = len(sequence)
    
    if total == 0:
        return 0.0
    
    return (gc_count / total) * 100

def gc_content_per_read(sequences):
    """
    Calculate GC content for each read.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
        
    Returns:
    --------
    list of float
        GC content percentage for each read
    """
    return [calculate_gc_content(seq) for seq in sequences]

Step 8: Plotting GC Content Distribution

def plot_gc_distribution(sequences, expected_gc=None, title="GC Content Distribution"):
    """
    Plot histogram of GC content across all reads.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
    expected_gc : float, optional
        Expected GC content for the organism (will add reference line)
    title : str
        Plot title
    """
    gc_contents = gc_content_per_read(sequences)
    
    plt.figure(figsize=(12, 6))
    plt.hist(gc_contents, bins=50, color='steelblue', alpha=0.7, 
             edgecolor='black', linewidth=0.5)
    
    # Add reference line for expected GC content
    if expected_gc is not None:
        plt.axvline(x=expected_gc, color='red', linestyle='--', linewidth=2,
                   label=f'Expected GC: {expected_gc}%')
        plt.legend()
    
    # Add mean line
    mean_gc = np.mean(gc_contents)
    plt.axvline(x=mean_gc, color='green', linestyle='-', linewidth=2,
               label=f'Observed Mean: {mean_gc:.1f}%', alpha=0.7)
    
    plt.xlabel('GC Content (%)', fontsize=12)
    plt.ylabel('Number of Reads', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

# Example usage:
# sequences, qualities = read_fastq('sample.fastq')
# plot_gc_distribution(sequences, expected_gc=41, title="Human Genome GC Content")

Step 9: GC Content by Position

Sometimes GC content varies along the read length, which can indicate:

  • Adapter sequences (usually very different GC content)
  • Random hexamer priming bias (in RNA-seq)
  • Fragmentation bias
def gc_by_position(sequences):
    """
    Calculate GC content at each position along reads.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
        
    Returns:
    --------
    positions : list
        Position numbers
    gc_percentages : list
        GC percentage at each position
    """
    max_len = max(len(seq) for seq in sequences)
    
    # Count G/C and total bases at each position
    gc_counts = [0] * max_len
    total_counts = [0] * max_len
    
    for seq in sequences:
        seq = seq.upper()
        for pos, base in enumerate(seq):
            if base in 'ATGC':
                total_counts[pos] += 1
                if base in 'GC':
                    gc_counts[pos] += 1
    
    # Calculate percentages
    positions = list(range(max_len))
    gc_percentages = [(gc_counts[i] / total_counts[i] * 100) if total_counts[i] > 0 else 0
                      for i in range(max_len)]
    
    return positions, gc_percentages

def plot_gc_by_position(sequences, expected_gc=None, 
                        title="GC Content by Position"):
    """
    Plot GC content across read positions.
    
    Parameters:
    -----------
    sequences : list of str
        List of DNA sequences
    expected_gc : float, optional
        Expected GC content percentage
    title : str
        Plot title
    """
    positions, gc_pcts = gc_by_position(sequences)
    
    plt.figure(figsize=(14, 6))
    plt.plot(positions, gc_pcts, linewidth=2, color='steelblue', 
             marker='o', markersize=3, markevery=5)
    
    if expected_gc is not None:
        plt.axhline(y=expected_gc, color='red', linestyle='--', linewidth=2,
                   label=f'Expected: {expected_gc}%', alpha=0.7)
        plt.legend()
    
    plt.xlabel('Position in Read (bp)', fontsize=12)
    plt.ylabel('GC Content (%)', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.ylim(0, 100)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Example usage:
# plot_gc_by_position(sequences, expected_gc=41)
โš ๏ธ
Warning Signs in GC Content
  • ๐Ÿšซ Sharp peaks/valleys: May indicate adapter contamination
  • ๐Ÿšซ Bimodal distribution: Possible mixed samples or contamination
  • ๐Ÿšซ Spike at read ends: Adapter sequences not trimmed
  • โš ๏ธ Shift from expected: May indicate PCR bias or wrong reference

Putting It All Together: Complete QC Pipeline

Let's create a comprehensive quality control function:

def comprehensive_qc(fastq_file, expected_gc=None, output_prefix="qc"):
    """
    Perform comprehensive quality control on a FASTQ file.
    
    Parameters:
    -----------
    fastq_file : str
        Path to FASTQ file
    expected_gc : float, optional
        Expected GC content percentage
    output_prefix : str
        Prefix for output plot files
    """
    print("Reading FASTQ file...")
    sequences, qualities = read_fastq(fastq_file)
    
    print(f"Total reads: {len(sequences):,}")
    print(f"Mean read length: {np.mean([len(s) for s in sequences]):.1f} bp")
    
    # Calculate summary statistics
    all_quals = []
    for qual_str in qualities:
        all_quals.extend(phred33_to_q(qual_str))
    
    mean_q = np.mean(all_quals)
    median_q = np.median(all_quals)
    q20_pct = (np.sum(np.array(all_quals) >= 20) / len(all_quals)) * 100
    q30_pct = (np.sum(np.array(all_quals) >= 30) / len(all_quals)) * 100
    
    print(f"\nQuality Statistics:")
    print(f"  Mean quality: Q{mean_q:.1f}")
    print(f"  Median quality: Q{median_q:.1f}")
    print(f"  Bases โ‰ฅ Q20: {q20_pct:.2f}%")
    print(f"  Bases โ‰ฅ Q30: {q30_pct:.2f}%")
    
    gc_contents = gc_content_per_read(sequences)
    print(f"\nGC Content Statistics:")
    print(f"  Mean GC: {np.mean(gc_contents):.2f}%")
    print(f"  Median GC: {np.median(gc_contents):.2f}%")
    if expected_gc:
        print(f"  Expected GC: {expected_gc}%")
    
    # Generate plots
    print("\nGenerating QC plots...")
    plot_quality_distribution(qualities, title=f"Quality Score Distribution - {output_prefix}")
    plot_quality_by_position(qualities, title=f"Quality by Position - {output_prefix}")
    plot_gc_distribution(sequences, expected_gc=expected_gc, 
                        title=f"GC Content Distribution - {output_prefix}")
    plot_gc_by_position(sequences, expected_gc=expected_gc,
                       title=f"GC Content by Position - {output_prefix}")
    
    print("\nQC analysis complete!")
    
    # Return summary dictionary
    return {
        'n_reads': len(sequences),
        'mean_length': np.mean([len(s) for s in sequences]),
        'mean_quality': mean_q,
        'q20_percent': q20_pct,
        'q30_percent': q30_pct,
        'mean_gc': np.mean(gc_contents)
    }

#

FASTQC

Genome Assembly

๐Ÿ“–
Definition: Genome Assembly

Genome assembly is the computational process of reconstructing the complete genome sequence from millions of short DNA fragments (reads) produced by sequencing.

The Puzzle Analogy

Think of it like solving a jigsaw puzzle:

  • The reads = individual puzzle pieces (short DNA sequences, typically 50-300 bp)
  • The genome = complete picture (the full chromosome sequences)
  • Assembly = finding overlaps between pieces to reconstruct the whole picture

Why Is It Needed?

Sequencing technologies can only read short fragments of DNA at a time, but we need the complete genome sequence. Assembly algorithms find overlapping regions between reads and merge them into longer sequences called contigs (contiguous sequences).

๐Ÿ’ป
Example

Read 1: ATCGATTGCA
Read 2: TTGCAGGCTAA
Read 3: GGCTAATCGA

Assembled: ATCGATTGCAGGCTAATCGA

(Overlapping regions in bold helped merge them)

Two Main Approaches

  1. De novo assembly: Building the genome from scratch without a reference (like solving a puzzle without the box picture)

  2. Reference-guided assembly: Using an existing genome as a template (like having the box picture to guide you)

๐Ÿ”ฌ
Fact

The human genome required years to assemble initially. Now, with better algorithms and longer reads, we can assemble genomes in days or weeks!

Assembly turns fragmented sequencing data into meaningful, complete genome sequences.

Three Laws of Genome Assembly

Genome assembly follows three fundamental principles that determine success or failure. Understanding these "laws" helps explain why some genomes are easy to assemble while others remain challenging.


Law #1: Overlaps Reveal Relationships

๐Ÿ“–
First Law

If the suffix of read A is similar to the prefix of read B, then A and B might overlap in the genome.

What this means:

When the end of one read matches the beginning of another read, they likely came from adjacent or overlapping regions in the original DNA molecule.

๐Ÿ’ป
Example

Read A: ATCGATTGCA
Read B: ATTGCAGGCT

The suffix of A (ATTGCA) matches the prefix of B (ATTGCA) โ†’ They overlap!
Assembled: ATCGATTGCAGGCT

Important caveat: The word "might" is crucial. Just because two reads overlap doesn't guarantee they're from the same genomic locationโ€”they could be from repeated sequences!

Watch Video Walkthrough

First and second laws of assembly


Law #2: Coverage Enables Assembly

๐Ÿ“–
Second Law

More coverage means more overlaps, which means better assembly.

What this means:

Higher sequencing depth (coverage) generates more reads spanning each genomic region, creating more overlapping read pairs that can be assembled together.

The relationship:

  • Low coverage (5-10ร—): Sparse overlaps, many gaps, fragmented assembly
  • Medium coverage (30-50ร—): Good overlaps, most regions covered, decent contigs
  • High coverage (100ร—+): Abundant overlaps, nearly complete assembly, longer contigs
๐Ÿ’ก
Tip

More coverage is always better for assembly, but there are diminishing returns. Going from 10ร— to 50ร— makes a huge difference; going from 100ร— to 200ร— makes less of an improvement.

Why it works:

Imagine trying to assemble a sentence with only a few random words versus having many overlapping phrasesโ€”more data gives more context and connections.

๐Ÿ’ป
Coverage Example

Genome region: ATCGATCGATCG (12 bp)

5ร— coverage (5 reads):
ATCGAT----
--CGAT----
----ATCGAT
------TCGA
--------GATCG
Result: Some gaps, uncertain overlaps

20ร— coverage (20 reads):
Many more reads covering every position multiple times
Result: Clear overlaps, confident assembly


Law #3: Repeats Are The Enemy

๐Ÿšซ
Third Law

Repeats are bad for assembly. Very bad.

What this means:

When a DNA sequence appears multiple times in the genome (repeats), assembly algorithms cannot determine which copy a read came from, leading to ambiguous or incorrect assemblies.

Types of problematic repeats:

  • Exact repeats: Identical sequences appearing multiple times
  • Transposable elements: Mobile DNA sequences copied throughout the genome
  • Tandem repeats: Sequences repeated back-to-back (CAGCAGCAGCAG...)
  • Segmental duplications: Large blocks of duplicated DNA
๐Ÿ’ป
Why Repeats Break Assembly

Genome:
ATCG[REPEAT]GGGG...CCCC[REPEAT]TACG

Problem:
When you find a read containing "REPEAT", you don't know if it came from the first location or the second location!

Result:
Assembly breaks into multiple contigs at repeat boundaries, or worse, creates chimeric assemblies by incorrectly connecting different genomic regions.

The challenge:

If a repeat is longer than your read length, you cannot span it with a single read, making it impossible to determine the correct path through the assembly.

โš ๏ธ
Real-World Impact

The human genome is ~50% repetitive sequences! This is why:

  • Early human genome assemblies had thousands of gaps
  • Some regions remained unassembled for decades
  • Long-read sequencing (10kb+ reads) was needed to finally span repeats

Solutions to the repeat problem:

  1. Longer reads: Span the entire repeat in a single read
  2. Paired-end reads: Use insert size information to bridge repeats
  3. High coverage: May help distinguish repeat copies
  4. Reference genomes: Use a related species' genome as a guide
๐Ÿ”ฌ
Fact

The final 8% of the human genome (highly repetitive centromeres and telomeres) wasn't fully assembled until 2022โ€”nearly 20 years after the "complete" Human Genome Projectโ€”thanks to ultra-long reads from PacBio and Oxford Nanopore sequencing!


Summary: The Three Laws

โœ…
Remember These Three Laws
  1. Overlaps suggest adjacency โ€“ matching suffix/prefix indicates reads might be neighbors
  2. Coverage enables confidence โ€“ more reads mean more overlaps and better assembly
  3. Repeats create ambiguity โ€“ identical sequences break assembly continuity

Understanding these principles explains why genome assembly remains challenging and why different strategies (long reads, paired ends, high coverage) are needed for complex genomes.

๐Ÿ“
Assembly Quality Trade-offs

The three laws create a fundamental trade-off:

  • Want to resolve repeats? โ†’ Need longer reads (but more expensive)
  • Want better coverage? โ†’ Need more sequencing (costs more money/time)
  • Want perfect assembly? โ†’ May be impossible for highly repetitive genomes

Every genome assembly project must balance accuracy, completeness, and cost.

Greedy Algorithm for Genome Assembly

Watch Video Walkthrough

Genome Assembly

๐Ÿ“–
Definition: Greedy Assembly

Greedy assembly is a simple approach that repeatedly finds and merges the two reads with the largest overlap, continuing until no more merges are possible.

How It Works

The algorithm follows these steps:

  1. Find the pair of reads with the longest overlap
  2. Merge those two reads into one longer sequence
  3. Repeat steps 1-2 until no overlaps remain (or overlaps are too small)
  4. Result is a set of contigs (assembled fragments)
๐Ÿ’ป
Simple Example

Starting reads:

  • Read A: ATCGAT
  • Read B: CGATGC
  • Read C: TGCAAA

Step 1: Best overlap is A+B (4 bp): ATCGAT + CGATGC โ†’ ATCGATGC

Step 2: Best overlap is AB+C (3 bp): ATCGATGC + TGCAAA โ†’ ATCGATGCAAA

Done! Final contig: ATCGATGCAAA

Why "Greedy"?

It's called "greedy" because it always takes the best immediate option (longest overlap right now) without considering if this might prevent better assemblies later.

โš ๏ธ
Major Problem

Repeats break greedy assembly! If a sequence appears multiple times in the genome, the greedy algorithm doesn't know which copy it's assembling and can merge reads from different genome locations incorrectly.

Advantages & Disadvantages

Advantages:

  • Simple and intuitive
  • Fast for small datasets
  • Works well for genomes with few repeats

Disadvantages:

  • Fails on repetitive sequences
  • Makes locally optimal choices that may be globally wrong
  • Can create chimeric contigs (incorrectly merged sequences)
๐Ÿ“
Note

Modern assemblers use more sophisticated approaches (like De Bruijn graphs) that handle repeats better. Greedy assembly is rarely used alone for real genome projects.

Graphs(Math)

Overlap Layout Consensus

De Bruijn Graph Assembly

License

Contributors

A big shout-out to everyone who has contributed to these notes!

  • Mahmoud - mahmoud.ninja - Creator and primary maintainer
  • Vittorio - Contributions and improvements
  • Betรผl Yalรงฤฑn - Contributions and improvements

Want to contribute?

If you've helped improve these notes and want to be listed here, or if you'd like to contribute:

  • Submit corrections or improvements via whatsapp, email, or github PR
  • Share useful resources or examples
  • Help clarify confusing sections

Feel free to reach out at mahmoudahmedxyz@gmail.com or text me directly if you have any method of connection, to be added to this list.

Sitemap
<style>
    body {
        font-family: Arial, sans-serif;
        max-width: 900px;
        margin: 20px auto;
        line-height: 1.6;
    }

    h1 {
        font-size: 28px;
        margin-bottom: 20px;
    }

    .section {
        margin-bottom: 30px;
    }

    .section h2 {
        font-size: 20px;
        margin-top: 10px;
    }

    ul {
        padding-left: 20px;
    }

    li {
        margin: 4px 0;
    }

    a {
        text-decoration: none;
        color: #0073e6;
    }

    a:hover {
        text-decoration: underline;
    }

    .footer {
        margin-top: 40px;
        font-size: 13px;
        text-align: right;
        color: #666;
    }
</style>

Website Sitemap