Welcome

Hello, I'm Mahmoud, and these are my notes. Since you're reading something I've written, I want to share a bit about how I approach learning and what you can expect here.

Want to know more about me? Check out my blog at mahmoud.ninja

My Philosophy

I believe there are no shortcuts to success. To me, success means respecting your time, and before investing that time, you need a plan rooted in where you want to go in life.

Learn, learn, learn, and when you think you've learned enough, write it down and share it.

About These Notes

I don't strive for perfectionism. Sometimes I write something and hope someone will point out where I'm wrong so we can both learn from it. That's the beauty of sharing knowledge: it's a two-way street.

I tend to be pretty chill, and I occasionally throw in some sarcasm when it feels appropriate. These are my notes after all, so please don't be annoyed if you encounter something that doesn't resonate with you, just skip ahead.

I'm creating this resource purely out of love for sharing and teaching. Ironically, I'm learning more by organizing and explaining these concepts than I ever did just studying them. Sharing is learning. Imagine if scientists never shared their research, we'd still be in the dark ages.

Everything I create here is released under Creative Commons (CC BY 4.0). You're free to share, copy, remix, and build upon this material for any purpose, even commercially, as long as you give appropriate credit.

I deeply respect intellectual property rights. I will never share copyrighted materials, proprietary resources, or content that was shared with our class under restricted access. All external resources linked here are publicly available or properly attributed.

If you notice any copyright violations or improperly shared materials, please contact me immediately at mahmoudahmedxyz@gmail.com, and I will remove the content right away and make necessary corrections.

Final Thoughts

I have tremendous respect for everyone in this learning journey. We're all here trying to understand complex topics, and we all learn differently. If these notes help you even a little bit, then this project has served its purpose.

Linux Fundamentals

The History of Linux

In 1880, the French government awarded the Volta Prize to Alexander Graham Bell. Instead of going to the Maldives (kidding...he had work to do), he went to America and opened Bell Labs.

This lab researched electronics and something revolutionary called the mathematical theory of communications. In the 1950s came the transistor revolution. Bell Labs scientists won 10 Nobel Prizes...not too shabby.

But around this time, Russia made the USA nervous by launching the first satellite, Sputnik, in 1957. This had nothing to do with operating systems, it was literally just a satellite beeping in space, but it scared America enough to kickstart the space race.

President Eisenhower responded by creating ARPA (Advanced Research Projects Agency) in 1958, and asked James Killian, MIT's president, to help develop computer technology. This led to Project MAC (Mathematics and Computation) at MIT.

Before Project MAC, using a computer meant bringing a stack of punch cards with your instructions, feeding them into the machine, and waiting. During this time, no one else could use the computer, it was one job at a time.

The big goal of Project MAC was to allow multiple programmers to use the same computer simultaneously, executing different instructions at the same time. This concept was called time-sharing.

MIT and Bell Labs cooperated and developed the first operating system to support time-sharing: CTSS (Compatible Time-Sharing System). They wanted to expand this to larger mainframe computers, so they partnered with General Electric (GE), who manufactured these machines. In 1964, they developed the first real OS with time-sharing support called Multics. It also introduced the terminal as a new type of input device.

In the late 1960s, GE and Bell Labs left the project. GE's computer department was bought by Honeywell, which continued the project with MIT and created a commercial version that sold for 25 years.

In 1969, Bell Labs engineers (Dennis Ritchie and Ken Thompson) developed a new OS based on Multics. In 1970, they introduced Unics (later called Unix, the name was a sarcastic play on "Multics," implying it was simpler).

The first two versions of Unix were written in assembly language, which was then translated by an assembler and linker into machine code. The big problem with assembly was that it was tightly coupled to specific processors, meaning you'd need to rewrite Unix for each processor architecture. So Dennis Ritchie decided to create a new programming language: C.

They rebuilt Unix using C. At this time, AT&T owned Bell Labs (now it's Nokia). AT&T declared that Unix was theirs and no one else could touch it, classic monopolization.

AT&T did make one merciful agreement: universities could use Unix for educational purposes. But after AT&T was broken up into smaller companies in 1984, even this stopped. Things got worse.

One person was watching all this and decided to take action: Andrew S. Tanenbaum. In 1987, he created a new Unix-inspired OS called MINIX. It was free for universities and designed to work on Intel chips. It had some issues, occasional crashes and overheating, but this was just the beginning. This was the first time someone made a Unix-like OS outside of AT&T.

The main difference between Unix and MINIX was that MINIX was built on a microkernel architecture. Unix had a larger monolithic kernel, but MINIX separated some modules, for example, device drivers were moved from kernel space to user space.

It's unclear if MINIX was truly open source, but people outside universities wanted access and wanted to contribute and modify it.

Around the same time MINIX was being developed, another person named Richard Stallman started the free software movement based on four freedoms: Freedom to run, Freedom to study, Freedom to modify, and Freedom to share. This led to the GPL license (GNU General Public License), which ensured that if you used something free, your product must also be free. They created the GNU Project, which produced many important tools like the GCC compiler, Bash shell, and more.

But there was one problem: the kernel, the beating heart of the operating system that talks to the hardware, was missing.

Let's leave the USA and cross the Atlantic Ocean. In Finland, a student named Linus Torvalds was stuck at home while his classmates vacationed in Baltim Egypt (kidding). He was frustrated with MINIX, had heard about GPL and GNU, and decided to make something new. "I know what I should do with my life," he thought. As a side hobby project in 1991, he started working on a new kernel (not based on MINIX) and sent an email to his classmates discussing it.

Linus announced Freax (maybe meant "free Unix") with a GPL license. After six months, he released another version and called it Linux. He improved the kernel and integrated many GNU Project tools. He uploaded the source code to the internet (though Git came much later, he initially used FTP). This mini-project became the most widely used OS on Earth.

The penguin mascot (Tux) came from multiple stories: Linus was supposedly bitten by a penguin at a zoo, and he also watched March of the Penguins and was inspired by how they cooperate and share to protect their eggs and each other. Cute and fitting.

...And that's the history intro.

Linux Distributions

Okay... let's install Linux. Which Linux? Wait, really? There are multiple Linuxes?

Here's the deal: the open-source part is the kernel, but different developers take it and add their own packages, libraries, and maybe create a GUI. Others add their own tweaks and features. This leads to many different versions, which we call distributions (or distros for short).

Some examples: Red Hat, Slackware, Debian.

Even distros themselves can be modified with additional features, which creates a version of a version. For example, Debian led to Ubuntu, these are called derivatives.

How many distros and derivatives exist in the world? Many. How many exactly? I said many. Anyone with a computer can create one.

So what's the main difference between these distros, so I know which one is suitable for me? The main differences fall into two categories: philosophical and technical.

One of the biggest technical differences is package management, the system that lets you install software, including the type and format of software itself.

Another difference is configuration files, their locations differ from one distro to another.

We agreed that everything is free, right? Well, you may find some paid versions like Red Hat Enterprise Linux, which charges for features like an additional layer of security, professional support, and guaranteed upgrades. Fedora is also owned by Red Hat and acts as a testing ground (a "backdoor," if you will) for new features before they hit Red Hat Enterprise.

The philosophical part is linked to the functional part. If you're using Linux for research, there are distros with specialized software for that. Maybe you're into ethical hacking, Kali Linux is for you. If you're afraid of switching from another OS, you might like Linux Mint, which even has themes that make it look like Windows.

Okay, which one should I install now? Heh... There are a ton of options and you can install any of them, but my preference is Ubuntu.

Ubuntu is the most popular for development and data engineering. But remember, in all cases, you'll be using the terminal a lot. So install Ubuntu, maybe in dual boot, and keep Windows if possible so you don't regret it later and blame me.


The Terminal

Yes, this is what matters for us. Every distro will come with a default terminal but you can install others if you want. Anyway, open the terminal from the apps or just click Ctrl+Alt+T.

alt text

Zoom in using Ctrl+Shift++ or out using Ctrl+-

By default first thing you will see the prompt name@host:path$ which your name @ the machine name then ~ then dollar sign colon then $. After $ you can write your command.

You can change the colors and all preferences and save each for profile.

You can even change the prompt itself as it is just a variable (more on variable later).

Basic Commands

First, everything is case sensitive, so be careful.

[1] echo

This command echoes whatever you write after it.

$ echo "Hello, terminal"

Output:

Hello, terminal

[2] pwd

This prints the current directory.

$ pwd

Output:

/home/mahmoudxyz

[3] cd

This is for changing the directory.

$ cd Desktop

The directory changed with no output, you can check this using pwd.

To go back to the main directory use:

$ cd ~

Or just:

$ cd

Note that this means we are back to /home/mahmoudxyz

To go back to the previous directory (in this case /home) even if you don't know the name, you can use:

$ cd ..

[4] ls

This command outputs the current files and directories (folders).

First let's go to desktop again:

$ cd /home/mahmoudxyz/Desktop

Yes, you can go to a specific dir if you know its path. Note that in Linux we are using / not \ like Windows.

Now let's see what files and directories are in my Desktop:

$ ls

Output:

file1  python  testdir

If you notice that in my case, my terminal supports colors. The blue ones are directories and the grey (maybe black) is the file.

But you may deal with some terminal that doesn't support colors, in this case you can use:

$ ls -F

Output:

file1  python/  testdir/

What ends with / like python/ is a directory otherwise it's a file like file1.

You can see the hidden files using:

$ ls -a

Output:

.  ..  file1  python  testdir  .you-cant-see-me

We saw .you-cant-see-me, but we are not hackers that we saw something hidden, being hidden is more than organizing purpose than actually hiding something.

You can also list the files in the long format using:

$ ls -l

Output:

total 8
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz    0 Nov  2 10:48 file1
drwxrwxr-x 2 mahmoudxyz mahmoudxyz 4096 Oct 16 15:20 python
drwxrwxr-x 2 mahmoudxyz mahmoudxyz 4096 Nov  1 21:45 testdir

Let's take the file1 and analyze the output:

ColumnMeaning
-rw-rw-r-- 1File type + permissions (more on this later)
1Number of hard links (more on this later)
mahmoudxyzOwner name
mahmoudxyzGroup name
0File size (bytes)
Nov 2 10:48Last modification date & time
file1File or directory name

We can also combine these flags/options:

$ ls -l -a -F

Output:

total 16
drwxr-xr-x  4 mahmoudxyz mahmoudxyz 4096 Nov  2 10:53 ./
drwxr-x--- 47 mahmoudxyz mahmoudxyz 4096 Nov  1 21:55 ../
-rw-rw-r--  1 mahmoudxyz mahmoudxyz    0 Nov  2 10:48 file1
drwxrwxr-x  2 mahmoudxyz mahmoudxyz 4096 Oct 16 15:20 python/
drwxrwxr-x  2 mahmoudxyz mahmoudxyz 4096 Nov  1 21:45 testdir/
-rw-rw-r--  1 mahmoudxyz mahmoudxyz    0 Nov  2 10:53 .you-cant-see-me

Or shortly:

$ ls -laF

The same output. The order of options is not important so ls -lFa will work as well.

[5] clear

This cleans your terminal. You can also use shortcut Ctrl+l

[6] mkdir

This makes a new directory.

$ mkdir new-dir

Then let's see the output:

$ ls -F

Output:

file1  new-dir/  python/  testdir/

[7] rmdir

This will remove the directory.

$ rmdir new-dir

Then let's see the output:

$ ls -F

Output:

file1  python/  testdir/

[8] touch

This command is for creating a new file.

$ mkdir new-dir
$ cd new-dir
$ touch file1
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:26 file1

You can also make more than one file with:

$ touch file2 file3
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:26 file1
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file2
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file3

In fact touch was created for modifying the timestamp of the file so let's try again:

$ touch file1
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:30 file1
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file2
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file3

What changed? The timestamp of file1. The touch is the easiest way to create a new file, it just changes the timestamp of the file and if it doesn't exist, it will create a new one.

[9] rm

This will remove the file.

$ rm file1
$ ls -l

Output:

total 0
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file2
-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 0 Nov  2 11:28 file3

[10] echo & cat (revisited)

Yes again, but this time, it will be used to create a new file with some text inside it.

$ echo "Hello, World" > file1

To output this file we can use:

$ cat file1

Output:

Hello, World

Notes:

  • If file1 doesn't exist, it will create a new one.
  • If it does exist → it will be overwritten.

To append text instead of overwrite use >>:

$ echo "Hello, Mah" >> file1

To output this file we can use:

$ cat file1

Output:

Hello, World
Hello, Mah

[11] rm -r

Let's go back:

$ cd ..

And then let's try to remove the directory:

$ rmdir new-dir

Output:

rmdir: failed to remove 'new-dir': Directory not empty

In case the directory is not empty, we can use rm that we used for removing a file but this time with a flag -r which means recursively remove everything in the folder.

$ rm -r new-dir

[12] cp

This command is for copying a file.

cp source destination

(you can also rename it while copying it)

For example, let's copy the hosts file:

$ cp /etc/hosts .

The dot . means the current directory. Meaning copy this file from this source to here. You can see the content of the file using cat as before.

[13] man

man is the built-in manual for commands. It contains short descriptions for the command and its options and their functions. It is useful and can be replaced nowadays with online search or even AI.

Try:

$ man ls

And then try:

$ man cd

No manual entry for cd. I don't know why exactly, but it's probably because cd is built into the shell itself and not an external command or maybe programmer choice.


Unix Philosophy

Second System Syndrome: If a software or system succeeds, any similar system that comes after it will likely fail. This is probably a psychological phenomenon, developers constantly compare themselves to the successful system, wanting to be like it but better. The fear of not matching that success often causes failure. Maybe you can succeed if you don't compare yourself to it.

Another thing: when developers started making software for Linux, everything was chaotic and random. This led to the creation of principles to govern development, a philosophy to follow. These principles ensure that when you develop something, you follow the same Unix mentality:

  1. Small is Beautiful – Keep programs compact and focused; bloat is the enemy.
  2. Each Program Does One Thing Well – Master one task instead of being mediocre at many.
  3. Prototype as Soon as Possible – Build it, test it, break it, learn from it, fast iteration wins.
  4. Choose Portability Over Efficiency – Code that runs everywhere beats code that's blazing fast on one system.
  5. Store Data in Flat Text Files – Text is universal, readable, and easy to parse; proprietary formats lock you in.
  6. Use Software Leverage – Don't reinvent the wheel; use existing tools and combine them creatively.
  7. Use Shell Scripts to Increase Leverage and Portability – Automate tasks and glue programs together with simple scripts.
  8. Avoid Captive User Interfaces – Don't trap users in rigid menus; let them pipe, redirect, and automate.
  9. Make Every Program a Filter – Take input, transform it, produce output, programs should be composable building blocks.

These concepts all lead to one fundamental Unix principle: everything is a file. Devices, processes, sockets, treat them all as files for consistency and simplicity.

Not all people follow this now, but the important question is: is it important? I don't know. But still the question is: is it important for you as a data engineer or analyst who will deal with data and different distros and different computers which maybe will be remote? Yes, it is important and very important.

Text Files

It's a bit strange that we are talking about editing text files in 2025. Really, does it matter?

Yes, it matters and it's a big topic in Linux because of what we discussed in the previous section.

There are a lot of editors on Linux like vi, nano and emacs. There is a famous debate between emacs and vim.

You can find vi in almost every distro. The shortcuts for it are many and hard to memorize if you are not dealing with it much, but you can use cheatsheets.

Simply put: vi is just two things, insert mode and command mode. The default when you open a file for the first time is the command mode. To start writing something you have to enter the insert mode by pressing i.

You might wonder why vi uses keyboard letters for navigation instead of arrow keys. Simple answer: arrow keys didn't exist on keyboards when vi was created in 1976. You're the lucky generation with arrow keys, the original vi users had to make do with what they had.

nano on the other hand is more simple and easier to use and edit files with.

Use any editor, probably vi or nano and start practicing on one.

Terminal vs Shell

Terminal ≠ Shell. Let's clear this up.

The shell is the thing that actually interprets your commands. It's the engine doing the work. File manipulation, running programs, printing text. That's all the shell.

The terminal is just the program that opens a window so you can talk to the shell. It's the middleman, the GUI wrapper, the pretty face.

Historical note:

This distinction mattered more when terminals were physical devices, actual hardware connected to mainframes. Today, we use terminal emulators (software), so the difference is mostly semantic. For practical purposes, just know: the shell runs your commands, the terminal displays them.

Pipes, Filters and Redirection

Standard Streams

Unix processes use I/O streams to read and write data.

Input stream sources include keyboards, terminals, devices, files, output from other processes, etc.

Unix processes have three standard streams:

  • STDIN (0) – Standard Input (data coming in from keyboard, file, etc.)
  • STDOUT (1) – Standard Output (normal output going to terminal, file, etc.)
  • STDERR (2) – Standard Error (error messages going to terminal, file, etc.)

Example: Try running cat with no arguments, it waits for input from STDIN and echoes it to STDOUT.

  • Ctrl+D – Stops the input stream and sends an EOF (End of File) signal to the process.
  • Ctrl+C – Sends an INT (Interrupt) signal to the process (i.e., kills the process).

Redirection

Redirection allows you to change the defaults for stdin, stdout, or stderr, sending them to different devices or files using their file descriptors.

File Descriptors

A file descriptor is a reference (or handle) used by the kernel to access a file. Every process gets its own file descriptor table.

Redirect stdin with <

Use the < operator to redirect standard input from a file:

$ wc < textfile

Using Heredocs with <<

Accepts input until a specified delimiter word is reached:

$ cat << EOF
# Type multiple lines here
# Press Enter, then type EOF to end
EOF

Using Herestrings with <<<

Pass a string directly as input:

$ cat <<< "Hello, Linux"

Redirect stdout using > and >>

Overwrite a file with > (or explicitly with 1>):

$ who > file      # Redirect stdout to file (overwrite)
$ cat file        # View the file

Append to a file with >>:

$ whoami >> file  # Append stdout to file
$ cat file        # View the file

Redirect stderr using 2> and 2>>

Redirect error messages to a file:

$ ls /xyz 2> err  # /xyz doesn't exist, error goes to err file
$ cat err         # View the error

Combining stdout and stderr

Redirect both stdout and stderr to the same file:

# Method 1: Redirect stderr to err, then stdout to the same place
$ ls /etc /xyz 2> err 1>&2

# Method 2: Redirect stdout to err, then stderr to the same place
$ ls /etc /xyz 1> err 2>&1

# Method 3: Shorthand for redirecting both
$ ls /etc /xyz &> err

$ cat err  # View both output and errors

Ignoring Error Messages with /dev/null

The black hole of Unix, anything sent here disappears:

$ ls /xyz 2> /dev/null  # Suppress error messages

User and Group Management

It is not complicated. The user here is like any other OS. An account with some permission and can do some operations.

There are three types of users in Linux:

Super user

The administrator that can do anything in the world. It is called root.

  • ID from 0 to 999

System user

This represents software and not a real person. Some software may need some access and permissions to do some tasks and operations or maybe install something.

  • ID from 0 to 999

Normal user

This is us.

  • ID >= 1000

Each user has its ID, shell, environmental vars and home dir.

File Ownership and Permissions

(Content to be added)


More on Navigating the Filesystem

Absolute vs Relative Paths

The root directory (/) is like "C:" in Windows, the top of the filesystem hierarchy.

Absolute path: Starts from root, always begins with /

/home/mahmoudxyz/Documents/notes.txt
/etc/passwd
/usr/bin/python3

Relative path: Starts from your current location

Documents/notes.txt          # Relative to current directory
../Desktop/file.txt          # Go up one level, then into Desktop
../../etc/hosts              # Go up two levels, then into etc

Special directory references:

  • . = current directory
  • .. = parent directory
  • ~ = your home directory
  • - = previous directory (used with cd -)

Useful Navigation Commands

ls -lh - List in long format with human-readable sizes

$ ls -lh
-rw-r--r-- 1 mahmoud mahmoud 1.5M Nov 10 14:23 data.csv
-rw-r--r-- 1 mahmoud mahmoud  12K Nov 10 14:25 notes.txt

ls -lhd - Show directory itself, not contents

$ ls -lhd /home/mahmoud
drwxr-xr-x 47 mahmoud mahmoud 4.0K Nov 10 12:00 /home/mahmoud

ls -lR - Recursive listing (all subdirectories)

$ ls -lR
./Documents:
-rw-r--r-- 1 mahmoud mahmoud 1234 Nov 10 14:23 file1.txt

./Documents/Projects:
-rw-r--r-- 1 mahmoud mahmoud 5678 Nov 10 14:25 file2.txt

tree - Visual directory tree (may need to install)

$ tree
.
├── Documents
│   ├── file1.txt
│   └── Projects
│       └── file2.txt
├── Downloads
└── Desktop

stat - Detailed file information

$ stat notes.txt
  File: notes.txt
  Size: 1234       Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d  Inode: 12345678   Links: 1
Access: 2024-11-10 14:23:45.123456789 +0100
Modify: 2024-11-10 14:23:45.123456789 +0100
Change: 2024-11-10 14:23:45.123456789 +0100

Shows: size, inode number, links, permissions, timestamps

Shell Globbing (Wildcards)

Wildcards let you match multiple files with patterns.

* - Matches any number of any characters (including none)

$ echo *                    # All files in current directory
$ echo *.txt                # All files ending with .txt
$ echo file*                # All files starting with "file"
$ echo *data*               # All files containing "data"

? - Matches exactly one character

$ echo b?at                 # Matches: boat, beat, b1at, b@at
$ echo file?.txt            # Matches: file1.txt, fileA.txt
$ echo ???                  # Matches any 3-character filename

[...] - Matches any character inside brackets

$ echo file[123].txt        # Matches: file1.txt, file2.txt, file3.txt
$ echo [a-z]*               # Files starting with lowercase letter
$ echo [A-Z]*               # Files starting with uppercase letter
$ echo *[0-9]               # Files ending with a digit

[!...] - Matches any character NOT in brackets

$ echo [!a-z]*              # Files NOT starting with lowercase letter
$ echo *[!0-9].txt          # .txt files NOT ending with a digit before extension

Practical examples:

$ ls *.jpg *.png            # All image files (jpg or png)
$ rm temp*                  # Delete all files starting with "temp"
$ cp *.txt backup/          # Copy all text files to backup folder
$ mv file[1-5].txt archive/ # Move file1.txt through file5.txt

File Structure: The Three Components

Every file in Linux consists of three parts:

1. Filename

The human-readable name you see and use.

2. Data Block

The actual content stored on disk, the file's data.

3. Inode (Index Node)

Metadata about the file stored in a data structure. Contains:

  • File size
  • Owner (UID) and group (GID)
  • Permissions
  • Timestamps (access, modify, change)
  • Number of hard links
  • Pointers to data blocks on disk
  • NOT the filename (filenames are stored in directory entries)

View inode number:

$ ls -i
12345678 file1.txt
12345679 file2.txt

View detailed inode information:

$ stat file1.txt

A link is a way to reference the same file from multiple locations. Think of it like shortcuts in Windows, but with two different types.


Concept: Another filename pointing to the same inode and data.

It's like having two labels on the same box. Both names are equally valid, neither is "original" or "copy."

Create a hard link:

$ ln original.txt hardlink.txt

What happens:

  • Both filenames point to the same inode
  • Both have equal status (no "original")
  • Changing content via either name affects both (same data)
  • File size, permissions, content are identical (because they ARE the same file)

Check with ls -i:

$ ls -i
12345678 original.txt
12345678 hardlink.txt    # Same inode number!

What if you delete the original?

$ rm original.txt
$ cat hardlink.txt        # Still works! Data is intact

Why? The data isn't deleted until all hard links are removed. The inode keeps a link count, only when it reaches 0 does the system delete the data.

Limitations of hard links:

  • Cannot cross filesystems (different partitions/drives)
  • Cannot link to directories (to prevent circular references)
  • Both files must be on the same partition

Concept: A special file that points to another filename, like a shortcut in Windows.

The soft link has its own inode, separate from the target file.

Create a soft link:

$ ln -s original.txt softlink.txt

What happens:

  • softlink.txt has a different inode
  • It contains the path to original.txt
  • Reading softlink.txt automatically redirects to original.txt

Check with ls -li:

$ ls -li
12345678 -rw-r--r-- 1 mahmoud mahmoud 100 Nov 10 14:00 original.txt
12345680 lrwxrwxrwx 1 mahmoud mahmoud  12 Nov 10 14:01 softlink.txt -> original.txt

Notice:

  • Different inode numbers
  • l at the start (link file type)
  • -> shows what it points to

What if you delete the original?

$ rm original.txt
$ cat softlink.txt        # Error: No such file or directory

The softlink still exists, but it's now a broken link (points to nothing).

Advantages of soft links:

  • Can cross filesystems (different partitions/drives)
  • Can link to directories
  • Can link to files that don't exist yet (forward reference)

FeatureHard LinkSoft Link
InodeSame as originalDifferent (own inode)
ContentPoints to dataPoints to filename
Delete originalLink still worksLink breaks
Cross filesystemsNoYes
Link to directoriesNoYes
Shows targetNo (looks like normal file)Yes (-> in ls -l)
Link countIncreasesDoesn't affect original

When to use each:

Hard links:

  • Backup/versioning within same filesystem
  • Ensure file persists even if "original" name is deleted
  • Save space (no duplicate data)

Soft links:

  • Link across different partitions
  • Link to directories
  • Create shortcuts for convenience
  • When you want the link to break if target is moved/deleted (intentional dependency)

Practical Examples

Hard link example:

$ echo "Important data" > data.txt
$ ln data.txt backup.txt              # Create hard link
$ rm data.txt                         # "Original" deleted
$ cat backup.txt                      # Still accessible!
Important data

Soft link example:

$ ln -s /usr/bin/python3 ~/python     # Shortcut to Python
$ ~/python --version                  # Works!
Python 3.10.0
$ rm /usr/bin/python3                 # If Python is removed
$ ~/python --version                  # Link breaks
bash: ~/python: No such file or directory

Link to directory (only soft link):

$ ln -s /var/log/nginx ~/nginx-logs   # Easy access to logs
$ cd ~/nginx-logs                     # Navigate via link
$ pwd                                 # Shows real path
/var/log/nginx

Understanding the Filesystem Hierarchy Standard

Mounting

There's no link between the hierarchy of directories and their location on the disk.

For more details, see: Linux Foundation FHS 3.0

File Management

[1] grep

This command to print lines matching pattern

Let's create a file to try examples on it:

echo -e "root\nhello\nroot\nRoot" >> file

Now let's use grep to search for the word root in this file:

$ grep root file

output:

root
root

You can search for anything excluding the root word:

$ grep -v root file

output:

hello
Root

You can search ingoring the case:

$ grep -i root file

result:

root
root
Root

You can also use REGEX:

$ grep -i r. file

result:

root
root
Root

[2] less

to page through a file (an alternative to more)

-- use with /word to search for a word in the file -- use with ?word to search backwards for a word in the file -- use with n to go to the next occurrence of the word -- use with N to go to the previous occurrence of the word -- use with q to quit the file

[3] diff

compare files line by line

[4] file

determine file type

$ file file
file: ASCII text

[5] find and locate

search for files in a directory hierarchy

[6] head and tail

head - output the first part of files head /usr/share/dict/words - display the first 10 lines of the file /usr/share/dict/words head -n 20 /usr/share/dict/words - display the first 20 lines of the file /usr/share/dict/words

tail - output the last part of files tail /usr/share/dict/words - display the last 10 lines of the file /usr/share/dict/words tail -n 20 /usr/share/dict/words - display the last 20 lines of the file /usr/share/dict/words

[7] mv

mv - move (rename) files mv file1 file2 - rename file1 to file2

mv - move (rename) files mv file1 file2 - rename file1 to file2

[8] cp

cp - copy files and directories cp file1 file2 - copy file1 to file2

[9] tar

archive utility

[10] gzip

[11] mount and unmount

what is the meaning of mounitng

Managing Linux Processes

What is a Process?

When Linux executes a program, it:

  1. Reads the file from disk
  2. Loads it into memory
  3. Reads the instructions inside it
  4. Executes them one by one

A process is the running instance of that program. It might be visible in your GUI or running invisibly in the background.

Types of Processes

Processes can be executed from different sources:

By origin:

  • Compiled programs (C, C++, Rust, etc.)
  • Shell scripts containing commands
  • Interpreted languages (Python, Perl, etc.)

By trigger:

  • Manually executed by a user
  • Scheduled (via cron or systemd timers)
  • Triggered by events or other processes

By category:

  • System processes - Managed by the kernel
  • User processes - Started by users (manually, scheduled, or remotely)

The Process Hierarchy

Every Linux system starts with a parent process that spawns all other processes. This is either:

  • init or sysvinit (older systems)
  • systemd (modern systems)

The first process gets PID 1 (Process ID 1), even though it's technically branched from the kernel itself (PID 0, which you never see directly).

From PID 1, all other processes branch out in a tree structure. Every process has:

  • PID (Process ID) - Its own unique identifier
  • PPID (Parent Process ID) - The ID of the process that started it

Viewing Processes

[1] ps - Process Snapshot

Basic usage - current terminal only:

$ ps

Output:

    PID TTY          TIME CMD
  14829 pts/1    00:00:00 bash
  14838 pts/1    00:00:00 ps

This shows only processes running in your current terminal session for your user.

All users' processes:

$ ps -a

Output:

    PID TTY          TIME CMD
   2955 tty2     00:00:00 gnome-session-b
  14971 pts/1    00:00:00 ps

All processes in the system:

$ ps -e

Output:

    PID TTY          TIME CMD
      1 ?        00:00:00 systemd
      2 ?        00:00:00 kthreadd
      3 ?        00:00:00 rcu_gp
    ... (hundreds more)

Note: The ? in the TTY column means the process was started by the kernel and has no controlling terminal.

Detailed process information:

$ ps -l

Output:

F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S  1000   14829   14821  0  80   0 -  2865 do_wai pts/1    00:00:00 bash
4 R  1000   15702   14829  0  80   0 -  3445 -      pts/1    00:00:00 ps

Here you can see the PPID (parent process ID). Notice that ps has bash as its parent (the PPID of ps matches the PID of bash).

Most commonly used:

$ ps -efl

This shows all processes with full details - PID, PPID, user, CPU time, memory, and command.

Understanding Daemons

Any system process running in the background typically ends with d (named after "daemon"). Examples:

  • systemd - System and service manager
  • sshd - SSH server
  • httpd or nginx - Web servers
  • crond - Job scheduler

Daemons are like Windows services - processes that run in the background, whether they're system or user processes.


[2] pstree - Process Tree Visualization

See the hierarchy of all running processes:

$ pstree

Output:

systemd─┬─ModemManager───3*[{ModemManager}]
        ├─NetworkManager───3*[{NetworkManager}]
        ├─accounts-daemon───3*[{accounts-daemon}]
        ├─avahi-daemon───avahi-daemon
        ├─bluetoothd
        ├─colord───3*[{colord}]
        ├─containerd───15*[{containerd}]
        ├─cron
        ├─cups-browsed───3*[{cups-browsed}]
        ├─cupsd───5*[dbus]
        ├─dbus-daemon
        ├─dockerd───19*[{dockerd}]
        ├─fwupd───5*[{fwupd}]
        ... (continues)

What you're seeing:

  • systemd is the parent process (PID 1)
  • Everything else branches from it
  • Multiple processes run in parallel
  • Some processes spawn their own children (like dockerd with 19 threads)

This visualization makes it easy to understand process relationships.


[3] top - Live Process Monitor

Unlike ps (which shows a snapshot), top shows real-time process information:

$ top

You'll see:

  • Processes sorted by CPU usage (by default)
  • Live updates of CPU and memory consumption
  • System load averages
  • Running vs sleeping processes

Press q to quit.

Useful top commands while running:

  • k - Kill a process (prompts for PID)
  • M - Sort by memory usage
  • P - Sort by CPU usage
  • 1 - Show individual CPU cores
  • h - Help

[4] htop - Better Process Monitor

htop is like top but modern, colorful, and more interactive.

Installation (if not already installed):

$ which htop   # Check if installed
$ sudo apt install htop   # Install if needed

Run it:

$ htop

Features:

  • Color-coded display
  • Mouse support (click to select processes)
  • Easy process filtering and searching
  • Visual CPU and memory bars
  • Tree view of process hierarchy
  • Built-in kill/nice/priority management

Navigation:

  • Arrow keys to move
  • F3 - Search for a process
  • F4 - Filter by name
  • F5 - Tree view
  • F9 - Kill a process
  • F10 or q - Quit

Foreground vs Background Processes

Sometimes you only have one terminal and want to run multiple long-running tasks. Background processes let you do this.

Foreground Processes (Default)

When you run a command normally, it runs in the foreground and blocks your terminal:

$ sleep 10

Your terminal is blocked for 10 seconds. You can't type anything until it finishes.

Background Processes

Add & at the end to run in the background:

$ sleep 10 &

Output:

[1] 12345

The terminal is immediately available. The numbers show [job_number] PID.

Managing Jobs

View running jobs:

$ jobs

Output:

[1]+  Running                 sleep 10 &

Bring a background job to foreground:

$ fg

If you have multiple jobs:

$ fg %1   # Bring job 1 to foreground
$ fg %2   # Bring job 2 to foreground

Send current foreground process to background:

  1. Press Ctrl+Z (suspends the process)
  2. Type bg (resumes it in background)

Example:

$ sleep 25
^Z
[1]+  Stopped                 sleep 25

$ bg
[1]+ sleep 25 &

$ jobs
[1]+  Running                 sleep 25 &

Stopping Processes

Process Signals

The kill command doesn't just "kill" - it sends signals to processes. The process decides how to respond.

Common signals:

SignalNumberMeaningProcess Can Ignore?
SIGHUP1Hang up (terminal closed)Yes
SIGINT2Interrupt (Ctrl+C)Yes
SIGTERM15Terminate gracefully (default)Yes
SIGKILL9Kill immediatelyNO
SIGSTOP19Stop/pause processNO
SIGCONT18Continue stopped processNO

Using kill

Syntax:

$ kill -SIGNAL PID

Example - find a process:

$ ps
    PID TTY          TIME CMD
  14829 pts/1    00:00:00 bash
  17584 pts/1    00:00:00 sleep
  18865 pts/1    00:00:00 ps

Try graceful termination first (SIGTERM):

$ kill -SIGTERM 17584

Or use the number:

$ kill -15 17584

Or just use default (SIGTERM is default):

$ kill 17584

If the process ignores SIGTERM, force kill (SIGKILL):

$ kill -SIGKILL 17584

Or:

$ kill -9 17584

Verify it's gone:

$ ps
    PID TTY          TIME CMD
  14829 pts/1    00:00:00 bash
  19085 pts/1    00:00:00 ps
[2]+  Killed                  sleep 10

Why SIGTERM vs SIGKILL?

SIGTERM (15) - Graceful shutdown:

  • Process can clean up (save files, close connections)
  • Child processes are also terminated properly
  • Always try this first

SIGKILL (9) - Immediate death:

  • Process cannot ignore or handle this signal
  • No cleanup happens
  • Can create zombie processes if parent doesn't reap children
  • Can cause memory leaks or corrupted files
  • Use only as last resort

Zombie Processes

A zombie is a dead process that hasn't been cleaned up by its parent.

What happens:

  1. Process finishes execution
  2. Kernel marks it as terminated
  3. Parent should read the exit status (called "reaping")
  4. If parent doesn't reap it, it becomes a zombie

Identifying zombies:

$ ps aux | grep Z

Look for processes with state Z (zombie).

Fixing zombies:

  • Kill the parent process (zombies are already dead)
  • The parent's death forces the kernel to reclassify zombies under init/systemd, which cleans them up
  • Or wait - some zombies disappear when the parent finally checks on them

killall - Kill by Name

Instead of finding PIDs, kill all processes with a specific name:

$ killall sleep

This kills ALL processes named sleep, regardless of their PID.

With signals:

$ killall -SIGTERM firefox
$ killall -9 chrome   # Force kill all Chrome processes

Warning: Be careful with killall - it affects all matching processes, even ones you might not want to kill.


Managing Services with systemctl

Modern Linux systems use systemd to manage services (daemons). The systemctl command controls them.

Service Status

Check if a service is running:

$ systemctl status ssh

Output shows:

  • Active/inactive status
  • PID of the main process
  • Recent log entries
  • Memory and CPU usage

Starting and Stopping Services

Start a service:

$ sudo systemctl start nginx

Stop a service:

$ sudo systemctl stop nginx

Restart a service (stop then start):

$ sudo systemctl restart nginx

Reload configuration without restarting:

$ sudo systemctl reload nginx

Enable/Disable Services at Boot

Enable a service to start automatically at boot:

$ sudo systemctl enable ssh

Disable a service from starting at boot:

$ sudo systemctl disable ssh

Enable AND start immediately:

$ sudo systemctl enable --now nginx

Listing Services

List all running services:

$ systemctl list-units --type=service --state=running

List all services (running or not):

$ systemctl list-units --type=service --all

List enabled services:

$ systemctl list-unit-files --type=service --state=enabled

Viewing Logs

See logs for a specific service:

$ journalctl -u nginx

Follow logs in real-time:

$ journalctl -u nginx -f

See only recent logs:

$ journalctl -u nginx --since "10 minutes ago"

Practical Examples

Example 1: Finding and Killing a Hung Process

# Find the process
$ ps aux | grep firefox

# Kill it gracefully
$ kill 12345

# Wait a few seconds, check if still there
$ ps aux | grep firefox

# Force kill if necessary
$ kill -9 12345

Example 2: Running a Long Script in Background

# Start a long-running analysis
$ python analyze_genome.py &

# Check it's running
$ jobs

# Do other work...

# Bring it back to see output
$ fg

Example 3: Checking System Load

# See what's consuming resources
$ htop

# Or check load average
$ uptime

# Or see top CPU processes
$ ps aux --sort=-%cpu | head

Example 4: Restarting a Web Server

# Check status
$ systemctl status nginx

# Restart it
$ sudo systemctl restart nginx

# Check logs if something went wrong
$ journalctl -u nginx -n 50

Summary: Process Management Commands

CommandPurpose
psSnapshot of processes
ps -eflAll processes with details
pstreeProcess hierarchy tree
topReal-time process monitor
htopBetter real-time monitor
jobsList background jobs
fgBring job to foreground
bgContinue job in background
command &Run command in background
Ctrl+ZSuspend current process
kill PIDSend SIGTERM to process
kill -9 PIDForce kill process
killall nameKill all processes by name
systemctl statusCheck service status
systemctl startStart a service
systemctl stopStop a service
systemctl restartRestart a service
systemctl enableEnable at boot

Shell Scripts (Bash Scripting)

A shell script is simply a collection of commands written in a text file. That's it. Nothing magical.

The original name was "shell script," but when GNU created bash (Bourne Again SHell), the term "bash script" became common.

Why Shell Scripts Matter

1. Automation
If you're typing the same commands repeatedly, write them once in a script.

2. Portability
Scripts work across different Linux machines and distributions (mostly).

3. Scheduling
Combine scripts with cron jobs to run tasks automatically.

4. DRY Principle
Don't Repeat Yourself - write once, run many times.

Important: Nothing new here. Everything you've already learned about Linux commands applies. Shell scripts just let you organize and automate them.


Creating Your First Script

Create a file called first-script.sh:

$ nano first-script.sh

Write some commands:

echo "Hello, World"

Note: The .sh extension doesn't technically matter in Linux (unlike Windows), but it's convention. Use it so humans know it's a shell script.


Making Scripts Executable

Check the current permissions:

$ ls -l first-script.sh

Output:

-rw-rw-r-- 1 mahmoudxyz mahmoudxyz 21 Nov  6 07:21 first-script.sh

Notice: No x (execute) permission. The file isn't executable yet.

Adding Execute Permission

$ chmod +x first-script.sh

Permission options:

  • u+x - Execute for user (owner) only
  • g+x - Execute for group only
  • o+x - Execute for others only
  • a+x or just +x - Execute for all (user, group, others)

Check permissions again:

$ ls -l first-script.sh

Output:

-rwxrwxr-x 1 mahmoudxyz mahmoudxyz 21 Nov  6 07:21 first-script.sh

Now we have x for user, group, and others.


Running Shell Scripts

There are two main ways to execute a script:

Method 1: Specify the Shell

$ sh first-script.sh

Or:

$ bash first-script.sh

This explicitly tells which shell to use.

Method 2: Direct Execution

$ ./first-script.sh

Why the ./ ?

Let's try without it:

$ first-script.sh

You'll get an error:

first-script.sh: command not found

Why? When you type a command without a path, the shell searches through directories listed in $PATH looking for that command. Your current directory (.) is usually NOT in $PATH for security reasons.

The ./ explicitly says: "Run the script in the current directory (.), don't search $PATH."

You could do this:

$ PATH=.:$PATH

Now first-script.sh would work without ./, but DON'T DO THIS. It's a security risk - you might accidentally execute malicious scripts in your current directory.

Best practices:

  1. Use ./script.sh for local scripts
  2. Put system-wide scripts in /usr/local/bin (which IS in $PATH)

The Shebang Line

Problem: How does the system know which interpreter to use for your script? Bash? Zsh? Python?

Solution: The shebang (#!) on the first line.

Basic Shebang

#!/bin/bash
echo "Hello, World"

What this means:
"Execute this script using /bin/bash"

When you run ./first-script.sh, the system:

  1. Reads the first line
  2. Sees #!/bin/bash
  3. Runs /bin/bash first-script.sh

Shebang with Other Languages

You can use shebang for any interpreted language:

#!/usr/bin/python3
print("Hello, World")

Now this file runs as a Python script!

The Portable Shebang

Problem: What if bash isn't at /bin/bash? What if python3 is at /usr/local/bin/python3 instead of /usr/bin/python3?

Solution: Use env to find the interpreter:

#!/usr/bin/env bash
echo "Hello, World"

Or for Python:

#!/usr/bin/env python3
print("Hello, World")

How it works:
env searches through $PATH to find the command. The shebang becomes: "Please find (env) where bash is located and execute this script with it."

Why env is better:

  • More portable across systems
  • Finds interpreters wherever they're installed
  • env itself is almost always at /usr/bin/env

Basic Shell Syntax

Command Separators

Semicolon (;) - Run commands sequentially:

$ echo "Hello" ; ls

This runs echo, then runs ls (regardless of whether echo succeeded).

AND (&&) - Run second command only if first succeeds:

$ echo "Hello" && ls

If echo succeeds (exit code 0), then run ls. If it fails, stop.

OR (||) - Run second command only if first fails:

$ false || ls

If false fails (exit code non-zero), then run ls. If it succeeds, stop.

Practical example:

$ cd /some/directory && echo "Changed directory successfully"

Only prints the message if cd succeeded.

$ cd /some/directory || echo "Failed to change directory"

Only prints the message if cd failed.


Variables

Variables store data that you can use throughout your script.

Declaring Variables

#!/bin/bash

# Integer variable
declare -i sum=16

# String variable
declare name="Mahmoud"

# Constant (read-only)
declare -r PI=3.14

# Array
declare -a names=()
names[0]="Alice"
names[1]="Bob"
names[2]="Charlie"

Key points:

  • declare -i = integer type
  • declare -r = read-only (constant)
  • declare -a = array
  • You can also just use sum=16 without declare (it works, but less explicit)

Using Variables

Access variables with $:

echo $sum          # Prints: 16
echo $name         # Prints: Mahmoud
echo $PI           # Prints: 3.14

For arrays and complex expressions, use ${}:

echo ${names[0]}   # Prints: Alice
echo ${names[1]}   # Prints: Bob
echo ${names[2]}   # Prints: Charlie

Why ${} matters:

echo "$nameTest"   # Looks for variable called "nameTest" (doesn't exist)
echo "${name}Test" # Prints: MahmoudTest (correct!)

Important Script Options

set -e

What it does: Exit script immediately if any command fails (non-zero exit code).

Why it matters: Prevents cascading errors. If step 1 fails, don't continue to step 2.

Example without set -e:

cd /nonexistent/directory
rm -rf *  # DANGER! This still runs even though cd failed

Example with set -e:

set -e
cd /nonexistent/directory  # Script stops here if this fails
rm -rf *                   # Never executes

Exit Codes

Every command returns an exit code:

  • 0 = Success
  • Non-zero = Failure (different numbers mean different errors)

Check the last command's exit code:

$ true
$ echo $?   # Prints: 0

$ false
$ echo $?   # Prints: 1

In scripts, explicitly exit with a code:

#!/bin/bash
echo "Script completed successfully"
exit 0  # Return 0 (success) to the calling process

Arithmetic Operations

There are multiple ways to do math in bash. Pick one and stick with it for consistency.

#!/bin/bash

num=4
echo $((num * 5))      # Prints: 20
echo $((num + 10))     # Prints: 14
echo $((num ** 2))     # Prints: 16 (exponentiation)

Operators:

  • + addition
  • - subtraction
  • * multiplication
  • / integer division
  • % modulo (remainder)
  • ** exponentiation

Pros: Built into bash, fast, clean syntax
Cons: Integer-only (no decimals)

Method 2: expr

#!/bin/bash

num=4
expr $num + 6      # Prints: 10
expr $num \* 5     # Prints: 20 (note the backslash before *)

Pros: Traditional, works in older shells
Cons: Awkward syntax, needs escaping for *

Method 3: bc (For Floating Point)

#!/bin/bash

echo "4.5 + 2.3" | bc       # Prints: 6.8
echo "10 / 3" | bc -l       # Prints: 3.33333... (-l for decimals)
echo "scale=2; 10/3" | bc   # Prints: 3.33 (2 decimal places)

Pros: Supports floating-point arithmetic
Cons: External program (slower), more complex

My recommendation: Use $(( )) for most cases. Use bc when you need decimals.


Logical Operations and Conditionals

Exit Code Testing

#!/bin/bash

true ; echo $?    # Prints: 0
false ; echo $?   # Prints: 1

Logical Operators

true && echo "True"     # Prints: True (because true succeeds)
false || echo "False"   # Prints: False (because false fails)

Comparison Operators

There are TWO syntaxes for comparisons in bash. Stick to one.

For integers:

[[ 1 -le 2 ]]  # Less than or equal
[[ 3 -ge 2 ]]  # Greater than or equal
[[ 5 -lt 10 ]] # Less than
[[ 8 -gt 4 ]]  # Greater than
[[ 5 -eq 5 ]]  # Equal
[[ 5 -ne 3 ]]  # Not equal

For strings and mixed:

[[ 3 == 3 ]]   # Equal
[[ 3 != 4 ]]   # Not equal
[[ 5 > 3 ]]    # Greater than (lexicographic for strings)
[[ 2 < 9 ]]    # Less than (lexicographic for strings)

Testing the result:

[[ 3 == 3 ]] ; echo $?   # Prints: 0 (true)
[[ 3 != 3 ]] ; echo $?   # Prints: 1 (false)
[[ 5 > 3 ]] ; echo $?    # Prints: 0 (true)
Option 2: test Command (Traditional)
test 1 -le 5 ; echo $?   # Prints: 0 (true)
test 10 -lt 5 ; echo $?  # Prints: 1 (false)

test is equivalent to [ ] (note: single brackets):

[ 1 -le 5 ] ; echo $?    # Same as test

My recommendation: Use [[ ]] (double brackets). It's more powerful and less error-prone than [ ] or test.

File Test Operators

Check file properties:

test -f /etc/hosts ; echo $?     # Does file exist? (0 = yes)
test -d /home ; echo $?           # Is it a directory? (0 = yes)
test -r /etc/shadow ; echo $?    # Do I have read permission? (1 = no)
test -w /tmp ; echo $?            # Do I have write permission? (0 = yes)
test -x /usr/bin/ls ; echo $?    # Is it executable? (0 = yes)

Common file tests:

  • -f file exists and is a regular file
  • -d directory exists
  • -e exists (any type)
  • -r readable
  • -w writable
  • -x executable
  • -s file exists and is not empty

Using [[ ]] syntax:

[[ -f /etc/hosts ]] && echo "File exists"
[[ -r /etc/shadow ]] || echo "Cannot read this file"

Positional Parameters (Command-Line Arguments)

When you run a script with arguments, bash provides special variables to access them.

Special Variables

#!/bin/bash

# $0 - Name of the script itself
# $# - Number of command-line arguments
# $* - All arguments as a single string
# $@ - All arguments as separate strings (array-like)
# $1 - First argument
# $2 - Second argument
# $3 - Third argument
# ... and so on

Example Script

#!/bin/bash

echo "Script name: $0"
echo "Total number of arguments: $#"
echo "All arguments: $*"
echo "First argument: $1"
echo "Second argument: $2"

Running it:

$ ./script.sh hello world 123

Output:

Script name: ./script.sh
Total number of arguments: 3
All arguments: hello world 123
First argument: hello
Second argument: world

$* vs $@

$* - Treats all arguments as a single string:

for arg in "$*"; do
    echo $arg
done
# Output: hello world 123 (all as one)

$@ - Treats arguments as separate items:

for arg in "$@"; do
    echo $arg
done
# Output:
# hello
# world
# 123

Recommendation: Use "$@" when looping through arguments.


Functions

Functions let you organize code into reusable blocks.

Basic Function

#!/bin/bash

Hello() {
    echo "Hello Functions!"
}

Hello  # Call the function

Alternative syntax:

function Hello() {
    echo "Hello Functions!"
}

Both work the same. Pick one style and be consistent.

Functions with Return Values

#!/bin/bash

function Hello() {
    echo "Hello Functions!"
    return 0  # Success
}

function GetTimestamp() {
    echo "The time now is $(date +%m/%d/%y' '%R)"
    return 0
}

Hello
echo "Exit code: $?"  # Prints: 0

GetTimestamp

Important: return only returns exit codes (0-255), NOT values like other languages.

To return a value, use echo:

function Add() {
    local result=$(($1 + $2))
    echo $result  # "Return" the value via stdout
}

sum=$(Add 5 3)  # Capture the output
echo "Sum: $sum"  # Prints: Sum: 8

Function Arguments

Functions can take arguments like scripts:

#!/bin/bash

Greet() {
    echo "Hello, $1!"  # $1 is first argument to function
}

Greet "Mahmoud"  # Prints: Hello, Mahmoud!
Greet "World"    # Prints: Hello, World!

Reading User Input

Basic read Command

#!/bin/bash

echo "What is your name?"
read name
echo "Hello, $name!"

How it works:

  1. Script displays prompt
  2. Waits for user to type and press Enter
  3. Stores input in variable name

read with Inline Prompt

#!/bin/bash

read -p "What is your name? " name
echo "Hello, $name!"

-p flag: Display prompt on same line as input

Reading Multiple Variables

#!/bin/bash

read -p "Enter your first and last name: " first last
echo "Hello, $first $last!"

Input: Mahmoud Xyz
Output: Hello, Mahmoud Xyz!

Reading Passwords (Securely)

#!/bin/bash

read -sp "Enter your password: " password
echo ""  # New line after hidden input
echo "Password received (length: ${#password})"

-s flag: Silent mode - doesn't display what user types
-p flag: Inline prompt

Security note: This hides the password from screen, but it's still in memory as plain text. For real password handling, use dedicated tools.

Reading from Files

#!/bin/bash

while read line; do
    echo "Line: $line"
done < /etc/passwd

Reads /etc/passwd line by line.


Best Practices

  1. Always use shebang: #!/usr/bin/env bash
  2. Use set -e: Stop on errors
  3. Use set -u: Stop on undefined variables
  4. Use set -o pipefail: Catch errors in pipes
  5. Quote variables: Use "$var" not $var (prevents word splitting)
  6. Check return codes: Test if commands succeeded
  7. Add comments: Explain non-obvious logic
  8. Use functions: Break complex scripts into smaller pieces
  9. Test thoroughly: Run scripts in safe environment first

The Holy Trinity of Safety

#!/usr/bin/env bash
set -euo pipefail
  • -e exit on error
  • -u exit on undefined variable
  • -o pipefail exit on pipe failures

About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (bash documentation, shell scripting tutorials, Linux guides).

This is my academic work, how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

References

[1] Ahmed Sami (Architect @ Microsoft).
Linux for Data Engineers (Arabic – Egyptian Dialect), 11h 30m.
YouTube

Python

python-comic

💡
Philosophy

I don't like cheat sheets. What we really need is daily problem-solving. Read other people's code, understand how they think - this is the only real way to improve.

This is a quick overview combined with practice problems. Things might appear in a reversed order sometimes - we'll introduce concepts by solving problems and covering tools as needed.

ℹ️
Need Help?

If you need help setting up something, write me.


Resources

Free Books:

If you want to buy:


Your First Program

alt text

print("Hello, World!")
⚠️
Everything is Case Sensitive

print() works. Print() does not!

The print() Function

Optional arguments: sep and end

sep (separator) - what goes between values:

print("A", "B", "C")              # A B C (default: space)
print("A", "B", "C", sep="-")     # A-B-C
print(1, 2, 3, sep=" | ")         # 1 | 2 | 3

end - what prints after the line:

print("Hello")
print("World")
# Output:
# Hello
# World

print("Hello", end=" ")
print("World")
# Output: Hello World

Escape Characters

📝
Common Escape Characters

\n → New line
\t → Tab
\\ → Backslash
\' → Single quote
\" → Double quote

Practice

💻
Exercise 1

Print a box of asterisks (4 rows, 19 asterisks each)

💻
Exercise 2

Print a hollow box (asterisks on edges, spaces inside)

💻
Exercise 3

Print a triangle pattern starting with one asterisk


Variables and Assignment

A variable stores a value in memory so you can use it later.

x = 7
y = 3
total = x + y
print(total)  # 11

alt text

⚠️
Assignment vs Equality

The = sign is for assignment, not mathematical equality. You're telling Python to store the right side value in the left side variable.

Multiple assignment:

x, y, z = 1, 2, 3

Variable Naming Rules

  • Must start with letter or underscore
  • Can contain letters, numbers, underscores
  • Cannot start with number
  • Cannot contain spaces
  • Cannot use Python keywords (for, if, class, etc.)
  • Case sensitive: age, Age, AGE are different

Assignment Operators

📝
Shortcuts

x += 3 → Same as x = x + 3
x -= 2 → Same as x = x - 2
x *= 4 → Same as x = x * 4
x /= 2 → Same as x = x / 2


Reading Input

name = input("What's your name? ")
print(f"Hello, {name}!")
⚠️
Important

input() always returns a string! Even if the user types 42, you get "42".

Converting input:

age = int(input("How old are you? "))
price = float(input("Enter price: $"))

Practice

💻
Exercise 1

Ask for a number, print its square in a complete sentence ending with a period (use sep)

💻
Exercise 2

Compute: (512 - 282) / (47 × 48 + 5)

💻
Exercise 3

Convert kilograms to pounds (2.2 pounds per kilogram)


Basic Data Types

Strings

Text inside quotes:

name = "Mahmoud"
message = 'Hello'

Can use single or double quotes. Strings can contain letters, numbers, spaces, symbols.

Numbers

  • int → Whole numbers: 7, 0, -100
  • float → Decimals: 3.14, 0.5, -2.7

Boolean

True or false values:

print(5 > 3)        # True
print(2 == 10)      # False
print("a" in "cat") # True

Logical Operators

📝
Operators

and → Both must be true
or → At least one must be true
not → Reverses the boolean
== → Equal to
!= → Not equal to
>, <, >=, <= → Comparisons

Practice

💻
DNA Validation Exercises

Read a DNA sequence and check:
1. Contains BOTH "A" AND "T"
2. Contains "U" OR "T"
3. Is pure RNA (no "T")
4. Is empty or only whitespace
5. Is valid DNA (only A, T, G, C)
6. Contains "A" OR "G" but NOT both
7. Contains any stop codon ("TAA", "TAG", "TGA")

Type Checking and Casting

print(type("hello"))  # <class 'str'>
print(type(10))       # <class 'int'>
print(type(3.5))      # <class 'float'>
print(type(True))     # <class 'bool'>

Type casting:

int("10")      # 10
float(5)       # 5.0
str(3.14)      # "3.14"
bool(0)        # False
bool(5)        # True
list("hi")     # ['h', 'i']
⚠️
Invalid Casts

int("hello") and float("abc") will cause errors!


Sequences

sequences.png

Strings

Strings are sequences of characters.

Indexing

Indexes start from 0:

alt text

name = "Python"
print(name[0])   # P
print(name[3])   # h
⚠️
Strings Are Immutable

You cannot change characters directly: name[0] = "J" causes an error!
But you can reassign the whole string: name = "Java"

String Operations

# Concatenation
"Hello" + " " + "World"  # "Hello World"

# Multiplication
"ha" * 3                 # "hahaha"

# Length
len("Python")            # 6

# Methods
text = "hello"
text.upper()             # "HELLO"
text.replace("h", "j")   # "jello"

Common String Methods

📝
Useful Methods

.upper(), .lower(), .capitalize(), .title()
.strip(), .lstrip(), .rstrip()
.replace(old, new), .split(sep), .join(list)
.find(sub), .count(sub)
.startswith(), .endswith()
.isalpha(), .isdigit(), .isalnum()

Practice

💻
Exercise 1

Convert DNA → RNA only if T exists (don't use if)

💻
Exercise 2

Check if DNA starts with "ATG" AND ends with "TAA"

💻
Exercise 3

Read text and print the last character


Lists

Lists can contain different types and are mutable (changeable).

numbers = [1, 2, 3]
mixed = [1, "hello", True]

List Operations

# Accessing
colors = ["red", "blue", "green"]
print(colors[1])  # "blue"

# Modifying (lists ARE mutable!)
colors[1] = "yellow"

# Adding
colors.append("black")          # Add at end
colors.insert(1, "white")       # Add at position

# Removing
del colors[1]                   # Remove by index
value = colors.pop()            # Remove last
colors.remove("red")            # Remove by value

# Sorting
numbers = [3, 1, 2]
numbers.sort()                  # Permanent
sorted(numbers)                 # Temporary

# Other operations
numbers.reverse()               # Reverse in place
len(numbers)                    # Length

Practice

💻
Exercise 1

Print the middle element of a list

💻
Exercise 2

Mutate RNA: ["A", "U", "G", "C", "U", "A"]
- Change first "A" to "G"
- Change last "A" to "C"

💻
Exercise 3

Swap first and last codon in: ["A","U","G","C","G","A","U","U","G"]

💻
Exercise 4

Create complementary DNA: A↔T, G↔C for ["A","T","G","C"]


Slicing

Extract portions of sequences: [start:stop:step]

alt text

⚠️
Stop is Excluded

[0:3] gives indices 0, 1, 2 (NOT 3)

Basic Slicing

numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

numbers[2:5]      # [2, 3, 4]
numbers[:3]       # [0, 1, 2] - from beginning
numbers[5:]       # [5, 6, 7, 8, 9] - to end
numbers[:]        # Copy everything
numbers[::2]      # [0, 2, 4, 6, 8] - every 2nd element

Negative Indices

Count from the end: -1 is last, -2 is second-to-last

numbers[-1]       # 9 - last element
numbers[-3:]      # [7, 8, 9] - last 3 elements
numbers[:-2]      # [0, 1, 2, 3, 4, 5, 6, 7] - all except last 2
numbers[::-1]     # Reverse!

Practice

💻
Exercise 1

Reverse middle 6 elements (indices 2-7) of [0,1,2,3,4,5,6,7,8,9]

💻
Exercise 2

Get every 3rd element backwards from ['a','b',...,'j']

💻
Exercise 3

Swap first 3 and last 3 characters in "abcdefghij"


Control Flow

If Statements

age = 18

if age >= 18:
    print("Adult")
elif age >= 13:
    print("Teen")
else:
    print("Child")
💡
elif vs Separate if

elif stops checking after first match. Separate if statements check all conditions.

Practice

💻
Exercise 1

Convert cm to inches (2.54 cm/inch). Print "invalid" if negative.

💻
Exercise 2

Print student year: ≤23: freshman, 24-53: sophomore, 54-83: junior, ≥84: senior

💻
Exercise 3

Number guessing game (1-10)


Loops

For Loops

# Loop through list
for fruit in ["apple", "banana"]:
    print(fruit)

# With index
for i, fruit in enumerate(["apple", "banana"]):
    print(f"{i}: {fruit}")

# Range
for i in range(5):        # 0, 1, 2, 3, 4
    print(i)

for i in range(2, 5):     # 2, 3, 4
    print(i)

for i in range(0, 10, 2): # 0, 2, 4, 6, 8
    print(i)

While Loops

count = 0
while count < 5:
    print(count)
    count += 1
⚠️
Infinite Loops

Make sure your condition eventually becomes False!

Control Statements

📝
Loop Control

break → Exit loop immediately
continue → Skip to next iteration
pass → Do nothing (placeholder)

Practice

💻
Exercise 1

Print your name 100 times

💻
Exercise 2

Print numbers and their squares from 1-20

💻
Exercise 3

Print: 8, 11, 14, 17, ..., 89 using a for loop


String & List Exercises

💻
String Challenges

1. Count spaces to estimate words
2. Check if parentheses are balanced
3. Check if word contains vowels
4. Encrypt by rearranging even/odd indices
5. Capitalize first letter of each word

💻
List Challenges

1. Replace all values > 10 with 10
2. Remove duplicates from list
3. Find longest run of zeros
4. Create [1,1,0,1,0,0,1,0,0,0,...]
5. Remove first character from each string


F-Strings (String Formatting)

Modern, clean way to format strings:

name = 'Ahmed'
age = 45
txt = f"My name is {name}, I am {age}"

Number Formatting

pi = 3.14159265359

f'{pi:.2f}'              # '3.14' - 2 decimals
f'{10:03d}'              # '010' - pad with zeros
f'{12345678:,d}'         # '12,345,678' - commas
f'{42:>10d}'             # '        42' - right align
f'{1234.5:>10,.2f}'      # '  1,234.50' - combined

Functions in F-Strings

name = "alice"
f"Hello, {name.upper()}!"        # 'Hello, ALICE!'

numbers = [3, 1, 4]
f"Sum: {sum(numbers)}"           # 'Sum: 8'

String Methods

split() and join()

# Split
text = "one,two,three"
words = text.split(',')          # ['one', 'two', 'three']
text.split()                     # Split on any whitespace

# Join
words = ['one', 'two', 'three']
', '.join(words)                 # 'one, two, three'
''.join(['H','e','l','l','o'])   # 'Hello'

partition()

Splits at first occurrence:

email = "user@example.com"
username, _, domain = email.partition('@')
# username = 'user', domain = 'example.com'

Character Checks

'123'.isdigit()          # True - all digits
'Hello123'.isalnum()     # True - letters and numbers
'hello'.isalpha()        # True - only letters
'hello'.islower()        # True - all lowercase
'HELLO'.isupper()        # True - all uppercase

Two Sum Problem

Problem

Given an array of integers and a target, return indices of two numbers that add up to target.

# Input: nums = [2, 7, 11, 15], target = 9
# Output: [0, 1]  (because 2 + 7 = 9)

Brute Force Solution (O(n²))

nums = [2, 7, 11, 15]
target = 9

for i in range(len(nums)):
    for j in range(i + 1, len(nums)):
        if nums[i] + nums[j] == target:
            print([i, j])
⚠️
Nested Loops = Slow

Time complexity: O(n²)
10 elements = ~100 operations
1,000 elements = ~1,000,000 operations!

alt text


Unpacking with * and **

Unpacking Iterables (*)

# Basic unpacking
numbers = [1, 2, 3]
a, b, c = numbers

# Catch remaining items
first, *middle, last = [1, 2, 3, 4, 5]
# first = 1, middle = [2, 3, 4], last = 5

# In function calls
def add(a, b, c):
    return a + b + c

numbers = [1, 2, 3]
add(*numbers)  # Same as add(1, 2, 3)

# Combining lists
list1 = [1, 2]
list2 = [3, 4]
combined = [*list1, *list2]  # [1, 2, 3, 4]

Unpacking Dictionaries (**)

# Merge dictionaries
defaults = {'color': 'blue', 'size': 'M'}
custom = {'size': 'L'}
final = {**defaults, **custom}
# {'color': 'blue', 'size': 'L'}

# In function calls
def create_user(name, age, city):
    print(f"{name}, {age}, {city}")

data = {'name': 'Bob', 'age': 30, 'city': 'NYC'}
create_user(**data)
💡
Remember

* unpacks iterables into positional arguments
** unpacks dictionaries into keyword arguments

Functions

📖
What is a Function?

A function is a reusable block of code that performs a specific task. It's like a recipe you can follow multiple times without rewriting the steps.

alt text

The DRY Principle

💡
DRY = Don't Repeat Yourself

If you're copying and pasting code, you should probably write a function instead!

Without a function (repetitive):

# Calculating area three times - notice the pattern?
area1 = 10 * 5
print(f"Area 1: {area1}")

area2 = 8 * 6
print(f"Area 2: {area2}")

area3 = 12 * 4
print(f"Area 3: {area3}")

With a function (clean):

def calculate_area(length, width):
    return length * width

print(f"Area 1: {calculate_area(10, 5)}")
print(f"Area 2: {calculate_area(8, 6)}")
print(f"Area 3: {calculate_area(12, 4)}")

Basic Function Syntax

Declaring a Function

def greet():
    print("Hello, World!")

Anatomy:

  • def → keyword to start a function
  • greet → function name (use descriptive names!)
  • () → parentheses for parameters
  • : → colon to start the body
  • Indented code → what the function does

Calling a Function

⚠️
Important

Defining a function doesn't run it! You must call it.

def greet():
    print("Hello, World!")

greet()  # Now it runs!
greet()  # You can call it multiple times

Parameters and Arguments

📝
Terminology

Parameters are in the definition. Arguments are the actual values you pass.

def greet(name):      # 'name' is a parameter
    print(f"Hello, {name}!")

greet("Alice")        # "Alice" is an argument

Multiple parameters:

def add_numbers(a, b):
    result = a + b
    print(f"{a} + {b} = {result}")

add_numbers(5, 3)     # Output: 5 + 3 = 8

Return Values

Functions can give back results using return:

def multiply(a, b):
    return a * b

result = multiply(4, 5)
print(result)  # 20

# Use the result directly in calculations
total = multiply(3, 7) + multiply(2, 4)  # 21 + 8 = 29
ℹ️
print() vs return

print() shows output on screen. return sends a value back so you can use it later.


Default Arguments

Give parameters default values if no argument is provided:

def power(base, exponent=2):  # exponent defaults to 2
    return base ** exponent

print(power(5))      # 25 (5²)
print(power(5, 3))   # 125 (5³)

Multiple defaults:

def create_profile(name, age=18, country="USA"):
    print(f"{name}, {age} years old, from {country}")

create_profile("Alice")                    # Uses both defaults
create_profile("Bob", 25)                  # Uses country default
create_profile("Charlie", 30, "Canada")    # No defaults used
⚠️
Rule

Parameters with defaults must come after parameters without defaults!

# ❌ Wrong
def bad(a=5, b):
    pass

# ✅ Correct
def good(b, a=5):
    pass

Variable Number of Arguments

*args (Positional Arguments)

Use when you don't know how many arguments will be passed:

def sum_all(*numbers):
    total = 0
    for num in numbers:
        total += num
    return total

print(sum_all(1, 2, 3))           # 6
print(sum_all(10, 20, 30, 40))    # 100

**kwargs (Keyword Arguments)

Use for named arguments as a dictionary:

def print_info(**details):
    for key, value in details.items():
        print(f"{key}: {value}")

print_info(name="Alice", age=25, city="New York")
# Output:
# name: Alice
# age: 25
# city: New York

Combining Everything

💡
Order Matters

When combining, use this order: regular params → *args → default params → **kwargs

def flexible(required, *args, default="default", **kwargs):
    print(f"Required: {required}")
    print(f"Args: {args}")
    print(f"Default: {default}")
    print(f"Kwargs: {kwargs}")

flexible("Must have", 1, 2, 3, default="Custom", extra="value")

Scope: Local vs Global

📖
Scope

Scope determines where a variable can be accessed in your code.

Local scope: Variables inside functions only exist inside that function

def calculate():
    result = 10 * 5  # Local variable
    print(result)

calculate()        # 50
print(result)      # ❌ ERROR! result doesn't exist here

Global scope: Variables outside functions can be accessed anywhere

total = 0  # Global variable

def add_to_total(amount):
    global total  # Modify the global variable
    total += amount

add_to_total(10)
print(total)  # 10
💡
Best Practice

Avoid global variables! Pass values as arguments and return results instead.

Better approach:

def add_to_total(current, amount):
    return current + amount

total = 0
total = add_to_total(total, 10)  # 10
total = add_to_total(total, 5)   # 15

Decomposition

📖
Decomposition

Breaking complex problems into smaller, manageable functions. Each function should do one thing well.

Bad (one giant function):

def process_order(items, customer):
    # Calculate, discount, tax, print - all in one!
    total = sum(item['price'] for item in items)
    if total > 100:
        total *= 0.9
    total *= 1.08
    print(f"Customer: {customer}")
    print(f"Total: ${total:.2f}")

Good (decomposed):

def calculate_subtotal(items):
    return sum(item['price'] for item in items)

def apply_discount(amount):
    return amount * 0.9 if amount > 100 else amount

def add_tax(amount):
    return amount * 1.08

def print_receipt(customer, total):
    print(f"Customer: {customer}")
    print(f"Total: ${total:.2f}")

def process_order(items, customer):
    subtotal = calculate_subtotal(items)
    discounted = apply_discount(subtotal)
    final = add_tax(discounted)
    print_receipt(customer, final)

Benefits: ✅ Easier to understand ✅ Easier to test ✅ Reusable components ✅ Easier to debug


Practice Exercises

💻
Exercise 1: Rectangle Printer

Write a function rectangle(m, n) that prints an m × n box of asterisks.

rectangle(2, 4)
# Output:
# ****
# ****
💻
Exercise 2: Add Excitement

Write add_excitement(words) that adds "!" to each string in a list.

  • Version A: Modify the original list
  • Version B: Return a new list without modifying the original
words = ["hello", "world"]
add_excitement(words)
# words is now ["hello!", "world!"]
💻
Exercise 3: Sum Digits

Write sum_digits(num) that returns the sum of all digits in a number.

sum_digits(123)   # Returns: 6 (1 + 2 + 3)
sum_digits(4567)  # Returns: 22 (4 + 5 + 6 + 7)
💻
Exercise 4: First Difference

Write first_diff(str1, str2) that returns the first position where strings differ, or -1 if identical.

first_diff("hello", "world")  # Returns: 0
first_diff("test", "tent")    # Returns: 2
first_diff("same", "same")    # Returns: -1
💻
Exercise 5: Tic-Tac-Toe

A 3×3 board uses: 0 = empty, 1 = X, 2 = O

  • Part A: Write a function that randomly places a 2 in an empty spot
  • Part B: Write a function that checks if someone has won (returns True/False)
💻
Exercise 6: String Matching

Write matches(str1, str2) that counts how many positions have the same character.

matches("python", "path")  # Returns: 3 (positions 0, 2, 3)
💻
Exercise 7: Find All Occurrences

Write findall(string, char) that returns a list of all positions where a character appears.

findall("hello", "l")  # Returns: [2, 3]
findall("test", "x")   # Returns: []
💻
Exercise 8: Case Swap

Write change_case(string) that swaps uppercase ↔ lowercase.

change_case("Hello World")  # Returns: "hELLO wORLD"

Challenge Exercises

Challenge 1: Merge Sorted Lists

Write merge(list1, list2) that combines two sorted lists into one sorted list.

  • Try it with .sort() method
  • Try it without using .sort()
merge([1, 3, 5], [2, 4, 6])  # Returns: [1, 2, 3, 4, 5, 6]
Challenge 2: Number to English

Write verbose(num) that converts numbers to English words (up to 10¹⁵).

verbose(123456)  
# Returns: "one hundred twenty-three thousand, four hundred fifty-six"
Challenge 3: Base 20 Conversion

Convert base 10 numbers to base 20 using letters A-T (A=0, B=1, ..., T=19).

base20(0)    # Returns: "A"
base20(20)   # Returns: "BA"
base20(39)   # Returns: "BT"
base20(400)  # Returns: "BAA"
Challenge 4: Closest Value

Write closest(L, n) that returns the largest element in L that doesn't exceed n.

closest([1, 6, 3, 9, 11], 8)  # Returns: 6
closest([5, 10, 15, 20], 12)  # Returns: 10

Higher-Order Functions

📖
Definition

Higher-Order Function: A function that either takes another function as a parameter OR returns a function as a result.

Why Do We Need Them?

Imagine you have a list of numbers and you want to:

  • Keep only the even numbers
  • Keep only numbers greater than 10
  • Keep only numbers divisible by 3

You could write three different functions... or write ONE function that accepts different "rules" as parameters!

💡
Key Idea

Separate what to do (iterate through a list) from how to decide (the specific rule)


Worked Example: Filtering Numbers

Step 1: The Problem

We have a list of numbers: [3, 8, 15, 4, 22, 7, 11]

We want to filter them based on different conditions.

Step 2: Without Higher-Order Functions (Repetitive)

# Filter for even numbers
def filter_even(numbers):
    result = []
    for num in numbers:
        if num % 2 == 0:
            result.append(num)
    return result

# Filter for numbers > 10
def filter_large(numbers):
    result = []
    for num in numbers:
        if num > 10:
            result.append(num)
    return result
⚠️
Problem

Notice how we're repeating the same loop structure? Only the condition changes!

Step 3: With Higher-Order Function (Smart)

def filter_numbers(numbers, condition):
    """
    Filter numbers based on any condition function.
    
    numbers: list of numbers
    condition: a function that returns True/False
    """
    result = []
    for num in numbers:
        if condition(num):  # Call the function we received!
            result.append(num)
    return result
Solution

Now we have ONE function that can work with ANY condition!

Step 4: Define Simple Condition Functions

def is_even(n):
    return n % 2 == 0

def is_large(n):
    return n > 10

def is_small(n):
    return n < 10

Step 5: Use It!

numbers = [3, 8, 15, 4, 22, 7, 11]

print(filter_numbers(numbers, is_even))   # [8, 4, 22]
print(filter_numbers(numbers, is_large))  # [15, 22, 11]
print(filter_numbers(numbers, is_small))  # [3, 8, 4, 7]
ℹ️
Notice

We pass the function name WITHOUT parentheses: is_even not is_even()


Practice Exercises

💻
Exercise 1: String Filter

Complete this function:

def filter_words(words, condition):
    # Your code here
    pass

def is_long(word):
    return len(word) > 5

def starts_with_a(word):
    return word.lower().startswith('a')

# Test it:
words = ["apple", "cat", "banana", "amazing", "dog"]
print(filter_words(words, is_long))         # Should print: ["banana", "amazing"]
print(filter_words(words, starts_with_a))   # Should print: ["apple", "amazing"]
💻
Exercise 2: Number Transformer

Write a higher-order function that transforms numbers:

def transform_numbers(numbers, transformer):
    # Your code here: apply transformer to each number
    pass

def double(n):
    return n * 2

def square(n):
    return n ** 2

# Test it:
nums = [1, 2, 3, 4, 5]
print(transform_numbers(nums, double))   # Should print: [2, 4, 6, 8, 10]
print(transform_numbers(nums, square))   # Should print: [1, 4, 9, 16, 25]
💻
Exercise 3: Grade Calculator

Create a function that grades scores using different grading systems:

def apply_grading(scores, grade_function):
    # Your code here
    pass

def strict_grade(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    else:
        return 'C'

def pass_fail(score):
    return 'Pass' if score >= 60 else 'Fail'

# Test it:
scores = [95, 75, 85, 55]
print(apply_grading(scores, strict_grade))  # Should print: ['A', 'C', 'B', 'C']
print(apply_grading(scores, pass_fail))     # Should print: ['Pass', 'Pass', 'Pass', 'Fail']

Conclusion

📝
Remember

1. Functions can be passed as parameters (like any other value)
2. The higher-order function provides the structure (loop, collection)
3. The parameter function provides the specific behavior (condition, transformation)
4. This makes code more reusable and flexible

💡
Real Python Examples

Python has built-in higher-order functions you'll use all the time:
sorted(items, key=function)
map(function, items)
filter(function, items)


Challenge Exercise

DNA Sequence Validator

Write a higher-order function validate_sequences(sequences, validator) that checks a list of DNA sequences using different validation rules.

Validation functions to create:

  • is_valid_dna(seq) - checks if sequence contains only A, C, G, T
  • is_long_enough(seq) - checks if sequence is at least 10 characters
  • has_start_codon(seq) - checks if sequence starts with "ATG"
sequences = ["ATGCGATCG", "ATGXYZ", "AT", "ATGCCCCCCCCCC"]

# Your solution should work like this:
print(validate_sequences(sequences, is_valid_dna))
# [True, False, True, True]

print(validate_sequences(sequences, is_long_enough))
# [False, False, False, True]

Tuples and Sets


Part 1: Tuples

What is a Tuple?

A tuple is essentially an immutable list. Once created, you cannot change its contents.

# List - mutable (can change)
L = [1, 2, 3]
L[0] = 100  # Works fine

# Tuple - immutable (cannot change)
t = (1, 2, 3)
t[0] = 100  # TypeError: 'tuple' object does not support item assignment

Creating Tuples

# With parentheses
t = (1, 2, 3)

# Without parentheses (comma makes it a tuple)
t = 1, 2, 3

# Single element tuple (comma is required!)
t = (1,)    # This is a tuple
t = (1)     # This is just an integer!

# Empty tuple
t = ()
t = tuple()

# From a list
t = tuple([1, 2, 3])

# From a string
t = tuple("hello")  # ('h', 'e', 'l', 'l', 'o')

Common mistake:

# This is NOT a tuple
x = (5)
print(type(x))  # <class 'int'>

# This IS a tuple
x = (5,)
print(type(x))  # <class 'tuple'>

Accessing Tuple Elements

t = ('a', 'b', 'c', 'd', 'e')

# Indexing (same as lists)
print(t[0])     # 'a'
print(t[-1])    # 'e'

# Slicing
print(t[1:3])   # ('b', 'c')
print(t[:3])    # ('a', 'b', 'c')
print(t[2:])    # ('c', 'd', 'e')

# Length
print(len(t))   # 5

Why Use Tuples?

1. Faster and Less Memory

Tuples are more efficient than lists:

import sys

L = [1, 2, 3, 4, 5]
t = (1, 2, 3, 4, 5)

print(sys.getsizeof(L))  # 104 bytes
print(sys.getsizeof(t))  # 80 bytes (smaller!)

2. Safe - Data Cannot Be Changed

When you want to ensure data stays constant:

# RGB color that shouldn't change
RED = (255, 0, 0)
# RED[0] = 200  # Error! Can't modify

# Coordinates
location = (40.7128, -74.0060)  # New York

3. Can Be Dictionary Keys

Lists cannot be dictionary keys, but tuples can:

# This works
locations = {
    (40.7128, -74.0060): "New York",
    (51.5074, -0.1278): "London"
}
print(locations[(40.7128, -74.0060)])  # New York

# This fails
# locations = {[40.7128, -74.0060]: "New York"}  # TypeError!

4. Return Multiple Values

Functions can return tuples:

def get_stats(numbers):
    return min(numbers), max(numbers), sum(numbers)

low, high, total = get_stats([1, 2, 3, 4, 5])
print(low, high, total)  # 1 5 15

Tuple Unpacking

# Basic unpacking
t = (1, 2, 3)
a, b, c = t
print(a, b, c)  # 1 2 3

# Swap values (elegant!)
x, y = 10, 20
x, y = y, x
print(x, y)  # 20 10

# Unpacking with *
t = (1, 2, 3, 4, 5)
first, *middle, last = t
print(first)   # 1
print(middle)  # [2, 3, 4]
print(last)    # 5

Looping Through Tuples

t = ('a', 'b', 'c')

# Basic loop
for item in t:
    print(item)

# With index
for i, item in enumerate(t):
    print(f"{i}: {item}")

# Loop through list of tuples
points = [(0, 0), (1, 2), (3, 4)]
for x, y in points:
    print(f"x={x}, y={y}")

Tuple Methods

Tuples have only two methods (because they're immutable):

t = (1, 2, 3, 2, 2, 4)

# Count occurrences
print(t.count(2))   # 3

# Find index
print(t.index(3))   # 2

Tuples vs Lists Summary

FeatureTupleList
Syntax(1, 2, 3)[1, 2, 3]
MutableNoYes
SpeedFasterSlower
MemoryLessMore
Dictionary keyYesNo
Use caseFixed dataChanging data

Tuple Exercises

Exercise 1: Create a tuple with your name, age, and city. Print each element.

Exercise 2: Given t = (1, 2, 3, 4, 5), print the first and last elements.

Exercise 3: Write a function that returns the min, max, and average of a list as a tuple.

Exercise 4: Swap two variables using tuple unpacking.

Exercise 5: Create a tuple from the string "ATGC" and count how many times 'A' appears.

Exercise 6: Given a list of (x, y) coordinates, calculate the distance of each from origin.

Exercise 7: Use a tuple as a dictionary key to store city names by their (latitude, longitude).

Exercise 8: Unpack (1, 2, 3, 4, 5) into first, middle (as list), and last.

Exercise 9: Create a function that returns the quotient and remainder of two numbers as a tuple.

Exercise 10: Loop through [(1, 'a'), (2, 'b'), (3, 'c')] and print each pair.

Exercise 11: Convert a list [1, 2, 3] to a tuple and back to a list.

Exercise 12: Find the index of 'G' in the tuple ('A', 'T', 'G', 'C').

Exercise 13: Create a tuple of tuples representing a 3x3 grid and print the center element.

Exercise 14: Given two tuples, concatenate them into a new tuple.

Exercise 15: Sort a list of (name, score) tuples by score in descending order.

Solutions
# Exercise 1
person = ("Mahmoud", 25, "Bologna")
print(person[0], person[1], person[2])

# Exercise 2
t = (1, 2, 3, 4, 5)
print(t[0], t[-1])

# Exercise 3
def stats(numbers):
    return min(numbers), max(numbers), sum(numbers)/len(numbers)
print(stats([1, 2, 3, 4, 5]))

# Exercise 4
x, y = 10, 20
x, y = y, x
print(x, y)

# Exercise 5
dna = tuple("ATGC")
print(dna.count('A'))

# Exercise 6
import math
coords = [(3, 4), (0, 5), (1, 1)]
for x, y in coords:
    dist = math.sqrt(x**2 + y**2)
    print(f"({x}, {y}): {dist:.2f}")

# Exercise 7
cities = {
    (40.71, -74.00): "New York",
    (51.51, -0.13): "London"
}
print(cities[(40.71, -74.00)])

# Exercise 8
t = (1, 2, 3, 4, 5)
first, *middle, last = t
print(first, middle, last)

# Exercise 9
def div_mod(a, b):
    return a // b, a % b
print(div_mod(17, 5))  # (3, 2)

# Exercise 10
pairs = [(1, 'a'), (2, 'b'), (3, 'c')]
for num, letter in pairs:
    print(f"{num}: {letter}")

# Exercise 11
L = [1, 2, 3]
t = tuple(L)
L2 = list(t)
print(t, L2)

# Exercise 12
dna = ('A', 'T', 'G', 'C')
print(dna.index('G'))  # 2

# Exercise 13
grid = ((1, 2, 3), (4, 5, 6), (7, 8, 9))
print(grid[1][1])  # 5

# Exercise 14
t1 = (1, 2)
t2 = (3, 4)
t3 = t1 + t2
print(t3)  # (1, 2, 3, 4)

# Exercise 15
scores = [("Alice", 85), ("Bob", 92), ("Charlie", 78)]
sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
print(sorted_scores)

Part 2: Sets

What is a Set?

A set is a collection of unique elements with no duplicates. Sets work like mathematical sets.

# Duplicates are automatically removed
S = {1, 2, 2, 3, 3, 3}
print(S)  # {1, 2, 3}

# Unordered - no indexing
# print(S[0])  # TypeError!

Creating Sets

# With curly braces
S = {1, 2, 3, 4, 5}

# From a list (removes duplicates)
S = set([1, 2, 2, 3, 3])
print(S)  # {1, 2, 3}

# From a string
S = set("hello")
print(S)  # {'h', 'e', 'l', 'o'}  (no duplicate 'l')

# Empty set (NOT {} - that's an empty dict!)
S = set()
print(type(S))   # <class 'set'>
print(type({}))  # <class 'dict'>

Adding and Removing Elements

S = {1, 2, 3}

# Add single element
S.add(4)
print(S)  # {1, 2, 3, 4}

# Add multiple elements
S.update([5, 6, 7])
print(S)  # {1, 2, 3, 4, 5, 6, 7}

# Remove element (raises error if not found)
S.remove(7)
print(S)  # {1, 2, 3, 4, 5, 6}

# Discard element (no error if not found)
S.discard(100)  # No error
S.discard(6)
print(S)  # {1, 2, 3, 4, 5}

# Pop random element
x = S.pop()
print(x)  # Some element (unpredictable which one)

# Clear all elements
S.clear()
print(S)  # set()

Membership Testing

Very fast - O(1):

S = {1, 2, 3, 4, 5}

print(3 in S)     # True
print(100 in S)   # False
print(100 not in S)  # True

Looping Through Sets

S = {'a', 'b', 'c'}

# Basic loop
for item in S:
    print(item)

# With enumerate
for i, item in enumerate(S):
    print(f"{i}: {item}")

Note: Sets are unordered - iteration order is not guaranteed!


Set Operations (The Powerful Part!)

Sets support mathematical set operations.

Union: Elements in Either Set

A = {1, 2, 3}
B = {3, 4, 5}

# Using | operator
print(A | B)  # {1, 2, 3, 4, 5}

# Using method
print(A.union(B))  # {1, 2, 3, 4, 5}

Intersection: Elements in Both Sets

A = {1, 2, 3}
B = {3, 4, 5}

# Using & operator
print(A & B)  # {3}

# Using method
print(A.intersection(B))  # {3}

Difference: Elements in A but Not in B

A = {1, 2, 3}
B = {3, 4, 5}

# Using - operator
print(A - B)  # {1, 2}
print(B - A)  # {4, 5}

# Using method
print(A.difference(B))  # {1, 2}

Symmetric Difference: Elements in Either but Not Both

A = {1, 2, 3}
B = {3, 4, 5}

# Using ^ operator
print(A ^ B)  # {1, 2, 4, 5}

# Using method
print(A.symmetric_difference(B))  # {1, 2, 4, 5}

Subset and Superset

A = {1, 2}
B = {1, 2, 3, 4}

# Is A a subset of B?
print(A <= B)        # True
print(A.issubset(B)) # True

# Is B a superset of A?
print(B >= A)          # True
print(B.issuperset(A)) # True

# Proper subset (subset but not equal)
print(A < B)  # True
print(A < A)  # False

Disjoint: No Common Elements

A = {1, 2}
B = {3, 4}
C = {2, 3}

print(A.isdisjoint(B))  # True (no overlap)
print(A.isdisjoint(C))  # False (2 is common)

Set Operations Summary

OperationOperatorMethodResult
UnionA \| BA.union(B)All elements from both
IntersectionA & BA.intersection(B)Common elements
DifferenceA - BA.difference(B)In A but not in B
Symmetric DiffA ^ BA.symmetric_difference(B)In either but not both
SubsetA <= BA.issubset(B)True if A ⊆ B
SupersetA >= BA.issuperset(B)True if A ⊇ B
Disjoint-A.isdisjoint(B)True if no overlap

In-Place Operations

Modify the set directly (note the method names end in _update):

A = {1, 2, 3}
B = {3, 4, 5}

# Union in-place
A |= B  # or A.update(B)
print(A)  # {1, 2, 3, 4, 5}

# Intersection in-place
A = {1, 2, 3}
A &= B  # or A.intersection_update(B)
print(A)  # {3}

# Difference in-place
A = {1, 2, 3}
A -= B  # or A.difference_update(B)
print(A)  # {1, 2}

Practical Examples

Remove Duplicates from List

L = [1, 2, 2, 3, 3, 3, 4]
unique = list(set(L))
print(unique)  # [1, 2, 3, 4]

Find Common Elements

list1 = [1, 2, 3, 4]
list2 = [3, 4, 5, 6]
common = set(list1) & set(list2)
print(common)  # {3, 4}

Find Unique DNA Bases

dna = "ATGCATGCATGC"
bases = set(dna)
print(bases)  # {'A', 'T', 'G', 'C'}

Set Exercises

Exercise 1: Create a set from the list [1, 2, 2, 3, 3, 3] and print it.

Exercise 2: Add the number 10 to a set {1, 2, 3}.

Exercise 3: Remove duplicates from [1, 1, 2, 2, 3, 3, 4, 4].

Exercise 4: Find common elements between {1, 2, 3, 4} and {3, 4, 5, 6}.

Exercise 5: Find elements in {1, 2, 3} but not in {2, 3, 4}.

Exercise 6: Find all unique characters in the string "mississippi".

Exercise 7: Check if {1, 2} is a subset of {1, 2, 3, 4}.

Exercise 8: Find symmetric difference of {1, 2, 3} and {3, 4, 5}.

Exercise 9: Check if two sets {1, 2} and {3, 4} have no common elements.

Exercise 10: Given DNA sequence "ATGCATGC", create set of unique nucleotides.

Exercise 11: Combine sets {1, 2}, {3, 4}, {5, 6} into one set.

Exercise 12: Given two lists of students, find students in both classes.

Exercise 13: Remove element 3 from set {1, 2, 3, 4} safely (no error if missing).

Exercise 14: Create a set of prime numbers less than 20 and check membership of 17.

Exercise 15: Given three sets A, B, C, find elements that are in all three.

Solutions
# Exercise 1
S = set([1, 2, 2, 3, 3, 3])
print(S)  # {1, 2, 3}

# Exercise 2
S = {1, 2, 3}
S.add(10)
print(S)

# Exercise 3
L = [1, 1, 2, 2, 3, 3, 4, 4]
print(list(set(L)))

# Exercise 4
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}
print(A & B)  # {3, 4}

# Exercise 5
A = {1, 2, 3}
B = {2, 3, 4}
print(A - B)  # {1}

# Exercise 6
print(set("mississippi"))

# Exercise 7
A = {1, 2}
B = {1, 2, 3, 4}
print(A <= B)  # True

# Exercise 8
A = {1, 2, 3}
B = {3, 4, 5}
print(A ^ B)  # {1, 2, 4, 5}

# Exercise 9
A = {1, 2}
B = {3, 4}
print(A.isdisjoint(B))  # True

# Exercise 10
dna = "ATGCATGC"
print(set(dna))  # {'A', 'T', 'G', 'C'}

# Exercise 11
A = {1, 2}
B = {3, 4}
C = {5, 6}
print(A | B | C)  # {1, 2, 3, 4, 5, 6}

# Exercise 12
class1 = ["Alice", "Bob", "Charlie"]
class2 = ["Bob", "Diana", "Charlie"]
print(set(class1) & set(class2))  # {'Bob', 'Charlie'}

# Exercise 13
S = {1, 2, 3, 4}
S.discard(3)  # Safe removal
S.discard(100)  # No error
print(S)

# Exercise 14
primes = {2, 3, 5, 7, 11, 13, 17, 19}
print(17 in primes)  # True

# Exercise 15
A = {1, 2, 3, 4}
B = {2, 3, 4, 5}
C = {3, 4, 5, 6}
print(A & B & C)  # {3, 4}

Summary: When to Use What?

Data TypeUse When
ListOrdered, allow duplicates, need to modify
TupleOrdered, no modification needed, dictionary keys
SetNo duplicates, fast membership testing, set operations
DictKey-value mapping, fast lookup by key

Useful modules

This is planned to be added later

Files and Sys Module

Reading Files

Always Use Context Manager (with)

Files automatically close, even if errors occur. This is the modern, safe way.

# ✅ Best way - file automatically closes
with open("data.txt", "r") as file:
    content = file.read()
    print(content)

# ❌ Old way - must manually close (don't do this)
file = open("data.txt", "r")
content = file.read()
file.close()  # Easy to forget!

File Modes

📝
Common Modes

"r" → Read (default)
"w" → Write (overwrites entire file!)
"a" → Append (adds to end)
"x" → Create (fails if exists)
"rb"/"wb" → Binary modes

# Read
with open("data.txt", "r") as f:
    content = f.read()

# Write (overwrites!)
with open("output.txt", "w") as f:
    f.write("Hello, World!")

# Append (adds to end)
with open("log.txt", "a") as f:
    f.write("New entry\n")

Reading Methods

read() - Entire File

with open("data.txt") as f:
    content = f.read()  # Whole file as string

readline() - One Line at a Time

with open("data.txt") as f:
    first = f.readline()   # First line
    second = f.readline()  # Second line

readlines() - All Lines as List

with open("data.txt") as f:
    lines = f.readlines()  # ['line1\n', 'line2\n', ...]

Looping Through Files

💡
Best Practice: Iterate Directly

Most memory efficient - reads one line at a time. Works with huge files!

# Best way - memory efficient
with open("data.txt") as f:
    for line in f:
        print(line, end="")  # Line already has \n

# With line numbers
with open("data.txt") as f:
    for i, line in enumerate(f, start=1):
        print(f"{i}: {line}", end="")

# Strip newlines
with open("data.txt") as f:
    for line in f:
        line = line.strip()  # Remove \n
        print(line)

# Process as list
with open("data.txt") as f:
    lines = [line.strip() for line in f]

Writing Files

write() - Single String

with open("output.txt", "w") as f:
    f.write("Hello\n")
    f.write("World\n")

writelines() - List of Strings

⚠️
writelines() Doesn't Add Newlines

You must include \n yourself!

lines = ["Line 1\n", "Line 2\n", "Line 3\n"]
with open("output.txt", "w") as f:
    f.writelines(lines)
with open("output.txt", "w") as f:
    print("Hello, World!", file=f)
    print("Another line", file=f)

Processing Lines

Splitting

# By delimiter
line = "name,age,city"
parts = line.split(",")  # ['name', 'age', 'city']

# By whitespace (default)
line = "John   25   NYC"
parts = line.split()  # ['John', '25', 'NYC']

# With max splits
line = "a,b,c,d,e"
parts = line.split(",", 2)  # ['a', 'b', 'c,d,e']

Joining

words = ['Hello', 'World']
sentence = " ".join(words)  # "Hello World"

lines = ['line1', 'line2', 'line3']
content = "\n".join(lines)

Processing CSV Data

with open("data.csv") as f:
    for line in f:
        parts = line.strip().split(",")
        name, age, city = parts
        print(f"{name} is {age} from {city}")

The sys Module

Command Line Arguments

import sys

print(sys.argv)  # List of all arguments
# python script.py hello world
# Output: ['script.py', 'hello', 'world']

print(sys.argv[0])  # Script name
print(sys.argv[1])  # First argument
print(len(sys.argv))  # Number of arguments

Basic Argument Handling

import sys

if len(sys.argv) < 2:
    print("Usage: python script.py <filename>")
    sys.exit(1)

filename = sys.argv[1]
print(f"Processing: {filename}")

Processing Multiple Arguments

import sys

# python script.py file1.txt file2.txt file3.txt
for filename in sys.argv[1:]:  # Skip script name
    print(f"Processing: {filename}")

Argument Validation

💻
Complete Template

Validation pattern for command-line scripts

import sys
import os

def main():
    # Check argument count
    if len(sys.argv) != 3:
        print("Usage: python script.py <input> <output>")
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    
    # Check if input exists
    if not os.path.exists(input_file):
        print(f"Error: {input_file} not found")
        sys.exit(1)
    
    # Check if output exists
    if os.path.exists(output_file):
        response = input(f"{output_file} exists. Overwrite? (y/n): ")
        if response.lower() != 'y':
            print("Aborted")
            sys.exit(0)
    
    # Process files
    process(input_file, output_file)

if __name__ == "__main__":
    main()

Standard Streams

stdin, stdout, stderr

import sys

# Read from stdin
line = sys.stdin.readline()

# Write to stdout (like print)
sys.stdout.write("Hello\n")

# Write to stderr (for errors)
sys.stderr.write("Error: failed\n")

Reading from Pipe

# In terminal
cat data.txt | python script.py
echo "Hello" | python script.py
# script.py
import sys

for line in sys.stdin:
    print(f"Received: {line.strip()}")

Exit Codes

📝
Convention

0 → Success
1 → General error
2 → Command line error

import sys

# Exit with success
sys.exit(0)

# Exit with error
sys.exit(1)

# Exit with message
sys.exit("Error: something went wrong")

Useful sys Attributes

import sys

# Python version
print(sys.version)         # '3.10.0 (default, ...)'
print(sys.version_info)    # sys.version_info(major=3, ...)

# Platform
print(sys.platform)        # 'linux', 'darwin', 'win32'

# Module search paths
print(sys.path)

# Maximum integer
print(sys.maxsize)

# Default encoding
print(sys.getdefaultencoding())  # 'utf-8'

Building Command Line Tools

Simple Script Template

#!/usr/bin/env python3
"""Simple command line tool."""

import sys
import os

def print_usage():
    print("Usage: python tool.py <input_file>")
    print("Options:")
    print("  -h, --help    Show help")
    print("  -v, --verbose Verbose output")

def main():
    # Parse arguments
    if len(sys.argv) < 2 or sys.argv[1] in ['-h', '--help']:
        print_usage()
        sys.exit(0)
    
    verbose = '-v' in sys.argv or '--verbose' in sys.argv
    
    # Get input file
    input_file = None
    for arg in sys.argv[1:]:
        if not arg.startswith('-'):
            input_file = arg
            break
    
    if not input_file:
        print("Error: No input file", file=sys.stderr)
        sys.exit(1)
    
    if not os.path.exists(input_file):
        print(f"Error: {input_file} not found", file=sys.stderr)
        sys.exit(1)
    
    # Process
    if verbose:
        print(f"Processing {input_file}...")
    
    with open(input_file) as f:
        for line in f:
            print(line.strip())
    
    if verbose:
        print("Done!")

if __name__ == "__main__":
    main()

Word Count Tool

💻
Example: wc Clone

Count lines, words, and characters

#!/usr/bin/env python3
import sys

def count_file(filename):
    lines = words = chars = 0
    with open(filename) as f:
        for line in f:
            lines += 1
            words += len(line.split())
            chars += len(line)
    return lines, words, chars

def main():
    if len(sys.argv) < 2:
        print("Usage: python wc.py <file1> [file2] ...")
        sys.exit(1)
    
    total_l = total_w = total_c = 0
    
    for filename in sys.argv[1:]:
        try:
            l, w, c = count_file(filename)
            print(f"{l:8} {w:8} {c:8} {filename}")
            total_l += l
            total_w += w
            total_c += c
        except FileNotFoundError:
            print(f"Error: {filename} not found", file=sys.stderr)
    
    if len(sys.argv) > 2:
        print(f"{total_l:8} {total_w:8} {total_c:8} total")

if __name__ == "__main__":
    main()

FASTA Sequence Counter

#!/usr/bin/env python3
import sys

def process_fasta(filename):
    sequences = 0
    total_bases = 0
    
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                sequences += 1
            else:
                total_bases += len(line)
    
    return sequences, total_bases

def main():
    if len(sys.argv) != 2:
        print("Usage: python fasta_count.py <file.fasta>")
        sys.exit(1)
    
    filename = sys.argv[1]
    
    try:
        seqs, bases = process_fasta(filename)
        print(f"Sequences: {seqs}")
        print(f"Total bases: {bases}")
        print(f"Average: {bases/seqs:.1f}")
    except FileNotFoundError:
        print(f"Error: {filename} not found", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

File Path Operations

import os

# Join paths (cross-platform)
path = os.path.join("folder", "subfolder", "file.txt")

# Get filename
os.path.basename("/path/to/file.txt")  # "file.txt"

# Get directory
os.path.dirname("/path/to/file.txt")   # "/path/to"

# Split extension
name, ext = os.path.splitext("data.txt")  # "data", ".txt"

# Check existence
os.path.exists("file.txt")    # True/False
os.path.isfile("file.txt")    # True if file
os.path.isdir("folder")       # True if directory

# Get file size
os.path.getsize("file.txt")   # Size in bytes

# Get absolute path
os.path.abspath("file.txt")

Practice Exercises

💻
Basic File Operations

1. Read file and print with line numbers
2. Count lines in a file
3. Copy file contents (use sys.argv)
4. Parse and format CSV rows
5. Reverse file contents

💻
Command Line Tools

6. Search for word and print matching lines
7. Read stdin, write stdout in uppercase
8. Validate arguments (file must exist)
9. Word frequency counter (top 10 words)
10. Parse FASTA (extract names and lengths)

💻
Advanced Tools

11. Merge multiple files into one
12. Remove blank lines from file
13. Convert file to uppercase
14. Log analyzer (count ERROR/WARNING/INFO)
15. Build grep-like tool: python grep.py <pattern> <file>


Quick Reference

📝
Essential Commands

with open(file) as f: → Open safely
f.read() → Read all
for line in f: → Iterate lines
f.write(string) → Write
sys.argv → Get arguments
sys.exit(code) → Exit program
print(..., file=sys.stderr) → Error output
os.path.exists(file) → Check file
os.path.join(a, b) → Join paths


Best Practices

Follow These Rules

1. Always use with for files
2. Validate command line arguments
3. Handle missing files gracefully
4. Use sys.exit(1) for errors
5. Write errors to stderr
6. Use os.path for cross-platform paths


Solution Hints

💡
Exercise 1: Line Numbers

Use enumerate(f, start=1) when iterating

💡
Exercise 6: Search Tool

Check if word in line: for each line

💡
Exercise 9: Word Frequency

Use from collections import Counter and .most_common(10)

💡
Exercise 15: Grep Tool

Use re.search(pattern, line) for pattern matching

Recursive Functions

📖
Recursion

When a function calls itself to solve smaller versions of the same problem.

Classic example: Factorial (5! = 5 × 4 × 3 × 2 × 1)

def factorial(n):
    # Base case: stop condition
    if n == 0 or n == 1:
        return 1
    
    # Recursive case: call itself   
    return n * factorial(n - 1)

print(factorial(5))  # 120

How it works:

factorial(5) = 5 × factorial(4)
             = 5 × (4 × factorial(3))
             = 5 × (4 × (3 × factorial(2)))
             = 5 × (4 × (3 × (2 × factorial(1))))
             = 5 × (4 × (3 × (2 × 1)))
             = 120

Key parts of recursion:

📝
Recursion Checklist

1. Base case: When to stop
2. Recursive case: Call itself with simpler input
3. Progress: Each call must get closer to the base case

Another example: Countdown

def countdown(n):
    if n == 0:
        print("Blast off!")
        return
    print(n)
    countdown(n - 1)

countdown(3)
# Output: 3, 2, 1, Blast off!
⚠️
Watch Out

Deep recursion can cause memory issues. Python has a default recursion limit.


Python Exceptions

Errors vs Bugs vs Exceptions

Syntax Errors

Errors in your code before it runs. Python can't even understand what you wrote.

# Missing colon
if True
    print("Hello")  # SyntaxError: expected ':'

# Unclosed parenthesis
print("Hello"  # SyntaxError: '(' was never closed

Fix: Correct the syntax. Python tells you exactly where the problem is.

Bugs

Your code runs, but it does the wrong thing. No error message - just incorrect behavior.

# Bug: wrong formula
def circle_area(radius):
    return 2 * 3.14 * radius  # Wrong! This is circumference, not area

print(circle_area(5))  # Returns 31.4, should be 78.5

Why "bug"? Legend says early computers had actual insects causing problems. The term stuck.

Fix: Debug your code - find and fix the logic error.

Exceptions

Errors that occur during execution. The code is syntactically correct, but something goes wrong at runtime.

# Runs fine until...
x = 10 / 0  # ZeroDivisionError: division by zero

# Or...
my_list = [1, 2, 3]
print(my_list[10])  # IndexError: list index out of range

Fix: Handle the exception or prevent the error condition.


What is an Exception?

An exception is Python's way of saying "something unexpected happened and I can't continue."

When an exception occurs:

  1. Python stops normal execution
  2. Creates an exception object with error details
  3. Looks for code to handle it
  4. If no handler found, program crashes with traceback
# Exception in action
print("Start")
x = 10 / 0  # Exception here!
print("End")  # Never reached

# Output:
# Start
# Traceback (most recent call last):
#   File "example.py", line 2, in <module>
#     x = 10 / 0
# ZeroDivisionError: division by zero

Common Exceptions

# ZeroDivisionError
10 / 0

# TypeError - wrong type
"hello" + 5

# ValueError - right type, wrong value
int("hello")

# IndexError - list index out of range
[1, 2, 3][10]

# KeyError - dictionary key not found
{'a': 1}['b']

# FileNotFoundError
open("nonexistent.txt")

# AttributeError - object has no attribute
"hello".append("!")

# NameError - variable not defined
print(undefined_variable)

# ImportError - module not found
import nonexistent_module

Handling Exceptions

Basic try/except

try:
    x = 10 / 0
except:
    print("Something went wrong!")

# Output: Something went wrong!

Problem: This catches ALL exceptions - even ones you didn't expect. Not recommended.

try:
    x = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero!")

# Output: Cannot divide by zero!

Catching Multiple Specific Exceptions

try:
    value = int(input("Enter a number: "))
    result = 10 / value
except ValueError:
    print("That's not a valid number!")
except ZeroDivisionError:
    print("Cannot divide by zero!")

Catching Multiple Exceptions Together

try:
    # Some risky code
    pass
except (ValueError, TypeError):
    print("Value or Type error occurred!")

Getting Exception Details

try:
    x = 10 / 0
except ZeroDivisionError as e:
    print(f"Error: {e}")
    print(f"Type: {type(e).__name__}")

# Output:
# Error: division by zero
# Type: ZeroDivisionError

The Complete try/except/else/finally

try:
    # Code that might raise an exception
    result = 10 / 2
except ZeroDivisionError:
    # Runs if exception occurs
    print("Cannot divide by zero!")
else:
    # Runs if NO exception occurs
    print(f"Result: {result}")
finally:
    # ALWAYS runs, exception or not
    print("Cleanup complete")

# Output:
# Result: 5.0
# Cleanup complete

When to Use Each Part

BlockWhen It RunsUse For
tryAlways attemptsCode that might fail
exceptIf exception occursHandle the error
elseIf NO exceptionCode that depends on try success
finallyALWAYSCleanup (close files, connections)

finally is Guaranteed

def risky_function():
    try:
        return 10 / 0
    except ZeroDivisionError:
        return "Error!"
    finally:
        print("This ALWAYS prints!")

result = risky_function()
# Output: This ALWAYS prints!
# result = "Error!"

Best Practices

1. Be Specific - Don't Catch Everything

# BAD - catches everything, hides bugs
try:
    do_something()
except:
    pass

# GOOD - catches only what you expect
try:
    do_something()
except ValueError:
    handle_value_error()

2. Don't Silence Exceptions Without Reason

# BAD - silently ignores errors
try:
    important_operation()
except Exception:
    pass  # What went wrong? We'll never know!

# GOOD - at least log it
try:
    important_operation()
except Exception as e:
    print(f"Error occurred: {e}")
    # or use logging.error(e)

3. Use else for Code That Depends on try Success

# Less clear
try:
    file = open("data.txt")
    content = file.read()
    process(content)
except FileNotFoundError:
    print("File not found")

# More clear - separate "risky" from "safe" code
try:
    file = open("data.txt")
except FileNotFoundError:
    print("File not found")
else:
    content = file.read()
    process(content)

4. Use finally for Cleanup

file = None
try:
    file = open("data.txt")
    content = file.read()
except FileNotFoundError:
    print("File not found")
finally:
    if file:
        file.close()  # Always close, even if error

# Even better - use context manager
with open("data.txt") as file:
    content = file.read()  # Automatically closes!

5. Catch Exceptions at the Right Level

# Don't catch too early
def read_config():
    # Let the caller handle missing file
    with open("config.txt") as f:
        return f.read()

# Catch at appropriate level
def main():
    try:
        config = read_config()
    except FileNotFoundError:
        print("Config file missing, using defaults")
        config = get_defaults()

Raising Exceptions

Use raise to throw your own exceptions:

def divide(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero!")
    return a / b

try:
    result = divide(10, 0)
except ValueError as e:
    print(e)  # Cannot divide by zero!

Re-raising Exceptions

try:
    risky_operation()
except ValueError:
    print("Logging this error...")
    raise  # Re-raise the same exception

Built-in Exception Hierarchy

All exceptions inherit from BaseException. Here's the hierarchy:

BaseException
├── SystemExit
├── KeyboardInterrupt
├── GeneratorExit
└── Exception
    ├── StopIteration
    ├── ArithmeticError
    │   ├── FloatingPointError
    │   ├── OverflowError
    │   └── ZeroDivisionError
    ├── AssertionError
    ├── AttributeError
    ├── BufferError
    ├── EOFError
    ├── ImportError
    │   └── ModuleNotFoundError
    ├── LookupError
    │   ├── IndexError
    │   └── KeyError
    ├── MemoryError
    ├── NameError
    │   └── UnboundLocalError
    ├── OSError
    │   ├── FileExistsError
    │   ├── FileNotFoundError
    │   ├── IsADirectoryError
    │   ├── NotADirectoryError
    │   ├── PermissionError
    │   └── TimeoutError
    ├── ReferenceError
    ├── RuntimeError
    │   ├── NotImplementedError
    │   └── RecursionError
    ├── SyntaxError
    │   └── IndentationError
    │       └── TabError
    ├── TypeError
    └── ValueError
        └── UnicodeError
            ├── UnicodeDecodeError
            ├── UnicodeEncodeError
            └── UnicodeTranslateError

Why Hierarchy Matters

Catching a parent catches all children:

# Catches ZeroDivisionError, OverflowError, FloatingPointError
try:
    result = 10 / 0
except ArithmeticError:
    print("Math error!")

# Catches IndexError and KeyError
try:
    my_list[100]
except LookupError:
    print("Lookup failed!")

Tip: Catch Exception instead of bare except: - it doesn't catch KeyboardInterrupt or SystemExit.

# Better than bare except
try:
    do_something()
except Exception as e:
    print(f"Error: {e}")

User-Defined Exceptions

Create custom exceptions by inheriting from Exception:

Basic Custom Exception

class InvalidDNAError(Exception):
    """Raised when DNA sequence contains invalid characters"""
    pass

def validate_dna(sequence):
    valid_bases = set("ATGC")
    for base in sequence.upper():
        if base not in valid_bases:
            raise InvalidDNAError(f"Invalid base: {base}")
    return True

try:
    validate_dna("ATGXCCC")
except InvalidDNAError as e:
    print(f"Invalid DNA: {e}")

Custom Exception with Attributes

class InsufficientFundsError(Exception):
    """Raised when account has insufficient funds"""
    
    def __init__(self, balance, amount):
        self.balance = balance
        self.amount = amount
        self.shortage = amount - balance
        super().__init__(
            f"Cannot withdraw ${amount}. "
            f"Balance: ${balance}. "
            f"Short by: ${self.shortage}"
        )

class BankAccount:
    def __init__(self, balance):
        self.balance = balance
    
    def withdraw(self, amount):
        if amount > self.balance:
            raise InsufficientFundsError(self.balance, amount)
        self.balance -= amount
        return amount

# Usage
account = BankAccount(100)
try:
    account.withdraw(150)
except InsufficientFundsError as e:
    print(e)
    print(f"You need ${e.shortage} more")

# Output:
# Cannot withdraw $150. Balance: $100. Short by: $50
# You need $50 more

Exception Hierarchy for Your Project

# Base exception for your application
class BioinformaticsError(Exception):
    """Base exception for bioinformatics operations"""
    pass

# Specific exceptions
class SequenceError(BioinformaticsError):
    """Base for sequence-related errors"""
    pass

class InvalidDNAError(SequenceError):
    """Invalid DNA sequence"""
    pass

class InvalidRNAError(SequenceError):
    """Invalid RNA sequence"""
    pass

class AlignmentError(BioinformaticsError):
    """Sequence alignment failed"""
    pass

# Now you can catch at different levels
try:
    process_sequence()
except InvalidDNAError:
    print("DNA issue")
except SequenceError:
    print("Some sequence issue")
except BioinformaticsError:
    print("General bioinformatics error")

Exercises

Exercise 1: Write code that catches a ZeroDivisionError and prints a friendly message.

Exercise 2: Ask user for a number, handle both ValueError (not a number) and ZeroDivisionError (if dividing by it).

Exercise 3: Write a function that opens a file and handles FileNotFoundError.

Exercise 4: Create a function that takes a list and index, returns the element, handles IndexError.

Exercise 5: Write code that handles KeyError when accessing a dictionary.

Exercise 6: Create a custom NegativeNumberError and raise it if a number is negative.

Exercise 7: Write a function that converts string to int, handling ValueError, and returns 0 on failure.

Exercise 8: Use try/except/else/finally to read a file and ensure it's always closed.

Exercise 9: Create a custom InvalidAgeError with min and max age attributes.

Exercise 10: Write a function that validates an email (must contain @), raise ValueError if invalid.

Exercise 11: Handle multiple exceptions: TypeError, ValueError, ZeroDivisionError in one block.

Exercise 12: Create a hierarchy: ValidationErrorEmailError, PhoneError.

Exercise 13: Re-raise an exception after logging it.

Exercise 14: Create a InvalidSequenceError for DNA validation with the invalid character as attribute.

Exercise 15: Write a "safe divide" function that returns None on any error instead of crashing.

Solutions
# Exercise 1
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Cannot divide by zero!")

# Exercise 2
try:
    num = int(input("Enter a number: "))
    result = 100 / num
    print(f"100 / {num} = {result}")
except ValueError:
    print("That's not a valid number!")
except ZeroDivisionError:
    print("Cannot divide by zero!")

# Exercise 3
def read_file(filename):
    try:
        with open(filename) as f:
            return f.read()
    except FileNotFoundError:
        print(f"File '{filename}' not found")
        return None

# Exercise 4
def safe_get(lst, index):
    try:
        return lst[index]
    except IndexError:
        print(f"Index {index} out of range")
        return None

# Exercise 5
d = {'a': 1, 'b': 2}
try:
    value = d['c']
except KeyError:
    print("Key not found!")
    value = None

# Exercise 6
class NegativeNumberError(Exception):
    pass

def check_positive(n):
    if n < 0:
        raise NegativeNumberError(f"{n} is negative!")
    return n

# Exercise 7
def safe_int(s):
    try:
        return int(s)
    except ValueError:
        return 0

# Exercise 8
file = None
try:
    file = open("data.txt")
    content = file.read()
except FileNotFoundError:
    print("File not found")
    content = ""
else:
    print("File read successfully")
finally:
    if file:
        file.close()
    print("Cleanup done")

# Exercise 9
class InvalidAgeError(Exception):
    def __init__(self, age, min_age=0, max_age=150):
        self.age = age
        self.min_age = min_age
        self.max_age = max_age
        super().__init__(f"Age {age} not in range [{min_age}, {max_age}]")

# Exercise 10
def validate_email(email):
    if '@' not in email:
        raise ValueError(f"Invalid email: {email} (missing @)")
    return True

# Exercise 11
try:
    # risky code
    pass
except (TypeError, ValueError, ZeroDivisionError) as e:
    print(f"Error: {e}")

# Exercise 12
class ValidationError(Exception):
    pass

class EmailError(ValidationError):
    pass

class PhoneError(ValidationError):
    pass

# Exercise 13
try:
    result = 10 / 0
except ZeroDivisionError:
    print("Logging: Division by zero occurred")
    raise

# Exercise 14
class InvalidSequenceError(Exception):
    def __init__(self, sequence, invalid_char):
        self.sequence = sequence
        self.invalid_char = invalid_char
        super().__init__(f"Invalid character '{invalid_char}' in sequence")

def validate_dna(seq):
    for char in seq:
        if char not in "ATGC":
            raise InvalidSequenceError(seq, char)
    return True

# Exercise 15
def safe_divide(a, b):
    try:
        return a / b
    except Exception:
        return None

print(safe_divide(10, 2))   # 5.0
print(safe_divide(10, 0))   # None
print(safe_divide("a", 2))  # None

Summary

ConceptDescription
Syntax ErrorCode is malformed, won't run
BugCode runs but gives wrong result
ExceptionRuntime error, can be handled
try/exceptCatch and handle exceptions
elseRuns if no exception
finallyAlways runs (cleanup)
raiseThrow an exception
Custom ExceptionInherit from Exception

Best Practices:

  1. Catch specific exceptions, not bare except:
  2. Don't silence exceptions without reason
  3. Use finally for cleanup
  4. Create custom exceptions for your domain
  5. Build exception hierarchies for complex projects

Debugging

Theory

PyCharm Debug Tutorial

Using the IDLE Debugger

Python Dictionaries

What is a Dictionary?

A dictionary stores data as key-value pairs.

# Basic structure
student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Access by key
print(student['name'])   # Alex
print(student['age'])    # 20

Creating Dictionaries

# Empty dictionary
empty = {}

# With initial values
person = {'name': 'Alex', 'age': 20}

# Using dict() constructor
person = dict(name='Alex', age=20)

Basic Operations

Adding and Modifying

student = {'name': 'Alex', 'age': 20}

# Add new key
student['major'] = 'CS'
print(student)  # {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Modify existing value
student['age'] = 21
print(student)  # {'name': 'Alex', 'age': 21, 'major': 'CS'}

Deleting

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Delete specific key
del student['major']
print(student)  # {'name': 'Alex', 'age': 20}

# Remove and return value
age = student.pop('age')
print(age)      # 20
print(student)  # {'name': 'Alex'}

Getting Values Safely

student = {'name': 'Alex', 'age': 20}

# Direct access - raises error if key missing
print(student['name'])      # Alex
# print(student['grade'])   # KeyError!

# Safe access with .get() - returns None if missing
print(student.get('name'))   # Alex
print(student.get('grade'))  # None

# Provide default value
print(student.get('grade', 'N/A'))  # N/A

Useful Methods

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Get all keys
print(student.keys())    # dict_keys(['name', 'age', 'major'])

# Get all values
print(student.values())  # dict_values(['Alex', 20, 'CS'])

# Get all key-value pairs
print(student.items())   # dict_items([('name', 'Alex'), ('age', 20), ('major', 'CS')])

# Get length
print(len(student))      # 3

Membership Testing

Use in to check if a key exists (not value!):

student = {'name': 'Alex', 'age': 20}

# Check if key exists
print('name' in student)     # True
print('grade' in student)    # False

# Check if key does NOT exist
print('grade' not in student)  # True

# To check values, use .values()
print('Alex' in student.values())  # True
print(20 in student.values())      # True

Important: Checking in on a dictionary is O(1) - instant! This is why dictionaries are so powerful.


Looping Through Dictionaries

Loop Over Keys (Default)

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

# Default: loops over keys
for key in student:
    print(key)
# name
# age
# major

# Explicit (same result)
for key in student.keys():
    print(key)

Loop Over Values

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

for value in student.values():
    print(value)
# Alex
# 20
# CS

Loop Over Keys and Values Together

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

for key, value in student.items():
    print(f"{key}: {value}")
# name: Alex
# age: 20
# major: CS

Loop With Index Using enumerate()

student = {'name': 'Alex', 'age': 20, 'major': 'CS'}

for index, key in enumerate(student):
    print(f"{index}: {key} = {student[key]}")
# 0: name = Alex
# 1: age = 20
# 2: major = CS

# Or with items()
for index, (key, value) in enumerate(student.items()):
    print(f"{index}: {key} = {value}")

Dictionary Order

Python 3.7+: Dictionaries maintain insertion order.

# Items stay in the order you add them
d = {}
d['first'] = 1
d['second'] = 2
d['third'] = 3

for key in d:
    print(key)
# first
# second
# third  (guaranteed order!)

Note: Before Python 3.7, dictionary order was not guaranteed. If you need to support older Python, don't rely on order.

Important: While keys maintain insertion order, this doesn't mean dictionaries are sorted. They just remember the order you added things.

# Not sorted - just insertion order
d = {'c': 3, 'a': 1, 'b': 2}
print(list(d.keys()))  # ['c', 'a', 'b'] - insertion order, not alphabetical

Complex Values

Lists as Values

student = {
    'name': 'Alex',
    'courses': ['Math', 'Physics', 'CS']
}

# Access list items
print(student['courses'][0])  # Math

# Modify list
student['courses'].append('Biology')
print(student['courses'])  # ['Math', 'Physics', 'CS', 'Biology']

Nested Dictionaries

students = {
    1: {'name': 'Alex', 'age': 20},
    2: {'name': 'Maria', 'age': 22},
    3: {'name': 'Jordan', 'age': 21}
}

# Access nested values
print(students[1]['name'])  # Alex
print(students[2]['age'])   # 22

# Modify nested values
students[3]['age'] = 22

# Add new entry
students[4] = {'name': 'Casey', 'age': 19}

Why Dictionaries Are Fast: Hashing

Dictionaries use hashing to achieve O(1) lookup time.

How it works:

  1. When you add a key, Python computes a hash (a number) from the key
  2. This hash tells Python exactly where to store the value in memory
  3. When you look up the key, Python computes the same hash and goes directly to that location

Result: Looking up a key takes the same time whether your dictionary has 10 items or 10 million items.

# List: O(n) - must check each element
my_list = [2, 7, 11, 15]
if 7 in my_list:  # Checks: 2? no. 7? yes! (2 checks)
    print("Found")

# Dictionary: O(1) - instant lookup
my_dict = {2: 'a', 7: 'b', 11: 'c', 15: 'd'}
if 7 in my_dict:  # Goes directly to location (1 check)
    print("Found")

Practical Example: Two Sum Problem

Problem: Find two numbers that add up to a target.

Slow approach (nested loops - O(n²)):

nums = [2, 7, 11, 15]
target = 9

for i in range(len(nums)):
    for j in range(i + 1, len(nums)):
        if nums[i] + nums[j] == target:
            print([i, j])  # [0, 1]

Fast approach (dictionary - O(n)):

nums = [2, 7, 11, 15]
target = 9
seen = {}

for i, num in enumerate(nums):
    complement = target - num
    if complement in seen:
        print([seen[complement], i])  # [0, 1]
    else:
        seen[num] = i

Why it's faster:

  • We loop once through the array
  • For each number, we check if its complement exists (O(1) lookup)
  • Total: O(n) instead of O(n²)

Trace through:

i=0, num=2: complement=7, not in seen, add {2: 0}
i=1, num=7: complement=2, IS in seen at index 0, return [0, 1]

Exercises

Exercise 1: Create a dictionary of 5 countries and their capitals. Print each country and its capital.

Exercise 2: Write a program that counts how many times each character appears in a string.

Exercise 3: Given a list of numbers, create a dictionary where keys are numbers and values are their squares.

Exercise 4: Create a program that stores product names and prices. Let the user look up prices by product name.

Exercise 5: Given a 5×5 list of numbers, count how many times each number appears and print the three most common.

Exercise 6: DNA pattern matching - given a list of DNA sequences and a pattern with wildcards (*), find matching sequences:

sequences = ['ATGCATGC', 'ATGGATGC', 'TTGCATGC']
pattern = 'ATG*ATGC'  # * matches any character
# Should match: 'ATGCATGC', 'ATGGATGC'
Solutions
# Exercise 1
capitals = {'France': 'Paris', 'Japan': 'Tokyo', 'Italy': 'Rome', 
            'Egypt': 'Cairo', 'Brazil': 'Brasilia'}
for country, capital in capitals.items():
    print(f"{country}: {capital}")

# Exercise 2
text = "hello world"
char_count = {}
for char in text:
    char_count[char] = char_count.get(char, 0) + 1
print(char_count)

# Exercise 3
numbers = [1, 2, 3, 4, 5]
squares = {n: n**2 for n in numbers}
print(squares)  # {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

# Exercise 4
products = {}
while True:
    name = input("Product name (or 'done'): ")
    if name == 'done':
        break
    price = float(input("Price: "))
    products[name] = price

while True:
    lookup = input("Look up product (or 'quit'): ")
    if lookup == 'quit':
        break
    print(products.get(lookup, "Product not found"))

# Exercise 5
import random
grid = [[random.randint(1, 10) for _ in range(5)] for _ in range(5)]
counts = {}
for row in grid:
    for num in row:
        counts[num] = counts.get(num, 0) + 1
# Sort by count and get top 3
top3 = sorted(counts.items(), key=lambda x: x[1], reverse=True)[:3]
print("Top 3:", top3)

# Exercise 6
sequences = ['ATGCATGC', 'ATGGATGC', 'TTGCATGC']
pattern = 'ATG*ATGC'

for seq in sequences:
    match = True
    for i, char in enumerate(pattern):
        if char != '*' and char != seq[i]:
            match = False
            break
    if match:
        print(f"Match: {seq}")

Summary

OperationSyntaxTime
Created = {'a': 1}O(1)
Accessd['key']O(1)
Add/Modifyd['key'] = valueO(1)
Deletedel d['key']O(1)
Check key exists'key' in dO(1)
Get all keysd.keys()O(1)
Get all valuesd.values()O(1)
Loopfor k in dO(n)

Key takeaways:

  • Dictionaries are fast for lookups (O(1))
  • Use .get() for safe access with default values
  • Loop with .items() to get both keys and values
  • Python 3.7+ maintains insertion order
  • Perfect for counting, caching, and mapping data

Regular Expressions in Python

📖
What are Regular Expressions?

Regular expressions (regex) are powerful patterns used to search, match, and manipulate text. You can find patterns, not just exact text.

Regular Expressions

Examples:

  • Find all email addresses in a document
  • Validate phone numbers
  • Extract gene IDs from biological data
  • Find DNA/RNA sequence patterns
  • Clean messy text data

Getting Started

Import the Module

import re
💡
Always Use Raw Strings

Write regex patterns with the r prefix: r"pattern"

Why Raw Strings Matter

# Normal string - \n becomes a newline
print("Hello\nWorld")
# Output:
# Hello
# World

# Raw string - \n stays as literal characters
print(r"Hello\nWorld")
# Output: Hello\nWorld

In regex, backslashes are special! Raw strings prevent confusion:

# ❌ Confusing without raw string
pattern = "\\d+"

# ✅ Clean with raw string
pattern = r"\d+"
Golden Rule

Always write regex patterns as raw strings: r"pattern"


Level 1: Literal Matching

The simplest regex matches exact text.

import re

dna = "ATGCGATCG"

# Search for exact text "ATG"
if re.search(r"ATG", dna):
    print("Found ATG!")

Your First Function: re.search()

ℹ️
re.search(pattern, text)

Looks for a pattern anywhere in text. Returns a match object if found, None if not.

match = re.search(r"ATG", "ATGCCC")
if match:
    print("Found:", match.group())    # Found: ATG
    print("Position:", match.start())  # Position: 0
⚠️
Case Sensitive

Regex is case-sensitive by default! "ATG""atg"

Practice

💻
Exercise 1.1

Find which sequences contain "ATG": ["ATGCCC", "TTTAAA", "ATGATG"]

💻
Exercise 1.2

Check if "PYTHON" appears in: "I love PYTHON programming"


Level 2: The Dot . - Match Any Character

The dot . matches any single character (except newline).

# Find "A" + any character + "G"
dna = "ATGCGATCG"
matches = re.findall(r"A.G", dna)
print(matches)  # ['ATG', 'ACG']

New Function: re.findall()

ℹ️
re.findall(pattern, text)

Finds all matches and returns them as a list.

text = "cat bat rat"
print(re.findall(r".at", text))  # ['cat', 'bat', 'rat']

Practice

💻
Exercise 2.1

Match "b.t" (b + any char + t) in: "bat bet bit bot but"

💻
Exercise 2.2

Find all 3-letter patterns starting with 'c' in: "cat cow cup car"


Level 3: Character Classes [ ]

Square brackets let you specify which characters to match.

# Match any nucleotide (A, T, G, or C)
dna = "ATGCXYZ"
nucleotides = re.findall(r"[ATGC]", dna)
print(nucleotides)  # ['A', 'T', 'G', 'C']

Character Ranges

Use - for ranges:

re.findall(r"[0-9]", "Room 123")      # ['1', '2', '3']
re.findall(r"[a-z]", "Hello")         # ['e', 'l', 'l', 'o']
re.findall(r"[A-Z]", "Hello")         # ['H']
re.findall(r"[A-Za-z]", "Hello123")   # ['H', 'e', 'l', 'l', 'o']

Negation with ^

^ inside brackets means "NOT these characters":

# Match anything that's NOT a nucleotide
dna = "ATGC-X123"
non_nucleotides = re.findall(r"[^ATGC]", dna)
print(non_nucleotides)  # ['-', 'X', '1', '2', '3']

Practice

💻
Exercise 3.1

Find all digits in: "Gene ID: ABC123"

💻
Exercise 3.2

Find all vowels in: "bioinformatics"

💻
Exercise 3.3

Find all NON-digits in: "Room123"


Level 4: Quantifiers - Repeating Patterns

Quantifiers specify how many times a pattern repeats.

📝
Quantifier Reference

* → 0 or more times
+ → 1 or more times
? → 0 or 1 time (optional)
{n} → Exactly n times
{n,m} → Between n and m times

Examples

# Find sequences of 2+ C's
dna = "ATGCCCAAAGGG"
print(re.findall(r"C+", dna))       # ['CCC']
print(re.findall(r"C{2,}", dna))    # ['CCC']

# Find all digit groups
text = "Call 123 or 4567"
print(re.findall(r"\d+", text))     # ['123', '4567']

# Optional minus sign
print(re.findall(r"-?\d+", "123 -456 789"))  # ['123', '-456', '789']

Combining with Character Classes

# Find all 3-letter codons
dna = "ATGCCCAAATTT"
codons = re.findall(r"[ATGC]{3}", dna)
print(codons)  # ['ATG', 'CCC', 'AAA', 'TTT']

Practice

💻
Exercise 4.1

Find sequences of exactly 3 A's in: "ATGCCCAAAGGGTTT"

💻
Exercise 4.2

Match "colou?r" (u is optional) in: "color colour"

💻
Exercise 4.3

Find all digit sequences in: "123 4567 89"


Level 5: Escaping Special Characters

Special characters like . * + ? [ ] ( ) have special meanings. To match them literally, escape with \.

# ❌ Wrong - dot matches ANY character
text = "file.txt and fileXtxt"
print(re.findall(r"file.txt", text))  # ['file.txt', 'fileXtxt']

# ✅ Correct - escaped dot matches only literal dot
print(re.findall(r"file\.txt", text))  # ['file.txt']

Common Examples

re.search(r"\$100", "$100")           # Literal dollar sign
re.search(r"What\?", "What?")         # Literal question mark
re.search(r"C\+\+", "C++")            # Literal plus signs
re.search(r"\(test\)", "(test)")      # Literal parentheses

Practice

💻
Exercise 5.1

Match "data.txt" (with literal dot) in: "File: data.txt"

💻
Exercise 5.2

Match "c++" in: "I code in c++ and python"


Level 6: Predefined Shortcuts

Python provides shortcuts for common character types.

📝
Common Shortcuts

\d → Any digit [0-9]
\D → Any non-digit
\w → Word character [A-Za-z0-9_]
\W → Non-word character
\s → Whitespace (space, tab, newline)
\S → Non-whitespace

Examples

# Find all digits
text = "Room 123, Floor 4"
print(re.findall(r"\d+", text))  # ['123', '4']

# Find all words
sentence = "DNA_seq-123 test"
print(re.findall(r"\w+", sentence))  # ['DNA_seq', '123', 'test']

# Split on whitespace
data = "ATG  CCC\tAAA"
print(re.split(r"\s+", data))  # ['ATG', 'CCC', 'AAA']

Practice

💻
Exercise 6.1

Find all word characters in: "Hello-World"

💻
Exercise 6.2

Split on whitespace: "ATG CCC\tAAA"


Level 7: Anchors - Position Matching

Anchors match positions, not characters.

📝
Anchor Reference

^ → Start of string
$ → End of string
\b → Word boundary
\B → Not a word boundary

Examples

dna = "ATGCCCATG"

# Match only at start
print(re.search(r"^ATG", dna))   # Matches!
print(re.search(r"^CCC", dna))   # None

# Match only at end
print(re.search(r"ATG$", dna))   # Matches!
print(re.search(r"CCC$", dna))   # None

# Word boundaries - whole words only
text = "The cat concatenated strings"
print(re.findall(r"\bcat\b", text))  # ['cat'] - only the word
print(re.findall(r"cat", text))      # ['cat', 'cat'] - both

Practice

💻
Exercise 7.1

Find sequences starting with "ATG": ["ATGCCC", "CCCATG", "TACATG"]

💻
Exercise 7.2

Match whole word "cat" (not "concatenate") in: "The cat sat"


Level 8: Alternation - OR Operator |

The pipe | means "match this OR that".

# Match either ATG or AUG
dna = "ATG is DNA, AUG is RNA"
print(re.findall(r"ATG|AUG", dna))  # ['ATG', 'AUG']

# Match stop codons
rna = "AUGCCCUAAUAGUGA"
print(re.findall(r"UAA|UAG|UGA", rna))  # ['UAA', 'UAG', 'UGA']

Practice

💻
Exercise 8.1

Match "email" or "phone" in: "Contact via email or phone"

💻
Exercise 8.2

Find stop codons (TAA, TAG, TGA) in: ["ATG", "TAA", "TAG"]


Level 9: Groups and Capturing ( )

Parentheses create groups you can extract separately.

# Extract parts of an email
email = "user@example.com"
match = re.search(r"(\w+)@(\w+)\.(\w+)", email)
if match:
    print("Username:", match.group(1))   # user
    print("Domain:", match.group(2))     # example
    print("TLD:", match.group(3))        # com
    print("Full:", match.group(0))       # user@example.com

Named Groups

Use (?P<name>...) for readable names:

gene_id = "NM_001234"
match = re.search(r"(?P<prefix>[A-Z]+)_(?P<number>\d+)", gene_id)
if match:
    print(match.group('prefix'))  # NM
    print(match.group('number'))  # 001234

Practice

💻
Exercise 9.1

Extract area code from: "Call 123-456-7890"

💻
Exercise 9.2

Extract year, month, day from: "2024-11-20"


Level 10: More Useful Functions

re.sub() - Find and Replace

# Mask stop codons
dna = "ATGTAACCC"
masked = re.sub(r"TAA|TAG|TGA", "XXX", dna)
print(masked)  # ATGXXXCCC

# Clean multiple spaces
text = "too    many     spaces"
clean = re.sub(r"\s+", " ", text)
print(clean)  # "too many spaces"

re.compile() - Reusable Patterns

# Compile once, use many times (more efficient!)
pattern = re.compile(r"ATG")

for seq in ["ATGCCC", "TTTAAA", "GCGCGC"]:
    if pattern.search(seq):
        print(f"{seq} contains ATG")

Practice

💻
Exercise 10.1

Replace all A's with N's in: "ATGCCCAAA"

💻
Exercise 10.2

Mask all digits with "X" in: "Room123Floor4"


Biological Examples

💡
Real Applications

Here's how regex is used in bioinformatics!

Validate DNA Sequences

def is_valid_dna(sequence):
    """Check if sequence contains only A, T, G, C"""
    return bool(re.match(r"^[ATGC]+$", sequence))

print(is_valid_dna("ATGCCC"))  # True
print(is_valid_dna("ATGXCC"))  # False

Find Restriction Sites

def find_ecori(dna):
    """Find EcoRI recognition sites (GAATTC)"""
    matches = re.finditer(r"GAATTC", dna)
    return [(m.start(), m.group()) for m in matches]

dna = "ATGGAATTCCCCGAATTC"
print(find_ecori(dna))  # [(3, 'GAATTC'), (12, 'GAATTC')]

Count Codons

def count_codons(dna):
    """Split DNA into codons (groups of 3)"""
    return re.findall(r"[ATGC]{3}", dna)

dna = "ATGCCCAAATTT"
print(count_codons(dna))  # ['ATG', 'CCC', 'AAA', 'TTT']

Extract Gene IDs

def extract_gene_ids(text):
    """Extract gene IDs like NM_123456"""
    return re.findall(r"[A-Z]{2}_\d+", text)

text = "Genes NM_001234 and XM_567890 are important"
print(extract_gene_ids(text))  # ['NM_001234', 'XM_567890']

Quick Reference

📝
Pattern Cheat Sheet

abc → Literal text
. → Any character
[abc] → Any of a, b, c
[^abc] → NOT a, b, c
[a-z] → Range
* → 0 or more
+ → 1 or more
? → 0 or 1 (optional)
{n} → Exactly n times
\d → Digit
\w → Word character
\s → Whitespace
^ → Start of string
$ → End of string
\b → Word boundary
| → OR
(...) → Capture group


Key Functions Summary

ℹ️
Function Reference

re.search(pattern, text) → Find first match
re.findall(pattern, text) → Find all matches
re.finditer(pattern, text) → Iterator of matches
re.sub(pattern, replacement, text) → Replace matches
re.split(pattern, text) → Split on pattern
re.compile(pattern) → Reusable pattern


Resources

Object-Oriented Programming V2

📖
What is OOP?

Object-Oriented Programming bundles data and the functions that work on that data into one unit called an object. Instead of data floating around with separate functions, everything lives together. Organized chaos.

The shift:

  • Before (imperative): Write instructions, use functions
  • Now (OOP): Create objects that contain both data AND behavior

You've Been Using OOP All Along

Plot twist: every data type in Python is already a class.

# Lists are objects
my_list = [1, 2, 3]
my_list.append(4)      # Method call
my_list.reverse()      # Another method
# Strings are objects
name = "hello"
name.upper()           # Method call
# Even integers are objects
x = 5
x.__add__(3)           # Same as x + 3
ℹ️
Pro Tip

Use help(list) or help(str) to see all methods of a class.


Level 1: Your First Class

The Syntax

class ClassName:
    # stuff goes here
    pass

A Simple Counter

Let's build step by step.

Step 1: Empty class

class Counter:
    pass

Step 2: The constructor

class Counter:
    def __init__(self, value):
        self.val = value

Step 3: A method

class Counter:
    def __init__(self, value):
        self.val = value
    
    def tick(self):
        self.val = self.val + 1

Step 4: More methods

class Counter:
    def __init__(self, value):
        self.val = value
    
    def tick(self):
        self.val = self.val + 1
    
    def reset(self):
        self.val = 0
    
    def value(self):
        return self.val

Step 5: Use it

c1 = Counter(0)
c2 = Counter(3)

c1.tick()
c2.tick()

print(c1.value())    # 1
print(c2.value())    # 4

Level 2: Understanding the Pieces

The Constructor: __init__

📝
What is __init__?

The constructor runs automatically when you create an object. It sets up the initial state.

def __init__(self, value):
    self.val = value

When you write Counter(5):

  1. Python creates a new Counter object
  2. Calls __init__ with value = 5
  3. Returns the object

The self Parameter

self = "this object I'm working on"

def tick(self):
    self.val = self.val + 1
  • self.val means "the val that belongs to THIS object"
  • Each object has its own copy of self.val
c1 = Counter(0)
c2 = Counter(100)

c1.tick()           # c1.val becomes 1
print(c2.value())   # Still 100, different object
⚠️
Don't Forget self!

Every method needs self as the first parameter. But when calling, you don't pass it — Python does that automatically.

# Defining: include self
def tick(self):
    ...

# Calling: don't include self
c1.tick()    # NOT c1.tick(c1)

Instance Variables

Variables attached to self are instance variables — each object gets its own copy.

def __init__(self, value):
    self.val = value      # Instance variable
    self.count = 0        # Another one

Level 3: Special Methods (Magic Methods)

Python has special method names that enable built-in behaviors.

__str__ — For print()

class Counter:
    def __init__(self, value):
        self.val = value
    
    def __str__(self):
        return f"Counter: {self.val}"
c = Counter(5)
print(c)    # Counter: 5

Without __str__, you'd get something ugly like <__main__.Counter object at 0x7f...>

__add__ — For the + operator

class Counter:
    def __init__(self, value):
        self.val = value
    
    def __add__(self, other):
        return Counter(self.val + other.val)
c1 = Counter(3)
c2 = Counter(7)
c3 = c1 + c2        # Calls c1.__add__(c2)
print(c3.val)       # 10

__len__ — For len()

def __len__(self):
    return self.val
c = Counter(5)
print(len(c))    # 5

__getitem__ — For indexing [ ]

def __getitem__(self, index):
    return something[index]
📝
Common Special Methods

__init__ → Constructor
__str__ → print() and str()
__add__ → + operator
__sub__ → - operator
__mul__ → * operator
__eq__ → == operator
__len__ → len()
__getitem__ → obj[index]


Level 4: Encapsulation

📖
Encapsulation

The idea that you should access data through methods, not directly. This lets you change the internals without breaking code that uses the class.

Bad (direct access):

c = Counter(5)
c.val = -100       # Directly messing with internal data

Good (through methods):

c = Counter(5)
c.reset()          # Using the provided interface
💡
Python's Approach

Python doesn't enforce encapsulation — it trusts you. Convention: prefix "private" variables with underscore: self._val


Level 5: Designing Classes

When creating a class, think about:

QuestionBecomes
What thing am I modeling?Class name
What data does it have?Instance variables
What can it do?Methods

Example: Student

  • Class: Student
  • Data: name, age, grades
  • Behavior: add_grade(), average(), pass_course()

Full Example: Card Deck

Let's see a more complex class.

Step 1: Constructor — Create all cards

class Deck:
    def __init__(self):
        self.cards = []
        for num in range(1, 11):
            for suit in ["Clubs", "Spades", "Hearts", "Diamonds"]:
                self.cards.append((num, suit))

Step 2: Shuffle method

from random import randint

class Deck:
    def __init__(self):
        self.cards = []
        for num in range(1, 11):
            for suit in ["Clubs", "Spades", "Hearts", "Diamonds"]:
                self.cards.append((num, suit))
    
    def shuffle(self):
        for i in range(200):
            x = randint(0, len(self.cards) - 1)
            y = randint(0, len(self.cards) - 1)
            # Swap
            self.cards[x], self.cards[y] = self.cards[y], self.cards[x]

Step 3: Special methods

def __len__(self):
    return len(self.cards)

def __getitem__(self, i):
    return self.cards[i]

def __str__(self):
    return f"I am a {len(self)} card deck"

Step 4: Using it

deck = Deck()
print(deck)              # I am a 40 card deck
print(deck[0])           # (1, 'Clubs')

deck.shuffle()
print(deck[0])           # Something random now

Complete Counter Example

Putting it all together:

class Counter:
    def __init__(self, value):
        self.val = value
    
    def tick(self):
        self.val = self.val + 1
    
    def reset(self):
        self.val = 0
    
    def value(self):
        return self.val
    
    def __str__(self):
        return f"Counter: {self.val}"
    
    def __add__(self, other):
        return Counter(self.val + other.val)
c1 = Counter(0)
c2 = Counter(3)

c1.tick()
c2.tick()

c3 = c1 + c2

print(c1.value())    # 1
print(c2)            # Counter: 4
print(c3)            # Counter: 5

Quick Reference

📝
OOP Cheat Sheet

class Name: → Define a class
def __init__(self): → Constructor
self.var = x → Instance variable
obj = Class() → Create object
obj.method() → Call method
__str__ → For print()
__add__ → For +
__eq__ → For ==
__len__ → For len()
__getitem__ → For [ ]


Why Bother?

Classes help organize large programs. Also, many frameworks (like PyTorch for machine learning) require you to define your own classes. So yeah, you need this. Sorry.

OOP: Extra Practice

ℹ️
Exam Note

The exam is problem-solving focused. OOP is just about organizing code cleanly. If you get the logic, the syntax follows. Don't memorize — understand.


Exercise 1: Clock

Build a clock step by step.

Part A: Basic structure

Create a Clock class with:

  • hours, minutes, seconds (all start at 0, or passed to constructor)
  • A tick() method that adds 1 second

Part B: Handle overflow

Make tick() handle:

  • 60 seconds → 1 minute
  • 60 minutes → 1 hour
  • 24 hours → back to 0

Part C: Display

Add __str__ to show time as "HH:MM:SS" (with leading zeros).

c = Clock(23, 59, 59)
print(c)        # 23:59:59
c.tick()
print(c)        # 00:00:00

Part D: Add seconds

Add __add__ to add an integer number of seconds:

c = Clock(10, 30, 0)
c2 = c + 90     # Add 90 seconds
print(c2)       # 10:31:30
💡
Hint for Part D

You can call tick() in a loop, or be smart and use division/modulo.


Exercise 2: Fraction

Create a Fraction class for exact arithmetic (no floating point nonsense).

Part A: Constructor and display

f = Fraction(1, 2)
print(f)        # 1/2

Part B: Simplify automatically

Use math.gcd to always store fractions in simplest form:

f = Fraction(4, 8)
print(f)        # 1/2 (not 4/8)

Part C: Arithmetic

Add these special methods:

  • __add__Fraction(1,2) + Fraction(1,3) = Fraction(5,6)
  • __sub__ → subtraction
  • __mul__ → multiplication
  • __eq__Fraction(1,2) == Fraction(2,4)True

Part D: Test these expressions

# Expression 1
f1 = Fraction(1, 4)
f2 = Fraction(1, 6)
f3 = Fraction(3, 2)
result = f1 + f2 * f3
print(result)           # Should be 1/2

# Expression 2
f4 = Fraction(1, 4)
f5 = Fraction(1, 4)
f6 = Fraction(1, 2)
print(f4 + f5 == f6)    # Should be True
⚠️
Operator Precedence

* happens before +, just like normal math. Python handles this automatically with your special methods.


Exercise 3: Calculator

A calculator that remembers its state.

Part A: Basic operations

class Calculator:
    # value starts at 0
    # add(x) → adds x to value
    # subtract(x) → subtracts x
    # multiply(x) → multiplies
    # divide(x) → divides
    # clear() → resets to 0
    # result() → returns current value
calc = Calculator()
calc.add(10)
calc.multiply(2)
calc.subtract(5)
print(calc.result())    # 15

Part B: Chain operations

Make methods return self so you can chain:

calc = Calculator()
calc.add(10).multiply(2).subtract(5)
print(calc.result())    # 15
💡
How to Chain

Each method should end with return self

Part C: Memory

Add:

  • memory_store() → saves current value
  • memory_recall() → adds stored value to current
  • memory_clear() → clears memory

Exercise 4: Playlist

Part A: Song class

class Song:
    # title, artist, duration (in seconds)
    # __str__ returns "Artist - Title (M:SS)"
s = Song("Bohemian Rhapsody", "Queen", 354)
print(s)    # Queen - Bohemian Rhapsody (5:54)

Part B: Playlist class

class Playlist:
    # name
    # songs (list)
    # add_song(song)
    # total_duration() → returns total seconds
    # __len__ → number of songs
    # __getitem__ → access by index
    # __str__ → shows playlist name and song count
p = Playlist("Road Trip")
p.add_song(Song("Song A", "Artist 1", 180))
p.add_song(Song("Song B", "Artist 2", 240))

print(len(p))              # 2
print(p[0])                # Artist 1 - Song A (3:00)
print(p.total_duration())  # 420

Exercise 5: Quick Concepts

No code — just answer:

5.1: What's the difference between a class and an object?

5.2: Why do methods have self as first parameter?

5.3: What happens if you forget __str__ and try to print an object?

5.4: When would you use __eq__ instead of just comparing with ==?

5.5: What's encapsulation and why should you care?


Exercise 6: Debug This

class BankAccount:
    def __init__(self, balance):
        balance = balance
    
    def deposit(amount):
        balance += amount
    
    def __str__(self):
        return f"Balance: {self.balance}"

acc = BankAccount(100)
acc.deposit(50)
print(acc)

This crashes. Find all the bugs.

⚠️
Hint

There are 3 bugs. All involve a missing word.


Done?

If you can do these, you understand OOP basics. I would be proud.

Object-Oriented Programming in Python

Object-Oriented Programming (OOP) is a way of organizing code by bundling related data and functions together into "objects". Instead of writing separate functions that work on data, you create objects that contain both the data and the functions that work with that data.

Why Learn OOP?

OOP helps you write code that is easier to understand, reuse, and maintain. It mirrors how we think about the real world - objects with properties and behaviors.

The four pillars of OOP:

  1. Encapsulation - Bundle data and methods together
  2. Abstraction - Hide complex implementation details
  3. Inheritance - Create new classes based on existing ones
  4. Polymorphism - Same interface, different implementations

Level 1: Understanding Classes and Objects

What is a Class?

A class is a blueprint or template for creating objects. Think of it like a cookie cutter - it defines the shape, but it's not the cookie itself.

# This is a class - a blueprint for dogs
class Dog:
    pass  # Empty for now

Naming Convention

Classes use PascalCase (UpperCamelCase):

class Dog:              # ✓ Good
class BankAccount:      # ✓ Good
class DataProcessor:    # ✓ Good

class my_class:         # ✗ Bad (snake_case)
class myClass:          # ✗ Bad (camelCase)

What is an Object (Instance)?

An object (or instance) is an actual "thing" created from the class blueprint. If the class is a cookie cutter, the object is the actual cookie.

class Dog:
    pass

# Creating objects (instances)
buddy = Dog()  # buddy is an object
max_dog = Dog()  # max_dog is another object

# Both are dogs, but they're separate objects
print(type(buddy))  # lass '__main__.Dog'>

Terminology:

  • Dog is the class (blueprint)
  • buddy and max_dog are instances or objects (actual things)
  • We say: "buddy is an instance of Dog" or "buddy is a Dog object"

Level 2: Attributes - Giving Objects Data

Attributes are variables that store data inside an object. They represent the object's properties or state.

Instance Attributes

Instance attributes are unique to each object:

class Dog:
    def __init__(self, name, age):
        self.name = name  # Instance attribute
        self.age = age    # Instance attribute

# Create two different dogs
buddy = Dog("Buddy", 3)
max_dog = Dog("Max", 5)

# Each has its own attributes
print(buddy.name)    # "Buddy"
print(max_dog.name)  # "Max"
print(buddy.age)     # 3
print(max_dog.age)   # 5

Understanding __init__

__init__ is a special method called a constructor. It runs automatically when you create a new object.

class Dog:
    def __init__(self, name, age):
        print(f"Creating a dog named {name}!")
        self.name = name
        self.age = age

buddy = Dog("Buddy", 3)  
# Prints: "Creating a dog named Buddy!"

What __init__ does:

  • Initializes (sets up) the new object's attributes
  • Runs automatically when you call Dog(...)
  • First parameter is always self

The double underscores (__init__) are called "dunder" (double-underscore). These mark special methods that Python recognizes for specific purposes.

Understanding self

self refers to the specific object you're working with:

class Dog:
    def __init__(self, name):
        self.name = name  # self.name means "THIS dog's name"

buddy = Dog("Buddy")
# When creating buddy, self refers to buddy
# So self.name = "Buddy" stores "Buddy" in buddy's name attribute

max_dog = Dog("Max")
# When creating max_dog, self refers to max_dog
# So self.name = "Max" stores "Max" in max_dog's name attribute

Important:

  • self is just a naming convention (you could use another name, but don't!)
  • Always include self as the first parameter in methods
  • You don't pass self when calling methods - Python does it automatically

Class Attributes

Class attributes are shared by ALL objects of that class:

class Dog:
    species = "Canis familiaris"  # Class attribute (shared)
    
    def __init__(self, name):
        self.name = name  # Instance attribute (unique)

buddy = Dog("Buddy")
max_dog = Dog("Max")

print(buddy.species)   # "Canis familiaris"
print(max_dog.species) # "Canis familiaris" (same for both)
print(buddy.name)      # "Buddy" (different)
print(max_dog.name)    # "Max" (different)

Practice:

Exercise 1: Create a Cat class with name and color attributes

Exercise 2: Create two cat objects with different names and colors

Exercise 3: Create a Book class with title, author, and pages attributes

Exercise 4: Add a class attribute book_count to track how many books exist

Exercise 5: Create a Student class with name and grade attributes

Solutions
# Exercise 1 & 2
class Cat:
    def __init__(self, name, color):
        self.name = name
        self.color = color

whiskers = Cat("Whiskers", "orange")
mittens = Cat("Mittens", "black")
print(whiskers.name, whiskers.color)  # Whiskers orange
print(mittens.name, mittens.color)    # Mittens black

# Exercise 3
class Book:
    def __init__(self, title, author, pages):
        self.title = title
        self.author = author
        self.pages = pages

book1 = Book("Python Basics", "John Doe", 300)
print(book1.title)  # Python Basics

# Exercise 4
class Book:
    book_count = 0  # Class attribute
    
    def __init__(self, title, author):
        self.title = title
        self.author = author
        Book.book_count += 1

book1 = Book("Book 1", "Author 1")
book2 = Book("Book 2", "Author 2")
print(Book.book_count)  # 2

# Exercise 5
class Student:
    def __init__(self, name, grade):
        self.name = name
        self.grade = grade

student = Student("Alice", "A")
print(student.name, student.grade)  # Alice A

Level 3: Methods - Giving Objects Behavior

Methods are functions defined inside a class. They define what objects can do.

Instance Methods

Instance methods operate on a specific object and can access its attributes:

class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def bark(self):  # Instance method
        return f"{self.name} says Woof!"
    
    def get_age_in_dog_years(self):
        return self.age * 7

buddy = Dog("Buddy", 3)
print(buddy.bark())                    # "Buddy says Woof!"
print(buddy.get_age_in_dog_years())    # 21

Key points:

  • First parameter is always self
  • Can access object's attributes using self.attribute_name
  • Called using dot notation: object.method()

Methods Can Modify Attributes

Methods can both read and change an object's attributes:

class BankAccount:
    def __init__(self, balance):
        self.balance = balance
    
    def deposit(self, amount):
        self.balance += amount  # Modify the balance
        return self.balance
    
    def withdraw(self, amount):
        if amount <= self.balance:
            self.balance -= amount
            return self.balance
        else:
            return "Insufficient funds"
    
    def get_balance(self):
        return self.balance

account = BankAccount(100)
account.deposit(50)
print(account.get_balance())  # 150
account.withdraw(30)
print(account.get_balance())  # 120

Practice: Methods

Exercise 1: Add a meow() method to the Cat class

Exercise 2: Add a have_birthday() method to Dog that increases age by 1

Exercise 3: Create a Rectangle class with width, height, and methods area() and perimeter()

Exercise 4: Add a description() method to Book that returns a formatted string

Exercise 5: Create a Counter class with increment(), decrement(), and reset() methods

Solutions
# Exercise 1
class Cat:
    def __init__(self, name):
        self.name = name
    
    def meow(self):
        return f"{self.name} says Meow!"

cat = Cat("Whiskers")
print(cat.meow())  # Whiskers says Meow!

# Exercise 2
class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def have_birthday(self):
        self.age += 1
        return f"{self.name} is now {self.age} years old!"

dog = Dog("Buddy", 3)
print(dog.have_birthday())  # Buddy is now 4 years old!

# Exercise 3
class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height
    
    def area(self):
        return self.width * self.height
    
    def perimeter(self):
        return 2 * (self.width + self.height)

rect = Rectangle(5, 3)
print(rect.area())       # 15
print(rect.perimeter())  # 16

# Exercise 4
class Book:
    def __init__(self, title, author, pages):
        self.title = title
        self.author = author
        self.pages = pages
    
    def description(self):
        return f"'{self.title}' by {self.author}, {self.pages} pages"

book = Book("Python Basics", "John Doe", 300)
print(book.description())  # 'Python Basics' by John Doe, 300 pages

# Exercise 5
class Counter:
    def __init__(self):
        self.count = 0
    
    def increment(self):
        self.count += 1
    
    def decrement(self):
        self.count -= 1
    
    def reset(self):
        self.count = 0
    
    def get_count(self):
        return self.count

counter = Counter()
counter.increment()
counter.increment()
print(counter.get_count())  # 2
counter.decrement()
print(counter.get_count())  # 1
counter.reset()
print(counter.get_count())  # 0

Level 4: Inheritance - Reusing Code

Inheritance lets you create a new class based on an existing class. The new class inherits attributes and methods from the parent.

Why? Code reuse - don't repeat yourself!

Basic Inheritance

# Parent class (also called base class or superclass)
class Animal:
    def __init__(self, name):
        self.name = name
    
    def speak(self):
        return "Some sound"

# Child class (also called derived class or subclass)
class Dog(Animal):  # Dog inherits from Animal
    def speak(self):  # Override parent method
        return f"{self.name} says Woof!"

class Cat(Animal):
    def speak(self):
        return f"{self.name} says Meow!"

dog = Dog("Buddy")
cat = Cat("Whiskers")

print(dog.speak())  # "Buddy says Woof!"
print(cat.speak())  # "Whiskers says Meow!"

What happened:

  • Dog and Cat inherit __init__ from Animal (no need to rewrite it!)
  • Both override the speak method with their own version
  • Each child gets all parent attributes and methods automatically

Extending Parent's __init__ with super()

Use super() to call the parent's __init__ and then add more:

class Animal:
    def __init__(self, name):
        self.name = name

class Dog(Animal):
    def __init__(self, name, breed):
        super().__init__(name)  # Call parent's __init__
        self.breed = breed      # Add new attribute
    
    def info(self):
        return f"{self.name} is a {self.breed}"

dog = Dog("Buddy", "Golden Retriever")
print(dog.info())  # "Buddy is a Golden Retriever"
print(dog.name)    # "Buddy" (inherited from Animal)

Method Overriding

Method overriding happens when a child class provides its own implementation of a parent's method:

class Animal:
    def speak(self):
        return "Some sound"
    
    def move(self):
        return "Moving"

class Fish(Animal):
    def move(self):  # Override
        return "Swimming"
    
    def speak(self):  # Override
        return "Blub"

class Bird(Animal):
    def move(self):  # Override
        return "Flying"
    # speak() not overridden, so uses parent's version

fish = Fish()
bird = Bird()

print(fish.move())   # "Swimming" (overridden)
print(fish.speak())  # "Blub" (overridden)
print(bird.move())   # "Flying" (overridden)
print(bird.speak())  # "Some sound" (inherited, not overridden)

Rule: When you call a method, Python uses the child's version if it exists, otherwise the parent's version.

Practice: Inheritance

Exercise 1: Create a Vehicle parent class with brand and year attributes

Exercise 2: Create Car and Motorcycle child classes that inherit from Vehicle

Exercise 3: Override a description() method in each child class

Exercise 4: Create an Employee parent class and a Manager child class with additional department attribute

Exercise 5: Create a Shape parent with color attribute, and Circle and Square children

Solutions
# Exercise 1, 2, 3
class Vehicle:
    def __init__(self, brand, year):
        self.brand = brand
        self.year = year
    
    def description(self):
        return f"{self.year} {self.brand}"

class Car(Vehicle):
    def description(self):
        return f"{self.year} {self.brand} Car"

class Motorcycle(Vehicle):
    def description(self):
        return f"{self.year} {self.brand} Motorcycle"

car = Car("Toyota", 2020)
bike = Motorcycle("Harley", 2019)
print(car.description())   # 2020 Toyota Car
print(bike.description())  # 2019 Harley Motorcycle

# Exercise 4
class Employee:
    def __init__(self, name, salary):
        self.name = name
        self.salary = salary

class Manager(Employee):
    def __init__(self, name, salary, department):
        super().__init__(name, salary)
        self.department = department
    
    def info(self):
        return f"{self.name} manages {self.department}"

manager = Manager("Alice", 80000, "Sales")
print(manager.info())  # Alice manages Sales
print(manager.salary)  # 80000

# Exercise 5
class Shape:
    def __init__(self, color):
        self.color = color

class Circle(Shape):
    def __init__(self, color, radius):
        super().__init__(color)
        self.radius = radius
    
    def area(self):
        return 3.14159 * self.radius ** 2

class Square(Shape):
    def __init__(self, color, side):
        super().__init__(color)
        self.side = side
    
    def area(self):
        return self.side ** 2

circle = Circle("red", 5)
square = Square("blue", 4)
print(circle.area())   # 78.53975
print(circle.color)    # red
print(square.area())   # 16
print(square.color)    # blue

Level 5: Special Decorators for Methods

Decorators modify how methods behave. They're marked with @ symbol before the method.

@property - Methods as Attributes

Makes a method accessible like an attribute (no parentheses needed):

class Circle:
    def __init__(self, radius):
        self._radius = radius
    
    @property
    def radius(self):
        return self._radius
    
    @property
    def area(self):
        return 3.14159 * self._radius ** 2
    
    @property
    def circumference(self):
        return 2 * 3.14159 * self._radius

circle = Circle(5)
print(circle.radius)         # 5 (no parentheses!)
print(circle.area)           # 78.53975 (calculated on access)
print(circle.circumference)  # 31.4159

@staticmethod - Methods Without self

Static methods don't need access to the instance:

class Math:
    @staticmethod
    def add(x, y):
        return x + y
    
    @staticmethod
    def multiply(x, y):
        return x * y

# Call without creating an instance
print(Math.add(5, 3))       # 8
print(Math.multiply(4, 7))  # 28

@classmethod - Methods That Receive the Class

Class methods receive the class itself (not the instance):

class Dog:
    count = 0  # Class attribute
    
    def __init__(self, name):
        self.name = name
        Dog.count += 1
    
    @classmethod
    def get_count(cls):
        return f"There are {cls.count} dogs"
    
    @classmethod
    def create_default(cls):
        return cls("Default Dog")

dog1 = Dog("Buddy")
dog2 = Dog("Max")
print(Dog.get_count())  # "There are 2 dogs"

# Create a dog using class method
dog3 = Dog.create_default()
print(dog3.name)        # "Default Dog"
print(Dog.get_count())  # "There are 3 dogs"

Practice: Decorators

Exercise 1: Create a Temperature class with celsius property and fahrenheit property

Exercise 2: Add a static method is_freezing(celsius) to check if temperature is below 0

Exercise 3: Create a Person class with class method to count total people created

Exercise 4: Add a property age to calculate age from birth year

Exercise 5: Create utility class StringUtils with static methods for string operations

Solutions
# Exercise 1
class Temperature:
    def __init__(self, celsius):
        self._celsius = celsius
    
    @property
    def celsius(self):
        return self._celsius
    
    @property
    def fahrenheit(self):
        return (self._celsius * 9/5) + 32

temp = Temperature(25)
print(temp.celsius)     # 25
print(temp.fahrenheit)  # 77.0

# Exercise 2
class Temperature:
    def __init__(self, celsius):
        self._celsius = celsius
    
    @property
    def celsius(self):
        return self._celsius
    
    @staticmethod
    def is_freezing(celsius):
        return celsius < 0

print(Temperature.is_freezing(-5))  # True
print(Temperature.is_freezing(10))  # False

# Exercise 3
class Person:
    count = 0
    
    def __init__(self, name):
        self.name = name
        Person.count += 1
    
    @classmethod
    def get_total_people(cls):
        return cls.count

p1 = Person("Alice")
p2 = Person("Bob")
print(Person.get_total_people())  # 2

# Exercise 4
class Person:
    def __init__(self, name, birth_year):
        self.name = name
        self.birth_year = birth_year
    
    @property
    def age(self):
        from datetime import datetime
        current_year = datetime.now().year
        return current_year - self.birth_year

person = Person("Alice", 1990)
print(person.age)  # Calculates current age

# Exercise 5
class StringUtils:
    @staticmethod
    def reverse(text):
        return text[::-1]
    
    @staticmethod
    def word_count(text):
        return len(text.split())
    
    @staticmethod
    def capitalize_words(text):
        return text.title()

print(StringUtils.reverse("hello"))           # "olleh"
print(StringUtils.word_count("hello world"))  # 2
print(StringUtils.capitalize_words("hello world"))  # "Hello World"

Level 6: Abstract Classes - Enforcing Rules

An abstract class is a class that cannot be instantiated directly. It exists only as a blueprint for other classes to inherit from.

Why? To enforce that child classes implement certain methods - it's a contract.

Creating Abstract Classes

Use the abc module (Abstract Base Classes):

from abc import ABC, abstractmethod

class Animal(ABC):  # Inherit from ABC
    def __init__(self, name):
        self.name = name
    
    @abstractmethod  # Must be implemented by children
    def speak(self):
        pass
    
    @abstractmethod
    def move(self):
        pass

# This will cause an error:
# animal = Animal("Generic")  # TypeError: Can't instantiate abstract class

class Dog(Animal):
    def speak(self):  # Must implement
        return f"{self.name} barks"
    
    def move(self):   # Must implement
        return f"{self.name} walks"

dog = Dog("Buddy")  # This works!
print(dog.speak())  # "Buddy barks"
print(dog.move())   # "Buddy walks"

Key points:

  • Abstract classes inherit from ABC
  • Use @abstractmethod for methods that must be implemented
  • Child classes MUST implement all abstract methods
  • Cannot create instances of abstract classes directly

Why Use Abstract Classes?

They enforce consistency across child classes:

from abc import ABC, abstractmethod

class Shape(ABC):
    @abstractmethod
    def area(self):
        pass
    
    @abstractmethod
    def perimeter(self):
        pass

class Rectangle(Shape):
    def __init__(self, width, height):
        self.width = width
        self.height = height
    
    def area(self):
        return self.width * self.height
    
    def perimeter(self):
        return 2 * (self.width + self.height)

class Circle(Shape):
    def __init__(self, radius):
        self.radius = radius
    
    def area(self):
        return 3.14159 * self.radius ** 2
    
    def perimeter(self):
        return 2 * 3.14159 * self.radius

# Both Rectangle and Circle MUST have area() and perimeter()
rect = Rectangle(5, 3)
circle = Circle(4)
print(rect.area())      # 15
print(circle.area())    # 50.26544

Practice: Abstract Classes

Exercise 1: Create an abstract Vehicle class with abstract method start_engine()

Exercise 2: Create abstract PaymentMethod class with abstract process_payment(amount) method

Exercise 3: Create concrete classes CreditCard and PayPal that inherit from PaymentMethod

Exercise 4: Create abstract Database class with abstract connect() and query() methods

Exercise 5: Create abstract FileProcessor with abstract read() and write() methods

Solutions
# Exercise 1
from abc import ABC, abstractmethod

class Vehicle(ABC):
    @abstractmethod
    def start_engine(self):
        pass

class Car(Vehicle):
    def start_engine(self):
        return "Car engine started"

car = Car()
print(car.start_engine())  # Car engine started

# Exercise 2 & 3
class PaymentMethod(ABC):
    @abstractmethod
    def process_payment(self, amount):
        pass

class CreditCard(PaymentMethod):
    def __init__(self, card_number):
        self.card_number = card_number
    
    def process_payment(self, amount):
        return f"Charged ${amount} to card {self.card_number}"

class PayPal(PaymentMethod):
    def __init__(self, email):
        self.email = email
    
    def process_payment(self, amount):
        return f"Charged ${amount} to PayPal account {self.email}"

card = CreditCard("1234-5678")
paypal = PayPal("user@email.com")
print(card.process_payment(100))    # Charged $100 to card 1234-5678
print(paypal.process_payment(50))   # Charged $50 to PayPal account user@email.com

# Exercise 4
class Database(ABC):
    @abstractmethod
    def connect(self):
        pass
    
    @abstractmethod
    def query(self, sql):
        pass

class MySQL(Database):
    def connect(self):
        return "Connected to MySQL"
    
    def query(self, sql):
        return f"Executing MySQL query: {sql}"

db = MySQL()
print(db.connect())           # Connected to MySQL
print(db.query("SELECT *"))   # Executing MySQL query: SELECT *

# Exercise 5
class FileProcessor(ABC):
    @abstractmethod
    def read(self, filename):
        pass
    
    @abstractmethod
    def write(self, filename, data):
        pass

class TextFileProcessor(FileProcessor):
    def read(self, filename):
        return f"Reading text from {filename}"
    
    def write(self, filename, data):
        return f"Writing text to {filename}: {data}"

processor = TextFileProcessor()
print(processor.read("data.txt"))              # Reading text from data.txt
print(processor.write("out.txt", "Hello"))     # Writing text to out.txt: Hello

Level 7: Design Pattern - Template Method

The Template Method Pattern defines the skeleton of an algorithm in a parent class, but lets child classes implement specific steps.

from abc import ABC, abstractmethod

class DataProcessor(ABC):
    """Template for processing data"""
    
    def process(self):
        """Template method - defines the workflow"""
        data = self.load_data()
        cleaned = self.clean_data(data)
        result = self.analyze_data(cleaned)
        self.save_results(result)
    
    @abstractmethod
    def load_data(self):
        """Children must implement"""
        pass
    
    @abstractmethod
    def clean_data(self, data):
        """Children must implement"""
        pass
    
    @abstractmethod
    def analyze_data(self, data):
        """Children must implement"""
        pass
    
    def save_results(self, result):
        """Default implementation (can override)"""
        print(f"Saving: {result}")


class CSVProcessor(DataProcessor):
    def load_data(self):
        return "CSV data loaded"
    
    def clean_data(self, data):
        return f"{data} -> cleaned"
    
    def analyze_data(self, data):
        return f"{data} -> analyzed"


class JSONProcessor(DataProcessor):
    def load_data(self):
        return "JSON data loaded"
    
    def clean_data(self, data):
        return f"{data} -> cleaned differently"
    
    def analyze_data(self, data):
        return f"{data} -> analyzed differently"


# Usage
csv = CSVProcessor()
csv.process()
# Output: Saving: CSV data loaded -> cleaned -> analyzed

json = JSONProcessor()
json.process()
# Output: Saving: JSON data loaded -> cleaned differently -> analyzed differently

Benefits:

  • Common workflow defined once in parent
  • Each child implements specific steps differently
  • Prevents code duplication
  • Enforces consistent structure

Summary: Key Concepts

Classes and Objects

  • Class = blueprint (use PascalCase)
  • Object/Instance = actual thing created from class
  • __init__ = constructor that runs when creating objects
  • self = reference to the current object

Attributes and Methods

  • Attributes = data (variables) stored in objects
  • Instance attributes = unique to each object (defined in __init__)
  • Class attributes = shared by all objects
  • Methods = functions that define object behavior
  • Access both using self.name inside the class

Inheritance

  • Child class inherits from parent class
  • Use super() to call parent's methods
  • Method overriding = child replaces parent's method
  • Promotes code reuse

Decorators

  • @property = access method like an attribute
  • @staticmethod = method without self, doesn't need instance
  • @classmethod = receives class instead of instance
  • @abstractmethod = marks methods that must be implemented

Abstract Classes

  • Cannot be instantiated directly
  • Use ABC and @abstractmethod
  • Enforce that children implement specific methods
  • Create contracts/interfaces

Design Patterns

  • Template Method = define algorithm structure in parent, implement steps in children
  • Promotes consistency and reduces duplication

Inheritance

📖
What is Inheritance?

Inheritance allows you to define new classes based on existing ones. The new class "inherits" attributes and methods from the parent, so you don't rewrite the same code twice. Because we're lazy. Efficiently lazy.

Why bother?

  • Reuse existing code
  • Build specialized versions of general classes
  • Organize related classes in hierarchies

The Basic Idea

Think about it:

  • A Student is a Person
  • A Student has everything a Person has (name, age, etc.)
  • But a Student also has extra stuff (exams, courses, stress)

Instead of copy-pasting all the Person code into Student, we just say "Student inherits from Person" and add the extra bits.

Person (superclass / parent)
   ↓
Student (subclass / child)
💡
Terminology

Superclass = Parent class = Base class (the original)
Subclass = Child class = Derived class (the new one)


Level 1: Creating a Subclass

The Syntax

Put the parent class name in parentheses:

class Student(Person):
    pass

That's it. Student now has everything Person has.

Let's Build It Step by Step

Step 1: The superclass

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

Step 2: Add a method

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def birthday(self):
        self.age += 1

Step 3: Add __str__

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def birthday(self):
        self.age += 1
    
    def __str__(self):
        return f"{self.name}, age {self.age}"

Step 4: Create the subclass

class Student(Person):
    pass

Step 5: Test it

s = Student("Alice", 20)
print(s)           # Alice, age 20
s.birthday()
print(s)           # Alice, age 21
What Just Happened?

Student inherited __init__, birthday, and __str__ from Person. We wrote zero code in Student but it works!


Level 2: Adding New Stuff to Subclasses

A subclass can have:

  • Additional instance variables
  • Additional methods
  • Its own constructor

Adding a Method

class Student(Person):
    def study(self):
        print(f"{self.name} is studying...")
s = Student("Bob", 19)
s.study()      # Bob is studying...
s.birthday()   # Still works from Person!

Adding Instance Variables

Students have exams. Persons don't. Let's add that.

class Student(Person):
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.exams = []    # New!
⚠️
Problem

We just copy-pasted the parent's __init__ code. That's bad. What if Person changes? We'd have to update Student too.


Level 3: The super() Function

super() lets you call methods from the parent class. Use it to avoid code duplication.

Better Constructor

class Student(Person):
    def __init__(self, name, age):
        super().__init__(name, age)    # Call parent's __init__
        self.exams = []                 # Add our own stuff

Breaking it down:

super().__init__(name, age)

This says: "Hey parent class, run YOUR __init__ with these values."

Then we add the Student-specific stuff after.

Test It

s = Student("Charlie", 21)
print(s.name)      # Charlie (from Person)
print(s.exams)     # [] (from Student)
💡
Golden Rule

When overriding __init__, usually call super().__init__(...) first, then add your stuff.


Level 4: Overriding Methods

If a subclass defines a method with the same name as the parent, it replaces (overrides) the parent's version.

Example: Override __str__

Parent version:

class Person:
    def __str__(self):
        return f"{self.name}, age {self.age}"

Child version (override):

class Student(Person):
    def __init__(self, name, age):
        super().__init__(name, age)
        self.exams = []
    
    def __str__(self):
        return f"Student: {self.name}, age {self.age}"
p = Person("Dan", 30)
s = Student("Eve", 20)

print(p)    # Dan, age 30
print(s)    # Student: Eve, age 20

Using super() in Overridden Methods

You can extend the parent's method instead of replacing it entirely:

class Student(Person):
    def __str__(self):
        base = super().__str__()           # Get parent's version
        return base + f", exams: {len(self.exams)}"
s = Student("Frank", 22)
print(s)    # Frank, age 22, exams: 0
ℹ️
When to Use super() in Methods

Use super().method_name() when you want to extend the parent's behavior, not completely replace it.


Level 5: Inheritance vs Composition

⚠️
Important Decision

Not everything should be a subclass! Choose wisely.

The "is-a" Test

Ask yourself: "Is X a Y?"

RelationshipIs-a?Use
Student → Person"A student IS a person" ✅Inheritance
Exam → Student"An exam IS a student" ❌Nope
Car → Vehicle"A car IS a vehicle" ✅Inheritance
Engine → Car"An engine IS a car" ❌Nope

When to Use Objects as Instance Variables

If X is NOT a Y, but X HAS a Y, use composition:

# A student HAS exams (not IS an exam)
class Student(Person):
    def __init__(self, name, age):
        super().__init__(name, age)
        self.exams = []    # List of Exam objects
# Exam is its own class, not a subclass
class Exam:
    def __init__(self, name, score, cfu):
        self.name = name
        self.score = score
        self.cfu = cfu
Simple Rule

IS-A → Use inheritance
HAS-A → Use instance variables (composition)


Level 6: Class Hierarchies

Subclasses can have their own subclasses. It's subclasses all the way down.

        Person
           ↓
        Student
           ↓
     ThesisStudent
class Person:
    pass

class Student(Person):
    pass

class ThesisStudent(Student):
    pass

A ThesisStudent inherits from Student, which inherits from Person.

The Secret: Everything Inherits from object

In Python, every class secretly inherits from object:

class Person:       # Actually: class Person(object)
    pass

That's why every class has methods like __str__ and __eq__ even if you don't define them (they're just not very useful by default).


Putting It Together: Complete Example

The Exam Class

class Exam:
    def __init__(self, name, score, cfu):
        self.name = name
        self.score = score
        self.cfu = cfu
    
    def __str__(self):
        return f"{self.name}: {self.score}/30 ({self.cfu} CFU)"

The Person Class

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def birthday(self):
        self.age += 1
    
    def __str__(self):
        return f"{self.name}, age {self.age}"

The Student Class

class Student(Person):
    def __init__(self, name, age):
        super().__init__(name, age)
        self.exams = []
    
    def pass_exam(self, exam):
        self.exams.append(exam)
    
    def __str__(self):
        base = super().__str__()
        if self.exams:
            exam_info = ", ".join(str(e) for e in self.exams)
            return f"{base}, exams: [{exam_info}]"
        return f"{base}, no exams yet"

Using It

# Create a student
s = Student("Grace", 20)
print(s)
# Grace, age 20, no exams yet

# Pass an exam
s.pass_exam(Exam("Python", 28, 6))
print(s)
# Grace, age 20, exams: [Python: 28/30 (6 CFU)]

# Pass another exam
s.pass_exam(Exam("Databases", 30, 9))
print(s)
# Grace, age 20, exams: [Python: 28/30 (6 CFU), Databases: 30/30 (9 CFU)]

# Have a birthday
s.birthday()
print(s)
# Grace, age 21, exams: [Python: 28/30 (6 CFU), Databases: 30/30 (9 CFU)]

Inheritance: Extra Practice

ℹ️
A Note on Exams

The exam is mostly problem-solving. Writing OOP code is really just organizing your logic nicely. Don't over-stress these — if you understand the concept, you can write the code. These exercises are just for practice, not memorization.


Exercise 1: Coffee Shop

You're building a coffee ordering system. ☕

Part A:

Create a Beverage class with:

  • name and price
  • __str__ that returns something like "Espresso: €2.50"

Part B:

Create a CustomBeverage subclass that:

  • Has an extras list (e.g., ["oat milk", "extra shot"])
  • Has an add_extra(extra_name, extra_price) method
  • Each extra increases the total price
  • Override __str__ to show the extras too

Test it:

drink = CustomBeverage("Latte", 3.00)
drink.add_extra("oat milk", 0.50)
drink.add_extra("vanilla syrup", 0.30)
print(drink)
# Latte: €3.80 (extras: oat milk, vanilla syrup)

Exercise 2: Shapes (Classic but Useful)

Part A:

Create a Shape class with:

  • A name instance variable
  • A method area() that returns 0 (base case)
  • __str__ that returns "Shape: {name}, area: {area}"

Part B:

Create two subclasses:

Rectangle(Shape):

  • Has width and height
  • Override area() to return width * height

Circle(Shape):

  • Has radius
  • Override area() to return π * radius²

Part C:

Create a function (not a method!) that takes a list of shapes and returns the total area:

shapes = [Rectangle(4, 5), Circle(3), Rectangle(2, 2)]
print(total_area(shapes))  # Should work for any mix of shapes
💡
Why This Works

This is polymorphism — you call .area() on each shape and the correct version runs automatically. The function doesn't care if it's a Rectangle or Circle.


Exercise 3: Game Characters

You're making an RPG. Because why not.

Part A:

Create a Character class with:

  • name and health (default 100)
  • take_damage(amount) that reduces health
  • is_alive() that returns True if health > 0
  • __str__ showing name and health

Part B:

Create a Warrior subclass:

  • Has armor (default 10)
  • Override take_damage so damage is reduced by armor first

Create a Mage subclass:

  • Has mana (default 50)
  • Has cast_spell(damage) that costs 10 mana and returns the damage (or 0 if no mana)

Test scenario:

w = Warrior("Ragnar")
m = Mage("Merlin")

w.take_damage(25)  # Should only take 15 damage (25 - 10 armor)
print(w)           # Ragnar: 85 HP

spell_damage = m.cast_spell(30)
print(m.mana)      # 40

Exercise 4: Quick Thinking

No code needed — just answer:

4.1: You have Animal and want to create Dog. Inheritance or instance variable?

4.2: You have Car and want to give it an Engine. Inheritance or instance variable?

4.3: What does super().__init__() do and when would you skip it?

4.4: If both Parent and Child have a method called greet(), which one runs when you call child_obj.greet()?


Exercise 5: Fix The Bug

This code has issues. Find and fix them:

class Vehicle:
    def __init__(self, brand):
        self.brand = brand
        self.fuel = 100
    
    def drive(self):
        self.fuel -= 10

class ElectricCar(Vehicle):
    def __init__(self, brand, battery):
        self.battery = battery
    
    def drive(self):
        self.battery -= 20
tesla = ElectricCar("Tesla", 100)
print(tesla.brand)  # 💥 Crashes! Why?
⚠️
Hint

What's missing in ElectricCar.__init__?


You're Ready

If you can do these, you understand inheritance. Now go touch grass or something. 🌱

---

Quick Reference

📝
Inheritance Cheat Sheet

class Child(Parent): → Create subclass
super().__init__(...) → Call parent's constructor
super().method() → Call parent's method
Same method name → Overrides parent
New method name → Adds to child
IS-A → Use inheritance
HAS-A → Use instance variables


💡
Final Note

This is just the basics. There's more to discover (multiple inheritance, abstract classes, etc.), but now you have some bases to build on. And yes, these notes are correct. You're welcome. 😏

Dynamic Programming

What is Dynamic Programming?

Dynamic Programming (DP) is an optimization technique that solves complex problems by breaking them down into simpler subproblems and storing their results to avoid redundant calculations.

The key idea: If you've already solved a subproblem, don't solve it again—just look up the answer!

Two fundamental principles:

  1. Overlapping subproblems - the same smaller problems are solved multiple times
  2. Optimal substructure - the optimal solution can be built from optimal solutions to subproblems

Why it matters: DP can transform exponentially slow algorithms into polynomial or even linear time algorithms by trading memory for speed.


Prerequisites: Why Dictionaries Are Perfect for DP

Before diving into dynamic programming, you should understand Python dictionaries. If you're not comfortable with dictionaries yet, review them first—they're the foundation of most DP solutions.

Quick dictionary essentials for DP:

# Creating and using dictionaries
cache = {}  # Empty dictionary

# Store results
cache[5] = 120
cache[6] = 720

# Check if we've seen this before
if 5 in cache:  # O(1) - instant lookup!
    print(cache[5])

# This is why dictionaries are perfect for DP!

Why dictionaries work for DP:

  • O(1) lookup time - checking if a result exists is instant
  • O(1) insertion time - storing a new result is instant
  • Flexible keys - can store results for any input value
  • Clear mapping - easy relationship between input (key) and result (value)

Now let's see DP in action with a classic example.


The Classic Example: Fibonacci

The Fibonacci sequence is perfect for understanding DP because it clearly shows the problem of redundant calculations.

The Problem: Naive Recursion

Fibonacci definition:

  • F(0) = 0
  • F(1) = 1
  • F(n) = F(n-1) + F(n-2)

Naive recursive solution:

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

print(fibonacci(10))  # 55
# Try fibonacci(40) - it takes forever!

Why Is This So Slow?

Look at the redundant calculations for fibonacci(5):

fibonacci(5)
├── fibonacci(4)
│   ├── fibonacci(3)
│   │   ├── fibonacci(2)
│   │   │   ├── fibonacci(1)  ← Calculated
│   │   │   └── fibonacci(0)  ← Calculated
│   │   └── fibonacci(1)      ← Calculated AGAIN
│   └── fibonacci(2)          ← Calculated AGAIN
│       ├── fibonacci(1)      ← Calculated AGAIN
│       └── fibonacci(0)      ← Calculated AGAIN
└── fibonacci(3)              ← Entire subtree calculated AGAIN
    ├── fibonacci(2)          ← Calculated AGAIN
    │   ├── fibonacci(1)      ← Calculated AGAIN
    │   └── fibonacci(0)      ← Calculated AGAIN
    └── fibonacci(1)          ← Calculated AGAIN

The numbers:

  • fibonacci(1) is calculated 5 times
  • fibonacci(2) is calculated 3 times
  • fibonacci(3) is calculated 2 times

For fibonacci(40), you'd do 331,160,281 function calls. That's insane for a simple calculation!

Time complexity: O(2^n) - exponential! Each call spawns two more calls.


Dynamic Programming Solution: Memoization

Memoization = storing (caching) results we've already calculated using a dictionary.

# Dictionary to store computed results
memo = {}

def fibonacci_dp(n):
    # Check if we've already calculated this
    if n in memo:
        return memo[n]
    
    # Base cases
    if n <= 1:
        return n
    
    # Calculate, store, and return
    result = fibonacci_dp(n - 1) + fibonacci_dp(n - 2)
    memo[n] = result
    return result

# First call - calculates and stores results
print(fibonacci_dp(10))   # 55
print(memo)  # {2: 1, 3: 2, 4: 3, 5: 5, 6: 8, 7: 13, 8: 21, 9: 34, 10: 55}

# Subsequent calls - instant lookups!
print(fibonacci_dp(50))   # 12586269025 (instant!)
print(fibonacci_dp(100))  # Works perfectly, still instant!

How Memoization Works: Step-by-Step

Let's trace fibonacci_dp(5) with empty memo:

Call fibonacci_dp(5):
  5 not in memo
  Calculate: fibonacci_dp(4) + fibonacci_dp(3)
  
  Call fibonacci_dp(4):
    4 not in memo
    Calculate: fibonacci_dp(3) + fibonacci_dp(2)
    
    Call fibonacci_dp(3):
      3 not in memo
      Calculate: fibonacci_dp(2) + fibonacci_dp(1)
      
      Call fibonacci_dp(2):
        2 not in memo
        Calculate: fibonacci_dp(1) + fibonacci_dp(0)
        fibonacci_dp(1) = 1 (base case)
        fibonacci_dp(0) = 0 (base case)
        memo[2] = 1, return 1
      
      fibonacci_dp(1) = 1 (base case)
      memo[3] = 2, return 2
    
    Call fibonacci_dp(2):
      2 IS in memo! Return 1 immediately (no calculation!)
    
    memo[4] = 3, return 3
  
  Call fibonacci_dp(3):
    3 IS in memo! Return 2 immediately (no calculation!)
  
  memo[5] = 5, return 5

Final memo: {2: 1, 3: 2, 4: 3, 5: 5}

Notice: We only calculate each Fibonacci number once. All subsequent requests are instant dictionary lookups!

Time complexity: O(n) - we calculate each number from 0 to n exactly once
Space complexity: O(n) - we store n results in the dictionary

Comparison:

  • Without DP: fibonacci(40) = 331,160,281 operations ⏰
  • With DP: fibonacci(40) = 40 operations ⚡

That's over 8 million times faster!


Top-Down vs Bottom-Up Approaches

There are two main ways to implement DP:

Top-Down (Memoization) - What We Just Did

Start with the big problem and recursively break it down, storing results as you go.

memo = {}

def fib_topdown(n):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_topdown(n - 1) + fib_topdown(n - 2)
    return memo[n]

Pros:

  • Intuitive if you think recursively
  • Only calculates what's needed
  • Easy to add memoization to existing recursive code

Cons:

  • Uses recursion (stack space)
  • Slightly slower due to function call overhead

Bottom-Up (Tabulation) - Build From Smallest

Start with the smallest subproblems and build up to the answer.

def fib_bottomup(n):
    if n <= 1:
        return n
    
    # Build table from bottom up
    dp = {0: 0, 1: 1}
    
    for i in range(2, n + 1):
        dp[i] = dp[i - 1] + dp[i - 2]
    
    return dp[n]

print(fib_bottomup(10))  # 55

Even more optimized (space-efficient):

def fib_optimized(n):
    if n <= 1:
        return n
    
    # We only need the last two values
    prev2, prev1 = 0, 1
    
    for i in range(2, n + 1):
        current = prev1 + prev2
        prev2, prev1 = prev1, current
    
    return prev1

print(fib_optimized(100))  # 354224848179261915075

Pros:

  • No recursion (no stack overflow risk)
  • Can optimize space usage (we did it above!)
  • Often slightly faster

Cons:

  • Less intuitive at first
  • Calculates all subproblems even if not needed

When to Use Dynamic Programming

Use DP when you spot these characteristics:

1. Overlapping Subproblems

The same calculations are repeated many times.

Example: In Fibonacci, we calculate F(3) multiple times when computing F(5).

2. Optimal Substructure

The optimal solution to the problem contains optimal solutions to subproblems.

Example: The optimal path from A to C through B must include the optimal path from A to B.

3. You Can Define a Recurrence Relation

You can express the solution in terms of solutions to smaller instances.

Example: F(n) = F(n-1) + F(n-2)


Common DP Problem Patterns

1. Climbing Stairs

Problem: How many distinct ways can you climb n stairs if you can take 1 or 2 steps at a time?

def climbStairs(n):
    if n <= 2:
        return n
    
    memo = {1: 1, 2: 2}
    
    for i in range(3, n + 1):
        memo[i] = memo[i - 1] + memo[i - 2]
    
    return memo[n]

print(climbStairs(5))  # 8
# Ways: 1+1+1+1+1, 1+1+1+2, 1+1+2+1, 1+2+1+1, 2+1+1+1, 1+2+2, 2+1+2, 2+2+1

Key insight: This is actually Fibonacci in disguise! To reach step n, you either came from step n-1 (one step) or step n-2 (two steps).

2. Coin Change

Problem: Given coins of different denominations, find the minimum number of coins needed to make a target amount.

def coinChange(coins, amount):
    # dp[i] = minimum coins needed to make amount i
    dp = {0: 0}
    
    for i in range(1, amount + 1):
        min_coins = float('inf')
        
        # Try each coin
        for coin in coins:
            if i - coin >= 0 and i - coin in dp:
                min_coins = min(min_coins, dp[i - coin] + 1)
        
        if min_coins != float('inf'):
            dp[i] = min_coins
    
    return dp.get(amount, -1)

print(coinChange([1, 2, 5], 11))  # 3 (5 + 5 + 1)
print(coinChange([2], 3))          # -1 (impossible)

The DP Recipe: How to Solve DP Problems

  1. Identify if it's a DP problem

    • Do you see overlapping subproblems?
    • Can you break it into smaller similar problems?
  2. Define the state

    • What information do you need to solve each subproblem?
    • This becomes your dictionary key
  3. Write the recurrence relation

    • How do you calculate dp[n] from smaller subproblems?
    • Example: F(n) = F(n-1) + F(n-2)
  4. Identify base cases

    • What are the smallest subproblems you can solve directly?
    • Example: F(0) = 0, F(1) = 1
  5. Implement and optimize

    • Start with top-down memoization (easier to write)
    • Optimize to bottom-up if needed
    • Consider space optimization

Common Mistakes to Avoid

1. Forgetting to Check the Cache

# Wrong - doesn't check memo first
def fib_wrong(n):
    if n <= 1:
        return n
    memo[n] = fib_wrong(n - 1) + fib_wrong(n - 2)  # Calculates every time!
    return memo[n]

# Correct - checks memo first
def fib_correct(n):
    if n in memo:  # Check first!
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_correct(n - 1) + fib_correct(n - 2)
    return memo[n]

2. Not Storing the Result

# Wrong - calculates but doesn't store
def fib_wrong(n):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    return fib_wrong(n - 1) + fib_wrong(n - 2)  # Doesn't store!

# Correct - stores before returning
def fib_correct(n):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_correct(n - 1) + fib_correct(n - 2)  # Store it!
    return memo[n]

3. Using Mutable Default Arguments

# Wrong - memo persists between calls!
def fib_wrong(n, memo={}):
    # ...

# Correct - create fresh memo or pass it explicitly
def fib_correct(n, memo=None):
    if memo is None:
        memo = {}
    # ...

Summary

Dynamic Programming is about:

  • Recognizing overlapping subproblems
  • Storing solutions to avoid recalculation
  • Trading memory for speed

Key techniques:

  • Top-down (memoization): Recursive + dictionary cache
  • Bottom-up (tabulation): Iterative + build from smallest

When to use:

  • Same subproblems solved repeatedly
  • Optimal substructure exists
  • Can define recurrence relation

The power of DP:

  • Transforms exponential O(2^n) → linear O(n)
  • Essential for many algorithmic problems
  • Dictionaries make implementation clean and fast

Remember: Not every problem needs DP! Use it when you spot repeated calculations. Sometimes a simple loop or greedy algorithm is better.


Practice Problems to Try

  1. House Robber - Maximum money you can rob from houses without robbing adjacent ones
  2. Longest Common Subsequence - Find longest sequence common to two strings
  3. Edit Distance - Minimum operations to convert one string to another
  4. Maximum Subarray - Find contiguous subarray with largest sum
  5. Unique Paths - Count paths in a grid from top-left to bottom-right

Each of these follows the same DP pattern we've learned. Try to identify the state, recurrence relation, and base cases!

Design Tic-Tac-Toe with Python

Project source: Hyperskill - Tic-Tac-Toe

Project Structure

This project is divided into multiple stages on Hyperskill, each with specific instructions and requirements. I'm sharing the final stage here, which integrates all previous components. The final stage instructions may seem brief as they build on earlier stages where the game logic was developed incrementally.

The complete input/output specifications can be found in the link above.

Sample Execution

---------
|       |
|       |
|       |
---------
3 1
---------
|       |
|       |
| X     |
---------
1 1
---------
| O     |
|       |
| X     |
---------
3 2
---------
| O     |
|       |
| X X   |
---------
0 0
Coordinates should be from 1 to 3!
1 2
---------
| O O   |
|       |
| X X   |
---------
3 3
---------
| O O   |
|       |
| X X X |
---------
X wins

Code


xo_arr = [[" "] * 3 for _ in range(3)]

def display_game(arr):
    row_one =  " ".join(xo_arr[0])
    row_two =  " ".join(xo_arr[1])
    row_three =  " ".join(xo_arr[2])

    print("---------")
    print(f"| {row_one} |")
    print(f"| {row_two} |")
    print(f"| {row_three} |")
    print("---------")


# This could be made in different(shorter) way, I think
# maybe make list of set all combinations for wining 
# and then check if it in or not 
def is_win(s):
    symbol_win =  xo_arr[0] == 3 * s
    symbol_win =  symbol_win or xo_arr[1] == 3 * s
    symbol_win =  symbol_win or xo_arr[2] == 3 * s

    symbol_win =  symbol_win or (xo_arr[0][0] == s and xo_arr[0][1] == s and xo_arr[0][2] == s)
    symbol_win =  symbol_win or (xo_arr[1][0] == s and xo_arr[1][1] == s and xo_arr[1][2] == s)
    symbol_win =  symbol_win or (xo_arr[2][0] == s and xo_arr[2][1] == s and xo_arr[2][2] == s)

    symbol_win =  symbol_win or (xo_arr[0][0] == s and xo_arr[1][1] == s and xo_arr[2][2] == s)
    symbol_win =  symbol_win or (xo_arr[0][2] == s and xo_arr[1][1] == s and xo_arr[2][0] == s)

    return symbol_win


symbol = "X"

display_game(xo_arr)


while True: 

    move = input()
    
    row_coordinate = move[0]
    column_coordinate = move[2]

    if not (row_coordinate.isdigit() and column_coordinate.isdigit()):
        print("You should enter numbers!")
        continue
    else:
        row_coordinate = int(row_coordinate)
        column_coordinate = int(column_coordinate)

    if not (1 <= row_coordinate <= 3 and 1 <= column_coordinate <= 3):
        print("Coordinates should be from 1 to 3!")
        continue

    elif xo_arr[row_coordinate - 1][column_coordinate - 1] == "X" or xo_arr[row_coordinate - 1][column_coordinate - 1] == "O":
        print("This cell is occupied! Choose another one!")
        continue

    xo_arr[row_coordinate - 1][column_coordinate - 1] = symbol

    if symbol == "X":
        symbol = "O"
    else:
        symbol = "X"

    display_game(xo_arr)

    o_win = is_win("O")
    x_win = is_win("X")


    if x_win:
        print("X wins")
        break

    elif o_win:
        print("O wins")
        break
    elif  " " not in xo_arr[0] and " " not in xo_arr[1] and " " not in xo_arr[2] :
        print("Draw")
        break

Multiplication Table

Write a multiplication table based on a maximum input value.

example:


> Please input number: 10
1    2    3    4    5    6    7    8    9    10  
2    4    6    8    10   12   14   16   18   20  
3    6    9    12   15   18   21   24   27   30  
4    8    12   16   20   24   28   32   36   40  
5    10   15   20   25   30   35   40   45   50  
6    12   18   24   30   36   42   48   54   60  
7    14   21   28   35   42   49   56   63   70  
8    16   24   32   40   48   56   64   72   80  
9    18   27   36   45   54   63   72   81   90  
10   20   30   40   50   60   70   80   90   100 

Implementation

This solution is dynamic because it depends on the number of digits in each result. If the maximum number in the table is 100, then the results can have:

three spaces → 1–9

two spaces → 10–99

one space → 100

So to align everything, you look at the biggest number in the table and check how many digits it has. You can do this mathematically (using tens) or simply by getting the length of the string of the number.

Then you add the right amount of spaces before each value to keep the table lined up.

num = int(input("Please input number: "))
max_spaces = len(str(num * num)) 
row = []

for i in range(1, num + 1):
    for j in range(1, num + 1):
        product = str(i * j)
        space =  " " * (max_spaces + 1 - len(product))
        row.append(product + space)
    
    print(*row)
    row = []


Sieve of Eratosthenes

This is an implementation of the Sieve of Eratosthenes.

You can find the full description of the algorithm on its Wikipedia page here.

Code


n = 120

consecutive_int  = [True for _ in range(2, n + 1)]

def mark_multiples(ci, p):
    for i in range(p * p, len(ci) + 2, p):
        ci[i - 2] = False
    return ci

def get_next_prime_notmarked(ci, p):
    for i in range(p + 1, len(ci) + 2):
        if ci[i - 2]:
            return i
    return - 1
            

next_prime = 2


while True:
    consecutive_int = mark_multiples(consecutive_int, next_prime)
    next_prime = get_next_prime_notmarked(consecutive_int, next_prime)
    if next_prime == -1:
        break

def convert_arr_nums(consecutive_int):
    num = ""
    for i in range(len(consecutive_int)):
        if consecutive_int[i]:
            num += str(i + 2) + " "
    return num
            

print(convert_arr_nums(consecutive_int))

Spiral Matrix

Difficulty: Medium
Source: LeetCode

Description

Given an m x n matrix, return all elements of the matrix in spiral order. The spiral traversal goes clockwise starting from the top-left corner: right → down → left → up, repeating inward until all elements are visited.

Code


# To be solved

Rotate Image

Difficulty: Medium
Source: LeetCode

Description

Given an n x n 2D matrix representing an image, rotate the image by 90 degrees clockwise.

Constraint: You must rotate the image in-place by modifying the input matrix directly. Do not allocate another 2D matrix.

Example

Input: matrix = [[1,2,3],[4,5,6],[7,8,9]]
Output: [[7,4,1],[8,5,2],[9,6,3]]

Code

# To be solved

Set Matrix Zeroes

Difficulty: Medium
Source: LeetCode

Description

Given an m x n integer matrix, if an element is 0, set its entire row and column to 0's.

Constraint: You must do it in place.

Example

Input: matrix = [[1,1,1],
                 [1,0,1],
                 [1,1,1]]
Output: [[1,0,1],
         [0,0,0],
         [1,0,1]]

Code

# To be solved

Two Pointers Intro

2 Pointers Technique

Watch this video to get overview on the pattern

2 Pointers Problems

Sliding Window Algorithm - Variable Length + Fixed Length

Reverse String

Difficulty: Easy
Source: LeetCode

Description

Write a function that reverse string in-place

Example

Input: s = ["h","e","l","l","o"]
Output: ["o","l","l","e","h"]

Code

# To be solved

Two Sum II - Input Array Is Sorted

Difficulty: Medium
Source: LeetCode

Description

You are given a 1-indexed integer array numbers that is sorted in non-decreasing order and an integer target.​

Your task is to return the 1-based indices of two different elements in numbers whose sum is exactly equal to target, with the guarantee that exactly one such pair exists

Please see full description in this link

Example

Example 1:

Input: numbers = [2, 7, 11, 15], target = 9​

Expected output: [1, 2]

Explanation: numbers[1] + numbers[2] = 2 + 7 = 9, so the correct indices are [1, 2].

Code

# To be solved

3sum

Difficulty: Medium
Source: LeetCode

Description

You are given an integer array nums, and the goal is to return all unique triplets [nums[i], nums[j], nums[k] such that each index is distinct and the sum of the three numbers is zero.​ The answer must not include duplicate triplets, even if the same values appear multiple times in the array.

Please see full description in this link

Example

Example 1:

Input: nums = [-1, 0, 1, 2, -1, -4]​

One valid output: [[-1, -1, 2], [-1, 0, 1]] (order of triplets or numbers within a triplet does not matter).

Code

# To be solved

Container With Most Water

Difficulty: Medium
Source: LeetCode

Description

You are given an array height where each element represents the height of a vertical line drawn at that index on the x-axis.​

Your goal is to pick two distinct lines such that, using the x-axis as the base, the container formed between these lines holds the maximum amount of water, and you must return that maximum water area

Please see full description in this link

Example

Example 1:

  • Input: height = [1, 8, 6, 2, 5, 4, 8, 3, 7]​
  • Output: 49
  • Explanation (high level): The best container uses the line of height 8 and the line of height 7, which are far enough apart that the width and the shorter height together produce area 49.​

Code

# To be solved

Remove Duplicates from Sorted Array

Difficulty: Medium
Source: LeetCode

Description

You are given an integer array nums sorted in non-decreasing order, and you need to modify it in-place so that each distinct value appears only once in the prefix of the array.​ After the operation, you return an integer k representing how many unique values remain at the start of nums, and the first k positions should contain those unique values in their original relative order.​

Please see full description in this link

Example

Example 1:

  • Input: nums = [1, 1, 2]​
  • Output: k = 2 and nums’s first k elements become [1, 2, _] (the last position can hold any value)
  • Explanation: The unique values are 1 and 2, so they occupy the first two positions and the function returns 2.​

Code

# To be solved

Move Zeroes

Difficulty: Medium
Source: LeetCode

Description

You are given an integer array nums and must move every 0 in the array to the end, without changing the relative order of the non-zero values.​ The rearrangement has to be performed directly on nums (in-place), and the overall extra space usage must remain O(1).

Please see full description in this link

Example

Example 1:

  • Input: nums = [0, 1, 0, 3, 12]​
  • Output (final state of nums): [1, 3, 12, 0, 0]
  • Explanation: The non-zero elements 1, 3, 12 stay in the same relative order, and both zeros are moved to the end

Code

# To be solved

Valid Palindrome

Difficulty: Medium
Source: LeetCode

Description

You are given a string s consisting of printable ASCII characters, and the goal is to determine whether it forms a palindrome when considering only letters and digits and treating uppercase and lowercase as the same.​ After filtering out non-alphanumeric characters and converting all remaining characters to a single case, the cleaned string must read the same from left to right and right to left to be considered valid.​

Please see full description in this link

Example

Example 1:

  • Input: s = "A man, a plan, a canal: Panama"​
  • Output: True
  • Explanation: After removing non-alphanumeric characters and lowering case, it becomes "amanaplanacanalpanama", which reads the same forwards and backwards.​

Code

# To be solved

Sliding Window Intro

Sliding Window Technique

Watch this video to get overview on the pattern

Sliding Window Problems

Sliding Window Algorithm - Variable Length + Fixed Length

Longest Substring Without Repeating Characters

Description

You are given a string s, and the goal is to determine the maximum length of any substring that has all unique characters, meaning no character appears more than once in that substring.

The substring must be contiguous within s (no reordering or skipping), and you only need to return the length of the longest such substring, not the substring itself.

Example

Example 1:

  • Input: s = "abcabcbb"
  • Output: 3
  • Explanation: One longest substring without repeating characters is "abc", which has length 3.

Example 2:

  • Input: s = "bbbbb"
  • Output: 1
  • Explanation: Every substring with unique characters is just "b", so the maximum length is 1.

Example 3:

  • Input: s = "pwwkew"
  • Output: 3
  • Explanation: A valid longest substring is "wke" with length 3; note that "pwke" is not allowed because it is not contiguous.

You can test edge cases like s = "" (empty string) or s = " " (single space) to see how the result behaves.[6][8]

Code

# LeetCode 3: Longest Substring Without Repeating Characters
# Credit: Problem from LeetCode (see problem page for full statement and tests).

def lengthOfLongestSubstring(s: str) -> int:
    """
    Write your solution here.

    Requirements:
    - Consider contiguous substrings of s.
    - Within the chosen substring, all characters must be distinct.
    - Return the maximum length among all such substrings.
    """
    # To be solved
    raise NotImplementedError

Maximum Number of Vowels in a Substring of Given Length

Difficulty: Medium
Source: LeetCode

Description

Given a string s and an integer k, return the maximum number of vowel letters in any substring of s with length k.

Vowel letters in English are 'a', 'e', 'i', 'o', and 'u'.

Examples

Input: s = "abciiidef", k = 3
Output: 3
Explanation: The substring "iii" contains 3 vowel letters
Input: s = "aeiou", k = 2
Output: 2
Explanation: Any substring of length 2 contains 2 vowels
Input: s = "leetcode", k = 3
Output: 2
Explanation: "lee", "eet" and "ode" contain 2 vowels

Code

# To be solved

Climbing Stairs

Difficulty: Easy
Source: LeetCode

Description

You are climbing a staircase. It takes n steps to reach the top.

Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?

Examples

Input: n = 2
Output: 2
Explanation: There are two ways to climb to the top:
1. 1 step + 1 step
2. 2 steps
Input: n = 3
Output: 3
Explanation: There are three ways to climb to the top:
1. 1 step + 1 step + 1 step
2. 1 step + 2 steps
3. 2 steps + 1 step

Code

# To be solved

Counting Bits

Difficulty: Easy
Source: LeetCode

Description

Given an integer n, return an array ans of length n + 1 such that for each i (0 <= i <= n), ans[i] is the number of 1's in the binary representation of i.

Example

Input: n = 2
Output: [0,1,1]
Explanation:
0 --> 0 (zero 1's)
1 --> 1 (one 1)
2 --> 10 (one 1)

Code

# To be solved

Decode Ways

Difficulty: Medium
Source: LeetCode

Description

Given a string s of digits, return the number of ways to decode it using the mapping:

"1" -> 'A', 
"2" -> 'B',
 ..., 
"26" -> 'Z'

A digit string can be decoded in multiple ways since some codes overlap (e.g., "12" can be "AB" or "L").

Rules:

  • Valid codes are "1" to "26"
  • Leading zeros are invalid (e.g., "06" is invalid, but "6" is valid)
  • Return 0 if the string cannot be decoded

Examples

Input: s = "12"
Output: 2
Explanation: Can be decoded as "AB" (1, 2) or "L" (12)
Input: s = "11106"
Output: 2
Explanation: 
- "AAJF" with grouping (1, 1, 10, 6)
- "KJF" with grouping (11, 10, 6)
- (1, 11, 06) is invalid because "06" is not valid

Code

# To be solved

Maximal Square

Difficulty: Medium
Source: LeetCode

Description

Given an m x n binary matrix filled with 0's and 1's, find the largest square containing only 1's and return its area.

Example

Input: matrix = [
  ["1","0","1","0","0"],
  ["1","0","1","1","1"],
  ["1","1","1","1","1"],
  ["1","0","0","1","0"]
]
Output: 4
Explanation: The largest square of 1's has side length 2, so area = 2 × 2 = 4

Code

# To be solved

Word Break

Difficulty: Medium
Source: LeetCode

Description

Given a string s and a dictionary of strings wordDict, return true if s can be segmented into a space-separated sequence of one or more dictionary words.

Note: The same word in the dictionary may be reused multiple times in the segmentation.

Example

Input: s = "leetcode", wordDict = ["leet","code"]
Output: true
Explanation: "leetcode" can be segmented as "leet code"
Input: s = "applepenapple", wordDict = ["apple","pen"]
Output: true
Explanation: "applepenapple" can be segmented as "apple pen apple"
Note: "apple" is reused

Code

# To be solved

Longest Increasing Subsequence

Difficulty: Medium
Source: LeetCode

Description

Given an integer array nums, return the length of the longest strictly increasing subsequence.

A subsequence is derived by deleting some or no elements without changing the order of the remaining elements.

Example

Input: nums = [10,9,2,5,3,7,101,18]
Output: 4
Explanation: The longest increasing subsequence is [2,3,7,101], with length 4

Code

# To be solved

Subarray Sum Equals K

Problem credit: This note is for practicing the LeetCode problem “Subarray Sum Equals K”. For the full official statement, examples, and judge, see the LeetCode problem page.

Description

You are given an integer array nums and an integer k, and the task is to return the number of non-empty contiguous subarrays whose elements add up to k.

A subarray is defined as a sequence of one or more elements that appear consecutively in the original array, without reordering or skipping indices.

Example

Example 1:

  • Input: nums = [1, 1, 1], k = 2
  • Output: 2
  • Explanation: The subarrays [1, 1] using indices [0, 1] and [1, 2] both sum to 2, so the answer is 2.

Example 2:

  • Input: nums = [1, 2, 3], k = 3
  • Output: 2
  • Explanation: The subarrays [1, 2] and [3] each sum to 3, giving a total count of 2.

You can experiment with inputs that include negative numbers, such as [2, 2, -4, 1, 1, 2] and various k values, to see how multiple overlapping subarrays can share the same sum.

Code

# LeetCode 560: Subarray Sum Equals K
# Credit: Problem from LeetCode (see problem page for full statement and tests).

def subarraySum(nums: List[int], k: int) -> int:
    """
    Write your solution here.

    Requirements:
    - Count all non-empty contiguous subarrays whose sum is exactly k.
    - nums may contain positive, negative, and zero values.
    - Return the total number of such subarrays.
    """
    # To be solved
    raise NotImplementedError

Count Vowel Substrings of a String

Difficulty: Easy
Source: LeetCode

Description

Given a string word, return the number of vowel substrings in word.

A vowel substring is a contiguous substring that:

  • Only consists of vowels ('a', 'e', 'i', 'o', 'u')
  • Contains all five vowels at least once

Examples

Input: word = "aeiouu"
Output: 2
Explanation: The vowel substrings are "aeiou" and "aeiouu"
Input: word = "unicornarihan"
Output: 0
Explanation: Not all 5 vowels are present, so there are no vowel substrings

Code

# To be solved

Roman to Integer

The problem can be found here

Solution one

Let's think, simple solution for this problem, will be change the way that system work, in another word, instead of making minus, will make everything just sum.

class Solution:
    def romanToInt(self, s: str) -> int:
        roman = {
            "I": 1,
            "V": 5,
            "X": 10,
            "L": 50,
            "C": 100,
            "D": 500,
            "M": 1000
        }
        replace = {
            "IV": "IIII",
            "IX": "VIIII",
            "XL": "XXXX",
            "XC": "LXXXX",
            "CD": "CCCC",
            "CM": "DCCCC"
        }

        for k, v in replace.items(): 
            s = s.replace(k, v)
            
        return sum([roman[char] for char in s])

Solution two

Another way to think about this, is just if we say smaller number before bigger number, we should minus, otherwise, we should continue adding numbers.

class Solution:
    def romanToInt(self, s: str) -> int:
        roman = {
            "I": 1,
            "V": 5,
            "X": 10,
            "L": 50,
            "C": 100,
            "D": 500,
            "M": 1000
        }
        total = 0
        pre_value = 0

        for i in s:
            if pre_value < roman[i]:
                total += roman[i] - 2 * pre_value
            else:
                total += roman[i]
            
            pre_value = roman[i]
        
        return total

This solution in runtime beats 100%, but memory only 20% better

why I did this roman[i] - 2 * pre_value? because we need to minus the added value in the previous step.

Basic Calculator

Difficulty: Medium

Description

Given a string expression containing digits and operators (+, -, *, /), evaluate the expression and return the result.

Rules:

  • Follow standard operator precedence (multiplication and division before addition and subtraction)
  • Division should be integer division (truncate toward zero)
  • No parentheses in the expression

Examples

Input: s = "3+2*2"
Output: 7
Explanation: Multiplication first: 3 + (2*2) = 3 + 4 = 7
Input: s = "4-8/2"
Output: 0
Explanation: Division first: 4 - (8/2) = 4 - 4 = 0
Input: s = "14/3*2"
Output: 8
Explanation: Left to right for same precedence: (14/3)*2 = 4*2 = 8

Code

# To be solved

String Metrics and Alignments

1. String Metrics and Alignments

What is a String Metric?

A string metric measures the distance or similarity between two text strings. We use these in bioinformatics to compare DNA, RNA, or protein sequences.

Edit Distance

Edit distances measure how many operations you need to transform one string into another. The operations can be:

Substitution: Replace one character with another

CAT → CUT
 |      
 A → U

Insertion/Deletion (Indel): Add or remove a character

MONKEY → MON-EY (deletion of K, or viewing it differently, insertion of gap)

Transposition: Swap adjacent characters

FORM → FROM

Each operation has a cost. The total cost defines the distance.

Distance vs Similarity

Distance metric: Range is [0, +∞)

  • Score of 0 means identical strings
  • Higher scores mean more different strings

Similarity metric: Range is (-∞, +∞)

  • High positive score means similar strings
  • Negative score means distant strings

In bioinformatics, we typically work with similarity because we want to find sequences that are alike.


2. Hamming Distance

The simplest edit distance. It only counts substitutions and requires strings of equal length.

Example

ACCCTCGCTAGATAAATAGATCTGATAG
||x||||||||||x|||||x||||x|||
ACTCTCGCTAGATGAATAGGTCTGTTAG

Count the mismatches (marked with x): positions 3, 14, 19, 25

Hamming Distance = 4

Converting to Similarity

For the example above:

  • Total positions: 28
  • Matches: 24
  • Mismatches: 4

Similarity = +24 - 4 = 20

(We add +1 for each match, subtract 1 for each mismatch)


Implementation: AlnSeq Class (Basic)

We start by creating a class to hold two sequences and compute Hamming metrics.

Step 1: Define the class and constructor

class AlnSeq:
    def __init__(self, seq1, seq2):
        if len(seq1) != len(seq2):
            raise ValueError("Sequences must have equal length")
        
        self.seq1 = seq1
        self.seq2 = seq2

Step 2: Add Hamming distance method

    def compute_hamming_distance(self):
        self.distance = 0
        for i in range(len(self.seq1)):
            if self.seq1[i] != self.seq2[i]:
                self.distance += 1

        return self.distance

Step 3: Add Hamming similarity method

    def compute_hamming_similarity(self):
        if getattr(self, 'distance', None) == None:
            self.compute_hamming_distance()
            # +1 * numbers of matchs + (-1 * numbers of mismatchs)
            # It's possible simplify this math experssion, but left for readabilty
        return (len(self.seq1) - self.distance) - self.distance

Usage

aln = AlnSeq("ACCCTCGCTAG", "ACTCTCGCTAG")
print(aln.compute_hamming_distance())    # 1
print(aln.compute_hamming_similarity())  # 9

3. Biologically Relevant Substitution Matrices

Not all substitutions are equal in biology. Some amino acids or nucleotides are chemically similar, so swapping between them is "less bad" than swapping very different ones.

Transition/Transversion Matrix (for DNA)

Transitions are mutations between chemically similar bases:

  • A ↔ G (both purines)
  • T ↔ C (both pyrimidines)

Transversions are mutations between different types:

  • A ↔ T, A ↔ C, G ↔ T, G ↔ C
      A    T    C    G
  A   2   -1   -1    0
  T  -1    2    0   -1
  C  -1    0    2   -1
  G   0   -1   -1    2

Example calculation:

AAAA
|XX/
ATCG

Score = Score(A,A) + Score(A,T) + Score(A,C) + Score(A,G) = 2 + (-1) + (-1) + 0 = 0

PAM Matrices (for Proteins)

PAM = Point Accepted Mutation

PAMn[i,j] gives the likelihood of residue i being replaced by residue j through evolutionary mutations over a time when n mutations occur per 100 residues.

Most commonly used: PAM250

BLOSUM Matrices (for Proteins)

BLOSUMn[i,j] is based on observed substitution frequencies in aligned protein sequences that share n% sequence identity.

Most commonly used: BLOSUM62


Implementation: SubstitutionMatrix Class

A class to store and access substitution matrix values.

Step 1: Define the class with matrix data

class SubstitutionMatrix:
    def __init__(self, matrix_dict):
        # matrix_dict is a nested dictionary
        # matrix_dict['A']['T'] gives score for A->T
        self.matrix = matrix_dict

Step 2: Implement getitem for easy indexing

This lets us use matrix['A', 'T'] syntax.

    def __getitem__(self, key):
        # key is a tuple like ('A', 'T')
        res1, res2 = key
        return self.matrix[res1][res2]

Step 3: Create the Transition/Transversion matrix

# Define the matrix as nested dictionaries
tt_matrix_data = {
    'A': {'A': 2, 'T': -1, 'C': -1, 'G': 0},
    'T': {'A': -1, 'T': 2, 'C': 0, 'G': -1},
    'C': {'A': -1, 'T': 0, 'C': 2, 'G': -1},
    'G': {'A': 0, 'T': -1, 'C': -1, 'G': 2}
}

tt_matrix = SubstitutionMatrix(tt_matrix_data)

Usage

print(tt_matrix['A', 'A'])   # 2  (match)
print(tt_matrix['A', 'G'])   # 0  (transition)
print(tt_matrix['A', 'T'])   # -1 (transversion)

Implementation: Extend AlnSeq with Matrix-Based Similarity

Add a method to compute similarity using any substitution matrix.

class AlnSeq:
    # ... previous methods ...
    
    def hamming_similarity_matrix(self, sub_matrix):
        if len(self.seq1) != len(self.seq2):
            raise ValueError("Sequences must have equal length")
        
        score = 0
        for i in range(len(self.seq1)):
            # Use the substitution matrix for scoring
            score += sub_matrix[self.seq1[i], self.seq2[i]]
        return score

Usage

aln = AlnSeq("AAAA", "ATCG")
print(aln.hamming_similarity_matrix(tt_matrix))  # 2 + (-1) + (-1) + 0 = 0

4. Levenshtein Distance

Unlike Hamming distance, Levenshtein allows insertions and deletions (indels), not just substitutions.

Example

KITTEN → SITTING

Step 1: K → S     (substitution)
Step 2: E → I     (substitution)  
Step 3: insert G  (insertion)

Levenshtein Distance = 3

Levenshtein vs Hamming

For strings of the same length: Hamming Distance ≥ Levenshtein Distance

FLAW    vs    LAWN

Hamming (no gaps allowed):
FLAW
XXXX
LAWN
Hamming Distance = 4

Levenshtein (gaps allowed):
-FLAW
 |||^
-LAWN
Levenshtein Distance = 2 (delete F, insert N)

This shows why we need sequence alignment.


5. Sequence Alignment

Each character of one sequence is matched with:

  • A character in the other sequence, OR
  • A gap (-)

After alignment, both gapped strings have the same length.

How Many Possible Alignments?

Simple case (no internal gaps): Just slide sequences past each other

--TCA    -TCA    TCA    TCA    TCA-    TCA--
TA---    TA--    TA-    -TA    --TA    ---TA

Number of alignments = len1 + len2 + 1

Complex case (internal gaps allowed): The number explodes

                    | 1           if len1 = 0
N(len1, len2) =     | 1           if len2 = 0
                    | N(len1-1, len2) + N(len1, len2-1) + N(len1-1, len2-1)  otherwise

Examples:

  • N(3,3) = 63
  • N(5,5) = 1,683
  • N(9,9) = 1,462,563

Brute force is not feasible. We need a smarter approach.


Implementation: Counting Possible Alignments

A recursive function to calculate the number of possible alignments.

def count_alignments(len1, len2):
    # Base cases: if one sequence is empty,
    # only one alignment is possible (all gaps)
    if len1 == 0:
        return 1
    if len2 == 0:
        return 1
    
    # Recursive case: three choices at each position
    return (count_alignments(len1 - 1, len2) +      # gap in seq2
            count_alignments(len1, len2 - 1) +      # gap in seq1
            count_alignments(len1 - 1, len2 - 1))   # match/mismatch

Usage

print(count_alignments(3, 3))   # 63
print(count_alignments(5, 5))   # 1683
print(count_alignments(9, 9))   # 1462563

This function is slow because it recalculates the same subproblems many times. That is exactly why we need dynamic programming.


6. Dynamic Programming

An algorithmic paradigm that works when a problem has:

  1. Optimal Substructure: The optimal solution depends on solutions to smaller sub-problems

  2. Overlapping Subproblems: The same sub-problems are solved multiple times

Key idea: Store solutions to sub-problems and build up to the full solution.


7. Needleman-Wunsch Algorithm (Global Alignment)

Finds the best alignment across the entire length of both sequences.

Setup

Given:

  • Sequence 1: TCA
  • Sequence 2: TA
  • Gap penalty: -2
  • Substitution scores: Transition/Transversion matrix

Step 1: Initialize the Matrix

Create a matrix with sequence 1 across the top and sequence 2 down the side. Add a gap column/row at the start.

        -     T     C     A
   -    0    -2    -4    -6
   T   -2
   A   -4

First row: Each cell = previous + gap penalty (0, -2, -4, -6) First column: Same logic (0, -2, -4, -6)

Step 2: Fill the Matrix

For each cell (i,j), take the maximum of three options:

Cell[i,j] = max(
    Cell[i-1, j] + gap,           // come from above (gap in seq1)
    Cell[i, j-1] + gap,           // come from left (gap in seq2)  
    Cell[i-1, j-1] + match(i,j)   // come from diagonal (match/mismatch)
)

Filling cell (T,T):

max(-2 + (-2), 0 + match(T,T), -2 + (-2))
= max(-4, 0 + 2, -4)
= max(-4, 2, -4)
= 2

Complete filled matrix:

        -     T     C     A
   -    0    -2    -4    -6
   T   -2     2     0    -2
   A   -4     0     1     2

The bottom-right cell (2) is the optimal alignment score.

Step 3: Backtracking

Start from bottom-right. At each cell, go back the way you came:

  • Diagonal: match characters
  • Up: gap in sequence 1
  • Left: gap in sequence 2
        -     T     C     A
   -    0 ←  -2 ←  -4 ←  -6
        ↑    ↖↑    ↖↑    ↖↑
   T   -2 ←   2  ←  0  ← -2
        ↑    ↖     ↖     ↖
   A   -4 ←   0  ←  1  ← [2]

Path from [2]: diagonal → diagonal → diagonal

Final Alignment:

TCA
|X|
T-A

Score = 2

8. Smith-Waterman Algorithm (Local Alignment)

Finds the best alignment between substrings of the sequences.

Differences from Needleman-Wunsch

  1. Initialization: First row and column are all zeros (no penalty for starting/ending gaps)

  2. Filling: Add a fourth option, zero:

Cell[i,j] = max(
    Cell[i-1, j] + gap,
    Cell[i, j-1] + gap,
    Cell[i-1, j-1] + match(i,j),
    0                              // NEW: discard negative paths
)
  1. Backtracking: Start from the highest value in the matrix (not bottom-right). Stop when you hit zero.

Example

Sequences: TGA and GA, Gap penalty: -2

Initialization:

        -     T     G     A
   -    0     0     0     0
   G    0
   A    0

Filling cell (G,T):

max(0 + (-2), 0 + match(T,G), 0 + (-2), 0)
= max(-2, -1, -2, 0)
= 0

Complete matrix:

        -     T     G     A
   -    0     0     0     0
   G    0     0     2     0
   A    0     0     0    [4]

Backtracking from highest value [4]:

  • (A,A): came from diagonal, match A-A
  • (G,G): came from diagonal, match G-G
  • (G,T): value is 0, STOP

Final Local Alignment:

GA
||
GA

Score = 4

The algorithm found the best matching substring, ignoring the T at the beginning.


Implementation: SeqPair Class (Needleman-Wunsch)

Step 1: Define the class

class SeqPair:
    def __init__(self, seq1, seq2, sub_matrix, gap_penalty=-2):
        self.seq1 = seq1
        self.seq2 = seq2
        self.sub_matrix = sub_matrix
        self.gap = gap_penalty

Step 2: Initialize the scoring matrix

    def _init_matrix(self, rows, cols):
        # Create matrix filled with zeros
        matrix = [[0] * cols for _ in range(rows)]
        return matrix

Step 3: Needleman-Wunsch matrix filling

    def needleman_wunsch(self):
        rows = len(self.seq2) + 1
        cols = len(self.seq1) + 1
        
        # Initialize score matrix
        score = self._init_matrix(rows, cols)
        
        # Initialize traceback matrix
        # 0 = diagonal, 1 = up, 2 = left
        trace = self._init_matrix(rows, cols)
        
        # Fill first row (gaps in seq2)
        for j in range(1, cols):
            score[0][j] = score[0][j-1] + self.gap
            trace[0][j] = 2  # came from left
        
        # Fill first column (gaps in seq1)
        for i in range(1, rows):
            score[i][0] = score[i-1][0] + self.gap
            trace[i][0] = 1  # came from up
        
        # Fill rest of matrix
        for i in range(1, rows):
            for j in range(1, cols):
                # Three options
                diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
                up = score[i-1][j] + self.gap
                left = score[i][j-1] + self.gap
                
                # Take maximum
                score[i][j] = max(diag, up, left)
                
                # Record which direction we came from
                if score[i][j] == diag:
                    trace[i][j] = 0
                elif score[i][j] == up:
                    trace[i][j] = 1
                else:
                    trace[i][j] = 2
        
        self.nw_score = score
        self.nw_trace = trace
        return score[rows-1][cols-1]  # final score

Step 4: Backtracking to get alignment

    def nw_traceback(self):
        aligned1 = ""
        aligned2 = ""
        
        # Start from bottom-right
        i = len(self.seq2)
        j = len(self.seq1)
        
        while i > 0 or j > 0:
            if i > 0 and j > 0 and self.nw_trace[i][j] == 0:
                # Diagonal: match/mismatch
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
                j -= 1
            elif i > 0 and self.nw_trace[i][j] == 1:
                # Up: gap in seq1
                aligned1 = "-" + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
            else:
                # Left: gap in seq2
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = "-" + aligned2
                j -= 1
        
        return aligned1, aligned2

Usage

seq_pair = SeqPair("TCA", "TA", tt_matrix, gap_penalty=-2)
score = seq_pair.needleman_wunsch()
print(f"Score: {score}")  # Score: 2

aln1, aln2 = seq_pair.nw_traceback()
print(aln1)  # TCA
print(aln2)  # T-A

Implementation: Smith-Waterman (Extend SeqPair)

Add local alignment method

    def smith_waterman(self):
        rows = len(self.seq2) + 1
        cols = len(self.seq1) + 1
        
        score = self._init_matrix(rows, cols)
        trace = self._init_matrix(rows, cols)
        
        # First row and column stay 0 (no initialization needed)
        
        # Track position of maximum score
        max_score = 0
        max_pos = (0, 0)
        
        # Fill matrix
        for i in range(1, rows):
            for j in range(1, cols):
                diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
                up = score[i-1][j] + self.gap
                left = score[i][j-1] + self.gap
                
                # Take maximum, but floor at 0
                score[i][j] = max(diag, up, left, 0)
                
                # Record direction (only if not 0)
                if score[i][j] == 0:
                    trace[i][j] = -1  # stop here
                elif score[i][j] == diag:
                    trace[i][j] = 0
                elif score[i][j] == up:
                    trace[i][j] = 1
                else:
                    trace[i][j] = 2
                
                # Update max position
                if score[i][j] > max_score:
                    max_score = score[i][j]
                    max_pos = (i, j)
        
        self.sw_score = score
        self.sw_trace = trace
        self.sw_max_pos = max_pos
        return max_score

Add local alignment backtracking

    def sw_traceback(self):
        aligned1 = ""
        aligned2 = ""
        
        # Start from maximum position
        i, j = self.sw_max_pos
        
        # Stop when we hit 0
        while i > 0 and j > 0 and self.sw_trace[i][j] != -1:
            if self.sw_trace[i][j] == 0:
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
                j -= 1
            elif self.sw_trace[i][j] == 1:
                aligned1 = "-" + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
            else:
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = "-" + aligned2
                j -= 1
        
        return aligned1, aligned2

Usage

seq_pair = SeqPair("TGA", "GA", tt_matrix, gap_penalty=-2)
score = seq_pair.smith_waterman()
print(f"Score: {score}")  # Score: 4

aln1, aln2 = seq_pair.sw_traceback()
print(aln1)  # GA
print(aln2)  # GA

9. Global vs Local: When to Use Which

Needleman-Wunsch (Global):

  • Use when comparing sequences of similar length
  • Use when you expect the sequences to be related across their entire length
  • Example: comparing two versions of the same gene

Smith-Waterman (Local):

  • Use when looking for conserved regions within longer sequences
  • Use when sequences have very different lengths
  • Example: finding a motif within a larger protein

10. Implementation Classes

AlnSeq Class

Holds two sequences and computes:

  • Hamming distance
  • Hamming similarity
  • Levenshtein similarity (for pre-aligned sequences)
  • Custom str for visual output:
    ACTG
    |X|^
    AGTG-
    
    Where: | = match, X = substitution, ^ = insertion, v = deletion

SubstitutionMatrix Class

  • Stores the scoring matrix values
  • Implements getitem for easy access: matrix['A', 'T'] returns the score

SeqPair Class

  • Computes alignment matrices (Needleman-Wunsch and Smith-Waterman)
  • Reconstructs alignments through backtracking

Implementation: Extend AlnSeq with Levenshtein and str

Levenshtein similarity for aligned sequences

This assumes sequences are already aligned (may contain gaps).

class AlnSeq:
    # ... previous methods ...
    
    def levenshtein_similarity(self, sub_matrix, gap_penalty=-2):
        # For aligned sequences (with gaps already inserted)
        if len(self.seq1) != len(self.seq2):
            raise ValueError("Aligned sequences must have equal length")
        
        score = 0
        for i in range(len(self.seq1)):
            c1 = self.seq1[i]
            c2 = self.seq2[i]
            
            if c1 == '-' or c2 == '-':
                # Gap penalty
                score += gap_penalty
            else:
                # Use substitution matrix
                score += sub_matrix[c1, c2]
        
        return score

Custom str method for visualization

    def __str__(self):
        if len(self.seq1) != len(self.seq2):
            return f"{self.seq1}\n{self.seq2}"
        
        # Build match string
        match_str = ""
        for i in range(len(self.seq1)):
            c1 = self.seq1[i]
            c2 = self.seq2[i]
            
            if c1 == c2:
                match_str += "|"      # identity
            elif c1 == '-':
                match_str += "^"      # insertion (gap in seq1)
            elif c2 == '-':
                match_str += "v"      # deletion (gap in seq2)
            else:
                match_str += "X"      # substitution
        
        return f"{self.seq1}\n{match_str}\n{self.seq2}"

Usage

aln = AlnSeq("TCA", "T-A")
print(aln)
# Output:
# TCA
# |v|
# T-A

print(aln.levenshtein_similarity(tt_matrix, gap_penalty=-2))  # 2 + (-2) + 2 = 2

Complete AlnSeq Class

class AlnSeq:
    def __init__(self, seq1, seq2):
        self.seq1 = seq1
        self.seq2 = seq2
    
    def hamming_distance(self):
        if len(self.seq1) != len(self.seq2):
            raise ValueError("Sequences must have equal length")
        distance = 0
        for i in range(len(self.seq1)):
            if self.seq1[i] != self.seq2[i]:
                distance += 1
        return distance
    
    def hamming_similarity(self):
        if len(self.seq1) != len(self.seq2):
            raise ValueError("Sequences must have equal length")
        score = 0
        for i in range(len(self.seq1)):
            if self.seq1[i] == self.seq2[i]:
                score += 1
            else:
                score -= 1
        return score
    
    def hamming_similarity_matrix(self, sub_matrix):
        if len(self.seq1) != len(self.seq2):
            raise ValueError("Sequences must have equal length")
        score = 0
        for i in range(len(self.seq1)):
            score += sub_matrix[self.seq1[i], self.seq2[i]]
        return score
    
    def levenshtein_similarity(self, sub_matrix, gap_penalty=-2):
        if len(self.seq1) != len(self.seq2):
            raise ValueError("Aligned sequences must have equal length")
        score = 0
        for i in range(len(self.seq1)):
            c1 = self.seq1[i]
            c2 = self.seq2[i]
            if c1 == '-' or c2 == '-':
                score += gap_penalty
            else:
                score += sub_matrix[c1, c2]
        return score
    
    def __str__(self):
        if len(self.seq1) != len(self.seq2):
            return f"{self.seq1}\n{self.seq2}"
        match_str = ""
        for i in range(len(self.seq1)):
            c1 = self.seq1[i]
            c2 = self.seq2[i]
            if c1 == c2:
                match_str += "|"
            elif c1 == '-':
                match_str += "^"
            elif c2 == '-':
                match_str += "v"
            else:
                match_str += "X"
        return f"{self.seq1}\n{match_str}\n{self.seq2}"

Complete SubstitutionMatrix Class

class SubstitutionMatrix:
    def __init__(self, matrix_dict):
        self.matrix = matrix_dict
    
    def __getitem__(self, key):
        res1, res2 = key
        return self.matrix[res1][res2]

# Transition/Transversion Matrix
tt_matrix = SubstitutionMatrix({
    'A': {'A': 2, 'T': -1, 'C': -1, 'G': 0},
    'T': {'A': -1, 'T': 2, 'C': 0, 'G': -1},
    'C': {'A': -1, 'T': 0, 'C': 2, 'G': -1},
    'G': {'A': 0, 'T': -1, 'C': -1, 'G': 2}
})

Complete SeqPair Class

class SeqPair:
    def __init__(self, seq1, seq2, sub_matrix, gap_penalty=-2):
        self.seq1 = seq1
        self.seq2 = seq2
        self.sub_matrix = sub_matrix
        self.gap = gap_penalty
    
    def _init_matrix(self, rows, cols):
        return [[0] * cols for _ in range(rows)]
    
    # --- Needleman-Wunsch (Global) ---
    def needleman_wunsch(self):
        rows = len(self.seq2) + 1
        cols = len(self.seq1) + 1
        
        score = self._init_matrix(rows, cols)
        trace = self._init_matrix(rows, cols)
        
        for j in range(1, cols):
            score[0][j] = score[0][j-1] + self.gap
            trace[0][j] = 2
        
        for i in range(1, rows):
            score[i][0] = score[i-1][0] + self.gap
            trace[i][0] = 1
        
        for i in range(1, rows):
            for j in range(1, cols):
                diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
                up = score[i-1][j] + self.gap
                left = score[i][j-1] + self.gap
                
                score[i][j] = max(diag, up, left)
                
                if score[i][j] == diag:
                    trace[i][j] = 0
                elif score[i][j] == up:
                    trace[i][j] = 1
                else:
                    trace[i][j] = 2
        
        self.nw_score = score
        self.nw_trace = trace
        return score[rows-1][cols-1]
    
    def nw_traceback(self):
        aligned1 = ""
        aligned2 = ""
        i = len(self.seq2)
        j = len(self.seq1)
        
        while i > 0 or j > 0:
            if i > 0 and j > 0 and self.nw_trace[i][j] == 0:
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
                j -= 1
            elif i > 0 and self.nw_trace[i][j] == 1:
                aligned1 = "-" + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
            else:
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = "-" + aligned2
                j -= 1
        
        return aligned1, aligned2
    
    # --- Smith-Waterman (Local) ---
    def smith_waterman(self):
        rows = len(self.seq2) + 1
        cols = len(self.seq1) + 1
        
        score = self._init_matrix(rows, cols)
        trace = self._init_matrix(rows, cols)
        
        max_score = 0
        max_pos = (0, 0)
        
        for i in range(1, rows):
            for j in range(1, cols):
                diag = score[i-1][j-1] + self.sub_matrix[self.seq1[j-1], self.seq2[i-1]]
                up = score[i-1][j] + self.gap
                left = score[i][j-1] + self.gap
                
                score[i][j] = max(diag, up, left, 0)
                
                if score[i][j] == 0:
                    trace[i][j] = -1
                elif score[i][j] == diag:
                    trace[i][j] = 0
                elif score[i][j] == up:
                    trace[i][j] = 1
                else:
                    trace[i][j] = 2
                
                if score[i][j] > max_score:
                    max_score = score[i][j]
                    max_pos = (i, j)
        
        self.sw_score = score
        self.sw_trace = trace
        self.sw_max_pos = max_pos
        return max_score
    
    def sw_traceback(self):
        aligned1 = ""
        aligned2 = ""
        i, j = self.sw_max_pos
        
        while i > 0 and j > 0 and self.sw_trace[i][j] != -1:
            if self.sw_trace[i][j] == 0:
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
                j -= 1
            elif self.sw_trace[i][j] == 1:
                aligned1 = "-" + aligned1
                aligned2 = self.seq2[i-1] + aligned2
                i -= 1
            else:
                aligned1 = self.seq1[j-1] + aligned1
                aligned2 = "-" + aligned2
                j -= 1
        
        return aligned1, aligned2

Resources

The exercises and examples in this material are inspired by several open educational resources released under Creative Commons licenses. Instead of referencing each one separately throughout the notes, here is a list of the main books and sources I used:

  • [A Practical Introduction to Python Programming- © 2015 Brian Heinold] (CC BY-NC-SA 3.0)

All credit goes to the original authors for their openly licensed educational content.

Note

This course courageously comes without dedicated notes, as most concepts overlap with material from other courses, so cross-referencing should do the job.

For the upcoming project, if anyone needs help with logistics, tool setup, or general organization (fully within the code of conduct), I’m happy to help.

Best of luck.

Biomedical Databases (Protected)

Hey! Welcome to my notes for the Biomedical Databases course where biology meets data engineering.

Course Overview

Important Heads-Up

The exam may be split into two sessions based on the modules. The first module is all about biological databases, so pay extra attention for right preparing.

Supplementary Learning Resource

If you want to dive deeper into database fundamentals (and I mean really deep), check out:

CMU 15-445/645: Intro to Database Systems (Fall 2024)

About the CMU Course

This is one of the best database courses available online, taught by Andy Pavlo at Carnegie Mellon University. It's more advanced and assumes some C++ knowledge, but the explanations are incredibly clear.

The CMU course covers database internals, query optimization, storage systems, and transaction management at a much deeper level. It's perfect if you're curious about how databases actually work under the hood.

Everything

What is database? A database is a large structured set of persistent data, usually in computer-readable form.

A DBMS is a software package that enables users:

  • to access the data
  • to manipulate (create, edit, link, update) files as needed
  • to preserve the integrity of the data
  • to deal with security issues (who should have access)

PubMed/MeSH

it comprises more than 39 million citations for biomedical and related journal from MEDLINE, life science journals, and online books

MeSH database (Medical Subject Headings) – controlled vocabulary thesaurus

The query is very easy, just be carefull for what OR, AND and '()'. Read the query correctly to know what is the correct query and correct Mesh.

PDB

Definition: What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.

How experimental structure data is obtained? (3 methods)

  1. X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
  2. NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
  3. Cryo-Electron Microscopy (Cryo-EM)(1%)

What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.

SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:

  1. PDB entries ↔ UniProt sequences
  2. Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl

This is how you can search PDB by Pfam domain or UniProt ID.

Method Comparison Summary

FeatureX-rayCryo-EMNMR
SampleCrystal requiredFrozen in iceSolution
Size limitNone>50 kDa<50-70 kDa
ResolutionCan be <1 ÅRarely <2.2 ÅN/A
DynamicsNoLimitedYes
Multiple statesDifficultYesYes
Membrane proteinsDifficultGoodLimited

AlphaFold

What is AlphaFold? A deep learning system that predicts protein structure from Amino acid sequence.

At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).

AlphaFold essentially solved the protein folding problem for single domains.

pLDDT (predicted Local Distance Difference Test): Stored in the B-factor column of AlphaFold PDB files.

What pLDDT measures: Confidence in local structure (not global fold).

  1. Identify structured domains vs disordered regions
  2. Decide which parts to trust

PAE (Predicted Aligned Error) Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)

Use PAE for: Determining if domain arrangements are reliable.

PDB file format: Legacy and mmCIF Format (current standard)

The B-factor Column

The B-factor means different things depending on the method:

MethodB-factor containsMeaning
X-rayTemperature factorAtomic mobility/disorder
NMRRMSFFluctuation across models
AlphaFoldpLDDTPrediction confidence

When validate you measure:

  1. Resolution (for X-ray/Cryo-EM)
  2. R-factors (for X-ray)
  3. Geometry (for all)

R-factor (X-ray only): Measures how well the model fits the experimental data. <0.20 -> Good fit

Types of R-factors:

  1. R-work: Calculated on data used for refinement
  2. R-free: Calculated on test set NOT used for refinement (more honest)

R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.

Data Validation:

  1. Resolution
  2. Geometery
  3. R-Factor

Key Search Fields

FieldUse for
Experimental Method"X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR"
Data Collection ResolutionX-ray resolution
Reconstruction ResolutionCryo-EM resolution
Source OrganismSpecies
UniProt AccessionLink to UniProt
Pfam IdentifierDomain family
CATH IdentifierStructure classification
Reference Sequence CoverageHow much of UniProt sequence is in structure

Comparing Experimental vs AlphaFold Structures

When AlphaFold structures are available:

CheckExperimentalAlphaFold
Overall reliabilityResolution, R-factorpLDDT, PAE
Local confidenceB-factor (flexibility)pLDDT (prediction confidence)
Disordered regionsOften missingLow pLDDT (<50)
Ligand binding sitesCan have ligandsNo ligands
Protein-protein interfacesShown in complex structuresNot reliable unless AlphaFold-Multimer

Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).

For the Oral Exam

Be prepared to explain:

  1. Why crystallography needs crystals — signal amplification from ordered molecular packing

  2. The phase problem — you measure amplitudes but lose phases; must determine indirectly

  3. What resolution means — ability to distinguish fine details; limited by crystal order

  4. Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances

  5. NMR gives ensembles, not single structures — restraints satisfied by multiple conformations

  6. What pLDDT means — local prediction confidence, stored in B-factor column

  7. Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning

  8. How to assess structure quality — resolution, R-factors, validation metrics

  9. B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)

  10. How to construct complex PDB queries — combining method, resolution, organism, domain annotations

UniProt

What it gives you:

  1. Protein sequences and functions
  2. Domains, families, PTMs
  3. Disease associations and variants
  4. Subcellular localization
  5. Cross-references to 180+ external databases
  6. Proteomes for complete organisms
  7. BLAST, Align, ID mapping tools
                    UniProt
                       │
       ┌───────────────┼───────────────┐
       │               │               │
   UniProtKB        UniRef         UniParc
   (Knowledge)    (Clusters)      (Archive)
       │
   ┌───┴───┐
   │       │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)

UniProt classifies how confident we are that a protein actually exists. Query syntax: existence:1 (for protein-level evidence)

It also has ID Mapping: Convert between ID systems

TL;DR

  • UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
  • Always add reviewed:true when you need reliable annotations
  • Query syntax: field:value with AND, OR, NOT
  • Use parentheses to group OR conditions properly
  • Common fields: organism_id, ec, reviewed, existence, database, proteome, go
  • Wildcards: Use * for EC numbers (e.g., ec:3.4.21.*)
  • Protein existence: Level 1 = experimental evidence, Level 5 = uncertain

NCBI

What is NCBI? National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.

What it gives you:

  1. GenBank (primary nucleotide sequences)
  2. RefSeq (curated reference sequences)
  3. Gene database (gene-centric information)
  4. PubMed (literature)
  5. dbSNP, ClinVar, OMIM (variants & clinical)
  6. BLAST (sequence alignment)
  7. And ~40 more databases, all cross-linked

TL;DR

  • NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
  • GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
  • RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
  • Boolean operators MUST be UPPERCASE: AND, OR, NOT
  • Use quotes around multi-word terms: "homo sapiens"[Organism]
  • Gene database = best starting point for gene-centric searches
  • Properties = what it IS, Filters = what it's LINKED to

Ensembl

Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.

What it gives you:

  1. Gene sets (splice variants, proteins, ncRNAs)
  2. Comparative genomics (alignments, protein trees, orthologues)
  3. Variation data (SNPs, InDels, CNVs)
  4. BioMart for bulk data export
  5. REST API for programmatic access
  6. Everything is open source

BioMart: Bulk Data Queries

Workflow Example: ID Conversion Goal: Convert RefSeq protein IDs to Ensembl Gene IDs

TL;DR

  • Ensembl = genome browser + database for genes, transcripts, variants, orthologues
  • IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
  • MANE Select = highest quality transcript annotation (use these when possible)
  • BioMart = bulk query tool: Dataset → Filters → Attributes → Export

Avoid these mistakes:

  1. Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
  2. Use the text input field, not just checkboxes
  3. Orthologue = cross-species, Paralogue = same species
  4. Start with the species of your INPUT IDs as your dataset
  5. Always include your filter column in output attributes M

Boolean Algebra in Nutshell

There are only two Boolean values:

  • True (1, yes, on)
  • False (0, no, off)

Basic Operators

AND Operator (∧)

The AND operator returns True only when both inputs are True.

Truth Table:

ABA AND B
FalseFalseFalse
FalseTrueFalse
TrueFalseFalse
TrueTrueTrue

OR Operator (∨)

The OR operator returns True when at least one input is True.

Truth Table:

ABA OR B
FalseFalseFalse
FalseTrueTrue
TrueFalseTrue
TrueTrueTrue

NOT Operator (¬)

The NOT operator flips the value - True becomes False, False becomes True.

Truth Table:

ANOT A
FalseTrue
TrueFalse

Combining Operators

You can combine operators to create complex logical expressions.

Operator Precedence (Order of Operations)

⚠️
Order Matters

1. NOT (highest priority)
2. AND
3. OR (lowest priority)

Example: A OR B AND C

  • First do: B AND C
  • Then do: A OR (result)

Use parentheses to be clear: (A OR B) AND C

Venn Diagrams

Write an expression to represent the outlined part of the Venn diagram shown.

Set Operations Venn Diagrams

ℹ️
Image Source

Image from Book Title by David Lippman, Pierce College. Licensed under CC BY-SA. View original

Problem 1: Morning Beverages

A survey asks 200 people "What beverage do you drink in the morning?", and offers these choices:

  • Tea only
  • Coffee only
  • Both coffee and tea

Suppose 20 report tea only, 80 report coffee only, 40 report both.

Questions:
a) How many people drink tea in the morning?
b) How many people drink neither tea nor coffee?

Problem 2: Course Enrollment

Fifty students were surveyed and asked if they were taking a social science (SS), humanities (HM) or a natural science (NS) course the next quarter.

  • 21 were taking a SS course
  • 26 were taking a HM course
  • 19 were taking a NS course
  • 9 were taking SS and HM
  • 7 were taking SS and NS
  • 10 were taking HM and NS
  • 3 were taking all three
  • 7 were taking none

Question: How many students are taking only a SS course?

ℹ️
Source Attribution

Problems adapted from David Lippman, Pierce College. Licensed under CC BY-SA.

PubMed/MeSH

Learn a systematic approach to find relevant articles on a given topic in PubMed combined with Mesh

PubMed is a free search engine maintained by the U.S. National Library of Medicine (NLM) that gives you access to more than 39 million citations from biomedical and life-science literature


PubMed
├── Search
│   ├── Basic Search
│   └── Advanced Search
│       └── MeSH Search
│
├── Filters
│   ├── Year
│   ├── Article Type
│   └── Free Full Text
│
├── Databases
│   ├── MEDLINE
│   ├── PubMed Central
│   └── Bookshelf
│
└── Article Page
    ├── Citation
    ├── Abstract
    ├── MeSH Terms
    └── Links to Full Text


What is Mesh DB?

Mesh terms are like tags attached to research papers. You can access Mesh database from this link https://www.ncbi.nlm.nih.gov/mesh/.

MeSH DB (Medical Subject Headings Database) is a controlled vocabulary system used to tag, organize, and standardize biomedical topics for precise searching in PubMed.

alt text

Be carefull for the Major topic or just Mesh Term or maybe sub-term asked in the question like AD diagnosis and not only AD.

See this sites for all tags and terms https://pubmed.ncbi.nlm.nih.gov/help/#using-search-field-tags

Protein Databases

Protein databases store information about protein structures, sequences, and functions. They come from experimental methods or computational predictions.

PDB

📖
Definition

What is PDB? PDB (Protein Data Bank) is the main global database that stores 3D structures of proteins, DNA, RNA, and their complexes.

How experimental structure data is obtained? (3 methods)

  1. X-ray Crystallography(88%): uses crystals + X-ray diffraction to map atomic positions.
  2. NMR Spectroscopy(10%): uses magnetic fields to determine structures in solution.
  3. Cryo-Electron Microscopy (Cryo-EM)(1%)

What is a Ligand?: A ligand is any small molecule, ion, or cofactor that binds to the protein in the structure, often to perform a specific biological function. Example: iron in hemoglobin

What is Resolution (Å)? Resolution (in Ångström) measures the level of detail; smaller value = sharper, more accurate structure.

What is the PDB? (Again)

The Protein Data Bank is the central repository for 3D structures of biological macromolecules (proteins, DNA, RNA). If you want to know what a protein looks like in 3D, you go to PDB.

Current stats:

  • ~227,000 experimental structures
  • ~1,000,000+ computed structure models (AlphaFold)

The wwPDB Consortium

wwPDB (worldwide Protein Data Bank) was established in 2003. Three data centers maintain it:

CenterLocationWebsite
RCSB PDBUSArcsb.org
PDBeEurope (EMBL-EBI)ebi.ac.uk/pdbe
PDBjJapanpdbj.org

They all share the same data, but each has different tools and interfaces.

What wwPDB Does

  1. Structure deposition — researchers submit their structures through OneDep (deposit.wwpdb.org)
  2. Structure validation — quality checking before release
  3. Structure archive — maintaining the database
ArchiveWhat it stores
PDBAtomic coordinates
EMDBElectron microscopy density maps
BMRBNMR data (chemical shifts, restraints)

SIFTS

SIFTS (Structure Integration with Function, Taxonomy and Sequence) provides residue-level mapping between:

  • PDB entries ↔ UniProt sequences
  • Connections to: GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl

This is how you can search PDB by Pfam domain or UniProt ID.


Part 1: Experimental Methods

Three main methods to determine protein structures:

Method% of PDB (2017)Size limitResolution
X-ray crystallography88%NoneCan be <1 Å
NMR spectroscopy10%<50-70 kDaN/A
Cryo-EM1% (now ~10%)>50 kDaRarely <2.2 Å

Important: Cryo-EM has grown exponentially since 2017 due to the "Resolution Revolution."


X-ray Crystallography

The Process

Protein → Crystallize → X-ray beam → Diffraction pattern → 
Electron density map → Atomic model
  1. Crystallization — grow protein crystals (ordered molecular packing)
  2. X-ray diffraction — shoot X-rays at the crystal
  3. Diffraction pattern — X-rays scatter, creating spots on detector
  4. Phase determination — the "phase problem" (you measure intensities but need phases)
  5. Electron density map — Fourier transform gives you electron density
  6. Model fitting — build atomic model into the density

Why X-rays?

Wavelength matters:

  • Visible light: λ ≈ 10⁻⁵ cm — too big to resolve atoms
  • X-rays: λ ≈ 10⁻⁸ cm — comparable to atomic distances (~1-2 Å)

Problem: No lens can focus X-rays. Computers must calculate the inverse Fourier transform.

Why Crystals?

A single molecule gives too weak a signal. Crystals contain millions of molecules in identical orientations, amplifying the diffraction signal.

The Phase Problem

When X-rays scatter, you measure:

  • Amplitudes |F(hkl)| — from diffraction spot intensities ✓
  • Phases α(hkl) — LOST in the measurement ✗

Phases must be determined indirectly (molecular replacement, heavy atom methods, etc.). This is why X-ray crystallography is hard.

Resolution

Definition: The smallest detail you can see in the structure.

What limits resolution: If molecules in the crystal aren't perfectly aligned (due to flexibility or disorder), fine details are lost.

ResolutionQualityWhat you can see
0.5-1.5 ÅExceptionalIndividual atoms, hydrogens sometimes visible
1.5-2.5 ÅHighMost features clear, good for detailed analysis
2.5-3.5 ÅMediumOverall fold clear, some ambiguity in sidechains
>3.5 ÅLowOnly general shape, significant uncertainty

Lower number = better resolution. A 1.5 Å structure is better than a 3.0 Å structure.


Cryo-Electron Microscopy (Cryo-EM)

The Resolution Revolution

Nobel Prize in Chemistry 2017. Progress on β-Galactosidase:

YearResolution
200525 Å (blob)
201111 Å
20136 Å
20143.8 Å
20152.2 Å

The Process

Protein → Flash-freeze in vitreous ice → Image thousands of particles → 
Align and average → 3D reconstruction → Build model
  1. Sample preparation — purify protein, flash-freeze in thin ice layer
  2. Imaging — electron beam through frozen sample
  3. Data collection — thousands of images of individual particles
  4. Image processing — classify, align, and average particles
  5. 3D reconstruction — combine to get density map
  6. Model building — fit atomic model into density

Advantages

  • No crystals needed — works on samples that won't crystallize
  • Large complexes — good for ribosomes, viruses, membrane proteins
  • Multiple conformations — can separate different states

Limitations

  • Size limit: Generally requires proteins >50 kDa (small proteins are hard to image)
  • Resolution: Very rarely reaches below ~2.2 Å

NMR Spectroscopy

How It Works

NMR doesn't give you a single structure. It gives you restraints (constraints):

  1. Dihedral angles — backbone and sidechain torsion angles
  2. Inter-proton distances — from NOE (Nuclear Overhauser Effect)
  3. Other restraints — hydrogen bonds, orientations

The Output

NMR produces a bundle of structures (ensemble), all compatible with the restraints.

                Model 1
               /
Restraints → Model 2  → All satisfy the experimental data
               \
                Model 3

A reference structure can be calculated by averaging.

What Does Variation Mean?

When NMR models differ from each other, it could mean:

  • Real flexibility — the protein actually moves
  • Uncertainty — not enough data to pin down the position

This is ambiguous and requires careful interpretation.

Advantages

  • Dynamics — can observe protein folding, conformational changes
  • Solution state — protein in solution, not crystal

Limitations

  • Size limit: ≤50-70 kDa (larger proteins have overlapping signals)

Method Comparison Summary

FeatureX-rayCryo-EMNMR
SampleCrystal requiredFrozen in iceSolution
Size limitNone>50 kDa<50-70 kDa
ResolutionCan be <1 ÅRarely <2.2 ÅN/A
DynamicsNoLimitedYes
Multiple statesDifficultYesYes
Membrane proteinsDifficultGoodLimited

Part 2: AlphaFold and Computed Structure Models

Timeline

MethodFirst structureNobel Prize
X-ray19581962
NMR19882002
Cryo-EM20142017
AlphaFold20202024

What is AlphaFold?

A deep learning system that predicts protein structure from sequence.

Amino acid sequence → AlphaFold neural network → 3D structure prediction

How It Works

Input features:

  1. MSA (Multiple Sequence Alignment) — find related sequences in:

    • UniRef90 (using jackhmmer)
    • Mgnify (metagenomic sequences)
    • BFD (2.5 billion proteins)
  2. Template structures — search PDB70 for similar known structures

Key concept: Co-evolution

If two positions in a protein always mutate together across evolution, they're probably in contact in 3D.

Example:

Position 3: R, R, R, K, K, K    (all positive)
Position 9: D, D, D, E, E, E    (all negative)

These positions probably form a salt bridge.

AlphaFold Performance

At CASP14 (2020), AlphaFold2 scored ~92 GDT (Global Distance Test).

  • GDT > 90 ≈ experimental structure accuracy
  • Previous best methods: 40-60 GDT

AlphaFold essentially solved the protein folding problem for single domains.

AlphaFold Database

  • Created: July 2021
  • Current size: ~214 million structures
  • Coverage: 48 complete proteomes (including human)
  • Access: UniProt, RCSB PDB, Ensembl

AlphaFold Confidence Metrics

These are critical for interpreting AlphaFold predictions.

pLDDT (predicted Local Distance Difference Test)

Stored in the B-factor column of AlphaFold PDB files.

pLDDTConfidenceInterpretation
>90Very highSide chains reliable, can analyze active sites
70-90ConfidentBackbone reliable
50-70LowUncertain
<50Very lowLikely disordered, NOT a structure prediction

What pLDDT measures: Confidence in local structure (not global fold).

Uses:

  • Identify structured domains vs disordered regions
  • Decide which parts to trust

PAE (Predicted Aligned Error)

A 2D matrix showing confidence in relative positions between residues.

        Residue j →
      ┌─────────────────┐
  R   │ ■■■     ░░░     │  ■ = low error (confident)
  e   │ ■■■     ░░░     │  ░ = high error (uncertain)
  s   │                 │
  i   │     ■■■■■       │
  d   │     ■■■■■       │
  u   │                 │
  e   │         ░░░░░░  │
  i ↓ │         ░░░░░░  │
      └─────────────────┘

Dark blocks on diagonal: Confident domains Off-diagonal dark blocks: Confident domain-domain interactions Light regions: Uncertain relative positions (domains may be connected but orientation unknown)

Use PAE for: Determining if domain arrangements are reliable.


Part 3: PDB File Formats

Legacy PDB Format

ATOM      1  N   LYS A   1     -21.816  -8.515  19.632  1.00 41.97
ATOM      2  CA  LYS A   1     -20.532  -9.114  20.100  1.00 41.18
ColumnMeaning
ATOMRecord type
1, 2Atom serial number
N, CAAtom name
LYSResidue name
AChain ID
1Residue number
-21.816, -8.515, 19.632X, Y, Z coordinates (Å)
1.00Occupancy
41.97B-factor

mmCIF Format

Current standard. More flexible than legacy PDB format:

  • Can handle >99,999 atoms
  • Machine-readable
  • Extensible

The B-factor Column

The B-factor means different things depending on the method:

MethodB-factor containsMeaning
X-rayTemperature factorAtomic mobility/disorder
NMRRMSFFluctuation across models
AlphaFoldpLDDTPrediction confidence

For X-ray: $$B = 8\pi^2 U^2$$

Where U² is mean square displacement.

B-factorDisplacementInterpretation
15 Ų~0.44 ÅRigid
60 Ų~0.87 ÅFlexible

Part 4: Data Validation

Why Validation Matters

Not all PDB structures are equal quality. You need to check:

  • Resolution (for X-ray/Cryo-EM)
  • R-factors (for X-ray)
  • Geometry (for all)

Resolution

Most important quality indicator for X-ray and Cryo-EM.

Lower = better. A 1.5 Å structure shows more detail than a 3.0 Å structure.

R-factor (X-ray only)

Measures how well the model fits the experimental data.

$$R = \frac{\sum |F_{obs} - F_{calc}|}{\sum |F_{obs}|}$$

R-factorInterpretation
<0.20Good fit
0.20-0.25Acceptable
>0.30Significant errors likely

Types of R-factors:

  • R-work: Calculated on data used for refinement
  • R-free: Calculated on test set NOT used for refinement (more honest)

R-free is more reliable. If R-work is much lower than R-free, the model may be overfitted.

Geometry Validation

MetricWhat it checks
ClashscoreSteric clashes between atoms
Ramachandran outliersUnusual backbone angles (φ/ψ)
Sidechain outliersUnusual rotamer conformations
RSRZ outliersResidues that don't fit electron density

RSRZ: Real Space R-value Z-score

  • Measures fit between residue and electron density
  • RSRZ > 2 = outlier

wwPDB Validation Report

Every PDB entry has a validation report with:

  • Overall quality metrics
  • Chain-by-chain analysis
  • Residue-level indicators
  • Color coding (green = good, red = bad)

Always check the validation report before trusting a structure!


Part 5: Advanced Search in RCSB PDB

Query Builder Categories

  1. Attribute Search

    • Structure attributes (method, resolution, date)
    • Chemical attributes (ligands)
    • Full text
  2. Sequence-based Search

    • Sequence similarity (BLAST)
    • Sequence motif
  3. Structure-based Search

    • 3D shape similarity
    • Structure motif
  4. Chemical Search

    • Ligand similarity

Key Search Fields

FieldUse for
Experimental Method"X-RAY DIFFRACTION", "ELECTRON MICROSCOPY", "SOLUTION NMR"
Data Collection ResolutionX-ray resolution
Reconstruction ResolutionCryo-EM resolution
Source OrganismSpecies
UniProt AccessionLink to UniProt
Pfam IdentifierDomain family
CATH IdentifierStructure classification
Reference Sequence CoverageHow much of UniProt sequence is in structure

Boolean Logic

AND — both conditions must be true
OR  — either condition can be true

Important: When combining different resolution types, use OR correctly.


Practice Exercises

Find X-ray structures at resolution ≤2.5 Å, from human and mouse, containing Pfam domain PF00004.

Query:

Experimental Method = "X-RAY DIFFRACTION"
AND Identifier = "PF00004" AND Annotation Type = "Pfam"
AND (Source Organism = "Homo sapiens" OR Source Organism = "Mus musculus")
AND Data Collection Resolution <= 2.5

Answer: 11-50 (15 entries)


Exercise 2: UniProt ID List with Filters

Find X-ray structures for a list of UniProt IDs, with resolution ≤2.2 Å and sequence coverage ≥0.90.

Query:

Accession Code(s) IS ANY OF [list of UniProt IDs]
AND Database Name = "UniProt"
AND Experimental Method = "X-RAY DIFFRACTION"
AND Data Collection Resolution <= 2.2
AND Reference Sequence Coverage >= 0.9

Answer: 501-1000 (811 entries)

Note: "Reference Sequence Coverage" tells you what fraction of the UniProt sequence is present in the PDB structure. Coverage of 0.90 means at least 90% of the protein is in the structure.


Exercise 3: Combining X-ray and Cryo-EM

Find all X-ray structures with resolution ≤2.2 Å AND all Cryo-EM structures with reconstruction resolution ≤2.2 Å.

The tricky part: X-ray uses "Data Collection Resolution" but Cryo-EM uses "Reconstruction Resolution". You need to combine them correctly.

Query:

(Experimental Method = "X-RAY DIFFRACTION" OR Experimental Method = "ELECTRON MICROSCOPY")
AND (Data Collection Resolution <= 2.2 OR Reconstruction Resolution <= 2.2)

Answer: 100001-1000000 (128,107 entries: 127,405 X-ray + 702 EM)

Why this works: Each entry will match either:

  • X-ray AND Data Collection Resolution ≤2.2, OR
  • EM AND Reconstruction Resolution ≤2.2

Exercise 4: Cryo-EM Quality Filter

Among Cryo-EM structures with resolution ≤2.2 Å, how many have Ramachandran outliers <1%?

Query:

Experimental Method = "ELECTRON MICROSCOPY"
AND Reconstruction Resolution <= 2.2
AND Molprobity Percentage Ramachandran Outliers <= 1

Answer: 101-1000 (687 out of 702 total)

This tells you that most high-resolution Cryo-EM structures have good geometry.


Query Building Tips

1. Use the Right Resolution Field

MethodResolution Field
X-rayData Collection Resolution
Cryo-EMReconstruction Resolution
NMRN/A (no resolution)

2. Experimental Method Exact Names

Use exactly:

  • "X-RAY DIFFRACTION" (not "X-ray" or "crystallography")
  • "ELECTRON MICROSCOPY" (not "Cryo-EM" or "EM")
  • "SOLUTION NMR" (not just "NMR")

3. Organism Names

Use full taxonomic name:

  • "Homo sapiens" (not "human")
  • "Mus musculus" (not "mouse")
  • "Rattus norvegicus" (not "rat")

4. UniProt Queries

When searching by UniProt ID, specify:

Accession Code = [ID] AND Database Name = "UniProt"

5. Combining OR Conditions

Always put OR conditions in parentheses:

(Organism = "Homo sapiens" OR Organism = "Mus musculus")

Otherwise precedence may give unexpected results.


What to Check When Using a PDB Structure

  1. Experimental method — X-ray? NMR? Cryo-EM?
  2. Resolution — <2.5 Å is generally good for most purposes
  3. R-factors — R-free should be reasonable for the resolution
  4. Validation report — check for outliers in your region of interest
  5. Sequence coverage — does the structure include the region you care about?
  6. Ligands/cofactors — are they present? Are they what you expect?

Comparing Experimental vs AlphaFold Structures

When AlphaFold structures are available:

CheckExperimentalAlphaFold
Overall reliabilityResolution, R-factorpLDDT, PAE
Local confidenceB-factor (flexibility)pLDDT (prediction confidence)
Disordered regionsOften missingLow pLDDT (<50)
Ligand binding sitesCan have ligandsNo ligands
Protein-protein interfacesShown in complex structuresNot reliable unless AlphaFold-Multimer

Key insight: Low-confidence AlphaFold regions often correspond to regions missing in experimental structures — both are telling you the same thing (disorder/flexibility).


Quick Reference

PDB Quality Indicators

IndicatorGood valueBad value
Resolution<2.5 Å>3.5 Å
R-free<0.25>0.30
Ramachandran outliers<1%>5%
Clashscore<5>20

AlphaFold Confidence

pLDDTMeaning
>90Very confident, analyze details
70-90Confident backbone
50-70Low confidence
<50Likely disordered

Search Field Cheatsheet

What you wantField to use
X-ray resolutionData Collection Resolution
Cryo-EM resolutionReconstruction Resolution
SpeciesSource Organism Taxonomy Name
UniProt linkAccession Code + Database Name = "UniProt"
Pfam domainIdentifier + Annotation Type = "Pfam"
CATH superfamilyLineage Identifier (CATH)
CoverageReference Sequence Coverage
Geometry qualityMolprobity Percentage Ramachandran Outliers

For the Oral Exam

Be prepared to explain:

  1. Why crystallography needs crystals — signal amplification from ordered molecular packing

  2. The phase problem — you measure amplitudes but lose phases; must determine indirectly

  3. What resolution means — ability to distinguish fine details; limited by crystal order

  4. Why Cryo-EM grew so fast — no crystals needed, good for large complexes, computational advances

  5. NMR gives ensembles, not single structures — restraints satisfied by multiple conformations

  6. What pLDDT means — local prediction confidence, stored in B-factor column

  7. Difference between pLDDT and PAE — pLDDT is local confidence, PAE is relative domain positioning

  8. How to assess structure quality — resolution, R-factors, validation metrics

  9. B-factor means different things — mobility (X-ray), fluctuation (NMR), confidence (AlphaFold)

  10. How to construct complex PDB queries — combining method, resolution, organism, domain annotations

UCSF-Chimera

Short Playlist:

So you need to visualize protein structures, analyze binding sites, or understand why a mutation causes disease? Welcome to Chimera — your molecular visualization workhorse.

What is Chimera?

UCSF Chimera — a free molecular visualization program from UC San Francisco. It lets you:

  • Visualize 3D protein/DNA/RNA structures
  • Analyze protein-ligand interactions
  • Measure distances and angles
  • Compare structures (superposition)
  • Color by various properties (charge, hydrophobicity, conservation, flexibility)
  • Generate publication-quality images

Getting Started

Opening a Structure

From PDB (online):

File → Fetch by ID → Enter PDB code (e.g., 1a6m) → Fetch

From file:

File → Open → Select your .pdb file

Representation Styles

The Main Styles

StyleWhat it showsUse for
Ribbon/CartoonSecondary structure (helices, sheets)Overall fold
SticksAll bonds as sticksDetailed view of residues
Ball and StickAtoms as balls, bonds as sticksLigands, active sites
Sphere/SpacefillAtoms as van der Waals spheresSpace-filling, surfaces
WireThin lines for bondsLarge structures

How to Change Representation

Actions → Atoms/Bonds → [stick/ball & stick/sphere/wire]
Actions → Ribbon → [show/hide]
💡
Common Combo

Ribbon for protein backbone + Sticks for ligand/active site residues = best of both worlds


Selection: The Most Important Skill

Everything in Chimera starts with selection. Select what you want, then do something to it.

Selection Methods

MethodHowExample
ClickCtrl + Click on atomSelect one atom
MenuSelect → ...Various options
ChainSelect → Chain → ASelect chain A
Residue typeSelect → Residue → HISAll histidines
Command lineselect :153Residue 153

Useful Selection Menu Options

Select → Chain → [A, B, C...]           # Select by chain
Select → Residue → [ALA, HIS, IHP...]   # Select by residue type
Select → Structure → Protein            # All protein
Select → Structure → Ligand             # All ligands
Select → Chemistry → Side chain         # Just sidechains
Select → Clear Selection                # Deselect everything
Select → Invert (all models)            # Select everything NOT selected

Zone Selection (Within Distance)

Select everything within X Å of current selection:

Select → Zone...
  → Set distance (e.g., 6 Å)
  → OK

This is super useful for finding binding site residues!


Coloring

Color by Element (Default)

Actions → Color → by element
ElementColor
CarbonGray
OxygenRed
NitrogenBlue
SulfurYellow
HydrogenWhite
IronOrange-brown
PhosphorusOrange

Color by Hydrophobicity

Tools → Depiction → Render by Attribute
  → Attribute: kdHydrophobicity
  → OK
ColorMeaning
Blue/CyanHydrophilic (polar)
WhiteIntermediate
Orange/RedHydrophobic (nonpolar)

Why use this? To see the hydrophobic core of proteins — nonpolar residues hide inside, polar residues face the water.


Color by Electrostatic Potential (Coulombic)

This is the red-white-blue coloring from your exercise!

Step 1: Generate surface first

Actions → Surface → Show

Step 2: Color by charge

Tools → Surface/Binding Analysis → Coulombic Surface Coloring → OK
ColorChargeAttracts...
BluePositive (+)Negative molecules
RedNegative (−)Positive molecules
WhiteNeutralHydrophobic stuff
⚠️
Surface Required!

The OK button is disabled if no surface exists. Always do Actions → Surface → Show first!

What to look for:

  • Binding pockets often have complementary charge to ligand
  • DNA-binding proteins have positive (blue) surfaces to attract negative DNA
  • Negatively charged ligands (like phosphates) bind in positive (blue) pockets

Color by B-factor (Flexibility)

B-factor = temperature factor = how much an atom "wiggles" in the crystal.

Tools → Depiction → Render by Attribute
  → Attribute: bfactor
  → OK
ColorB-factorMeaning
BlueLowRigid, well-ordered
RedHighFlexible, mobile

What to expect:

  • Protein core: Blue (rigid)
  • Loops and termini: Red (floppy)
  • Active sites: Often intermediate

Color by Conservation

When you have multiple aligned structures:

Tools → Sequence → Multialign Viewer
  → (structures get aligned)
Structure → Render by Conservation
ColorConservation
Blue/PurpleHighly conserved
RedVariable

Conserved residues = functionally important (active sites, structural core)


Molecular Surfaces

Show/Hide Surface

Actions → Surface → Show
Actions → Surface → Hide

Transparency

Actions → Surface → Transparency → [0-100%]

Use ~50-70% transparency to see ligands through the surface.

Cross-Section (Clipping)

To see inside the protein:

Tools → Depiction → Per-Model Clipping
  → Enable clipping
  → Adjust plane position

Or use the Side View panel:

Tools → Viewing Controls → Side View

Measuring Distances

Method 1: Distance Tool

Tools → Structure Analysis → Distances

Then Ctrl+Shift+Click on first atom, Ctrl+Shift+Click on second atom.

Distance appears as a yellow dashed line with measurement.

Method 2: Command Line

distance :169@OG :301@O34

What Distances Mean

DistanceInteraction Type
~1.0–1.5 ÅCovalent bond
~1.8–2.1 ÅCoordination bond (metal)
~2.5–3.5 ÅHydrogen bond
~2.8–4.0 ÅSalt bridge
> 4 ÅNo direct interaction

Hydrogen Bonds

What is a Hydrogen Bond?

Donor—H · · · · Acceptor
         ↑
    H-bond (~2.5-3.5 Å)
  • Donor: Has hydrogen to give (—OH, —NH)
  • Acceptor: Has lone pair to receive (O=, N)

Find H-Bonds Automatically

Tools → Structure Analysis → FindHBond

Options:

  • ✓ Include intra-molecule (within protein)
  • ✓ Include inter-molecule (protein-ligand)

H-bonds appear as blue/green lines.

Common H-Bond Donors in Proteins

Amino AcidDonor AtomGroup
SerineOG—OH
ThreonineOG1—OH
TyrosineOH—OH
HistidineNE2, ND1Ring —NH
LysineNZ—NH₃⁺
ArginineNH1, NH2, NEGuanidinium
BackboneNAmide —NH

Common H-Bond Acceptors

GroupAtoms
PhosphateO atoms
CarboxylateOD1, OD2 (Asp), OE1, OE2 (Glu)
CarbonylO (backbone)
HydroxylO (can be both donor AND acceptor)

Salt Bridges (Ionic Interactions)

A salt bridge = electrostatic attraction between opposite charges.

Positive (basic)Negative (acidic)
Lysine (NZ)Aspartate (OD1, OD2)
Arginine (NH1, NH2)Glutamate (OE1, OE2)
Histidine (when protonated)C-terminus
N-terminusPhosphate groups

Typical distance: ~2.8–4.0 Å between charged atoms


Coordination Bonds (Metals)

Metals like Fe, Zn, Mg are coordinated by specific atoms:

MetalCommon LigandsDistance
Fe (heme)His NE2, O₂~2.0–2.2 Å
ZnCys S, His N~2.0–2.3 Å
MgAsp/Glu O, water~2.0–2.2 Å

Example: In myoglobin (1a6m), the proximal histidine coordinates Fe at ~2.1 Å.


Ramachandran Plot

Shows allowed backbone angles (φ/ψ) for amino acids.

Tools → Structure Analysis → Ramachandran Plot

Regions of the Plot

RegionLocationStructure
Lower leftφ ≈ -60°, ψ ≈ -45°α-helix
Upper leftφ ≈ -120°, ψ ≈ +130°β-sheet
Upper rightPositive φLeft-handed helix (rare)

Why Glycine is Special

Glycine has no sidechain → no steric clashes → can be in "forbidden" regions (positive φ).

Select → Residue → GLY

Glycines often appear in the right half of the Ramachandran plot where other residues can't go.


Structural Superposition

Compare two similar structures by overlaying them.

Method 1: MatchMaker (Sequence-based)

Tools → Structure Comparison → MatchMaker
  → Reference: structure 1
  → Match: structure 2
  → OK

Output tells you:

  • RMSD (Root Mean Square Deviation): How well they align
    • < 1 Å = very similar
    • 1–2 Å = similar fold
    • 3 Å = significant differences

  • Sequence identity %: How similar the sequences are

Method 2: Match (Command)

match #1 #0

Restricting Alignment to a Region

To align just the active site (e.g., within 4 Å of ligand):

sel #1:hem #0:hem zr < 4
match sel

Working with Chains

Delete Unwanted Chains

Select → Chain → B
Actions → Atoms/Bonds → Delete

Select Specific Chain

Select → Chain → A

Or command:

select #0:.A

AlphaFold Structures and pLDDT

What is pLDDT?

AlphaFold stores its confidence score (pLDDT) in the B-factor column.

pLDDTConfidenceTypical regions
> 90Very highStructured core
70–90ConfidentMost of protein
50–70LowLoops, uncertain
< 50Very lowDisordered regions

Color by pLDDT

Since pLDDT is in B-factor column, use:

Tools → Depiction → Render by Attribute → bfactor

Or select low-confidence regions:

select @@bfactor<70
ℹ️
AlphaFold vs Experimental

Low pLDDT regions in AlphaFold often correspond to regions that are ALSO missing in experimental structures — they're genuinely disordered/flexible, not just bad predictions.


The Hydrophobic Core

Soluble proteins organize with:

  • Hydrophobic residues (Leu, Ile, Val, Phe, Met) → inside (core)
  • Polar/charged residues (Lys, Glu, Ser, Asp) → outside (surface)

Visualizing the Core

  1. Color by hydrophobicity
  2. Use cross-section/clipping to see inside
  3. Orange/tan inside, blue/cyan outside = correct fold

Protein-Ligand Interaction Analysis

General Workflow

  1. Isolate the binding site:

    Select → Residue → [ligand name]
    Select → Zone → 5-6 Å
    
  2. Delete or hide everything else:

    Select → Invert
    Actions → Atoms/Bonds → Delete (or Hide)
    
  3. Show interactions:

    Tools → Structure Analysis → FindHBond
    
  4. Measure specific distances:

    Tools → Structure Analysis → Distances
    
  5. Look at electrostatics:

    Actions → Surface → Show
    Tools → Surface/Binding Analysis → Coulombic Surface Coloring
    

What to Report

For protein-ligand interactions, describe:

Interaction TypeHow to Identify
Hydrogen bondsDistance 2.5–3.5 Å, involves N-H or O-H
Salt bridgesOpposite charges, distance ~2.8–4 Å
HydrophobicNonpolar residues surrounding nonpolar parts of ligand
CoordinationMetal ion with specific geometry
Electrostatic complementarityBlue pocket for negative ligand (or vice versa)

Example: Analyzing a Binding Site (3eeb)

This is the exercise you did!

The Setup

1. Fetch 3eeb
2. Delete chain B (Select → Chain → B, then Delete)
3. Show surface, color by electrostatics

Result: Blue (positive) binding pocket for the negative IHP (6 phosphates).

The Details

1. Hide surface
2. Select IHP, then Zone 6 Å
3. Invert selection, Delete
4. Show sidechains, keep ribbon
5. Measure distances

Result:

  • Ser 169 OG ↔ IHP O34: ~2.8 Å = hydrogen bond (Ser donates H)
  • His 55 NE2 ↔ IHP O22: ~2.9 Å = hydrogen bond (His donates H)

The Interpretation

"IHP binding is driven by electrostatic attraction (positive pocket, negative ligand) and stabilized by specific hydrogen bonds from Ser 169 and His 55 to phosphate oxygens."


Cancer Mutations in p53 (1tup)

Example from your lectures showing how to analyze mutation hotspots:

The Hotspot Residues

ResidueTypeRole
R248ContactDirectly touches DNA
R273ContactDirectly touches DNA
R175StructuralStabilizes DNA-binding loop
H179StructuralStabilizes DNA-binding loop

Analysis Approach

1. Open 1tup, keep chain B
2. Show R175, R248, R273, H179 in spacefill
3. Color surface by electrostatics

Result:

  • R248 and R273 are right at the DNA interface (positive surface touching negative DNA)
  • R175 and H179 are buried, maintaining the fold
  • Mutations here → lose DNA binding → lose tumor suppression → cancer

Common Chimera Workflows

Quick Look at a Structure

1. File → Fetch by ID
2. Actions → Ribbon → Show
3. Presets → Interactive 1 (ribbons)
4. Rotate, zoom, explore

Analyze Active Site

1. Select ligand
2. Select → Zone → 5 Å
3. Actions → Atoms/Bonds → Show (for selection)
4. Tools → Structure Analysis → FindHBond

Compare Two Structures

1. Open both structures
2. Tools → Structure Comparison → MatchMaker
3. Check RMSD and sequence identity

Make a Figure

1. Set up your view
2. Presets → Publication 1
3. File → Save Image

Command Line Quick Reference

The command line is at the bottom of the Chimera window. Faster than menus once you know commands.

CommandWhat it does
open 1a6mFetch and open PDB
select :153Select residue 153
select :HISSelect all histidines
select #0:.ASelect chain A of model 0
select :hem zr<5Select within 5 Å of heme
display selShow selected atoms
~display ~selHide unselected atoms
color red selColor selection red
represent sphereSpacefill for selection
distance :169@OG :301@O34Measure distance
match #1 #0Superpose model 1 onto 0
surfaceShow surface
~surfaceHide surface
del selDelete selection

Keyboard Shortcuts

KeyAction
Ctrl + ClickSelect atom
Ctrl + Shift + ClickAdd to selection / measure distance
Scroll wheelZoom
Right-dragTranslate
Left-dragRotate
Middle-dragZoom (alternative)

Troubleshooting Common Issues

"Nothing selected"

You tried to do something but nothing happened:

  • Check: Is anything actually selected? (Green highlighting)
  • Fix: Select → [what you want] first

Surface coloring disabled

  • Check: Does a surface exist?
  • Fix: Actions → Surface → Show first

Can't see ligand

  • Check: Is it hidden?
  • Fix: Select → Residue → [ligand], then Actions → Atoms/Bonds → Show

Structure looks weird after operations

  • Fix: Presets → Interactive 1 to reset to default view

Atoms showing when you want ribbon only

Actions → Atoms/Bonds → Hide
Actions → Ribbon → Show

External Resources for Structure Analysis

ResourceURLUse for
RCSB PDBrcsb.orgUS PDB, structure info
PDBeebi.ac.uk/pdbeEuropean PDB, ligand interactions
PLIPplip-tool.biotec.tu-dresden.deAutomated interaction analysis
AlphaFold DBalphafold.ebi.ac.ukPredicted structures
COSMICcancer.sanger.ac.uk/cosmicCancer mutations

TL;DR

TaskHow
Open structureFile → Fetch by ID
SelectSelect → [Chain/Residue/Zone]
DeleteSelect, then Actions → Atoms/Bonds → Delete
Show surfaceActions → Surface → Show
Color by chargeSurface first, then Tools → Surface/Binding Analysis → Coulombic
Color by flexibilityTools → Depiction → Render by Attribute → bfactor
Measure distanceTools → Structure Analysis → Distances, then Ctrl+Shift+Click
Find H-bondsTools → Structure Analysis → FindHBond
Compare structuresTools → Structure Comparison → MatchMaker

Key distances:

  • ~2.0 Å = coordination bond
  • ~2.5–3.5 Å = hydrogen bond
  • ~2.8–4.0 Å = salt bridge

Electrostatic colors:

  • Blue = positive
  • Red = negative
  • White = neutral

Now go visualize some proteins! 🧬

UniProt

Introduction

So you need protein sequences, functions, domains, or disease associations? Welcome to UniProt — the world's most comprehensive protein database, and your one-stop shop for everything protein-related.

Universal Protein Resource — a collaboration between three major institutions since 2002:

InstitutionLocationContribution
SIBSwiss Institute of Bioinformatics, LausanneUniProtKB/Swiss-Prot
EBIEuropean Bioinformatics Institute, UKUniProtKB/TrEMBL, UniParc
PIRProtein Information Resource, GeorgetownUniRef

What it gives you:

  • Protein sequences and functions
  • Domains, families, PTMs
  • Disease associations and variants
  • Subcellular localization
  • Cross-references to 180+ external databases
  • Proteomes for complete organisms
  • BLAST, Align, ID mapping tools

The UniProt Structure

UniProt isn't just one database — it's a collection:

                    UniProt
                       │
       ┌───────────────┼───────────────┐
       │               │               │
   UniProtKB        UniRef         UniParc
   (Knowledge)    (Clusters)      (Archive)
       │
   ┌───┴───┐
   │       │
Swiss-Prot TrEMBL
(Reviewed) (Unreviewed)
DatabaseWhat it isSize (approx.)
Swiss-ProtManually curated, reviewed~570,000 entries
TrEMBLAutomatically annotated~250,000,000 entries
UniRefClustered sequences (100%, 90%, 50% identity)Reduced redundancy
UniParcComplete archive of all sequencesNon-redundant archive
ProteomesComplete protein sets per organism~160,000 proteomes

Swiss-Prot vs TrEMBL: Know the Difference

This is the most important distinction in UniProt:

AspectSwiss-Prot (Reviewed)TrEMBL (Unreviewed)
CurationManually reviewed by expertsComputationally analyzed
Data sourceScientific publicationsSequence repositories
IsoformsGrouped together per geneIndividual entries
QualityHigh confidenceVariable
Size~570K entries~250M entries
Icon⭐ Gold star📄 Document
⚠️
Always Filter by Review Status!

When you need reliable annotations, always add reviewed:true to your query. TrEMBL entries can be useful for breadth, but Swiss-Prot entries are gold standard.


UniProt Identifiers

Accession Numbers

The primary identifier — stable and persistent:

P05067      (6 characters: 1 letter + 5 alphanumeric)
A0A024RBG1  (10 characters: newer format)

Entry Names

Human-readable format: GENE_SPECIES

APP_HUMAN    → Amyloid precursor protein, Human
INS_HUMAN    → Insulin, Human
SPIKE_SARS2  → Spike protein, SARS-CoV-2
💡
Accession vs Entry Name

Accession (P05067) = stable, use for databases and scripts
Entry name (APP_HUMAN) = readable, can change if gene name updates


Protein Existence Levels

UniProt classifies how confident we are that a protein actually exists:

LevelEvidenceDescription
1Protein levelExperimental evidence (MS, X-ray, etc.)
2Transcript levelmRNA evidence, no protein detected
3HomologyInferred from similar sequences
4PredictedGene prediction, no other evidence
5UncertainDubious, may not exist

Query syntax: existence:1 (for protein-level evidence)


Annotation Score

A 1-5 score indicating annotation completeness (not accuracy!):

ScoreMeaning
5/5Well-characterized, extensively annotated
4/5Good annotation coverage
3/5Moderate annotation
2/5Basic annotation
1/5Minimal annotation
ℹ️
Score ≠ Accuracy

A score of 5/5 means the entry has lots of annotations — it doesn't guarantee they're all correct. A score of 1/5 might just mean the protein hasn't been studied much yet.


UniProt Search Syntax

UniProt uses a field-based query syntax. The general format:

field:value

Basic Query Structure

term1 AND term2 AND (term3 OR term4)

Boolean operators: AND, OR, NOT (can be uppercase or lowercase)


Key Search Fields

Organism and Taxonomy

FieldExampleDescription
organism_nameorganism_name:humanSearch by name
organism_idorganism_id:9606Search by NCBI taxonomy ID
taxonomy_idtaxonomy_id:9606Same as organism_id

Common taxonomy IDs:

  • Human: 9606
  • Mouse: 10090
  • Rat: 10116
  • Zebrafish: 7955
  • E. coli K12: 83333
  • Yeast: 559292

Review Status and Existence

FieldExampleDescription
reviewedreviewed:trueSwiss-Prot only
reviewedreviewed:falseTrEMBL only
existenceexistence:1Protein-level evidence

Enzyme Classification (EC Numbers)

FieldExampleDescription
ecec:3.4.21.1Exact EC number
ecec:3.4.21.*Wildcard for all serine endopeptidases
ecec:3.4.*All peptidases
💡
EC Number Wildcards

Use * as wildcard: ec:3.4.21.* matches all serine endopeptidases (3.4.21.1, 3.4.21.2, etc.)


Proteomes

FieldExampleDescription
proteomeproteome:UP000005640Human reference proteome
proteomeproteome:UP000000589Mouse reference proteome

Finding proteome IDs: Go to UniProt → Proteomes → Search your organism


Cross-References (External Databases)

FieldExampleDescription
databasedatabase:pdbHas PDB structure
databasedatabase:smrHas Swiss-Model structure
databasedatabase:ensemblHas Ensembl cross-ref
xrefxref:pdb-1abcSpecific PDB ID

Function and Annotation

FieldExampleDescription
cc_functioncc_function:"ion transport"Function comment
cc_scl_termcc_scl_term:SL-0039Subcellular location term
keywordkeyword:kinaseUniProt keyword
familyfamily:kinaseProtein family

Gene Ontology

FieldExampleDescription
gogo:0007155Any GO term (by ID)
gogo:"cell adhesion"Any GO term (by name)
goagoa:0007155GO annotation (same as go)

Sequence Properties

FieldExampleDescription
lengthlength:[100 TO 500]Sequence length range
massmass:[10000 TO 50000]Molecular weight range
cc_mass_spectrometrycc_mass_spectrometry:*Has MS data

Building Complex Queries

Pattern 1: Reviewed + Organism + Function

reviewed:true AND organism_id:9606 AND cc_function:"kinase"

Pattern 2: Multiple EC Numbers

(ec:3.4.21.*) OR (ec:3.4.22.*)

Pattern 3: Multiple Organisms

(organism_id:10116) OR (organism_id:7955)

Pattern 4: Proteome + Database Cross-Reference

proteome:UP000005640 AND (database:pdb OR database:smr) AND reviewed:true

Pattern 5: Complex Boolean Logic

For "exactly two of three conditions" (A, B, C):

((A AND B) OR (B AND C) OR (A AND C)) NOT (A AND B AND C)

Practice Exercises

Exercise 1: Protein Existence Statistics

Q: (1) What percentage of TrEMBL entries have evidence at "protein level"? (2) What percentage of Swiss-Prot entries have evidence at "protein level"?

Click for answer

Answers:

  1. TrEMBL: ~0.17% (343,595 / 199,006,239)
  2. Swiss-Prot: ~20.7% (118,866 / 573,661)

Queries:

(1) (existence:1) AND (reviewed:false)
(2) (existence:1) AND (reviewed:true)

Takeaway: Swiss-Prot has ~100x higher percentage of experimentally verified proteins — that's why manual curation matters!


Exercise 2: EC Numbers + Multiple Organisms

Q: Retrieve all reviewed proteins annotated as either:

  • Cysteine endopeptidases (EC 3.4.22.*)
  • Serine endopeptidases (EC 3.4.21.*)

From: Rattus norvegicus [10116] and Danio rerio [7955]

How many?

Click for answer

Answer: 132 entries (121 rat, 11 zebrafish)

Query:

((ec:3.4.21.*) OR (ec:3.4.22.*)) AND ((organism_id:10116) OR (organism_id:7955)) AND (reviewed:true)

How to build it:

RequirementQuery Component
Serine OR Cysteine peptidases(ec:3.4.21.*) OR (ec:3.4.22.*)
Rat OR Zebrafish(organism_id:10116) OR (organism_id:7955)
Reviewed onlyreviewed:true

⚠️ Watch the parentheses! Without proper grouping, you'll get wrong results.


Exercise 3: Proteome + Structure Cross-References

Q: Retrieve all reviewed entries from the Human Reference Proteome that have either:

  • A PDB structure, OR
  • A Swiss-Model Repository structure

How many?

Click for answer

Answer: 17,695 entries

Query:

proteome:UP000005640 AND ((database:pdb) OR (database:smr)) AND (reviewed:true)

Components:

RequirementQuery
Human Reference Proteomeproteome:UP000005640
PDB OR SMR structure(database:pdb) OR (database:smr)
Reviewedreviewed:true

Exercise 4: Complex Boolean — "Exactly Two of Three"

Q: Find all reviewed entries with exactly two of these three properties:

  • Function: "ion transport" (CC field)
  • Subcellular location: "cell membrane" (SL-0039)
  • GO term: "cell adhesion" (GO:0007155)
Click for answer

Answer: 2,022 entries

Query:

(cc_function:"ion transport" AND cc_scl_term:SL-0039) OR (cc_scl_term:SL-0039 AND go:0007155) OR (cc_function:"ion transport" AND go:0007155) NOT (cc_function:"ion transport" AND cc_scl_term:SL-0039 AND go:0007155) AND (reviewed:true)

Logic breakdown:

"Exactly two of three" = (A AND B) OR (B AND C) OR (A AND C), but NOT (A AND B AND C)

VariableCondition
Acc_function:"ion transport"
Bcc_scl_term:SL-0039
Cgo:0007155

⚠️ Common UniProt Search Mistakes

Mistake #1: Forgetting reviewed:true

❌ organism_id:9606 AND ec:3.4.21.*
   → Returns millions of TrEMBL entries

✓ organism_id:9606 AND ec:3.4.21.* AND reviewed:true
   → Returns curated Swiss-Prot entries only

Mistake #2: Wrong Parentheses Grouping

❌ ec:3.4.21.* OR ec:3.4.22.* AND organism_id:9606
   → Parsed as: ec:3.4.21.* OR (ec:3.4.22.* AND organism_id:9606)
   → Gets ALL serine peptidases from ANY organism

✓ (ec:3.4.21.* OR ec:3.4.22.*) AND organism_id:9606
   → Gets both types, but only from human

Rule: Always use parentheses to make grouping explicit!


Mistake #3: Confusing Taxonomy Fields

organism_id:9606    → Works ✓
organism_name:human → Works ✓
taxonomy:human      → Doesn't work as expected

Best practice: Use organism_id with the NCBI taxonomy ID for precision.


Mistake #4: Missing Quotes Around Phrases

❌ cc_function:ion transport
   → Searches for "ion" in function AND "transport" anywhere

✓ cc_function:"ion transport"
   → Searches for the phrase "ion transport" in function

Mistake #5: Using Wrong Field for Cross-References

❌ pdb:1ABC
   → Not a valid field

✓ database:pdb AND xref:pdb-1ABC
   → Correct way to search for specific PDB

Or to find ANY protein with PDB:

database:pdb

Quick Reference: Common Query Patterns

By Organism

organism_id:9606                           # Human
organism_id:10090                          # Mouse
(organism_id:9606) OR (organism_id:10090)  # Human OR Mouse

By Enzyme Class

ec:1.1.1.1              # Exact EC
ec:1.1.1.*              # All in 1.1.1.x
ec:1.*                  # All oxidoreductases

By Evidence Level

reviewed:true                    # Swiss-Prot only
reviewed:false                   # TrEMBL only
existence:1                      # Protein-level evidence
existence:1 AND reviewed:true    # Best quality

By Database Cross-Reference

database:pdb                     # Has any PDB structure
database:smr                     # Has Swiss-Model
database:ensembl                 # Has Ensembl link
(database:pdb) OR (database:smr) # Has any 3D structure

By Proteome

proteome:UP000005640    # Human reference proteome
proteome:UP000000589    # Mouse reference proteome
proteome:UP000000625    # E. coli K12 proteome

By Function/Location

cc_function:"kinase"              # Function contains "kinase"
cc_scl_term:SL-0039               # Cell membrane
keyword:phosphoprotein            # UniProt keyword
go:0007155                        # GO term by ID
go:"cell adhesion"                # GO term by name

Entry Sections Quick Reference

A UniProtKB entry contains these sections:

SectionWhat you find
FunctionCatalytic activity, cofactors, pathway
Names & TaxonomyProtein names, gene names, organism
Subcellular LocationWhere in the cell
Disease & VariantsAssociated diseases, natural variants
PTM/ProcessingPost-translational modifications
ExpressionTissue specificity, developmental stage
InteractionProtein-protein interactions
Structure3D structure info, links to PDB
Family & DomainsPfam, InterPro, PROSITE
SequenceAmino acid sequence, isoforms
Cross-referencesLinks to 180+ external databases

Tools Available in UniProt

ToolWhat it does
BLASTSequence similarity search
AlignMultiple sequence alignment
Peptide SearchFind proteins containing a peptide
ID MappingConvert between ID systems
Batch RetrievalGet multiple entries at once

Download Formats

FormatUse case
FASTASequences for analysis tools
TSVTabular data for Excel/R/Python
ExcelDirect spreadsheet use
JSONProgrammatic access
XMLStructured data exchange
GFFGenome annotations
ListJust accession numbers
💡
Customize Your Download

Before downloading, click "Customize columns" to select exactly which fields you need. This saves processing time later!


Automatic Annotation Systems

For TrEMBL entries, annotations come from:

SystemHow it works
UniRuleManually curated rules based on Swiss-Prot templates
ARBAAssociation Rule-Based Annotation using InterPro
ProtNLMGoogle's NLP model for protein function prediction

Evidence codes (ECO):

  • ECO:0000269 — Experimental evidence
  • ECO:0000305 — Curator inference
  • ECO:0000256 — Sequence model (automatic)
  • ECO:0000259 — InterPro match (automatic)

TL;DR

  • UniProt = protein database = Swiss-Prot (reviewed, high quality) + TrEMBL (unreviewed, comprehensive)
  • Always add reviewed:true when you need reliable annotations
  • Query syntax: field:value with AND, OR, NOT
  • Use parentheses to group OR conditions properly
  • Common fields: organism_id, ec, reviewed, existence, database, proteome, go
  • Wildcards: Use * for EC numbers (e.g., ec:3.4.21.*)
  • Protein existence: Level 1 = experimental evidence, Level 5 = uncertain

Now go find some proteins! 🧬

NCBI: A Practical Guide

So you need to search for nucleotide sequences, reference sequences, or gene information? Welcome to NCBI — the American counterpart to Europe's EBI, and home to GenBank, RefSeq, and about 40 other interconnected databases.

What is NCBI?

National Center for Biotechnology Information — created in 1988 as part of the National Library of Medicine (NLM) at NIH, Bethesda, Maryland.

What it gives you:

  • GenBank (primary nucleotide sequences)
  • RefSeq (curated reference sequences)
  • Gene database (gene-centric information)
  • PubMed (literature)
  • dbSNP, ClinVar, OMIM (variants & clinical)
  • BLAST (sequence alignment)
  • And ~40 more databases, all cross-linked
ℹ️
Global Search

Search any term (e.g., "HBB") from the NCBI homepage and it returns results across ALL databases — Literature, Genes, Proteins, Genomes, Genetics, Chemicals. Then drill down into the specific database you need.


The Three Main Sequence Databases

DatabaseWhat it isKey Point
NucleotideCollection from GenBank, RefSeq, TPA, PDBPrimary entry point for sequences
GenBankPrimary archive — anyone can submitRaw data, may have duplicates/contradictions
RefSeqCurated, non-redundant reference sequencesClean, reviewed, NCBI-maintained

GenBank vs RefSeq: Know the Difference

This is crucial — they serve different purposes:

AspectGenBankRefSeq
CurationNot curatedCurated by NCBI
Who submitsAuthors/labsNCBI creates from existing data
Who revisesOnly original authorNCBI updates continuously
RedundancyMultiple records for same locusSingle record per molecule
ConsistencyRecords can contradict each otherConsistent, reviewed
ScopeAny speciesModel organisms mainly
Data sharingShared via INSDCNCBI exclusive
AnalogyPrimary literatureReview articles
💡
When to Use Which?

GenBank: When you need all available sequences, including rare species or unpublished data.
RefSeq: When you need a reliable, canonical reference sequence for analysis.


INSDC: The Global Sequence Collaboration

GenBank doesn't exist in isolation. Since 2005, three databases synchronize daily:

        DDBJ (Japan)
           ↓
    ← → INSDC ← →
           ↓
NCBI/GenBank    ENA/EBI (Europe)
    (USA)

Submit to one, it appears in all three. This is why you sometimes see the same sequence with different accession prefixes.


Understanding Accession Numbers

GenBank Accessions

The LOCUS line tells you a lot:

LOCUS       SCU49845    5028 bp    DNA    PLN    21-JUN-1999
            ↑           ↑          ↑      ↑      ↑
         Name        Length     Type  Division  Date

GenBank Divisions (the 3-letter code):

CodeDivision
PRIPrimate sequences
RODRodent sequences
MAMOther mammalian
VRTOther vertebrate
INVInvertebrate
PLNPlant, fungal, algal
BCTBacterial
VRLViral
PHGBacteriophage
SYNSynthetic

Query by division: gbdiv_pln[Properties]


RefSeq Accession Prefixes

This is important — the prefix tells you exactly what type of sequence it is:

PrefixTypeCuration Level
NM_mRNACurated ✓
NP_ProteinCurated ✓
NR_Non-coding RNACurated ✓
XM_mRNAPredicted (computational)
XP_ProteinPredicted (computational)
XR_Non-coding RNAPredicted (computational)
NG_Genomic regionReference
NC_ChromosomeComplete
NT_ContigAssembly
NW_WGS SupercontigAssembly
⚠️
N vs X Prefix

NM_, NP_ = Curated, experimentally supported
XM_, XP_ = Predicted by algorithms, not yet reviewed

For reliable analyses, prefer N* prefixes when available!


RefSeq Status Codes

StatusMeaningReliability
REVIEWEDReviewed by NCBI staff, literature-backed⭐⭐⭐ Highest
VALIDATEDInitial review done, preferred sequence⭐⭐ High
PROVISIONALNot yet reviewed, gene association established⭐ Medium
PREDICTEDComputational prediction, some aspects predicted⭐ Medium
INFERREDPredicted, partially supported by homologyLow
MODELAutomatic pipeline, no individual reviewLowest

NCBI Search Syntax

This is where it gets powerful. NCBI uses field tags in square brackets.

Basic Syntax

search_term[Field Tag]

Boolean operators must be UPPERCASE: AND, OR, NOT

Common Field Tags

Field TagWhat it searchesExample
[Title]Definition lineglyceraldehyde 3 phosphate dehydrogenase[Title]
[Organism]NCBI taxonomymouse[Organism], "Homo sapiens"[Organism]
[Properties]Molecule type, source, etc.biomol mrna[Properties]
[Filter]Subsets of datanucleotide omim[Filter]
[Gene Name]Gene symbolBRCA1[Gene Name]
[EC/RN Number]Enzyme Commission number2.1.1.1[EC/RN Number]
[Accession]Accession numberNM_001234[Accession]

Useful Properties Field Terms

Molecule Type

biomol_mrna[Properties]
biomol_genomic[Properties]
biomol_rrna[Properties]

GenBank Division

gbdiv_pri[Properties]    (primates)
gbdiv_rod[Properties]    (rodents)
gbdiv_est[Properties]    (ESTs)
gbdiv_htg[Properties]    (high throughput genomic)

Gene Location

gene_in_mitochondrion[Properties]
gene_in_chloroplast[Properties]
gene_in_genomic[Properties]

Source Database

srcdb_refseq[Properties]           (any RefSeq)
srcdb_refseq_reviewed[Properties]  (reviewed RefSeq only)
srcdb_refseq_validated[Properties] (validated RefSeq only)
srcdb_pdb[Properties]
srcdb_swiss_prot[Properties]

The Gene database is the best starting point for gene-specific searches. It integrates information from multiple sources: nomenclature, RefSeqs, maps, pathways, variations, phenotypes.

Gene-Specific Field Tags

Find genes by...Search syntax
Free texthuman muscular dystrophy
Gene symbolBRCA1[sym]
Organismhuman[Organism]
ChromosomeY[CHR] AND human[ORGN]
Gene Ontology term"cell adhesion"[GO] or 10030[GO]
EC number1.9.3.1[EC]
PubMed ID11331580[PMID]
AccessionM11313[accn]

Gene Properties

genetype protein coding[Properties]
genetype pseudo[Properties]
has transcript variants[Properties]
srcdb refseq reviewed[Properties]
feattype regulatory[Properties]

Gene Filters

gene clinvar[Filter]      (has ClinVar entries)
gene omim[Filter]         (has OMIM entries)
gene structure[Filter]    (has 3D structure)
gene type noncoding[Filter]
gene type pseudo[Filter]
src genomic[Filter]
src organelle[Filter]

Building Complex Queries

Query Structure

term1[Field] AND term2[Field] AND (term3[Field] OR term4[Field])
⚠️
Boolean Operators Must Be UPPERCASE

AND, OR, NOT — lowercase won't work!

Example Query Walkthrough

Goal: Find all reviewed/validated RefSeq mRNA entries for mouse enzymes with EC 2.1.1.1 or 2.1.1.10

Breaking it down:

RequirementQuery Component
mRNA sequences"biomol mrna"[Properties]
EC 2.1.1.1 OR 2.1.1.10(2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number])
Mouse"mus musculus"[Organism]
Reviewed OR validated RefSeq("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

Final query:

"biomol mrna"[Properties] AND (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) AND "mus musculus"[Organism] AND ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

Result: 9 entries


Practice Exercises

Exercise 1: Nucleotide Database Query

Q: In NCBI "Nucleotide", find all entries containing:

  • mRNA sequences
  • coding for enzymes with EC Numbers 2.1.1.1 and 2.1.1.10
  • from Mus musculus
  • which have been reviewed or validated in RefSeq

How many entries?

Click for answer

Answer: 9 entries (range: 1-10)

Query:

"biomol mrna"[Properties] AND (2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number]) AND "mus musculus"[Organism] AND ("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

How to build it:

RequirementField Tag
mRNA"biomol mrna"[Properties]
EC numbers (OR)(2.1.1.1[EC/RN Number] OR 2.1.1.10[EC/RN Number])
Mouse"mus musculus"[Organism]
RefSeq quality("srcdb refseq reviewed"[Properties] OR "srcdb refseq validated"[Properties])

Exercise 2: Gene Database Query

Q: In the «Gene» database, look for all genes:

  • coding for proteins (protein-coding genes)
  • associated to the GO term "ATP synthase"
  • whose source is mitochondrial or genomic
  • annotated in ClinVar OR OMIM

How many entries?

Click for answer

Answer: 32 entries (range: 31-40)

Query:

"genetype protein coding"[Properties] AND "atp synthase"[Gene Ontology] AND ("source mitochondrion"[Properties] OR "source genomic"[Properties]) AND ("gene clinvar"[Filter] OR "gene omim"[Filter])

How to build it:

RequirementField Tag
Protein-coding"genetype protein coding"[Properties]
GO term"atp synthase"[Gene Ontology]
Source (OR)("source mitochondrion"[Properties] OR "source genomic"[Properties])
Clinical (OR)("gene clinvar"[Filter] OR "gene omim"[Filter])

Common Query Patterns

Pattern 1: Species + Molecule Type + Quality

"homo sapiens"[Organism] AND biomol mrna[Properties] AND srcdb refseq reviewed[Properties]

Pattern 2: Gene Function + Clinical Relevance

"kinase"[Gene Ontology] AND gene clinvar[Filter] AND human[Organism]

Pattern 3: Chromosome Region + Gene Type

7[CHR] AND human[ORGN] AND genetype protein coding[Properties]

Pattern 4: Multiple EC Numbers

(1.1.1.1[EC/RN Number] OR 1.1.1.2[EC/RN Number] OR 1.1.1.3[EC/RN Number])

⚠️ Common NCBI Search Mistakes

Mistake #1: Lowercase Boolean Operators

❌ biomol mrna[Properties] and mouse[Organism]
✓ biomol mrna[Properties] AND mouse[Organism]

The fix: Always use UPPERCASE AND, OR, NOT


Mistake #2: Missing Quotes Around Multi-Word Terms

❌ mus musculus[Organism]
✓ "mus musculus"[Organism]

❌ biomol mrna[Properties]
✓ "biomol mrna"[Properties]

The fix: Use quotes around phrases with spaces


Mistake #3: Wrong Database for Your Query

You want...Use this database
Gene information, GO terms, pathwaysGene
Nucleotide sequencesNucleotide
Protein sequencesProtein
VariantsdbSNP, ClinVar
LiteraturePubMed

Mistake #4: Confusing Properties vs Filters

TypePurposeExample
PropertiesContent-based attributesbiomol mrna[Properties]
FiltersRelationships to other databasesgene clinvar[Filter]

Rule of thumb:

  • Properties = what the sequence IS
  • Filters = what the sequence is LINKED to

Mistake #5: Using GenBank When You Need RefSeq

If you need a reliable reference sequence for analysis, don't just search Nucleotide — filter for RefSeq:

srcdb refseq[Properties]

Or for highest quality:

srcdb refseq reviewed[Properties]

Quick Reference: Field Tags Cheatsheet

Nucleotide Database

PurposeQuery
mRNA onlybiomol mrna[Properties]
Genomic DNAbiomol genomic[Properties]
RefSeq onlysrcdb refseq[Properties]
RefSeq reviewedsrcdb refseq reviewed[Properties]
Specific organism"Homo sapiens"[Organism]
EC number1.1.1.1[EC/RN Number]
GenBank divisiongbdiv_pri[Properties]

Gene Database

PurposeQuery
Protein-coding genesgenetype protein coding[Properties]
Pseudogenesgenetype pseudo[Properties]
GO term"term"[Gene Ontology]
Has ClinVargene clinvar[Filter]
Has OMIMgene omim[Filter]
Has structuregene structure[Filter]
Chromosome7[CHR]
Gene symbolBRCA1[sym]

Cytogenetic Location Quick Reference

For the Gene database, understanding cytogenetic notation:

    7  q  3  1  .  2
    ↑  ↑  ↑  ↑     ↑
   Chr Arm Region Band Sub-band
   
   p = short arm (petit)
   q = long arm

Example: CFTR gene is at 7q31.2 = Chromosome 7, long arm, region 3, band 1, sub-band 2


TL;DR

  • NCBI = US hub for biological databases (GenBank, RefSeq, Gene, PubMed, etc.)
  • GenBank = primary archive (raw submissions) vs RefSeq = curated reference (cleaned up)
  • RefSeq prefixes: NM/NP = curated, XM/XP = predicted — prefer N* for reliable analysis
  • Boolean operators MUST be UPPERCASE: AND, OR, NOT
  • Use quotes around multi-word terms: "homo sapiens"[Organism]
  • Gene database = best starting point for gene-centric searches
  • Properties = what it IS, Filters = what it's LINKED to

Now go query some databases! 🧬

Ensembl: A Practical Guide

So you need to look up genes, transcripts, variants, or convert IDs between databases? Welcome to Ensembl — the genome browser that bioinformaticians actually use daily.

What is Ensembl?

Ensembl is a genome browser and database jointly run by the EBI (European Bioinformatics Institute) and the Wellcome Trust Sanger Institute since 1999. Think of it as Google Maps, but for genomes.

What it gives you:

  • Gene sets (splice variants, proteins, ncRNAs)
  • Comparative genomics (alignments, protein trees, orthologues)
  • Variation data (SNPs, InDels, CNVs)
  • BioMart for bulk data export
  • REST API for programmatic access
  • Everything is open source
ℹ️
The Human Reference Genome

Currently we're on GRCh38.p14 (Genome Reference Consortium). The original Human Genome Project finished in 2003 — cost $3 billion and took 15 years. Now you can access it for free in seconds. Science is wild.


Ensembl Stable Identifiers

This is the ID system you'll see everywhere. Memorize the prefixes:

PrefixMeaningExample
ENSGGene IDENSG00000141510
ENSTTranscript IDENST00000269305
ENSPPeptide/Protein IDENSP00000269305
ENSEExon IDENSE00001146308
ENSRRegulatory FeatureENSR00000000001
ENSFMProtein FamilyENSFM00250000000001
💡
Non-Human Species

For other species, a 3-letter code is inserted: ENSMUSG (mouse), ENSDARG (zebrafish), ENSCSAVG (Ciona savignyi), etc.


Transcript Quality Tiers

Not all transcripts are created equal. Here's the hierarchy:

MANE Select (Gold Standard) 🥇

  • Matched Annotation between NCBI and EBI
  • Perfectly aligned to GRCh38
  • Complete sequence identity with RefSeq
  • This is your go-to transcript

Merged (Ensembl/Havana) 🥈

  • Automatically annotated + manually curated
  • High confidence

CCDS (Consensus CDS)

  • Collaborative effort for consistent protein-coding annotations
  • Shared between NCBI, EBI, UCSC, and others

Ensembl Protein Coding (Red)

  • Automatic annotation based on mRNA/protein evidence
  • Good, but not manually verified
⚠️
Always Check Transcript Quality

When doing variant analysis, prefer MANE Select transcripts. Using a low-confidence transcript can give you wrong coordinates or missed variants.


Using the Ensembl Browser

Basic Navigation

  1. Go to ensembl.org
  2. Search by: gene name, Ensembl ID, coordinates, or variant ID (rs number)
  3. Gene page shows: location, transcripts, variants, orthologues, etc.

Key Information You Can Find

For any gene (e.g., MYH9):

  • Ensembl Gene ID → ENSG00000100345
  • Chromosomal coordinates → 22:36,281,270-36,393,331
  • Cytogenetic location → 22q12.3
  • Strand → Forward (+) or Reverse (-)
  • Number of transcripts → and which are protein-coding
  • MANE Select transcript → with CCDS and RefSeq cross-references

Viewing Variants

  1. Navigate to your gene
  2. Go to "Variant table" or zoom into a specific region
  3. Filter by: consequence type, clinical significance (ClinVar), etc.
  4. Click on any variant (e.g., rs80338828) to see:
    • Alleles and frequencies
    • Consequence (missense, synonymous, etc.)
    • Clinical annotations (ClinVar, OMIM)
    • Population frequencies

BioMart: Bulk Data Queries

BioMart is where Ensembl gets powerful. No programming required — it's a web interface for mining data in bulk.

Access: ensembl.org → BioMart (top menu)

The Three-Step Process

1. DATASET    → Choose species/database (e.g., Human genes GRCh38.p14)
2. FILTERS    → Narrow down what you want (gene list, chromosome, biotype...)
3. ATTRIBUTES → Choose what columns to export (IDs, names, sequences...)
💻
Workflow Example: ID Conversion

Goal: Convert RefSeq protein IDs to Ensembl Gene IDs

  1. Dataset: Human genes (GRCh38.p14)
  2. Filters → External References → RefSeq peptide ID → paste your list
  3. Attributes: Gene stable ID, Gene name, RefSeq peptide ID
  4. Results → Export as CSV/TSV/HTML

⚠️ Common BioMart Mistakes (And How to Avoid Them)

These will save you hours of frustration. Learn from pain.

Mistake #1: Pasting IDs in the Wrong Filter Field

⚠️
The Classic Blunder

You have RefSeq IDs (NP_001214, NP_001216...) and you paste them into "Gene stable ID(s)" field. Result? Empty results.

Why it happens: The "Gene stable ID(s)" field expects Ensembl IDs (ENSG...), not RefSeq IDs.

The fix:

ID TypeWhere to Paste
ENSG00000xxxxxFilters → GENE → Gene stable ID(s)
NP_xxxxxx (RefSeq protein)Filters → EXTERNAL → RefSeq peptide ID(s)
NM_xxxxxx (RefSeq mRNA)Filters → EXTERNAL → RefSeq mRNA ID(s)
P12345 (UniProt)Filters → EXTERNAL → UniProtKB/Swiss-Prot ID(s)
💡
Rule of Thumb

Look at your ID prefix. If it's NOT "ENS...", you need to find the matching field under EXTERNAL → External References.


Mistake #2: Checkbox vs Text Input Confusion

Some filter options have both a checkbox AND a text field:

☑ With RefSeq peptide ID(s): Only    ← Checkbox (just filters for genes that HAVE RefSeq IDs)
[________________________]           ← Text field (where you paste YOUR specific IDs)

The mistake: Checking the box but not pasting IDs in the text field.

What happens:

  • Checkbox alone = "Give me all genes that have ANY RefSeq ID" (thousands of results)
  • Text field = "Give me only genes matching THESE specific RefSeq IDs" (your actual query)

The fix: Always paste your ID list in the text input field, not just check the box.


Mistake #3: Orthologue vs Paralogue Mix-up

⚠️
Know the Difference!

You want to find human equivalents of Ciona genes. You select Paralogue %id. Result? Wrong data or empty results.

TermMeaningUse When
OrthologueSame gene in different species (separated by speciation)Ciona gene → Human equivalent
ParalogueDifferent gene in same species (separated by duplication)Human BRCA1 → Human BRCA2

The fix:

For cross-species queries (e.g., Ciona → Human):

Attributes → Homologues → Human Orthologues
    ✓ Human gene stable ID
    ✓ Human gene name
    ✓ %id. target Human gene identical to query gene

NOT:

Attributes → Homologues → Paralogues   ← WRONG for cross-species!

Mistake #4: Forgetting to Include Filter Column in Attributes

The scenario: You filter by RefSeq peptide ID, but don't include it in your output attributes.

What happens: You get a list of Ensembl IDs with no way to match them back to your original input!

Gene stable IDGene name
ENSG00000137752CASP1
ENSG00000196954CASP4

Wait... which RefSeq ID was CASP1 again? 🤷

The fix: Always include your filter field as an output attribute:

Attributes:
    ✓ Gene stable ID
    ✓ Gene name
    ✓ RefSeq peptide ID    ← Include this for verification!

Now you get:

Gene stable IDGene nameRefSeq peptide ID
ENSG00000137752CASP1NP_001214
ENSG00000196954CASP4NP_001216

Much better!


Mistake #5: Wrong Dataset for Cross-Species Queries

The scenario: You want human orthologues of Ciona genes. You select "Human genes" as your dataset.

What happens: You can't input Ciona gene IDs because you're in the Human database!

The fix: Start from the source species:

Dataset: Ciona savignyi genes    ← Start here (your input species)
Filters: Gene stable ID → paste Ciona IDs
Attributes: 
    - Gene stable ID (Ciona)
    - Human orthologue gene ID    ← Get human data as attributes
    - Human gene name

Rule: Dataset = species of your INPUT IDs. Other species come through Homologues attributes.


BioMart Mistakes Cheatsheet

SymptomLikely CauseFix
Empty resultsIDs in wrong filter fieldMatch ID prefix to correct filter (EXTERNAL for non-Ensembl IDs)
Way too many resultsUsed checkbox without text inputPaste specific IDs in the text field
Wrong species dataSelected Paralogue instead of OrthologueUse Orthologue for cross-species
Can't match results to inputDidn't include filter column in outputAdd your filter field to Attributes
Can't input your IDsWrong dataset selectedDataset = species of your INPUT IDs

Common BioMart Queries

Query Type 1: ID Conversion

RefSeq → Ensembl + HGNC Symbol

StepAction
DatasetHuman genes (GRCh38.p14)
FiltersEXTERNAL → RefSeq peptide ID(s) → paste list
AttributesGene stable ID, HGNC symbol, RefSeq peptide ID

Query Type 2: Finding Orthologues

Find human orthologues of genes from another species

StepAction
DatasetSource species (e.g., Ciona savignyi genes)
FiltersGene stable ID → paste your list
AttributesGene stable ID, Human orthologue gene ID, Human gene name, % identity
⚠️
Remember

Orthologue = cross-species. Paralogue = same species. Don't mix them up!


Query Type 3: Variant Export

Get all missense variants for a gene list

StepAction
DatasetHuman genes (GRCh38.p14)
FiltersGene name → your list; Variant consequence → missense_variant
AttributesGene name, Variant name (rs ID), Consequence, Amino acid change

Query Type 4: Find Genes with PDB Structures

Count/export genes that have associated 3D structures

StepAction
DatasetHuman genes (GRCh38.p14)
FiltersWith PDB ID → Only
AttributesGene stable ID, Gene name, PDB ID, UniProtKB/Swiss-Prot ID

Practice Exercises

Exercise 1: SNP Nucleotide Lookup

Q: In Ensembl, consider the SNP variation rs80338826. Which is the DNA nucleotide triplet coding for the wild-type amino acid residue (transcript MYH9-201)?

Click for answer

Answer: The triplet is CGT (coding for Arginine).

How to find it:

  1. Search rs80338826 in Ensembl
  2. Go to the variant page
  3. Look at transcript MYH9-201 consequences
  4. Check the codon column for the reference allele

Exercise 2: RefSeq to Ensembl Conversion

Q: Convert these RefSeq protein IDs to Ensembl Gene IDs and HGNC symbols:

NP_203126, NP_001214, NP_001216, NP_001220
NP_036246, NP_203519, NP_203520, NP_203522
Click for answer

BioMart Setup: | Step | What to do | |------|------------| | Dataset | Human genes (GRCh38.p14) | | Filters | EXTERNAL → RefSeq peptide ID(s) → paste the NP_ IDs | | Attributes | Gene stable ID, HGNC symbol, RefSeq peptide ID |

⚠️ Don't paste NP_ IDs in "Gene stable ID" field — that's for ENSG IDs only!

Results:

Gene stable IDHGNC symbolRefSeq peptide ID
ENSG00000137752CASP1NP_001214
ENSG00000196954CASP4NP_001216
ENSG00000132906CASP9NP_001220
ENSG00000105141CASP14NP_036246
ENSG00000165806CASP7NP_203126
ENSG00000064012CASP8NP_203519
ENSG00000064012CASP8NP_203520

(Notice: CASP8 has multiple RefSeq IDs mapping to it — different isoforms!)


Q: Find human orthologues for these Ciona savignyi genes:

ENSCSAVG00000000002, ENSCSAVG00000000003, ENSCSAVG00000000006
ENSCSAVG00000000007, ENSCSAVG00000000009, ENSCSAVG00000000011
Click for answer

BioMart Setup: | Step | What to do | |------|------------| | Dataset | Ciona savignyi genes (NOT Human!) | | Filters | Gene stable ID(s) → paste the ENSCSAVG IDs | | Attributes | Gene stable ID, Human orthologue gene ID, Human gene name, %id target Human |

⚠️ Use Orthologue (cross-species), NOT Paralogue (same species)!

Results:

C. savignyi Gene IDHuman Gene IDHuman Gene Name% Identity
ENSCSAVG00000000002ENSG00000156026MCU55.1%
ENSCSAVG00000000003ENSG00000169435RASSF629.6%
ENSCSAVG00000000003ENSG00000101265RASSF235.4%
ENSCSAVG00000000003ENSG00000107551RASSF433.1%
ENSCSAVG00000000007ENSG00000145416MARCHF158.8%
ENSCSAVG00000000009ENSG00000171865RNASEH139.4%
ENSCSAVG00000000011ENSG00000146856AGBL369.1%

(Note: ENSCSAVG00000000003 maps to multiple RASSF family members — gene family expansion!) (Note: ENSCSAVG00000000006 has no human orthologue)


Exercise 4: MYH9 Gene Exploration

Q: For the human MYH9 gene:

  1. What's the Ensembl code? How many transcripts? All protein-coding? Forward or reverse strand?
  2. What's the MANE Select transcript code? CCDS code? RefSeq codes?
  3. Chromosomal coordinates? Cytogenetic location?
  4. Zoom to exon 17 (22:36,306,051-36,305,930). Any variants annotated in both ClinVar and OMIM? Check rs80338828.
Click for answer
  1. Ensembl Gene ID: ENSG00000100345
    Transcripts: Multiple (check current count — it changes between releases)
    Not all protein-coding — some are processed transcripts, nonsense-mediated decay, etc.
    Strand: Reverse (-)

  2. MANE Select: ENST00000216181 (MYH9-201)
    CCDS: CCDS14099
    RefSeq: NM_002473 (mRNA), NP_002464 (protein)

  3. Coordinates: Chr22:36,281,270-36,393,331 (GRCh38)
    Cytogenetic: 22q12.3

  4. rs80338828: Yes, annotated in both ClinVar and OMIM
    Associated with MYH9-related disorders (May-Hegglin anomaly, etc.)


Quick Reference: BioMart Checklist

□ Selected correct dataset (species of your INPUT IDs)
□ Pasted IDs in the CORRECT filter field (match ID prefix!)
□ Used text input field, not just checkbox
□ Selected Orthologue (not Paralogue) for cross-species queries
□ Included filter column in attributes (for verification)
□ Checked "Unique results only" if needed
□ Tested with small subset before full export
📝
Pro Tips
  • BioMart can be slow with large queries — be patient or split into batches
  • Always double-check your assembly version (GRCh37 vs GRCh38)
  • For programmatic access, use the Ensembl REST API instead
  • Video tutorial: EBI BioMart Tutorial

TL;DR

  • Ensembl = genome browser + database for genes, transcripts, variants, orthologues
  • IDs: ENSG (gene), ENST (transcript), ENSP (protein) — learn to recognize them
  • MANE Select = highest quality transcript annotation (use these when possible)
  • BioMart = bulk query tool: Dataset → Filters → Attributes → Export

Avoid these mistakes:

  1. Don't paste RefSeq/UniProt IDs in "Gene stable ID" field — use EXTERNAL filters
  2. Use the text input field, not just checkboxes
  3. Orthologue = cross-species, Paralogue = same species
  4. Start with the species of your INPUT IDs as your dataset
  5. Always include your filter column in output attributes

Now go explore some genomes! 🧬

This Container Has a Snake Inside

We will talk in this topic about containers and how to put the snake (python) inside them.

alt text This image is a reference to a scene from an Egyptian movie, where a character humorously asks what’s inside the box.

Introduction to Containers

📖
Definition

Containers: an easy way of making bundle of an application with some requirments and with abilty to deploy it in many places .

Applications inside a box and with some requirments? Hmmm, but Virtual Machine can do this. We need to know how the whole story begun.

The Beginning: Bare Metal

📝
One App, One Server

Each application needed its own physical server. Servers ran at 5-15% capacity but you paid for 100%.

Virtual Machines (VMs) Solution

Split One Server Into Many

Hypervisor software lets you run multiple "virtual servers" on one physical machine.

How it works:

Physical Server
├── Hypervisor
├── VM 1 (Full OS + App)
├── VM 2 (Full OS + App)
└── VM 3 (Full OS + App)

alt text

⚠️
The Hidden Costs

VMs solved hardware waste but created new problems at scale.

Every VM runs a complete operating system, if you have 1,000 VMs, you're running 1,000 complete operating systems, each consuming 2-4GB RAM, taking minutes to boot, and requiring constant maintenance.

Every operating system needs a license

Each VM's operating system needs monthly patches, security updates, backups, monitoring, and troubleshooting, at 1,000 VMs, you're maintaining 1,000 separate operating systems.

You need specialized VMware administrators, OS administrators for each type of VM, network virtualization experts, and storage specialists, even with templates, deploying a new VM takes days because it requires coordination across multiple expert teams.

Container Architecture

If you notice in the previous image, we are repeating the OS. We just need to change the app and its requirements.

Think about it: an OS is just a kernel (for hardware recognition - the black screen that appears when you turn on the PC) and user space. For running applications, we don't need the full user space, we only need the kernel (for hardware access).

Another thing - the VMs are already installed on a real (physical) machine that already has a kernel, so why not just use it? If we could use the host's kernel and get rid of the OS for each VM, we'd solve half the problem. This is one of the main ideas behind containers.

How can we do this? First, remember that the Linux kernel is the same everywhere in the world - what makes distributions different is the user space. Start with the kernel, add some tools and configurations, you get Debian. Add different tools, you get Ubuntu. It's always: kernel + different stuff on top = different distributions.

How do containers achieve this idea? By using layers. Think of it like a cake:

alt text

You can stop at any layer! Layer 1 alone (just the base OS files) is a valid container - yes, you can have a "container of an OS", but remember it's not a full OS, just the user space files without a kernel. Each additional layer adds something specific you need.

After you finish building these layers, you can save the complete stack as a template, this template is called an image. When you run an image, it becomes a running container.

alt text

Remember, we don't care about the OS - Windows, Linux, macOS - they all have kernels. If your app needs Linux-specific tools or Windows-specific tools, you can add just those specific components in a layer and continue building. This reduces dependencies dramatically.

The idea is: start from the kernel and build up only what you need. But how exactly does this work?

The Linux Magic: cgroups and namespaces

Containers utilize Linux kernel features, specifically cgroups and namespaces.

cgroups (control groups): It controls how much CPU, memory, and disk a process can use.

Example:

  • Process A: Use maximum 2 CPU cores and 4GB RAM
  • Process B: Use maximum 1 CPU core and 2GB RAM
  • Container = cgroups ensures Process A can't steal resources from Process B

namespaces: These manage process isolation and hierarchy, they make processes think they're alone on the system.

Example: Process tree isolation

Host System:
├── Process 1 (PID 1)
├── Process 2 (PID 2)
└── Process 3 (PID 3)

Inside Container (namespace):
└── Process 1 (thinks it's PID 1, but it's actually PID 453 on host)
    └── Process 2 (thinks it's PID 2, but it's actually PID 454 on host)

The container's processes think they're the only processes on the system, completely unaware of other containers or host processes.

Containers = cgroups + namespaces + layers

If you think about it, cgroups + namespaces = container isolation. You start with one process, isolated in its own namespace with resource limits from cgroups. From that process, you install specific libraries, then Python, then pip install your dependencies, and each step is a layer.

alt text

You can even utilize the same idea of Unix signals to control containers, and send SIGTERM to stop a process, and by extension, stop the entire container.

Because namespaces and cgroups are built into the Linux kernel, we only need the kernel, nothing else! No full operating system required.

The Tool: Docker

There are many technologies that achieve containerization (rkt, Podman, containerd), but the most famous one is made by Docker Inc. The software? They called it "Docker."

Yeah, super creative naming there, folks. :)

alt text

If you install Docker on Windows, you are actually installing Docker Desktop, which creates a lightweight virtual machine behind the scenes. Inside that VM, Docker runs a Linux environment, and your Linux containers run there.

If you want to run Windows containers, Docker Desktop can switch to Windows container mode, but those require the Windows kernel and cannot run inside the Linux VM.

Same for macOS.

If you install Docker on Linux, there is no virtual machine involved. You simply get the tools to create and run containers directly

Install Docker

For Windows of macOS see see: Overview of Docker Desktop.

If you are Ubuntu run these commands:

curl -fsSL https://get.docker.com -o get-docker.sh

Then

sudo sh ./get-docker.sh --dry-run

Then run to verify:

sudo docker info

If writing sudo everytime is annoying, then you need to yourself(the name of the user) to the docker group and then restart your machine:

Run the following with replacing mahmoudxyz with your username:

sudo usermod -aG docker mahmoudxyz

After you restart your PC, you will not need to use sudo again before docker.

Basic Docker Commands

Let's start with a simple command:

docker run -it python

This command creates and starts a container (a shortcut for docker create + docker start). The -i flag keeps STDIN open (interactive), and -t allocates a terminal (TTY).

Another useful thing about docker run is that if you don’t have the image locally, Docker will automatically pull it from Docker Hub.

The output of this command shows some downloads and other logs, but the most important part is something like:

Digest: sha256:[text here]

This string can also serve as your image ID.

After the download finishes, Docker will directly open the Python interactive mode:

Python interactive mode

You can write Python code here, but if you exit Python, the entire container stops. This illustrates an important concept: a container is designed to run a single process. Once that process ends, the container itself ends.

CommandDescriptionExample
docker pullDownloads an image from Docker Hub (or another registry)docker pull fedora
docker createCreates a container from an image without starting itdocker create fedora
docker runCreates and starts a container (shortcut for create + start)docker run fedora
docker psLists running containersdocker ps
docker ps -aLists all containers (stopped + running)docker ps -a
docker imagesShows all downloaded imagesdocker images

Useful Flags

FlagMeaningExample
-iKeep STDIN open (interactive)docker run -i fedora
-tAllocate a TTY (terminal)docker run -t fedora
-itInteractive + TTY → lets you use the container shelldocker run -it fedora bash
ls (in Docker context)Used inside container to list files (Linux command)docker run -it ubuntu ls

To remove a container, use:

docker rm <container_id_or_name>

You can only remove stopped containers. If a container is running, you need to stop it first with:

docker stop <container_id_or_name>

Port Forwarding

When you run a container that exposes a service (like a web server), you often want to access it from your host machine. Docker allows this using the -p flag:

docker run -p <host_port>:<container_port> <image>

Example:

docker run -p 8080:80 nginx
  1. 8080 → the port on your host machine
  2. 80 → the port inside the container that Nginx listens on

Now, you can open your browser and visit: http://localhost:8080 …and you’ll see the Nginx welcome page.

Docker Networks (in nutshell)

Docker containers are isolated by default. Each container has its own network stack and cannot automatically see or communicate with other containers unless you connect them.

A Docker network allows containers to:

  • Communicate with each other using container names instead of IPs.
  • Avoid port conflicts and isolate traffic from the host or other containers.
  • Use DNS resolution inside the network (so container1 can reach container2 by name).

Default Networks

Docker automatically creates a few networks:

  1. bridge → the default network for standalone containers.
  2. host → containers share the host’s network.
  3. none → containers have no network

If you want multiple containers (e.g., Jupyter + database) to talk to each other safely and easily, it’s best to create a custom network like bdb-net.

Example:

docker network create bdb-net

Jupyter Docker

Jupyter Notebook can easily run inside a Docker container, which helps avoid installing Python and packages locally.

Don't forget to create the network first:

docker network create bdb-net
docker run -d --rm --name my_jupyter --mount src=bdb_data,dst=/home/jovyan -p 127.0.0.1:8888:8888 --network bdb-net -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="bdb_password" --user root -e CHOWN_HOME=yes -e CHOWN_HOME_OPTS="-R" jupyter/datascience-notebook

Flags and options:

OptionMeaning
-dRun container in detached mode (in the background)
--rmAutomatically remove container when it stops
--name my_jupyterAssign a custom name to the container
--mount src=bdb_data,dst=/home/jovyanMount local volume bdb_data to /home/jovyan inside container
-p 127.0.0.1:8888:8888Forward host localhost port 8888 to container port 8888
--network bdb-netConnect container to Docker network bdb-net
-e JUPYTER_ENABLE_LAB=yesStart Jupyter Lab instead of classic Notebook
-e JUPYTER_TOKEN="bdb_password"Set a token/password for access
--user rootRun container as root user (needed for certain permissions)
-e CHOWN_HOME=yes -e CHOWN_HOME_OPTS="-R"Change ownership of home directory to user inside container
jupyter/datascience-notebookThe Docker image containing Python, Jupyter, and data science packages

After running this, access Jupyter Lab at: http://127.0.0.1:8888. Use the token bdb_password to log in.

Topics (coming soon)

Docker engine architecture, docker image deep dives, container deep dives, Network

Pandas: Complete Notes

Setup

Every Pandas script starts with:

import pandas as pd
import numpy as np

pd and np are conventions. Everyone uses them.


Part 1: Series

A Series is a 1-dimensional array with labels (called an index).

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

The left column (0, 1, 2...) is the index. The right column is the values.

Accessing Index and Values Separately

print(s.index)    # RangeIndex(start=0, stop=6, step=1)
print(s.values)   # array([ 1.,  3.,  5., nan,  6.,  8.])

Iterating Over a Series

# Just values
for i in s.values:
    print(i)

# Both index and values
for i, v in s.items():
    print(i, v)

Slicing a Series

Works like Python lists:

print(s[1:3])  # Elements at index 1 and 2 (3 is excluded)

Check for NaN

np.isnan(s[3])  # True

Custom Index

The index doesn't have to be integers:

students = pd.Series(
    [28, 15, 30, 24, 10, 19], 
    index=['Lorenzo', 'Alessandra', 'Sofia', 'Giovanni', 'Matteo', 'Chiara']
)
print(students)

Output:

Lorenzo       28
Alessandra    15
Sofia         30
Giovanni      24
Matteo        10
Chiara        19
dtype: int64

Now you access by name:

print(students['Sofia'])  # 30

Filtering a Series

print(students[students >= 18])

Output:

Lorenzo     28
Sofia       30
Giovanni    24
Chiara      19
dtype: int64

Creating Series from a Dictionary

capitals = pd.Series({
    'Italy': 'Rome',
    'Germany': 'Berlin',
    'France': 'Paris',
    'Spain': 'Madrid',
    'Portugal': 'Lisbon'
})

Unlike regular Python dictionaries, Series support slicing:

print(capitals['France':'Portugal'])

Output:

France      Paris
Spain       Madrid
Portugal    Lisbon
dtype: object

Convert Series to List

capitals.to_list()  # ['Rome', 'Berlin', 'Paris', 'Madrid', 'Lisbon']

Part 2: DataFrames

A DataFrame is a 2-dimensional table. Each column is a Series.

Creating a DataFrame from Series

# First, create two Series with the same index
capitals = pd.Series({
    'Italy': 'Rome',
    'Germany': 'Berlin',
    'France': 'Paris',
    'Spain': 'Madrid',
    'Portugal': 'Lisbon'
})

population = pd.Series({
    'Italy': 58_800_000,
    'Spain': 48_400_000,
    'Germany': 84_400_000,
    'Portugal': 10_400_000,
    'France': 68_200_000
})

# Combine into DataFrame
countries = pd.DataFrame({'capitals': capitals, 'population': population})
print(countries)

Output:

          capitals  population
Italy         Rome    58800000
Germany     Berlin    84400000
France       Paris    68200000
Spain       Madrid    48400000
Portugal    Lisbon    10400000

Creating a DataFrame from a Dictionary

df = pd.DataFrame({
    'country': ['France', 'Germany', 'Italy', 'Portugal', 'Spain'],
    'capital': ['Paris', 'Berlin', 'Rome', 'Lisbon', 'Madrid'],
    'population': [68_200_000, 84_400_000, 58_800_000, 10_400_000, 48_400_000]
})

This creates an automatic numeric index (0, 1, 2...).

DataFrame Properties

print(countries.index)    # Index(['Italy', 'Germany', 'France', 'Spain', 'Portugal'])
print(countries.columns)  # Index(['capitals', 'population'])
print(countries.shape)    # (5, 2) → 5 rows, 2 columns
print(countries.size)     # 10 → total elements

Accessing Columns

Two ways:

# Bracket notation
countries['population']

# Dot notation
countries.population

Both return a Series.

Accessing Multiple Columns

Use a list inside brackets:

countries[['capitals', 'population']]  # Returns a DataFrame

Slicing Rows

countries['Italy':]      # From Italy to the end
countries[0:3]           # Rows 0, 1, 2

Filtering

countries[countries.population > 60_000_000]

Iterating Over a Single Column

for cap in countries['capitals']:
    print(cap)

Convert DataFrame to Dictionary

countries.to_dict()

Part 3: Reading Files

CSV Files

df = pd.read_csv('covid19-sample.csv')

Excel Files

First install openpyxl (only once):

!pip install openpyxl

Then:

df = pd.read_excel('covid19-sample.xlsx')

Important: Excel reading is 500-1000x slower than CSV. Use CSV when possible.

Reading from a URL

df = pd.read_csv('https://github.com/dsalomoni/bdb-2024/raw/main/covid/covid19-sample.csv')

Reading Only Specific Columns

my_columns = ['country', 'weekly_count', 'year_week']
df = pd.read_csv('covid19-sample.csv', usecols=my_columns)

Part 4: Inspecting Data

First/Last Rows

df.head()      # First 5 rows
df.head(10)    # First 10 rows
df.tail(3)     # Last 3 rows

Shape and Size

df.shape  # (rows, columns) tuple
df.size   # Total elements = rows × columns

Column Names

df.columns

Unique Values in a Column

df['indicator'].unique()    # Array of unique values
df['indicator'].nunique()   # Count of unique values

Part 5: Selecting and Slicing

By Row Number

df[3500:3504]  # Rows 3500, 3501, 3502, 3503
df[777:778]    # Just row 777

Specific Column from a Slice

df[777:778]['year_week']
# or
df[777:778].year_week

Multiple Columns

df.head()[['country', 'year_week']]

Using loc[]

Access rows by index label or by condition:

# By label
df.loc[19828]

# By condition
df.loc[df.weekly_count > 4500]

Part 6: Filtering with Conditions

Direct Filtering

df[df['grade'] > 27]

Multiple Conditions

# AND - use &
df[(df['grade'] > 27) & (df['age'] < 30)]

# OR - use |
df[(df['grade'] > 29) | (df['age'] > 30)]

Important: Wrap each condition in parentheses.

Using query() — The Better Way

df.query('country=="Italy" and indicator=="cases"')

With variables:

start_week = '2020-10'
end_week = '2021-48'
df.query('year_week >= @start_week and year_week <= @end_week')

Or using string formatting:

df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))

Part 7: iterrows() vs query()

The Slow Way: iterrows()

it_cases = dict()
for index, row in df.iterrows():
    if row['country'] == 'Italy':
        if row['indicator'] == 'cases':
            week = row['year_week']
            if (week >= start_week) and (week <= end_week):
                it_cases[week] = row['weekly_count']

df2 = pd.DataFrame(list(it_cases.items()), columns=['week', 'cases'])

Time: ~1.52 seconds for 41,000 rows

The Fast Way: query()

df3 = df.query('country=="Italy" and indicator=="cases" and year_week>="%s" and year_week<="%s"' % (start_week, end_week))

Time: ~0.01 seconds

query() is about 150x faster than iterrows().


Part 8: Sorting

Sort a Series

series.sort_values()                    # Ascending
series.sort_values(ascending=False)     # Descending

Sort a DataFrame

df.sort_values(by='quantity')                     # By one column
df.sort_values(by='quantity', ascending=False)    # Descending
df.sort_values(by=['column1', 'column2'])         # By multiple columns

Sort a Dictionary with Pandas

x = {'apple': 5, 'banana': 2, 'orange': 8, 'grape': 1}
series_x = pd.Series(x)
sorted_x = series_x.sort_values().to_dict()
# {'grape': 1, 'banana': 2, 'apple': 5, 'orange': 8}

Part 9: Common Functions

sum()

df['weekly_count'].sum()

describe()

Generates statistics for numerical columns:

df.describe()

Output includes: count, mean, std, min, 25%, 50%, 75%, max

nunique() and unique()

df['country'].nunique()  # Number of unique values
df['country'].unique()   # Array of unique values

mean() and median()

df['salary'].mean()    # Average
df['salary'].median()  # Middle value

When to use which:

  • Mean: When data is symmetrically distributed, no outliers
  • Median: When data has outliers or is skewed

Example:

Blood pressure readings: 142, 124, 121, 150, 215

Mean = (142+124+121+150+215)/5 = 150.4
Median = 142 (middle value when sorted: 121, 124, 142, 150, 215)

The 215 outlier pulls the mean up but doesn't affect the median.


Part 10: groupby()

Split data into groups, then apply a function.

Basic groupby

df_grouped = df.groupby('continent')

This returns a DataFrameGroupBy object. By itself, not useful. You need to apply a function:

df.groupby('continent').sum()
df.groupby('continent')['weekly_count'].mean()
df.groupby('continent')['weekly_count'].count()

Multiple Statistics with agg()

df.groupby('Agency')['Salary Range From'].agg(['mean', 'median'])

Group by Multiple Columns

df.groupby(['Agency', 'Posting Type'])['Salary Range From'].mean()

To prevent the grouped columns from becoming the index:

df.groupby(['Agency', 'Posting Type'], as_index=False)['Salary Range From'].mean()

Accessing Groups

grouped = df.groupby('continent')

# What groups exist?
grouped.groups.keys()

# Get one specific group
grouped.get_group('Oceania')

# How many unique countries in Oceania?
grouped.get_group('Oceania')['country'].nunique()

# Which countries?
grouped.get_group('Oceania')['country'].unique()

Sorting groupby Results

df.groupby('Agency')['# Of Positions'].count().sort_values(ascending=False).head(10)

Part 11: cut() — Binning Data

Convert continuous values into categories.

Basic Usage

df = pd.DataFrame({'age': [25, 30, 35, 40, 45, 50, 55, 60, 65]})
bins = [20, 40, 60, 80]

df['age_group'] = pd.cut(df['age'], bins)

Result:

   age   age_group
0   25    (20, 40]
1   30    (20, 40]
2   35    (20, 40]
3   40    (20, 40]
4   45    (40, 60]
...

With Labels

df['age_group'] = pd.cut(df['age'], bins, labels=['young', 'middle', 'old'])

Automatic Bins

pd.cut(df['Salary Range From'], bins=3, labels=['low', 'middle', 'high'])

Pandas automatically calculates the bin ranges.

Combining cut() with groupby()

# Add salary category column
jobs['salary_bin'] = pd.cut(jobs['Salary Range From'], bins=3, labels=['low', 'middle', 'high'])

# Now group by it
jobs.groupby('salary_bin')['Salary Range From'].count()

Part 12: Data Cleaning

Removing Duplicates

# Remove duplicate rows (all columns must match)
df.drop_duplicates(inplace=True)

# Remove duplicates based on specific column
df.drop_duplicates(subset=['B'], inplace=True)

Handling Missing Values (NaN)

Option 1: Fill with a value

# Fill with mean
df['A'].fillna(df['A'].mean(), inplace=True)

# Fill with median
df['B'].fillna(df['B'].median(), inplace=True)

Option 2: Drop rows with NaN

df.dropna()                     # Drop any row with NaN
df.dropna(subset=['grade'])     # Only if specific column is NaN

Part 13: Data Scaling

When columns have very different scales (e.g., age: 20-60, salary: 50000-200000), analysis and visualization become difficult.

Standardization (StandardScaler)

Transforms data to have mean = 0 and standard deviation = 1.

from sklearn.preprocessing import StandardScaler

df_unscaled = pd.DataFrame({'A': [1, 3, 2, 2, 1], 'B': [65, 130, 80, 70, 50]})

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_unscaled)
df_scaled = pd.DataFrame(df_scaled, columns=df_unscaled.columns)

When to use: Data follows a Gaussian (bell-shaped) distribution.

Normalization (MinMaxScaler)

Transforms data to range [0, 1].

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_unscaled)
df_scaled = pd.DataFrame(df_scaled, columns=df_unscaled.columns)

When to use: Distribution is unknown or not Gaussian.

Warning: Normalization is more sensitive to outliers than standardization.


Part 14: Plotting

Basic Syntax

df.plot(x='column_x', y='column_y', kind='line')

Plot Types

kind=Plot Type
'line'Line plot
'bar'Bar chart
'barh'Horizontal bar
'pie'Pie chart
'scatter'Scatter plot
'hist'Histogram

Examples

# Bar plot
df.plot(x='name', y='age', kind='bar', title='Ages')

# Line plot
df.plot(x='month', y='sales', kind='line', title='Monthly Sales')

# With more options
df.plot(kind='bar', ylabel='Total cases', title='COVID-19', grid=True, logy=True)

Plotting Two DataFrames Together

# Get axis from first plot
ax = df1.plot(kind='line', x='Month', title='Comparison')

# Add second plot to same axis
df2.plot(ax=ax, kind='line')

ax.set_xlabel('Month')
ax.set_ylabel('Sales')
ax.legend(['Vendor A', 'Vendor B'])

Part 15: Exporting Data

To CSV

df.to_csv('output.csv', index=False)

To Excel

df.to_excel('output.xlsx', index=False)

index=False prevents writing the row numbers as a column.


Part 16: Statistics Refresher

Variance and Standard Deviation

Variance (σ²): Average of squared differences from the mean.

$$\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$

Standard Deviation (σ): Square root of variance.

$$\sigma = \sqrt{\sigma^2}$$

Why use standard deviation instead of variance?

  • Variance has squared units (meters² if data is in meters)
  • Standard deviation has the same units as the original data
  • Standard deviation is more interpretable

Gaussian Distribution

A bell-shaped curve where:

  • Mean, median, and mode are equal (at the center)
  • ~68% of data falls within 1 standard deviation of the mean
  • ~95% within 2 standard deviations
  • ~99.7% within 3 standard deviations

Quick Reference

Reading

pd.read_csv('file.csv')
pd.read_csv('file.csv', usecols=['col1', 'col2'])
pd.read_excel('file.xlsx')

Inspecting

df.head(), df.tail()
df.shape, df.size, df.columns
df.describe()
df['col'].unique(), df['col'].nunique()

Selecting

df['col']                    # Single column (Series)
df[['col1', 'col2']]         # Multiple columns (DataFrame)
df[0:5]                      # Rows 0-4
df.loc[df['col'] > x]        # By condition
df.query('col > x')          # By condition (faster)

Cleaning

df.dropna()
df.fillna(value)
df.drop_duplicates()

Aggregating

df['col'].sum(), .mean(), .median(), .std(), .count()
df.groupby('col')['val'].mean()
df.groupby('col')['val'].agg(['mean', 'median', 'count'])

Sorting

df.sort_values(by='col')
df.sort_values(by='col', ascending=False)

Exporting

df.to_csv('out.csv', index=False)
df.to_excel('out.xlsx', index=False)

Performance Summary

OperationSpeed
read_csv()Fast
read_excel()500-1000x slower
query()Fast
df[condition]Fast
iterrows()~150x slower than query()

Rule: Avoid iterrows() on large datasets. Use query() or boolean indexing instead.

Introduction to Databases

📖
Definition

A database (DB) is an organized collection of structured data stored electronically in a computer system, managed by a Database Management System (DBMS).

Let's Invent Database

Alright, so imagine you're building a movie collection app with Python. At first, you might think "I'll just use files!"

You create a file for each movie - titanic.txt, inception.txt, and so on. Inside each file, you write the title, director, year, rating. Simple enough!

But then problems start piling up. You want to find all movies from 2010? Now you're writing Python code to open every single file, read it, parse it, check the year. Slow and messy.

Your friend wants to update a movie's rating while you're reading it? Boom! File corruption or lost data because two programs can't safely write to the same file simultaneously.

You want to find all movies directed by Nolan AND released after 2010? Now your Python script is getting complex, looping through thousands of files, filtering multiple conditions.

What if the power goes out mid-write? Half-updated file, corrupted data.

This is where you start thinking, "there has to be a better way!" What if instead of scattered files, we had one organized system that could handle all this? A system designed from the ground up for concurrent access, fast searching, data integrity, and complex queries. That's the core idea behind what we'd call a database.

Database Management System

So you've realized you need a better system. Enter the DBMS, the Database Management System.

Instead of your Python code directly wrestling with files, the DBMS handles all the heavy lifting, managing storage, handling concurrent users, ensuring data doesn't get corrupted, and executing queries efficiently.

But here's the key question: how should we actually structure this data?

This is where the data model comes in. It's your blueprint for organizing information. For movies, you might think: "Every movie has attributes: title, director, year, rating." That's a relational model thinking, data organized in tables with rows and columns, like a spreadsheet but much more powerful.

Relational Model - Tables:

movie_idtitledirectoryearrating
1InceptionNolan20108.8
2TitanicCameron19977.9
3InterstellarNolan20148.7

Or maybe you think: "Movies are connected, directors make movies, actors star in them, movies belong to genres." That's more of a graph model, focusing on relationships between entities.

Graph Model - Nodes and Relationships:

(Movie: Inception)
       |
       |--[DIRECTED_BY]--> (Director: Nolan)
       |
       |--[RELEASED_IN]--> (Year: 2010)
       |
       |--[HAS_RATING]--> (Rating: 8.8)

(Movie: Interstellar)
       |
       |--[DIRECTED_BY]--> (Director: Nolan)
       |
       |--[RELEASED_IN]--> (Year: 2014)

The data model you choose shapes everything, how you store data, how you query it, how it performs. It's the fundamental architectural decision that defines your database.

What Is Schema ?

The schema is the blueprint (like class in Java or python) or structure of your database, it defines what can be stored and how it's organized, but not the actual data itself.

For our movie table, the schema would be:

Movies (
  movie_id: INTEGER,
  title: TEXT,
  director: TEXT,
  year: INTEGER,
  rating: FLOAT
)

It specifies the table name, column names, and data types. It's like the architectural plan of a building, it shows the rooms and layout, but the furniture (actual data) comes later.

The schema enforces rules: you can't suddenly add a movie with a text value in the year field, or store a rating as a string. It keeps your data consistent and predictable.

Data Models

These are just example to know, but we will study only few, so it's ok if you they sounded complex, but they aren't.

Relational (SQL)

  • Examples: PostgreSQL, MySQL, SQLite
  • Use case: transactions. Need ACID guarantees, complex joins between related data.

Key-Value

  • Examples: Redis, Memcached
  • Use case: Session storage, user login tokens. Lightning-fast lookups by key, simple get/set operations.

Document/JSON (NoSQL)

  • Examples: MongoDB, CouchDB
  • Use case: Blog platform, each post is a JSON document with nested comments, tags, metadata. Flexible schema, easy to evolve.

Wide Column / Column Family

  • Examples: Cassandra, HBase
  • Use case: Time-series data like IoT sensors. Billions of writes per day, queried by device_id and timestamp range.

Array/Matrix/Vector

  • Examples: PostgreSQL with pgvector, Pinecone, Weaviate
  • Use case: AI embeddings for semantic search - store vectors representing documents, find similar items by vector distance.

Legacy Models:

  • Hierarchical
  • Network
  • Semantic
  • Entity-Relationship

The CAP Theorems

So you're building a distributed system. Maybe you've got servers in New York, London, and Tokyo because you want to be fancy and global. Everything's going great until someone asks you a simple question: "What happens when the network breaks?"

Welcome to the CAP theorem, where you learn that you can't have your cake, eat it too, and share it perfectly across three continents simultaneously.

The Three Musketeers (But Only Two Can Fight at Once)

CAP stands for Consistency, Availability, and Partition Tolerance. The theorem, courtesy of Eric Brewer in 2000, says you can only pick two out of three. It's like a cruel database version of "choose your fighter."

Consistency (C): Every node in your distributed system sees the same data at the same time. You read from Tokyo, you read from New York - same answer, guaranteed.

Availability (A): Every request gets a response, even if some nodes are down. The system never says "sorry, come back later."

Partition Tolerance (P): The system keeps working even when network connections between nodes fail. Because networks will fail - it's not if, it's when.

⚠️
Mind-Bender Alert

The "C" in CAP is NOT the same as the "C" in ACID! ACID consistency means your data follows all the rules (constraints, foreign keys). CAP consistency means all nodes agree on what the data is right now. Totally different beasts.

Why P Isn't Really Optional (Spoiler: Physics)

Here's the dirty secret: Partition Tolerance isn't actually optional in distributed systems. Network failures happen. Cables get cut. Routers die. Someone trips over the ethernet cord. Cosmic rays flip bits (yes, really).

If you're distributed across multiple machines, partitions will occur. So the real choice isn't CAP - it's really CP vs AP. You're choosing between Consistency and Availability when the network inevitably goes haywire.

ℹ️
The Single Machine Exception

If your "distributed system" is actually just one machine, congratulations! You can have CA because there's no network to partition. But then you're not really distributed, are you? This is why traditional RDBMS like PostgreSQL on a single server can give you strong consistency AND high availability.

CP: Consistency Over Availability

The Choice: "I'd rather return an error than return wrong data."

When a network partition happens, CP systems refuse to respond until they can guarantee you're getting consistent data. They basically say "I'm not going to lie to you, so I'm just going to shut up until I know the truth."

Examples: MongoDB (in default config), HBase, Redis (in certain modes), traditional SQL databases with synchronous replication.

When to choose CP:

  • Banking and financial systems - you CANNOT have Bob's account showing different balances on different servers
  • Inventory systems - overselling products because two datacenters disagree is bad for business
  • Configuration management - if half your servers think feature X is on and half think it's off, chaos ensues
  • Anything where stale data causes real problems, and it's better to show an error than a lie
💻
Real World Example

Your bank's ATM won't let you withdraw money during a network partition because it can't verify your balance with the main server. Annoying? Yes. Better than letting you overdraw? Absolutely.

AP: Availability Over Consistency

The Choice: "I'd rather give you an answer (even if it might be stale) than no answer at all."

AP systems keep responding even during network partitions. They might give you slightly outdated data, but hey, at least they're talking to you! They eventually sync up when the network heals - this is called "eventual consistency."

Examples: Cassandra, DynamoDB, Riak, CouchDB, DNS (yes, the internet's phone book).

When to choose AP:

  • Social media - if you see a slightly stale like count during a network issue, the world doesn't end
  • Shopping cart systems - better to let users add items even if inventory count is slightly off, sort it out later
  • Analytics dashboards - last hour's metrics are better than no metrics
  • Caching layers - stale cache beats no cache
  • Anything where availability matters more than perfect accuracy
💻
Real World Example

Twitter/X during high traffic: you might see different follower counts on different servers for a few seconds. But the tweets keep flowing, the system stays up, and eventually everything syncs. For a social platform, staying online beats perfect consistency.

The "It Depends"

Here's where it gets interesting: modern systems often aren't pure CP or AP. They let you tune the trade-off!

Cassandra has a "consistency level" setting. Want CP behavior? Set it to QUORUM. Want AP? Set it to ONE. You're literally sliding the dial between consistency and availability based on what each query needs.

💡
Pro Architecture Move

Different parts of your system can make different choices! Use CP for critical financial data, AP for user preferences and UI state. This is called "polyglot persistence" and it's how the big players actually do it.

The Plot Twist: PACELC

Just when you thought you understood CAP, along comes PACELC to ruin your day. It says: even when there's NO partition (normal operation), you still have to choose between Latency and Consistency.

Want every read to be perfectly consistent? You'll pay for it in latency because nodes have to coordinate. Want fast responses? Accept that reads might be slightly stale.

But that's a story for another day...

📝
Remember

CAP isn't about right or wrong. It's about understanding trade-offs and making conscious choices based on your actual needs. The worst decision is not knowing you're making one at all.

TL;DR

You can't have perfect consistency, perfect availability, AND handle network partitions. Since partitions are inevitable in distributed systems, you're really choosing between CP (consistent but might go down) or AP (always available but might be stale).

Choose CP when wrong data is worse than no data. Choose AP when no data is worse than slightly outdated data.

Now go forth and distribute responsibly!

SQLite

What is a Relational Database?

A relational database organizes data into tables. Each table has:

  • Rows (also called records) — individual entries
  • Columns (also called fields) — attributes of each entry

Tables can be linked together through common fields. This is the "relational" part.


Part 1: Core Concepts

Schema

The schema is the structure definition of your database:

  • Number of tables
  • Column names and data types
  • Constraints (what's allowed)

Critical: You must define the schema BEFORE you can store any data. This is a fixed structure — not flexible like a spreadsheet.

Primary Key

A column that uniquely identifies each row.

Rules:

  • Only ONE primary key per table
  • Values must be unique (no duplicates)
  • Cannot be NULL

Example: Student ID, Order Number, ISBN

Foreign Key

A column that references a primary key in another table.

Rules:

  • Can have multiple foreign keys in one table
  • Values don't need to be unique
  • Creates relationships between tables

Example: student_id in an Enrollments table references id in a Students table.

Why Use Multiple Tables?

Instead of repeating data:

# Bad: Student info repeated for each course
(1, 'Alice', 22, 'alice@unibo.it', 'BDB')
(2, 'Alice', 22, 'alice@unibo.it', 'BDP1')  # Alice duplicated!
(3, 'Bob', 23, 'bob@unibo.it', 'BDB')

Use two linked tables:

Students:
(1, 'Alice', 22, 'alice@unibo.it')
(2, 'Bob', 23, 'bob@unibo.it')

Enrollments:
(1, 1, 'BDB')    # student_id=1 (Alice)
(2, 1, 'BDP1')   # student_id=1 (Alice)
(3, 2, 'BDB')    # student_id=2 (Bob)

Benefits:

  • No data duplication
  • Update student info in one place
  • Smaller storage

Part 2: ACID Properties

A transaction is a unit of work — a set of operations that must either all succeed or all fail.

ACID guarantees for transactions:

PropertyMeaning
AtomicAll operations complete, or none do
ConsistentDatabase goes from one valid state to another
IsolatedTransactions don't interfere with each other
DurableOnce committed, changes survive crashes/power failures

Relational databases provide ACID compliance. This is why banks use them.


Part 3: SQLite

What is SQLite?

SQLite is a relational database that lives in a single file. No server needed.

It's everywhere:

  • Every Android and iOS device
  • Windows 10/11, macOS
  • Firefox, Chrome, Safari
  • Estimated 1 trillion+ SQLite databases in active use

SQLite vs Traditional Databases

Traditional (PostgreSQL, MySQL)SQLite
Separate server processNo server
Client connects via networkDirect file access
Multiple filesSingle file
Complex setupZero configuration

Python Support

Python has built-in SQLite support. No installation needed:

import sqlite3

Part 4: Connecting to SQLite

Basic Connection

import sqlite3 as sql

# Connect to database (creates file if it doesn't exist)
conn = sql.connect('my_database.db')

# Get a cursor (your pointer into the database)
cur = conn.cursor()

After this, you'll have a file called my_database.db in your directory.

In-Memory Database

For testing or temporary work:

conn = sql.connect(':memory:')

Fast, but everything is lost when you close Python.

What's a Cursor?

The cursor is how you execute commands and retrieve results. Think of it as your interface to the database.

cur.execute('SQL command here')

Part 5: Creating Tables

CREATE TABLE Syntax

cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

Data Types

TypeWhat it stores
INTEGERWhole numbers
TEXTStrings
REALFloating point numbers
BLOBBinary data

Constraints

ConstraintMeaning
PRIMARY KEYUnique identifier for each row
NOT NULLCannot be empty
UNIQUENo duplicate values allowed

The "Table Already Exists" Problem

If you run CREATE TABLE twice, you get an error.

Solution: Drop the table first if it exists.

cur.execute('DROP TABLE IF EXISTS Students')

cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

Creating Tables with Foreign Keys

cur.execute('''DROP TABLE IF EXISTS Students''')
cur.execute('''
    CREATE TABLE Students (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        age INTEGER,
        email TEXT NOT NULL UNIQUE
    )
''')

cur.execute('''DROP TABLE IF EXISTS Student_courses''')
cur.execute('''
    CREATE TABLE Student_courses (
        id INTEGER PRIMARY KEY,
        student_id INTEGER NOT NULL,
        course_id INTEGER,
        course_name TEXT NOT NULL,
        FOREIGN KEY(student_id) REFERENCES Students(id)
    )
''')

conn.commit()

Part 6: Inserting Data

Single Row

cur.execute('''
    INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')
''')

What Happens If You Insert a Duplicate Primary Key?

cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
cur.execute('''INSERT INTO Students VALUES (1, 'John', 'Doe', 21, 'john@doe.com')''')
# ERROR! id=1 already exists

Primary keys must be unique.

Multiple Rows with executemany()

the_students = (
    (1, 'John', 'Doe', 21, 'john@doe.com'),
    (2, 'Alice', 'Doe', 22, 'alice@doe.com'),
    (3, 'Rose', 'Short', 21, 'rose@short.com')
)

cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)

The ? Placeholders

Each ? gets replaced by one value from your tuple.

# 5 columns = 5 question marks
cur.executemany('''INSERT INTO Students VALUES(?, ?, ?, ?, ?)''', the_students)

Why use ? instead of string formatting?

  1. Cleaner code
  2. Prevents SQL injection attacks
  3. Handles escaping automatically

Part 7: The Commit Rule

Critical: Changes are NOT saved until you call commit().

cur.execute('INSERT INTO Students VALUES (4, "Diana", "Smith", 20, "diana@smith.com")')

# At this point, the data is only in memory

conn.commit()  # NOW it's written to disk

If you close the connection without committing, all changes since the last commit are lost.

The Complete Pattern

# Make changes
cur.execute('INSERT ...')
cur.execute('UPDATE ...')
cur.execute('DELETE ...')

# Save to disk
conn.commit()

# Close when done
conn.close()

Part 8: Querying Data (SELECT)

Get All Rows

cur.execute('SELECT * FROM Students')
print(cur.fetchall())

Output:

[(1, 'John', 'Doe', 21, 'john@doe.com'),
 (2, 'Alice', 'Doe', 22, 'alice@doe.com'),
 (3, 'Rose', 'Short', 21, 'rose@short.com')]

fetchall() vs fetchone()

fetchall() returns a list of all rows:

cur.execute('SELECT * FROM Students')
all_rows = cur.fetchall()  # List of tuples

fetchone() returns one row at a time:

cur.execute('SELECT * FROM Students')
first = cur.fetchone()   # (1, 'John', 'Doe', 21, 'john@doe.com')
second = cur.fetchone()  # (2, 'Alice', 'Doe', 22, 'alice@doe.com')
third = cur.fetchone()   # (3, 'Rose', 'Short', 21, 'rose@short.com')
fourth = cur.fetchone()  # None (no more rows)

Important: fetchall() Exhausts the Cursor

cur.execute('SELECT * FROM Students')
print(cur.fetchall())  # Returns all rows
print(cur.fetchall())  # Returns [] (empty list!)

Once you've fetched all rows, there's nothing left to fetch. You need to execute the query again.

Select Specific Columns

cur.execute('SELECT last_name, email FROM Students')
print(cur.fetchall())
# [('Doe', 'john@doe.com'), ('Doe', 'alice@doe.com'), ('Short', 'rose@short.com')]

Filter with WHERE

cur.execute('SELECT * FROM Students WHERE id=3')
print(cur.fetchall())
# [(3, 'Rose', 'Short', 21, 'rose@short.com')]

Pattern Matching with LIKE

# Emails ending with 'doe.com'
cur.execute("SELECT * FROM Students WHERE email LIKE '%doe.com'")
print(cur.fetchall())
# [(1, 'John', 'Doe', 21, 'john@doe.com'), (2, 'Alice', 'Doe', 22, 'alice@doe.com')]

Wildcards:

  • % — any sequence of characters (including none)
  • _ — exactly one character

Examples:

LIKE 'A%'      # Starts with A
LIKE '%e'      # Ends with e
LIKE '%li%'    # Contains 'li'
LIKE '_ohn'    # 4 characters ending in 'ohn' (John, Bohn, etc.)

Note: LIKE is case-insensitive in SQLite.


Part 9: Deleting Data

cur.execute('DELETE FROM Students WHERE id=1')
conn.commit()

Warning: Without WHERE, you delete everything:

cur.execute('DELETE FROM Students')  # Deletes ALL rows!

Part 10: Error Handling

The Proper Pattern

import sqlite3 as sql

try:
    conn = sql.connect('my_database.db')
    cur = conn.cursor()
    print("Connection successful")
    
    # Your database operations here
    cur.execute('SELECT * FROM Students')
    print(cur.fetchall())
    
    cur.close()  # Close cursor to free memory
    
except sql.Error as error:
    print("Error in SQLite:", error)
    
finally:
    conn.close()  # Always close connection, even if error occurred

Why use try/except/finally?

  • Database operations can fail (file locked, disk full, etc.)
  • finally ensures connection is always closed
  • Prevents resource leaks

Part 11: Pandas Integration

This is where SQLite becomes really useful for data analysis.

Read SQLite into DataFrame

import pandas as pd
import sqlite3 as sql

conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)
conn.close()

Now you have a DataFrame with all the Pandas functionality.

df.head()
df.info()
df.describe()

Filter in SQL vs Filter in Pandas

Option 1: Filter in SQL (better for large databases)

df = pd.read_sql_query('SELECT * FROM gubbio WHERE NO2 > 50', conn)

Only matching rows are loaded into memory.

Option 2: Load all, filter in Pandas

df = pd.read_sql_query('SELECT * FROM gubbio', conn)
df_filtered = df[df['NO2'] > 50]

Loads everything, then filters.

Use SQL filtering when:

  • Database is large
  • You only need a small subset

Use Pandas filtering when:

  • Data fits in memory
  • You need multiple different analyses

Write DataFrame to SQLite

conn = sql.connect('output.sqlite')
df.to_sql('table_name', conn, if_exists='replace')
conn.close()

if_exists options:

  • 'fail' — raise error if table exists (default)
  • 'replace' — drop table and recreate
  • 'append' — add rows to existing table

Part 12: Data Cleaning Example (Gubbio Dataset)

The Dataset

Environmental monitoring data from Gubbio, Italy (2018):

  • Columns: year, month, day, hour, NO2, O3, PM10, PM25
  • Values are in µg/m³
  • Problem: Missing/invalid readings are coded as -999

The Problem with -999 Values

df = pd.read_sql_query('SELECT * FROM gubbio', conn)
print(df['NO2'].mean())  # Wrong! Includes -999 values

The -999 values will drastically lower your mean.

Solution 1: Replace with 0 (for visualization only)

df.loc[df.NO2 < 0, 'NO2'] = 0
df.loc[df.O3 < 0, 'O3'] = 0
df.loc[df.PM10 < 0, 'PM10'] = 0
df.loc[df.PM25 < 0, 'PM25'] = 0

Good for plotting (no negative spikes), but bad for statistics — zeros still affect the mean.

Solution 2: Replace with NaN (for analysis)

import numpy as np

df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan

This is the correct approach. Pandas ignores NaN in calculations:

df['NO2'].mean()  # Calculates mean of valid values only

Using loc[] to Find and Modify

Find rows matching condition:

# All rows where NO2 is negative
print(df.loc[df.NO2 < 0])

# Just the NO2 column where NO2 is negative
print(df.loc[df.NO2 < 0, 'NO2'])

Modify matching rows:

df.loc[df.NO2 < 0, 'NO2'] = np.nan

This reads: "For rows where NO2 < 0, set the NO2 column to NaN."


Part 13: DateTime Handling

Creating DateTime from Components

The Gubbio dataset has separate year, month, day, hour columns. Combine them:

df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])

Result: A proper datetime column like 2018-01-01 00:00:00.

Setting DateTime as Index

df.set_index('timerep', inplace=True)

Now you can do time-based operations.

Check the Result

df.info()

You'll see DatetimeIndex instead of RangeIndex.


Part 14: Resampling (Time Aggregation)

What is Resampling?

Converting from higher frequency (hourly) to lower frequency (daily, monthly, yearly).

Basic Syntax

df.resample('D').mean()  # Daily mean
df.resample('M').mean()  # Monthly mean
df.resample('A').mean()  # Annual mean

Resample Codes

CodeFrequency
'H'Hourly
'D'Daily
'W'Weekly
'M'Monthly
'A'Annual/Yearly

Examples

Daily mean of PM10, PM25, NO2:

df.resample('D').mean()[['PM10', 'PM25', 'NO2']]

Yearly mean:

df.resample('A').mean()[['PM10', 'PM25']]

Combining Resample with Query

Find days where PM10 exceeded 50 µg/m³ (WHO 24-hour limit):

df.resample('D').mean().query('PM10 > 50')[['PM10']]

This:

  1. Resamples to daily
  2. Computes the mean
  3. Filters to days where PM10 > 50
  4. Shows only the PM10 column

Find days where PM2.5 exceeded 24 µg/m³:

df.resample('D').mean().query('PM25 > 24')[['PM25']]

WHO Air Quality Limits

PollutantAnnual Limit24-Hour Limit
PM2.510 µg/m³24 µg/m³
PM1020 µg/m³50 µg/m³

Part 15: Saving and Loading with DateTime Index

The Problem

When you save a DataFrame with a datetime index to SQLite and read it back, the index might not be preserved correctly.

Wrong Way

# Save
df.to_sql('gubbio', conn, if_exists='replace')

# Load
df2 = pd.read_sql('SELECT * FROM gubbio', conn)
df2.plot(y=['NO2'])  # X-axis is wrong!

Correct Way: Preserve the Index

Saving:

df.to_sql('gubbio', conn, if_exists='replace', index=True, index_label='timerep')

Loading:

df2 = pd.read_sql('SELECT * FROM gubbio', conn, index_col='timerep', parse_dates=['timerep'])

Parameters:

  • index=True — save the index as a column
  • index_label='timerep' — name the index column
  • index_col='timerep' — use this column as index when loading
  • parse_dates=['timerep'] — parse as datetime

Part 16: Complete Workflow

Typical pattern: Load → Clean → Analyze → Save

import pandas as pd
import sqlite3 as sql
import numpy as np

# 1. Connect and load
conn = sql.connect('gubbio_env_2018.sqlite')
df = pd.read_sql_query('SELECT * FROM gubbio', conn)

# 2. Clean bad values (replace -999 with NaN)
df.loc[df.NO2 < 0, 'NO2'] = np.nan
df.loc[df.O3 < 0, 'O3'] = np.nan
df.loc[df.PM10 < 0, 'PM10'] = np.nan
df.loc[df.PM25 < 0, 'PM25'] = np.nan

# 3. Create datetime index
df['timerep'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
df.set_index('timerep', inplace=True)

# 4. Analyze
# Daily averages
daily = df.resample('D').mean()[['PM10', 'PM25', 'NO2']]

# Days exceeding WHO PM10 limit
bad_pm10_days = df.resample('D').mean().query('PM10 > 50')[['PM10']]
print(f"Days PM10 > 50: {len(bad_pm10_days)}")

# Yearly average
yearly = df.resample('A').mean()[['PM10', 'PM25']]
print(yearly)

# 5. Plot
df.plot(y=['NO2'])
df.plot(y=['O3'])

# 6. Save results
df.to_sql('gubbio_clean', conn, if_exists='replace', index=True, index_label='timerep')

# 7. Close
conn.close()

SQL Commands Summary

CommandPurposeExample
CREATE TABLEDefine schemaCREATE TABLE Students (id INTEGER PRIMARY KEY, name TEXT)
DROP TABLEDelete tableDROP TABLE IF EXISTS Students
INSERT INTOAdd rowsINSERT INTO Students VALUES (1, 'Alice')
SELECTQuery dataSELECT * FROM Students WHERE age > 20
DELETERemove rowsDELETE FROM Students WHERE id = 1
LIKEPattern matchSELECT * FROM Students WHERE name LIKE 'A%'

Python SQLite Summary

OperationCode
Connectconn = sql.connect('file.db')
Get cursorcur = conn.cursor()
Executecur.execute('SQL')
Execute manycur.executemany('SQL', list_of_tuples)
Fetch onecur.fetchone()
Fetch allcur.fetchall()
Save changesconn.commit()
Close cursorcur.close()
Close connectionconn.close()

Pandas + SQLite Summary

OperationCode
Readpd.read_sql_query('SELECT...', conn)
Read with indexpd.read_sql_query('...', conn, index_col='col', parse_dates=['col'])
Writedf.to_sql('table', conn, if_exists='replace')
Write with indexdf.to_sql('table', conn, if_exists='replace', index=True, index_label='name')

Common Mistakes

MistakeProblemFix
Forgot conn.commit()Changes not savedAlways commit after INSERT/UPDATE/DELETE
Using == in SQLSyntax errorUse single = for equality
Replace -999 with 0Wrong statisticsUse np.nan instead
DELETE FROM table without WHEREDeletes everythingAlways specify condition
CREATE TABLE twiceErrorUse DROP TABLE IF EXISTS first
Wrong number of ?ErrorMust match column count
Not closing connectionResource leakAlways conn.close()
fetchall() twiceEmpty second resultRe-execute query or use fetchone()

Quick Reference Card

import sqlite3 as sql
import pandas as pd
import numpy as np

# Connect
conn = sql.connect('database.db')
cur = conn.cursor()

# Create table
cur.execute('DROP TABLE IF EXISTS MyTable')
cur.execute('CREATE TABLE MyTable (id INTEGER PRIMARY KEY, value REAL)')

# Insert
cur.executemany('INSERT INTO MyTable VALUES (?, ?)', [(1, 10.5), (2, 20.3)])
conn.commit()

# Query
cur.execute('SELECT * FROM MyTable WHERE value > 15')
results = cur.fetchall()

# Load into Pandas
df = pd.read_sql_query('SELECT * FROM MyTable', conn)

# Clean data
df.loc[df.value < 0, 'value'] = np.nan

# Save back
df.to_sql('MyTable', conn, if_exists='replace', index=False)

# Close
conn.close()

ACID: The Database's Solemn Vow (NOT EXAM)

Picture this: You're transferring $500 from your savings to your checking account. The database deducts $500 from savings... and then the power goes out. Did the money vanish into the digital void? Did it get added to checking? Are you now $500 poorer for no reason?

This is the nightmare that keeps database architects up at night. And it's exactly why ACID exists.

ACID is a set of properties that guarantees your database transactions are reliable, even when the universe conspires against you. It stands for Atomicity, Consistency, Isolation, and Durability - which sounds like boring corporate jargon until you realize it's the difference between "my money's safe" and "WHERE DID MY MONEY GO?!"

A is for Atomicity: All or Nothing, Baby

Atomicity means a transaction is indivisible - it's an atom (get it?). Either the entire thing happens, or none of it does. No half-baked in-between states.

Back to our money transfer:

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 500 WHERE account_id = 'savings';
  UPDATE accounts SET balance = balance + 500 WHERE account_id = 'checking';
COMMIT;

If the power dies after the first UPDATE, atomicity guarantees that when the system comes back up, it's like that first UPDATE never happened. Your savings account still has the $500. The transaction either completes fully (both updates) or rolls back completely (neither update).

💻
Real World Analogy

Ordering a pizza. Either you get the pizza AND they charge your card, or neither happens. You can't end up with "they charged me but I got no pizza" or "I got pizza but they forgot to charge me." Well, okay, in real life that sometimes happens. But in ACID databases? Never.

⚠️
Common Confusion

Atomicity doesn't mean fast or instant. It means indivisible. A transaction can take 10 seconds, but it's still atomic - either all 10 seconds of work commits, or none of it does.

C is for Consistency: Follow the Rules or Get Out

Consistency means your database moves from one valid state to another valid state. All your rules - constraints, triggers, cascades, foreign keys - must be satisfied before and after every transaction.

Let's say you have a rule: "Account balance cannot be negative." Consistency guarantees that no transaction can violate this, even temporarily during execution.

-- This has a constraint: balance >= 0
UPDATE accounts SET balance = balance - 1000 WHERE account_id = 'savings';

If your savings only has $500, this transaction will be rejected. The database won't let you break the rules, even for a nanosecond.

ℹ️
The Big Confusion

Remember: ACID consistency is about business rules and constraints within your database. CAP consistency (from the previous article) is about all servers in a distributed system agreeing on the same value. Same word, completely different meanings. Because computer science loves confusing us.

I is for Isolation: Mind Your Own Business

Isolation means concurrent transactions don't step on each other's toes. When multiple transactions run at the same time, they should behave as if they're running one after another, in some order.

Imagine two people trying to book the last seat on a flight at the exact same moment:

Transaction 1: Check if seats available → Yes → Book seat
Transaction 2: Check if seats available → Yes → Book seat

Without isolation, both might see "seats available" and both book the same seat. Chaos! Isolation prevents this by making sure transactions don't see each other's half-finished work.

📝
The Isolation Plot Twist

Isolation actually has different levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable). Stronger isolation = safer but slower. Weaker isolation = faster but riskier. Most databases default to something in the middle because perfect isolation is expensive.

The Classic Problem: Dirty Reads, Phantom Reads, and Other Horror Stories

Without proper isolation, you get gems like:

Dirty Read: You read data that another transaction hasn't committed yet. They roll back, and you read data that never actually existed. Spooky!

Non-Repeatable Read: You read a value, someone else changes it, you read it again in the same transaction and get a different answer. Identity crisis for data!

Phantom Read: You run a query that returns 5 rows. Run it again in the same transaction, now there are 6 rows because someone inserted data. Where did that 6th row come from? It's a phantom!

💻
Example: The Double-Booking Nightmare

Two users book the same hotel room because both checked availability before either transaction committed. Isolation levels (like Serializable) prevent this by locking the relevant rows or using techniques like MVCC (Multi-Version Concurrency Control).

D is for Durability: Once Committed, Forever Committed

Durability means once a transaction is committed, it's permanent. Even if the server explodes, catches fire, and falls into the ocean immediately after, your committed data is safe.

How? Write-Ahead Logging (WAL), journaling, replication - databases use all kinds of tricks to write data to disk before saying "yep, it's committed!"

COMMIT; -- At this moment, the database promises your data is SAFE
-- Server can crash now, data is still there when it comes back up
💡
Behind the Scenes

When you COMMIT, the database doesn't just trust RAM. It writes to persistent storage (disk, SSD) and often waits for the OS to confirm the write completed. This is why commits can feel slow - durability isn't free, but it's worth every millisecond when disaster strikes.

When ACID Matters (Hint: More Than You Think)

Absolutely need ACID:

  • Banking and financial systems - money doesn't just disappear or duplicate
  • E-commerce - orders, payments, inventory must be consistent
  • Medical records - patient data integrity is literally life-or-death
  • Booking systems - double-booking is unacceptable
  • Anything involving legal compliance or auditing

Maybe can relax ACID:

  • Analytics dashboards - approximate counts are fine
  • Social media likes - if a like gets lost in the noise, who cares?
  • Caching layers - stale cache is better than no cache
  • Logging systems - losing 0.01% of logs during a crash might be acceptable
🚫
The "We Don't Need ACID" Famous Last Words

"Our app is simple, we don't need all that ACID overhead!" - said every developer before they had to explain to their CEO why customer orders disappeared. Don't be that developer.

The Trade-off: ACID vs Performance

Here's the uncomfortable truth: ACID guarantees aren't free. They cost performance.

Ensuring atomicity? Needs transaction logs.
Enforcing consistency? Needs constraint checking.
Providing isolation? Needs locking or MVCC overhead.
Guaranteeing durability? Needs disk writes and fsyncs.

This is why NoSQL databases got popular in the early 2010s. They said "what if we... just didn't do all that?" and suddenly you could handle millions of writes per second. Of course, you also had data corruption, lost writes, and race conditions, but hey, it was fast!

🔬
Historical Fun Fact

MongoDB famously had a "durability" setting that was OFF by default for years. Your data wasn't actually safe after a commit unless you explicitly turned on write concerns. They fixed this eventually, but not before countless developers learned about durability the hard way.

Modern Databases: Having Your Cake and Eating It Too

The plot twist? Modern databases are getting really good at ACID without sacrificing too much performance:

  • PostgreSQL uses MVCC (Multi-Version Concurrency Control) for high-performance isolation
  • CockroachDB gives you ACID and horizontal scaling
  • Google Spanner provides global ACID transactions across datacenters

The "NoSQL vs SQL" war has settled into "use the right tool for the job, and maybe that tool is a NewSQL database that gives you both."

💡
Pro Tip

Don't sacrifice ACID unless you have a specific, measured performance problem. Premature optimization killed more projects than slow databases ever did. Start with ACID, relax it only when you must.

TL;DR

ACID is your database's promise that your data is safe and correct:

  • Atomicity: All or nothing - no half-done transactions
  • Consistency: Rules are never broken - constraints always hold
  • Isolation: Transactions don't interfere with each other
  • Durability: Committed means forever - even through disasters

Yes, it costs performance. No, you probably shouldn't skip it unless you really, REALLY know what you're doing and have a very good reason.

Your future self (and your CEO) will thank you when the server crashes and your data is still intact.

Database Management System Architecture [NOT EXAM]

So you've got data. Lots of it. And you need to store it, query it, update it, and make sure it doesn't explode when a thousand users hit it simultaneously. Enter the DBMS - the unsung hero working behind the scenes while you're busy writing SELECT * FROM users.

But what actually happens when you fire off that query? What's going on in the engine room? Let's pop the hood and see how these beautiful machines work.

The Big Picture: Layers Upon Layers

A DBMS is like an onion - layers upon layers, and sometimes it makes you cry when you dig too deep. But unlike an onion, each layer has a specific job and they all work together in harmony (most of the time).

Think of it as a restaurant:

  • Query Interface: The waiter taking your order
  • Query Processor: The chef figuring out how to make your dish
  • Storage Manager: The kitchen staff actually cooking and storing ingredients
  • Transaction Manager: The manager making sure orders don't get mixed up
  • Disk Storage: The pantry and freezer where everything lives

Let's break down each component and see what it actually does.

1. Query Interface: "Hello, How Can I Help You?"

This is where you interact with the database. It's the friendly face (or command line) that accepts your SQL queries, API calls, or whatever language your DBMS speaks.

Components:

  • SQL Parser: Takes your SQL string and turns it into something the computer understands
  • DDL Compiler: Handles schema definitions (CREATE TABLE, ALTER TABLE)
  • DML Compiler: Handles data manipulation (SELECT, INSERT, UPDATE, DELETE)
SELECT * FROM users WHERE age > 18;

The parser looks at this and thinks: "Okay, they want data. From the 'users' table. With a condition. Got it." Then it passes this understanding down the chain.

ℹ️
Fun Fact

When you write terrible SQL with syntax errors, this is where it gets caught. The parser is that friend who tells you "that's not how you spell SELECT" before you embarrass yourself further.

2. Query Processor: The Brain of the Operation

This is where the magic happens. Your query might say "give me all users over 18," but HOW should the database do that? Scan every single row? Use an index? Check the age column first or last? The query processor figures all this out.

Key Components:

Query Optimizer

The optimizer is basically an AI that's been doing its job since the 1970s. It looks at your query and generates multiple execution plans, then picks the best one based on statistics about your data.

SELECT u.name, o.total 
FROM users u 
JOIN orders o ON u.id = o.user_id 
WHERE u.country = 'Italy';

The optimizer thinks: "Should I find Italian users first, then join orders? Or scan orders first? How many Italian users are there? Is there an index on country? On user_id?" It runs the math and picks the fastest path.

💻
Real World Example

This is why adding an index can make queries 1000x faster. The optimizer sees the index and thinks "oh perfect, I can use that instead of scanning millions of rows!" Same query, completely different execution plan.

Query Execution Engine

Once the optimizer picks a plan, the execution engine actually runs it. It's the worker bee that fetches data, applies filters, joins tables, and assembles your result set.

💡
Pro Tip

Most databases let you see the query plan with EXPLAIN or EXPLAIN ANALYZE. If your query is slow, this is your first stop. The optimizer shows you exactly what it's doing, and often you'll spot the problem immediately - like a missing index or an accidental full table scan.

3. Transaction Manager: Keeping the Peace

Remember ACID? This is where it happens. The transaction manager makes sure multiple users can work with the database simultaneously without chaos erupting.

Key Responsibilities:

Concurrency Control

Prevents the classic problems: two people trying to buy the last concert ticket, or withdrawing money from the same account simultaneously. Uses techniques like:

  • Locking: "Sorry, someone else is using this row right now, wait your turn"
  • MVCC (Multi-Version Concurrency Control): "Here's your own snapshot of the data, everyone gets their own version"
  • Timestamp Ordering: "We'll execute transactions in timestamp order, nice and orderly"

Recovery Manager

When things go wrong (power outage, crash, cosmic ray), this component brings the database back to a consistent state. It uses:

  • Write-Ahead Logging (WAL): Write to the log before writing to the database, so you can replay or undo operations
  • Checkpoints: Periodic snapshots so recovery doesn't have to replay the entire history since the Big Bang
  • Rollback: Undo incomplete transactions
  • Roll-forward: Redo committed transactions that didn't make it to disk
⚠️
Why Commits Feel Slow

When you COMMIT, the database doesn't just write to memory and call it a day. It writes to the WAL, flushes to disk, and waits for confirmation. This is why durability costs performance - but it's also why your data survives disasters.

4. Storage Manager: Where Bytes Live

This layer manages the actual storage of data on disk (or SSD, or whatever physical medium you're using). It's the bridge between "logical" concepts like tables and rows, and "physical" reality like disk blocks and file pointers.

Components:

Buffer Manager

RAM is fast, disk is slow. The buffer manager keeps frequently accessed data in memory (the buffer pool) so queries don't have to hit disk constantly.

It's like keeping your favorite snacks on the counter instead of going to the store every time you're hungry.

When memory fills up, it uses replacement policies (LRU - Least Recently Used is popular) to decide what to kick out.

File Manager

Manages the actual files on disk. Tables aren't stored as neat CSV files - they're stored in complex structures optimized for different access patterns:

  • Heap Files: Unordered collection of records, good for full table scans
  • Sorted Files: Records sorted by some key, good for range queries
  • Hash Files: Records distributed by hash function, good for exact-match lookups
  • Clustered Files: Related records stored together, good for joins

Index Manager

Manages indexes - the phone book of your database. Instead of scanning every row to find what you want, indexes let you jump straight to the relevant data.

Common index types:

  • B-Tree / B+Tree: Sorted tree structure, handles ranges beautifully
  • Hash Index: Lightning fast for exact matches, useless for ranges
  • Bitmap Index: Great for columns with few distinct values (like gender, status)
  • Full-Text Index: Specialized for text search
💻
Example: Why Indexes Matter

Finding a user by ID without an index: scan 10 million rows, takes seconds.
Finding a user by ID with a B-tree index: traverse a tree with height ~4, takes milliseconds.
Same query, 1000x speed difference. Indexes are your friend!

5. The Disk Storage Layer: Ground Zero

At the bottom of it all, your data lives on physical storage. This layer deals with the gritty details:

  • Blocks/Pages: Data is stored in fixed-size chunks (usually 4KB-16KB)
  • Slotted Pages: How records fit inside blocks
  • Free Space Management: Tracking which blocks have room for new data
  • Data Compression: Squeezing more data into less space

Modern databases are incredibly clever here. They use techniques like:

  • Column-oriented storage: Store columns separately for analytics workloads
  • Compression: Save disk space and I/O bandwidth
  • Partitioning: Split huge tables across multiple physical locations
📝
The Performance Hierarchy

- CPU Cache: ~1 nanosecond
- RAM: ~100 nanoseconds
- SSD: ~100 microseconds (1000x slower than RAM!)
- HDD: ~10 milliseconds (100,000x slower than RAM!)

This is why the buffer manager is so critical. Every disk access avoided is a massive win.

Architectural Patterns: Different Strokes for Different Folks

Not all DBMS architectures are the same. They evolved to solve different problems.

Centralized Architecture

Traditional, single-server setup. Everything lives on one machine.

Pros: Simple, full ACID guarantees, no network latency between components
Cons: Limited by one machine's resources, single point of failure

Example: PostgreSQL or MySQL on a single server

Client-Server Architecture

Clients connect to a central database server. Most common pattern today.

Pros: Centralized control, easier security, clients can be lightweight
Cons: Server can become a bottleneck

Example: Your web app connecting to a PostgreSQL server

Distributed Architecture

Data spread across multiple nodes, often in different locations.

Pros: Massive scalability, fault tolerance, can survive node failures
Cons: Complex, CAP theorem strikes, eventual consistency headaches

Example: Cassandra, MongoDB sharded clusters, CockroachDB

Parallel Architecture

Multiple processors/cores working on the same query simultaneously.

Types:

  • Shared Memory: All processors share RAM and disk (symmetric multiprocessing)
  • Shared Disk: Processors have their own memory but share disks
  • Shared Nothing: Each processor has its own memory and disk (most scalable)

Example: Modern PostgreSQL can parallelize queries across cores

ℹ️
The Evolution

We went from centralized mainframes (1970s) → client-server (1990s) → distributed NoSQL (2000s) → distributed NewSQL (2010s). Each era solved the previous era's limitations while introducing new challenges.

Modern Twists: Cloud and Serverless

The cloud changed the game. Now we have:

Database-as-a-Service (DBaaS): Amazon RDS, Google Cloud SQL - you get a managed database without worrying about the infrastructure.

Serverless Databases: Aurora Serverless, Cosmos DB - database scales automatically, you pay per query.

Separation of Storage and Compute: Modern architectures split storage (S3, object storage) from compute (query engines). Scale them independently!

💡
The Big Idea

Traditional databases bundle everything together. Modern cloud databases separate concerns: storage is cheap and infinite (S3), compute is expensive and scales (EC2). Why pay for compute when you're not querying? This is the serverless revolution.

Putting It All Together: A Query's Journey

Let's trace what happens when you run a query:

SELECT name, email FROM users WHERE age > 25 ORDER BY name LIMIT 10;
  1. Query Interface: Parses the SQL, validates syntax
  2. Query Processor: Optimizer creates execution plan ("use age index, sort results, take first 10")
  3. Transaction Manager: Assigns a transaction ID, determines isolation level
  4. Storage Manager:
    • Buffer manager checks if needed data is in memory
    • If not, file manager reads from disk
    • Index manager uses age index to find matching rows
  5. Execution Engine: Applies filter, sorts, limits results
  6. Transaction Manager: Commits transaction, releases locks
  7. Query Interface: Returns results to your application

All this happens in milliseconds. Databases are incredibly sophisticated machines!

Mind Blown Yet?

Next time your query returns in 50ms, take a moment to appreciate the decades of computer science and engineering that made it possible. From parsing to optimization to disk I/O to lock management - it's a symphony of coordinated components.

TL;DR

A DBMS is a complex system with multiple layers:

  • Query Interface: Takes your SQL and validates it
  • Query Processor: Figures out the best way to execute your query
  • Transaction Manager: Ensures ACID properties and handles concurrency
  • Storage Manager: Manages buffer pool, files, and indexes
  • Disk Storage: Where your data actually lives

Different architectures (centralized, distributed, parallel) trade off simplicity vs scalability vs consistency.

Modern databases are moving toward cloud-native, separation of storage and compute, and serverless models.

The next time you write SELECT *, remember: there's a whole orchestra playing in the background to make that query work.

Concurrency Control Theory -not for exam

Remember our ACID article? We talked about how databases promise to keep your data safe and correct. But there's a problem we glossed over: what happens when multiple transactions run at the same time?

Spoiler alert: chaos. Beautiful, fascinating, wallet-draining chaos.

The $25 That Vanished Into Thin Air

Let's start with a horror story. You've got $100 in your bank account. You try to pay for something that costs $25. Simple, right?

Read Balance: $100
Check if $100 > $25? ✓
Pay $25
New Balance: $75
Write Balance: $75

Works perfectly! Until the power goes out right after you read the balance but before you write it back. Now what? Did the payment go through? Is your money gone? This is where Atomicity saves you - either the entire transaction happens or none of it does.

But here's an even scarier scenario: What if TWO payments of $25 try to execute at the exact same time?

Transaction 1: Read Balance ($100) → Check funds → Pay $25
Transaction 2: Read Balance ($100) → Check funds → Pay $25
Transaction 1: Write Balance ($75)
Transaction 2: Write Balance ($75)

Both transactions read $100, both think they have enough money, both pay $25... and your final balance is $75 instead of $50. You just got a free $25! (Your bank is not happy.)

This is the nightmare that keeps database architects awake at night. And it's exactly why concurrency control exists.

🚫
The Real World Impact

These aren't theoretical problems. In 2012, Knight Capital lost $440 million in 45 minutes due to a race condition in their trading system. Concurrent transactions matter!

The Strawman Solution: Just Don't

The simplest solution? Don't allow concurrency at all. Execute one transaction at a time, in order, like a polite British queue.

Transaction 1 → Complete → Transaction 2 → Complete → Transaction 3 → ...

Before each transaction starts, copy the entire database to a new file. If it succeeds, overwrite the original. If it fails, delete the copy. Done!

This actually works! It's perfectly correct! It also has the performance of a potato.

Why? Because while one transaction is waiting for a slow disk read, every other transaction in the world is just... waiting. Doing nothing. Your expensive multi-core server is running one thing at a time like it's 1975.

We can do better.

The Goal: Having Your Cake and Eating It Too

What we actually want:

  • Better utilization: Use all those CPU cores! Don't let them sit idle!
  • Increased response times: When one transaction waits for I/O, let another one run
  • Correctness: Don't lose money or corrupt data
  • Fairness: Don't let one transaction starve forever

The challenge is allowing transactions to interleave their operations while still maintaining the illusion that they ran one at a time.

📖
Key Concept: Serializability

A schedule (interleaving of operations) is serializable if its result is equivalent to *some* serial execution of the transactions. We don't care which order, just that there exists *some* valid serial order that produces the same result.

The DBMS View: It's All About Reads and Writes

The database doesn't understand your application logic. It doesn't know you're transferring money or booking hotel rooms. All it sees is:

Transaction T1: R(A), W(A), R(B), W(B)
Transaction T2: R(A), W(A), R(B), W(B)

Where R = Read and W = Write. That's it. The DBMS's job is to interleave these operations in a way that doesn't break correctness.

The Classic Example: Interest vs Transfer

You've got two accounts, A and B, each with $1000. Two transactions run:

T1: Transfer $100 from A to B

A = A - 100  // A becomes $900
B = B + 100  // B becomes $1100

T2: Add 6% interest to both accounts

A = A * 1.06
B = B * 1.06

What should the final balance be? Well, A + B should equal $2120 (the original $2000 plus 6% interest).

Serial Execution: The Safe Path

If T1 runs completely before T2:

A = 1000 - 100 = 900
B = 1000 + 100 = 1100
Then apply interest:
A = 900 * 1.06 = 954
B = 1100 * 1.06 = 1166
Total: $2120 ✓

If T2 runs completely before T1:

A = 1000 * 1.06 = 1060
B = 1000 * 1.06 = 1060
Then transfer:
A = 1060 - 100 = 960
B = 1060 + 100 = 1160
Total: $2120 ✓

Both valid! Different final states, but both correct because A + B = $2120.

Good Interleaving: Still Correct

T1: A = A - 100  (A = 900)
T1: B = B + 100  (B = 1100)
T2: A = A * 1.06 (A = 954)
T2: B = B * 1.06 (B = 1166)
Total: $2120 ✓

This interleaving is equivalent to running T1 then T2 serially. We're good!

Bad Interleaving: Money Disappears

T1: A = A - 100  (A = 900)
T2: A = A * 1.06 (A = 1060) ← Used old value of A!
T2: B = B * 1.06 (B = 1060)
T1: B = B + 100  (B = 1160) ← Used old value of B!
Total: $2114 ✗

We lost $6! This schedule is NOT equivalent to any serial execution. It's incorrect.

⚠️
The Problem

T1 read A before T2 updated it, but T2 read B before T1 updated it. The transactions are interleaved in an inconsistent way - each transaction sees a mix of old and new values.

Conflicting Operations: The Root of All Evil

When do operations actually conflict? When they can cause problems if interleaved incorrectly?

Two operations conflict if:

  1. They're from different transactions
  2. They're on the same object (same data item)
  3. At least one is a write

This gives us three types of conflicts:

Read-Write Conflicts: The Unrepeatable Read

T1: R(A) → sees $10
T2: W(A) → writes $19
T1: R(A) → sees $19

T1 reads A twice in the same transaction and gets different values! The data changed underneath it. This is called an unrepeatable read.

Write-Read Conflicts: The Dirty Read

T1: W(A) → writes $12 (not committed yet)
T2: R(A) → reads $12
T2: W(A) → writes $14 (based on dirty data)
T2: COMMIT
T1: ROLLBACK ← Oh no!

T2 read data that T1 wrote but never committed. That data never "really existed" because T1 rolled back. T2 made decisions based on a lie. This is a dirty read.

💻
Real World Example

You're booking the last seat on a flight. The reservation system reads "1 seat available" from a transaction that's updating inventory but hasn't committed. You book the seat. That transaction rolls back. Turns out there were actually 0 seats. Now you're stuck at the airport arguing with gate agents.

Write-Write Conflicts: The Lost Update

T1: W(A) → writes "Bob"
T2: W(A) → writes "Alice"

T2's write overwrites T1's write. If T1 hasn't committed yet, its update is lost. This is the lost update problem.

Conflict Serializability: The Practical Standard

Now we can formally define what makes a schedule acceptable. A schedule is conflict serializable if we can transform it into a serial schedule by swapping non-conflicting operations.

The Dependency Graph Trick

Here's a clever way to check if a schedule is conflict serializable:

  1. Draw one node for each transaction
  2. Draw an edge from Ti to Tj if Ti has an operation that conflicts with an operation in Tj, and Ti's operation comes first
  3. If the graph has a cycle, the schedule is NOT conflict serializable

Example: The Bad Schedule

T1: R(A), W(A), R(B), W(B)
T2: R(A), W(A), R(B), W(B)

With interleaving:

T1: R(A), W(A)
T2: R(A), W(A)
T2: R(B), W(B)
T1: R(B), W(B)

Dependency graph:

T1 → T2  (T1 writes A, T2 reads A - T1 must come first)
T2 → T1  (T2 writes B, T1 reads B - T2 must come first)

There's a cycle! T1 needs to come before T2 AND T2 needs to come before T1. Impossible! This schedule is not conflict serializable.

💡
Why This Matters

The dependency graph gives us a mechanical way to check serializability. If there's no cycle, we can find a valid serial order by doing a topological sort of the graph. This is how the DBMS reasons about schedules!

View Serializability: The Broader Definition

Conflict serializability is practical, but it's also conservative - it rejects some schedules that are actually correct.

View serializability is more permissive. Two schedules are view equivalent if:

  1. If T1 reads the initial value of A in one schedule, it reads the initial value in the other
  2. If T1 reads a value of A written by T2 in one schedule, it does so in the other
  3. If T1 writes the final value of A in one schedule, it does so in the other

Consider this schedule:

T1: R(A), W(A)
T2: W(A)
T3: W(A)

The dependency graph has cycles (it's not conflict serializable), but it's view serializable! Why? Because T3 writes the final value of A in both the interleaved schedule and the serial schedule T1→T2→T3. The intermediate writes by T1 and T2 don't matter - they're overwritten anyway.

This is called a blind write - writing a value without reading it first.

ℹ️
Why Don't Databases Use View Serializability?

Checking view serializability is NP-Complete. It's computationally expensive and impractical for real-time transaction processing. Conflict serializability is polynomial time and good enough for 99.9% of cases.

The Universe of Schedules

┌─────────────────────────────────────┐
│      All Possible Schedules         │
│  ┌───────────────────────────────┐  │
│  │   View Serializable           │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │ Conflict Serializable   │  │  │
│  │  │  ┌───────────────────┐  │  │  │
│  │  │  │  Serial Schedules │  │  │  │
│  │  │  └───────────────────┘  │  │  │
│  │  └─────────────────────────┘  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Most databases enforce conflict serializability because:

  • It's efficient to check
  • It covers the vast majority of practical cases
  • It can be enforced with locks, timestamps, or optimistic methods

How Do We Actually Enforce This?

We've talked about what serializability means, but not how to enforce it. That's the job of concurrency control protocols, which come in two flavors:

Pessimistic: Assume conflicts will happen, prevent them proactively

  • Two-Phase Locking (2PL) - most common
  • Timestamp Ordering
  • "Don't let problems arise in the first place"

Optimistic: Assume conflicts are rare, deal with them when detected

  • Optimistic Concurrency Control (OCC)
  • Multi-Version Concurrency Control (MVCC)
  • "Let transactions run freely, check for conflicts at commit time"

We'll dive deep into these in the next article, but the key insight is that all of them are trying to ensure the schedules they produce are serializable.

📝
Important Distinction

This article is about checking whether schedules are correct. The next article is about generating correct schedules in the first place. The theory tells us what's correct; the protocols tell us how to achieve it.

The NoSQL Backlash (That's Now Backtracking)

Around 2010, the NoSQL movement said "transactions are slow, ACID is overkill, eventual consistency is fine!" Systems like early MongoDB and Cassandra threw out strict serializability for performance.

And you know what? They were fast! They could handle millions of writes per second!

They also had data corruption, lost writes, and developers pulling their hair out debugging race conditions.

The pendulum has swung back. Modern databases (NewSQL, distributed SQL) are proving you can have both performance AND correctness. Turns out the computer scientists in the 1970s knew what they were doing.

🔬
Historical Note

The theory of serializability was developed in the 1970s-1980s by pioneers like Jim Gray, Phil Bernstein, and Christos Papadimitriou. It's stood the test of time because it's based on fundamental principles, not implementation details.

TL;DR

The Problem: Multiple concurrent transactions can interfere with each other, causing lost updates, dirty reads, and inconsistent data.

The Solution: Ensure all schedules are serializable - equivalent to some serial execution.

Key Concepts:

  • Conflicting operations: Two operations on the same object from different transactions, at least one is a write
  • Conflict serializability: Can transform the schedule into a serial one by swapping non-conflicting operations (check with dependency graphs)
  • View serializability: Broader definition, but too expensive to enforce in practice

Types of Conflicts:

  • Read-Write: Unrepeatable reads
  • Write-Read: Dirty reads
  • Write-Write: Lost updates

Next Time: We'll learn about Two-Phase Locking, MVCC, and how databases actually enforce serializability in practice. The theory is beautiful; the implementation is where the magic happens! 🔒

How Databases Actually Store Your Data

You write INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com') and hit enter. It works! Magic!

But have you ever wondered what actually happens? Where does "Alice" go? How does the database find her again when you run SELECT * FROM users WHERE name = 'Alice'?

The beautiful abstraction of tables, rows, and columns is just that - an abstraction. Under the hood, your database is playing Tetris with bytes on a spinning disk (or SSD), trying to pack data efficiently while making it fast to retrieve.

Let's pop the hood and see how this really works.

The Great Illusion: Logical vs Physical

When you think about a database, you probably imagine something like this:

users table:
┌────┬───────┬──────────────────┬─────┐
│ id │ name  │ email            │ age │
├────┼───────┼──────────────────┼─────┤
│ 1  │ Alice │ alice@example.com│ 28  │
│ 2  │ Bob   │ bob@example.com  │ 35  │
│ 3  │ Carol │ carol@example.com│ 42  │
└────┴───────┴──────────────────┴─────┘

Nice, neat rows and columns. Very spreadsheet-like. This is the logical view - how humans think about data.

But on disk? It looks more like:

01001000 01100101 01101100 01101100 01101111 00100000
01010111 01101111 01110010 01101100 01100100 00100001
...millions more bytes...

The physical view is just bytes in files. The database's job is to bridge this gap - to take your neat logical tables and figure out how to jam them into bytes efficiently.

ℹ️
The Storage Manager's Job

The storage manager is the part of the DBMS that translates between "give me the user with id=42" (logical) and "read bytes 8192-8256 from file users.db" (physical). It's like a translator between two completely different languages.

The Storage Hierarchy: A Tale of Speed and Money

Before we dive into how data is stored, we need to understand the hardware reality. Not all storage is created equal:

CPU Registers:     ~1 nanosecond    (Tiny, blazing fast, $$$$$)
CPU Cache:         ~1-10 ns         (Small, very fast, $$$$)
RAM:               ~100 ns          (Medium, fast, $$$)
SSD:               ~100 microseconds (Large, pretty fast, $$)
HDD:               ~10 milliseconds  (Huge, slow, $)
Network Storage:   ~100+ ms         (Infinite, slower, $)

Notice that gap between RAM and SSD? 1,000x slower. And HDD? 100,000x slower than RAM.

This is why databases are obsessed with keeping data in memory (RAM) and avoiding disk I/O at all costs. Every disk access is a tragedy. Every cache hit is a celebration.

⚠️
The Performance Reality

You can execute millions of CPU instructions in the time it takes to read one block from a hard disk. This is why database design is all about minimizing I/O - the CPU is sitting there twiddling its thumbs waiting for the disk.

Pages: The Fundamental Unit of I/O

Here's a key insight: databases don't read individual rows from disk. That would be insane. Instead, they work with pages (also called blocks).

A page is a fixed-size chunk of data, typically 4KB, 8KB, or 16KB. When you ask for one row, the database reads an entire page containing that row (and probably many other rows too).

Why? Because of how disks work. Reading 1 byte from disk takes about the same time as reading 8KB - you pay for the seek time either way. Might as well read a decent chunk while you're there.

Disk File:
┌────────────┬────────────┬────────────┬────────────┐
│  Page 0    │  Page 1    │  Page 2    │  Page 3    │
│  (8 KB)    │  (8 KB)    │  (8 KB)    │  (8 KB)    │
└────────────┴────────────┴────────────┴────────────┘
     ↓
 Contains multiple rows:
 ┌──────────┐
 │ Row 1    │
 │ Row 2    │
 │ Row 3    │
 │ Row 4    │
 │ ...      │
 └──────────┘

Everything in a database happens at page granularity:

  • Read a row? Read the whole page
  • Update a row? Read the page, modify it in memory, write the whole page back
  • Lock a row? Actually lock the whole page (in some systems)
💡
Page Size Matters

Bigger pages = fewer I/O operations but more wasted space and higher contention. Smaller pages = more I/O but better space utilization. Most databases settle on 8KB as a reasonable compromise. PostgreSQL uses 8KB, MySQL InnoDB uses 16KB.

Inside a Page: Slotted Page Layout

So we've got an 8KB page. How do we store rows in it? The most common approach is the slotted page structure:

┌──────────────────────────────────────────┐ ← Page Start (8KB)
│           Page Header                     │
│  - Number of slots used                   │
│  - Free space pointer                     │
│  - Page checksum                          │
├──────────────────────────────────────────┤
│           Slot Array                      │
│  Slot 0: [offset=7800, length=120]       │
│  Slot 1: [offset=7500, length=180]       │
│  Slot 2: [offset=7200, length=150]       │
│  ...                                      │
├──────────────────────────────────────────┤
│                                           │
│         Free Space (grows down)           │
│                                           │
├──────────────────────────────────────────┤
│  Tuple 2: [data...]                      │ ← Offset 7200
│  Tuple 1: [data...]                      │ ← Offset 7500
│  Tuple 0: [data...]                      │ ← Offset 7800
└──────────────────────────────────────────┘ ← Page End

The clever bit: the slot array grows down from the top, the actual tuple data grows up from the bottom. They meet in the middle. When they collide, the page is full.

Why this design?

  • Indirection: Want to move a tuple within the page? Just update the slot's offset, don't touch anything else
  • Efficient deletion: Mark a slot as empty, reuse it later
  • Variable-length records: No problem, just store the actual length in the slot
💻
Example: Finding Row 5

1. Database knows row 5 is on page 12
2. Read page 12 into memory (8KB I/O operation)
3. Look at slot 5 in the slot array: offset=7500, length=180
4. Jump to byte 7500 in the page, read 180 bytes
5. That's your row!

All this happens in microseconds once the page is in memory.

Tuple Layout: How Rows Become Bytes

Inside each slot, we've got the actual row data (called a tuple). How is it laid out?

Fixed-Length Fields (Simple):

Row: (id=42, age=28, salary=50000)

┌────────┬────────┬────────────┐
│   42   │   28   │   50000    │
│ 4 bytes│ 4 bytes│  4 bytes   │
└────────┴────────┴────────────┘

Easy! Just concatenate the values. To find the age field, jump to byte offset 4. To find salary, jump to byte offset 8.

Variable-Length Fields (Tricky):

Row: (id=42, name="Alice", email="alice@example.com")

┌────────┬────────┬────────┬───────┬──────────────────────┐
│   42   │ off=16 │ off=22 │ Alice │ alice@example.com    │
│ 4 bytes│ 4 bytes│ 4 bytes│ 5 byte│     17 bytes         │
└────────┴────────┴────────┴───────┴──────────────────────┘
         ↑        ↑
         └────────┴── Offsets to variable-length data

The fixed-length header contains offsets pointing to where the variable-length data actually lives. When you want the name, you look at the offset, jump there, and read until you hit the next field.

NULL Handling:

Many databases use a null bitmap at the start of each tuple:

┌──────────────┬────────┬────────┬────────┐
│ Null Bitmap  │ Field1 │ Field2 │ Field3 │
│ (bits: 010)  │   42   │  NULL  │   28   │
└──────────────┴────────┴────────┴────────┘

Each bit indicates if the corresponding field is NULL. If it is, you don't even store the value - saves space!

Heap Files: The Simplest Storage Structure

Now that we know how to store rows in pages, how do we organize pages into files? The simplest approach is a heap file - just a random collection of pages with no particular order.

users.heap file:
┌──────────┬──────────┬──────────┬──────────┐
│ Page 0   │ Page 1   │ Page 2   │ Page 3   │
│ [rows]   │ [rows]   │ [rows]   │ [rows]   │
└──────────┴──────────┴──────────┴──────────┘
   ↓ No particular order!
   Rows inserted wherever there's space

Insertion: Find a page with free space (keep a free space map), stick the new row there.

Lookup by ID: Scan every single page until you find it. Slow! This is why we need indexes.

Deletion: Mark the row as deleted, or compact the page to reclaim space.

Heap files are simple but have terrible performance for searches. Finding one specific row means reading the entire table. For a million-row table, that's thousands of I/O operations.

This is where indexes save the day.

📝
When Heap Files Are Okay

If you're always scanning the entire table anyway (like for analytics), heap files are fine. No point in maintaining indexes if you're going to read everything. But for OLTP workloads with point queries? You absolutely need indexes.

Indexes: The Database's Phone Book

An index is a separate data structure that maintains a sorted order and lets you find rows quickly. It's like the index in the back of a book - instead of reading every page to find "Serializability," you look it up in the index and jump straight to page 347.

B-Tree Index: The King of Indexes

The B-Tree (actually B+Tree in most databases) is the workhorse index structure. It's a balanced tree where:

  • Internal nodes contain keys and pointers to child nodes
  • Leaf nodes contain keys and pointers to actual rows (or row IDs)
  • All leaf nodes are at the same depth
  • Tree stays balanced on inserts/deletes
                   [50, 100]
                  /    |     \
          [10,30,40] [60,80] [120,150]
            / | | \    / |     /  |
          [...data...]       [...data...]

Finding id=75:

  1. Start at root: 75 is between 50 and 100, go middle
  2. At [60, 80]: 75 is between 60 and 80, go middle
  3. At leaf node, find the record or pointer to page containing id=75
  4. Read that page, extract the row

For a million-row table, a B-Tree might have height 3-4. That's only 3-4 I/O operations to find any row! Compare that to scanning thousands of pages in a heap file.

💡
Why B-Trees?

B-Trees have high fanout (hundreds of children per node), which keeps the tree shallow. Fewer levels = fewer I/O operations. They're also self-balancing and handle range queries beautifully (all leaves are linked, just traverse left to right).

Hash Index: Fast but Limited

Hash indexes use a hash function to map keys directly to buckets:

hash(id=42) = 7 → Bucket 7 → [pointers to rows with id=42]
hash(id=100) = 3 → Bucket 3 → [pointers to rows with id=100]

Pros: O(1) lookups for exact matches - incredibly fast!

Cons: Can't do range queries. WHERE id > 50 requires scanning all buckets. Also, hash collisions need to be handled.

Hash indexes are great for equality lookups (WHERE id = 42) but terrible for anything else. B-Trees handle both equality and ranges, which is why they're more popular.

Clustered vs Non-Clustered Indexes

Clustered Index: The table data itself is organized by the index key. The leaf nodes of the index ARE the actual rows.

B-Tree (clustered on id):
Leaf nodes contain: [id=10, name="Alice", ...full row data...]
                    [id=20, name="Bob", ...full row data...]

Benefit: Finding a row by the clustered key is super fast - one index lookup and you have the whole row.

Cost: You can only have ONE clustered index per table (because the data can only be physically sorted one way). In MySQL InnoDB, the primary key is always clustered.

Non-Clustered Index: Leaf nodes contain row IDs or pointers, not the actual data.

B-Tree (non-clustered on email):
Leaf nodes contain: [email="alice@ex.com", row_id=1]
                    [email="bob@ex.com", row_id=2]

To get the full row, you need two lookups:

  1. Search the index to find row_id
  2. Look up row_id in the main table (clustered index or heap)

This is called a index lookup or bookmark lookup. It's slower than a clustered index but still way faster than scanning the whole table.

💻
Real Query Example

SELECT * FROM users WHERE email = 'alice@example.com'

Without index on email: Scan entire heap file (1000+ I/O operations)

With non-clustered index on email:
1. Search B-Tree index (3-4 I/O operations) → find row_id=42
2. Look up row_id=42 in clustered index (1-2 I/O operations)
Total: ~5 I/O operations vs 1000+

That's a 200x speedup!

The Buffer Pool: RAM to the Rescue

Remember how disk I/O is 100,000x slower than RAM? The buffer pool (also called buffer cache) is the database's attempt to minimize this pain.

The buffer pool is a large chunk of RAM (often gigabytes) that caches pages from disk:

┌─────────────────────────────────────────┐
│         Buffer Pool (RAM)                │
├─────────────────────────────────────────┤
│  Frame 0: Page 42 (dirty)               │
│  Frame 1: Page 17 (clean)               │
│  Frame 2: Page 99 (dirty)               │
│  Frame 3: Page 5  (clean)               │
│  ...                                     │
│  Frame N: Empty                          │
└─────────────────────────────────────────┘
         ↕ (only on cache miss)
┌─────────────────────────────────────────┐
│           Disk Storage                   │
└─────────────────────────────────────────┘

How it works:

  1. Query needs page 42
  2. Check buffer pool: Is page 42 already in memory?
  3. Cache hit: Great! Use it directly. No disk I/O! 🎉
  4. Cache miss: Sad. Read page 42 from disk, put it in buffer pool, evict something else if full

Dirty pages: Pages that have been modified in memory but not yet written to disk. Eventually they need to be flushed back to disk (called write-back).

Replacement policy: When the buffer pool is full and you need to load a new page, which one do you evict? Most databases use LRU (Least Recently Used) or variants like Clock or LRU-K.

ℹ️
The 80-20 Rule

Typically, 80% of queries access 20% of the data. If your buffer pool can hold that "hot" 20%, your cache hit rate will be ~80%. This is why throwing more RAM at a database often dramatically improves performance - more cache hits!

Sequential vs Random I/O: The Secret to Performance

Not all I/O is created equal. Sequential I/O (reading consecutive pages) is MUCH faster than random I/O (reading scattered pages).

Why? Mechanical sympathy. On an HDD:

  • Sequential read: Read head is already in position, just keep reading. Fast!
  • Random read: Move read head to new location (seek time ~10ms), then read. Slow!

Even on SSDs, sequential I/O is faster due to how flash memory works.

This is why database design obsesses over data locality:

  • Keep related data on the same page or adjacent pages
  • Use clustered indexes to physically sort data by common access patterns
  • Partition large tables to keep hot data together

Table scans (reading the entire table sequentially) are actually pretty fast IF you're going to read most of the data anyway. Reading 1000 pages sequentially might be faster than reading 50 pages randomly!

This is why the query optimizer sometimes chooses a table scan over using an index - if you're retrieving a large percentage of rows, scanning is more efficient.

⚠️
Index Selectivity Matters

An index on gender (2 values) is almost useless - the optimizer will likely ignore it and scan the table.

An index on email (unique values) is incredibly valuable - it makes queries 1000x faster.

The more selective (fewer duplicates) the index, the more useful it is.

Column-Oriented Storage: A Different Approach

Everything we've discussed so far assumes row-oriented storage - rows are stored together. But there's another way: column-oriented storage.

Row-oriented (traditional):

Page 0: [Row1: id=1, name="Alice", age=28]
        [Row2: id=2, name="Bob", age=35]
        [Row3: id=3, name="Carol", age=42]

Column-oriented:

Page 0: [id column: 1, 2, 3, 4, 5, ...]
Page 1: [name column: "Alice", "Bob", "Carol", ...]
Page 2: [age column: 28, 35, 42, ...]

All values for one column are stored together!

Benefits:

  • Analytical queries: SELECT AVG(age) FROM users only reads the age column, ignores name/email. Huge I/O savings!
  • Compression: Similar values compress better. A column of integers compresses 10x-100x better than mixed row data
  • SIMD: Modern CPUs can process arrays of similar values super fast

Drawbacks:

  • OLTP queries: SELECT * FROM users WHERE id=42 needs to read multiple column files and reassemble the row. Slow!
  • Updates: Updating one row requires touching multiple column files

This is why column stores like ClickHouse, Vertica, and RedShift are amazing for analytics (read-heavy, aggregate queries) but terrible for OLTP (transactional, row-level updates).

Modern databases like PostgreSQL are hybrid - primarily row-oriented but with column-store extensions for analytics.

Data Files in Practice: PostgreSQL Example

Let's see how PostgreSQL actually organizes data on disk:

/var/lib/postgresql/data/
├── base/                    ← Database files
│   ├── 16384/              ← Database OID
│   │   ├── 16385           ← Table file (heap)
│   │   ├── 16385_fsm       ← Free space map
│   │   ├── 16385_vm        ← Visibility map
│   │   ├── 16386           ← Index file (B-Tree)
│   │   └── ...
├── pg_wal/                 ← Write-ahead log
└── pg_xact/                ← Transaction commit log
  • Table file (16385): Heap of pages, each 8KB
  • Free space map: Tracks which pages have free space for inserts
  • Visibility map: Tracks which pages have all rows visible to all transactions (for vacuum optimization)
  • Index files: B-Tree structures, also page-based

When you INSERT a row, PostgreSQL:

  1. Checks free space map for a page with room
  2. Loads that page into buffer pool
  3. Adds row to page using slotted layout
  4. Marks page as dirty
  5. Eventually writes back to disk
🔬
PostgreSQL Page Anatomy

You can actually inspect pages using pageinspect extension:

SELECT * FROM heap_page_items(get_raw_page('users', 0));

This shows you the slot array, tuple offsets, free space - everything we've discussed! It's like an X-ray of your database.

TL;DR

The Storage Hierarchy:

  • RAM is fast (~100ns), disk is slow (~10ms)
  • Minimize I/O at all costs!

Pages are the fundamental unit:

  • Fixed-size chunks (typically 8KB)
  • Everything happens at page granularity
  • Slotted page layout for flexible tuple storage

Heap files are simple but slow:

  • Unordered collection of pages
  • Scans require reading everything
  • Need indexes for fast lookups

Indexes make queries fast:

  • B-Trees: balanced, support ranges, most common
  • Hash indexes: fast equality, no ranges
  • Clustered vs non-clustered trade-offs

Buffer pool caches hot data:

  • Keep frequently accessed pages in RAM
  • LRU eviction policy
  • High cache hit rate = fast database

Sequential I/O >> Random I/O:

  • Keep related data together
  • Data locality matters enormously
  • Sometimes scans beat indexes!

Column stores for analytics:

  • Store columns separately
  • Great compression and SIMD
  • Fast aggregates, slow row retrieval

Next time you run a query, picture the journey: SQL → query plan → index traversal → page reads → buffer pool → disk → pages → slots → tuples → bytes. It's a beautiful dance of abstraction layers, all working together to make SELECT look simple!

Modern SQL

You write SELECT * FROM users WHERE age > 25 and hit enter. Simple, right? Three seconds later, your result appears. You're happy.

But what you don't see is the absolute chaos that just happened behind the scenes. Your innocent little query triggered an optimizer that considered 47 different execution strategies, ran statistical analysis on your data distribution, predicted I/O costs down to the millisecond, and ultimately chose an algorithm you've probably never heard of - all in a fraction of a second.

Modern SQL databases are frighteningly smart. They're doing things that would make a PhD dissertation look simple. Let's dive into the wizard's workshop and see what kind of sorcery is actually happening.

The Query Journey: From SQL to Execution

First, let's trace the path your query takes through the database:

Your SQL
   ↓
Parser → Check syntax, build parse tree
   ↓
Binder → Verify tables/columns exist, resolve names
   ↓
Optimizer → THIS IS WHERE THE MAGIC HAPPENS
   ↓
Execution Plan → The actual algorithm to run
   ↓
Execution Engine → Just do what the optimizer said
   ↓
Results!

Most people focus on writing SQL or tuning indexes. But the optimizer? That's where databases flex their 50 years of computer science research.

ℹ️
The Optimizer's Job

Given one SQL query, the optimizer might generate hundreds or thousands of possible execution plans. Its job: find the fastest one without actually running them all. It's like trying to predict which route through the city is fastest without actually driving each one.

The Cost Model: Predicting the Future

Here's the first bit of magic: the optimizer doesn't just guess. It models the cost of each possible plan.

Cost factors:

  • I/O cost: How many pages to read from disk?
  • CPU cost: How many tuples to process?
  • Network cost: (for distributed databases) How much data to transfer?
  • Memory cost: Will this fit in buffer pool or require disk spills?

Let's say you have:

SELECT * FROM users 
WHERE age > 25 AND city = 'New York';

The optimizer considers:

Option 1: Scan the whole table

  • Cost: Read all 10,000 pages = 10,000 I/O ops
  • Then filter in memory
  • Estimated time: ~10 seconds

Option 2: Use index on age

  • Cost: Read index (height=3) = 3 I/O ops
  • Then read matching data pages = ~3,000 pages = 3,000 I/O ops
  • Estimated time: ~3 seconds

Option 3: Use index on city

  • Cost: Read index = 3 I/O ops
  • Read matching pages = 500 pages = 500 I/O ops
  • Estimated time: ~0.5 seconds ← WINNER!

The optimizer picks Option 3. But how did it know city='New York' would only match 500 pages?

Statistics.

💡
The Statistics System

Databases maintain statistics about your data: number of rows, distinct values per column, data distribution histograms, correlation between columns, and more. Run ANALYZE or UPDATE STATISTICS regularly, or your optimizer is flying blind!

Cardinality Estimation: The Art of Fortune Telling

Cardinality = how many rows a query will return. Getting this right is CRITICAL because it affects every downstream decision.

Simple Predicate

WHERE age = 30

If the table has 1,000,000 rows and age has 70 distinct values (ages 18-87), the optimizer estimates:

Cardinality = 1,000,000 / 70 ≈ 14,285 rows

This assumes uniform distribution - a simplification, but reasonable.

Multiple Predicates (The Independence Assumption)

WHERE age = 30 AND city = 'New York'

Optimizer assumes age and city are independent:

Selectivity(age=30) = 1/70 = 0.014
Selectivity(city='NY') = 0.05 (5% of users in NY)
Combined = 0.014 × 0.05 = 0.0007
Cardinality = 1,000,000 × 0.0007 = 700 rows

But what if young people prefer cities? Then age and city are correlated, and this estimate is wrong!

⚠️
When Estimates Go Wrong

The optimizer estimated 700 rows, so it chose a nested loop join. Reality: 50,000 rows. Now your query takes 10 minutes instead of 10 seconds because the wrong algorithm was chosen. This is why DBAs obsess over statistics quality!

Modern Solution: Histograms and Multi-Dimensional Statistics

PostgreSQL, SQL Server, and Oracle now maintain histograms - bucketed distributions of actual data:

age histogram:
[18-25]: 200,000 rows  (young users!)
[26-35]: 400,000 rows  (peak)
[36-50]: 300,000 rows
[51+]:   100,000 rows

Even better, some databases track multi-column statistics to capture correlations:

CREATE STATISTICS young_city_corr 
ON age, city FROM users;

Now the optimizer knows that age and city ARE correlated and adjusts estimates accordingly.

Join Algorithms: More Than You Ever Wanted to Know

Here's where databases really show off. You write:

SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.city = 'Boston';

Simple, right? But the optimizer has to choose from dozens of algorithms:

Nested Loop Join (The Simple One)

for each row in users where city='Boston':
    for each row in orders where user_id = user.id:
        output joined row

Cost: If 100 Boston users and 1,000,000 orders:

  • Outer loop: 100 iterations
  • Inner loop: 1,000,000 / (num_users) ≈ 10 per user
  • Total: 100 × 10 = 1,000 comparisons

When to use: Small outer table, index on inner table's join key. Perfect for this query!

Hash Join (The Clever One)

1. Build hash table on smaller table (users from Boston)
2. Probe: for each order, hash user_id and look up in hash table
3. Output matches

Cost:

  • Build phase: Read Boston users (100 rows)
  • Probe phase: Read all orders (1,000,000 rows), O(1) lookup each
  • Total: ~1,000,100 operations, but no random I/O!

When to use: No indexes available, joining large tables, can fit build side in memory.

💻
The Hash Join Trick

Hash joins are I/O efficient because they read each table sequentially (no random seeks). Even if nested loop needs fewer comparisons, hash join might be faster because sequential I/O is so much quicker than random access!

Sort-Merge Join (The Sophisticated One)

1. Sort users by id
2. Sort orders by user_id  
3. Merge: walk through both sorted lists simultaneously

Cost:

  • Sort users: 100 × log(100) ≈ 664
  • Sort orders: 1,000,000 × log(1,000,000) ≈ 20,000,000
  • Merge: 100 + 1,000,000 = 1,000,100
  • Total: ~20,001,000 operations

Looks expensive! But if the data is ALREADY sorted (because of an index or previous operation), the sorts are free. Then merge is just two sequential scans - super fast!

When to use: Data already sorted, or you need sorted output anyway (for ORDER BY or GROUP BY downstream).

The Optimizer's Decision

The optimizer estimates costs for ALL of these (and more), considering:

  • Available indexes
  • Data cardinalities
  • Memory available
  • Whether output needs to be sorted

Then it picks the winner. And it does this for EVERY join in your query, considering all possible orderings!

SELECT *
FROM A JOIN B ON A.id = B.id
       JOIN C ON B.id = C.id
       JOIN D ON C.id = D.id;

Possible join orders:

  • ((A ⋈ B) ⋈ C) ⋈ D
  • (A ⋈ (B ⋈ C)) ⋈ D
  • A ⋈ ((B ⋈ C) ⋈ D)
  • ... and many more

For N tables, there are roughly (2N)! / N! possible orderings. For 10 tables? 17 trillion possibilities.

The optimizer can't check them all. So it uses heuristics, dynamic programming, and sometimes genetic algorithms to search the space efficiently.

🔬
The Join Ordering Problem

Finding the optimal join order is NP-hard. Modern optimizers use sophisticated search strategies: PostgreSQL uses dynamic programming (exact for <12 tables, heuristic for more), SQL Server uses a "memo" structure to cache subproblems, and some experimental optimizers use machine learning!

Adaptive Query Processing: Learning on the Fly

Here's where it gets wild. Modern databases don't just plan and execute - they adapt mid-query.

Adaptive Join Selection (SQL Server)

SQL Server's optimizer might say: "I'm not sure if nested loop or hash join is better. Let me start with nested loop, but if I process more than 1000 rows, switch to hash join mid-execution."

Start: Nested Loop Join
  → After 500 rows: "This is fine, keep going"
  → After 1500 rows: "Wait, this is taking forever!"
  → Switch to Hash Join without restarting query

The database is literally changing algorithms WHILE YOUR QUERY IS RUNNING.

Runtime Filter Pushdown (ClickHouse, Snowflake)

Consider:

SELECT * FROM big_table b
JOIN small_table s ON b.id = s.id
WHERE s.category = 'active';

Traditional plan:

  1. Scan big_table (1 billion rows)
  2. Scan small_table, filter to 'active' (100 rows)
  3. Join (now only need to check 100 IDs from big_table)

But we wasted time scanning 1 billion rows!

Runtime filter pushdown:

  1. Scan small_table first, get IDs: {42, 87, 153, ...} (100 IDs)
  2. Build a bloom filter or hash set
  3. Scan big_table, but skip rows where ID not in filter
  4. Now only read ~100 rows from big_table!

The filter is computed AT RUNTIME and pushed down dynamically. You didn't ask for this. The database just decided to do it because it's smarter than you.

💡
Bloom Filters: Space Magic

A bloom filter is a probabilistic data structure that answers "is X in the set?" in O(1) time and constant space. It might have false positives (says yes when it's no) but never false negatives. Perfect for filtering billions of rows with just KB of memory!

Cardinality Re-Estimation (Oracle)

Oracle's optimizer can detect when its estimates were wrong:

Expected: 1,000 rows after filter
Reality: 500,000 rows (oops!)

Oracle: "My estimate was garbage. Let me re-plan 
         the rest of the query with correct cardinality."

Mid-query re-optimization. Because plans go stale, and modern databases know it.

Parallel Execution: Divide and Conquer

Your query:

SELECT COUNT(*) FROM huge_table WHERE value > 1000;

Traditional: One thread scans 10 million rows. Takes 10 seconds.

Parallel execution:

Thread 1: Scan rows 0-2.5M
Thread 2: Scan rows 2.5M-5M  
Thread 3: Scan rows 5M-7.5M
Thread 4: Scan rows 7.5M-10M

Each thread: COUNT(*)
Final: SUM(all counts)

Now it takes 2.5 seconds (assuming 4 cores and perfect scaling).

But wait, there's more! Modern databases do parallel everything:

Parallel Hash Join:

1. Partition users into 4 buckets by hash(id)
2. Partition orders into 4 buckets by hash(user_id)
3. Four threads, each joins one bucket pair
4. Merge results

Parallel Aggregation:

SELECT city, AVG(age) FROM users GROUP BY city;
1. Each thread scans part of table, computes local aggregates
2. Combine phase: merge partial aggregates
3. Compute final AVG from combined SUM/COUNT

The optimizer decides:

  • How many threads to use
  • How to partition the data
  • Where to place exchange operators (data shuffling points)
  • Whether parallelism is even worth it (overhead vs speedup)
⚠️
Parallelism Isn't Free

Coordinating threads, partitioning data, and merging results has overhead. For small queries, parallel execution is SLOWER. The optimizer must predict when parallelism helps vs hurts. Getting this wrong means your "optimization" made things worse!

Vectorized Execution: SIMD on Steroids

Traditional query execution (Volcano model):

while (tuple = next()) {
    result = apply_filter(tuple);
    emit(result);
}

One tuple at a time. Lots of function calls, branches, cache misses.

Vectorized execution (DuckDB, ClickHouse):

while (batch = next_batch()) {  // Get 1024 tuples
    results = apply_filter_vectorized(batch);  // Process all at once
    emit_batch(results);
}

Process tuples in batches of 1024-2048. The filter function operates on arrays:

// Instead of:
for (int i = 0; i < 1024; i++) {
    if (ages[i] > 25) output[j++] = rows[i];
}

// Compiler generates SIMD:
// Check 8 ages at once with AVX2 instructions
// 128x fewer branches, better cache locality

Modern CPUs have SIMD (Single Instruction Multiple Data) that can process 8-16 values simultaneously. Vectorized engines exploit this automatically.

Result: 10-100x speedup on analytical queries. DuckDB crushes Postgres on aggregations because of this.

💻
Real-World Impact

Query: SELECT SUM(price) FROM orders WHERE status = 'completed'

PostgreSQL (tuple-at-a-time): 5 seconds
DuckDB (vectorized): 0.3 seconds

Same data, same machine. The execution model matters THAT much.

Just-In-Time (JIT) Compilation: Compiling Your Query

Here's some next-level sorcery: compile your query to machine code.

Traditional interpretation:

For each row:
    Push onto stack
    Call filter function
    Call projection function
    Pop from stack
    Emit result

Thousands of function calls, stack operations, indirection.

JIT compilation (PostgreSQL with LLVM, Hyper/Tableau):

1. Take query plan
2. Generate C code or LLVM IR
3. Compile to native machine code
4. Execute compiled function directly

Your query becomes a tight loop with no function call overhead:

; Pseudo-assembly for: WHERE age > 25 AND city = 'Boston'
loop:
    load age from [rdi]
    cmp age, 25
    jle skip
    load city_ptr from [rdi+8]
    cmp [city_ptr], 'Boston'
    jne skip
    ; emit row
skip:
    add rdi, 32  ; next row
    jmp loop

No interpretation, no indirection. Just raw CPU instructions.

Cost: Compilation takes 10-100ms. So JIT only helps for long-running queries (seconds or more). The optimizer must predict if compilation overhead is worth it!

🔬
HyPer/Umbra Innovation

The HyPer database (now Tableau's engine) pioneered query compilation. Their approach: compile the entire query pipeline into one tight loop with no materialization. Result: analytical queries 10-100x faster than traditional row-at-a-time execution.

Approximate Query Processing: Good Enough is Perfect

Sometimes you don't need exact answers:

SELECT AVG(price) FROM orders;

Do you REALLY need to scan all 1 billion rows to get an average? Or would "approximately $47.32 ± $0.50" be fine?

Sampling

SELECT AVG(price) FROM orders TABLESAMPLE BERNOULLI(1);

Read only 1% of rows, compute average on sample. 100x faster, answer is usually within 1% of truth.

Sketches (HyperLogLog for COUNT DISTINCT)

SELECT COUNT(DISTINCT user_id) FROM events;

Traditional: Hash all user_ids into a set, count size. Memory = O(cardinality).

HyperLogLog sketch: Use ~1KB of memory, get count with ~2% error.

For each user_id:
    hash = hash(user_id)
    bucket = hash % 16384
    leading_zeros = count_leading_zeros(hash)
    max_zeros[bucket] = max(max_zeros[bucket], leading_zeros)

Cardinality ≈ 2^(average(max_zeros))

Sounds like magic? It is. But it works.

Result: COUNT(DISTINCT) on billions of rows in seconds, not hours.

💡
When to Use Approximation

Dashboards, analytics, exploration - approximation is perfect. Financial reports, compliance - need exact answers. Modern databases like ClickHouse and Snowflake make sampling trivial, and many have built-in sketch algorithms.

Push-Based vs Pull-Based Execution

Traditional (pull-based / Volcano model):

Top operator: "Give me next row"
  ↓
Join: "Give me next row from both inputs"
  ↓
Scan: "Read next row from disk"

Data is pulled up through the pipeline. Simple, but lots of function call overhead.

Push-based (MonetDB, Vectorwise):

Scan: "I have 1024 rows, pushing to filter"
  ↓
Filter: "Got 1024, filtered to 800, pushing to join"
  ↓
Join: "Got 800, joined to 600, pushing to output"

Data is pushed through operators. Fewer function calls, better cache locality, easier to vectorize.

Morsel-Driven (HyPer): Hybrid approach. Process data in "morsels" (chunks), push within operators but pull between pipeline breakers (like hash join build phase).

The optimizer chooses the execution model based on query shape and workload!

Zone Maps / Small Materialized Aggregates

Here's a sneaky optimization you never asked for:

When writing pages to disk, the database tracks metadata:

Page 42: 
  min(timestamp) = 2024-01-01
  max(timestamp) = 2024-01-07
  min(price) = 10.50
  max(price) = 999.99

Query:

SELECT * FROM orders WHERE timestamp > '2024-06-01';

Optimizer: "Page 42 has max timestamp of 2024-01-07. Skip it entirely!"

Without reading the page, we know it has no matching rows. This is called zone map filtering or small materialized aggregates.

Result: Prune entire pages/partitions without I/O. Analytical queries get 10-1000x faster.

ClickHouse, Snowflake, and Redshift do this automatically. You didn't ask for it. The database just does it because it's clever.

💻
Real Example: Time-Series Data

Table with 1 year of data, partitioned by day (365 partitions).
Query: WHERE timestamp > NOW() - INTERVAL '7 days'

Zone maps let optimizer skip 358 partitions immediately.
Scan 7 days of data instead of 365 days = 50x speedup!

Machine Learning in the Optimizer

This is where databases officially become science fiction.

Learned Cardinality Estimation (Research / Neo, Bao)

Traditional: Use statistics and independence assumption.

ML approach: Train a neural network on query workload:

Input: Query features (predicates, joins, tables)
Output: Estimated cardinality

Training data: Actual query executions

The model learns correlations, data skew, and patterns that statistics miss.

Result: 10-100x better estimates than traditional methods in research papers. Production adoption is starting.

Learned Indexes (Research)

B-Trees are great, but what if we could do better?

Key insight: An index is just a function mapping keys to positions.

Traditional B-Tree: 
  key → traverse tree → find position

Learned Index:
  key → neural network → predict position → verify

Train a neural network to predict "where is key X in the sorted array?"

Result: In some workloads, learned indexes are 2-3x faster and 10x smaller than B-Trees. Still research, but Google is experimenting.

📝
The ML-Database Convergence

We're seeing ML infuse databases (learned optimizers) AND databases infuse ML (vector databases, embedding search). The lines are blurring. In 10 years, every database will have ML components under the hood.

The Explain Plan: Your Window Into the Optimizer's Mind

Want to see what the optimizer chose?

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.city = 'Boston';

PostgreSQL output:

Nested Loop  (cost=0.56..892.34 rows=100 width=64) 
             (actual time=0.043..5.231 rows=112 loops=1)
  Buffers: shared hit=245 read=12
  ->  Index Scan on users u (cost=0.42..23.45 rows=100 width=32)
                              (actual time=0.021..0.156 rows=112 loops=1)
        Index Cond: (city = 'Boston')
        Buffers: shared hit=45
  ->  Index Scan on orders o (cost=0.14..8.68 rows=1 width=32)
                              (actual time=0.002..0.042 rows=10 loops=112)
        Index Cond: (user_id = u.id)
        Buffers: shared hit=200 read=12
Planning Time: 0.342 ms
Execution Time: 5.487 ms

This tells you EVERYTHING:

  • Nested loop join chosen
  • Index scans on both tables
  • Estimated 100 rows, actually got 112 (pretty good!)
  • 245 buffer hits (cache!), only 12 disk reads
  • Execution took 5.4ms

If your query is slow, start with EXPLAIN. It shows you what the optimizer thought vs reality.

💡
Reading Explain Plans

Key things to look for:
- Seq Scan on large table? Probably need an index
- Estimated rows << actual rows? Stats are stale
- Lots of disk reads? Need more buffer pool memory
- Hash Join on tiny tables? Optimizer confused, maybe outdated stats

Modern SQL Features You Should Know

The SQL standard has evolved. Modern databases support wild features:

Window Functions (Every Modern DB)

SELECT name, salary,
       AVG(salary) OVER (PARTITION BY department) as dept_avg,
       ROW_NUMBER() OVER (ORDER BY salary DESC) as rank
FROM employees;

Compute aggregates over "windows" of rows without GROUP BY collapsing. Incredibly powerful for analytics.

CTEs and Recursive Queries (SQL:1999)

WITH RECURSIVE subordinates AS (
    SELECT id, name, manager_id FROM employees WHERE id = 1
    UNION ALL
    SELECT e.id, e.name, e.manager_id 
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.id
)
SELECT * FROM subordinates;

Traverse hierarchies, compute transitive closures. This is graph traversal in SQL!

Lateral Joins (PostgreSQL, Oracle)

SELECT u.name, o.*
FROM users u
CROSS JOINj LATERAL (
    SELECT * FROM orders 
    WHERE user_id = u.id 
    ORDER BY created_at DESC 
    LIMIT 5
) o;

For each user, get their 5 most recent orders. The subquery can reference the outer query! This was impossible in old SQL.

JSON Support (PostgreSQL, MySQL, SQL Server)

SELECT data->>'name' as name,
       jsonb_array_elements(data->'tags') as tag
FROM documents
WHERE data @> '{"status": "active"}';

Store JSON, query it with SQL, index it, join it. The relational/document boundary is gone.

GROUPING SETS / CUBE / ROLLUP

SELECT city, product, SUM(sales)
FROM orders
GROUP BY GROUPING SETS (
    (city, product),
    (city),
    (product),
    ()
);

Compute multiple group-by aggregations in one pass. Used to require UNION of multiple queries. Now it's one efficient operation.

ℹ️
SQL is Not Dead

People keep predicting SQL's death. But SQL keeps getting MORE powerful. Modern SQL can express complex analytics, graph traversals, time-series operations, and even some ML tasks. It's 50 years old and more relevant than ever.

When the Optimizer Gets It Wrong

Optimizers are smart but not perfect. Common failure modes:

Stale Statistics

-- Yesterday: 1000 rows
-- Today: 10,000,000 rows (bulk insert)
-- Optimizer still thinks: 1000 rows

Solution: ANALYZE / UPDATE STATISTICS after bulk changes!

Correlated Columns

WHERE age < 25 AND student = true

If young people are usually students (correlation), independence assumption fails.

Solution: Multi-column statistics or hints.

Parameter Sniffing (SQL Server)

EXEC GetUsers @city = 'Boston'  -- Optimizer plans for Boston (100 rows)
EXEC GetUsers @city = 'New York'  -- Reuses plan, but NY has 10M rows!

Plan was optimal for first parameter, terrible for second.

Solution: OPTION (RECOMPILE) or plan guides.

Function Calls Hide Selectivity

WHERE UPPER(name) = 'ALICE'

Optimizer can't use index on name (function applied). Also can't estimate selectivity.

Solution: Use functional indexes or write WHERE name = 'Alice' OR name = 'ALICE'.

⚠️
The 80-20 Rule of Query Performance

80% of slow queries are due to:
- Missing indexes (40%)
- Stale statistics (20%)
- Poorly written SQL (15%)
- Wrong data types/implicit conversions (5%)

Only 20% are actually hard optimization problems requiring deep tuning.

The Future: What's Coming

Autonomous Databases (Oracle, Azure SQL)

Databases that automatically:

  • Tune themselves
  • Create indexes
  • Adjust memory allocation
  • Detect and fix performance issues

The DBA becomes optional.

Unified OLTP/OLAP (TiDB, CockroachDB + Analytics)

One database for both transactions AND analytics. No more ETL to data warehouses.

Hybrid storage engines (row + column), workload-aware optimization.

Serverless Query Engines (BigQuery, Athena, Snowflake)

Separate storage from compute. Scale to petabytes, pay only for queries run.

No servers to manage, infinite scale.

GPU-Accelerated Databases (BlazingSQL, OmniSci)

Push operations to GPUs for 10-100x speedup on analytics.

Thousands of cores processing data in parallel.

🔬
The Pace of Innovation

In the last 10 years, we've seen: columnar execution, vectorization, JIT compilation, adaptive optimization, GPU acceleration, and ML-driven tuning. Database systems research is THRIVING. The next 10 years will be even wilder.

TL;DR

Modern SQL databases are absurdly sophisticated:

Query Optimization:

  • Cost models predict execution time with scary accuracy
  • Consider hundreds/thousands of possible plans
  • Use statistics, histograms, and ML for cardinality estimation
  • Find optimal join orders in exponential search space

Execution Innovations:

  • Adaptive algorithms switch strategies mid-query
  • Parallel execution across cores automatically
  • Vectorized/SIMD processing for 10-100x speedup
  • JIT compilation turns queries into machine code
  • Push-based execution for better cache performance

Smart Shortcuts:

  • Zone maps skip entire partitions without reading
  • Runtime filter pushdown avoids billions of rows
  • Approximate processing for "good enough" answers
  • Learned indexes and ML-powered optimizers (coming soon)

Modern SQL:

  • Window functions, CTEs, lateral joins
  • JSON support, recursive queries
  • GROUPING SETS for multi-dimensional analytics
  • Still evolving after 50 years!

The next time you write a simple SELECT statement, remember: you've just triggered a cascade of algorithms that would make a PhD dissertation look trivial. The database is working HARD to make your query look easy.

And that's beautiful.

Programmatic Access to Databases

Why Programmatic Access?

You've used web interfaces to search databases. But what if you need to:

  • Query 500 proteins automatically
  • Extract specific fields from thousands of entries
  • Build a pipeline that updates daily

You need to talk to databases programmatically — through their APIs.


Part 1: How the Web Works

URLs

A URL (Uniform Resource Locator) is an address for a resource on the web:

https://www.rcsb.org/structure/4GYD

HTTP Protocol

When your browser opens a page:

  1. Browser identifies the server from the URL
  2. Sends a request using HTTP (or HTTPS for secure)
  3. Server responds with content + status code

HTTP Methods:

  • GET — retrieve data (what we'll mostly use)
  • POST — send data to create/update
  • PUT — update data
  • DELETE — remove data

Status Codes

Every HTTP response includes a status code:

RangeMeaningExample
1XXInformation100 Continue
2XXSuccess200 OK
3XXRedirect301 Moved Permanently
4XXClient error404 Not Found
5XXServer error500 Internal Server Error

Key rule: Always check if status code is 200 (or in 2XX range) before processing the response.


Part 2: REST and JSON

REST

REST (REpresentational State Transfer) is an architecture for web services.

A REST API lets you:

  • Send HTTP requests to specific URLs
  • Get structured data back

Most bioinformatics databases offer REST APIs: PDB, UniProt, NCBI, Ensembl.

JSON

JSON (JavaScript Object Notation) is the standard format for API responses.

Four rules:

  1. Data is in name/value pairs
  2. Data is separated by commas
  3. Curly braces {} hold objects (like Python dictionaries)
  4. Square brackets [] hold arrays (like Python lists)

Example:

{
    "entry_id": "4GYD",
    "resolution": 1.86,
    "chains": ["A", "B"],
    "ligands": [
        {"id": "CFF", "name": "Caffeine"},
        {"id": "HOH", "name": "Water"}
    ]
}

This maps directly to Python:

  • {} → dictionary
  • [] → list
  • "text" → string
  • numbers → int or float

Part 3: The requests Module

Python's requests module makes HTTP requests simple.

Basic GET Request

import requests

res = requests.get('http://www.google.com')
print(res.status_code)  # 200

Check Status Before Processing

res = requests.get('http://www.google.com')

if res.status_code == 200:
    print(res.text)  # The HTML content
else:
    print(f"Error: {res.status_code}")

What Happens with Errors

r = requests.get('https://github.com/timelines.json')
print(r.status_code)  # 404
print(r.text)  # Error message from GitHub

Always check the status code. Don't assume success.

Getting JSON Responses

Most APIs return JSON. Convert it to a Python dictionary:

r = requests.get('https://some-api.com/data')
data = r.json()  # Now it's a dictionary

print(type(data))  # <class 'dict'>
print(data.keys())  # See what's inside

Part 4: PDB REST API

The Protein Data Bank has multiple APIs. Let's start with the REST API.

PDB Terminology

TermMeaningExample
EntryComplete structure from one experiment4GYD
Polymer EntityOne chain (protein, DNA, RNA)4GYD entity 1
Chemical ComponentSmall molecule, ligand, ionCFF (caffeine)

Get Entry Information

r = requests.get('https://data.rcsb.org/rest/v1/core/entry/4GYD')
data = r.json()

print(data.keys())
# dict_keys(['cell', 'citation', 'diffrn', 'entry', 'exptl', ...])

print(data['cell'])
# {'Z_PDB': 4, 'angle_alpha': 90.0, 'angle_beta': 90.0, ...}

Get Polymer Entity (Chain) Information

# 4GYD, entity 1
r = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4GYD/1')
data = r.json()

print(data['entity_poly'])
# Contains sequence, polymer type, etc.

Get PubMed Annotations

r = requests.get('https://data.rcsb.org/rest/v1/core/pubmed/4GYD')
data = r.json()

print(data['rcsb_pubmed_abstract_text'])
# The paper's abstract

Get Chemical Component Information

# CFF = Caffeine
r = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/CFF')
data = r.json()

print(data['chem_comp'])
# {'formula': 'C8 H10 N4 O2', 'formula_weight': 194.191, 'name': 'CAFFEINE', ...}

Get DrugBank Information

r = requests.get('https://data.rcsb.org/rest/v1/core/drugbank/CFF')
data = r.json()

print(data['drugbank_info']['description'])
# "A methylxanthine naturally occurring in some beverages..."

print(data['drugbank_info']['indication'])
# What the drug is used for

Get FASTA Sequence

Note: This returns plain text, not JSON.

r = requests.get('https://www.rcsb.org/fasta/entry/4GYD/download')
print(r.text)

# >4GYD_1|Chain A|...
# MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG...

Process Multiple Proteins

protein_ids = ['4GYD', '4H0J', '4H0K']

protein_dict = dict()
for protein in protein_ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{protein}')
    data = r.json()
    protein_dict[protein] = data['cell']

# Print cell dimensions
for protein_id, cell in protein_dict.items():
    print(f"{protein_id}: a={cell['length_a']}, b={cell['length_b']}, c={cell['length_c']}")

Part 5: PDB Search API

The Search API lets you query across the entire PDB database.

Base URL: http://search.rcsb.org/rcsbsearch/v2/query?json=<query>

Important: The query must be URL-encoded.

URL Encoding

Special characters in URLs must be encoded. Use requests.utils.requote_uri():

my_query = '{"query": ...}'  # JSON query string
encoded = requests.utils.requote_uri(my_query)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'
r = requests.get(url)

Sequence Similarity Search (BLAST-like)

Find structures with similar sequences:

fasta = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS"

my_query = '''{
  "query": {
    "type": "terminal",
    "service": "sequence",
    "parameters": {
      "evalue_cutoff": 1,
      "identity_cutoff": 0.9,
      "sequence_type": "protein",
      "value": "%s"
    }
  },
  "request_options": {
    "scoring_strategy": "sequence"
  },
  "return_type": "polymer_entity"
}''' % fasta

r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(f"Total matches: {j['total_count']}")
for item in j['result_set']:
    print(item['identifier'], "score =", item['score'])

Sequence Motif Search (PROSITE)

Find structures containing a specific motif:

# Zinc finger Cys2His2-like fold group
# PROSITE pattern: C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

my_query = '''{
  "query": {
    "type": "terminal",
    "service": "seqmotif",
    "parameters": {
      "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H",
      "pattern_type": "prosite",
      "sequence_type": "protein"
    }
  },
  "return_type": "polymer_entity"
}'''

r = requests.get('http://search.rcsb.org/rcsbsearch/v2/query?json=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(f"Total: {j['total_count']}, returned: {len(j['result_set'])}")

Search by Chemical Component

Find all entries containing caffeine:

my_query = '''{
    "query": {
        "type": "terminal",
        "service": "text",
        "parameters": {
            "attribute": "rcsb_nonpolymer_instance_annotation.comp_id",
            "operator": "exact_match",
            "value": "CFF"
        }
    },
    "return_type": "entry"
}'''

url = "https://search.rcsb.org/rcsbsearch/v2/query?json=%s" % requests.utils.requote_uri(my_query)
r = requests.get(url)
data = r.json()

pdb_ids = [row["identifier"] for row in data.get("result_set", [])]
print(f"Entries with caffeine: {len(pdb_ids)}")
print(pdb_ids)

Understanding the Response

j = r.json()

j.keys()
# dict_keys(['query_id', 'result_type', 'total_count', 'result_set'])

j['total_count']  # Total number of matches
j['result_set']   # List of results (may be paginated)

# Each result
j['result_set'][0]
# {'identifier': '4GYD_1', 'score': 1.0, ...}

Part 6: PDB GraphQL API

GraphQL is a query language that lets you request exactly the fields you need.

Endpoint: https://data.rcsb.org/graphql

Interactive testing: http://data.rcsb.org/graphql/index.html (GraphiQL)

Why GraphQL?

REST: Multiple requests for related data GraphQL: One request, specify exactly what you want

Basic Query

my_query = '''{
    entry(entry_id: "4GYD") {
        cell {
            Z_PDB
            angle_alpha
            angle_beta
            angle_gamma
            length_a
            length_b
            length_c
            volume
        }
    }
}'''

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(my_query))
j = r.json()

print(j.keys())  # dict_keys(['data'])

print(j['data'])
# {'entry': {'cell': {'Z_PDB': 4, 'angle_alpha': 90.0, ...}}}

Accessing the Data

params = j['data']['entry']['cell']

for key, value in params.items():
    print(f"{key}: {value}")

Query Multiple Entries

my_query = '''{
    entries(entry_ids: ["4GYD", "4H0J", "4H0K"]) {
        rcsb_id
        cell {
            length_a
            length_b
            length_c
        }
    }
}'''

Find UniProt Mappings

my_query = '''{
    polymer_entity(entry_id: "4GYD", entity_id: "1") {
        rcsb_polymer_entity_container_identifiers {
            entry_id
            entity_id
        }
        rcsb_polymer_entity_align {
            aligned_regions {
                entity_beg_seq_id
                length
            }
            reference_database_name
            reference_database_accession
        }
    }
}'''

Part 7: UniProt API

UniProt uses the Proteins REST API at https://www.ebi.ac.uk/proteins/api/

Important: Specify JSON Format

UniProt doesn't return JSON by default. You must request it:

headers = {"Accept": "application/json"}
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession=P0A3X7&reviewed=true"

r = requests.get(requestURL, headers=headers)
j = r.json()

Response Structure

UniProt returns a list, not a dictionary:

type(j)  # <class 'list'>
len(j)   # Number of entries returned

# Access first entry
j[0].keys()
# dict_keys(['accession', 'id', 'proteinExistence', 'info', 'organism', ...])

Extract Gene Ontology Information

print(f"Accession: {j[0]['accession']}")  # P0A3X7
print(f"ID: {j[0]['id']}")  # CYC6_NOSS1

print("Gene Ontologies:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print(f"  {item['id']}: {item['properties']['term']}")

Part 8: NCBI API

NCBI also offers REST APIs for programmatic access.

Gene Information

headers = {'Accept': 'application/json'}
gene_id = 8291  # DYSF (dysferlin)

r = requests.get(f'https://api.ncbi.nlm.nih.gov/datasets/v1alpha/gene/id/{gene_id}', headers=headers)
j = r.json()

gene = j['genes'][0]['gene']
print(gene['description'])  # dysferlin
print(gene['symbol'])       # DYSF
print(gene['taxname'])      # Homo sapiens

Part 9: Common Patterns

Pattern 1: Always Check Status

r = requests.get(url)
if r.status_code != 200:
    print(f"Error: {r.status_code}")
    print(r.text)
else:
    data = r.json()
    # process data

Pattern 2: Loop Through Multiple IDs

ids = ['4GYD', '4H0J', '4H0K']
results = {}

for id in ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
    if r.status_code == 200:
        results[id] = r.json()
    else:
        print(f"Failed to get {id}")

Pattern 3: Extract Specific Fields

# Get resolution for multiple structures
resolutions = {}
for id in ids:
    r = requests.get(f'https://data.rcsb.org/rest/v1/core/entry/{id}')
    data = r.json()
    # Navigate nested structure
    resolutions[id] = data['rcsb_entry_info']['resolution_combined'][0]

Pattern 4: Build URL with Parameters

base_url = "https://www.ebi.ac.uk/proteins/api/proteins"
params = {
    'offset': 0,
    'size': 10,
    'accession': 'P0A3X7',
    'reviewed': 'true'
}

# Build query string
query = '&'.join([f"{k}={v}" for k, v in params.items()])
url = f"{base_url}?{query}"

Pattern 5: Handle Paginated Results

Search APIs often return limited results per page:

j = r.json()
print(f"Total: {j['total_count']}")
print(f"Returned: {len(j['result_set'])}")

# If total > returned, you need pagination
# Check API docs for how to request more pages

API Summary

DatabaseBase URLJSON by default?Notes
PDB RESTdata.rcsb.org/rest/v1/core/YesEntry, entity, chemcomp
PDB Searchsearch.rcsb.org/rcsbsearch/v2/queryYesURL-encode query
PDB GraphQLdata.rcsb.org/graphqlYesFlexible queries
UniProtebi.ac.uk/proteins/api/No (need header)Returns list
NCBIapi.ncbi.nlm.nih.gov/datasets/No (need header)Gene, genome, etc.

Quick Reference

requests Basics

import requests

# GET request
r = requests.get(url)
r = requests.get(url, headers={'Accept': 'application/json'})

# Check status
r.status_code  # 200 = success

# Get response
r.text  # As string
r.json()  # As dictionary (if JSON)

URL Encoding

# For Search API queries
encoded = requests.utils.requote_uri(query_string)
url = f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded}'

PDB API URLs

# Entry info
f'https://data.rcsb.org/rest/v1/core/entry/{pdb_id}'

# Polymer entity
f'https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}'

# Chemical component
f'https://data.rcsb.org/rest/v1/core/chemcomp/{ccd_id}'

# DrugBank
f'https://data.rcsb.org/rest/v1/core/drugbank/{ccd_id}'

# PubMed
f'https://data.rcsb.org/rest/v1/core/pubmed/{pdb_id}'

# FASTA
f'https://www.rcsb.org/fasta/entry/{pdb_id}/download'

# GraphQL
f'https://data.rcsb.org/graphql?query={encoded_query}'

# Search
f'http://search.rcsb.org/rcsbsearch/v2/query?json={encoded_query}'

UniProt API URL

# Needs header: {"Accept": "application/json"}
f'https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true'

Common Mistakes

MistakeProblemFix
Not checking status codeProcess garbage dataAlways check r.status_code == 200
Forgetting JSON header for UniProtGet HTML instead of JSONAdd headers={"Accept": "application/json"}
Not URL-encoding search queriesQuery failsUse requests.utils.requote_uri()
Assuming dict when it's a listKeyErrorCheck type(r.json())
Calling .json() on non-JSONErrorCheck if response is actually JSON
Not handling missing keysKeyErrorUse .get('key', default)

Workflow Example: Get GO Terms for a PDB Structure

Complete workflow combining PDB and UniProt:

import requests

# 1. Get UniProt ID from PDB
pdb_id = "4GYD"
query = '''{
    polymer_entity(entry_id: "%s", entity_id: "1") {
        rcsb_polymer_entity_align {
            reference_database_name
            reference_database_accession
        }
    }
}''' % pdb_id

r = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(query))
data = r.json()

# Find UniProt accession
for align in data['data']['polymer_entity']['rcsb_polymer_entity_align']:
    if align['reference_database_name'] == 'UniProt':
        uniprot_id = align['reference_database_accession']
        break

print(f"UniProt ID: {uniprot_id}")

# 2. Get GO terms from UniProt
url = f"https://www.ebi.ac.uk/proteins/api/proteins?accession={uniprot_id}&reviewed=true"
r = requests.get(url, headers={"Accept": "application/json"})
j = r.json()

print("Gene Ontology terms:")
for item in j[0]['dbReferences']:
    if item['type'] == "GO":
        print(f"  {item['id']}: {item['properties']['term']}")

Create Your Own Database

The Goal

Combine everything you've learned:

  • SQLite databases
  • PDB GraphQL API
  • UniProt REST API

Into one project: Build your own local database that integrates data from multiple sources.


Part 1: The Problem

You have PDB IDs (e.g., 4GYD, 1TU2). You want to store:

From PDB:

  • Structure weight (kDa)
  • Atom count
  • Residue count
  • Polymer information
  • UniProt IDs
  • Source organism

From UniProt:

  • Gene Ontology (GO) annotations

Why a local database? Because:

  • Faster queries than hitting APIs repeatedly
  • Combine data from multiple sources
  • Custom queries across all your data
  • Works offline

Part 2: Gene Ontology (GO)

What is GO?

Gene Ontology is a standardized vocabulary for describing protein functions. It lets you compare proteins across species using consistent terminology.

Three Categories

CategoryCodeWhat it describesExample
Molecular FunctionFWhat the protein does at molecular levelF:iron ion binding
Biological ProcessPWhat pathway/process it's involved inP:photosynthesis
Cellular ComponentCWhere in the cell it's locatedC:plasma membrane

GO ID Format

GO:0005506

Seven digits after "GO:". Each ID maps to a specific term.

Example GO Entry

{
    'type': 'GO',
    'id': 'GO:0005506',
    'properties': {
        'term': 'F:iron ion binding',
        'source': 'IEA:InterPro'
    }
}
  • id: The GO identifier
  • term: Category code + description
  • source: Where the annotation came from (evidence)

Part 3: Database Schema Design

Why Multiple Tables?

One PDB structure can have:

  • Multiple polymers (chains)
  • Each polymer can have multiple GO annotations

This is a one-to-many relationship. Storing everything in one table would mean massive data duplication.

The Three Tables

structures (1) ----< (N) polymers (1) ----< (N) go_annotations

One structure → many polymers → many GO annotations

Table 1: structures

CREATE TABLE structures (
    pdb_id TEXT PRIMARY KEY,
    title TEXT,
    total_weight REAL,
    atom_count INTEGER,
    residue_count INTEGER
)

One row per PDB entry.

Table 2: polymers

CREATE TABLE polymers (
    polymer_id TEXT PRIMARY KEY,
    pdb_id TEXT NOT NULL,
    uniprot_accession TEXT,
    protein_name TEXT,
    scientific_name TEXT,
    FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
    UNIQUE (polymer_id, scientific_name, uniprot_accession)
)

One row per polymer (chain) in a structure.

The FOREIGN KEY links back to the structures table.

Table 3: go_annotations

CREATE TABLE go_annotations (
    id INTEGER PRIMARY KEY,
    go_id TEXT NOT NULL,
    go_term TEXT NOT NULL,
    go_source TEXT NOT NULL,
    polymer_id TEXT NOT NULL,
    FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
    UNIQUE (polymer_id, go_id)
)

One row per GO annotation per polymer.

The id INTEGER PRIMARY KEY auto-increments — you don't specify it when inserting.


Part 4: Creating the Schema

import sqlite3 as sql
import requests

# Connect to database (creates file if doesn't exist)
conn = sql.connect('my_database.sqlite')
cur = conn.cursor()

# Drop existing tables (start fresh)
cur.execute('DROP TABLE IF EXISTS structures')
cur.execute('DROP TABLE IF EXISTS polymers')
cur.execute('DROP TABLE IF EXISTS go_annotations')

# Create tables
cur.execute('''CREATE TABLE structures (
    pdb_id TEXT PRIMARY KEY,
    title TEXT,
    total_weight REAL,
    atom_count INTEGER,
    residue_count INTEGER
)''')

cur.execute('''CREATE TABLE polymers (
    polymer_id TEXT PRIMARY KEY,
    pdb_id TEXT NOT NULL,
    uniprot_accession TEXT,
    protein_name TEXT,
    scientific_name TEXT,
    FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
    UNIQUE (polymer_id, scientific_name, uniprot_accession)
)''')

cur.execute('''CREATE TABLE go_annotations (
    id INTEGER PRIMARY KEY,
    go_id TEXT NOT NULL,
    go_term TEXT NOT NULL,
    go_source TEXT NOT NULL,
    polymer_id TEXT NOT NULL,
    FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
    UNIQUE (polymer_id, go_id)
)''')

conn.commit()

Part 5: The GraphQL Query

What We Need from PDB

{
  entries(entry_ids: ["4GYD", "1TU2"]) {
    rcsb_id
    struct { title }
    rcsb_entry_info {
      molecular_weight
      deposited_atom_count
      deposited_modeled_polymer_monomer_count
    }
    polymer_entities {
      rcsb_id
      rcsb_entity_source_organism {
        ncbi_scientific_name
      }
      uniprots {
        rcsb_uniprot_container_identifiers {
          uniprot_id
        }
        rcsb_uniprot_protein {
          name {
            value
          }
        }
      }
    }
  }
}

Understanding the Response Structure

The response is nested:

entries (list)
  └── each entry (one per PDB ID)
        ├── rcsb_id
        ├── struct.title
        ├── rcsb_entry_info (weight, counts)
        └── polymer_entities (list)
              └── each polymer
                    ├── rcsb_id (polymer ID like "4GYD_1")
                    ├── rcsb_entity_source_organism (list of organisms)
                    └── uniprots (list)
                          ├── rcsb_uniprot_container_identifiers.uniprot_id
                          └── rcsb_uniprot_protein.name.value

Execute the Query

pdb_query = '''
{
  entries(entry_ids: ["4GYD", "1TU2"]) {
    rcsb_id
    struct { title }
    rcsb_entry_info {
      molecular_weight
      deposited_atom_count
      deposited_modeled_polymer_monomer_count
    }
    polymer_entities {
      rcsb_id
      rcsb_entity_source_organism {
        ncbi_scientific_name
      }
      uniprots {
        rcsb_uniprot_container_identifiers {
          uniprot_id
        }
        rcsb_uniprot_protein {
          name {
            value
          }
        }
      }
    }
  }
}
'''

p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()

Part 6: Populating the Database

Step 1: Insert into structures table

for prot in j['data']['entries']:
    pdb_id = prot['rcsb_id']
    title = prot['struct']['title']
    weight = prot['rcsb_entry_info']['molecular_weight']
    atom_count = prot['rcsb_entry_info']['deposited_atom_count']
    residue_count = prot['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
    
    cur.execute('INSERT INTO structures VALUES (?, ?, ?, ?, ?)',
                (pdb_id, title, weight, atom_count, residue_count))

Step 2: Insert into polymers table

    # Still inside the loop over entries
    for polymer in prot['polymer_entities']:
        polymer_id = polymer['rcsb_id']
        
        # Extract all source organisms (could be multiple)
        source_organisms = []
        for so in polymer['rcsb_entity_source_organism']:
            source_organisms.append(so['ncbi_scientific_name'])
        
        # Extract all UniProt info
        uniprots = []
        for up in polymer['uniprots']:
            uniprot_id = up['rcsb_uniprot_container_identifiers']['uniprot_id']
            protein_name = up['rcsb_uniprot_protein']['name']['value']
            uniprots.append((uniprot_id, protein_name))
        
        # Create all combinations (organism × uniprot)
        combinations = [(org, up) for org in source_organisms for up in uniprots]
        
        # Insert each combination
        for (organism, uniprot_info) in combinations:
            cur.execute('INSERT INTO polymers VALUES (?, ?, ?, ?, ?)',
                        (polymer_id,
                         pdb_id,
                         uniprot_info[0],  # UniProt accession
                         uniprot_info[1],  # Protein name
                         organism))        # Scientific name

Step 3: Query UniProt and insert GO annotations

        # For each UniProt ID, get GO annotations
        for up in uniprots:
            accession_id = up[0]
            
            # Query UniProt API
            uniprot_url = f'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession={accession_id}'
            r = requests.get(uniprot_url, headers={"Accept": "application/json"})
            
            # GO info is in dbReferences
            db_info = r.json()[0]['dbReferences']
            
            for db in db_info:
                if db['type'] == 'GO':
                    go_id = db['id']
                    go_term = db['properties']['term']
                    go_source = db['properties']['source']
                    
                    # Insert (don't specify id - it auto-increments)
                    cur.execute('''INSERT INTO go_annotations 
                                   (go_id, go_term, go_source, polymer_id)
                                   VALUES (?, ?, ?, ?)''',
                                (go_id, go_term, go_source, polymer_id))

conn.commit()

Part 7: The Complete Code

import sqlite3 as sql
import requests

# Connect
conn = sql.connect('my_database.sqlite')
cur = conn.cursor()

# Create schema
cur.execute('DROP TABLE IF EXISTS structures')
cur.execute('DROP TABLE IF EXISTS polymers')
cur.execute('DROP TABLE IF EXISTS go_annotations')

cur.execute('''CREATE TABLE structures (
    pdb_id TEXT PRIMARY KEY,
    title TEXT,
    total_weight REAL,
    atom_count INTEGER,
    residue_count INTEGER
)''')

cur.execute('''CREATE TABLE polymers (
    polymer_id TEXT PRIMARY KEY,
    pdb_id TEXT NOT NULL,
    uniprot_accession TEXT,
    protein_name TEXT,
    scientific_name TEXT,
    FOREIGN KEY (pdb_id) REFERENCES structures(pdb_id),
    UNIQUE (polymer_id, scientific_name, uniprot_accession)
)''')

cur.execute('''CREATE TABLE go_annotations (
    id INTEGER PRIMARY KEY,
    go_id TEXT NOT NULL,
    go_term TEXT NOT NULL,
    go_source TEXT NOT NULL,
    polymer_id TEXT NOT NULL,
    FOREIGN KEY (polymer_id) REFERENCES polymers(polymer_id),
    UNIQUE (polymer_id, go_id)
)''')

conn.commit()

# Query PDB
pdb_query = '''{ entries(entry_ids: ["4GYD", "1TU2"]) { ... } }'''  # Full query here
p = requests.get('https://data.rcsb.org/graphql?query=%s' % requests.utils.requote_uri(pdb_query))
j = p.json()

# Populate database
for prot in j['data']['entries']:
    # Insert structure
    pdb_id = prot['rcsb_id']
    title = prot['struct']['title']
    weight = prot['rcsb_entry_info']['molecular_weight']
    atom_count = prot['rcsb_entry_info']['deposited_atom_count']
    residue_count = prot['rcsb_entry_info']['deposited_modeled_polymer_monomer_count']
    
    cur.execute('INSERT INTO structures VALUES (?, ?, ?, ?, ?)',
                (pdb_id, title, weight, atom_count, residue_count))
    
    # Insert polymers and GO annotations
    for polymer in prot['polymer_entities']:
        polymer_id = polymer['rcsb_id']
        
        source_organisms = [so['ncbi_scientific_name'] 
                          for so in polymer['rcsb_entity_source_organism']]
        
        uniprots = [(up['rcsb_uniprot_container_identifiers']['uniprot_id'],
                    up['rcsb_uniprot_protein']['name']['value'])
                   for up in polymer['uniprots']]
        
        combinations = [(org, up) for org in source_organisms for up in uniprots]
        
        for (organism, uniprot_info) in combinations:
            cur.execute('INSERT INTO polymers VALUES (?, ?, ?, ?, ?)',
                        (polymer_id, pdb_id, uniprot_info[0], uniprot_info[1], organism))
        
        # Get GO annotations from UniProt
        for up in uniprots:
            accession_id = up[0]
            uniprot_url = f'https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=10&accession={accession_id}'
            r = requests.get(uniprot_url, headers={"Accept": "application/json"})
            
            for db in r.json()[0]['dbReferences']:
                if db['type'] == 'GO':
                    cur.execute('''INSERT INTO go_annotations 
                                   (go_id, go_term, go_source, polymer_id)
                                   VALUES (?, ?, ?, ?)''',
                                (db['id'], db['properties']['term'], 
                                 db['properties']['source'], polymer_id))

conn.commit()
conn.close()

Part 8: Querying Your Database

Basic Queries

Get all info for a PDB ID:

cur.execute('SELECT * FROM structures WHERE pdb_id = ?', ("4GYD",))
print(cur.fetchall())
# [('4GYD', 'Nostoc sp Cytochrome c6', 58.57, 4598, 516)]

Get all polymers for a PDB ID:

cur.execute('SELECT * FROM polymers WHERE pdb_id = ?', ("4GYD",))
print(cur.fetchall())
# [('4GYD_1', '4GYD', 'P0A3X7', 'Cytochrome c6', 'Nostoc sp. PCC 7120')]

Top 10 heaviest structures:

cur.execute('''SELECT pdb_id, title, total_weight 
               FROM structures
               ORDER BY total_weight DESC
               LIMIT 10''')
print(cur.fetchall())

GO annotations from a specific source:

cur.execute('SELECT * FROM go_annotations WHERE go_source LIKE "%UniProtKB-UniRule%"')
print(cur.fetchall())

Queries Across Tables (JOINs)

Get all GO IDs for a UniProt accession (using subquery):

cur.execute('''
    SELECT go_id FROM go_annotations AS ga
    WHERE ga.polymer_id IN (
        SELECT p.polymer_id
        FROM polymers AS p
        WHERE p.uniprot_accession = ?
    )
''', ("P46444",))
print(cur.fetchall())

Same query using JOIN:

cur.execute('''
    SELECT g.go_id
    FROM go_annotations AS g
    JOIN polymers AS p ON p.polymer_id = g.polymer_id
    WHERE p.uniprot_accession = ?
''', ("P46444",))
print(cur.fetchall())

Both return the same result. The AS creates aliases (shortcuts for table names).

Count GO annotations per structure:

cur.execute('''
    SELECT COUNT(go_annotations.go_id)
    FROM go_annotations
    WHERE polymer_id IN (
        SELECT polymer_id
        FROM polymers
        WHERE pdb_id = ?
    )
''', ("1TU2",))
print(cur.fetchall())
# [(8,)]

Part 9: Understanding JOINs

What is a JOIN?

A JOIN combines rows from two tables based on a related column.

The Tables

polymers:
polymer_id | pdb_id | uniprot_accession | ...
-----------+--------+-------------------+----
4GYD_1     | 4GYD   | P0A3X7           | ...
1TU2_1     | 1TU2   | P46444           | ...

go_annotations:
id | go_id       | polymer_id | ...
---+-------------+------------+----
1  | GO:0005506  | 4GYD_1     | ...
2  | GO:0009055  | 4GYD_1     | ...
3  | GO:0005507  | 1TU2_1     | ...

JOIN in Action

SELECT g.go_id, p.uniprot_accession
FROM go_annotations AS g
JOIN polymers AS p ON p.polymer_id = g.polymer_id
WHERE p.pdb_id = '4GYD'

This:

  1. Takes each row from go_annotations
  2. Finds the matching row in polymers (where polymer_ids match)
  3. Combines them
  4. Filters by pdb_id

Result:

go_id      | uniprot_accession
-----------+------------------
GO:0005506 | P0A3X7
GO:0009055 | P0A3X7

Subquery Alternative

Same result, different approach:

SELECT go_id FROM go_annotations
WHERE polymer_id IN (
    SELECT polymer_id FROM polymers WHERE pdb_id = '4GYD'
)
  1. Inner query gets polymer_ids for 4GYD
  2. Outer query gets GO IDs for those polymers

Part 10: Exporting the Schema

Why Export Schema?

You might want to:

  • Document your database structure
  • Recreate the database elsewhere
  • Share the schema without the data

export_schema.py

import sqlite3
import os
import sys

def export_sqlite_schema(db_path, output_file):
    """
    Extracts the schema from a SQLite database and writes it to a file.
    """
    if not os.path.isfile(db_path):
        print(f"Error: Database file '{db_path}' not found.")
        return False
    
    try:
        # Connect read-only
        conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
        cursor = conn.cursor()
        
        # Get schema from sqlite_master
        cursor.execute("SELECT sql FROM sqlite_master WHERE sql IS NOT NULL;")
        schema_statements = cursor.fetchall()
        
        if not schema_statements:
            print("No schema found in the database.")
            return False
        
        # Write to file
        with open(output_file, "w", encoding="utf-8") as f:
            for stmt in schema_statements:
                f.write(stmt[0] + ";\n\n")
        
        print(f"Schema successfully exported to '{output_file}'")
        return True
        
    except sqlite3.Error as e:
        print(f"SQLite error: {e}")
        return False
    finally:
        if 'conn' in locals():
            conn.close()

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python export_schema.py <database_path> <output_sql_file>")
        sys.exit(1)
    
    db_file = sys.argv[1]
    output_file = sys.argv[2]
    export_sqlite_schema(db_file, output_file)

Usage

python export_schema.py my_database.sqlite schema.sql

What sqlite_master Contains

Every SQLite database has a special table called sqlite_master that stores:

  • Table definitions (CREATE TABLE statements)
  • Index definitions
  • View definitions
  • Trigger definitions
SELECT sql FROM sqlite_master WHERE sql IS NOT NULL;

Returns all the CREATE statements that define your database structure.


Part 11: Key Concepts Summary

Database Design

ConceptApplication
Primary KeyUnique identifier for each row (pdb_id, polymer_id)
Foreign KeyLinks tables together (polymers.pdb_id → structures.pdb_id)
One-to-ManyOne structure has many polymers; one polymer has many GO annotations
UNIQUE constraintPrevents duplicate combinations
Auto-incrementid INTEGER PRIMARY KEY auto-generates values

Data Flow

PDB GraphQL API
      ↓
Extract structure info → INSERT INTO structures
      ↓
Extract polymer info → INSERT INTO polymers
      ↓
For each UniProt ID:
      ↓
UniProt REST API
      ↓
Extract GO annotations → INSERT INTO go_annotations
      ↓
conn.commit()

SQL Operations

OperationExample
SELECTSELECT * FROM structures WHERE pdb_id = '4GYD'
WHEREFilter rows
ORDER BYORDER BY total_weight DESC
LIMITLIMIT 10
LIKEWHERE go_source LIKE '%UniRule%'
COUNTSELECT COUNT(go_id) FROM ...
JOINCombine related tables
SubqueryNested SELECT

Quick Reference

Schema Creation Pattern

cur.execute('DROP TABLE IF EXISTS tablename')
cur.execute('''CREATE TABLE tablename (
    column1 TYPE CONSTRAINT,
    column2 TYPE CONSTRAINT,
    FOREIGN KEY (column) REFERENCES other_table(column)
)''')
conn.commit()

Insert Pattern

# With all columns
cur.execute('INSERT INTO table VALUES (?, ?, ?)', (val1, val2, val3))

# With specific columns (skip auto-increment)
cur.execute('INSERT INTO table (col1, col2) VALUES (?, ?)', (val1, val2))

Query Pattern

cur.execute('SELECT columns FROM table WHERE condition', (params,))
results = cur.fetchall()

JOIN Pattern

cur.execute('''
    SELECT t1.col, t2.col
    FROM table1 AS t1
    JOIN table2 AS t2 ON t1.key = t2.key
    WHERE condition
''')

Common Mistakes

MistakeProblemFix
Forgetting conn.commit()Data not savedAlways commit after inserts
Wrong number of ?Insert failsCount columns carefully
Not handling listsMissing dataCheck if lists could have multiple items
Hardcoding IDsNot reusableUse variables and parameters
Not closing connectionResource leakAlways conn.close()
Duplicate primary keyInsert failsUse UNIQUE constraints or check first

CHEAT SHEETs

# ==================== PANDAS CHEAT SHEET ====================

import pandas as pd

# ============ READING/WRITING DATA ============
df = pd.read_csv('file.csv')
df = pd.read_excel('file.xls')
df.to_csv('output.csv', index=False)
df.to_sql('table', conn, index=False)

# ============ BASIC INFO ============
df.head()          # first 5 rows
df.tail(3)         # last 3 rows
df.shape           # (rows, columns) -> (100, 5)
df.columns         # column names
df.dtypes          # data types
df.info()          # summary
df.describe()      # statistics

# ============ SELECTING DATA ============
df['col']                 # single column (Series)
df[['col1', 'col2']]      # multiple columns (DataFrame)
df.loc[0]                 # row by label/index
df.loc[0:5, 'col']        # rows 0-5, specific column
df.iloc[0:5, 0:2]         # by position (first 5 rows, first 2 cols)

# ============ FILTERING ============
df[df['age'] > 30]                          # where age > 30
df[df['country'] == 'Italy']                # where country is Italy
df[df['country'].isin(['Italy', 'Spain'])]  # where country in list
df[(df['age'] > 30) & (df['salary'] > 50000)]  # multiple conditions

# ============ UNIQUE VALUES ============
df['country'].unique()      # array of unique values -> ['Italy', 'Spain', 'France']
df['country'].nunique()     # count unique -> 3
df['country'].value_counts()  
# Italy     10
# Spain      8
# France     5

# ============ MISSING DATA ============
df.isna().sum()        # count NaN per column
df.dropna()            # remove rows with NaN
df.fillna(0)           # replace NaN with 0

# ============ GROUPBY ============
df.groupby('country')['salary'].mean()
# country
# France    45000
# Italy     52000
# Spain     48000

df.groupby('country').agg({'salary': 'mean', 'age': 'max'})
#          salary  age
# France    45000   55
# Italy     52000   60

# ============ SORTING ============
df.sort_values('salary')                    # ascending
df.sort_values('salary', ascending=False)   # descending
df.sort_values(['country', 'salary'])       # multiple columns

# ============ ADDING/MODIFYING COLUMNS ============
df['new_col'] = df['salary'] * 2
df['category'] = df['age'].apply(lambda x: 'old' if x > 50 else 'young')

# ============ RENAMING ============
df.rename(columns={'old_name': 'new_name'})

# ============ DROP ============
df.drop(columns=['col1', 'col2'])
df.drop(index=[0, 1, 2])

# ============ MERGE/JOIN ============
pd.merge(df1, df2, on='id')               # inner join
pd.merge(df1, df2, on='id', how='left')   # left join

# ============ CONCAT ============
pd.concat([df1, df2])          # stack vertically
pd.concat([df1, df2], axis=1)  # stack horizontally

# ============ pd.cut() - BINNING ============
ages = pd.Series([15, 25, 35, 45, 55])
pd.cut(ages, bins=3, labels=['young', 'mid', 'old'])
# 0    young
# 1    young
# 2      mid
# 3      mid
# 4      old

# ============ QUICK PLOTTING ============
df['salary'].plot()                    # line plot
df['salary'].plot(kind='bar')          # bar plot
df.plot(x='year', y='salary')          # x vs y
df.groupby('country')['salary'].mean().plot(kind='bar')

# ============ COMMON AGGREGATIONS ============
df['col'].sum()
df['col'].mean()
df['col'].min()
df['col'].max()
df['col'].count()
df['col'].std()

# ==================== SQLITE + PANDAS CHEAT SHEET ====================

import sqlite3
import pandas as pd

# ============ CONNECT TO DATABASE ============
conn = sqlite3.connect('database.sqlite')  # creates file if doesn't exist
conn.close()                                # always close when done

# ============ PANDAS TO SQLITE ============
conn = sqlite3.connect('mydb.sqlite')

# Write entire dataframe to SQLite table
df.to_sql('table_name', conn, index=False, if_exists='replace')

# if_exists options:
#   'fail'    - error if table exists (default)
#   'replace' - drop table and recreate
#   'append'  - add rows to existing table

conn.close()

# ============ SQLITE TO PANDAS ============
conn = sqlite3.connect('mydb.sqlite')

# Read entire table
df = pd.read_sql_query('SELECT * FROM table_name', conn)

# Read with filter
df = pd.read_sql_query('SELECT * FROM happiness WHERE year > 2015', conn)

# Read specific columns
df = pd.read_sql_query('SELECT country, year, salary FROM employees', conn)

# Read with multiple conditions
df = pd.read_sql_query('''
    SELECT * FROM happiness 
    WHERE "Log GDP per capita" > 11.2 
    AND year >= 2010
''', conn)

conn.close()

# ============ IMPORTANT: COLUMN NAMES WITH SPACES ============
# Use double quotes around column names with spaces
df = pd.read_sql_query('SELECT "Country name", "Life Ladder" FROM happiness', conn)

# ============ COMMON SQL QUERIES ============
# Count rows
pd.read_sql_query('SELECT COUNT(*) FROM table_name', conn)

# Distinct values
pd.read_sql_query('SELECT DISTINCT country FROM happiness', conn)

# Order by
pd.read_sql_query('SELECT * FROM happiness ORDER BY year DESC', conn)

# Group by with aggregation
pd.read_sql_query('''
    SELECT country, AVG(salary) as avg_salary 
    FROM employees 
    GROUP BY country
''', conn)

# ============ TYPICAL WORKFLOW ============
# 1. Read Excel/CSV
df = pd.read_excel('data.xls')

# 2. Select columns
df_subset = df[['col1', 'col2', 'col3']]

# 3. Save to SQLite
conn = sqlite3.connect('mydb.sqlite')
df_subset.to_sql('mytable', conn, index=False, if_exists='replace')
conn.close()

# 4. Later, read back with filter
conn = sqlite3.connect('mydb.sqlite')
df_filtered = pd.read_sql_query('SELECT * FROM mytable WHERE col1 > 100', conn)
conn.close()

# ============ MODIFY DATA & SAVE TO NEW DB ============
# Read from db1
conn1 = sqlite3.connect('db1.sqlite')
df = pd.read_sql_query('SELECT * FROM table1', conn1)
conn1.close()

# Modify in pandas
df['new_col'] = df['old_col'] * 10
df = df.drop(columns=['old_col'])
df = df.rename(columns={'new_col': 'better_name'})

# Save to db2
conn2 = sqlite3.connect('db2.sqlite')
df.to_sql('table1', conn2, index=False, if_exists='replace')
conn2.close()

# ============ FILE SIZE ============
import os
os.path.getsize('file.sqlite')  # size in bytes


# ==================== MATPLOTLIB CHEAT SHEET ====================

import matplotlib.pyplot as plt

# ============ BASIC LINE PLOT ============
plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.show()

# ============ LINE PLOT WITH LABELS ============
plt.plot([2020, 2021, 2022], [100, 150, 130])
plt.xlabel('Year')
plt.ylabel('Sales')
plt.title('Sales Over Time')
plt.show()

# ============ MULTIPLE LINES (SAME PLOT) ============
plt.plot([2020, 2021, 2022], [100, 150, 130], label='Italy')
plt.plot([2020, 2021, 2022], [90, 120, 140], label='Spain')
plt.plot([2020, 2021, 2022], [80, 110, 160], label='France')
plt.legend()  # shows the labels
plt.show()

# ============ BAR PLOT ============
plt.bar(['Italy', 'Spain', 'France'], [100, 90, 80])
plt.show()

# ============ BAR PLOT WITH OPTIONS ============
plt.bar(['Italy', 'Spain', 'France'], [100, 90, 80], color='green')
plt.title('GDP by Country')
plt.xticks(rotation=45)  # rotate x labels
plt.tight_layout()       # prevent labels from cutting off
plt.show()

# ============ HORIZONTAL BAR ============
plt.barh(['Italy', 'Spain', 'France'], [100, 90, 80])
plt.show()

# ============ SCATTER PLOT ============
plt.scatter([1, 2, 3, 4], [10, 20, 15, 30])
plt.show()

# ============ HISTOGRAM ============
data = [1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5]
plt.hist(data, bins=5)
plt.show()

# ============ PIE CHART ============
plt.pie([30, 40, 30], labels=['A', 'B', 'C'])
plt.show()

# ============ PLOT FROM PANDAS DIRECTLY ============
df['salary'].plot()                      # line
df['salary'].plot(kind='bar')            # bar
df.plot(x='year', y='salary')            # x vs y
df.plot(x='year', y='salary', kind='scatter')

# ============ GROUPBY + PLOT ============
df.groupby('country')['salary'].mean().plot(kind='bar')
plt.title('Average Salary by Country')
plt.show()

# ============ MULTIPLE LINES FROM DATAFRAME ============
countries = ['Italy', 'Spain', 'France']
for country in countries:
    data = df[df['country'] == country]
    plt.plot(data['year'], data['value'], label=country)
plt.legend()
plt.show()

# ============ STYLING OPTIONS ============
plt.plot(x, y, color='red')              # color
plt.plot(x, y, linestyle='--')           # dashed line
plt.plot(x, y, marker='o')               # dots on points
plt.plot(x, y, linewidth=2)              # thicker line

# Combined:
plt.plot(x, y, color='blue', linestyle='--', marker='o', linewidth=2, label='Sales')

# ============ FIGURE SIZE ============
plt.figure(figsize=(10, 6))  # width, height in inches
plt.plot(x, y)
plt.show()

# ============ SUBPLOTS (MULTIPLE PLOTS) ============
fig, axes = plt.subplots(1, 2)  # 1 row, 2 columns
axes[0].plot(x, y)
axes[0].set_title('Plot 1')
axes[1].bar(['A', 'B'], [10, 20])
axes[1].set_title('Plot 2')
plt.show()

# 2x2 grid
fig, axes = plt.subplots(2, 2)
axes[0, 0].plot(x, y)
axes[0, 1].bar(['A', 'B'], [10, 20])
axes[1, 0].scatter(x, y)
axes[1, 1].hist(data)
plt.tight_layout()
plt.show()

# ============ SAVE FIGURE ============
plt.plot(x, y)
plt.savefig('myplot.png')
plt.savefig('myplot.pdf')

# ============ COMMON FORMATTING ============
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('My Title')
plt.legend()                    # show legend
plt.xticks(rotation=45)         # rotate x labels
plt.tight_layout()              # fix layout
plt.grid(True)                  # add grid
plt.xlim(0, 100)                # x axis limits
plt.ylim(0, 50)                 # y axis limits

# ============================
# PYTHON QUICK CHEAT SHEET
# Requests + GraphQL + SQLite
# ============================

# ---------- SQLite ----------
import sqlite3

# Connect / cursor
conn = sqlite3.connect("mydb.sqlite")
cur = conn.cursor()

# Create table (safe to re-run)
cur.execute("""
CREATE TABLE IF NOT EXISTS table_name (
  col1 TEXT,
  col2 INTEGER
)
""")

# INSERT (parameterized)
cur.execute(
  "INSERT INTO table_name (col1, col2) VALUES (?, ?)",
  ("value", 10)              # tuple matches the ? placeholders
)
conn.commit()

# INSERT many
rows = [("A", 1), ("B", 2)]
cur.executemany(
  "INSERT INTO table_name (col1, col2) VALUES (?, ?)",
  rows
)
conn.commit()

# SELECT with 1 parameter (NOTE the comma!)
cur.execute("SELECT * FROM table_name WHERE col1 = ?", ("A",))
print(cur.fetchall())

# SELECT with multiple parameters
cur.execute(
  "SELECT * FROM table_name WHERE col2 BETWEEN ? AND ?",
  (1, 10)
)
print(cur.fetchall())

# OR condition (same value)
q = "A"
cur.execute(
  "SELECT * FROM table_name WHERE col1 = ? OR col2 = ?",
  (q, q)
)

# IN clause (dynamic list)
ids = [1, 3, 5]
ph = ",".join(["?"] * len(ids))       # "?,?,?"
cur.execute(f"SELECT * FROM table_name WHERE col2 IN ({ph})", ids)

# Fetch methods
cur.fetchone()     # one row
cur.fetchmany(5)   # up to 5 rows
cur.fetchall()     # all rows

# Close DB
cur.close()
conn.close()


# ---------- Requests ----------
import requests

# GET JSON
r = requests.get("https://example.com/data.json", timeout=30)
r.raise_for_status()
data = r.json()

# POST JSON
r = requests.post("https://example.com/api", json={"key": "value"})
r.raise_for_status()


# ---------- GraphQL ----------
url = "https://data.rcsb.org/graphql"
query = """
{
  entries(entry_ids: ["1QK1"]) {
    rcsb_id
  }
}
"""

r = requests.post(url, json={"query": query}, timeout=30)
r.raise_for_status()
j = r.json()

if "errors" in j:
    raise RuntimeError(j["errors"])

entries = j["data"]["entries"]


Lec1

Lec2 Notes V2

Lec3 Notes V2

Lec4 Notes V2

Lec5 Notes V2

Lec6 Notes V2

Proteomics Approaches - Oral Questions

Key Distinction

Remember this throughout:

  • Bottom-up: Gel-based; proteins are separated BEFORE digestion
  • Shotgun: Gel-free; entire mixture is digested BEFORE peptide separation
  • Top-down: Gel-free; intact proteins analyzed WITHOUT digestion

1. The Three Approaches Overview

Practice Set: Three Approaches
0 / 4
1
Question 1 Overview
Hard
Compare the three main proteomic approaches: Bottom-up, Shotgun, and Top-down. What is the key difference between them?
✓ Model Answer

All three approaches share common phases but differ in timing of enzymatic digestion and state of proteins during separation.

ApproachStrategySeparationDigestion
Bottom-upGel-basedProteins separated FIRST (2D-PAGE)After separation
ShotgunGel-freePeptides separated (after digestion)FIRST (whole mixture)
Top-downGel-freeIntact proteins (HPLC)NO digestion

Key distinctions:

  • Bottom-up: Separate proteins → Digest → MS (PMF)
  • Shotgun: Digest mixture → Separate peptides → MS/MS
  • Top-down: Separate intact proteins → MS (intact mass + fragmentation)
💡 Memory aid: Bottom-up = proteins first, Shotgun = peptides first, Top-down = no digestion at all.
2
Question 2 Bottom-up
Medium
Describe the Bottom-up approach in detail. What are its main steps?
✓ Model Answer

Bottom-up is a gel-based strategy where proteins are separated before digestion.

Workflow:

  1. Extraction & Lysis: Release proteins from cells
  2. Sample Preparation: Denaturation, reduction, alkylation
  3. 2D-PAGE Separation:
    • 1st dimension: IEF (by pI)
    • 2nd dimension: SDS-PAGE (by MW)
  4. Staining & Visualization: Coomassie or Silver stain
  5. Spot Picking: Excise protein spots from gel
  6. In-gel Digestion: Trypsin digestion
  7. MS Analysis: MALDI-TOF for PMF
  8. Database Search: Match masses to identify protein

Identification method: Peptide Mass Fingerprinting (PMF) — based on fingerprint of a single protein.

3
Question 3 Shotgun
Medium
Describe the Shotgun approach. Why is it called "shotgun"?
✓ Model Answer

Shotgun is a gel-free strategy where the entire protein mixture is digested first.

Why "Shotgun"?

  • Like a shotgun blast — analyzes everything at once
  • No pre-selection of proteins
  • Relies on computational deconvolution

Workflow:

  1. Extract proteins from sample
  2. Digest ENTIRE mixture with trypsin (no gel separation)
  3. Separate peptides by multidimensional chromatography (e.g., MudPIT: SCX + RP-HPLC)
  4. Online LC-MS/MS: ESI coupled to tandem MS
  5. Database search: Match MS/MS spectra to sequences

Identification method: Based on thousands of overlapping peptide sequences — much higher coverage than PMF.

Key difference from Bottom-up:

  • Bottom-up: Separate proteins first
  • Shotgun: Separate peptides first
4
Question 4 Top-down
Medium
Describe the Top-down approach. What is its main advantage?
✓ Model Answer

Top-down is a gel-free strategy where intact proteins are analyzed without enzymatic digestion.

Workflow:

  1. Fractionate proteins by HPLC (not gels)
  2. Introduce intact protein to MS (offline infusion or online LC-MS)
  3. Measure intact mass
  4. Fragment in gas phase (CID, ETD, ECD)
  5. Analyze fragments for sequence information

Main advantages:

  • Complete sequence coverage: See the whole protein
  • PTM preservation: All modifications remain intact
  • Proteoform identification: Can distinguish different forms of same protein
  • No digestion artifacts: See true mass of protein

Identification method: Based on intact mass + gas-phase fragmentation of the whole protein.

Note: Alkylation often skipped to measure true intact mass.


2. Sample Preparation & Extraction

🎤
Oral Question Cell Lysis
Hard
Describe the different methods of cell lysis for protein extraction. What are the three main approaches?
✓ Model Answer

Cell lysis disrupts cellular structure to release proteins. Three main approaches:

1. Chemical Lysis:

  • Uses detergents and buffers
  • Example: SDS disrupts hydrophobic interactions among membrane lipids
  • Gentle, but may interfere with downstream analysis

2. Enzymatic Lysis:

  • Uses specific enzymes to digest cell walls or extracellular matrix
  • Examples: Lysozyme (bacteria), Zymolyase (yeast)
  • Specific and gentle

3. Physical Lysis:

MethodMechanism
Mechanical (Blender/Polytron)Rotating blades grind and disperse cells
Liquid HomogenizationForce through narrow space (Dounce, French Press)
SonicationHigh-frequency sound waves shear cells
Freeze/ThawIce crystal formation disrupts membranes
Manual (Mortar & Pestle)Grinding frozen tissue (liquid nitrogen)

After lysis: Centrifugation separates debris from soluble proteins (supernatant).

🎤
Oral Question Depletion & Enrichment
Medium
What is the difference between depletion and enrichment in sample preparation? When is each used?
✓ Model Answer

Both are pre-analytical complexity management steps to reduce sample complexity and compress dynamic range.

Depletion:

  • Purpose: Remove high-abundance proteins that mask low-abundance ones
  • When used: Essential for plasma/serum (albumin = ~60% of protein)
  • Methods: Immunoaffinity columns, protein A/G

Enrichment:

  • Purpose: Isolate specific sub-proteomes of interest
  • Methods:
    • Selective Dialysis: Membrane with tiny pores acts as sieve
    • Microdialysis: Collect small molecules through diffusion
    • Selective Precipitation: Salts/solvents isolate by solubility
    • Immunoprecipitation: Antibodies isolate target protein

Approach-specific needs:

  • Bottom-up: Complexity reduced physically on 2D gel
  • Shotgun & Top-down: Complexity must be managed strictly during extraction to avoid overloading LC-MS
🎤
Oral Question Reduction & Alkylation
Medium
What is reduction and alkylation? Why are these steps important in sample preparation?
✓ Model Answer

Final steps of sample preparation to ensure proteins remain denatured and accessible to trypsin.

Reduction:

  • Reagent: DTT (dithiothreitol) or TCEP
  • Purpose: Break disulfide bonds (S-S → SH + SH)
  • Unfolds protein structure

Alkylation:

  • Reagent: IAA (iodoacetamide) or IAM
  • Purpose: Block free thiol groups (prevents disulfide reformation)
  • Adds ~57 Da (carbamidomethyl) to each cysteine

Why important:

  • Ensures complete denaturation
  • Makes all sites accessible to trypsin
  • Prevents protein refolding/aggregation
  • Produces reproducible digestion

Approach differences:

  • Bottom-up: Essential for proper IEF/SDS-PAGE
  • Shotgun: Essential for making protein accessible to trypsin
  • Top-down: Alkylation often skipped to measure true intact mass
🎤
Oral Question Sample Prep Goals
Medium
What are the main goals of sample preparation in proteomics?
✓ Model Answer

Five main goals:

  1. Solubilize all protein classes reproducibly
    • Including hydrophobic membrane proteins
    • Use chaotropes (urea, thiourea) to disrupt hydrogen bonds
  2. Prevent protein aggregation
    • Keep solubility high during IEF or digestion
    • Use appropriate detergents
  3. Prevent chemical/enzymatic modifications
    • Use protease inhibitors
    • Work at low temperature
  4. Remove interfering molecules
    • Digest or remove: nucleic acids, salts, lipids
  5. Enrich target proteins
    • Reduce dynamic range
    • Deplete high-abundance proteins
💡 Note on detergents: For IEF, avoid ionic detergents like SDS (binds proteins, imparts negative charge). Use zwitterionic detergents like CHAPS instead.

3. 2D-PAGE (Two-Dimensional Electrophoresis)

🎤
Oral Question 2D-PAGE Principle
Hard
Explain the principle of 2D-PAGE. What is separated in each dimension and how?
✓ Model Answer

2D-PAGE separates proteins by TWO independent (orthogonal) properties for maximum resolution.

First Dimension: Isoelectric Focusing (IEF)

  • Separates by isoelectric point (pI)
  • Uses immobilized pH gradient (IPG) strip
  • High voltage applied
  • Positively charged proteins → cathode
  • Negatively charged proteins → anode
  • Each protein migrates until net charge = 0 (at its pI)
  • Result: Proteins aligned horizontally by pI

Second Dimension: SDS-PAGE

  • Separates by molecular weight (MW)
  • IPG strip placed on top of polyacrylamide gel
  • SDS denatures and gives uniform negative charge
  • Smaller proteins migrate faster
  • Result: Horizontal band resolved vertically

Final result: 2D map of spots — each spot = specific protein with unique pI and MW.

DimensionPropertyMethodDirection
1stpI (charge)IEFHorizontal
2ndMW (size)SDS-PAGEVertical
🎤
Oral Question IEF Resolution
Medium
How does the pH range of the IPG strip affect IEF resolution?
✓ Model Answer

The resolution of IEF depends on the pH range of the IPG strip:

pH RangeResolutionUse Case
Wide (3-10)Lower resolutionInitial screening, overview
Narrow (e.g., 5-7)Higher resolutionDetailed analysis of specific pI range

Why?

  • Wide range: Same physical strip length covers more pH units → proteins with similar pI hard to distinguish
  • Narrow range: Same length covers fewer pH units → better separation of proteins with close pI values

Strategy:

  1. Start with wide range (pH 3-10) for overview
  2. Use narrow range strips to "zoom in" on regions of interest
🎤
Oral Question Gel Staining
Easy
What staining methods are used to visualize proteins on 2D gels? Which is most sensitive?
✓ Model Answer

Common staining methods:

MethodSensitivityMS CompatibleNotes
Coomassie Brilliant Blue~100 ngYesSimple, reversible
Silver Staining~1 ngVariable*Most sensitive
SYPRO Ruby~1-10 ngYesFluorescent, linear range

Silver staining is the most sensitive method, capable of detecting very low-abundance proteins.

*Silver staining compatibility with MS depends on the protocol — some fixatives can interfere.

After staining:

  1. Gel is digitized (scanner or camera)
  2. Image imported to software (e.g., Melanie)
  3. Spot detection and analysis performed
🎤
Oral Question Master Gel
Medium
What is a Master Gel? How is it created and used?
✓ Model Answer

Master Gel: A synthetic reference map created from multiple gel replicates.

How it's created:

  1. Run multiple replicates of the same sample
  2. Use image alignment (matching) software
  3. Apply warping algorithms to correct geometric distortions
  4. Combine all spots detected across all gels

What it contains:

  • Every spot detected across the entire experiment
  • Characterizes a "typical profile"
  • Assigns unique coordinates to each protein

How it's used:

  • Reference for comparing samples (e.g., healthy vs. diseased)
  • Enables consistent spot identification across experiments
  • Facilitates quantitative comparison

Software features:

  • Contrast adjustment
  • Background subtraction
  • 3D visualization
  • Spot detection and splitting
🎤
Oral Question 2D-PAGE Limitations
Hard
What are the limitations of 2D gel electrophoresis?
✓ Model Answer

Sample-Related Limitations:

  • Hydrophobic proteins: Membrane proteins poorly soluble in IEF buffers
  • Extreme pI: Very acidic (<3) or basic (>10) proteins hard to focus
  • Extreme MW: Large (>200 kDa) don't enter gel; small (<10 kDa) run off
  • Low-abundance proteins: Masked by high-abundance proteins
  • Limited dynamic range: ~10⁴ vs. proteome range of 10⁶-10⁷

Technical Limitations:

  • Poor reproducibility: Gel-to-gel variation requires triplicates
  • Labor-intensive: Manual, time-consuming, hard to automate
  • Low throughput: Cannot be easily scaled
  • Co-migration: Similar pI/MW proteins in same spot

Practical Issues:

  • Keratin contamination (especially manual spot picking)
  • Streaking from degradation
  • Background from staining
💡 These limitations drove development of gel-free approaches (shotgun proteomics, MudPIT).

4. Enzymatic Digestion

🎤
Oral Question Trypsin
Hard
Why is trypsin considered the gold standard for proteomics? What is its specificity?
✓ Model Answer

Trypsin Specificity:

  • Cleaves at the C-terminal side of Lysine (K) and Arginine (R)
  • Exception: Does NOT cleave when followed by Proline (P)

Why it's the gold standard:

  1. Robustness: Stable and active across wide pH and temperature range
  2. High Specificity: Predictable cleavage sites enable accurate database searching
  3. Ideal Peptide Length: Generates peptides of 6-20 amino acids — optimal for MS detection
  4. Internal Calibration: Autolysis peaks (trypsin digesting itself) serve as mass standards
  5. Basic C-terminus: K and R promote ionization in positive mode

When to use alternatives:

  • Proteins rich in K/R → use Glu-C (cleaves after Glu) for longer peptides
  • Different sequence coverage needed → Chymotrypsin (cleaves after Phe, Tyr, Trp)
💡 Memorize: "Trypsin cleaves C-terminal to K and R, except before P"
🎤
Oral Question Digestion Timing
Medium
When does enzymatic digestion occur in each proteomic approach?
✓ Model Answer
ApproachWhen Digestion OccursWhat is Digested
Bottom-upAFTER protein separation (2D-PAGE)Single protein from excised spot
ShotgunBEFORE separationEntire protein mixture at once
Top-downNO enzymatic digestionN/A - intact proteins analyzed

Bottom-up digestion:

  • Called "in-gel digestion"
  • Spot excised, destained, then digested
  • Peptides extracted from gel

Shotgun digestion:

  • Called "in-solution digestion"
  • Whole lysate digested
  • Produces complex peptide mixture

5. Peptide Cleanup & Separation

🎤
Oral Question ZipTip
Medium
What is ZipTip purification? When is it used?
✓ Model Answer

ZipTip: A 10 µL pipette tip packed with reverse-phase (RP) material.

Purpose:

  • Desalt peptides (remove salts that interfere with ionization)
  • Concentrate samples
  • Remove detergents and buffers

How it works:

  1. Condition tip with solvent
  2. Bind peptides to RP material
  3. Wash away salts (they don't bind)
  4. Elute clean, concentrated peptides

When used:

  • Bottom-up (gel-based): Preferred offline method for cleaning peptides from single gel spot
  • Before MALDI-TOF analysis
  • Improves MS sensitivity for low-abundance proteins

Shotgun & Top-down: Use online RP-HPLC instead (performs both desalting and high-resolution separation).

🎤
Oral Question RP-HPLC
Medium
What is Reverse-Phase HPLC? Why is it called "reverse-phase"?
✓ Model Answer

Reverse-Phase (RP) Chromatography: The dominant mode for peptide separation in proteomics.

Why "reverse-phase"?

  • Normal-phase: Polar stationary phase, non-polar mobile phase
  • Reverse-phase: Non-polar (hydrophobic) stationary phase, polar mobile phase
  • It's the "reverse" of traditional chromatography

How it works:

  • Stationary phase: C18 hydrocarbon chains (hydrophobic)
  • Mobile phase: Water/acetonitrile gradient
  • Peptides bind via hydrophobic interactions
  • Increasing organic solvent elutes more hydrophobic peptides

Use in proteomics:

ApproachRP-HPLC Use
Bottom-upOffline (ZipTip) or online before MS
ShotgunOnline, coupled directly to ESI-MS/MS
Top-downOnline for intact protein separation

6. MALDI-TOF Mass Spectrometry

Practice Set: MALDI-TOF
0 / 5
1
Question 1 MALDI Process
Hard
Explain the MALDI ionization process step by step. What is the role of the matrix?
✓ Model Answer

MALDI = Matrix-Assisted Laser Desorption/Ionization

Step-by-step process:

  1. Sample Preparation:
    • Analyte mixed with organic matrix (e.g., α-CHCA, DHB, sinapinic acid)
    • Spotted on metal plate, solvent evaporates
    • Analyte "caged" within matrix crystals
  2. Laser Irradiation:
    • Plate placed in vacuum chamber
    • UV laser (337 nm nitrogen or 355 nm Nd:YAG) pulses at sample
  3. Desorption:
    • Matrix absorbs laser energy, rapidly heats up
    • Controlled "explosion" carries intact analyte into gas phase
  4. Ionization:
    • Protons transfer from matrix to analyte in the plume
    • Most peptides pick up single proton → [M+H]⁺

Role of the matrix:

  • Absorbs laser energy (protects analyte)
  • Facilitates desorption
  • Donates protons for ionization
  • "Soft" ionization — even large proteins stay intact
2
Question 2 TOF Analyzer
Hard
How does the TOF (Time-of-Flight) mass analyzer work? What problems can affect accuracy?
✓ Model Answer

TOF Principle:

  • Ions accelerated through electric field → same kinetic energy
  • KE = ½mv² → lighter ions travel faster
  • Ions enter field-free drift tube
  • Time to reach detector depends on m/z
  • Small/light ions arrive first

Problems affecting accuracy:

  1. Spatial Distribution: Not all ions start at same distance from detector
  2. Initial Velocity Spread: Some ions have different starting speeds

Solutions:

  • Delayed Extraction: Brief pause before acceleration allows ions to "reset" — more uniform start
  • Reflectron: See next question
3
Question 3 Reflectron
Hard
What is a Reflectron and how does it improve resolution?
✓ Model Answer

Problem: Ions of same m/z may have slightly different kinetic energies → peaks blur (poor resolution).

Reflectron ("Ion Mirror"):

  • Electric field that reverses ions' direction
  • Located at end of flight tube

How it improves resolution:

  • Faster ions (higher KE) penetrate deeper into reflectron → longer path
  • Slower ions (lower KE) turn back sooner → shorter path
  • Result: Ions of same m/z arrive at detector at the same time
  • Peaks become narrower → better resolution

Resolution formula: R = m/Δm (where Δm = FWHM of peak)

💡 Resolution depends on: Reflectron + Delayed Extraction (both minimize energy and spatial spread).
4
Question 4 Data Quality
Medium
What three criteria define excellent MS data quality?
✓ Model Answer

Three criteria for excellent data:

  1. Sensitivity:
    • Ability to detect tiny amounts of sample
    • Down to femtomole (10⁻¹⁵ mol) quantities
  2. Resolution:
    • Ability to distinguish ions differing by at least 1 Da
    • Calculated: R = m/Δm (FWHM)
    • Depends on Reflectron and Delayed Extraction
  3. Accuracy (Calibration):
    • How close measured mass is to true mass
    • Requires regular calibration with known standards
    • Expressed in ppm (parts per million)
5
Question 5 MALDI Ions
Easy
What type of ions does MALDI produce — singly or multiply charged?
✓ Model Answer

MALDI produces almost exclusively SINGLY CHARGED ions.

Common ions:

  • [M+H]⁺ — most common (protonated molecule)
  • [M+Na]⁺ — sodium adduct
  • [M+K]⁺ — potassium adduct
  • [M-H]⁻ — negative mode

Advantage of singly charged:

  • Simple, easy-to-read spectra
  • Each peak = molecular mass + 1 (for proton)
  • No charge deconvolution needed

Example: Peptide of 1032 Da appears at m/z = 1033 [M+H]⁺


7. ESI (Electrospray Ionization)

Practice Set: ESI
0 / 4
1
Question 1 ESI Process
Hard
Explain the ESI (Electrospray Ionization) process step by step. What is the Rayleigh limit?
✓ Model Answer

ESI = Electrospray Ionization — premier "soft" technique for liquid samples.

Step-by-step process:

  1. Spray Formation:
    • Liquid sample pumped through fine capillary needle
    • High voltage (2-5 kV) applied
    • Forms Taylor Cone at needle tip
    • Produces fine mist of charged droplets
  2. Desolvation:
    • Warm, dry nitrogen gas injected
    • Acts as "hairdryer" — evaporates solvent
    • Nitrogen is inert — doesn't react with sample
  3. Rayleigh Limit & Coulomb Explosion:
    • As solvent evaporates, droplet shrinks
    • Charge density increases (same charge, smaller surface)
    • Rayleigh limit: Point where charge repulsion > surface tension
    • Coulomb explosion: Droplet bursts into smaller "progeny" droplets
    • Cycle repeats until solvent gone
  4. Ion Release:
    • Fully desolvated, multiply charged ions released
2
Question 2 Multiple Charging
Hard
What type of ions does ESI produce? Why is multiple charging important?
✓ Model Answer

ESI produces MULTIPLY CHARGED ions — key characteristic!

Ion types:

  • Positive mode: [M+nH]ⁿ⁺ (e.g., [M+2H]²⁺, [M+3H]³⁺)
  • Negative mode: [M-nH]ⁿ⁻
  • Creates a charge envelope (Gaussian distribution of charge states)

Why multiple charging is important:

  • m/z = mass / charge
  • More charges → lower m/z values
  • Allows detection of very large proteins within typical mass analyzer range

Example:

  • 50 kDa protein with +50 charges
  • m/z = 50,000 / 50 = 1,000 (easily detectable)

Disadvantage: More complex spectra (multiple peaks per protein) — requires deconvolution.

3
Question 3 ESI Advantage
Medium
What is the greatest advantage of ESI?
✓ Model Answer

Greatest advantage: Direct online coupling to HPLC.

Why this matters:

  • ESI operates at atmospheric pressure with liquid samples
  • HPLC separates complex mixture over time
  • ESI continuously ionizes components as they elute
  • Ions sent directly into mass analyzer

Result: LC-ESI-MS/MS — the workhorse of shotgun proteomics.

Additional ESI advantages:

  • Very high sensitivity (attomole range — 1000× better than MALDI)
  • Soft ionization (large proteins intact)
  • Multiple charging enables large protein detection

Trade-offs:

  • More complex instrumentation
  • Slower analysis (chromatography time)
  • Sensitive to salts/contaminants
4
Question 4 ESI Limitations
Medium
What are the limitations of ESI?
✓ Model Answer

ESI Limitations:

  • Sensitive to contaminants:
    • Salts disrupt Taylor Cone formation
    • Cause ion suppression
    • Requires rigorous sample purification
  • Complex spectra:
    • Multiple charge states per molecule
    • Requires computational deconvolution
  • Slower throughput:
    • LC separation takes time
    • Not as fast as MALDI for simple samples
  • More complex instrumentation:
    • Requires LC system
    • More maintenance

8. MALDI vs ESI Comparison

🎤
Oral Question Complete Comparison
Hard
Compare MALDI and ESI ionization techniques. What are the advantages and disadvantages of each?
✓ Model Answer
FeatureMALDIESI
Sample stateSolid (co-crystallized)Liquid (solution)
Ions producedSingly chargedMultiply charged
SensitivityFemtomole (10⁻¹⁵)Attomole (10⁻¹⁸) — 1000× better
Contaminant toleranceHigh (robust)Low (sensitive to salts)
LC couplingOfflineOnline (direct)
SpectraSimpleComplex (multiple charges)
ThroughputHigh (~10⁴ samples/day)Lower (LC time)
Best forPMF, rapid fingerprintingShotgun proteomics, deep mapping

Summary:

  • MALDI: Favored for speed, simplicity, and tolerance to contaminants
  • ESI: Gold standard for high-sensitivity proteomics and complex LC-MS/MS analyses
💡 Key difference: MALDI = Singly charged (simple spectra), ESI = Multiply charged (can analyze huge proteins).

9. Peptide Mass Fingerprinting (PMF)

🎤
Oral Question PMF Workflow
Hard
What is Peptide Mass Fingerprinting (PMF)? Describe the complete workflow.
✓ Model Answer

PMF: Protein identification technique based on the mass spectrum of proteolytic peptides.

Principle: Each protein produces a unique "fingerprint" of peptide masses when digested with a specific enzyme.

Complete workflow:

  1. Spot Recovery: Excise protein spot from 2D gel (robotic or manual)
  2. Destaining: Remove Coomassie or silver stain
  3. Reduction/Alkylation: Break disulfide bonds, block cysteines
  4. In-gel Digestion: Trypsin digestion overnight
  5. Peptide Extraction: Recover peptides from gel pieces
  6. Cleanup: ZipTip desalting
  7. MALDI-TOF Analysis: Acquire mass spectrum
  8. Database Search:
    • Compare experimental masses to theoretical "digital digests"
    • Databases: UniProt, Swiss-Prot
    • Software assigns Mascot score (statistical probability)

Identification criteria:

  • Significant number of peptides must match
  • Typically need 4-6 matching peptides
  • ~40% sequence coverage considered good

Limitation: Only works if protein is in database.


10. Quick Review Questions

Test yourself with these rapid-fire questions:

Bottom-up separates ❓ before digestion Proteins (via 2D-PAGE)

Shotgun separates ❓ after digestion Peptides (via LC)

Top-down analyzes proteins ❓ digestion WITHOUT any digestion (intact)

DTT is used for Reduction (breaking disulfide bonds)

IAA is used for Alkylation (blocking cysteine thiols)

The 1st dimension of 2D-PAGE separates by pI (isoelectric point) via IEF

The 2nd dimension of 2D-PAGE separates by MW (molecular weight) via SDS-PAGE

MALDI produces ❓ charged ions Singly charged [M+H]⁺

ESI produces ❓ charged ions Multiply charged [M+nH]ⁿ⁺

The Rayleigh limit is reached when Charge repulsion > surface tension → Coulomb explosion

The Reflectron improves Resolution (compensates for kinetic energy spread)

ZipTip is used for Desalting and concentrating peptides

Why avoid SDS in IEF? It binds proteins and imparts negative charge, interfering with pI-based separation

Use ❓ detergent instead of SDS for IEF CHAPS (zwitterionic)

Silver staining is more sensitive than Coomassie by approximately 100× (1 ng vs 100 ng detection limit)

ESI can be coupled ❓ online or offline to HPLC? Online (direct coupling)

MALDI is typically ❓ online or offline? Offline

ESI sensitivity is in the ❓ range Attomole (10⁻¹⁸)


Quantitative Proteomics - Oral Questions

A comprehensive collection of oral exam questions covering quantitative proteomics methods: SILAC, ICAT, iTRAQ, TMT, and Label-Free approaches.


Key Workflow Overview

When does labeling occur?

StageMethod
Metabolic (in vivo)SILAC, SILAM
Spiking (after lysis)AQUA, QconCAT, Super-SILAC
Enzymatic (digestion)¹⁸O Labeling
Chemical (before HPLC)iTRAQ, TMT, Dimethylation
No labelingSpectral Counting, MRM, SWATH, XIC

1. Introduction to Quantitative Proteomics

🎤
Oral Question Definition
Medium
What is Quantitative Proteomics? What are its main applications?
✓ Model Answer

Quantitative Proteomics: An analytical field focused on measuring the relative expression levels of proteins and characterizing their Post-Translational Modifications (PTMs).

Primary goal: Evaluate how protein expression shifts between different states/conditions.

Main applications:

  • Tissue Comparison: Understanding molecular differences between tissue types
  • Biomarker Discovery: Identifying proteins that differentiate healthy vs. diseased states
  • Drug & Pathogen Response: Monitoring cellular reactions to treatments and infections
  • Stress Analysis: Studying adaptation to environmental or physiological stress

Key distinction:

  • Qualitative: What proteins are present? (identification)
  • Quantitative: How much of each protein? (abundance)
🎤
Oral Question Longitudinal Profiling
Medium
What is longitudinal profiling? Why is it important in personalized medicine?
✓ Model Answer

Longitudinal Profiling: Monitoring a person's molecular profile over long time frames, comparing current data against their own previous measurements (rather than just population averages).

Why it's important:

  • More meaningful: Individual baseline is more informative than population average
  • Early detection: Identifies risks before symptoms appear
  • High sensitivity: Catches subtle molecular changes unique to the individual
  • Prevention: Enables proactive interventions to stop disease progression

Example: Athlete Biological Passport (ABP)

  • Monitors biological variables in athletes over time
  • Doesn't detect specific substances
  • Looks for fluctuations that indirectly reveal doping effects
  • Consistent monitoring makes it harder to bypass anti-doping rules
💡 Key shift: From population averages → individual trends = more effective disease prevention.

2. Plasma Proteomics & Biomarkers

🎤
Oral Question Plasma Challenges
Hard
What are the main challenges of plasma proteomics? What is the difference between plasma and serum?
✓ Model Answer

Plasma vs Serum:

PlasmaSerum
Blood + anticoagulantBlood allowed to clot
Contains clotting factorsDevoid of clotting factors

The Composition Challenge:

  • Unbalanced distribution of protein mass
  • In cells: >2,300 proteins = 75% of mass
  • In plasma: Only 20 proteins = ~90% of mass (albumin, immunoglobulins)

The masking problem:

  • Dominant proteins mask low-abundance proteins
  • Disease biomarkers often hidden in the "low-abundance" fraction

Solutions:

  • Depletion: Remove abundant proteins (albumin, IgG)
  • Enrichment: Increase concentration of rare proteins
🎤
Oral Question Leakage Proteins
Medium
What are leakage proteins? Give an example.
✓ Model Answer

Leakage Proteins: Intracellular proteins that are abnormally released into the bloodstream (or other body fluids) as a result of damage, stress, or death of a specific tissue or organ.

Why they're important:

  • Serve as biomarkers for tissue damage
  • Indicate which organ/tissue is affected
  • Used in clinical diagnostics

Primary example: Cardiac Troponin

  • Normally found inside heart muscle cells
  • Released into blood when heart muscle is damaged
  • Gold standard biomarker for heart attack (myocardial infarction)
  • Very specific to cardiac tissue

Other examples:

  • AST/ALT → liver damage
  • Creatine kinase → muscle damage
  • Amylase/Lipase → pancreatic damage
🎤
Oral Question Discovery vs Targeted
Hard
Compare Quantitative (Discovery) Proteomics and Targeted Proteomics.
✓ Model Answer
FeatureQuantitative (Discovery)Targeted
GoalComprehensive proteome viewMeasure specific proteins
Proteins measured2,000-6,00010-100
SelectionUntargeted (find what's there)Pre-selected before analysis
SensitivityLowerHigher
AccuracyLowerHigher
MethodsSILAC, iTRAQ, Label-freeMRM, SRM, PRM
UseFind candidatesValidate candidates

The logical workflow:

  1. Step 1 (Discovery): Use quantitative proteomics to explore the landscape and find potential biomarker candidates
  2. Step 2 (Validation): Use targeted proteomics to zoom in on specific candidates with high sensitivity to confirm clinical relevance

3. Label-Based vs Label-Free Strategies

🎤
Oral Question Strategies Overview
Hard
Compare Label-Free and Label-Based approaches in quantitative proteomics.
✓ Model Answer

Label-Free Approach:

  • Direct analysis without external tags
  • Less expensive and less invasive
  • Samples analyzed separately in parallel workflows
  • Used for initial screening or natural samples
  • May be less accurate with complex samples
  • Methods: Spectral counting, AUC/XIC, MRM, SWATH

Label-Based Approach:

  • Uses tracer/label to monitor proteins
  • Labels have high signal-to-mass ratio
  • Samples can be mixed and analyzed together
  • Label identifies origin of each protein
  • More accurate for relative quantification

When labeling occurs:

StageMethodType
In vivo (metabolic)SILAC, SILAMLiving cells
After lysis (spiking)AQUA, QconCATIsolated proteins
During digestion¹⁸O LabelingEnzymatic
Before HPLCiTRAQ, TMT, ICATChemical

4. SILAC (Stable Isotope Labeling by Amino Acids in Cell Culture)

Practice Set: SILAC
0 / 5
1
Question 1 Principle
Hard
Explain the principle of SILAC. Why are Arginine and Lysine typically used?
✓ Model Answer

SILAC = Stable Isotope Labeling by Amino Acids in Cell Culture

An in vivo metabolic labeling technique for quantitative proteomics.

Core principle:

  • Uses stable isotopes (¹³C, ¹⁵N) — NOT radioactive
  • Same chemical-physical properties as natural isotopes
  • Isotopes incorporated into "heavy" amino acids
  • Cells incorporate labeled amino acids during translation
  • Label encoded directly into the proteome

Why Arginine and Lysine?

  1. Essential/semi-essential: Cells must obtain them from media
  2. Trypsin cleavage sites: Trypsin cleaves after K and R
  3. Every tryptic peptide (except C-terminal) contains at least one K or R
  4. Ensures all peptides are labeled

Also used: Leucine (present in ~70% of tryptic peptides)

2
Question 2 Workflow
Hard
Describe the SILAC workflow step by step.
✓ Model Answer
  1. Cell Cultures:
    • Two populations grown separately
    • One in "light" medium (normal amino acids)
    • One in "heavy" medium (¹³C/¹⁵N-labeled amino acids)
  2. Protein Integration:
    • Cells incorporate amino acids during translation
    • Multiple cell divisions for complete labeling
  3. Treatment:
    • Apply experimental condition (e.g., drug, stimulus)
  4. Harvest & Mixing:
    • Samples mixed early (at cell level)
    • Minimizes experimental error
  5. Lysis & Separation:
    • Cells lysed, proteins separated (SDS-PAGE or 2D-PAGE)
  6. Digestion:
    • Trypsin digestion → peptides
  7. MS Analysis:
    • Light and heavy peptides co-elute from LC
    • Two peak families in spectrum
    • Ratio of peak intensities = relative abundance
3
Question 3 Spectrum Interpretation
Hard
How do you interpret a SILAC MS spectrum? Calculate the m/z shift for a peptide with ¹³C₆-Lysine at +2 charge.
✓ Model Answer

SILAC spectrum interpretation:

  • Two families of peaks: "light" and "heavy"
  • Heavy peaks shifted to the right (higher m/z)
  • Peak intensity ratio = relative protein abundance

Calculation example:

  • ¹³C₆-Lysine adds 6 Da mass difference
  • With +2 charge state:
  • m/z shift = Mass difference ÷ Charge
  • m/z shift = 6 ÷ 2 = 3 m/z units

General formula:

Δm/z = ΔMass / z

Quantification:

  • Compare peak heights or areas
  • Heavy/Light ratio indicates fold change
  • SILAC provides relative (not absolute) quantification
4
Question 4 Limitations
Hard
What are the limitations of SILAC? Which samples cannot be analyzed?
✓ Model Answer

SILAC Limitations:

  1. Requires living cells:
    • Cells must grow in culture
    • Must incorporate labeled amino acids
  2. Time-consuming:
    • Multiple cell divisions needed for complete labeling
    • Typically 5-6 doublings
  3. Limited multiplexing:
    • Maximum 2-3 samples (light, medium, heavy)
  4. Arginine-to-Proline conversion:
    • Some cells convert Arg to Pro
    • Can cause labeling artifacts

Samples that CANNOT be used:

  • Cell-free biological fluids:
    • Plasma/serum
    • Urine
    • Saliva
    • CSF
  • Reason: No living cells to incorporate labels!

Samples that CAN be used:

  • Cell lines
  • Blood-derived leukocytes (if cultured)
  • Biopsy-obtained cancer cells (if cultured)
5
Question 5 Advantages
Medium
What are the advantages of SILAC compared to other methods?
✓ Model Answer

SILAC Advantages:

  1. Early mixing:
    • Samples mixed at cell level (earliest possible point)
    • Minimizes experimental error during sample preparation
  2. Complete labeling:
    • Nearly 100% incorporation after sufficient doublings
  3. No chemical modification:
    • Label is natural amino acid (just different isotope)
    • No affinity purification needed
  4. High proteome coverage:
    • ~70% of peptides contain Leucine
    • All tryptic peptides contain K or R
  5. Accurate quantification:
    • Light and heavy peptides co-elute
    • Analyzed simultaneously = same ionization conditions

5. ICAT (Isotope-Coded Affinity Tag)

Practice Set: ICAT
0 / 3
1
Question 1 Reagent Structure
Hard
Describe the structure of the ICAT reagent. What are its three functional components?
✓ Model Answer

ICAT = Isotope-Coded Affinity Tag

An in vitro chemical labeling technique targeting Cysteine residues.

Three functional components:

  1. Reactive Group (Iodoacetamide):
    • Specifically binds to cysteine thiol groups (-SH)
    • Highly specific reaction
  2. Isotope-Coded Linker (PEG):
    • Polyethylene glycol bridge
    • Light version: Normal hydrogen atoms
    • Heavy version: 8 hydrogens replaced with deuterium
    • Mass difference: 8 Da
  3. Biotin Tag:
    • Affinity tag for purification
    • Strong binding to streptavidin/avidin
    • Enables selective isolation of labeled peptides

Structure: [Iodoacetamide]—[PEG linker]—[Biotin]

2
Question 2 Workflow
Hard
Describe the ICAT methodology step by step.
✓ Model Answer
  1. Denaturation & Reduction:
    • Unfold proteins
    • Break disulfide bonds to expose cysteines
  2. Labeling:
    • Sample 1 → Light ICAT reagent
    • Sample 2 → Heavy ICAT reagent
    • Iodoacetamide reacts with Cys thiols
  3. Mixing & Digestion:
    • Combine labeled samples
    • Trypsin digestion → peptides
  4. Affinity Chromatography:
    • Add streptavidin-coated beads
    • Biotin-tagged peptides bind
    • Non-Cys peptides washed away
    • Reduces complexity!
  5. Nano-HPLC & MS:
    • Separate and analyze peptides
    • Light/Heavy peaks separated by 8 Da
  6. MS/MS:
    • Fragment for sequence identification
    • Database search (MASCOT)
3
Question 3 Advantages & Limitations
Hard
What are the advantages and disadvantages of ICAT?
✓ Model Answer

Advantages:

  • Reduced complexity: Only Cys-containing peptides selected → cleaner spectra
  • Accuracy: ~10% accuracy in relative quantification
  • Flexibility: Works on complex protein mixtures
  • Clinical samples: Can use tissues, biopsies, fluids (unlike SILAC)

Disadvantages:

  • Cysteine dependency:
    • Only ~25% of peptides contain Cys
    • Proteins without Cys cannot be identified!
  • Accessibility issues:
    • Some Cys buried in protein structure
    • Cannot be labeled
  • Limited multiplexing:
    • Only 2 samples (light vs heavy)
  • Cost: Expensive reagents
  • Yield concerns: Non-specific binding and incomplete labeling

6. SILAC vs ICAT Comparison

🎤
Oral Question Comparison
Hard
Compare SILAC and ICAT. When would you use each?
✓ Model Answer
FeatureSILACICAT
TypeIn vivo (metabolic)In vitro (chemical)
TargetLys, Arg (all tryptic peptides)Cysteine only
Proteome coverage~70% (Leu-containing)~25% (Cys-containing)
Sample mixingVery early (cells)After labeling
Multiplexing2-3 samples2 samples
Sample typeLiving cells onlyAny protein mixture
Clinical samples❌ Cannot use fluids✅ Can use biopsies/fluids
ComplexityFull (many peptides)Reduced (Cys-only)

When to use SILAC:

  • Cell culture experiments
  • Need high proteome coverage
  • Can afford time for labeling

When to use ICAT:

  • Clinical samples (plasma, tissue)
  • Complex mixtures needing simplification
  • Cannot grow cells in culture

7. iTRAQ (Isobaric Tags for Relative and Absolute Quantitation)

Practice Set: iTRAQ
0 / 5
1
Question 1 Isobaric Principle
Hard
What does "isobaric" mean in iTRAQ? How does this affect the MS spectrum?
✓ Model Answer

Isobaric = Same total mass

All iTRAQ reagents have identical total mass (e.g., 145 Da for 4-plex).

Why this matters:

  • Identical peptides from different samples appear as ONE peak in MS1
  • Keeps spectrum simple and clean
  • No peak splitting like in SILAC

How it works:

  • Different isotope distribution within the reagent
  • Reporter group + Balance group = constant mass
  • When reporter is heavier → balancer is lighter

Example (4-plex):

ReagentReporterBalanceTotal
1114 Da31 Da145 Da
2115 Da30 Da145 Da
3116 Da29 Da145 Da
4117 Da28 Da145 Da
2
Question 2 Reagent Structure
Hard
Describe the structure of the iTRAQ reagent. What does each part do?
✓ Model Answer

iTRAQ reagent has three parts:

  1. Reporter Group:
    • Unique "ID" for each sample
    • 4-plex: 114, 115, 116, 117 Da
    • 8-plex: 113-121 Da
    • Released during MS/MS fragmentation
    • Used for quantification!
  2. Balance Group:
    • Compensates for reporter mass
    • Ensures total mass is constant
    • Lost during fragmentation
  3. Reactive Group (NHS ester):
    • Binds to N-terminus and Lysine side chains
    • Labels all peptides (not just Cys like ICAT)

Structure: [Reporter]—[Balance]—[NHS-ester]

3
Question 3 Workflow & Quantification
Hard
Describe the iTRAQ workflow. At which MS stage does quantification occur?
✓ Model Answer

iTRAQ Workflow:

  1. Extraction & Preparation: Purify, denature, reduce proteins
  2. Digestion: Trypsin → peptides BEFORE labeling
  3. Labeling: Each sample labeled with specific iTRAQ reagent
  4. Pooling: Combine all labeled samples into one
  5. HPLC Separation: Treat as single sample
  6. MS1: Single peak per peptide (isobaric!)
  7. MS/MS (CID): Fragmentation breaks Reporter-Balance bond
  8. Reporter ions released: 114-117 region shows intensities

Quantification occurs at MS/MS (MS2) level!

MethodQuantification Stage
SILACMS1 (peak ratios)
ICATMS1 (peak ratios)
iTRAQMS2 (reporter ions)
💡 Key distinction: iTRAQ quantifies at MS/MS level (reporter ions), while SILAC/ICAT quantify at MS1 level (peak ratios).
4
Question 4 Ratio Compression
Hard
What is the Ratio Compression Effect in iTRAQ/TMT? What causes it?
✓ Model Answer

Ratio Compression Effect: Measured differences in protein abundance appear smaller than actual biological values, compressing observed ratios toward 1:1.

Cause: Co-Isolation Challenge

  • During MS2, mass spectrometer isolates precursor ion for fragmentation
  • Peptides with similar m/z that co-elute are co-isolated
  • These "contaminating" peptides also fragment
  • Their reporter ions merge with target signal
  • Background peptides at different concentrations → dilute the true signal
  • Result: Systematic underestimation of fold-change

Mitigation strategies:

  1. Better chromatography: Reduce co-elution
  2. MS3 analysis: Additional fragmentation stage (gold standard)
  3. Narrower isolation windows: Reduce co-isolated species
5
Question 5 Advantages & Limitations
Medium
What are the advantages and limitations of iTRAQ?
✓ Model Answer

Advantages:

  • High multiplexing: Up to 8 samples (4-plex or 8-plex)
  • Statistical power: More samples = better p-values, less noise
  • Clean MS1 spectra: Isobaric tags → single peaks
  • High coverage: Labels N-terminus + Lys (most peptides)
  • Relative & absolute: Can include standards

Limitations:

  • Ratio compression: Background interference underestimates differences
  • Expensive reagents: High cost compared to label-free
  • High sample concentration needed:
  • Complex preparation: Risk of sample loss, incomplete labeling
  • Sophisticated software needed: ProQuant, etc.

8. Method Comparison: SILAC vs ICAT vs iTRAQ

🎤
Oral Question Three-Way Comparison
Hard
Compare SILAC, ICAT, and iTRAQ. Create a comprehensive comparison table.
✓ Model Answer
FeatureSILACICATiTRAQ
TypeIn vivo (metabolic)In vitro (chemical)In vitro (chemical)
Labeling stageCell cultureAfter lysisAfter digestion
TargetLys, Arg, LeuCysteine onlyN-terminus + Lys
Multiplexing2-3 samples2 samples4-8 samples
QuantificationMS1MS1MS2
CoverageHigh (~70%)Low (~25%)Very high
Sample typeCells onlyAny mixtureAny mixture
Clinical samples❌ No✅ Yes✅ Yes
Main limitationNeeds living cellsCys dependencyRatio compression
💡 Summary: SILAC = best accuracy (early mixing), ICAT = reduces complexity, iTRAQ = highest multiplexing.

9. Label-Free Quantification

Practice Set: Label-Free Methods
0 / 4
1
Question 1 Principle
Medium
What is Label-Free Quantification (LFQ)? How does it differ from label-based methods?
✓ Model Answer

Label-Free Quantification: Quantitative proteomics without isotope labels or chemical tags.

Key characteristics:

  • Direct comparison of individual LC-MS/MS runs
  • No expensive reagents needed
  • Samples never mixed — analyzed separately
  • Requires strict experimental standardization

Comparison to label-based:

FeatureLabel-BasedLabel-Free
Sample mixingCombined before MSAnalyzed separately
CostHigher (reagents)Lower
MultiplexingLimited by reagentsUnlimited samples
VariabilityLower (same run)Higher (run-to-run)
ComplexitySample prepData analysis
2
Question 2 Two Methods
Hard
Describe the two main Label-Free quantification methods: Spectral Counting and Precursor Intensity (AUC).
✓ Model Answer

1. Spectral Counting:

  • Principle: More protein → more peptides → more MS/MS spectra
  • Data level: MS2
  • Measures: Number of spectra, unique peptides, sequence coverage
  • Advantages: Easy to implement, no special algorithms
  • Best for: High-abundance proteins

2. Precursor Signal Intensity (AUC):

  • Principle: Measure Area Under the Curve of chromatographic peaks
  • Data level: MS1
  • Measures: Peak intensity/height as peptides elute
  • Advantages: More accurate for subtle changes
  • Best for: Low-abundance proteins
FeatureSpectral CountingAUC
Data LevelMS2MS1
ComplexityLowHigh (needs alignment)
SensitivityBetter for abundantBetter for low-abundance
3
Question 3 Challenges
Hard
What are the main technical challenges of Label-Free quantification?
✓ Model Answer

Technical challenges:

  1. Experimental Drift:
    • Fluctuations in retention time (RT) between runs
    • m/z drift over time
    • Hard to align same peptide across samples
    • Solution: Alignment algorithms that "stretch/shrink" chromatograms
  2. Run-to-Run Variability:
    • Even identical samples show intensity differences
    • ESI efficiency fluctuations
    • Column performance variation
    • Solution: Internal standards, global normalization
  3. Data Complexity:
    • Massive data volume from separate runs
    • Requires sophisticated bioinformatics pipelines
    • Automated alignment, normalization, statistics
  4. No internal standard:
    • Unlike labeled methods, no built-in reference
💡 Key requirement: Extremely reproducible chromatography and careful normalization are essential.
4
Question 4 Advantages
Medium
What are the advantages of Label-Free over label-based methods?
✓ Model Answer

Label-Free Advantages:

  • Cost-effective: No expensive reagents
  • Simple sample prep: No labeling steps
  • Unlimited multiplexing: Compare any number of samples
  • Works with any sample: Tissues, fluids, cells
  • Lower sample amount: No sample loss during labeling
  • Dynamic range: Can detect wider range of changes
  • No ratio compression: Unlike iTRAQ

Best applications:

  • Large-scale studies (many samples)
  • Clinical cohorts
  • When sample is limited
  • Initial screening studies

10. Quick Review Questions

Test yourself with these rapid-fire questions:

SILAC is an ❓ vivo or in vitro method? In vivo (metabolic labeling)

iTRAQ can compare up to ❓ samples simultaneously 8 samples (8-plex)

ICAT specifically targets ❓ amino acid Cysteine

iTRAQ quantification occurs at ❓ level MS/MS (MS2) level — reporter ions

SILAC quantification occurs at ❓ level MS1 level — peak ratios

"Isobaric" means Same total mass

SILAC cannot be used on Cell-free fluids (plasma, urine, saliva) — no living cells

The ICAT mass difference between light and heavy is ❓ Da 8 Da (8 deuteriums)

Ratio compression in iTRAQ is caused by Co-isolation of background peptides during MS2

Spectral counting uses ❓ data level MS2 (number of spectra)

AUC (Area Under Curve) uses ❓ data level MS1 (peak intensity)

In plasma, only ❓ proteins constitute ~90% of the mass 20 proteins (albumin, immunoglobulins)

Cardiac troponin is an example of a ❓ protein Leakage protein (biomarker for heart damage)

ABP (Athlete Biological Passport) uses ❓ profiling Longitudinal profiling (individual over time)

Discovery proteomics measures ❓ proteins, targeted measures ❓ 2,000-6,000 proteins (discovery) vs 10-100 proteins (targeted)

ICAT biotin tag binds to ❓ for affinity purification Streptavidin/avidin beads

Label-free main challenge is Run-to-run variability / alignment between runs

iTRAQ reporter ions appear in the ❓ region of MS/MS spectrum Low-mass region (114-117 for 4-plex)

Interactomics - Oral Questions

A comprehensive collection of oral exam questions covering protein-protein interactions, interactomics methods, and advanced techniques.


1. Introduction to Interactomics

🎤
Oral Question Definition
Medium
What is Interactomics? Why is studying protein-protein interactions important?
✓ Model Answer

Interactomics: The study of protein-protein interactions (PPIs) and the networks they form within biological systems.

Why PPIs are important:

  • Functional Insight: Essential for understanding how proteins function within cells
  • Pathology: Gene mutations can disrupt protein interactions — a primary driver of disease
  • Drug Discovery: New drug treatments rely heavily on protein function analysis
  • Discovery: Unknown proteins can be discovered by identifying their partners in known pathways

Scale of the problem:

  • ~2-4 million proteins per cubic micron in cells
  • Number of possible interactions is enormous
  • PPIs are intrinsic to virtually every cellular process: cell growth, cell cycle, metabolism, signal transduction

Key challenges:

  1. Identifying which proteins interact in the crowded intracellular environment
  2. Mapping specific residues that participate in interactions
🎤
Oral Question Bait and Prey
Easy
Explain the "Bait and Prey" model in interactomics.
✓ Model Answer

The Bait and Prey model is the fundamental principle underlying all PPI methods:

Bait (X):

  • The protein of interest
  • Used to "fish" for interacting partners
  • Usually tagged or labeled for detection

Prey (Y):

  • Proteins that interact with the bait
  • Can be known candidates or unknown proteins from a library

Types of interactions:

  • Binary: One bait + one prey
  • Complex: One bait + multiple preys simultaneously

The fundamental question: "Does X bind with protein Y?"


2. Classification of PPI Methods

🎤
Oral Question Method Classification
Hard
How are PPI experimental methods classified? Give examples of each category.
✓ Model Answer

A. Experimental Methods:

In Vitro Methods: (Purified proteins, controlled lab environment)

  • Co-Immunoprecipitation (Co-IP): Antibodies isolate protein complexes
  • GST-Pull Down: Tagged proteins capture binding partners
  • Protein Arrays: High-throughput screening on solid surface

In Vivo / Cellular Methods: (Living cells)

  • Yeast Two-Hybrid (Y2H): Classic genetic screen for binary interactions
  • Mammalian Two-Hybrid (M2H): Y2H in mammalian context
  • Phage Display: Connects proteins with encoding DNA
  • Proximity Labeling: BioID, TurboID, APEX

Imaging & Real-time:

  • FRET: Fluorescence Resonance Energy Transfer
  • BRET: Bioluminescence Resonance Energy Transfer

B. Computational Methods:

  • Genomic data: Phylogenetic profiles, gene fusion, correlated mutations
  • Protein structure: Residue frequencies, 3D distance matrices, surface patches

3. Co-Immunoprecipitation (Co-IP)

Practice Set: Co-IP
0 / 4
1
Question 1 Principle
Hard
Explain the principle of Co-Immunoprecipitation (Co-IP). Why is it considered a rigorous method?
✓ Model Answer

Co-IP: A technique to verify whether two or more proteins form a complex within a cell.

Principle:

  • Uses antibodies to isolate protein complexes from cell extracts
  • Antibody against "bait" captures bait + any bound "prey" proteins
  • If proteins interact, prey co-precipitates with bait

Why it's rigorous:

  • Physiological relevance: Uses whole cell extract
  • Proteins in native conformation
  • Contains natural cofactors and other cellular components
  • Confirms interactions in near-physiological conditions

Why use eukaryotic cells?

  • Enables post-translational modifications
  • PTMs often required for interactions
  • Reduces false negatives from missing modifications

Caveat: Coprecipitated proteins are assumed to be related to bait function — requires further verification.

2
Question 2 Workflow
Hard
Describe the Co-IP experimental workflow step by step.
✓ Model Answer
  1. Cell Lysis:
    • Lyse cells under non-denaturing conditions
    • Must maintain 3D protein structure
    • Denaturation would disrupt complexes and antibody recognition
  2. Antibody Addition:
    • Add antibody specific to the "bait" protein
    • Antibody captures bait + any bound prey
  3. Immobilization:
    • Antibody-antigen complex captured on Protein A or G Sepharose beads
    • These have high affinity for antibody Fc region
  4. Washing:
    • Stringency washes remove non-binding proteins
    • Must be optimized — too harsh may lose weak/transient interactions (false negatives)
  5. Elution & Dissociation:
    • Elute complex from beads
    • Dissociate using SDS sample buffer
  6. Evaluation:
    • SDS-PAGE separation
    • Western blotting with distinct antibodies for bait and prey
    • Include negative control (non-specific IgG)
3
Question 3 Limitations
Medium
What are the limitations of Co-IP?
✓ Model Answer

Co-IP Limitations:

  • Requires good antibody: Antibody must be specific and high-affinity
  • Cannot distinguish direct vs indirect: May capture whole complexes, not just direct interactors
  • May miss transient interactions: Weak or transient interactions lost during washing
  • Low throughput: Tests one bait at a time
  • Non-denaturing conditions required: Limits buffer choices
  • False positives: Non-specific binding to beads
  • Verification needed: Results require confirmation by other methods
4
Question 4 Controls
Medium
What controls should be included in a Co-IP experiment?
✓ Model Answer

Essential controls:

  • Negative control (IgG): Use non-specific IgG instead of specific antibody — ensures interaction is specific, not due to non-specific binding to beads
  • Input control: Sample of lysate before IP — confirms proteins are present
  • Beads-only control: Lysate + beads without antibody — tests non-specific bead binding

Detection controls:

  • Western blot for bait — confirms successful pulldown
  • Western blot for prey — verifies the interaction

4. GST-Pull Down Assay

🎤
Oral Question GST-Pull Down
Hard
Explain the GST-Pull Down assay. How does it differ from Co-IP?
✓ Model Answer

GST-Pull Down: An affinity purification method similar to Co-IP, but uses a recombinant tagged bait protein instead of an antibody.

Key difference from Co-IP:

FeatureCo-IPGST-Pull Down
Capture agentAntibodyGST-tagged bait protein
Bait sourceEndogenousRecombinant (usually E. coli)
Requires antibodyYesNo

The GST Fusion System:

  • Bait protein fused to GST (glutathione-S-transferase) tag
  • Expressed in E. coli
  • GST increases solubility (acts as molecular chaperone)
  • GST binds strongly to glutathione-agarose beads

Workflow:

  1. Express GST-bait fusion in E. coli
  2. Immobilize on glutathione beads
  3. Incubate with cell extract (prey source)
  4. Wash away non-binders
  5. Elute with excess glutathione (competes for GST)
  6. Analyze by SDS-PAGE + Western blot

5. Protein Arrays

🎤
Oral Question Protein Arrays
Hard
What are Protein Microarrays? Describe the three main types.
✓ Model Answer

Protein Microarrays: Miniaturized bioanalytical devices with arrayed molecules on a surface for high-throughput analysis.

Three main types:

  1. Analytical Protein Arrays:
    • Immobilized capture agents (antibodies)
    • Detect proteins in solution (analyte)
    • Used for: Clinical diagnostics, biomarker discovery
  2. Functional Protein Arrays:
    • Proteins of interest are immobilized
    • Capture interacting molecules from analyte
    • Used for: Mapping interactome, identifying protein complexes
  3. Reverse Phase Protein Arrays (RPPA):
    • Complex sample immobilized on surface
    • Specific probes detect target proteins within sample
    • Used for: Tissue lysate analysis, pathway profiling

General workflow:

  1. Array fabrication (design layout, select probes)
  2. Substrate selection & deposition (robotic printing)
  3. Immobilization (attach capture molecules)
  4. Interaction & detection (fluorescence or MS)
🎤
Oral Question Array Challenges
Medium
What are the main technical challenges of protein microarrays?
✓ Model Answer

Technical challenges:

  • Steric Hindrance:
    • Proteins are large and asymmetrical
    • Immobilization can mask active sites
    • Need site-specific orientation for accessibility
  • Low Yield:
    • Inefficient covalent attachment
    • Suboptimal surface density
    • Limits dynamic range
  • Non-specific Adsorption:
    • Proteins are "sticky"
    • Hydrophobic/electrostatic binding to substrate
    • Causes high background and false positives
  • Conformation Fragility & Denaturation:
    • Proteins are thermodynamically unstable (vs. DNA)
    • Sensitive to pH, temperature, dehydration
    • Loss of 3D structure = loss of activity

Artifacts: Dust particles, scratches, bleeding between spots can cause spurious signals.


6. Yeast Two-Hybrid (Y2H)

Practice Set: Yeast Two-Hybrid
0 / 4
1
Question 1 Principle
Hard
Explain the molecular basis of the Yeast Two-Hybrid system. What is a transcriptional activator?
✓ Model Answer

Y2H exploits the modularity of transcriptional activators (like GAL4).

Transcriptional activators have two separable domains:

  1. DNA Binding Domain (DBD):
    • Recognizes and binds specific DNA sequence near promoter
    • By itself, cannot activate transcription
    • Just indicates which gene to activate
  2. Activation Domain (AD):
    • Stimulates transcription by recruiting RNA Polymerase II
    • By itself, cannot bind DNA

The Y2H trick:

  • In nature, DBD and AD are part of one protein
  • In Y2H, they are expressed as separate fusion proteins
  • DBD fused to Bait (X)
  • AD fused to Prey (Y)
  • If X and Y interact → DBD and AD brought together → transcription activated → reporter gene expressed
2
Question 2 Workflow
Hard
Describe the Y2H experimental workflow. How is interaction detected?
✓ Model Answer

Step 1: Construct Fusion Proteins

  • Bait (DBD-X): Gene X inserted next to DBD (e.g., GAL4 BD)
  • Prey (AD-Y): Gene Y inserted next to AD (e.g., GAL4 AD, VP16)

Step 2: Transfection & Selection

  • Transform yeast with both plasmids
  • Selection based on metabolic genes:
    • Bait plasmid: TRP1 (growth without tryptophan)
    • Prey plasmid: LEU2 (growth without leucine)
  • Only double-transformants survive on -Trp/-Leu plates

Step 3: Detection of Interaction

  • If X and Y interact → functional transcription factor reconstituted
  • Reporter gene expressed:
    • GFP: Green fluorescence
    • lacZ: β-galactosidase → blue color with X-gal
    • HIS3: Growth on histidine-lacking media
3
Question 3 Limitations
Hard
What are the main limitations of the Yeast Two-Hybrid system?
✓ Model Answer

Y2H Limitations:

  1. Nuclear Localization:
    • Interaction must occur in nucleus to trigger reporter
    • Membrane-bound or strictly cytoplasmic proteins difficult to study
  2. Post-Translational Modifications:
    • Yeast may lack mammalian PTM enzymes
    • Missing phosphorylation/glycosylation → false negatives
  3. Non-native Context:
    • Yeast is simple unicellular organism
    • Cannot fully mimic mammalian cell environment
  4. Steric Hindrance:
    • Large DBD/AD domains may block interaction site
  5. False Positives:
    • Some proteins activate transcription on their own
    • "Sticky" proteins bind non-specifically
💡 Y2H is often the first method used, but results must be confirmed by other techniques in more native contexts.
4
Question 4 Mammalian 2H
Medium
Why would you use a Mammalian Two-Hybrid system instead of Y2H?
✓ Model Answer

Reasons to use Mammalian Two-Hybrid (M2H):

  1. Authentic PTMs: Glycosylation, phosphorylation, acylation present
  2. Native localization: Correct organelles and trafficking pathways
  3. Efficiency: Results in ~48 hours vs. 3-4 days for yeast
  4. Physiological context: Mimics human cell environment

M2H uses three plasmids:

  1. Bait Vector (DBD-X)
  2. Prey Vector (AD-Y) — often VP16 AD
  3. Reporter Vector (multiple DBD binding sites + TATA box + reporter)

Common reporters:

  • Firefly Luciferase: Luminescent, very sensitive
  • SEAP: Secreted, non-invasive (sample media without lysis)
  • β-Galactosidase: Colorimetric (X-gal → blue)

Use case: M2H is used to validate interactions found in Y2H, not for primary library screening.


7. Phage Display

🎤
Oral Question Phage Display Principle
Hard
What is Phage Display? Explain the fundamental principle and the "Biopanning" process.
✓ Model Answer

Phage Display: A technique where peptides/proteins are displayed on bacteriophage surfaces, creating a physical link between phenotype and genotype.

Fundamental principle:

  • Foreign DNA fused to phage coat protein gene
  • When phage replicates, fusion protein displayed on surface
  • DNA encoding it is packaged inside
  • Phenotype (displayed protein) linked to genotype (internal DNA)

Is it in vitro or in vivo?

  • Production: In vivo (in E. coli)
  • Selection: In vitro (on plates/beads)
  • Application: In vivo (therapeutic use)
  • Acts as a "bridge" technique

Biopanning (Selection Process):

  1. Binding: Phage library exposed to immobilized target
  2. Wash: Non-binders removed (acid/urea/competing ligand)
  3. Amplification: Bound phages re-infect E. coli and multiply
  4. Iteration: Repeat 3-4 cycles to enrich strong binders
  5. Sequencing: Identify common motifs in winners
🎤
Oral Question Phage Display Limitations
Hard
What are the main limitations of Phage Display?
✓ Model Answer

Main limitations:

  1. Prokaryotic Expression System:
    • No post-translational modifications (no glycosylation, phosphorylation)
    • May not fold mammalian proteins correctly
    • Codon bias issues
  2. Size Constraints:
    • Large protein inserts may disrupt folding or phage assembly
  3. Selection Bias:
    • Some peptides toxic to bacteria → lost from library
  4. Stringency Risks:
    • First wash too harsh → lose high-affinity candidates
  5. In Vivo Translation:
    • Peptide that works in lab may fail in living body (pH, interference)
  6. Misfolding:
    • Complex proteins may not adopt correct 3D structure on phage surface
💡 Key limitation to remember: Prokaryotic expression = no eukaryotic PTMs and potential protein misfolding.

8. Proximity Labeling (BioID, APEX, TurboID)

🎤
Oral Question Proximity Labeling
Hard
Explain Proximity Labeling. Compare BioID, APEX, and TurboID.
✓ Model Answer

Proximity Labeling: An in vivo method where an enzyme fused to bait labels all nearby proteins with biotin.

Core mechanism:

  1. Biotinylation: Enzyme activates biotin → reactive species tags neighbors within ~10-20 nm
  2. Capture: Biotin-streptavidin affinity captures tagged proteins
  3. Identification: MS identifies the "proteomic atlas" of bait's environment

Comparison:

FeatureBioIDAPEXTurboID
EnzymeBiotin Ligase (BirA*)Ascorbate PeroxidaseEvolved Biotin Ligase
SubstrateBiotin + ATPBiotin-phenol + H₂O₂Biotin + ATP
Labeling Time18-24 hours (SLOW)<1 minute (FAST)10 minutes (FAST)
Target AALysineTyrosineLysine
ToxicityLowHIGH (H₂O₂)Low
In Vivo UseExcellentLimitedExcellent

TurboID is now the gold standard: combines non-toxic nature of BioID with speed of APEX.


9. FRET (Fluorescence Resonance Energy Transfer)

Practice Set: FRET & BRET
0 / 5
1
Question 1 RET Principle
Hard
Explain the principle of Resonance Energy Transfer (RET). What factors influence efficiency?
✓ Model Answer

RET (Resonance Energy Transfer): Energy transfer from an excited donor to an acceptor molecule through non-radiative dipole-dipole coupling (no photon emitted).

Three critical factors affecting efficiency:

  1. Distance (R):
    • Most important factor!
    • Efficiency ∝ 1/R⁶ (inverse sixth power)
    • Must be within 1-10 nm (10-100 Å)
  2. Relative Orientation:
    • Donor and acceptor dipoles must be approximately parallel
    • Perpendicular = zero transfer
  3. Spectral Overlap:
    • Donor emission spectrum must overlap with acceptor absorption spectrum

Two main types:

  • FRET: Donor is fluorescent (requires external light)
  • BRET: Donor is bioluminescent (no external light needed)
2
Question 2 FRET Mechanism
Hard
Explain how FRET works. What are donor and acceptor molecules?
✓ Model Answer

FRET = Förster (or Fluorescence) Resonance Energy Transfer

How it works:

  1. Excitation: External light excites the donor fluorophore
  2. Energy Transfer: Instead of emitting light, donor transfers energy to acceptor via dipole-dipole coupling (non-radiative)
  3. Acceptor Emission: Acceptor becomes excited and emits light at its characteristic wavelength

Donor and Acceptor:

  • Donor: Fluorescent protein that absorbs excitation light (e.g., CFP - Cyan Fluorescent Protein)
  • Acceptor: Fluorescent protein that receives energy from donor (e.g., YFP - Yellow Fluorescent Protein)

Common FRET pairs:

  • CFP → YFP
  • BFP → GFP
  • GFP → mCherry/RFP

Measurable result:

  • Donor emission decreases (quenching)
  • Acceptor emission appears (sensitized emission)
3
Question 3 BRET
Hard
What is BRET? How does it differ from FRET?
✓ Model Answer

BRET = Bioluminescence Resonance Energy Transfer

Key difference: Donor is a bioluminescent enzyme (not a fluorophore).

FeatureFRETBRET
DonorFluorophore (e.g., CFP)Luciferase enzyme (e.g., Rluc)
ExcitationExternal light sourceChemical substrate (no light needed)
BackgroundHigh (autofluorescence)Low (no autofluorescence)
PhotobleachingYes (donor degrades)No
PhototoxicityRisk of cell damageNo photodamage

BRET advantages:

  • No external light → no autofluorescence background
  • No photobleaching → longer experiments
  • No phototoxicity → better cell viability
  • Higher signal-to-noise ratio

Common BRET donors: Renilla luciferase (Rluc), NanoLuc

4
Question 4 Signal Types
Medium
What types of signals can be obtained from FRET/BRET? What is a ratiometric measurement?
✓ Model Answer

Types of signals measured:

  1. Sensitized emission: Acceptor fluorescence upon donor excitation
  2. Donor quenching: Decrease in donor fluorescence intensity
  3. Donor lifetime: Decrease in fluorescence lifetime (FLIM-FRET)
  4. Acceptor photobleaching: Donor recovery after acceptor is bleached

Ratiometric measurement:

  • Calculate ratio of acceptor emission / donor emission
  • Why it's powerful: Self-normalizing!
  • Eliminates variability from: cell number, assay volume, detector fluctuations
  • Results reflect true molecular interactions, not experimental artifacts

BRET Ratio formula:

BRET ratio = [I₅₃₀ - (Cf × I₄₉₀)] / I₄₉₀

  • High ratio = strong interaction
  • Low ratio = proteins distant or not interacting
5
Question 5 RET Limitations
Hard
What are the limitations of FRET and BRET?
✓ Model Answer

General limitations (both):

  • Steric hindrance: Large tags (GFP, Luciferase) may block interaction site
  • Artifactual behavior: Fusion may change protein conformation/localization
  • Overexpression artifacts: High concentrations can force non-physiological interactions

FRET-specific limitations:

  • Photobleaching: Donor degrades under continuous illumination
  • Autofluorescence: Endogenous molecules create background noise
  • Phototoxicity: Intense light can damage cells
  • Direct acceptor excitation: Can create false positives

BRET-specific limitations:

  • Substrate dependency: Requires exogenous substrate addition
  • Limited donor library: Fewer bioluminescent proteins available compared to fluorescent proteins
  • Lower signal intensity: Bioluminescence weaker than fluorescence

10. Advanced Techniques

🎤
Oral Question SRET
Hard
What is SRET? When would you use it instead of standard FRET/BRET?
✓ Model Answer

SRET = Sequential BRET-FRET

An advanced technique to monitor non-binary interactions (three or more proteins forming a complex).

The molecular components:

  1. Donor: Protein 1 fused to Renilla luciferase (Rluc)
  2. First Acceptor: Protein 2 fused to GFP/YFP
  3. Second Acceptor: Protein 3 fused to DsRed

Sequential energy transfer:

  1. BRET phase: Rluc → GFP (if proteins 1 & 2 are close)
  2. FRET phase: GFP → DsRed (if proteins 2 & 3 are close)
  3. Final emission: DsRed emits — confirms all three are together

Key advantage: Positive SRET signal is definitive proof that all three proteins are physically clustered at the same time.

Application: Studying GPCR oligomerization (homo- and hetero-oligomers) in drug discovery.

🎤
Oral Question PCA/NanoBiT
Hard
What are Protein-Fragment Complementation Assays (PCAs)? What is NanoBiT?
✓ Model Answer

PCA Principle:

  • Reporter protein (e.g., luciferase) split into two inactive fragments
  • Fragments fused to bait and prey proteins
  • If bait and prey interact → fragments brought together → reporter reconstituted → signal produced

Logic:

  • No interaction → fragments separated → no activity
  • Interaction → proximity → reassembly → activity restored

NanoBiT (NanoLuc Binary Technology):

  • Current gold standard PCA system
  • Large BiT (LgBiT): 18 kDa
  • Small BiT (SmBiT): 11 amino acids
  • Engineered with very weak intrinsic affinity
  • Only reassemble when "forced" together by bait-prey interaction

Advantages of NanoBiT:

  • High signal-to-noise ratio
  • Low background (no spontaneous assembly)
  • Works at physiological protein concentrations
  • Superior dynamic range vs. FRET
🎤
Oral Question Inteins
Medium
What are Inteins? What is their significance in protein engineering?
✓ Model Answer

Inteins = INternal proTEINS

Self-splicing protein segments that excise themselves from a precursor protein, leaving the flanking exteins joined together.

Terminology:

  • Intein: Gets removed (internal protein)
  • Extein: Flanking sequences that remain (external protein)
  • N-extein—[INTEIN]—C-extein → N-extein—C-extein + free intein

Mechanism (Protein Splicing):

  1. N-S or N-O acyl shift at N-terminus
  2. Transesterification
  3. Asparagine cyclization releases intein
  4. S-N or O-N acyl shift joins exteins with native peptide bond

Applications:

  • Self-cleaving affinity tags: Tag-free protein purification (no extra residues!)
  • Expressed Protein Ligation: Join two protein fragments with native bond
  • Protein cyclization: Create cyclic proteins
  • Conditional protein splicing: Control protein activity

11. Aptamers & SELEX

🎤
Oral Question Aptamers
Medium
What are Aptamers? How are they selected using SELEX?
✓ Model Answer

Aptamers: Single-stranded oligonucleotides (ssDNA or RNA) that fold into complex 3D structures and bind targets with high affinity.

How they bind:

  • Shape complementarity (not base pairing)
  • Non-covalent interactions: hydrogen bonding, van der Waals, aromatic stacking
  • Often called "chemical antibodies"

SELEX = Systematic Evolution of Ligands by EXponential Enrichment

  1. Create library: 10⁹-10¹¹ random sequences
  2. Incubation: Expose library to target
  3. Counter-selection: Remove cross-reactive sequences (expose to non-targets)
  4. Wash & Elute: Remove non-binders, recover high-affinity sequences
  5. Amplification: PCR enrichment of winners
  6. Iteration: Repeat 8-15 cycles

Applications: Drugs, therapeutics, diagnostics, bio-imaging, food inspection


12. Computational Approaches

🎤
Oral Question Computational Methods
Medium
What computational approaches are used to study protein-protein interactions?
✓ Model Answer

A. Experimental-based (validation):

  • X-ray crystallography
  • NMR spectroscopy
  • Cryo-EM

B. Computational based on Genomic Data:

  • Phylogenetic profiles: Proteins that co-evolve likely interact
  • Gene neighborhood: Genes close on chromosome often encode interacting proteins
  • Gene fusion: Proteins fused in one organism may interact in another
  • Correlated mutations: Co-evolving residues suggest contact

C. Based on Protein Primary Structure:

  • Residue frequencies and pairing preferences
  • Sequence profile and residue neighbor list

D. Based on Protein Tertiary Structure:

  • 3D structural distance matrix
  • Surface patches analysis
  • Direct electrostatic interactions
  • Van der Waals interactions
  • Docking simulations

13. Quick Review Questions

Test yourself with these rapid-fire questions:

The "Bait" in PPI studies is The protein of interest used to "fish" for interacting partners

Co-IP requires ❓ conditions Non-denaturing (to preserve 3D structure and interactions)

In GST-Pull Down, GST binds to ❓ beads Glutathione-agarose beads

Y2H requires interaction to occur in the Nucleus (to trigger reporter transcription)

The main limitation of phage display is Prokaryotic expression (no eukaryotic PTMs)

FRET requires donor and acceptor within ❓ nm 1-10 nm

BRET donor is a ❓ enzyme Bioluminescent enzyme (e.g., Luciferase)

FRET donor is a Fluorophore (e.g., CFP)

BRET advantage over FRET: no Photobleaching, autofluorescence, or phototoxicity

TurboID labeling time is ~10 minutes (vs. 18-24 hours for BioID)

APEX uses ❓ which causes toxicity H₂O₂ (hydrogen peroxide)

Inteins are used for Tag-free protein purification / protein ligation

SELEX is used to select Aptamers (high-affinity oligonucleotides)

M2H uses ❓ plasmids Three plasmids (bait, prey, reporter)

NanoBiT consists of Large BiT (18 kDa) + Small BiT (11 amino acids)

Ratiometric measurement eliminates Variability from cell number, volume, detector fluctuations

SRET can study ❓ interactions Non-binary (three or more proteins)

Common FRET pair: ❓ → ❓ CFP → YFP (Cyan to Yellow)


Important Oral Questions (Core Exam Questions)

A focused collection of high-priority oral exam questions covering the most frequently tested topics. Master these before your exam!

Keyboard Shortcuts

KeyAction
RReveal all answers on page
HHide all answers on page
Space / EnterToggle focused card

⭐ High-Priority Topics

These questions cover concepts that are essential for oral exams. Pay special attention to understanding the reasoning behind experimental choices and the ability to compare techniques.


1. Experimental Design & Model Selection

Core Question Experimental Design
Hard
When reading a proteomics paper: Why did the researchers choose a particular cell line or model system? What factors influence this choice?
✓ Model Answer

Model/cell line selection depends on several factors:

Biological Relevance:

  • Does the model accurately represent the disease/condition being studied?
  • Does it express the proteins of interest?
  • Is it from the relevant tissue type?

Technical Considerations:

  • For SILAC: Cells must be able to grow in culture and incorporate labeled amino acids
  • Protein yield: Sufficient protein for analysis
  • Reproducibility: Well-characterized, stable cell lines preferred
  • Availability: Commercially available vs. primary cells

Common choices:

  • HeLa cells: Easy to culture, well-characterized
  • HEK293: High transfection efficiency
  • Primary cells: More physiologically relevant but harder to work with
  • Patient-derived cells: Most relevant for translational studies
💡 Key insight: Always be prepared to justify WHY a specific model was chosen — this shows critical thinking about experimental design.
Core Question Sample Strategy
Hard
What is the difference between pooled samples and single/individual samples in proteomics? When would you use each approach?
✓ Model Answer

Pooled Samples:

  • Multiple individual samples combined into one
  • Represents an "average" of the group
  • Advantages:
    • Reduces individual biological variation
    • Increases protein amount for analysis
    • Reduces number of MS runs needed
    • Cost-effective for initial screening
  • Disadvantages:
    • Loses individual variation information
    • Cannot identify outliers
    • Cannot perform statistical analysis on individuals

Single/Individual Samples:

  • Each sample analyzed separately
  • Advantages:
    • Captures biological variability
    • Enables proper statistical analysis
    • Can identify individual responders/non-responders
    • Required for biomarker validation
  • Disadvantages:
    • More expensive (more MS runs)
    • More time-consuming
    • May have limited sample amount per individual
💡 Best practice: Use pooled samples for discovery phase, then validate with individual samples. For clinical studies, individual samples are essential.

2. ESI (Electrospray Ionization)

Core Question ESI Mechanism
Hard
Explain ESI (Electrospray Ionization) and how it works. What types of ions are usually formed?
✓ Model Answer

ESI Mechanism (step-by-step):

  1. Spray Formation: Sample solution is pumped through a capillary needle at high voltage (2-5 kV)
  2. Taylor Cone: Electric field causes liquid to form a cone shape at the needle tip
  3. Droplet Formation: Fine charged droplets are sprayed from the cone tip
  4. Desolvation: Warm nitrogen gas assists solvent evaporation; droplets shrink
  5. Coulombic Explosion: As droplets shrink, charge density increases until Rayleigh limit is reached → droplets explode into smaller droplets
  6. Ion Release: Process repeats until fully desolvated, multiply charged ions are released

Types of ions formed:

  • MULTIPLY CHARGED ions — this is the key characteristic!
  • Positive mode: [M+nH]ⁿ⁺ (e.g., [M+2H]²⁺, [M+3H]³⁺)
  • Negative mode: [M-nH]ⁿ⁻
  • Creates a charge envelope (Gaussian distribution of charge states)

Why multiple charges matter:

  • m/z = mass / charge
  • Multiple charges reduce m/z values
  • Allows large proteins (>100 kDa) to be analyzed within typical mass analyzer range

Advantages of ESI:

  • Soft ionization (minimal fragmentation)
  • Directly compatible with LC (on-line coupling)
  • Very high sensitivity (attomole range)

Disadvantages:

  • Sensitive to salts and detergents (ion suppression)
  • Requires clean samples
  • More complex spectra due to multiple charge states

3. MALDI (Matrix-Assisted Laser Desorption/Ionization)

Core Question MALDI Complete
Hard
How does MALDI work? What types of ions are involved? What are the pros and cons? Are ions singly or multiply charged? Which analyzers are typically used?
✓ Model Answer

How MALDI works:

  1. Sample Preparation: Analyte mixed with organic matrix (e.g., α-CHCA, DHB, sinapinic acid)
  2. Crystallization: Mixture spotted on metal plate; solvent evaporates forming co-crystals
  3. Laser Irradiation: UV laser (337 nm nitrogen or 355 nm Nd:YAG) hits the crystals
  4. Matrix Absorption: Matrix absorbs photon energy, becomes electronically excited
  5. Desorption: Matrix undergoes "micro-explosion," ejecting analyte into gas phase
  6. Ionization: Proton transfer from matrix to analyte creates ions

Types of ions:

  • SINGLY CHARGED ions — key difference from ESI!
  • Positive mode: [M+H]⁺ (most common for peptides)
  • Negative mode: [M-H]⁻
  • Also: [M+Na]⁺, [M+K]⁺ (adducts)

Pros:

  • Simple spectra (singly charged = easy interpretation)
  • More tolerant to salts and contaminants than ESI
  • Very robust, high-throughput (~10⁴ samples/day)
  • Wide mass range (up to 500 kDa)
  • Easy to use

Cons:

  • Lower sensitivity than ESI (femtomole vs. attomole)
  • Not easily coupled to LC (off-line)
  • Matrix interference in low mass region
  • Shot-to-shot variability

Typical analyzers used with MALDI:

  • TOF (Time-of-Flight) — most common combination (MALDI-TOF)
  • TOF/TOF — for MS/MS analysis
  • Can also be coupled with: FT-ICR, Orbitrap
💡 Remember: MALDI = Singly charged, ESI = Multiply charged. This is a classic exam comparison!

4. SELDI vs MALDI

Core Question SELDI
Hard
What is SELDI? How does it compare to MALDI?
✓ Model Answer

SELDI (Surface-Enhanced Laser Desorption/Ionization):

A variation of MALDI where the target surface is chemically modified to selectively bind certain proteins.

Key difference from MALDI:

FeatureMALDISELDI
SurfaceInert metal plateChemically modified (active) surface
Sample prepSimple spottingSurface captures specific proteins
SelectivityNone (all proteins)Surface-dependent selectivity
ComplexityFull sample complexityReduced (only bound proteins)
WashingNot typicalUnbound proteins washed away

SELDI Surface Types:

  • Chemical surfaces:
    • CM10: Weak cation exchange
    • Q10: Strong anion exchange
    • H50: Hydrophobic/reverse phase
    • IMAC30: Metal affinity (binds His, phosphoproteins)
  • Biological surfaces:
    • Antibody-coated
    • Receptor-coated
    • DNA/RNA-coated

SELDI Workflow:

  1. Spot sample on modified surface
  2. Specific proteins bind based on surface chemistry
  3. Wash away unbound proteins
  4. Apply matrix
  5. Analyze by laser desorption (same as MALDI)

SELDI Advantages:

  • Reduces sample complexity (acts as "on-chip purification")
  • Good for biomarker discovery/profiling
  • Requires minimal sample preparation

SELDI Limitations:

  • Lower resolution than standard MALDI
  • Limited protein identification (profiling only)
  • Reproducibility issues
  • Largely replaced by LC-MS approaches

5. Peptide Mass Fingerprinting (PMF)

Core Question PMF Complete
Hard
What is Peptide Mass Fingerprinting (PMF)? How are the proteins digested into fragments? What is the specificity of the enzyme used?
✓ Model Answer

PMF (Peptide Mass Fingerprinting):

A protein identification method where a protein is enzymatically digested into peptides, and the resulting peptide masses are compared to theoretical masses from database proteins.

PMF Workflow:

  1. Protein isolation: Usually from 2D gel spot
  2. Destaining: Remove Coomassie/silver stain
  3. Reduction & Alkylation: Break and block disulfide bonds
  4. Enzymatic digestion: Typically with trypsin
  5. Peptide extraction: From gel pieces
  6. MALDI-TOF analysis: Measure peptide masses
  7. Database search: Compare experimental masses to theoretical

Digestion enzyme — TRYPSIN:

Specificity:

  • Cleaves at the C-terminal side of:
  • Lysine (K) and Arginine (R)
  • EXCEPT when followed by Proline (P)

Why trypsin is the gold standard:

  • High specificity: Predictable cleavage sites
  • Optimal peptide size: 6-20 amino acids (ideal for MS)
  • Basic residues at C-terminus: Promotes ionization in positive mode
  • Robust: Works well across pH 7-9
  • Reproducible: Produces consistent results
  • Self-digestion peaks: Can be used for internal calibration

Other enzymes sometimes used:

  • Chymotrypsin: Cleaves after Phe, Tyr, Trp
  • Glu-C: Cleaves after Glu (and Asp at high pH)
  • Lys-C: Cleaves after Lys only
  • Asp-N: Cleaves before Asp
💡 Exam tip: "Trypsin cleaves C-terminal to K and R, except before P" — memorize this!

6. Bottom-Up vs Shotgun Proteomics

Core Question Approaches
Hard
What is the difference between Bottom-Up and Shotgun proteomics?
✓ Model Answer

Important clarification: Shotgun proteomics IS a type of bottom-up approach. The distinction is in the workflow:

FeatureClassical Bottom-Up (PMF)Shotgun (Bottom-Up)
Protein separationFIRST (2D-PAGE, then cut spots)None or minimal
DigestionSingle isolated proteinEntire protein mixture
Peptide separationUsually noneLC (often multi-dimensional)
MS analysisMALDI-TOF (PMF)LC-MS/MS
IdentificationMass matchingMS/MS sequencing
ThroughputOne protein at a timeThousands of proteins

Classical Bottom-Up Workflow:

  1. Separate proteins by 2D-PAGE
  2. Cut out individual spots
  3. Digest each spot separately
  4. Analyze by MALDI-TOF
  5. PMF database search

Shotgun Workflow:

  1. Lyse cells, extract all proteins
  2. Digest entire mixture into peptides
  3. Separate peptides by LC (MudPIT uses 2D-LC)
  4. Analyze by MS/MS
  5. Database search with MS/MS spectra

Why "Shotgun"?

  • Like a shotgun blast — analyzes everything at once
  • No pre-selection of proteins
  • Relies on computational deconvolution
💡 Key distinction: Classical bottom-up separates proteins first, shotgun separates peptides (after digesting the whole mixture).

7. Gel Electrophoresis Limitations

Core Question 2D-PAGE Limitations
Hard
What are the limitations of 2D gel electrophoresis?
✓ Model Answer

Sample-Related Limitations:

  • Hydrophobic proteins: Membrane proteins poorly soluble in IEF buffers → underrepresented
  • Extreme pI proteins: Very acidic (<3) or basic (>10) proteins difficult to focus
  • Extreme MW proteins:
    • Large proteins (>200 kDa) don't enter gel well
    • Small proteins (<10-15 kDa) may run off the gel
  • Low-abundance proteins: Masked by high-abundance proteins; below detection limit
  • Dynamic range: Limited (~10⁴), much less than proteome range (~10⁶-10⁷)

Technical Limitations:

  • Poor reproducibility: Gel-to-gel variation requires running in triplicate
  • Labor-intensive: Manual, time-consuming, hard to automate
  • Low throughput: Cannot be easily scaled up
  • Co-migration: Proteins with similar pI/MW appear in same spot
  • Quantification limited: Staining is semi-quantitative at best

Analytical Limitations:

  • Proteome coverage gap: Yeast example: 6,000 genes → 4,000 expressed proteins → only ~1,000 detected by 2DE
  • Requires MS for ID: 2DE is only separation; identification needs additional steps
  • PTM detection: May see multiple spots but hard to characterize modifications

Practical Issues:

  • Streaking/smearing from degradation
  • Background interference from staining
  • Keratin contamination common
💡 These limitations drove development of gel-free approaches like shotgun proteomics and MudPIT.

8. Hybrid Mass Spectrometry Systems

Core Question Hybrid MS
Hard
What is used in hybrid mass spectrometry systems? What are the limitations?
✓ Model Answer

Hybrid MS: Instruments combining two or more different mass analyzers to leverage their complementary strengths.

Common Hybrid Configurations:

Hybrid TypeComponentsStrengths
Q-TOFQuadrupole + TOFHigh resolution, accurate mass, good for ID
Triple Quad (QqQ)Q1 + Collision cell + Q3Excellent for quantification (SRM/MRM)
Q-OrbitrapQuadrupole + OrbitrapVery high resolution + sensitivity
LTQ-OrbitrapLinear ion trap + OrbitrapHigh speed + high resolution
TOF-TOFTOF + Collision + TOFHigh-energy fragmentation with MALDI
Q-TrapQuadrupole + Ion trapVersatile, MRM + scanning modes

How they work (Q-TOF example):

  1. Q1 (Quadrupole): Selects precursor ion of interest
  2. Collision cell: Fragments the selected ion (CID)
  3. TOF: Analyzes all fragments with high resolution and mass accuracy

Limitations of Hybrid Systems:

  • Cost: Very expensive instruments ($500K - $1M+)
  • Complexity: Requires expert operators
  • Maintenance: More components = more potential failures
  • Data complexity: Generates massive datasets
  • Duty cycle trade-offs: Can't optimize all parameters simultaneously
  • Ion transmission losses: Each analyzer stage loses some ions

Specific limitations by type:

  • Q-TOF: Lower sensitivity in MS/MS mode
  • Ion trap hybrids: Space charge effects limit dynamic range
  • Orbitrap hybrids: Slower scan speed than TOF

9. TUNEL Analysis

Core Question TUNEL Assay
Medium
What is TUNEL analysis? What does it detect and how does it work?
✓ Model Answer

TUNEL = Terminal deoxynucleotidyl transferase dUTP Nick End Labeling

Purpose: Detects apoptosis (programmed cell death) by identifying DNA fragmentation.

Principle:

  • During apoptosis, endonucleases cleave DNA between nucleosomes
  • This creates many DNA fragments with exposed 3'-OH ends ("nicks")
  • TUNEL labels these free 3'-OH ends

How it works:

  1. TdT enzyme (terminal deoxynucleotidyl transferase) is added
  2. TdT adds labeled dUTP nucleotides to 3'-OH ends of DNA breaks
  3. Labels can be: fluorescent (FITC), biotin (detected with streptavidin), or other markers
  4. Visualized by fluorescence microscopy or flow cytometry

Applications:

  • Detecting apoptosis in tissue sections
  • Studying cell death in disease models
  • Drug toxicity testing
  • Cancer research

Limitations:

  • Can also label necrotic cells (not specific to apoptosis)
  • False positives from mechanical DNA damage during sample prep
  • Should be combined with other apoptosis markers

Follow-up study suggestions:

  • Caspase activity assays (more specific for apoptosis)
  • Annexin V staining (early apoptosis marker)
  • Western blot for cleaved caspase-3 or PARP

10. Phage Display

Core Question Phage Display
Hard
What is Phage Display? What is its main limitation?
✓ Model Answer

Phage Display: A molecular biology technique where peptides or proteins are expressed ("displayed") on the surface of bacteriophage particles.

How it works:

  1. Library Creation: DNA encoding peptides/proteins is inserted into phage coat protein gene
  2. Expression: Phage expresses the foreign peptide fused to its coat protein (usually pIII or pVIII)
  3. Panning: Library exposed to target molecule (bait) immobilized on surface
  4. Selection: Non-binding phages washed away; binding phages retained
  5. Amplification: Bound phages eluted and amplified in bacteria
  6. Iteration: Process repeated 3-4 times to enrich for strong binders
  7. Identification: DNA sequencing reveals the binding peptide sequence

Applications:

  • Antibody discovery and engineering
  • Finding protein-protein interaction partners
  • Epitope mapping
  • Drug target identification
  • Peptide ligand discovery

MAIN LIMITATIONS:

  • Bacterial expression system:
    • No post-translational modifications (no glycosylation, phosphorylation)
    • May not fold mammalian proteins correctly
    • Codon bias issues
  • Size constraints: Large proteins difficult to display
  • Selection bias: Some peptides toxic to bacteria → lost from library
  • False positives: Selection for phage propagation, not just binding
  • Context-dependent: Displayed peptide may behave differently than free peptide
  • Limited to protein/peptide interactions: Cannot study interactions requiring membrane context
💡 Key limitation to mention: Prokaryotic expression = no eukaryotic PTMs and potential protein misfolding.

11. Energy Transfer Methods (FRET/BRET)

Core Question Energy Transfer
Hard
Explain energy transfer-based methods for studying protein interactions. What are donor and acceptor? What types of signals are obtained?
✓ Model Answer

Energy Transfer Methods: Techniques that detect protein-protein interactions based on the transfer of energy between two labeled molecules when they come into close proximity.

FRET (Förster Resonance Energy Transfer):

  • Donor: Fluorescent molecule that absorbs excitation light (e.g., CFP, GFP)
  • Acceptor: Fluorescent molecule that receives energy from donor (e.g., YFP, RFP)
  • Mechanism: Non-radiative energy transfer through dipole-dipole coupling
  • Distance requirement: 1-10 nm (typically <10 nm for efficient transfer)

BRET (Bioluminescence Resonance Energy Transfer):

  • Donor: Bioluminescent enzyme (e.g., Renilla luciferase)
  • Acceptor: Fluorescent protein (e.g., GFP, YFP)
  • Advantage: No external excitation needed → lower background

Signals Obtained:

  • When proteins are FAR apart:
    • Only donor emission observed
    • No energy transfer
  • When proteins INTERACT (close proximity):
    • Donor emission decreases (quenching)
    • Acceptor emission increases (sensitized emission)
    • FRET efficiency can be calculated

Types of signals measured:

  1. Sensitized emission: Acceptor fluorescence upon donor excitation
  2. Donor quenching: Decrease in donor fluorescence intensity
  3. Donor lifetime: Decrease in fluorescence lifetime (FLIM-FRET)
  4. Acceptor photobleaching: Donor recovery after acceptor is bleached

Applications:

  • Detecting protein-protein interactions in living cells
  • Monitoring conformational changes
  • Studying signaling pathway activation
  • Biosensor development

Common FRET pairs:

  • CFP (cyan) → YFP (yellow)
  • BFP (blue) → GFP (green)
  • GFP → RFP/mCherry
💡 Key concept: FRET is a "molecular ruler" — efficiency depends on distance (1/r⁶), so it only works when proteins are very close (<10 nm), indicating direct interaction.

12. Quick Review - Core Concepts

Test yourself on these essential concepts:

Trypsin cleaves at the C-terminal of ❓ which residues? K (Lysine) and R (Arginine), except before P (Proline)

MALDI produces mainly ❓ charged ions Singly charged [M+H]⁺

ESI produces mainly ❓ charged ions Multiply charged [M+nH]ⁿ⁺

The main difference between SELDI and MALDI is SELDI uses chemically modified surfaces for selective binding

Shotgun proteomics separates ❓ proteins or peptides first? Peptides (digests whole mixture first)

Classical bottom-up (PMF) separates ❓ proteins or peptides first? Proteins (2D-PAGE, then digests individual spots)

TUNEL detects DNA fragmentation / Apoptosis

The main limitation of phage display is Prokaryotic expression (no PTMs, potential misfolding)

FRET requires donor and acceptor to be within ❓ nm <10 nm (typically 1-10 nm)

Q-TOF is a hybrid combining Quadrupole + Time-of-Flight

In pooled samples you lose Individual variation / ability to do statistics on individuals

The "proteomic gap" in 2DE refers to Proteins expressed but not detected by 2D electrophoresis


13. CID (Collision-Induced Dissociation)

Core Question Fragmentation
Hard
Describe the process of Collision-Induced Dissociation (CID) and its significance in tandem mass spectrometry.
✓ Model Answer

CID (Collision-Induced Dissociation): A fragmentation method where precursor ions are fragmented by colliding them with an inert gas.

How CID works:

  1. Ion selection: Precursor ion selected in first mass analyzer (MS1)
  2. Collision cell: Selected ion enters a chamber filled with inert gas (Argon, Nitrogen, or Xenon)
  3. Collision: Ion collides with gas molecules, converting kinetic energy to internal energy
  4. Fragmentation: Internal energy causes bonds to break, producing fragment ions
  5. Analysis: Fragment ions analyzed in second mass analyzer (MS2)

Significance in MS/MS:

  • Generates b-ions and y-ions for peptide sequencing
  • Provides structural information about the parent ion
  • Enables amino acid sequence determination
  • Allows protein identification via database searching
  • Can reveal PTM locations

Other fragmentation methods:

  • HCD: Higher-energy Collisional Dissociation (used in Orbitrap)
  • ETD: Electron Transfer Dissociation (better for PTMs, larger peptides)
  • ECD: Electron Capture Dissociation (preserves labile modifications)
💡 Key point: CID is the most common fragmentation method and primarily breaks peptide bonds, generating predictable b- and y-ion series.

14. b-ions and y-ions

Core Question Fragment Ions
Hard
What are b-ions and y-ions? How are they used for peptide sequencing?
✓ Model Answer

Fragment ions from peptide backbone cleavage:

b-ions:

  • Contain the N-terminal portion of the peptide
  • Charge retained on the N-terminal fragment
  • Named b₁, b₂, b₃... (number = amino acids from N-terminus)

y-ions:

  • Contain the C-terminal portion of the peptide
  • Charge retained on the C-terminal fragment
  • Named y₁, y₂, y₃... (number = amino acids from C-terminus)

Visual representation:

        N-terminus ← → C-terminus
        H₂N-[AA₁]-[AA₂]-[AA₃]-[AA₄]-COOH
             ↓     ↓     ↓
            b₁    b₂    b₃    (N-terminal fragments)
                  y₃    y₂    y₁  (C-terminal fragments)
        

How sequencing works:

  1. Mass differences between consecutive b-ions (or y-ions) = amino acid masses
  2. b₂ - b₁ = mass of 2nd amino acid
  3. y₃ - y₂ = mass of amino acid at position (n-2)
  4. Complete series allows full sequence determination

Why both series are useful:

  • Complementary information confirms sequence
  • Gaps in one series may be filled by the other
  • b + y should equal precursor mass + 18 (water)
💡 Remember: b-ions = N-terminal, y-ions = C-terminal. The mass difference between consecutive ions reveals the amino acid identity.

15. Monoisotopic vs Average Mass

Core Question Mass Definitions
Medium
Define monoisotopic mass and average mass. How are they used in peptide mass fingerprinting?
✓ Model Answer

Monoisotopic Mass:

  • Mass calculated using the most abundant isotope of each element
  • For organic molecules: ¹²C, ¹H, ¹⁴N, ¹⁶O, ³²S
  • Corresponds to the first peak in the isotope distribution (M+0)
  • More precise, used for accurate mass measurements

Average Mass:

  • Weighted average of all naturally occurring isotopes
  • Takes into account natural isotope abundance
  • Corresponds to the centroid of the isotope envelope
  • Used when resolution is insufficient to resolve isotopes

Example (for Carbon):

  • Monoisotopic: ¹²C = 12.0000 Da
  • Average: (98.9% × 12.0000) + (1.1% × 13.0034) = 12.011 Da

Use in PMF:

SituationMass TypeReason
High-resolution MS (MALDI-TOF)MonoisotopicCan resolve isotope peaks
Low-resolution MSAverageCannot resolve isotopes
Small peptides (<2000 Da)MonoisotopicFirst peak is tallest
Large proteins (>10 kDa)AverageMonoisotopic peak too small to detect
💡 Key point: For PMF with MALDI-TOF, use monoisotopic masses of peptides for database matching — this gives the highest accuracy.

16. Mass Analyzers Comparison

Core Question Mass Analyzers
Hard
Compare the TOF, Quadrupole, and Orbitrap mass analyzers. How does each separate ions? Compare their resolution, mass accuracy, and sensitivity.
✓ Model Answer

How each analyzer separates ions:

TOF (Time-of-Flight):

  • Ions accelerated through same voltage, gain same kinetic energy
  • KE = ½mv² → lighter ions travel faster
  • Measures flight time through drift tube
  • Shorter time = lower m/z

Quadrupole:

  • Four parallel rods with oscillating RF/DC voltages
  • Creates oscillating electric field
  • Only ions with specific m/z have stable trajectories
  • Others collide with rods and are lost
  • Acts as a mass filter (scanning or SIM mode)

Orbitrap:

  • Ions trapped orbiting around central spindle electrode
  • Oscillate axially with frequency dependent on m/z
  • Measures oscillation frequency (image current)
  • Fourier transform converts frequency → m/z

Comparison table:

ParameterTOFQuadrupoleOrbitrap
Resolution10,000-60,0001,000-4,000 (low)100,000-500,000+
Mass Accuracy5-20 ppm100-1000 ppm<2-5 ppm
SensitivityHigh (femtomole)HighHigh (attomole)
Mass RangeUnlimited (in principle)Up to ~4000 m/zUp to ~6000 m/z
Scan SpeedVery fastFastSlower
CostModerateLowHigh
Best forMALDI, fast scanningQuantification (SRM)High accuracy ID
💡 Summary: Orbitrap = highest resolution/accuracy; Quadrupole = best for quantification; TOF = fastest, good all-rounder.

17. De Novo Sequencing

Core Question Sequencing
Hard
What is de novo sequencing? When would you use it instead of database searching?
✓ Model Answer

De Novo Sequencing: Determining the amino acid sequence of a peptide directly from its MS/MS spectrum, without relying on a sequence database.

How it works:

  1. Acquire high-quality MS/MS spectrum
  2. Identify b-ion and y-ion series
  3. Calculate mass differences between consecutive peaks
  4. Match mass differences to amino acid residue masses
  5. Build sequence from N- to C-terminus (or reverse)
  6. Validate with complementary ion series

When to use de novo sequencing:

  • Protein NOT in database:
    • Novel organisms without sequenced genomes
    • Uncharacterized proteins
    • Organisms with incomplete proteome databases
  • Unexpected modifications: PTMs not predicted by database
  • Mutations/variants: Sequence differs from database entry
  • Antibody sequencing: Highly variable regions
  • Ancient proteins: Paleoproteomics
  • Validation: Confirming database search results

Challenges:

  • Requires high-quality spectra with complete ion series
  • Isobaric amino acids (Leu/Ile = 113 Da) cannot be distinguished
  • Labor-intensive and time-consuming
  • May have gaps in sequence coverage

Software tools: PEAKS, Novor, PepNovo, DeNovoGUI

💡 Key scenario: "What do you do if the protein is not in the database?" → Use de novo sequencing to determine sequence directly from MS/MS data.

18. Inteins

Core Question Protein Engineering
Medium
What are inteins? Explain their significance in protein engineering and purification.
✓ Model Answer

Inteins: Self-splicing protein segments that can excise themselves from a precursor protein, leaving behind the flanking exteins joined together.

Terminology:

  • Intein: INternal proTEIN (gets removed)
  • Extein: EXternal proTEIN (flanking sequences that remain)
  • N-extein — [INTEIN] — C-extein → N-extein—C-extein + free intein

Mechanism (protein splicing):

  1. N-S or N-O acyl shift at N-terminus of intein
  2. Transesterification
  3. Asparagine cyclization releases intein
  4. S-N or O-N acyl shift joins exteins with native peptide bond

Applications in protein engineering:

  • Self-cleaving affinity tags:
    • Protein fused to intein + affinity tag (e.g., chitin-binding domain)
    • Bind to affinity column
    • Induce intein cleavage (pH, temperature, or thiol)
    • Pure protein released, tag remains on column
    • Advantage: No protease needed, no extra residues left
  • Protein ligation (Expressed Protein Ligation):
    • Join two protein fragments with native peptide bond
    • Useful for incorporating unnatural amino acids
    • Creating segmentally labeled proteins for NMR
  • Protein cyclization: Create cyclic proteins
  • Conditional protein splicing: Control protein activity
💡 Main advantage: Inteins enable tag-free protein purification — the protein is released without any extra amino acids from the tag.

19. Interactomics Methods

Core Question Interactomics
Hard
What is interactomics? Describe the main experimental techniques used to study protein-protein interactions: Yeast Two-Hybrid, Co-IP, and AP-MS.
✓ Model Answer

Interactomics: The study of protein-protein interactions (PPIs) and the networks they form within biological systems.

1. Yeast Two-Hybrid (Y2H):

  • Principle: Reconstitution of transcription factor activity
  • Method:
    • Bait protein fused to DNA-binding domain
    • Prey protein fused to activation domain
    • If bait and prey interact → transcription factor reconstituted → reporter gene expressed
  • Pros: High-throughput, detects direct binary interactions
  • Cons: In vivo but in yeast (not native environment), high false positive rate, only nuclear interactions

2. Co-Immunoprecipitation (Co-IP):

  • Principle: Antibody pulldown of protein complexes
  • Method:
    • Lyse cells, add antibody against bait protein
    • Antibody-protein complex captured on beads
    • Wash away non-specific proteins
    • Elute and analyze interacting proteins (Western blot or MS)
  • Pros: Detects endogenous interactions, physiological conditions
  • Cons: Requires good antibody, may miss transient interactions, cannot distinguish direct from indirect interactions

3. Affinity Purification-Mass Spectrometry (AP-MS):

  • Principle: Tagged bait protein pulls down interaction partners
  • Method:
    • Express tagged bait protein (FLAG, HA, TAP tag)
    • Lyse cells, capture bait + interactors on affinity resin
    • Wash stringently
    • Elute and identify interactors by MS
  • Pros: Unbiased identification, can detect entire complexes
  • Cons: Tag may affect interactions, overexpression artifacts, false positives from sticky proteins
MethodThroughputDirect/IndirectEnvironment
Y2HHighDirect onlyYeast nucleus
Co-IPLowBothNative
AP-MSMediumBothNative (with tag)

20. What If Protein Is Not In Database?

Core Question Database Issues
Hard
What do you do if the protein is not in the database? How can you still identify an unknown protein?
✓ Model Answer

Strategies when protein is not in database:

1. De Novo Sequencing:

  • Determine peptide sequence directly from MS/MS spectrum
  • Calculate mass differences between fragment ions
  • Match to amino acid masses
  • Build sequence without database reference

2. Homology/Sequence Tag Searching:

  • Use short sequence tags from de novo to search related organisms
  • BLAST search against broader databases (NCBI nr)
  • MS-BLAST: Search with imperfect sequences
  • May find homologous protein in related species

3. Error-Tolerant Database Searching:

  • Allow for mutations, modifications, or sequence variants
  • Search with wider mass tolerance
  • Consider unexpected PTMs or SNPs

4. EST/Transcriptome Database Search:

  • Use expressed sequence tags (EST) databases
  • Search against RNA-seq data from same organism
  • May contain unannotated protein sequences

5. Spectral Library Searching:

  • Compare experimental spectrum to library of acquired spectra
  • May match even without sequence information

6. Genomic Six-Frame Translation:

  • If genome is available but not annotated
  • Translate genome in all 6 reading frames
  • Search MS data against translated sequences

Practical workflow:

  1. First: Try error-tolerant search or related species database
  2. Second: Perform de novo sequencing on best spectra
  3. Third: BLAST de novo sequences against NCBI
  4. Fourth: If genome available, try 6-frame translation
💡 Key answer: Use de novo sequencing to get peptide sequences directly from MS/MS data, then use these sequences to search broader databases or identify homologs.

21. 2D-PAGE Workflow

Core Question 2D Electrophoresis
Medium
Describe the 2D-PAGE workflow. What is separated in each dimension?
✓ Model Answer

2D-PAGE = Two-Dimensional Polyacrylamide Gel Electrophoresis

Principle: Separates proteins by TWO independent properties for maximum resolution.

First Dimension: Isoelectric Focusing (IEF)

  • Separates proteins by isoelectric point (pI)
  • Uses immobilized pH gradient (IPG) strips
  • Proteins migrate until net charge = 0
  • High voltage (up to 8000 V), long focusing time

Second Dimension: SDS-PAGE

  • Separates proteins by molecular weight (MW)
  • IPG strip equilibrated with SDS, placed on gel
  • SDS denatures proteins and provides uniform charge
  • Smaller proteins migrate faster

Complete workflow:

  1. Sample preparation: Lysis, solubilization in urea/thiourea/CHAPS
  2. Rehydration: Load sample onto IPG strip
  3. IEF: Focus proteins by pI (12-24 hours)
  4. Equilibration: Reduce (DTT) and alkylate (IAA) proteins in SDS buffer
  5. SDS-PAGE: Separate by MW (4-6 hours)
  6. Staining: Coomassie, silver, or fluorescent (SYPRO Ruby)
  7. Image analysis: Detect spots, compare gels
  8. Spot picking: Excise spots of interest
  9. MS analysis: In-gel digestion → MALDI-TOF (PMF) or LC-MS/MS
DimensionPropertyMethodDirection
1stpI (charge)IEFHorizontal
2ndMW (size)SDS-PAGEVertical
💡 Remember: 1st dimension = pI (IEF), 2nd dimension = MW (SDS-PAGE). This gives "orthogonal" separation for maximum resolution.

22. Quick Review - Additional Concepts

Test yourself on these additional essential concepts:

CID stands for Collision-Induced Dissociation

b-ions contain the ❓ terminus N-terminus

y-ions contain the ❓ terminus C-terminus

Monoisotopic mass uses the ❓ isotope Most abundant isotope of each element

Which mass analyzer has the highest resolution? Orbitrap (100,000-500,000+)

Which mass analyzer is best for quantification (SRM)? Quadrupole (Triple Quad)

De novo sequencing is used when Protein is not in the database

Inteins are useful for Tag-free protein purification / protein ligation

Y2H detects ❓ interactions only Direct binary interactions

In 2D-PAGE, the 1st dimension separates by pI (isoelectric point) using IEF

Leucine and Isoleucine cannot be distinguished because They have identical mass (113 Da) - isobaric

TOF separates ions by their Flight time through drift tube


PLINK Genotype File Formats

PLINK is a free, open-source toolset designed for genome-wide association studies (GWAS) and population genetics analysis.

When you're dealing with genotype data from thousands (or millions) of people across hundreds of thousands (or millions) of genetic variants, you face several problems:

  1. File size: Raw genotype data is MASSIVE
  2. Processing speed: Reading and analyzing this data needs to be fast
  3. Standardization: Different labs and companies produce data in different formats
  4. Analysis tools: You need efficient ways to compute allele frequencies, test for associations, filter variants, etc.

PLINK solves these problems by providing:

  • Efficient binary file formats (compact storage)
  • Fast algorithms for common genetic analyses
  • Format conversion tools
  • Quality control utilities
  • Analyzing data from genotyping chips (Illumina, Affymetrix)
  • Running genome-wide association studies (GWAS)
  • Computing population genetics statistics
  • Quality control and filtering of genetic variants
  • Converting between different genotype file formats

This is PLINK's primary format - a set of three files that work together. It's called "binary" because the main genotype data is stored in a compressed binary format rather than human-readable text.

The .fam File (Family/Sample Information)

The .fam file contains information about each individual (sample) in your study. It has 6 columns with NO header row.

Format:


FamilyID  IndividualID  FatherID  MotherID  Sex  Phenotype

Example .fam file:


FAM001  IND001  0  0  1  2
FAM001  IND002  0  0  2  1
FAM002  IND003  IND004  IND005  1  -9
FAM002  IND004  0  0  1  1
FAM002  IND005  0  0  2  1

Column Breakdown:

Column 1: Family ID

  • Groups individuals into families
  • Can be the same as Individual ID if samples are unrelated
  • Example: FAM001, FAM002

Column 2: Individual ID

  • Unique identifier for each person
  • Must be unique within each family
  • Example: IND001, IND002

Column 3: Paternal ID (Father)

  • Individual ID of the father
  • 0 = father not in dataset (unknown or not genotyped)
  • Used for constructing pedigrees and family-based analyses

Column 4: Maternal ID (Mother)

  • Individual ID of the mother
  • 0 = mother not in dataset
  • Must match an Individual ID if the parent is in the study

Column 5: Sex

  • 1 = Male
  • 2 = Female
  • 0 = Unknown sex
  • Other codes (like -9) are sometimes used for unknown, but 0 is standard

Column 6: Phenotype

  • The trait you're studying (disease status, quantitative trait, etc.)
  • For binary (case-control) traits:
    • 1 = Control (unaffected)
    • 2 = Case (affected)
    • 0 or -9 = Missing phenotype
  • For quantitative traits: Any numeric value
  • -9 = Standard missing value code

Important Notes About Special Codes:

0 (Zero):

  • In Parent columns: Parent not in dataset
  • In Sex column: Unknown sex
  • In Phenotype column: Missing phenotype (though -9 is more common)

-9 (Negative nine):

  • Universal "missing data" code in PLINK
  • Most commonly used for missing phenotype
  • Sometimes used for unknown sex (though 0 is standard)

Why these codes matter:

  • PLINK will skip individuals with missing phenotypes in association tests
  • Parent information is crucial for family-based tests (like TDT)
  • Sex information is needed for X-chromosome analysis

The .bim File (Variant Information)

The .bim file (binary marker information) describes each genetic variant. It has 6 columns with NO header row.

Format:

Chromosome  VariantID  GeneticDistance  Position  Allele1  Allele2

Example .bim file:

1   rs12345    0    752566    G    A
1   rs67890    0    798959    C    T
2   rs11111    0    1240532   A    G
3   rs22222    0    5820321   T    C
X   rs33333    0    2947392   G    A

Column Breakdown:

Column 1: Chromosome

  • Chromosome number: 1-22 (autosomes)
  • Sex chromosomes: X, Y, XY (pseudoautosomal), MT (mitochondrial)
  • Example: 1, 2, X

Column 2: Variant ID

  • Usually an rsID (reference SNP ID from dbSNP)
  • Format: rs followed by numbers (e.g., rs12345)
  • Can be any unique identifier if rsID isn't available
  • Example: chr1:752566:G:A (chromosome:position:ref:alt format)

Column 3: Genetic Distance

  • Position in centimorgans (cM)
  • Measures recombination distance, not physical distance
  • Often set to 0 if unknown (very common)
  • Used in linkage analysis and some phasing algorithms

Column 4: Base-Pair Position

  • Physical position on the chromosome
  • Measured in base pairs from the start of the chromosome
  • Example: 752566 means 752,566 bases from chromosome start
  • Critical for genome builds: Make sure you know if it's GRCh37 (hg19) or GRCh38 (hg38)!

Column 5: Allele 1

  • First allele (often the reference allele)
  • Single letter: A, C, G, T
  • Can also be I (insertion), D (deletion), or 0 (missing)

Column 6: Allele 2

  • Second allele (often the alternate/effect allele)
  • Same coding as Allele 1

Important Notes:

Allele coding:

  • These alleles define what genotypes mean in the .bed file
  • Genotype AA means homozygous for Allele1
  • Genotype AB means heterozygous
  • Genotype BB means homozygous for Allele2

Strand issues:

  • Alleles should be on the forward strand
  • Mixing strands between datasets causes major problems in meta-analysis
  • Always check strand alignment when combining datasets!

The .bed File (Binary Genotype Data)

The .bed file contains the actual genotype calls in compressed binary format. This file is NOT human-readable - you can't open it in a text editor and make sense of it.

Key characteristics:

Why binary?

  • Space efficiency: A text file with millions of genotypes is huge; binary format compresses this dramatically
  • Speed: Computer can read binary data much faster than parsing text
  • Example: A dataset with 1 million SNPs and 10,000 people:
    • Text format (.ped): ~30 GB
    • Binary format (.bed): ~2.4 GB

What's stored:

  • Genotype calls for every individual at every variant
  • Each genotype is encoded efficiently (2 bits per genotype)
  • Encoding:
    • 00 = Homozygous for allele 1 (AA)
    • 01 = Missing genotype
    • 10 = Heterozygous (AB)
    • 11 = Homozygous for allele 2 (BB)

SNP-major vs. individual-major:

  • PLINK binary files are stored in SNP-major mode by default
  • This means genotypes are organized by variant (all individuals for SNP1, then all individuals for SNP2, etc.)
  • More efficient for most analyses (which process one SNP at a time)

You never edit .bed files manually - always use PLINK commands to modify or convert them.


This is the original PLINK format. It's human-readable but much larger and slower than binary format. Mostly used for small datasets or when you need to manually inspect/edit data.

The .map File (Variant Map)

Similar to .bim but with only 4 columns.

Format:

Chromosome  VariantID  GeneticDistance  Position

Example .map file:

1   rs12345    0    752566
1   rs67890    0    798959
2   rs11111    0    1240532
3   rs22222    0    5820321

Notice: NO allele information in .map files (unlike .bim files).


The .ped File (Pedigree + Genotypes)

Contains both sample information AND genotype data in one large text file.

Format:

FamilyID  IndividualID  FatherID  MotherID  Sex  Phenotype  [Genotypes...]

The first 6 columns are identical to the .fam file. After that, genotypes are listed as pairs of alleles (one pair per SNP).

Example .ped file:

FAM001  IND001  0  0  1  2  G G  C T  A G  T T
FAM001  IND002  0  0  2  1  G A  C C  A A  T C
FAM002  IND003  0  0  1  1  A A  T T  G G  C C

Genotype Encoding:

Each SNP is represented by two alleles separated by a space:

  • G G = Homozygous for G allele
  • G A = Heterozygous (one G, one A)
  • A A = Homozygous for A allele
  • 0 0 = Missing genotype

Important: The order of alleles in heterozygotes doesn't matter (G A = A G).

Problems with .ped format:

  • HUGE files for large datasets (gigabytes to terabytes)
  • Slow to process (text parsing is computationally expensive)
  • No explicit allele definition (you have to infer which alleles exist from the data)

When to use .ped/.map:

  • Small datasets (< 1,000 individuals, < 10,000 SNPs)
  • When you need to manually edit genotypes
  • Importing data from older software
  • Best practice: Convert to binary format (.bed/.bim/.fam) immediately for analysis

Transposed Format (.tped/.tfam)

This format is a "transposed" version of .ped/.map. Instead of one row per individual, you have one row per SNP.

The .tfam File

Identical to .fam file - contains sample information.

Format:

FamilyID  IndividualID  FatherID  MotherID  Sex  Phenotype

The .tped File (Transposed Genotypes)

Each row represents one SNP, with genotypes for all individuals.

Format:

Chromosome  VariantID  GeneticDistance  Position  [Genotypes for all individuals...]

Example .tped file:

1  rs12345  0  752566  G G  G A  A A  G G  A A
1  rs67890  0  798959  C T  C C  T T  C T  C C
2  rs11111  0  1240532 A G  A A  G G  A G  A A

The first 4 columns are like the .map file. After that, genotypes are listed for all individuals (2 alleles per person, space-separated).

When to use .tped/.tfam:

  • When your data is organized by SNP rather than by individual
  • Converting from certain genotyping platforms
  • Some imputation software prefers this format
  • Still text format so same size/speed issues as .ped

Long Format

Long format (also called "additive" or "dosage" format) represents genotypes as numeric values instead of allele pairs.

Format options:

Additive coding (most common):

FamilyID  IndividualID  VariantID  Genotype
FAM001    IND001        rs12345    0
FAM001    IND001        rs67890    1
FAM001    IND001        rs11111    2
FAM001    IND002        rs12345    1

Numeric genotype values:

  • 0 = Homozygous for reference allele (AA)
  • 1 = Heterozygous (AB)
  • 2 = Homozygous for alternate allele (BB)
  • NA or -9 = Missing

Why long format?

  • Easy to use in statistical software (R, Python pandas)
  • Flexible for merging with other data (phenotypes, covariates)
  • Good for database storage (one row per observation)
  • Can include dosages for imputed data (values between 0-2, like 0.85)

Downsides:

  • MASSIVE file size (one row per person per SNP)
  • Example: 10,000 people × 1 million SNPs = 10 billion rows
  • Not practical for genome-wide data without compression

When to use:

  • Working with a small subset of SNPs in R/Python
  • Merging genotypes with other tabular data
  • Machine learning applications where you need a feature matrix

Variant Call Format (VCF)

VCF is the standard format for storing genetic variation from sequencing data. Unlike genotyping arrays (which only check specific SNPs), sequencing produces all variants, including rare and novel ones.

Key characteristics:

Comprehensive information:

  • Genotypes for all samples at each variant
  • Quality scores for each call
  • Read depth, allele frequencies
  • Functional annotations
  • Multiple alternate alleles at the same position

File structure:

  • Header lines start with ## (metadata about reference genome, samples, etc.)
  • Column header line starts with #CHROM (defines columns)
  • Data lines: One per variant

Standard VCF columns:

#CHROM  POS     ID         REF  ALT     QUAL  FILTER  INFO           FORMAT  [Sample genotypes...]
1       752566  rs12345    G    A       100   PASS    AF=0.23;DP=50  GT:DP   0/1:30  1/1:25  0/0:28

Column Breakdown:

CHROM: Chromosome (1-22, X, Y, MT)

POS: Position on chromosome (1-based coordinate)

ID: Variant identifier (rsID or . if none)

REF: Reference allele (what's in the reference genome)

ALT: Alternate allele(s) - can be multiple, comma-separated

  • Example: A,T means two alternate alleles

QUAL: Quality score (higher = more confident call)

  • Phred-scaled: QUAL=30 means 99.9% confidence
  • . if unavailable

FILTER: Quality filter status

  • PASS = passed all filters
  • LowQual, HighMissing, etc. = failed specific filters
  • . = no filtering applied

INFO: Semicolon-separated annotations

  • AF=0.23 = Allele frequency 23%
  • DP=50 = Total read depth
  • AC=10 = Allele count
  • Many possible fields (defined in header)

FORMAT: Describes the per-sample data fields

  • GT = Genotype
  • DP = Read depth for this sample
  • GQ = Genotype quality
  • Example: GT:DP:GQ

Sample columns: One column per individual

  • Data corresponds to FORMAT field
  • Example: 0/1:30:99 means heterozygous, 30 reads, quality 99

Genotype Encoding in VCF:

GT (Genotype) format:

  • 0/0 = Homozygous reference (REF/REF)
  • 0/1 = Heterozygous (REF/ALT)
  • 1/1 = Homozygous alternate (ALT/ALT)
  • ./. = Missing genotype
  • 1/2 = Heterozygous with two different alternate alleles
  • 0|1 = Phased genotype (pipe | instead of slash /)

Phased vs. unphased:

  • / = unphased (don't know which allele came from which parent)
  • | = phased (know parental origin)
  • 0|1 means reference allele from parent 1, alternate from parent 2

Compressed VCF (.vcf.gz):

VCF files are usually gzipped and indexed:

  • .vcf.gz = compressed VCF (much smaller)
  • .vcf.gz.tbi = tabix index (allows fast random access)
  • Tools like bcftools and vcftools work directly with compressed VCFs

Example sizes:

  • Uncompressed VCF: 100 GB
  • Compressed .vcf.gz: 10-15 GB
  • Always work with compressed VCFs!

When to use VCF:

  • Sequencing data (whole genome, exome, targeted)
  • When you need detailed variant information
  • Storing rare and novel variants
  • Multi-sample studies with complex annotations
  • NOT typical for genotyping array data (use PLINK binary instead)

Oxford Format (.gen / .bgen + .sample)

Developed by the Oxford statistics group, commonly used in UK Biobank and imputation software (IMPUTE2, SHAPEIT).

The .sample File

Contains sample information, similar to .fam but with a header row.

Format:

ID_1 ID_2 missing sex phenotype
0 0 0 D B
IND001 IND001 0 1 2
IND002 IND002 0 2 1

First two rows are special:

  • Row 1: Column names
  • Row 2: Data types
    • D = Discrete/categorical
    • C = Continuous
    • B = Binary
    • 0 = Not used

Subsequent rows: Sample data

  • ID_1: Usually same as ID_2 for unrelated individuals
  • ID_2: Sample identifier
  • missing: Missingness rate (usually 0)
  • sex: 1=male, 2=female
  • phenotype: Your trait of interest

The .gen File (Genotype Probabilities)

Stores genotype probabilities rather than hard calls. This is crucial for imputed data where you're not certain of the exact genotype.

Format:

Chromosome  VariantID  Position  Allele1  Allele2  [Genotype probabilities for all samples...]

Example .gen file:

1  rs12345  752566  G  A  1 0 0  0.95 0.05 0  0 0.1 0.9

Genotype Probability Triplets:

For each sample, three probabilities (must sum to 1.0):

  • P(AA) = Probability of homozygous for allele 1
  • P(AB) = Probability of heterozygous
  • P(BB) = Probability of homozygous for allele 2

Example interpretations:

  • 1 0 0 = Definitely AA (100% certain)
  • 0 0 1 = Definitely BB (100% certain)
  • 0 1 0 = Definitely AB (100% certain)
  • 0.9 0.1 0 = Probably AA, might be AB (uncertain genotype)
  • 0.33 0.33 0.33 = Completely uncertain (missing data)

Why probabilities matter:

  • Imputed genotypes aren't perfectly certain
  • Better to use probabilities than picking "best guess" genotype
  • Allows proper statistical modeling of uncertainty
  • Example: If imputation says 90% chance of AA, 10% chance AB, you should account for that uncertainty

The .bgen File (Binary Gen)

Binary version of .gen format - compressed and indexed for fast access.

Key features:

  • Much smaller than text .gen files
  • Includes variant indexing for rapid queries
  • Supports different compression levels
  • Stores genotype probabilities (like .gen) or dosages
  • Used by UK Biobank and other large biobanks

Associated files:

  • .bgen = Main genotype file
  • .bgen.bgi = Index file (for fast lookup)
  • .sample = Sample information (same as with .gen)

When to use Oxford format:

  • Working with imputed data
  • UK Biobank analyses
  • Using Oxford software (SNPTEST, QCTOOL, etc.)
  • When you need to preserve genotype uncertainty

Converting to PLINK:

  • PLINK2 can read .bgen files
  • Can convert to hard calls (loses probability information)
  • Or use dosages (keeps uncertainty as 0-2 continuous values)

23andMe Format

23andMe is a direct-to-consumer genetic testing company. Their raw data format is simple but NOT standardized for research use.

Format:

# rsid    chromosome    position    genotype
rs12345    1    752566    AG
rs67890    1    798959    CC
rs11111    2    1240532   --

Column Breakdown:

rsid: Variant identifier (rsID from dbSNP)

chromosome: Chromosome number (1-22, X, Y, MT)

  • Note: Sometimes uses 23 for X, 24 for Y, 25 for XY, 26 for MT

position: Base-pair position

  • Warning: Build version (GRCh37 vs GRCh38) is often unclear!
  • Check the file header or 23andMe documentation

genotype: Two-letter allele call

  • AG = Heterozygous
  • AA = Homozygous
  • -- = Missing/no call
  • DD or II = Deletion or insertion (rare)

Important Limitations:

Not standardized:

  • Different builds over time (some files are GRCh37, newer ones GRCh38)
  • Allele orientation issues (forward vs. reverse strand)
  • Variant filtering varies by chip version

Only genotyped SNPs:

  • Typically 500k-1M SNPs (depending on chip version)
  • No imputed data in raw download
  • Focused on common variants (rare variants not included)

Missing quality information:

  • No quality scores
  • No read depth or confidence metrics
  • "No call" (--) doesn't tell you why it failed

Privacy and consent issues:

  • Users may not understand research implications
  • IRB approval needed for research use
  • Cannot assume informed consent for specific research

Many online tools exist, but be careful:

  1. Determine genome build (critical!)
  2. Check strand orientation
  3. Handle missing genotypes (-- → 0 0)
  4. Verify chromosome coding (especially X/Y/MT)

Typical workflow:

# Convert to PLINK format (using a conversion script)
python 23andme_to_plink.py raw_data.txt

# Creates .ped and .map files
# Then convert to binary
plink --file raw_data --make-bed --out data

When you'd use 23andMe data:

  • Personal genomics projects
  • Ancestry analysis
  • Polygenic risk score estimation
  • Educational purposes
  • NOT suitable for: Clinical decisions, serious GWAS (too small), research without proper consent

Summary: Choosing the Right Format

FormatBest ForProsCons
PLINK binary (.bed/.bim/.fam)GWAS, large genotyping arraysFast, compact, standardLoses probability info
PLINK text (.ped/.map)Small datasets, manual editingHuman-readableHuge, slow
VCF (.vcf/.vcf.gz)Sequencing data, rare variantsComprehensive info, standardComplex, overkill for arrays
Oxford (.bgen/.gen)Imputed data, UK BiobankPreserves uncertaintyLess common in US
23andMePersonal genomicsDirect-to-consumerNot research-grade
Long formatStatistical analysis in R/PythonEasy to manipulateMassive file size

General recommendations:

  1. For genotyping array data: Use PLINK binary format (.bed/.bim/.fam)
  2. For sequencing data: Use compressed VCF (.vcf.gz)
  3. For imputed data: Use Oxford .bgen or VCF with dosages
  4. For statistical analysis: Convert subset to long format
  5. For personal data: Convert 23andMe to PLINK, but carefully

File conversions:

  • PLINK can convert between most formats
  • Always document your conversions (genome build, strand, filters)
  • Verify a few variants manually after conversion
  • Keep original files - conversions can introduce errors

Sanger Sequencing

The Chemistry: dNTPs vs ddNTPs

dNTP (deoxynucleotide triphosphate):

  • Normal DNA building blocks: dATP, dCTP, dGTP, dTTP
  • Have a 3'-OH group → DNA polymerase can add another nucleotide
  • Chain continues growing

ddNTP (dideoxynucleotide triphosphate):

  • Modified nucleotides: ddATP, ddCTP, ddGTP, ddTTP
  • Missing the 3'-OH group → no place to attach next nucleotide
  • Chain terminates (stops growing)

The key idea: Mix normal dNTPs with a small amount of ddNTPs. Sometimes the polymerase adds a normal dNTP (chain continues), sometimes it adds a ddNTP (chain stops). This creates DNA fragments of different lengths, all ending at the same type of base.


The Classic Method: Four Separate Reactions

You set up four tubes, each with:

  • Template DNA (what you want to sequence)
  • Primer (starting point)
  • DNA polymerase
  • All four dNTPs (A, C, G, T)
  • One type of ddNTP (different for each tube)

The Four Reactions:

Tube 1 - ddATP: Chains terminate at every A position Tube 2 - ddCTP: Chains terminate at every C position
Tube 3 - ddGTP: Chains terminate at every G position
Tube 4 - ddTTP: Chains terminate at every T position

Example Results:

Let's say the template sequence is: 5'-ACGTACGT-3'

Tube A (ddATP): Fragments ending at A positions

A
ACGTA
ACGTACGTA

Tube C (ddCTP): Fragments ending at C positions

AC
ACGTAC

Tube G (ddGTP): Fragments ending at G positions

ACG
ACGTACG

Tube T (ddTTP): Fragments ending at T positions

ACGT
ACGTACGT

Gel Electrophoresis Separation

Run all four samples on a gel. Smallest fragments move furthest, largest stay near the top.

        A    C    G    T
        |    |    |    |
Start → ━━━━━━━━━━━━━━━━  (loading wells)
        
        |              |  ← ACGT (8 bases)
        |    |         |  ← ACGTACG (7 bases)
        |              |  ← ACGTAC (6 bases) 
        |         |    |  ← ACGTA (5 bases)
        |         |    |  ← ACGT (4 bases)
        |    |    |    |  ← ACG (3 bases)
        |    |         |  ← AC (2 bases)
        |              |  ← A (1 base)
        
      ↓ Direction of migration ↓

Reading the sequence: Start from the bottom (smallest fragment) and go up:

Bottom → Top:  A - C - G - T - A - C - G - T
Sequence:      A   C   G   T   A   C   G   T

The sequence is ACGTACGT (read from bottom to top).


Modern Method: Fluorescent Dyes

Instead of four separate tubes, we now use one tube with four different fluorescent ddNTPs:

  • ddATP = Green fluorescence
  • ddCTP = Blue fluorescence
  • ddGTP = Yellow fluorescence
  • ddTTP = Red fluorescence

What happens:

  1. All fragments are created in one tube
  2. Run them through a capillary (tiny tube) instead of a gel
  3. Laser detects fragments as they pass by
  4. Computer records the color (= which base) and timing (= fragment size)

Chromatogram output:

Fluorescence
    ↑
    |     G  C     T     A  G  C  T
    |    /\  /\   /\    /\ /\ /\ /\
    |___/  \/  \_/  \__/  X  \/  \_____→ Time
    |                     / \
Position: 1  2  3  4  5  6  7  8

The computer reads the peaks and outputs: GCTAGCT


Why Sanger Sequencing Still Matters

  • High accuracy (~99.9%)
  • Gold standard for validating variants
  • Good for short reads (up to ~800 bases)
  • Single-molecule sequencing - no PCR bias
  • Used for: Confirming mutations, plasmid verification, PCR product sequencing

Limitations:

  • One fragment at a time (not high-throughput)
  • Expensive for large-scale projects (replaced by next-gen sequencing)
  • Can't detect low-frequency variants (< 15-20%)

About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (textbooks, online tutorials, sequencing method documentation).

This is my academic work—how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Lecture 2: Applied Genomics Overview

Key Concepts Covered

Hardy-Weinberg Equilibrium
Population genetics foundation - allele frequencies (p, q, r) in populations remain constant under specific conditions.

Quantitative Genetics (QG)
Study of traits controlled by multiple genes. Used for calculating breeding values in agriculture and understanding complex human traits.

The Human Genome

  • ~3 billion base pairs
  • <5% codes for proteins (the rest: regulatory, structural, "junk")
  • Massive scale creates computational challenges

QTL (Quantitative Trait Loci)
Genomic regions associated with quantitative traits - linking genotype to phenotype.

Genomics Definition
Study of entire genomes - all DNA sequences, genes, and their interactions.

Sequencing Accuracy
Modern sequencing: <1 error per 10,000 bases

Comparative Genomics
Comparing genomes across species to understand evolution, function, and conservation.

Applied Genomics (Why we're here)
Analyze genomes and extract information - turning raw sequence data into biological insights.

Major Challenges in Genomic Data

  1. Storage - Billions of bases = terabytes of data
  2. Transfer - Moving large datasets between systems
  3. Processing - Computational power for analysis

Sequencing Direction Note

Sanger sequencing: Input = what you're reading (direct)
NGS: Reverse problem - detect complement synthesis, infer template

Next-Generation Sequencing (NGS)

Ion Torrent Sequencing

Ion Torrent is a next-generation sequencing technology that detects DNA sequences by measuring pH changes instead of using light or fluorescence. It's fast, relatively cheap, and doesn't require expensive optical systems.


The Chemistry: Detecting Hydrogen Ions

The Core Principle

When DNA polymerase adds a nucleotide to a growing DNA strand, it releases a hydrogen ion (H⁺).

The reaction:

dNTP + DNA(n) → DNA(n+1) + PPi + H⁺
  • DNA polymerase incorporates a nucleotide
  • Pyrophosphate (PPi) is released
  • One H⁺ ion is released per nucleotide added
  • The H⁺ changes the pH of the solution
  • A pH sensor detects this change

Key insight: No fluorescent labels, no lasers, no cameras. Just chemistry and pH sensors.

Why amplification? A single molecule releasing one H⁺ isn't detectable. A million copies releasing a million H⁺ ions at once creates a measurable pH change.

The Homopolymer Problem

What Are Homopolymers?

A homopolymer is a stretch of identical nucleotides in a row:

  • AAAA (4 A's)
  • TTTTTT (6 T's)
  • GGGGG (5 G's)

Why They're a Problem in Ion Torrent

Normal case (single nucleotide):

  • Flow A → 1 nucleotide added → 1 H⁺ released → small pH change → signal = 1

Homopolymer case (multiple identical nucleotides):

  • Flow A → 4 nucleotides added (AAAA) → 4 H⁺ released → larger pH change → signal = 4

The challenge: Distinguishing between signal strengths. Is it 3 A's or 4 A's? Is it 7 T's or 8 T's?

The Math Problem

Signal intensity is proportional to the number of nucleotides incorporated:

  • 1 nucleotide = signal intensity ~100
  • 2 nucleotides = signal intensity ~200
  • 3 nucleotides = signal intensity ~300
  • ...but measurements have noise

Example measurements:

  • True 3 A's might measure as 290-310
  • True 4 A's might measure as 390-410
  • Overlap zone: Is a signal of 305 actually 3 or 4?

The longer the homopolymer, the harder it is to count accurately.

Consequences:

  • Insertions/deletions (indels) in homopolymer regions
  • Frameshifts if in coding regions (completely changes protein)
  • False variants called in genetic studies
  • Harder genome assembly (ambiguous regions)

Here's a concise section on Ion Torrent systems:


Ion Torrent Systems

Ion Torrent offers different sequencing systems optimized for various throughput needs.

System Comparison

FeatureIon PGMIon Proton/S5
Throughput30 Mb - 2 GbUp to 15 Gb
Run time4-7 hours2-4 hours
Read length35-400 bp200 bp
Best forSmall targeted panels, single samplesExomes, large panels, multiple samples
Cost per runLowerHigher
Lab spaceBenchtopBenchtop

Advantages of Ion Torrent

1. Speed

  • No optical scanning between cycles
  • Direct electronic detection
  • Runs complete in 2-4 hours (vs. days for some platforms)

2. Cost

  • No expensive lasers or cameras
  • Simpler hardware = lower instrument cost
  • Good for small labs or targeted sequencing

3. Scalability

  • Different chip sizes for different throughput needs
  • Can sequence 1 sample or 96 samples
  • Good for clinical applications

4. Long reads (relatively)

  • 200-400 bp reads standard
  • Longer than Illumina (75-300 bp typically)
  • Helpful for some applications

Disadvantages of Ion Torrent

1. Homopolymer errors (the big one)

  • Indel errors in long homopolymers
  • Limits accuracy for some applications

2. Lower overall accuracy

  • ~98-99% accuracy vs. 99.9% for Illumina
  • More errors per base overall

3. Smaller throughput

  • Maximum output: ~15 Gb per run
  • Illumina NovaSeq: up to 6 Tb per run
  • Not ideal for whole genome sequencing of complex organisms

4. Systematic errors

  • Errors aren't random - they cluster in homopolymers
  • Harder to correct computationally

Conclusion

Ion Torrent is a clever technology that trades optical complexity for electronic simplicity. It's fast and cost-effective for targeted applications, but the homopolymer problem remains its Achilles' heel.

The homopolymer issue isn't a deal-breaker - it's manageable with proper bioinformatics and sufficient coverage. But you need to know about it when designing experiments and interpreting results.

For clinical targeted sequencing (like cancer panels), Ion Torrent is excellent. For reference-quality genome assemblies or ultra-high-accuracy applications, other platforms might be better choices.

The key lesson: Every sequencing technology has trade-offs. Understanding them helps you choose the right tool for your specific question.


About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (sequencing technology documentation, bioinformatics tutorials, scientific literature).

This is my academic work—how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Lec3

ABI SOLiD Sequencing (Historical)

What Was SOLiD?

SOLiD (Sequencing by Oligonucleotide Ligation and Detection) was a next-generation sequencing platform developed by Applied Biosystems (later acquired by Life Technologies, then Thermo Fisher).

Status: Essentially discontinued. Replaced by Ion Torrent and other technologies.


The Key Difference: Ligation Instead of Synthesis

Unlike other NGS platforms:

  • Illumina: Sequencing by synthesis (polymerase adds nucleotides)
  • Ion Torrent: Sequencing by synthesis (polymerase adds nucleotides)
  • SOLiD: Sequencing by ligation (ligase joins short probes)

How It Worked (Simplified)

  1. DNA fragments attached to beads (emulsion PCR, like Ion Torrent)
  2. Fluorescent probes (short 8-base oligonucleotides) compete to bind
  3. DNA ligase joins the matching probe to the primer
  4. Detect fluorescence to identify which probe bound
  5. Cleave probe, move to next position
  6. Repeat with different primers to read the sequence

Key concept: Instead of building a complementary strand one nucleotide at a time, SOLiD interrogated the sequence using short probes that bind and get ligated.

Why It's Dead (or Nearly Dead)

Advantages that didn't matter enough:

  • Very high accuracy (>99.9% after two-base encoding)
  • Error detection built into chemistry

Fatal disadvantages:

  1. Complex bioinformatics - two-base encoding required specialized tools
  2. Long run times - 7-14 days per run (vs. hours for Ion Torrent, 1-2 days for Illumina)
  3. Expensive - high cost per base
  4. Company pivot - Life Technologies acquired Ion Torrent and shifted focus there

The market chose: Illumina won on simplicity and throughput, Ion Torrent won on speed.

What You Should Remember

1. Different chemistry - Ligation-based, not synthesis-based

2. Two-base encoding - Clever error-checking mechanism, but added complexity

3. Historical importance - Showed alternative approaches to NGS were possible

4. Why it failed - Too slow, too complex, company shifted to Ion Torrent

5. Legacy - Some older papers used SOLiD data; understanding the platform helps interpret those results


The Bottom Line

SOLiD was an interesting experiment in using ligation chemistry for sequencing. It achieved high accuracy through two-base encoding but couldn't compete with faster, simpler platforms.

Why learn about it?

  • Understand the diversity of approaches to NGS
  • Interpret older literature that used SOLiD
  • Appreciate why chemistry simplicity matters (Illumina's success)

You won't use it, but knowing it existed helps you understand the evolution of sequencing technologies and why certain platforms won the market.


Illumina Sequencing

Illumina is the dominant next-generation sequencing platform worldwide. It uses reversible terminator chemistry and fluorescent detection to sequence millions of DNA fragments simultaneously with high accuracy.


The Chemistry: Reversible Terminators

The Core Principle

Unlike Ion Torrent (which detects H⁺ ions), Illumina detects fluorescent light from labeled nucleotides.

Key innovation: Reversible terminators

Normal dNTP:

  • Has 3'-OH group
  • Polymerase adds it and continues to next base

Reversible terminator (Illumina):

  • Has 3'-OH blocked by a chemical group
  • Has fluorescent dye attached
  • Polymerase adds it and stops
  • After imaging, the block and dye are removed
  • Polymerase continues to next base

Why this matters: You get exactly one base added per cycle, making base calling precise.


How It Works: Step by Step

1. Library Preparation

DNA is fragmented and adapters are ligated to both ends of each fragment.

Adapters contain:

  • Primer binding sites
  • Index sequences (barcodes for sample identification)
  • Sequences complementary to flow cell oligos

2. Cluster Generation (Bridge Amplification)

This is Illumina's signature step - amplification happens on the flow cell surface.

The flow cell:

  • Glass slide with millions of oligos attached to the surface
  • Two types of oligos (P5 and P7) arranged in a lawn

Bridge amplification process:

Step 1: DNA fragments bind to flow cell oligos (one end attaches)

Step 2: The free end bends over and binds to nearby oligo (forms a "bridge")

Step 3: Polymerase copies the fragment, creating double-stranded bridge

Step 4: Bridge is denatured (separated into two strands)

Step 5: Both strands bind to nearby oligos and repeat

Result: Each original fragment creates a cluster of ~1,000 identical copies in a tiny spot on the flow cell.

Why amplification? Like Ion Torrent, a single molecule's fluorescent signal is too weak to detect. A thousand identical molecules in the same spot produce a strong signal.

Visual representation:

Original fragment: ═══DNA═══

After bridge amplification:
║ ║ ║ ║ ║ ║ ║ ║
║ ║ ║ ║ ║ ║ ║ ║  ← ~1000 copies in one cluster
║ ║ ║ ║ ║ ║ ║ ║
Flow cell surface

3. Sequencing by Synthesis

Now the actual sequencing begins.

Cycle 1:

  1. Add fluorescent reversible terminators (all four: A, C, G, T, each with different color)
  2. Polymerase incorporates one base (only one because it's a terminator)
  3. Wash away unincorporated nucleotides
  4. Image the flow cell with laser
    • Green light = A was added
    • Blue light = C was added
    • Yellow light = G was added
    • Red light = T was added
  5. Cleave off the fluorescent dye and the 3' blocking group
  6. Repeat for next base

Cycle 2, 3, 4... 300+: Same process, one base at a time.

Key difference from Ion Torrent:

  • Illumina: All four nucleotides present at once, polymerase chooses correct one
  • Ion Torrent: One nucleotide type at a time, polymerase adds it only if it matches

Color System

2 color and 4 colors system

No Homopolymer Problem

Why Illumina Handles Homopolymers Better

Remember Ion Torrent's main weakness? Homopolymers like AAAA produce strong signals that are hard to quantify (is it 3 A's or 4?).

Illumina doesn't have this problem because:

  1. One base per cycle - the terminator ensures only one nucleotide is added
  2. Direct counting - if you see 4 green signals in a row, it's exactly 4 A's
  3. No signal intensity interpretation - just presence/absence of color

Example:

Sequence: AAAA

Illumina:

Cycle 1: Green (A)
Cycle 2: Green (A)
Cycle 3: Green (A)
Cycle 4: Green (A)
→ Exactly 4 A's, no ambiguity

Ion Torrent:

Flow A: Large signal (proportional to 4 H⁺ ions)
→ Is it 4? Or 3? Or 5? (requires signal quantification)

Error Profile: Substitutions, Not Indels

Illumina's Main Error Type

Substitution errors - reading the wrong base (A instead of G, C instead of T)

Error rate: ~0.1% (1 error per 1,000 bases, or 99.9% accuracy)

Common causes:

  1. Phasing/pre-phasing - some molecules in a cluster get out of sync
  2. Dye crosstalk - fluorescent signals bleed between channels
  3. Quality degradation - accuracy decreases toward end of reads

Why Few Indels?

Because of the reversible terminator:

  • Exactly one base per cycle
  • Can't skip a base (would need terminator removal without incorporation)
  • Can't add two bases (terminator blocks second addition)

Comparison:

Error TypeIlluminaIon Torrent
Substitutions~99% of errors~30% of errors
Insertions/Deletions~1% of errors~70% of errors
Homopolymer errorsRareCommon

Phasing and Pre-phasing

The Synchronization Problem

In a perfect world, all molecules in a cluster stay perfectly synchronized - all at the same base position.

Reality: Some molecules lag behind (phasing) or jump ahead (pre-phasing).

Phasing (Lagging Behind)

Cycle 1: All molecules at position 1 ✓
Cycle 2: 98% at position 2, 2% still at position 1 (incomplete extension)
Cycle 3: 96% at position 3, 4% behind...

As cycles progress, the cluster becomes a mix of molecules at different positions.

Result: Blurry signal - you're imaging multiple bases at once.

Pre-phasing (Jumping Ahead)

Cause: Incomplete removal of terminator or dye

A molecule might:

  • Have terminator removed
  • BUT dye not fully removed
  • Next cycle adds another base (now 2 bases ahead of schedule)

Impact on Quality

Early cycles (1-100): High accuracy, minimal phasing
Middle cycles (100-200): Good accuracy, some phasing
Late cycles (200-300+): Lower accuracy, significant phasing

Quality scores decline with read length. This is why:

  • Read 1 (first 150 bases) typically has higher quality than Read 2
  • Paired-end reads are used (sequence both ends, higher quality at each end)

Paired-End Sequencing

What Is Paired-End?

Instead of sequencing only one direction, sequence both ends of the DNA fragment.

Process:

  1. Read 1: Sequence from one end (forward direction) for 150 bases
  2. Regenerate clusters (bridge amplification again)
  3. Read 2: Sequence from the other end (reverse direction) for 150 bases

Result: Two reads from the same fragment, separated by a known distance.

Why Paired-End?

1. Better mapping

  • If one end maps ambiguously, the other might be unique
  • Correct orientation and distance constrain mapping

2. Detect structural variants

  • Deletions: Reads closer than expected
  • Insertions: Reads farther than expected
  • Inversions: Wrong orientation
  • Translocations: Reads on different chromosomes

3. Improve assembly

  • Links across repetitive regions
  • Spans gaps

4. Quality assurance

  • If paired reads don't map correctly, flag as problematic

Illumina Systems

Different Throughput Options

Illumina offers multiple sequencing platforms for different scales:

SystemThroughputRun TimeRead LengthBest For
iSeq 1001.2 Gb9-19 hours150 bpSmall targeted panels, amplicons
MiniSeq8 Gb4-24 hours150 bpSmall labs, targeted sequencing
MiSeq15 Gb4-55 hours300 bpTargeted panels, small genomes, amplicon seq
NextSeq120 Gb12-30 hours150 bpExomes, transcriptomes, small genomes
NovaSeq6000 Gb (6 Tb)13-44 hours250 bpWhole genomes, large projects, population studies

Key trade-offs:

  • Higher throughput = longer run time
  • Longer reads = lower throughput or longer run time
  • Bigger machines = higher capital cost but lower cost per Gb

Advantages of Illumina

1. High Accuracy

  • 99.9% base accuracy (Q30 or higher)
  • Few indel errors
  • Reliable base calling

2. High Throughput

  • Billions of reads per run
  • Suitable for whole genomes at population scale

3. Low Cost (at scale)

  • ~$5-10 per Gb for high-throughput systems
  • Cheapest for large projects

4. Mature Technology

  • Well-established protocols
  • Extensive bioinformatics tools
  • Large user community

5. Flexible Read Lengths

  • 50 bp to 300 bp
  • Single-end or paired-end

6. Multiplexing

  • Sequence 96+ samples in one run using barcodes
  • Reduces cost per sample

Disadvantages of Illumina

1. Short Reads

  • Maximum ~300 bp (vs. PacBio: 10-20 kb)
  • Hard to resolve complex repeats
  • Difficult for de novo assembly of large genomes

2. Run Time

  • 12-44 hours for high-throughput systems
  • Longer than Ion Torrent (2-4 hours)
  • Not ideal for ultra-rapid diagnostics

3. PCR Amplification Bias

  • Bridge amplification favors certain sequences
  • GC-rich or AT-rich regions may be underrepresented
  • Some sequences difficult to amplify

4. Equipment Cost

  • NovaSeq: $850,000-$1,000,000
  • High upfront investment
  • Requires dedicated space and trained staff

5. Phasing Issues

  • Quality degrades with read length
  • Limits maximum usable read length

When to Use Illumina

Ideal Applications

Whole Genome Sequencing (WGS)

  • Human, animal, plant genomes
  • Resequencing (alignment to reference)
  • Population genomics

Whole Exome Sequencing (WES)

  • Capture and sequence only coding regions
  • Clinical diagnostics
  • Disease gene discovery

RNA Sequencing (RNA-seq)

  • Gene expression profiling
  • Transcript discovery
  • Differential expression analysis

ChIP-Seq / ATAC-Seq

  • Protein-DNA interactions
  • Chromatin accessibility
  • Epigenomics

Metagenomics

  • Microbial community profiling
  • 16S rRNA sequencing
  • Shotgun metagenomics

Targeted Panels

  • Cancer hotspot panels
  • Carrier screening
  • Pharmacogenomics

Not Ideal For

Long-range phasing (use PacBio or Oxford Nanopore)
Structural variant detection (short reads struggle with large rearrangements)
Ultra-rapid turnaround (use Ion Torrent for speed)
De novo assembly of repeat-rich genomes (long reads better)


Illumina vs Ion Torrent: Summary

FeatureIlluminaIon Torrent
DetectionFluorescencepH (H⁺ ions)
ChemistryReversible terminatorsNatural dNTPs + ddNTPs
Read length50-300 bp200-400 bp
Run time12-44 hours (high-throughput)2-4 hours
Accuracy99.9%98-99%
Main errorSubstitutionsIndels (homopolymers)
HomopolymersNo problemMajor issue
ThroughputUp to 6 Tb (NovaSeq)Up to 15 Gb
Cost per Gb$5-10 (at scale)$50-100
Best forLarge projects, WGS, high accuracyTargeted panels, speed

The Bottom Line

Illumina is the workhorse of genomics. It's not the fastest (Ion Torrent), not the longest reads (PacBio/Nanopore), but it hits the sweet spot of:

  • High accuracy
  • High throughput
  • Reasonable cost
  • Mature ecosystem

For most genomic applications - especially resequencing, RNA-seq, and exomes - Illumina is the default choice.

The main limitation is short reads. For applications requiring long-range information (phasing variants, resolving repeats, de novo assembly), you'd combine Illumina with long-read technologies or use long-read platforms alone.

Key takeaway: Illumina's reversible terminator chemistry elegantly solves the homopolymer problem by ensuring exactly one base per cycle, trading speed (longer run time) for accuracy (99.9%).


About Course Materials

These notes contain NO copied course materials. Everything here is my personal understanding and recitation of concepts, synthesized from publicly available resources (Illumina documentation, sequencing technology literature, bioinformatics tutorials).

This is my academic work—how I've processed and reorganized information from legitimate sources. I take full responsibility for any errors in my understanding.

If you believe any content violates copyright, contact me at mahmoudahmedxyz@gmail.com and I'll remove it immediately.

Nanopore Sequencing

Overview

Oxford Nanopore uses tiny protein pores embedded in a membrane to read DNA directly - no amplification, no fluorescence.


How It Works

The Setup: Membrane with Nanopores

A membrane separates two chambers with different electrical charges. Embedded in the membrane are protein nanopores - tiny holes just big enough for single-stranded DNA to pass through.

     Voltage applied across membrane
              ─────────────
                   ↓
    ════════════╤═════╤════════════  ← Membrane
                │ ◯ ◯ │              ← Nanopores
    ════════════╧═════╧════════════
                   ↑
              DNA threads through

The Detection: Measuring Current

  1. DNA strand is fed through the pore by a motor protein
  2. As each base passes through, it partially blocks the pore
  3. Each base (A, T, G, C) has a different size/shape
  4. Different bases create different electrical resistance
  5. We measure the change in current to identify the base

Key insight: No labels, no cameras, no lasers - just electrical signals!


The Signal: It's Noisy

The raw signal is messy - multiple bases in the pore at once, random fluctuations:

Current
   │
   │ ▄▄▄   ▄▄    ▄▄▄▄   ▄▄   ▄▄▄
   │█   █▄█  █▄▄█    █▄█  █▄█   █▄▄
   │
   └───────────────────────────────── Time
   
   Base: A  A  T  G   C  C  G  A

Machine learning (neural networks) decodes this noisy signal into base calls.


Why Nanopore?

Ultra-Long Reads

  • Typical: 10-50 kb
  • Record: >4 Mb (yes, megabases!)
  • Limited only by DNA fragment length, not the technology

Cheap and Portable

  • MinION device fits in your hand, costs ~$1000
  • Can sequence in the field (disease outbreaks, remote locations)
  • Real-time data - see results as sequencing happens

Direct Detection

  • Can detect modified bases (methylation) directly
  • No PCR amplification needed
  • Can sequence RNA directly (no cDNA conversion)

Error Rate and Correction

Raw accuracy: ~93-97% (improving with each update)

Error type: Mostly indels, especially in homopolymers

Improving Accuracy

1. Higher coverage: Multiple reads of the same region, errors cancel out

2. Duplex sequencing: DNA is double-stranded - sequence both strands and combine:

Forward strand:  ATGCCCAAA
                 |||||||||
Reverse strand:  TACGGGTTT  (complement)

→ Consensus: Higher accuracy

3. Better basecallers: Neural networks keep improving, accuracy increases with software updates

PacBio Sequencing

Overview

PacBio (Pacific Biosciences) uses SMRT sequencing (Single Molecule Real-Time) to produce long reads - often 10,000 to 25,000+ base pairs.

For better illustration, watch the video below:


How It Works

The Setup: ZMW (Zero-Mode Waveguide)

PacBio uses tiny wells called ZMWs - holes so small that light can only illuminate the very bottom.

At the bottom of each well:

  • A single DNA polymerase is fixed in place
  • A single DNA template is threaded through it

The Chemistry: Real-Time Detection

  1. Fluorescent nucleotides (A, T, G, C - each with different color) float in solution
  2. When polymerase grabs the correct nucleotide, it holds it in the detection zone
  3. Laser detects the fluorescence - we see which base is being added
  4. Polymerase incorporates the nucleotide, releases the fluorescent tag
  5. Repeat - watching DNA synthesis in real-time

Key difference from Illumina: We watch a single molecule of polymerase working continuously, not millions of molecules in sync.


Why Long Reads?

The circular template trick:

PacBio uses SMRTbell templates - DNA with hairpin adapters on both ends, forming a circle.

    ╭──────────────╮
    │              │
────┤   Template   ├────
    │              │
    ╰──────────────╯

The polymerase goes around and around, reading the same template multiple times.


Error Correction: Why High Accuracy?

Raw reads have ~10-15% error rate (mostly insertions/deletions)

But: Because polymerase circles the template multiple times, we get multiple reads of the same sequence.

CCS (Circular Consensus Sequencing):

  • Align all passes of the same template
  • Errors are random, so they cancel out
  • Result: >99.9% accuracy (HiFi reads)
Pass 1:  ATGC-CCAAA
Pass 2:  ATGCCC-AAA
Pass 3:  ATGCCCAAAA
Pass 4:  ATGCCC-AAA
         ──────────
Consensus: ATGCCCAAA  ✓

When to Use PacBio

Ideal for:

  • De novo genome assembly
  • Resolving repetitive regions
  • Detecting structural variants
  • Full-length transcript sequencing
  • Phasing haplotypes

Not ideal for:

  • Large-scale population studies (cost)
  • When short reads are sufficient

Lecture 1 — Introduction: Foundational Genetics & Genomics Concepts

📝Lecture 1 — Concepts in Genetics & Genomics
0 / 40
Q1 Easy
Which branch of genetics uses statistical models to estimate the genetic contribution to variation in traits controlled by multiple genes?
AClassical genetics
BMolecular genetics
CQuantitative genetics
DPopulation genetics
Explanation
Quantitative genetics analyzes traits controlled by multiple genes (e.g., height, milk production) and uses statistical models to estimate the genetic contribution to phenotypic variation. Population genetics studies allele frequency changes, classical genetics focuses on Mendelian inheritance patterns, and molecular genetics investigates gene structure/function at the DNA/RNA/protein level.
Q2 Medium
Which domain of genetics links genetics with evolutionary biology by studying the distribution and change of allele frequencies?
AMolecular genetics
BPopulation genetics
CQuantitative genetics
DClassical genetics
Explanation
Population genetics studies the distribution and change of allele frequencies within populations, directly linking genetics with evolutionary biology. Don't confuse it with quantitative genetics, which focuses on multi-gene traits and variance components.
Q3 Tricky
According to the lecture, which domain of genetics tends to be less central in genomics projects that focus on individual genes or regions?
AClassical genetics
BMolecular genetics
CPopulation genetics
DQuantitative genetics
Explanation
The lecture specifically states: "Quantitative genetics, while important for understanding polygenic traits, tends to be less central in genomics projects that focus on individual genes or regions." This is a subtle point easily overlooked. Classical and molecular genetics form the "foundational pillars" of applied genomics.
Q4 Easy
Classical genetics was developed before the molecular nature of DNA was understood. What did it primarily rely on to infer genetic laws?
APhenotypic traits and breeding analysis
BDNA sequencing and molecular markers
CStatistical models of allele frequencies
DProtein expression analysis
Explanation
Classical genetics focused on observing phenotypic traits (visible differences such as eye color or body shape) and inferring rules of inheritance through carefully planned breeding experiments and offspring analysis. DNA sequencing came much later.
Q5 Medium
What is the relationship between the distance of two genes on the same chromosome and the probability of crossing over?
AGreater distance = lower probability of crossing over
BGreater distance = higher probability of crossing over
CDistance has no effect on crossing over frequency
DOnly genes on different chromosomes undergo crossing over
Explanation
Genes that are farther apart on the same chromosome have a higher probability of recombination (crossing over) during meiosis. Genes close together tend to be inherited together because a crossover event between them is less likely. This principle is the foundation of genetic mapping.
Q6 Tricky
Why is the distance between genes on the same chromosome important when planning a genomics project, according to Professor Fontanesi?
AIt determines the total genome size
BIt affects protein expression levels
CIt is related to the level of recombination
DIt determines the mutation rate
Explanation
The lecture explicitly notes: "Fontanesi suggests looking at the distance between hereditary elements on the same chromosome when planning a project, because distance is related to the level of recombination." This is practical advice for designing genomics experiments — closely linked loci behave differently from distant ones.
Q7 Medium
Why is polyploidy a challenge in applied genomics?
APolyploid organisms cannot reproduce sexually
BPolyploid organisms have smaller genomes that are harder to detect
CPolyploidy eliminates crossing over during meiosis
DMultiple gene copies make it harder to determine which copy is responsible for a given trait
Explanation
In polyploid organisms (tetraploid, hexaploid, etc.), each gene may exist in multiple copies. This makes it significantly harder to determine which copy is responsible for a given trait, complicating both experimental design and data analysis. It may even make it impossible to resolve which part of the DNA corresponds to which parental genome.
Q8 Easy
A hexaploid organism has how many sets of chromosomes?
ASix
BFour
CThree
DTwo
Explanation
Hexaploid = six sets of chromosomes. Diploid = two, tetraploid = four. Polyploidy is common in plants (many crops are tetraploid or hexaploid) and creates complexity in genomic analysis.
Q9 Tricky
When planning a genomics project on cattle, which factor most limits the speed of experimental progress?
ALarge genome size
BLong reproductive cycle and generation time
CHigh ploidy level
DLack of available reference genomes
Explanation
Classical genetics relies on generational cycles, making time a key constraint. Cattle have a long reproductive cycle, so it's not feasible to expect rapid results. Cattle are diploid (not polyploid), and reference genomes exist. The lecture specifically highlights this as a practical consideration.
Q10 Easy
What is the genotypic ratio in the F2 generation of a monohybrid cross between two heterozygous individuals (Tt × Tt)?
A3:1
B1:1
C1:2:1
D9:3:3:1
Explanation
The genotypic ratio from Tt × Tt is 1 TT : 2 Tt : 1 tt = 1:2:1. The 3:1 ratio is the phenotypic ratio (3 tall : 1 dwarf) — a common trap! The 9:3:3:1 ratio applies to a dihybrid cross.
Q11 Medium
In Mendel's dihybrid cross (RrYy × RrYy), the F2 phenotypic ratio 9:3:3:1 is observed. What key condition must be true for this ratio to appear?
ABoth genes must show incomplete dominance
BBoth genes must be on the same chromosome
COne gene must be epistatic to the other
DThe two genes must assort independently (unlinked)
Explanation
The 9:3:3:1 ratio only appears when the genes are on different chromosomes or far enough apart on the same chromosome to assort independently. The lecture notes that "Mendel was lucky" — the two traits he studied were on different chromosomes. If genes were linked (close together on the same chromosome), the ratio would deviate.
Q12 Tricky
Why did the lecture describe Mendel as "lucky" in his experimental design for studying independent assortment?
AThe traits he studied happened to be on different chromosomes
BHe used a species with unusually short generation time
CAll his traits showed complete dominance without exceptions
DPea plants are polyploid, giving clearer segregation patterns
Explanation
The lecture explicitly states Mendel was lucky because the two traits he studied (seed shape and seed color) were on different chromosomes, allowing them to assort independently. If they had been linked (close together on the same chromosome), the 9:3:3:1 ratio would not have appeared, making the underlying pattern harder to detect. Pea plants are diploid, not polyploid.
Q13 Medium
If the tall (T) and dwarf (t) alleles in Mendel's pea plants had shown incomplete dominance instead of complete dominance, what phenotype would heterozygous (Tt) plants display?
ATall
BMedium height (intermediate)
CDwarf
DBoth tall and dwarf simultaneously
Explanation
The lecture states: "If the 'tall' and 'dwarf' alleles had shown incomplete dominance, Mendel would have observed pea plants with medium height, somewhere between the tall and dwarf phenotypes." Incomplete dominance produces an intermediate phenotype, while codominance (D) would show both phenotypes fully expressed (like AB blood type).
Q14 Easy
According to Mendel's Law of Segregation, how many alleles does each gamete carry for a given gene?
ATwo — one from each parent
BIt depends on the ploidy level
COne
DTwo identical copies
Explanation
The Law of Segregation states that during gametogenesis, each individual's pair of alleles segregates, meaning only ONE allele is passed into each gamete. A diploid individual carries two alleles per gene, but each gamete receives only one.
Q15 Medium
A plant with genotype TT is grown in a physically restricted environment. What is the expected outcome?
AIt will always reach full tall height because it is homozygous dominant
BIt will become dwarf because the environment overrides the genotype
CIts genotype will change to Tt due to environmental pressure
DIt may not reach its full height despite having the genetic potential for tallness
Explanation
The lecture states that a plant with the tall allele "grown in a physically restricted environment may not reach its full height." This illustrates environmental effects on phenotypic expression — the genotype doesn't change (C is wrong), but the phenotype can be modified. The environment doesn't override genetics completely (B), but it can limit expression (D is correct).
Q16 Easy
Which blood type is an example of codominance?
AAB blood type
BType O blood
CType A blood (heterozygous)
DRh-negative blood
Explanation
The AB blood type is the classic example of codominance mentioned in the lecture — both A and B alleles are fully expressed in the heterozygote. Neither allele masks the other.
Q17 Easy
In a pedigree, what does a half-filled symbol represent?
AAn affected individual
BA carrier of a trait (usually recessive)
CA deceased individual
DAn individual of unknown sex
Explanation
A half-filled symbol = carrier of a trait. Fully filled = affected individual. A rhombus represents unknown/unspecified sex. These symbols are standard in pedigree notation.
Q18 Medium
Why does inbreeding increase the expression of recessive traits in a population?
AIt increases the mutation rate at recessive loci
BIt changes recessive alleles into dominant alleles
CIt increases the probability of homozygosity through identical-by-descent alleles
DIt eliminates dominant alleles from the population
Explanation
Inbred individuals may inherit two alleles that are identical by descent (IBD) — both alleles come from a common ancestor. IBD alleles increase the probability of homozygosity, making recessive traits more likely to be expressed. Inbreeding doesn't change alleles or increase mutation rates.
Q19 Medium
Which of the following is NOT a limitation of pedigree analysis in human studies?
ALow reproductive rate
BUncontrolled matings
CLong generation time
DInability to trace monogenic diseases
Explanation
The three limitations listed in the lecture are: low reproductive rate, uncontrolled matings, and long generation time (A, B, C). However, pedigree analysis remains a core tool for tracing monogenic diseases, identifying carriers, and diagnosing X-linked or mitochondrial disorders — so D is not a limitation; it's actually a strength.
Q20 Easy
In a PLINK .ped file, what does column 6 represent?
ASex of the individual
BPhenotype (usually 1 = control, 2 = case)
CMaternal ID
DAllele data for the first SNP
Explanation
In the PLINK .ped file: col 1 = Family ID, col 2 = Individual ID, col 3 = Paternal ID, col 4 = Maternal ID, col 5 = Sex (1=M, 2=F), col 6 = Phenotype (1=control, 2=case, -9/0=missing), col 7+ = allele data. Sex is column 5, not 6.
Q21 Tricky
In a PLINK .ped file, the sex column uses the coding: 1 = Male, 2 = Female, 0 = Unknown. What value is used for missing phenotype information?
A0 only
B-1
C-9 or 0
DNA
Explanation
Missing phenotype data in PLINK is coded as -9 or 0. This is tricky because 0 is used for unknown sex AND can indicate missing phenotype. In the sex column: 0 = unknown sex. In the phenotype column: 1 = control, 2 = case, -9/0 = missing. The .fam file description specifies: '-9'/'0'/non-numeric = missing data for case/control.
Q22 Medium
Which PLINK file specifies chromosome number, SNP ID, genetic distance, and physical base-pair position?
A.map or .bim file
B.ped file
C.fam file
D.bed file
Explanation
The .map (or .bim in binary format) file contains genomic position information: chromosome number, SNP ID, genetic distance, and physical base-pair position. The .ped file contains individual-level metadata and genotypes. The .fam file contains sample information (first 6 columns of .ped). The .bed file is binary genotype data.
Q23 Medium
Which of the following is NOT a listed application of PLINK?
ARunning quality control on genomic data
BStudying population structure
CIdentifying genetic variants associated with diseases
DPerforming de novo genome assembly
Explanation
PLINK's applications include: identifying variants associated with diseases, analyzing heritability, studying population structure, running QC (missingness, heterozygosity, HWE), and feeding data into advanced models. De novo genome assembly is NOT a PLINK function — that requires specialized assemblers like SPAdes, Canu, or similar tools.
Q24 Easy
During which phase of mitosis are chromosomes best visualized in cytogenetics?
AInterphase
BMetaphase
CAnaphase
DTelophase
Explanation
Chromosomes are best visualized during metaphase because they are in their most condensed state. This makes them visible under microscopy and allows staining techniques (Giemsa, FISH) to distinguish individual chromosomes.
Q25 Medium
Even with whole genome sequencing available, cytogenetics still plays a crucial role in which of the following?
AIdentifying SNPs at single-base resolution
BSequencing mitochondrial genomes
CDiagnosing chromosomal abnormalities and genome assembly validation
DMeasuring allele frequencies in populations
Explanation
The lecture lists cytogenetics' current roles as: clinical genetics (diagnosing syndromes like Down syndrome), cancer genomics (chromosomal rearrangements), evolutionary biology (comparing karyotypes), and genome assembly validation. SNP identification and allele frequency measurement are done with sequencing/genotyping tools, not cytogenetics.
Q26 Easy
What does linkage disequilibrium (LD) describe?
AThe non-random association of alleles at two or more loci in a population
BThe physical distance between two genes in base pairs
CThe random segregation of alleles during meiosis
DThe linkage between a gene and its protein product
Explanation
LD refers to the non-random association of alleles at two or more loci — certain allele combinations occur together more (or less) frequently than expected if they were independent. This is a key concept for GWAS and population genomics.
Q27 Tricky
Which statement about linkage disequilibrium (LD) is correct?
ALD is the same thing as physical linkage
BLD can only exist between loci on the same chromosome
CRecombination increases LD over time
DLD ≠ physical linkage, but physical linkage contributes to LD
Explanation
The lecture explicitly states: "LD ≠ physical linkage, but physical linkage contributes to LD." LD can also be caused by population structure, genetic drift, selection, and population admixture — factors that can create LD even between loci on different chromosomes. Recombination breaks down LD (not increases it).
Q28 Medium
Which of the following is NOT listed as a cause of linkage disequilibrium?
APhysical proximity of loci on the same chromosome
BHigh recombination rates between loci
CGenetic drift and small population sizes
DNew mutations arising on specific genetic backgrounds
Explanation
High recombination rates BREAK DOWN LD, they do not cause it. Causes of LD include: physical proximity (low recombination), small population sizes, genetic drift, selection, population admixture, and new mutations on specific backgrounds.
Q29 Medium
In the LD numerical example, if P(A) = 0.5 and P(B) = 0.5, what is the expected frequency of haplotype AB under linkage equilibrium?
A0.25
B0.50
C0.40
D0.10
Explanation
Under linkage equilibrium (independence), P(AB) = P(A) × P(B) = 0.5 × 0.5 = 0.25. The observed frequency was 0.4, which is higher than expected, indicating LD. The difference between observed and expected haplotype frequencies is the hallmark of LD.
Q30 Medium
The lactose tolerance haplotype block near the LCT gene has been maintained in high LD. What evolutionary force is responsible?
AGenetic drift
BRandom mutation
CPositive selection
DGene flow
Explanation
The lecture states that SNPs near the LCT gene "form a haplotype block that has been maintained due to positive selection (people with this haplotype digest lactose better)." Positive selection favors beneficial alleles and their linked variants, preserving the LD structure.
Q31 Easy
What does 1 centimorgan (cM) represent?
A1 million base pairs of DNA
BA 10% chance of recombination per generation
CThe distance equal to one gene length
DA 1% chance of recombination per generation between two loci
Explanation
1 centimorgan (cM) = 1% chance of recombination between two loci per generation. On average, 1 out of 100 meioses will result in recombination between loci that are 1 cM apart. Named after Thomas Hunt Morgan. Note: 1 cM does NOT necessarily equal 1 Mb — the relationship between genetic and physical distance varies across the genome.
Q32 Hard
Two loci are 50 cM apart on the same chromosome. How do they behave in terms of inheritance?
AThey assort independently, as if on different chromosomes
BThey are always inherited together
CThey recombine 100% of the time
DThey cannot be mapped using recombination frequency
Explanation
50 cM is the maximum recombination frequency. Loci ≥50 cM apart recombine 50% of the time, which is the same as independent assortment (as if on different chromosomes). They are NOT linked at this distance. Note: 50 cM ≠ 100% recombination; the maximum observable recombination frequency is 50%.
Q33 Medium
In the ZW sex determination system (birds), which sex is heterogametic?
AMales (ZW)
BBoth sexes equally
CFemales (ZW)
DNeither — sex is determined by environment
Explanation
In birds and some reptiles (ZW system): Males = ZZ (homogametic), Females = ZW (heterogametic). This is the opposite of mammals where males (XY) are heterogametic. Don't mix them up!
Q34 Tricky
What is the pseudoautosomal region (PAR)?
AA region on autosomes that behaves like a sex chromosome
BThe only region of sex chromosomes where crossing over occurs during meiosis
CA duplicated region found on all chromosomes
DA region that determines sex in the ZW system
Explanation
The PAR is the only part of the sex chromosomes that acts like an autosome and allows crossing over during meiosis. It's located at the tips of the X and Y chromosomes. Option A is a clever distractor — PAR is a region of sex chromosomes that behaves like autosomes, not the other way around.
Q35 Hard
In Hymenoptera (e.g., honeybees), males are haploid and females are diploid. How are males produced?
AFrom fertilized eggs with a special sex-determining gene
BFrom eggs exposed to high temperature during development
CFrom fertilized eggs that lose one set of chromosomes
DFrom unfertilized eggs
Explanation
In the haplo-diploid system of Hymenoptera: males develop from unfertilized eggs and are haploid (one set of chromosomes), while females develop from fertilized eggs and are diploid (two sets). This is a unique sex determination mechanism distinct from XY and ZW systems.
Q36 Medium
Which of the following is NOT an assumption of Hardy-Weinberg Equilibrium?
AInfinitely large population size
BRandom mating
COverlapping generations
DNo mutation, migration, or selection
Explanation
HWE requires that generations do NOT overlap (parents do not mate with offspring). "Overlapping generations" violates the HWE assumptions. The full list: diploid, sexual reproduction, non-overlapping generations, random mating, infinite population, equal allele frequencies between sexes, no evolutionary forces.
Q37 Easy
If the frequency of allele A is p = 0.6 and allele a is q = 0.4, what is the expected frequency of heterozygotes (Aa) under Hardy-Weinberg equilibrium?
A0.48
B0.36
C0.24
D0.16
Explanation
Under HWE, heterozygote frequency = 2pq = 2 × 0.6 × 0.4 = 0.48. AA = p² = 0.36, aa = q² = 0.16. Check: 0.36 + 0.48 + 0.16 = 1.00 ✓
Q38 Tricky
In the classroom experiment, 21 students were assigned genotypes: 8 AA, 6 AB, 7 BB. Compared to HWE expectations, the number of heterozygotes is:
AHigher than expected — suggesting outbreeding
BExactly as expected — population is in HWE
CCannot be determined without knowing allele frequencies
DLower than expected — consistent with inbreeding or non-random mating
Explanation
From the data: p(A) = (16+6)/(42) = 0.52, q(B) = 0.48. Expected heterozygotes = 2pq × 21 = 2(0.52)(0.48)(21) ≈ 10.5. Observed = 6. So there are fewer heterozygotes than expected, which the lecture attributes to "random assignments (not truly random mating), small sample size, and inbreeding, which increases the proportion of homozygous genotypes."
Q39 Medium
What is the difference between dominance and epistasis?
ADominance is between populations; epistasis is within populations
BDominance is interaction between alleles at the same gene; epistasis is interaction between alleles at different genes
CDominance is always complete; epistasis can be partial
DDominance affects phenotype; epistasis only affects genotype
Explanation
Dominance (intragenic interaction) occurs between alleles of the SAME gene. Epistasis (intergenic interaction) occurs between DIFFERENT genes. Both affect phenotype. Dominance is not always complete — it can be incomplete or show codominance.
Q40 Easy
A heritability (h²) value of 0.65 for height would be classified as:
ALow heritability
BMedium heritability
CHigh heritability
DHeritability cannot exceed 0.5
Explanation
Heritability ranges: Low < 0.1, Medium = 0.1–0.4, High > 0.4. A value of 0.65 is high heritability. The lecture notes stature has h² ≈ 0.5–0.7, which is classified as high. Heritability can range from 0 to 1.
Q41 Easy
Who coined the term "genome" and in what year?
AThomas Hunt Morgan, 1910
BHans Winkler, 1920
CGregor Mendel, 1866
DFrancis Collins, 2003
Explanation
The term "genome" was defined in 1920 by Hans Winkler as the set of genes in a haploid set of chromosomes. Today the term encompasses all DNA in a cell (nuclear, mitochondrial, etc.).
Q42 Medium
How many genomes do plants typically have?
AOne (nuclear)
BTwo (nuclear + mitochondrial)
CTwo (nuclear + chloroplast)
DThree (nuclear + mitochondrial + chloroplast)
Explanation
The lecture states "Plants typically have 3 genomes": nuclear, mitochondrial, and chloroplast. Animals have 2 (nuclear + mitochondrial). This is an easy detail to overlook.
Q43 Tricky
The Human Genome Project required an accuracy standard of fewer than:
A1 error per 10,000 bases
B1 error per 1,000 bases
C1 error per 100,000 bases
D1 error per 1,000,000 bases
Explanation
The lecture states the HGP required "an accuracy standard of fewer than one error per 10,000 bases." This is a specific number from the slides that a professor might test.
Q44 Medium
Which of the following is an example of an extremophile from the Archaea domain?
AEscherichia coli
BSaccharomyces cerevisiae
CMethanogens
DDrosophila melanogaster
Explanation
The lecture lists Archaea as extremophiles including: Thermophiles (heat-loving), Halophiles (salt-loving), and Methanogens (methane-producing). E. coli is a bacterium, yeast is a eukaryote, and Drosophila is an insect.
Q45 Medium
What is the Whole Genome Shotgun (WGS) approach?
ASequencing individual chromosomes one at a time
BRandom sequencing of DNA to reconstruct full genomes without prior knowledge of DNA location
CSequencing only protein-coding regions of the genome
DTargeted sequencing of specific disease-associated genes
Explanation
WGS involves random sequencing of DNA fragments to reconstruct the full genome without needing prior knowledge of where each fragment comes from. This is a core sequencing strategy covered throughout the course.
Q46 Tricky
Which of the following metadata can sometimes be inferred from genomic data by comparing sequences to annotated reference datasets?
ASex
BStature
CDiet
DLocation of sample collection
Explanation
The lecture states that some features like sex "can sometimes be inferred by comparing sequences to annotated reference datasets" (e.g., by checking X/Y chromosome coverage), while others like stature "are harder to predict purely from genomic data." Stature is a complex trait influenced by many genes and environment.
Q47 — Open Short Answer
List the four main domains of genetics and briefly describe the focus of each.
✓ Model Answer

1. Classical Genetics (Transmission/Formal Genetics): Focuses on how traits are passed from parents to offspring using breeding experiments and phenotypic analysis to infer genetic laws (e.g., Mendel's laws).

2. Molecular Genetics: Investigates the structure and function of genes at the molecular level (DNA, RNA, proteins), including gene expression, mutation, and gene regulation.

3. Population Genetics: Studies the distribution and change of allele frequencies within populations, linking genetics with evolutionary biology.

4. Quantitative Genetics: Analyzes traits controlled by multiple genes using statistical models to estimate genetic contribution to phenotypic variation (e.g., height, milk production).

Q48 — Open Calculation
In a population of 200 individuals, you observe the following genotypes: 90 AA, 40 Aa, 70 aa. Calculate the allele frequencies of A and a, determine the expected genotype frequencies under HWE, and state whether this population is in Hardy-Weinberg equilibrium.
✓ Model Answer

Step 1: Calculate allele frequencies

Total alleles = 200 × 2 = 400
Copies of A = (90 × 2) + (40 × 1) = 180 + 40 = 220
p = freq(A) = 220 / 400 = 0.55
q = freq(a) = 1 − 0.55 = 0.45

Step 2: Expected genotype frequencies under HWE

AA = p² = 0.55² = 0.3025 → expected count = 0.3025 × 200 = 60.5
Aa = 2pq = 2 × 0.55 × 0.45 = 0.495 → expected count = 0.495 × 200 = 99
aa = q² = 0.45² = 0.2025 → expected count = 0.2025 × 200 = 40.5

Step 3: Comparison

Observed: 90 AA, 40 Aa, 70 aa
Expected: 60.5 AA, 99 Aa, 40.5 aa

There is a large excess of homozygotes and a deficit of heterozygotes (40 observed vs. 99 expected). The population is NOT in HWE. This deviation could be caused by inbreeding, population structure, selection, or non-random mating.

Q49 — Open Short Answer
Explain what linkage disequilibrium (LD) is, how it differs from physical linkage, and give one real-world example mentioned in the lecture.
✓ Model Answer

Definition: Linkage disequilibrium (LD) is the non-random association of alleles at two or more loci in a population. Certain allele combinations occur together more (or less) frequently than expected under independence.

LD vs. Physical Linkage: LD ≠ physical linkage. Physical linkage refers to genes being on the same chromosome. Physical linkage contributes to LD (close genes recombine less), but LD can also arise from other forces: small population size, genetic drift, selection, population admixture, or new mutations. Conversely, physically linked genes can have low LD if enough recombination has occurred over time.

Real-world example: Lactose tolerance in humans — a variant near the LCT gene (lactose digestion) is in high LD with nearby SNPs, forming a haplotype block maintained by positive selection because individuals with this haplotype digest lactose better.

Q50 — Open Short Answer
Describe the structure of a PLINK .ped file. What information does each column contain (columns 1–7+)?
✓ Model Answer

The .ped file is a text file with no header, where each line corresponds to one individual. The columns are:

Column 1 — Family ID: Identifier for the family, used to group related individuals.

Column 2 — Individual ID: Unique identifier for each individual.

Column 3 — Paternal ID: Father's ID (0 if unknown).

Column 4 — Maternal ID: Mother's ID (0 if unknown).

Column 5 — Sex: 1 = Male, 2 = Female, 0 = Unknown.

Column 6 — Phenotype: 1 = control, 2 = case, -9 or 0 = missing.

Column 7+ — Allele data: Genotype information with two alleles per locus (e.g., A A, G T). The number of loci can be as many as the dataset supports.

Genomic positions for each locus are specified in the associated .map or .bim file (chromosome, SNP ID, genetic distance, physical position).

Q51 — Open Tricky
Explain the formula P = G + E in quantitative genetics. What are the three components of the genotype effect, and how does heritability relate to this equation?
✓ Model Answer

The equation: Phenotype = Genotype effect + Environmental effect, or Var(P) = Var(G) + Var(E)

Three components of the genotype effect:

1. Additive genetic effect: The sum of individual allele effects across all loci contributing to the trait.

2. Dominance effect (intragenic): Interaction between alleles at the same gene (e.g., how Tt differs from the average of TT and tt).

3. Epistatic effect (intergenic): Interaction between alleles at different genes.

Heritability (h²): Measures how much of the phenotypic variation in a population is due to genetic differences. It is calculated by comparing related individuals (since unrelated individuals don't share genetic background). Ranges: Low (<0.1), Medium (0.1–0.4), High (>0.4). It tells us what proportion of Var(P) is attributable to Var(G).

NGS Technologies I & II — Exam Practice

📝NGS Technologies I & II — MCQ + Open Questions
0 / 65
Q1 Easy
Sanger dideoxy sequencing is classified as which generation of sequencing?
AZeroth generation sequencing
BFirst generation sequencing
CSecond generation sequencing
DThird generation sequencing
Explanation
Sanger dideoxy sequencing is explicitly classified as "first generation sequencing." NGS platforms (Illumina, Ion Torrent, 454, SOLiD) are second generation, while long-read technologies (PacBio, Nanopore) are often called third generation.
Q2 Medium
Which of the following factors is NOT listed as a key consideration when choosing a sequencing technology?
AError rate
BTurnaround time
CNumber of fluorescent labels used
DData output
Explanation
The three key factors for choosing a sequencing technology are: error rate, turnaround time, and data output. The number of fluorescent labels is a platform-specific technical detail, not a primary decision factor.
Q3 Medium
Moore's Law states that the number of transistors on a chip doubles approximately every:
A6 months
B12 months
C36 months
D24 months
Explanation
Moore's Law states that the number of transistors on a chip doubles every 24 months (2 years). Some definitions use 18 months instead. Importantly, NGS cost reduction has outpaced even Moore's Law.
Q4 Medium
Compared to the pre-NGS era, how has the distribution of cost and effort in genomic experiments changed?
ACost has shifted from data production to planning and data analysis
BCost has shifted from data analysis to data production
CCost remains equally distributed among all three phases
DPlanning is no longer necessary with NGS technologies
Explanation
Before NGS, data production was the most expensive and time-consuming phase. With NGS, data production became cheap and fast, so the major costs and effort shifted to experimental planning (more crucial than ever due to the volume of data possible) and data analysis (requires more time and expertise to handle massive datasets).
Q5 Tricky
The decrease in sequencing cost over the past two decades has:
AClosely followed Moore's Law predictions
BOutpaced Moore's Law
CBeen slower than Moore's Law
DFollowed a linear rather than exponential trend
Explanation
The decrease in sequencing costs has outpaced Moore's Law. While Moore's Law predicts doubling of computing power roughly every 2 years, NGS technology has advanced at an even faster rate, revolutionizing data generation in biology beyond the pace seen in computer science. The cost of sequencing a human genome dropped from ~$100 million to under $1,000.
Q6 Easy
What is described as the main cause of library preparation failure?
AIncorrect adapter sequences
BContamination with RNA
CInaccurate quantification of starting DNA
DUse of degraded DNA polymerase
Explanation
The lecture explicitly states: "Inaccurate quantification is the main cause of library preparation failure." Accurate quantification of starting DNA is critical to ensure proper library preparation for NGS.
Q7 Easy
The Ion S5 sequencer detects nucleotide incorporation by measuring:
AFluorescent light emission
BpH changes caused by H⁺ ion release
CChanges in electrical current through a nanopore
DBioluminescent signal from luciferase
Explanation
Ion S5 uses semiconductor sequencing. When a nucleotide is incorporated by DNA polymerase, hydrogen ions (H⁺) are released, causing a pH drop in the well. An ion-sensitive layer detects this change and converts it into a voltage signal. This is fundamentally different from optical (fluorescence/bioluminescence) detection used by Illumina or 454.
Q8 Medium
What is the major source of error in Ion Torrent sequencing?
AFluorescent label cross-talk between channels
BBridge amplification artifacts
CLigation probe mismatches
DHomopolymer regions where signal intensity does not scale linearly
Explanation
Ion Torrent's major error source is homopolymer regions (stretches of repeated identical bases like AAAA or TTT). The signal strength correlates with the number of bases incorporated, but doesn't scale linearly. For example, distinguishing between 3 and 4 consecutive T's can be ambiguous because the peak height falls between expected values. Fluorescent cross-talk applies to Illumina, bridge amplification is Illumina's method, and ligation probes are used by SOLiD.
Q9 Medium
In Ion Torrent sequencing, polyclonal reads occur when:
AMore than one DNA species occupies the same well
BThe same DNA fragment is sequenced multiple times
CA polymerase stalls during nucleotide incorporation
DTwo primers bind to the same template simultaneously
Explanation
Polyclonal reads occur when more than one DNA species occupies a single well. This produces mixed/overlapping signals, making base-calling unreliable. In ionograms, polyclonal reads show very few or no empty spaces because bases are incorporated continuously from different templates. These reads are filtered out during quality control.
Q10 Medium
Which clonal amplification method is used by the Ion S5 platform?
ABridge amplification on a flow cell
BRolling circle amplification
CEmulsion PCR on beads
DIsothermal amplification in nanowells
Explanation
Ion S5 uses emulsion PCR (emPCR). DNA fragments are attached to beads and encapsulated in oil-water droplets. Each droplet ideally contains a single DNA fragment, primers, nucleotides, and polymerase. The fragments are amplified to create clonal populations on the beads, which are then loaded into chip wells. Bridge amplification is the method used by Illumina.
Q11 Tricky
A key advantage of the Ion S5 over Illumina is that Ion S5:
AHas higher accuracy in homopolymer regions
BUses natural, unmodified nucleotides, reducing chemical costs
CProduces longer reads than any other platform
DCan detect base modifications directly during sequencing
Explanation
Ion S5 uses natural, unmodified nucleotides (no fluorescent labels or terminators), which minimizes chemical costs and simplifies chemistry. Illumina uses chemically modified nucleotides with fluorescent labels and reversible terminators, which are more expensive. Ion Torrent actually has LOWER accuracy in homopolymer regions than Illumina, and long-read platforms (PacBio, Nanopore) produce much longer reads.
Q12 Tricky
In an ionogram, an empty space between peaks indicates:
AA polyclonal well producing mixed signals
BA homopolymer region was encountered
CThe sequencing quality dropped below the threshold
DThe added nucleotide did not match the template during that flow
Explanation
In an ionogram, empty spaces (gaps) mean that during that nucleotide flow, no bases were incorporated — the added nucleotide didn't match the template. Polyclonal reads actually show the opposite: very few or no empty spaces because bases from mixed templates are incorporated almost continuously.
Q13 Medium
During Ion Torrent library preparation, barcodes are used to:
ATag each sample with a unique sequence for multiplexed sequencing and later demultiplexing
BIncrease the length of the sequencing reads
CDetect homopolymer errors during data analysis
DProvide primer binding sites for emulsion PCR
Explanation
Barcodes are unique DNA sequences that tag each sample, allowing multiple samples to be pooled in a single sequencing run (multiplexing). After sequencing, reads are sorted back to their original sample based on the barcode (demultiplexing). This is separate from adapter sequences that provide primer binding sites.
Q14 Easy
In Ion Torrent sequencing, reads shorter than how many bases are automatically filtered out?
A10 bases
B15 bases
C25 bases
D50 bases
Explanation
Reads smaller than 25 bases are filtered automatically because they are too short to be aligned reliably to a reference genome. This threshold may need adjustment for specific applications like miRNA sequencing, where the target molecules are naturally very short.
Q15 Easy
The Roche 454 sequencing platform uses which detection method?
APyrosequencing — light produced by luciferase-catalyzed reaction
BSemiconductor detection of pH changes
CFluorescent reversible terminator chemistry
DSequencing by ligation with fluorescent probes
Explanation
Roche 454 uses pyrosequencing. When a nucleotide is incorporated, pyrophosphate (PPi) is released. ATP sulfurylase converts PPi to ATP, which luciferase then uses to produce light. A CCD camera detects this light and integrates it as a peak in a pyrogram. This is an optical (light-based) detection method, unlike Ion Torrent's electronic detection.
Q16 Hard
In 454 pyrosequencing, which enzyme is responsible for degrading unincorporated nucleotides and excess ATP between cycles?
ALuciferase
BDNA polymerase
CATP sulfurylase
DApyrase
Explanation
Apyrase is the nucleotide-degrading enzyme that continuously degrades ATP excess and unincorporated dNTPs. The four enzymes in pyrosequencing are: DNA polymerase (incorporates nucleotides), ATP sulfurylase (converts PPi to ATP), luciferase (produces light from ATP), and apyrase (degrades excess ATP/dNTPs for the next cycle).
Q17 Hard
In Roche 454 template preparation, the ratio of DNA fragments to agarose beads during emulsion PCR is approximately:
A1:10
B1:1
C10:1
D100:1
Explanation
In 454 emulsion PCR, DNA fragments and agarose beads (with complementary oligonucleotides) are mixed in an approximately 1:1 ratio. This ensures that most beads ideally capture a single DNA fragment. The mixture is then encapsulated by vigorous vortexing into aqueous micelles surrounded by oil for PCR amplification, resulting in beads decorated with ~1 million copies of the original fragment.
Q18 Medium
Approximately how many copies of the original DNA fragment are generated on each bead in 454 emulsion PCR?
A~100
B~10,000
C~1 million
D~1 billion
Explanation
Each bead is decorated with approximately 1 million copies of the original single-stranded fragment. This high copy number is necessary to provide sufficient signal strength during the pyrosequencing reaction to detect and record nucleotide incorporation events.
Q19 Easy
ABI SOLiD uses which sequencing approach?
ASequencing by synthesis using fluorescent reversible terminators
BSequencing by ligation using fluorescently labeled di-base probes
CPyrosequencing with bioluminescent detection
DSemiconductor detection of hydrogen ions
Explanation
ABI SOLiD (Sequencing by Oligonucleotide Ligation and Detection) uses a sequencing-by-ligation approach. A set of four fluorescently labeled di-base probes compete for ligation to the sequencing primer. This is fundamentally different from polymerase-based methods (Illumina, Ion Torrent, 454).
Q20 Hard
The SOLiD system achieves high accuracy through its two-base encoding system. How many rounds of primer reset are performed for each sequence tag?
ATwo
BThree
CFour
DFive
Explanation
Five rounds of primer reset are completed for each sequence tag. Through this process, virtually every base is interrogated in two independent ligation reactions by two different primers. This dual interrogation is fundamental to the high accuracy (up to 99.99% with Exact Call Chemistry) of the SOLiD system.
Q21 Medium
What is a significant limitation of the ABI SOLiD platform?
AShort read lengths (35–50 bp) and complex color-space data analysis
BHigh homopolymer error rates
CVery low throughput
DInability to perform paired-end sequencing
Explanation
SOLiD's reads are relatively short (typically 35–50 bp), which limits genome assembly and analysis of repetitive regions. Additionally, the output is in color-space encoding that must be converted to nucleotide sequences, adding complexity to data analysis. SOLiD actually has very high throughput and can do paired-end sequencing. Homopolymer errors are characteristic of Ion Torrent and 454, not SOLiD.
Q22 Tricky
The SOLiD Exact Call Chemistry (ECC) module can achieve up to 99.99% accuracy when used:
AWithout any reference, in base-space mode
BOnly with paired-end reads
CIn combination with a reference genome
DOnly for reads shorter than 25 bp
Explanation
The Exact Call Chemistry (ECC) module achieves up to 99.99% accuracy when used in combination with a reference genome. Without a reference, the ECC module can still output data in base space (rather than color space), but it does not reach the same level of accuracy.
Q23 Easy
Illumina sequencing uses which cluster generation method?
AEmulsion PCR on beads
BBridge amplification on a flow cell
CRolling circle amplification
DIsothermal strand displacement
Explanation
Illumina uses bridge amplification on a flow cell. Single-stranded DNA molecules bind to complementary oligos on the flow cell surface, fold over to form bridges with adjacent primers, and are amplified to form clonal clusters. Emulsion PCR is used by Ion Torrent, 454, and SOLiD.
Q24 Medium
An Illumina flow cell is best described as:
AA silicon chip with millions of microwells
BA 96-well microtiter plate for PCR
CA membrane with embedded protein nanopores
DA thick glass slide with channels/lanes coated with a lawn of oligos complementary to library adapters
Explanation
An Illumina flow cell is a thick glass slide with channels or lanes. Each lane is randomly coated with a lawn of oligonucleotides complementary to library adapters. This surface enables the capture and bridge amplification of library fragments. Silicon chips with microwells describe Ion Torrent, and protein nanopore membranes describe Oxford Nanopore.
Q25 Medium
In Illumina bridge amplification, what happens immediately after the double-stranded bridge is denatured?
ATwo copies of covalently bound single-stranded templates are produced
BThe sequencing primer hybridizes immediately
CThe reverse strand is cleaved and washed away
DFluorescent nucleotides are added for sequencing
Explanation
After the double-stranded bridge is denatured, the result is two copies of covalently bound single-stranded templates. These single-stranded molecules can then flip over to hybridize to adjacent primers, and the bridge amplification cycle continues until a full cluster is formed. Reverse strand cleavage and sequencing primer hybridization occur later, after cluster generation is complete.
Q26 Medium
Each Illumina cluster represents:
AA single DNA molecule with no amplification
BA mixture of different DNA fragments from the library
CThousands of copies of the same DNA strand in a 1–2 micron spot
DApproximately 1 million copies of the DNA fragment
Explanation
Each Illumina cluster represents thousands of copies of the same DNA strand positioned in a 1–2 micron spot. Clusters appear as bright spots on fluorescent images. The high copy number provides sufficient signal intensity for detection. The ~1 million copies figure applies to 454 beads, not Illumina clusters.
Q27 Medium
In Illumina sequencing by synthesis (SBS), what ensures that only one nucleotide is incorporated per cycle?
ANucleotides are added one at a time in separate flows
BThe DNA polymerase has built-in proofreading activity
CDNA ligase prevents further extension
DEach nucleotide has a reversible chemical terminator that blocks further incorporation
Explanation
Illumina uses fluorescent reversible terminator chemistry. All four nucleotides are added simultaneously, but each is chemically blocked (has a reversible terminator) to prevent additional incorporations in the same cycle. After imaging, the fluorescent label and chemical block are enzymatically removed, allowing the next cycle. This is a key difference from Ion Torrent, which adds nucleotides one type at a time and may incorporate multiple identical bases.
Q28 Tricky
Why does Illumina SBS have higher accuracy in homopolymer regions than Ion Torrent?
AIllumina uses a more sensitive camera system
BOnly one nucleotide can be incorporated per cycle due to the reversible terminator, so each base is read individually
CIllumina reads are inherently longer than Ion Torrent reads
DIllumina uses a two-base encoding system similar to SOLiD
Explanation
Illumina's reversible terminator ensures only one nucleotide is incorporated per cycle, regardless of the template sequence. Even in a homopolymer run (e.g., AAAA), each A is incorporated and read in a separate cycle. Ion Torrent flows one nucleotide type at a time without a terminator, so multiple identical bases can be incorporated simultaneously, and the signal intensity must be used to infer the count — which is error-prone.
Q29 Hard
In Illumina 2-channel SBS chemistry (e.g., NextSeq 500), how is a G base identified?
ANo signal in either channel (neither red nor green)
BGreen signal only
CRed signal only
DBoth red and green signals simultaneously
Explanation
In 2-channel chemistry: T = green only, C = red only, A = both red + green, G = neither (no signal). This reduces imaging requirements from 4 channels to 2, allowing faster scanning with simpler optics. G is essentially a "dark" base inferred from the absence of signal.
Q30 Hard
In Illumina 1-channel (single-color) SBS chemistry, which base is identified by appearing green in the first imaging but dark (no signal) in the second imaging?
AT
BC
CA
DG
Explanation
In 1-channel chemistry: Step 1 (first image): A and T emit green, C and G are dark. Step 2 (chemistry): A loses its green dye, T stays green, C gets activated to green, G stays dark. Step 3 (second image): A = green→black, T = green→green, C = black→green, G = black→black. So A is identified by going from green to dark.
Q31 Medium
In the Illumina cluster generation workflow, what happens immediately after reverse strand cleavage?
ABridge amplification continues
BFree 3' ends are blocked to prevent unwanted DNA priming
CFluorescent nucleotides are added
DThe flow cell is scanned for cluster positions
Explanation
After linearization and reverse strand cleavage (leaving only forward strands), free 3' ends are blocked to prevent unwanted DNA priming. Only after blocking does the sequencing primer hybridize to the adapter sequence for Read 1. The full sequence is: bridge amplification → denaturation → linearization → reverse strand cleavage → 3' blocking → primer hybridization → sequencing.
Q32 Medium
What is Exclusion Amplification (ExAmp) in the context of Illumina sequencing?
AA method to exclude polyclonal reads during analysis
BA technique to filter out short reads
CAn alternative sequencing chemistry for long reads
DAn improved method for cluster generation on flow cells
Explanation
Exclusion Amplification (ExAmp) is an improved cluster generation method used on newer Illumina platforms. It is designed to generate higher-quality, more evenly spaced clusters on flow cells, improving data quality and throughput compared to traditional bridge amplification alone.
Q33 Easy
What is paired-end sequencing?
ASequencing both ends of a DNA fragment, producing two reads per molecule
BSequencing the same fragment twice for error correction
CSequencing two different samples simultaneously on one flow cell
DUsing two different sequencing chemistries on the same library
Explanation
Paired-end sequencing sequences both ends of a DNA fragment, generating two reads per molecule — one from each end. Although the middle portion remains unsequenced, the two reads are physically linked (from the same fragment), providing crucial positional information for alignment and variant detection.
Q34 Medium
Which type of variant is particularly difficult to detect with single-end reads but becomes detectable with paired-end sequencing?
ASingle nucleotide polymorphisms (SNPs)
BPoint mutations
CInsertion-deletion (indel) variants and structural rearrangements
DHeterozygous genotypes
Explanation
Paired-end sequencing facilitates detection of insertion-deletion (indel) variants, structural rearrangements, gene fusions, and novel transcripts, which is "not possible with single-read data." SNPs and point mutations can be detected with single-end reads. The paired positional information allows mapping discordant read pairs to identify structural changes.
Q35 Tricky
During paired-end sequencing on Illumina, how is the Read 2 template generated?
AThe original forward template is used again with a different primer
BThe template loops over to form a bridge, is re-amplified, linearized, and the forward strand is cleaved — leaving the reverse strand as template
CA separate library is prepared and loaded onto the same flow cell
DThe sequencing primer is simply moved to the opposite end of the same strand
Explanation
For Read 2 in paired-end sequencing: (1) The Read 1 sequenced strand is stripped off. (2) Template strands and lawn primers are unblocked. (3) The single-stranded template loops over to form a bridge by hybridizing with a lawn primer. (4) The primer is extended, creating a new double-stranded bridge. (5) Bridges are linearized and the original forward template is cleaved. (6) The reverse strand remains as the template for Read 2 sequencing.
Q36 Easy
Which statement about paired-end sequencing is TRUE?
AIt requires twice the amount of DNA compared to single-read sequencing
BIt requires restriction digestion of the DNA
COnly specific Illumina platforms support paired-end sequencing
DIt requires the same amount of DNA as single-read sequencing
Explanation
Paired-end sequencing uses the same amount of DNA as single-read genomic DNA or cDNA sequencing. It does not require methylation of DNA or restriction digestion, and all Illumina NGS systems are capable of paired-end sequencing. It's a simple modification to the standard library preparation process.
Q37 Easy
In Illumina library preparation, "indexing" refers to:
AAdding unique barcode sequences to library fragments for sample identification during multiplexing
BCreating a reference index for read alignment
CNumbering each cluster position on the flow cell
DMeasuring the fragment size distribution of the library
Explanation
Indexing (also called barcoding) is the process of adding unique DNA sequences (indexes) to library fragments during preparation. This allows multiple samples to be pooled and sequenced together (multiplexing) on the same flow cell or lane, and later computationally separated (demultiplexed) based on their unique index sequences.
Q38 Easy
Which file format stores sequencing reads along with per-base quality scores but no alignment information?
ABAM
BVCF
CFASTQ
DBED
Explanation
FASTQ is a text-based format that stores reads and per-base quality scores, but contains no alignment information. BAM stores reads plus alignment information in binary format. VCF stores variant calls (SNPs, indels). BED defines genomic features/regions in the reference genome.
Q39 Medium
A BAM file differs from a FASTQ file in that it:
AIs a plain text file that can be directly viewed
BContains only quality scores without the actual sequences
CStores variant calls for SNPs and indels
DIs a compressed binary file containing both reads and alignment information, with an index for fast access
Explanation
BAM (Binary Alignment Map) is a compressed binary format containing both reads and alignment information. It uses an index file to give fast access to small sections of the file but cannot be directly viewed as text (requires specialized tools/genome browsers). FASTQ is plain text and contains no alignment data. VCF stores variant calls, not BAM.
Q40 Easy
The VCF (Variant Call Format) file is used to represent:
ARaw sequencing reads with quality scores
BSNPs, indels, and structural variation calls
CGenomic feature annotations like genes and regulatory elements
DRead alignment positions in binary format
Explanation
VCF (Variant Call Format) is a standardized text file format for representing SNP, indel, and structural variation calls — differences between the sequenced sample and the reference genome. FASTQ stores raw reads, BAM stores alignments, and BED defines genomic features/regions.
Q41 Medium
A BED (Browser Extensible Data) file is best described as:
AA tab-delimited text file that defines genomic features or regions added to a reference file
BA compressed binary format for read alignments
CA text file storing only DNA sequences without quality information
DA format for storing raw fluorescence intensity data
Explanation
A BED file is a tab-delimited text file that defines a feature track — specifying genomic features or regions such as genes or regulatory elements. BED files are added to a reference file to annotate or highlight specific parts of the reference genome. Option C describes FASTA format.
Q42 Medium
How many lines represent each read in a FASTQ file?
A2 lines (header + sequence)
B3 lines (header + sequence + quality)
C4 lines (identifier + sequence + separator + quality scores)
DVariable — depends on read length
Explanation
Each read in FASTQ is represented by exactly 4 lines: Line 1 (@Read_ID) = Identifier, Line 2 = DNA sequence, Line 3 (+) = Separator, Line 4 = Quality scores encoded as ASCII characters. This compact format combines sequence and quality information efficiently.
Q43 Easy
A Phred quality score (Q score) of 20 corresponds to:
AAn error rate of 1 in 10 (90% accuracy)
BAn error rate of 1 in 100 (99% accuracy)
CAn error rate of 1 in 1,000 (99.9% accuracy)
DAn error rate of 1 in 10,000 (99.99% accuracy)
Explanation
Q = -10 × log₁₀(e), where e is the error probability. For Q=20: 20 = -10 × log₁₀(e), so log₁₀(e) = -2, meaning e = 0.01 = 1 in 100, giving 99% accuracy. Q10 = 1/10 (90%), Q30 = 1/1,000 (99.9%), Q40 = 1/10,000 (99.99%).
Q44 Medium
In the Sanger FASTQ format, Phred quality scores are encoded using ASCII characters in which range?
AASCII 0 to 93
BASCII 0 to 126
CASCII 64 to 126
DASCII 33 to 126
Explanation
Sanger FASTQ format encodes Phred quality scores from 0 to 93 using ASCII characters 33 to 126 (Q score = ASCII value − 33). The Solexa/Illumina early format used a different offset (ASCII 64). A tip: if you see characters with ASCII codes higher than 90 in the quality string, the file is likely in the older Solexa/Illumina format.
Q45 Tricky
How can you distinguish a Solexa/Illumina FASTQ file from a standard Sanger FASTQ file by examining the quality string?
ASolexa/Illumina files may contain characters with ASCII code higher than 90
BSolexa/Illumina files use numeric quality scores instead of ASCII characters
CSanger files always start with the '@' symbol while Solexa uses '>'
DSolexa files contain only uppercase letters in the quality string
Explanation
The lecture explicitly states: "Although Solexa/Illumina read file looks pretty much like FASTQ, they are different in that the qualities are scaled differently. In the quality string, if you can see a character with its ASCII code higher than 90, probably your file is in the Solexa/Illumina format." Both formats use '@' as a read identifier and ASCII-encoded quality scores, but the offset differs.
Q46 Medium
Illumina sequencing platforms typically achieve quality scores around:
AQ10 (90% accuracy)
BQ20 (99% accuracy)
CQ30 (99.9% accuracy) or better
DQ40 (99.99% accuracy)
Explanation
Illumina sequencing typically achieves around Q30 (1 error in 1,000 bases) or better. Ion Torrent averages around Q20 due to homopolymer difficulties, with improvements pushing closer to Q30. Q30 is a common quality benchmark in the field.
Q47 Medium
The "moving window" trimming approach in read filtering works by:
ARemoving the first 25 bases of every read
BKeeping only reads above a specific length threshold
CRandomly sampling reads to reduce file size
DTrimming reads at the position where base quality drops below a certain threshold
Explanation
A moving window approach automatically trims reads when base quality drops below a certain threshold. It slides a window along the read and cuts at the position where quality deteriorates. Quality typically drops toward the ends of reads, so this helps retain the high-quality portion while discarding unreliable bases.
Q48 Easy
The FASTA format differs from FASTQ in that FASTA:
AIncludes per-base quality scores
BContains only sequence information without quality scores
CIs a binary format that cannot be viewed directly
DStores alignment information along with the sequence
Explanation
FASTA is a simple plain-text format that stores only the sequence information (header line starting with '>' followed by the sequence). It does not include quality scores. FASTQ extends this by adding per-base quality scores encoded as ASCII characters. FASTA was widely used before high-throughput sequencing became common.
Q49 Easy
Oxford Nanopore sequencing works by detecting:
AChanges in electrical current as nucleic acids pass through a protein nanopore
BFluorescent signals from labeled nucleotides
CpH changes from hydrogen ion release
DLight produced by luciferase enzyme activity
Explanation
Nanopore sequencing monitors changes in electrical current as single-stranded DNA or RNA passes through a protein nanopore embedded in a membrane. The current changes are decoded to determine the nucleotide sequence. This is fundamentally different from synthesis-based or ligation-based detection methods.
Q50 Medium
A unique feature of Oxford Nanopore Technology is its ability to:
AAchieve error rates below 0.01%
BAmplify DNA before reading each molecule
CSequence RNA directly without conversion to cDNA
DProduce paired-end reads on ultra-long fragments
Explanation
Oxford Nanopore can sequence RNA directly — no need to convert it to cDNA first. This is a unique capability. Nanopore also requires no DNA amplification. Its error rate is around 5% (not below 0.01%), and since it produces continuous long reads, paired-end sequencing is not needed.
Q51 Tricky
The Nanopore does NOT read individual nucleotides one at a time. Instead, it:
AReads an entire chromosome in one signal event
BDetects only purine vs pyrimidine groups
CRequires fluorescent labeling to distinguish bases
DDetects a signal affected by a short sequence (k-mer), making signal-to-sequence conversion complex
Explanation
The nanopore detects a signal affected by a short sequence (k-mer) of nucleotides that are simultaneously within or near the pore, not individual bases. This makes the signal-to-sequence conversion computationally complex, as the current change at any point depends on multiple adjacent nucleotides. This contributes to the relatively higher error rate.
Q52 Medium
The MinION from Oxford Nanopore Technologies has up to how many nanopore channels?
A126
B512
C3,000
D144,000
Explanation
MinION has up to 512 nanopore channels. PromethION has up to 3,000 channels per flow cell and up to 48 flow cells (total up to 144,000 channels). GridION has five individually addressable flow cell positions compatible with MinION and Flongle flow cells.
Q53 Medium
Which Oxford Nanopore device is best suited for smaller samples, quality checking, or targeted regions?
AFlongle
BMinION
CPromethION
DGridION Mk1
Explanation
Flongle is an adapter for MinION designed for smaller tests. It's suitable for quality checks, amplicons, smaller genomes, targeted regions, or diagnostics. It's a single-use, on-demand, cost-efficient solution when you have smaller samples or prefer running single samples rather than multiplexing.
Q54 Hard
Why is long-read sequencing (Nanopore/PacBio) NOT suitable for ancient or highly degraded DNA?
AThe cost per base is too high for degraded samples
BDegraded DNA causes the nanopore to clog
CHigh error rates combined with sensitivity to DNA quality make results unreliable on degraded templates
DDegraded DNA fragments are too long for the sequencing chemistry
Explanation
Long-read technologies are not suitable for ancient/degraded DNA due to higher error rates combined with sensitivity to DNA quality. Ancient DNA is fragmented and chemically damaged, which compounds the already higher error rates of these platforms. These methods require high-quality, intact DNA to take full advantage of their long-read capability.
Q55 Tricky
Why might the Nagoya Protocol be relevant when using a portable Nanopore sequencer in the field?
AIt requires all sequencing data to be made publicly available
BIt restricts the export of biological resources across national borders, meaning sequencing must often be done locally
CIt mandates the use of specific sequencing platforms in clinical settings
DIt prohibits the use of portable sequencers outside laboratory environments
Explanation
The Nagoya Protocol restricts the export of biological resources (tissue, blood, DNA) across national borders to prevent unauthorized use of genetic resources with commercial or research value. This means you can bring the sequencer to the sample (making Nanopore's portability valuable), but you may not be allowed to export the sample itself. Sequencing must often be done locally.
Q56 Easy
PacBio's SMRT sequencing immobilizes DNA polymerases in tiny wells called:
ANanopores
BMicrotiter wells
CIon semiconductor wells
DZero-Mode Waveguides (ZMWs)
Explanation
PacBio SMRT sequencing uses Zero-Mode Waveguides (ZMWs) — tiny wells at the bottom of which a DNA polymerase is attached. Fluorescently labeled nucleotides (with six phosphate groups) are incorporated continuously, and a camera detects the color and timing of each incorporation event in real time.
Q57 Medium
PacBio's Circular Consensus Sequencing (CCS) achieves high accuracy by:
AReading the same circular template multiple times and averaging the results to generate HiFi reads
BUsing two-base encoding similar to SOLiD
CAdding reversible terminators to slow down the polymerase
DUsing bridge amplification to generate clonal clusters
Explanation
CCS creates a circular DNA template (SMRTbell) by ligating hairpin adapters to both ends. The polymerase makes multiple passes around the circle, reading the same sequence repeatedly. These multiple subreads are aligned and averaged to generate a high-fidelity (HiFi) consensus read with >99.9% accuracy. Because errors are random, the same error is unlikely to recur at the same position across passes.
Q58 Medium
A key difference between PacBio SMRT sequencing and Illumina SBS is that PacBio:
AUses a termination step to control nucleotide addition
BDoes not use fluorescently labeled nucleotides
CIncorporates nucleotides continuously without a termination step
DRequires cluster amplification before sequencing
Explanation
Unlike Illumina (which uses reversible terminators to add one base per cycle), PacBio has no termination step — nucleotides are incorporated continuously in real time. PacBio does use fluorescently labeled nucleotides (with six phosphate groups), and it performs single-molecule sequencing without prior amplification.
Q59 Hard
PacBio SMRT sequencing can detect DNA modifications (e.g., methylation) because:
AModified bases emit a different fluorescent color
BModified bases alter the interpulse duration (time between incorporations), which is detected in real time
CModified bases cause a pH change that differs from unmodified bases
DA separate bisulfite treatment step is required before sequencing
Explanation
PacBio detects DNA modifications through "interpulse duration" — the time between base incorporation events. When the polymerase encounters a modified base, the incorporation time is longer. The SMRT system records the color and duration of emitted light in real time, so the interpulse duration can indicate DNA modification events directly, without separate chemical treatment.
Q60 Medium
Which of the following is an advantage of PacBio HiFi reads?
ALowest cost per gigabase among all platforms
BHighest throughput (most reads per run)
CShortest library preparation time
DUniform coverage with little or no GC bias due to PCR-free library prep
Explanation
PacBio HiFi reads provide uniform coverage with little or no sequence bias (e.g., GC bias) thanks to PCR-free library preparation and random error profiles. PacBio has higher cost per run and lower throughput than Illumina, and library preparation is more technically demanding, not simpler.
Q61 Hard
Match the signal type to the correct platform: Fluorescence (optical), pH sensing (electronic), Bioluminescence (optical), Electrical current changes.
AIllumina = Fluorescence, Ion Torrent = pH, 454 = Bioluminescence, Nanopore = Electrical current
BIllumina = Bioluminescence, Ion Torrent = pH, 454 = Fluorescence, Nanopore = Electrical current
CIllumina = Fluorescence, Ion Torrent = Electrical current, 454 = pH, Nanopore = Bioluminescence
DIllumina = pH, Ion Torrent = Fluorescence, 454 = Bioluminescence, Nanopore = Electrical current
Explanation
Illumina uses fluorescent reversible terminator chemistry (optical detection). Ion Torrent detects pH changes via an ion-sensitive semiconductor (electronic). Roche 454 uses pyrosequencing with luciferase-generated bioluminescence (optical). Oxford Nanopore detects changes in electrical current through a protein pore (electronic). PacBio also uses fluorescence but in a real-time, single-molecule format.
Q62 Medium
Raw sequencing data from a single run can be approximately 2.5 terabytes, which after processing into FASTQ format reduces to about:
A500 gigabytes
B100 gigabytes
C30 gigabytes
D1 gigabyte
Explanation
Raw data can be ~2.5 terabytes, but once processed into FASTQ format, it reduces to about 30 gigabytes. This massive reduction highlights the importance of data management — keeping only essential files and compressing data efficiently, since storing terabytes of raw data long-term is impractical.
Q63 — Open Calculation
A base has a Phred quality score of Q30. What is the error probability and the corresponding base call accuracy? Show your calculation using the Q score formula.
✓ Model Answer

The Q score formula is: Q = −10 × log₁₀(e), where e is the error probability.

Given Q = 30:
30 = −10 × log₁₀(e)
log₁₀(e) = −3
e = 10⁻³ = 0.001
Error rate = 1 in 1,000 bases
Base call accuracy = 1 − 0.001 = 0.999 = 99.9%

Q30 is considered a standard quality benchmark for Illumina sequencing, meaning that on average, only 1 in 1,000 base calls is incorrect.

Q64 — Open Short Answer
Compare the clonal amplification methods used by Illumina (bridge amplification) and Ion Torrent (emulsion PCR). Describe the key steps of each and explain how each approach generates clonal populations of DNA templates for sequencing.
✓ Model Answer

Illumina — Bridge Amplification:

1. Single-stranded library fragments hybridize to complementary oligonucleotides (lawn primers) on a glass flow cell surface via their adapters.

2. DNA polymerase synthesizes a complementary strand; the original template is washed away.

3. The newly synthesized strand folds over ("bridges") to hybridize with an adjacent complementary primer on the flow cell.

4. Polymerase extends the primer, creating a double-stranded bridge.

5. The bridge is denatured, yielding two covalently attached single-stranded copies.

6. The cycle repeats, exponentially amplifying copies in a localized cluster (1–2 micron spot with thousands of identical copies).

7. After amplification, clusters are linearized, reverse strands are cleaved, and 3' ends are blocked before sequencing.

Ion Torrent — Emulsion PCR:

1. Library fragments are mixed with beads coated with complementary oligonucleotides, ideally at a 1:1 ratio (one fragment per bead).

2. Beads and fragments are emulsified into oil-water droplets, creating millions of individual micro-reactors.

3. Each droplet contains a single bead, a DNA fragment, primers, nucleotides, and polymerase.

4. Standard PCR amplification occurs inside each droplet, generating clonal populations on each bead.

5. Beads with amplified DNA are enriched (Ion Sphere Particle Enrichment) and deposited into chip wells for sequencing.

Key difference: Bridge amplification occurs on a flat surface (flow cell) generating spatially separated clusters, while emulsion PCR occurs in liquid microdroplets generating bead-bound clonal populations that are then loaded into wells.

Q65 — Open Tricky
Explain how Illumina's 2-channel and 1-channel SBS chemistries can distinguish all four nucleotides (A, T, C, G) using fewer imaging channels than the traditional 4-channel approach. What are the advantages and trade-offs of the 1-channel system?
✓ Model Answer

2-Channel Chemistry (e.g., NextSeq 500): Uses two colors (red and green). T = green only; C = red only; A = both red and green; G = neither color (dark). Each cycle requires two images, one per channel, and the combination identifies the base.

1-Channel Chemistry: Uses only a single color (green) but requires two imaging steps per cycle with an intervening chemical modification:

Step 1 — Incorporation: All 4 nucleotides are added. A and T emit green; C and G are dark.

Step 2 — First imaging: Green = A or T; Dark = C or G.

Step 3 — Chemical modification: A loses its green dye (cleaved off); T retains green; C is activated to fluoresce green; G remains dark.

Step 4 — Second imaging: A = green→dark; T = green→green (stays); C = dark→green; G = dark→dark (stays).

Advantages of 1-channel: Uses only one dye and one detector, reducing instrument cost and complexity. No need for multicolor scanning or complex optics.

Trade-offs: Requires two chemical processing steps and two images per cycle, making each cycle slightly longer. The chemistry is more complex and nucleotide reagents are more expensive than in multi-channel systems.

NGS Data Analysis — Exam Practice

📝NGS Data Analysis – Variant Discovery Pipeline
0 / 40
Q1 Medium
An A260/280 ratio of 1.5 for a DNA sample most likely indicates:
ARNA contamination
BProtein or phenol contamination
CPure, high-quality DNA
DCarbohydrate contamination
Explanation
The ideal A260/280 ratio for pure DNA is ~1.8. A ratio <1.8 indicates contamination by proteins or phenol, which absorb at 280 nm and lower the ratio. A ratio >1.8 may suggest RNA contamination. Carbohydrate contamination is detected by the A260/230 ratio, not A260/280.
Q2 Easy
The A260/230 ratio is used to assess:
ADNA fragment length
BProtein contamination
CDNA integrity
DChemical contaminants such as carbohydrates, phenol, or guanidine salts
Explanation
The A260/230 ratio (ideal 2.0–2.2) detects chemical contaminants that absorb at 230 nm: carbohydrates (common in plant DNA), residual phenol, guanidine salts (from column kits), and glycogen. Protein contamination is assessed by A260/280. DNA integrity is assessed by gel electrophoresis, not absorbance ratios.
Q3 Medium
On an agarose gel, high-quality intact genomic DNA appears as:
AA sharp, high molecular weight band near the well
BA smear distributed evenly across the gel
CMultiple discrete bands of different sizes
DA band at the bottom of the gel near small fragments
Explanation
Intact genomic DNA is composed of very long fragments that migrate slowly in the gel, producing a sharp, high molecular weight band near the wells. A smear indicates degradation. Degraded DNA (important for ancient DNA or complex-matrix samples) may still work for short-read sequencing but will fail on long-read platforms like PacBio or Nanopore.
Q4 Tricky
A bioinformatician receives NGS data but does not ask about the library preparation protocol. Which of the following errors is MOST likely to occur?
AFailure to install the alignment software
BThe reference genome will not be available
CIncorrect interpretation of duplicate reads or coverage biases introduced by PCR amplification
DThe FASTQ files will be unreadable
Explanation
Knowing whether PCR amplification was used in library prep is essential. PCR introduces duplicate reads and coverage biases that affect variant calling, allele frequency estimation, and coverage evenness. Without this knowledge, a bioinformatician may misinterpret duplicates as genuine high coverage supporting a variant, leading to false positives. The lecture emphasizes: "Always ask what has been done to generate the data."
Q5 Easy
The formula for sequencing depth (coverage) is:
ADepth = G × L / N
BDepth = (N × L) / G
CDepth = N / (L × G)
DDepth = (G × N) / L
Explanation
Depth (X) = (N × L) / G, where N = number of reads, L = read length (bp), G = genome size (bp). For example, 100 million reads of 150 bp on a 3 Gb genome gives (100M × 150) / 3G = 5×.
Q6 Tricky
Breadth of coverage of 95% at 20× means:
A95% of reads have a quality score ≥20
BEach base in the genome has been read exactly 20 times
C95% of reads aligned with a mapping quality of 20
D95% of the target genome bases are covered by at least 20 reads
Explanation
Breadth of coverage is the percentage of the target genome covered at a specified minimum depth. "95% at 20×" means 95% of bases have ≥20 reads mapped to them. This is different from depth of coverage, which is the average number of times each base is read. Breadth measures completeness; depth measures redundancy/confidence.
Q7 Easy
In a FASTQ file, each sequence entry consists of:
A4 lines: header (@), sequence, separator (+), quality scores
B3 lines: header (>), sequence, quality scores
C2 lines: sequence and quality scores
D5 lines: header, sequence, separator, quality, checksum
Explanation
FASTQ uses exactly 4 lines per read: Line 1 begins with '@' followed by the sequence identifier; Line 2 is the raw nucleotide sequence; Line 3 begins with '+' (optionally repeating the identifier); Line 4 encodes quality values as ASCII characters, with the same number of characters as bases in Line 2.
Q8 Medium
A Phred quality score of Q30 corresponds to:
AA 1 in 100 chance of an incorrect base call (99% accuracy)
BA 1 in 10 chance of an incorrect base call (90% accuracy)
CA 1 in 1000 chance of an incorrect base call (99.9% accuracy)
DA 1 in 10000 chance of an incorrect base call (99.99% accuracy)
Explanation
The Phred formula is Q = −10 × log₁₀(P), where P is the probability of error. For Q30: P = 10^(−30/10) = 10^(−3) = 1/1000. So there is a 0.1% chance of error, or 99.9% base call accuracy. Q20 = 99%, Q10 = 90%, Q40 = 99.99%.
Q9 Medium
Quality scores in FASTQ files are encoded using:
ABinary values representing Phred scores directly
BSingle ASCII characters, where each character maps to a Phred score
CTwo-digit integers separated by commas
DHexadecimal values encoding error probabilities
Explanation
Both the sequence and quality scores are each encoded with a single ASCII character for brevity. ASCII printable characters (range 33–126) are used, where each character corresponds to a specific Phred quality score. Illumina 1.8+ uses the same encoding as the original Sanger format.
Q10 Medium
In paired-end sequencing, the two FASTQ files (*_1.fastq.gz and *_2.fastq.gz) are characterized by:
AReads sorted in the same order — the n-th read in file 1 is the mate of the n-th read in file 2
BReads sorted by mapping position along the reference genome
CFile 1 contains forward reads and file 2 contains all the quality scores
DThe two files can be read in any order and paired later by sequence similarity
Explanation
In paired-end sequencing, reads follow the same order in both files — the first read in *_1.fastq.gz is the mate pair of the first read in *_2.fastq.gz, and so on. They are NOT sorted by genomic position (that happens only after alignment to produce BAM files). Maintaining this order is critical for correct downstream alignment and analysis.
Q11 Easy
Which file formats can FastQC accept as input?
AOnly FASTQ files
BFASTQ and VCF files
COnly BAM files
DBAM, SAM, and FASTQ files
Explanation
FastQC can import data from BAM, SAM, or FASTQ files (any variant). It provides a modular set of analyses with summary graphs and exports results as an HTML report. It does not accept VCF or other variant files.
Q12 Medium
In the FastQC "Per base sequence quality" module, a box plot at position 140 showing a median Phred score of 15 indicates:
AExcellent quality — no action needed
BModerate quality — acceptable for most analyses
CPoor quality — trimming of read ends is recommended
DThe sequencing run failed and data should be discarded entirely
Explanation
A Phred score of 15 means ~96.8% accuracy — this falls in the poor/red zone. Quality typically drops toward the end of reads. A median of 15 at position 140 suggests the read ends need trimming. However, the entire run is not necessarily a failure — trimming the low-quality tails may rescue the usable portion of the data.
Q13 Medium
In the FastQC "Per sequence GC content" module, a distribution with two distinct peaks (instead of a single normal curve) most likely indicates:
AHigh sequencing quality
BContamination from another organism
CNormal variation in GC content across the genome
DLow sequencing depth
Explanation
A normal WGS library should produce a roughly normal (single-peak) GC distribution matching the reference genome. Multiple peaks or an unusually shaped distribution suggest contamination from another organism with a different GC content. For example, bacterial DNA in a cattle sample would produce a secondary peak. Environmental DNA samples (soil, honey) naturally show complex multi-peak distributions.
Q14 Medium
The "Sequence Duplication Levels" module in FastQC shows that 35% of reads appear 10+ times. The most likely cause is:
AOver-amplification during PCR in library preparation
BA very large and complex genome
CHigh sequencing depth with PCR-free library prep
DAdapter contamination
Explanation
In a random WGS library, most sequences should occur only once. High duplication levels (35% appearing 10+ times) strongly suggest over-amplification during PCR library preparation. These duplicates don't add new information and should be removed before analysis. PCR-free protocols avoid this bias but require more starting DNA material.
Q15 Tricky
In a WGS experiment, the "Per base sequence content" plot shows that position 1 always starts with a T and position 2 always starts with an A. This is most likely because:
AThe genome is AT-rich
BSequencing machine error at the beginning of every read
CA restriction enzyme was used during library preparation that cuts at a specific recognition site
DAdapter sequences were not trimmed
Explanation
Random DNA fragmentation should produce roughly equal proportions of all four bases at each position. Consistent base enrichment at specific positions indicates a restriction enzyme was used for fragmentation — these enzymes cut at defined recognition sequences, so all fragments begin with the same bases. This is a key example from the lecture about why understanding the library prep method is crucial for interpreting QC results.
Q16 Medium
What is the advantage of sliding window trimming over simple threshold-based trimming?
AIt is faster and requires less memory
BIt better preserves moderate-quality bases surrounded by high-quality neighbors
CIt removes adapter sequences simultaneously
DIt increases the overall read length
Explanation
Sliding window trimming uses a window (e.g., 5 bases) and calculates the average quality within it, trimming only when the average drops below the threshold. This approach preserves individual bases of moderate quality that are surrounded by high-quality neighbors, whereas simple threshold trimming would remove any single base below the cutoff. Tools like Prinseq and Trimmomatic implement sliding window trimming.
Q17 Easy
What should be done immediately after trimming reads?
AProceed directly to variant calling
BSubmit the trimmed reads to a public database
CRe-sequence the sample
DRe-run quality control (e.g., FastQC) to confirm improvements
Explanation
After trimming, you should always re-run QC (e.g., FastQC) to verify that the quality has improved. This is part of the iterative QC-filter cycle emphasized throughout the pipeline. Skipping this verification risks retaining errors that lead to unreliable downstream conclusions.
Q18 Easy
What is the primary purpose of a SAM/BAM file?
ATo store the alignment of sequencing reads to a reference genome
BTo store raw sequencing reads and quality scores
CTo store a list of genetic variants (SNPs and indels)
DTo store the reference genome sequence
Explanation
SAM (Sequence Alignment Map) stores alignments of reads to a reference genome. BAM is the compressed binary version of SAM — smaller file size, indexed access, but not human-readable. FASTQ stores raw reads + quality scores. VCF stores variant calls. The reference genome is stored separately (e.g., as a FASTA file).
Q19 Medium
The SAM file header line "@SQ SN:chr1 LN:248956422" indicates:
AA sequencing quality score of 248956422
BThe alignment tool version number
CA reference sequence named chr1 with a length of 248,956,422 bp
DThe number of reads aligned to chromosome 1
Explanation
In the SAM header, @SQ is the reference sequence dictionary tag. SN stands for the reference Sequence Name (e.g., chr1), and LN stands for the reference Length in base pairs. There is one @SQ line per chromosome/contig. The @PG line (not @SQ) records alignment tool information.
Q20 Hard
A SAM alignment record has the FLAG value 1024. This read is:
AUnmapped to the reference genome
BA PCR or optical duplicate
CA secondary alignment
DPart of a properly paired read
Explanation
SAM FLAG is an integer where each bit encodes a different property. FLAG 4 = unmapped read, FLAG 256 = secondary alignment, FLAG 1024 = PCR or optical duplicate. Duplicates should typically be removed (e.g., using Picard) because they don't add independent information and can artificially inflate coverage, leading to false variant support.
Q21 Hard
Given the CIGAR string "4S8M2I4M1D3M", which statement is correct?
AThe read has 4 deletions at the start
BThe read is 22 bases long and all bases align to the reference
CThere are 2 deletions and 1 insertion in this alignment
DThe first 4 bases are soft-clipped, there is a 2-base insertion and a 1-base deletion relative to the reference
Explanation
Parsing "4S8M2I4M1D3M": 4S = 4 bases soft-clipped (present in read but not aligned); 8M = 8 matching/mismatching bases; 2I = 2 bases inserted in read (not in reference); 4M = 4 matching; 1D = 1 base deleted from read (present in reference, missing in read); 3M = 3 matching. The read length = 4+8+2+4+3 = 21 bases. Note: M includes both matches AND mismatches — variant detection requires separate tools.
Q22 Medium
A mapping quality (MAPQ) score of 0 indicates that:
AThe read maps equally well to multiple locations in the genome
BThe read has perfect alignment with no mismatches
CThe base quality of all positions in the read is zero
DThe read was not sequenced correctly
Explanation
MAPQ is a Phred-scaled score of mapping confidence. MAPQ = 0 means the read aligns equally well to multiple locations, often due to repetitive/low-complexity regions or genome duplications. Such reads should be filtered out before variant calling to avoid false positives. Higher MAPQ = higher confidence in unique placement.
Q23 Medium
BWA-MEM is based on which algorithm?
AHash table indexing
BSmith-Waterman local alignment
CBurrows-Wheeler Transform
Dk-mer frequency counting
Explanation
BWA (Burrows-Wheeler Aligner) uses the Burrows-Wheeler Transform (BWT) for efficient sequence alignment. Bowtie also uses BWT. In contrast, some other aligners use hash table-based approaches. The two approaches differ in speed, CPU/memory usage, and sensitivity — affecting downstream variant discovery. BWA-MEM is the default aligner in many standard genomic pipelines.
Q24 Tricky
When comparing BWA-MEM and Bowtie2 using the same variant caller (SAMtools), only 24.5% of SNPs were concordant. This demonstrates that:
ASAMtools is an unreliable variant caller
BThe choice of read aligner has a major impact on downstream variant discovery
CBoth aligners produce identical results and the difference is due to random variation
DThe reference genome was incorrectly assembled
Explanation
Only 24.5% concordance between BWA-MEM and Bowtie2 (with the same variant caller) shows that aligner choice is NOT trivial — it profoundly affects which variants are discovered. The suggestion is to run both tools and compare results, especially for complex genomes like polyploid plants. This is a key point students often underestimate.
Q25 Medium
PCR duplicates are identified by sharing:
AThe same base quality scores
BThe same read name in the FASTQ file
CAlignment to different chromosomes with similar sequences
DCommon coordinates, sequencing direction, and the same sequence
Explanation
PCR duplicates are identified as reads that share common genomic coordinates (start/end position), the same sequencing direction, and the same sequence — indicating they originated from the same amplified fragment rather than independent DNA molecules. Tools like Picard mark and remove these duplicates. Alternatively, PCR-free library protocols can be used if sufficient input DNA is available.
Q26 Medium
Why is duplicate removal important before variant calling?
ADuplicates artificially inflate coverage and can give false support to variants
BDuplicates reduce the mapping quality of all reads
CDuplicates change the reference genome sequence
DDuplicates decrease the file size of BAM files
Explanation
PCR duplicates are copies of the same DNA fragment. They fake high coverage at certain positions, giving artificially strong support for variants (including errors from the original fragment). This can lead to false positive variant calls. Removing duplicates ensures that only independent observations contribute to variant evidence.
Q27 Medium
Which of the following factors does NOT directly affect variant calling accuracy?
ABase call quality of supporting reads
BProximity to homopolymer runs
CThe GC content of the entire genome
DMapping quality of the aligned reads
Explanation
Variant calling accuracy is affected by: base call quality, proximity to indels/homopolymer runs (which cause sequencing errors), mapping quality, and sequencing depth. The overall GC content of the genome affects sequencing coverage evenness but does not directly impact variant calling at a specific position in the way the other factors do.
Q28 Hard
What is the main advantage of joint variant calling over individual variant calling followed by merging?
AIt produces smaller VCF files
BA low-confidence variant in one sample can be confirmed by evidence from other samples
CIt requires less computational resources
DIt does not require a reference genome
Explanation
Joint variant calling analyzes all samples simultaneously, allowing the caller to use cross-sample evidence. A variant that appears with low confidence in one sample (e.g., due to low coverage) can be confirmed if it appears confidently in other samples. In individual calling, a missing variant in some VCFs is ambiguous — it could be wild-type or just insufficient coverage. Joint calling resolves this ambiguity.
Q29 Tricky
A variant is called in a region with a homopolymer run of 8 adenines (AAAAAAAA). How should this variant be treated?
AWith caution — homopolymer regions are prone to sequencing errors that produce false positive variants
BWith high confidence — repetitive regions are easier to sequence accurately
CIt should be automatically accepted because modern callers handle homopolymers perfectly
DIt should be ignored — variants cannot occur in homopolymer regions
Explanation
Homopolymer runs are regions where the same nucleotide repeats many times (e.g., AAAAAAAA). Sequencing platforms, especially Illumina, are prone to errors in these regions (insertions/deletions of bases). Variants found in homopolymers are often false positives. Modern variant callers include filters for these regions but they are not perfect. Manual inspection (e.g., in IGV) is recommended.
Q30 Easy
In a VCF file, which column contains the alternative (non-reference) allele?
AREF
BQUAL
CALT
DINFO
Explanation
The VCF mandatory columns are: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, and then sample columns. ALT contains the alternative (non-reference) allele(s), comma-separated if more than one. REF contains the reference allele. QUAL is the Phred-scaled quality score. INFO contains extensible annotations.
Q31 Medium
In a VCF file, meta-information lines begin with:
A@ (at sign)
B> (greater-than sign)
C# (single hash)
D## (double hash)
Explanation
In VCF files: ## (double hash) marks meta-information lines (key=value pairs describing filters, info fields, etc.); # (single hash) marks the column header line (CHROM, POS, ID, etc.). Don't confuse with FASTQ where @ begins each entry, or FASTA where > begins each entry. SAM headers use @.
Q32 Easy
Which tool determines the effect of variants on genes, transcripts, and protein sequence?
ABWA-MEM
BEnsembl Variant Effect Predictor (VEP)
CFastQC
DPicard
Explanation
VEP (Variant Effect Predictor) from Ensembl determines variant effects on genes, transcripts, and protein sequences. It also provides SIFT and PolyPhen-2 scores for protein-altering changes. Other annotation tools include SnpEff and ANNOVAR. BWA-MEM is an aligner, FastQC does quality control, and Picard handles duplicate removal.
Q33 Medium
A "gain of TFBS" variant means:
AThe transcription factor binding site exists only for the alternative allele of the SNP
BThe transcription factor binding site exists only for the reference allele
CBoth alleles have identical transcription factor binding affinity
DThe variant is located in a coding region and causes a missense change
Explanation
TFBS variant consequences: Loss of TFBS = binding site exists only for the reference (0) allele; Gain of TFBS = binding site exists only for the alternative (1) allele; Score-Change = binding affinity differs between alleles; No Change = both alleles predicted with same binding affinity. In entropy logos, larger letters indicate more critical positions — mutations there are more likely to impact binding.
Q34 Medium
In the albino donkey case study, what type of variant was identified in the TYR gene?
AA frameshift deletion
BA synonymous substitution
CA splice site variant
DA missense mutation (c.604C>G, p.His202Asp) disrupting copper binding in tyrosinase
Explanation
The albino donkeys from Asinara island carry a missense mutation c.604C>G in the TYR gene, causing a histidine to aspartate substitution at position 202 (p.His202Asp). This disrupts copper binding in the tyrosinase enzyme, inactivating melanin production. Parents are heterozygous (C/G), albino offspring are homozygous (G/G) — demonstrating autosomal recessive inheritance.
Q35 Easy
The correct order of file formats in a standard variant discovery pipeline is:
ABAM → FASTQ → VCF
BVCF → BAM → FASTQ
CFASTQ → BAM → VCF
DFASTQ → VCF → BAM
Explanation
The standard pipeline flows: FASTQ (raw reads) → alignment with BWA → BAM (aligned reads) → variant calling with GATK → VCF (variant calls) → annotation with VEP/SnpEff. Quality control and filtering occur between every step. This is the core workflow emphasized throughout the lecture.
Q36 Tricky
Unmapped reads (FLAG 4) from cattle WGS data could be useful for:
AImproving the alignment of mapped reads
BMetagenomics — detecting contaminant bacteria or viruses in the sample
CIncreasing the sequencing depth of the cattle genome
DGenerating a new reference genome
Explanation
Unmapped reads (FLAG 4) did not align to the host (cattle) genome. These can be extracted and re-aligned to microbial databases to identify contaminant bacteria, viruses, or other organisms. This is the foundation of metagenomics and is relevant to the "One Health" approach linking human, animal, and environmental health. This is a concept the lecture specifically highlights as a practical application of SAM flag filtering.
Q37 Easy
Galaxy is primarily described as:
AAn open, web-based platform for accessible, reproducible, and transparent computational research
BA command-line only tool for variant calling
CA commercial sequencing platform
DA local software package requiring complex installation
Explanation
Galaxy is an open, web-based platform. Its key features are: accessibility (no programming required, point-and-click), reproducibility (captures all analysis details), transparency (users share histories and workflows), and community-centered design. It requires only an internet connection and a browser — no installation or complex commands needed.
Q38 Easy
The Galaxy interface consists of three main panels:
AUpload, Download, and Settings
BCode editor, Terminal, and Output
CData + Available tools, Run tools and view results, Analysis history
DAlignment, Variant calling, and Annotation panels
Explanation
Galaxy has three main panels: (1) Left panel for data and available tools, (2) Middle/center panel for running tools and viewing results, and (3) Right panel for the analysis history which tracks all files and operations. The history panel shows files with three action buttons: view (eye icon), edit attributes (pencil), and delete.
Q39 — Open Calculation
You sequence a cattle genome (genome size = 2.7 Gbp) using Illumina paired-end 150 bp reads. You generate 600 million reads in total. (a) Calculate the sequencing depth. (b) If the minimum recommended depth for robust SNP detection is 10×, is this sufficient? (c) What is the Phred score corresponding to 99.99% base call accuracy?
✓ Model Answer

(a) Sequencing depth:

Depth = (N × L) / G
N = 600,000,000 reads; L = 150 bp; G = 2,700,000,000 bp
Depth = (600,000,000 × 150) / 2,700,000,000
= 90,000,000,000 / 2,700,000,000
= 33.3×

(b) Yes, 33.3× exceeds the minimum recommended ~10× for robust SNP detection. This depth provides high confidence for variant calling.

(c) Phred score for 99.99% accuracy:

P(error) = 1 − 0.9999 = 0.0001 = 10⁻⁴
Q = −10 × log₁₀(10⁻⁴) = −10 × (−4) = 40
Answer: Q40
Q40 — Open Short Answer
Describe the complete variant discovery pipeline from raw sequencing data to annotated variants. For each major step, name the input file format, the output file format, and one commonly used tool.
✓ Model Answer

The variant discovery pipeline consists of four major steps, each with QC/filtering between them:

1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (assessment), Trimmomatic or Prinseq (trimming) → Output: cleaned FASTQ. Checks per-base quality, GC content, duplication levels. Removes low-quality bases using sliding window or threshold approaches.

2. Alignment: Input: cleaned FASTQ → Tool: BWA-MEM (Burrows-Wheeler Transform) → Output: BAM file. Maps reads to the reference genome. BAM is the binary compressed version of SAM. Post-alignment: filter by mapping quality (MAPQ) and remove PCR duplicates (Picard).

3. Variant Calling: Input: filtered BAM → Tool: GATK (following GATK Best Practices) → Output: VCF file. Examines aligned bases at each position to identify SNPs and indels. Detection affected by base quality, proximity to homopolymers, mapping quality, and sequencing depth. Can be individual or joint calling across samples.

4. Variant Annotation: Input: VCF → Tool: Ensembl VEP, SnpEff, or ANNOVAR → Output: annotated VCF. Determines effect of variants on genes/transcripts/proteins (e.g., missense, frameshift, intronic, TFBS variants). Provides functional impact predictions (SIFT, PolyPhen-2).

Q41 — Open Tricky
Explain the difference between depth of coverage and breadth of coverage. Give a scenario where you might have high depth but low breadth, and explain why this would be problematic for variant discovery.
✓ Model Answer

Depth of coverage = the average number of times each base is read (expressed as "X", e.g., 30×). It measures redundancy and confidence. Formula: Depth = (N × L) / G.

Breadth of coverage = the percentage of the target genome covered at a minimum depth (expressed as %, e.g., 95% at 1×). It measures completeness.

High depth, low breadth scenario: In a PCR-amplified library with severe amplification bias, certain genomic regions may be vastly over-represented (giving very high local depth), while other regions receive no reads at all. The average depth might be reported as 30×, but large portions of the genome have 0× coverage. This means variants in uncovered regions are completely missed, making the analysis incomplete despite seemingly adequate depth. This is why breadth is especially important in clinical diagnostics where missing regions could mean missing pathogenic variants.

Another example: targeted sequencing (e.g., exome capture) inherently has high depth in target regions but low breadth over the whole genome — which is by design, but must be understood when interpreting results.

Q42 — Open Short Answer
Why is it important for a bioinformatician to understand the upstream wet-lab steps before analyzing NGS data? Give at least three specific examples of how lab decisions affect data analysis.
✓ Model Answer

A bioinformatician must understand what happened before data generation because ignoring upstream processes leads to incorrect assumptions or flawed analyses. Key examples:

1. PCR amplification vs. PCR-free: If PCR was used during library prep, duplicate reads are expected and must be removed (e.g., with Picard). Without knowing this, duplicates would be treated as independent evidence, inflating coverage and producing false positive variants. PCR-free protocols avoid this but require more input DNA.

2. DNA quality/degradation: Degraded DNA (e.g., from ancient samples, honey, or soil) produces shorter fragments. This affects which sequencing platform is appropriate — degraded DNA is unsuitable for long-read platforms (PacBio/Nanopore). Low-quality DNA also affects alignment success and error rates.

3. Sequencing platform choice: Different platforms have different error profiles. Illumina has very low error rates; Ion Torrent has higher error rates especially in homopolymer regions. Knowing the platform tells you which types of errors to expect and filter for.

4. Expected sequencing depth: If coverage was planned at 5× vs. 30×, the confidence in variant calls differs dramatically. Low-coverage data (1–5×) requires different analytical approaches than high-coverage data.

5. Fragmentation method: Random fragmentation vs. restriction enzymes produces different sequence content patterns visible in FastQC (e.g., consistent bases at read starts with restriction enzymes).

GWAS – Genome-Wide Association Studies (Lectures 12–14)

📝GWAS — Concepts, LD & Study Design
0 / 45
Q1 Easy
What is the primary goal of a Genome-Wide Association Study (GWAS)?
ATo sequence the entire genome of an individual at single-nucleotide resolution
BTo identify statistical associations between genetic variants and phenotypic traits across the genome
CTo determine the complete haplotype structure of an organism
DTo identify all protein-coding genes in the genome
Explanation
GWAS analyzes DNA sequence variation across a large population to find statistical associations between specific genetic variants (typically SNPs) and phenotypic traits. It does not sequence the whole genome — it genotypes known variant positions. The ultimate goals include understanding genetic basis of traits, disease prediction/prevention, and improving breeding programs.
Q2 Medium
According to the Common Disease/Common Variant (CD/CV) hypothesis, which statement is correct?
ACommon diseases are caused by single rare mutations with large effect sizes
BCommon variants each explain a large proportion of disease heritability on their own
CCommon diseases are influenced by many common variants, each with a small effect size
DCommon variants have high penetrance and follow Mendelian inheritance patterns
Explanation
The CD/CV hypothesis states that common diseases are influenced by genetic variants that are also common in the population. Each individual variant has a small effect size (low penetrance), but their aggregate, polygenic effect explains the observed heritability. This is in contrast to rare Mendelian disorders caused by single high-penetrance mutations (like Huntington's disease).
Q3 Easy
What does Linkage Disequilibrium (LD) describe?
AThe non-random association of alleles at different loci within a population
BThe random segregation of alleles during meiosis
CThe physical distance between two genes on a chromosome
DThe rate of recombination between two loci
Explanation
LD describes the tendency of certain alleles at different loci (commonly SNPs) to be inherited together more often than expected by chance. When two SNPs are in LD, the presence of one allele can predict the presence of another nearby allele. LD is shaped by recombination history, genetic drift, selection, and population history.
Q4 Medium
What is the role of tag SNPs in GWAS?
AThey are always the causal variants responsible for disease
BThey mark the boundaries of chromosomes
CThey are used to increase the number of SNPs tested in GWAS
DThey are representative SNPs that capture genetic variation within an LD block without genotyping every variant
Explanation
Because SNPs in LD carry redundant information, researchers don't need to genotype every variant. Tag SNPs serve as proxies for other variants in the same LD block. This reduces genotyping costs and data complexity while still enabling genome-wide coverage. If a tag SNP is associated with a trait, the actual causal variant is likely nearby and in LD with it — this is called indirect association.
Q5 Hard
Given two bi-allelic SNPs with allele frequencies q₁ = 0.2 and q₂ = 0.3, and observed haplotype frequency q₁₂ = 0.20, what is the value of D?
A0.06
B0.14
C0.20
D−0.14
Explanation
D = q₁₂ − q₁ × q₂ = 0.20 − (0.2 × 0.3) = 0.20 − 0.06 = 0.14. The expected haplotype frequency under linkage equilibrium is 0.06, but the observed is 0.20, indicating strong LD. Since D ≠ 0, the two loci are in linkage disequilibrium — the alleles co-occur more often than expected by chance.
Q6 Medium
What does D′ = 1 indicate about two SNPs?
AThe two SNPs are in perfect linkage equilibrium
BThe two SNPs have identical allele frequencies
CComplete LD — the strongest possible non-random association given the allele frequencies
DThe two SNPs are on different chromosomes
Explanation
D′ is D normalized by its maximum possible value given the allele frequencies. D′ = 1 means complete LD, the strongest possible association between the two SNPs. D′ = 0 means no LD. Note that D′ = 1 does not necessarily mean r² = 1 — perfect correlation (r² = 1) additionally requires that the allele frequencies are the same at both loci.
Q7 Tricky
Which LD measure is most commonly used to assess the coverage quality of a GWAS genotyping array?
Ar² — because it measures the correlation between tag SNP and causal variant
BD′ — because it indicates complete LD regardless of allele frequency
CD — because it directly measures haplotype frequency deviations
Dχ² — because it tests significance of association
Explanation
r² is the preferred measure for GWAS array design because it directly quantifies the correlation between a tag SNP and any other SNP. GWAS genotyping products select tag SNPs that guarantee coverage of common polymorphisms at some predetermined threshold of r². The lecture specifically states that LD between the causal polymorphism and a tested tag SNP is "measured by r²" and affects power. D′ can equal 1 even when prediction is poor (if allele frequencies differ), making it less useful for assessing genotyping coverage.
Q8 Easy
Over many generations, what happens to linkage disequilibrium between two loci?
ALD always increases due to genetic drift
BLD remains constant regardless of distance
CLD increases with physical distance between the loci
DRecombination gradually breaks LD, with distant loci losing LD faster than close loci
Explanation
Recombination events accumulate over generations and break apart linked regions. Variants that are physically close on a chromosome are less likely to be separated by recombination and remain in LD longer. Variants farther apart are more likely to be separated, moving toward linkage equilibrium. This leaves only small blocks of variants (haplotype blocks) still in LD.
Q9 Medium
In the context of GWAS phenotype definition, which phenotype type is analyzed using logistic regression?
AQuantitative traits like height or BMI
BBinary (dichotomous) traits like disease presence/absence
COrdinal traits like disease severity scores
DAll phenotype types use logistic regression in GWAS
Explanation
Binary (dichotomous) traits use logistic regression or contingency table methods (e.g., Fisher's exact test, chi-squared). Quantitative (continuous) traits use linear regression / generalized linear models (GLM). Semi-quantitative/ordinal traits may require ordinal regression or non-parametric tests. The statistical test depends primarily on the phenotype type.
Q10 Tricky
Why can a poorly defined phenotype in a case-control GWAS reduce statistical power?
AIt increases the number of SNPs that need to be tested
BIt reduces linkage disequilibrium across the genome
CMisclassified individuals increase heterogeneity in causal polymorphisms, diluting the genetic signal
DIt causes the Bonferroni correction threshold to become more stringent
Explanation
Non-specific case–control definitions increase heterogeneity in the underlying causal genetic polymorphisms and non-genetic risk factors. For example, mixing different disease subtypes (e.g., inflammatory vs. non-inflammatory forms) in the "case" group means different causal variants are at play, diluting the signal for any single one. This leads to decreased power for detection, spurious associations, and invalid conclusions.
Q11 Medium
Which sample size would typically be needed for a GWAS studying complex diseases like diabetes?
A≥700–1000+ individuals due to small effect sizes and polygenic nature
B~50 individuals since GWAS genotyping arrays are highly precise
C~250 individuals, same as for molecular traits
DSample size does not affect GWAS power for complex diseases
Explanation
Complex diseases require large cohorts (≥700–1000+ individuals) because of their polygenic nature and the involvement of environmental factors. Molecular traits (e.g., metabolite levels) can often be studied with smaller cohorts (~250 individuals) because they are more directly linked to genetic variation. External traits (height, hair color) need medium sample sizes.
Q12 Easy
What is the standard genome-wide significance threshold used in GWAS?
AP < 0.05
BP < 0.01
CP < 1 × 10⁻⁶
DP < 5 × 10⁻⁸
Explanation
The widely accepted genome-wide significance threshold is P < 5 × 10⁻⁸. This threshold was derived by correcting for approximately 1 million independent LD blocks across the genome (Pe'er et al., 2008). It is now standard across GWAS studies — if a SNP's p-value is below this threshold, it is considered significantly associated with the trait.
Q13 Hard
If you test 150,000 SNPs at α = 0.05 without correction, approximately how many false positives would you expect?
A150
B7,500
C75
D750
Explanation
At α = 0.05, you expect 5% of all tested SNPs to appear significant by chance alone (even if no real associations exist). 150,000 × 0.05 = 7,500 expected false positives. This is exactly why multiple testing correction is critical in GWAS.
Q14 Medium
What is the main disadvantage of the Bonferroni correction in GWAS?
AIt is computationally too expensive for large datasets
BIt does not correct for multiple testing at all
CIt is overly conservative because it assumes all tests are independent, ignoring LD between SNPs
DIt produces too many false positives compared to FDR
Explanation
Bonferroni correction divides α by the total number of SNPs tested, assuming each test is independent. However, in GWAS, many SNPs are correlated due to linkage disequilibrium (LD), so the effective number of independent tests is lower. This makes Bonferroni overly conservative — it may miss true associations (increased false negatives). Additionally, the threshold changes depending on the genotyping panel used (panel-dependence).
Q15 Tricky
The fixed genome-wide significance threshold of P < 5 × 10⁻⁸ was derived based on:
AAn estimated 1 million independent LD blocks across the genome
BThe total number of SNPs on the Illumina 1M array
CThe number of protein-coding genes in the human genome
DThe False Discovery Rate method at q = 0.05
Explanation
The threshold P < 5 × 10⁻⁸ comes from correcting for approximately 1 million independent LD blocks across the genome (Pe'er et al., 2008). Rather than correcting per dataset, this predefined threshold is LD-aware and standardized across studies. It is neither based on a specific genotyping array nor on FDR — it's an empirical, fixed correction derived from the estimated number of independent tests genome-wide.
Q16 Medium
How does the False Discovery Rate (FDR) approach differ from Bonferroni correction?
AFDR is more conservative and rejects fewer SNPs
BFDR controls the proportion of false positives among significant results, rather than controlling the family-wise error rate
CFDR requires permutation of genotype data
DFDR uses a fixed threshold of P < 5 × 10⁻⁸
Explanation
FDR (Benjamini-Hochberg) controls the expected proportion of false positives among all declared significant associations. It is less conservative than Bonferroni and better suited for exploratory research because it retains more true positives. The procedure ranks p-values, calculates thresholds q(i) = (i/m) × α, and identifies the largest p-value meeting the criterion. Bonferroni controls the family-wise error rate (probability of ≥1 false positive).
Q17 Medium
What is population stratification in the context of GWAS?
AThe random sampling of individuals from a homogeneous population
BThe sequencing of DNA in distinct population layers
CThe division of a population into cases and controls for analysis
DThe presence of subgroups differing in genetic ancestry and trait prevalence, causing confounding in GWAS
Explanation
Population stratification occurs when a study population contains subgroups that differ in both genetic ancestry and trait prevalence. SNPs associated with ancestry may falsely appear associated with the disease (confounding by ancestry). For example, if Southern Europeans have both higher disease rates (for environmental reasons) and different allele frequencies, a GWAS might detect spurious associations. If uncorrected, it creates false positives and can mask true associations.
Q18 Easy
What is the purpose of MDS (Multidimensional Scaling) in GWAS?
ATo detect and visualize population structure by reducing high-dimensional genotype data
BTo calculate p-values for SNP associations
CTo perform multiple testing correction
DTo phase haplotypes from genotype data
Explanation
MDS is a dimensionality reduction technique that summarizes genome-wide genetic variation into a few dimensions. Each point on an MDS plot represents an individual, and clusters indicate groups of genetically similar individuals. If distinct clusters appear, it signals population stratification that must be accounted for in GWAS. MDS outputs can be used as covariates in the association model to correct for structure.
Q19 Tricky
In the mouse body weight GWAS example, why did almost every SNP appear significantly associated with body weight?
ABody weight is controlled by every locus in the genome
BThe genotyping array had a very high error rate
CPopulation structure between wild-derived and classical inbred strains confounded the results — SNPs differentiating strains correlated with weight differences
DThe Bonferroni correction threshold was too lenient
Explanation
The mice came from genetically distinct strains (wild-derived vs. classical inbred), and wild-derived strains had much lower body weight (3–4× difference). MDS revealed two major genetic clusters. The GWAS wasn't detecting causal genes — it was detecting genetic background differences that correlated with body weight. Every SNP differentiating the two strain groups appeared associated. This is a textbook example of population stratification creating massive false associations.
Q20 Medium
What does a genomic control inflation factor (λGC) of approximately 1.00 indicate?
AThe study has many true positive associations
BNo inflation — the test statistics match the expected null distribution, suggesting proper population structure control
CSignificant overcorrection for population structure
DThe Bonferroni threshold was applied correctly
Explanation
λGC ≈ 1.00 is the ideal scenario — the observed test statistics match the null distribution, indicating no inflation from population stratification or other confounding. λGC > 1 indicates inflation (possible uncorrected confounding, leading to false positives). λGC < 1 suggests overcorrection (too many covariates), potentially causing false negatives.
Q21 Medium
On a QQ-plot from a well-controlled GWAS, what pattern indicates true associations?
AAll points lie exactly on the diagonal line
BAn overall upward curve with all points above the diagonal
CAll points fall below the diagonal line
DMost points follow the diagonal, with an upward deviation only at the tail (top-right corner)
Explanation
In a well-controlled GWAS: most SNPs (not associated with the trait) should fall along the diagonal, and only a few SNPs with true associations deviate upward at the tail. An overall upward curve (many points above the line) would indicate systematic inflation due to population stratification or technical artifacts. If all points are on the diagonal with no deviation, there may be no true associations detected.
Q22 Easy
In a Manhattan plot, what does the Y-axis represent?
A−log₁₀(p-value), so that more significant SNPs appear as higher points
BThe physical position of each SNP along the chromosome
CThe allele frequency difference between cases and controls
DThe effect size (beta coefficient) of each SNP
Explanation
In a Manhattan plot, the X-axis shows SNP positions along chromosomes, and the Y-axis shows −log₁₀(p-value). This transformation means smaller p-values (more significant) appear as higher points. Peaks indicate clusters of SNPs with significant associations, and a horizontal line typically marks the genome-wide significance threshold.
Q23 Tricky
In GWAS, a single isolated SNP signal above the significance threshold is most likely:
AA definitive causal variant for the trait
BA tag SNP in perfect LD with many other variants
CA potential false positive caused by genotyping errors or mapping issues
DAlways more reliable than a peak with multiple linked SNPs
Explanation
A true GWAS association usually appears as a peak with several linked SNPs (in LD with each other), because LD means multiple nearby SNPs carry similar association signals. A single isolated SNP signal — without supporting nearby SNPs — is suspicious and may represent a false positive from genotyping errors or mapping issues. The presence of multiple linked SNPs strengthens confidence in the association.
Q24 Medium
In the additive genetic model used in GWAS, how are genotypes coded?
AAA = 1, AG = 0, GG = −1
BBy the count of minor alleles: 0 (homozygous major), 1 (heterozygous), 2 (homozygous minor)
CAA = 0, AG = 0, GG = 1 (dominant model)
DGenotypes are not numerically coded in GWAS
Explanation
The additive model is the most commonly used in GWAS. It counts the number of copies of the minor allele: 0 (homozygous major, e.g., AA), 1 (heterozygous, e.g., AG), and 2 (homozygous minor, e.g., GG). A linear regression then tests whether the number of minor alleles is predictive of the phenotype value, assuming a trend per copy of the minor allele.
Q25 Medium
Why is covariate adjustment important in GWAS linear mixed models?
AIt increases the number of SNPs that can be tested
BIt replaces the need for multiple testing correction
CIt eliminates all environmental effects on phenotype
DIt reduces spurious associations due to sampling artifacts, biases, and known confounding factors like sex, age, and population substructure
Explanation
The model Y = Xb + Zu + e includes fixed effects (known constants like sex, age, study site, population substructure) and random effects. Covariate adjustment reduces spurious associations caused by sampling artifacts or biases in study design. However, it comes at the cost of using additional degrees of freedom, which may impact statistical power. Population substructure is one of the most important covariates.
Q26 Easy
What is the purpose of replication in GWAS?
ATo validate that identified associations are robust and not statistical artifacts, using an independent sample
BTo increase the number of SNPs tested in the original study
CTo apply a different multiple testing correction method
DTo genotype the same individuals using a different SNP array
Explanation
Replication is the gold standard for validation. It should be done in an independent dataset, drawn from the same population, with similar phenotype definition and genotyping platform. Once confirmed in the target population, other populations may be sampled — successful replication in additional populations is called "generalization."
Q27 Medium
What is the concept of "indirect association" in GWAS?
AA SNP that is found to be causal through functional studies
BAn association detected only in meta-analyses, not in individual studies
CWhen the detected SNP is not the causal variant but is in strong LD with it, acting as a proxy
DAn association caused by population stratification rather than biology
Explanation
Indirect association is central to GWAS interpretation. The SNP that shows up in the GWAS results is often not the causal variant, but is in strong LD with it. The real causal variant may not have been genotyped, but its "tag" shows association due to LD. This is why further fine-mapping and functional studies are needed after GWAS to identify the true causal variants.
Q28 Medium
What is the main purpose of genotype imputation in meta-GWAS?
ATo correct genotyping errors in individual studies
BTo generate a common set of SNPs across studies that used different genotyping platforms
CTo increase the sample size of individual GWAS studies
DTo reduce the number of SNPs tested and thus relax the significance threshold
Explanation
Meta-analysis requires assessing the effect of the same allele across studies. When different studies use different genotyping platforms (with different SNP sets), imputation estimates genotypes for SNPs not directly genotyped by exploiting known LD patterns and haplotype frequencies from reference panels like HapMap or 1000 Genomes. This creates a common set of SNPs for comparison.
Q29 Tricky
Which statement about permutation testing for multiple testing correction in GWAS is correct?
AIt is the most commonly used method in standard GWAS because of its simplicity
BIt assumes all SNP tests are independent, like Bonferroni
CIt controls the false discovery rate rather than family-wise error rate
DIt preserves LD structure and generates empirical p-values, but is computationally too intensive for routine GWAS use
Explanation
Permutation testing shuffles genotypes many times to generate an empirical null distribution. Its key advantage is that it preserves the LD structure between SNPs, producing accurate empirical p-values. However, it is computationally intensive, especially with millions of SNPs and tens of thousands of samples, making it impractical for routine GWAS. It is powerful but rarely used in standard analysis.
Q30 Easy
Which of the following is NOT a typical GWAS application in livestock?
AIdentifying genetic risk factors for schizophrenia
BFinding causal genes for milk yield and quality
CIdentifying genes for coat color
DDiscovering variants for disease resistance
Explanation
Schizophrenia is a human complex disease, not a livestock trait. GWAS in livestock focuses on economically important traits such as milk yield/quality, fertility, growth, coat color, disease resistance, and performance traits (e.g., racing ability in horses). Crop GWAS focuses on yield, flowering time, drought tolerance, and nutritional value.
Q31 Hard
In the FDR (Benjamini-Hochberg) procedure with 20 SNPs at q* = 0.05, the BH threshold for the 3rd ranked p-value is:
A0.0025
B0.0050
C0.0075
D0.0100
Explanation
The BH threshold for rank i is: q(i) = (i/m) × α. For the 3rd ranked p-value: q(3) = (3/20) × 0.05 = 0.0075. The procedure ranks all p-values from smallest to largest, assigns each a BH threshold, and the largest p-value still below its threshold becomes the cutoff. All SNPs with p-values below that cutoff are declared significant.
Q32 Medium
What is the formula for the basic LD measure D?
AD = q₁ × q₂ − q₁₂
BD = q₁₂ − q₁ × q₂
CD = q₁₂ / (q₁ × q₂)
DD = (q₁ + q₂) − q₁₂
Explanation
D = q₁₂ − q₁ × q₂, where q₁₂ is the observed haplotype frequency and q₁ × q₂ is the expected haplotype frequency under linkage equilibrium (random combination of alleles). When D = 0, the loci are in linkage equilibrium. When D ≠ 0, the loci are in LD. However, D is sensitive to allele frequencies, which is why standardized versions D′ and r² are preferred.
Q33 Tricky
A λGC value less than 1.00 in a GWAS most likely suggests:
AUncorrected population stratification inflating results
BMany true associations were detected
CThe study has perfect statistical power
DOvercorrection for population structure, potentially leading to false negatives
Explanation
λGC < 1.00 means observed test statistics are smaller than expected, suggesting the model is too conservative. This can happen when over-adjusting for population structure (e.g., including too many principal components as covariates). The consequence is an increased risk of false negatives — real associations may be missed because the test statistics are deflated.
Q34 Medium
What is over-representation analysis (ORA) used for in post-GWAS analysis?
ATesting whether specific biological functions are significantly enriched in a set of GWAS-identified genes compared to chance
BIdentifying additional SNPs not tested in the original GWAS
CCalculating linkage disequilibrium between candidate genes
DReplicating GWAS findings in an independent population
Explanation
ORA tests whether biological functions/pathways are significantly more frequent (over-represented) in a GWAS gene set than expected by chance. For example, if your GWAS identifies 50 candidate genes, ORA can determine whether immune-related functions are enriched in that gene set. Tools like DAVID and EnrichR perform this analysis, producing p-values for each functional term.
Q35 Easy
Which tool can be used to calculate LD measures (D, D′, r²) and visualize LD structure from genotype data?
ABEDTools
BEnrichR
CPLINK
DGeneCards
Explanation
PLINK is the primary tool for LD calculation, handling large SNP datasets and computing D, D′, and r². It also supports LD pruning and block identification. Haploview is another tool used specifically for LD visualization. BEDTools is for genomic interval operations, EnrichR for pathway enrichment analysis, and GeneCards is a gene-centric database.
Q36 Medium
What does the GWAS Catalog (NHGRI-EBI) provide?
ARaw sequencing reads from GWAS experiments
BA curated collection of published SNP-trait associations from GWAS, with summary statistics and genomic visualization
CReference genomes for assembly purposes
DLD block definitions for all human populations
Explanation
The GWAS Catalog (founded by NHGRI in 2008) is a curated repository of published SNP-trait associations from genome-wide association studies. It offers a search interface, downloadable data, API access, summary statistics, and an iconic GWAS diagram showing associations mapped onto the human karyotype. It is a key resource for post-GWAS annotation and prior knowledge.
Q37 Easy
What tool is used to find overlaps between genomic features such as GWAS peaks and gene annotations (GFF files)?
AHaploview
BDAVID
CPLINK
DBEDTools intersect
Explanation
BEDTools intersect allows screening for overlaps between two sets of genomic features (e.g., GWAS-associated SNP positions and annotated genes in a GFF file). It works with BED, GFF, VCF, and BAM files. This is commonly used in post-GWAS analysis to identify genes near association peaks, often within a defined window (e.g., 0.5 Mb).
Q38 Tricky
In a meta-GWAS, which of the following is NOT a requirement?
AAll studies must use the exact same genotyping platform
BAll studies must have examined the same hypothesis
CQC procedures and covariate adjustments should be standardized across studies
DThe sample sets across all studies should be independent
Explanation
Studies in a meta-GWAS do NOT need to use the exact same platform — this is precisely why imputation exists: to generate a common set of SNPs across studies using different arrays. However, all studies must examine the same hypothesis, use standardized QC and covariate adjustments, have consistent phenotype measurements, and use independent sample sets. Meta-analysis allows pooling results without transferring protected genotype data.
Q39 Medium
What data sources does genotype imputation rely on?
AProtein crystal structure databases
BRNA-Seq expression profiles from the same individuals
CKnown LD patterns and haplotype frequencies from reference panels like HapMap or 1000 Genomes
DPhenotype data from case-control studies
Explanation
Genotype imputation exploits known LD patterns and haplotype frequencies from reference panels (HapMap, 1000 Genomes) to statistically estimate genotypes at SNP positions that were not directly genotyped in the study. It leverages the principle that nearby SNPs in LD are inherited together, so if you know the genotype of surrounding SNPs, you can predict the missing ones.
Q40 Medium
In GWAS, what does "fine-mapping" refer to?
AIncreasing the sample size of the study
BInvestigating the LD structure and nearby genes within significant GWAS peaks to prioritize candidate causal variants
CApplying more stringent multiple testing corrections
DRemoving SNPs in LD from the genotyping panel
Explanation
Fine-mapping is the step after identifying GWAS peaks where researchers zoom into significant regions to examine LD structure (using r² or D′), identify nearby genes, and prioritize candidate causal variants for functional studies. Because GWAS detects indirect associations through LD, fine-mapping helps narrow down from a region to the actual causal variant(s).
Q41 — Open Short Answer
Describe the six key design considerations a GWAS should address. For each, explain why it matters for study quality.
✓ Model Answer

The six key GWAS design considerations are:

1. Phenotype definition: Precise trait classification (binary, quantitative, or ordinal) is essential. Misclassification increases heterogeneity and reduces statistical power. Different disease subtypes should be distinguished.

2. Structure of common genetic variation (LD): Understanding LD blocks enables efficient genotyping with tag SNPs. LD patterns vary across populations, so study design must account for the target population's LD structure.

3. Sample size: Must be adequate for the trait complexity. Complex diseases need ≥700–1000+ individuals; molecular traits may require ~250. Larger samples detect more loci and improve reliability.

4. Population structure/stratification: Ancestry differences between subgroups can confound results. Must be assessed (via PCA/MDS) and corrected by including ancestry components as covariates.

5. Genome-wide significance and multiple testing correction: Testing millions of SNPs generates many false positives. Correction methods include Bonferroni, fixed threshold (P < 5 × 10⁻⁸), FDR, and permutation testing.

6. Replication: Findings must be validated in an independent cohort with similar phenotype definition and genetic background. Successful replication in other populations = generalization.

Q42 — Open Calculation
Given two SNPs: SNP A (alleles T and G, with freq(G) = 0.2) and SNP B (alleles G and A, with freq(A) = 0.3). The observed haplotype frequencies are: T-G = 0.70, T-A = 0.10, G-G = 0.00, G-A = 0.20. Calculate D for the G-A haplotype and verify it matches for all other haplotypes.
✓ Model Answer

First, calculate expected haplotype frequencies under linkage equilibrium (product of allele frequencies):

freq(T) = 0.8, freq(G_A) = 0.2, freq(G_B) = 0.7, freq(A_B) = 0.3
Expected T-G = 0.8 × 0.7 = 0.56
Expected T-A = 0.8 × 0.3 = 0.24
Expected G-G = 0.2 × 0.7 = 0.14
Expected G-A = 0.2 × 0.3 = 0.06

Now compute D = observed − expected for each haplotype:

D(T-G) = 0.70 − 0.56 = +0.14
D(T-A) = 0.10 − 0.24 = −0.14
D(G-G) = 0.00 − 0.14 = −0.14
D(G-A) = 0.20 − 0.06 = +0.14

The absolute value |D| = 0.14 is the same for all haplotypes (with signs adjusting to preserve the total probability summing to 1). Since D ≠ 0, the two SNPs are in linkage disequilibrium. The positive D for G-A means this haplotype is observed more frequently than expected — indicating non-random co-inheritance.

Q43 — Open Tricky
Explain why a GWAS-significant SNP is often not the actual causal variant. What steps would a researcher take after identifying a significant association peak?
✓ Model Answer

GWAS relies on indirect association through LD. Genotyping arrays use tag SNPs that are representative markers for LD blocks. When a tag SNP shows significant association, it may be in strong LD with the true causal variant, which was never directly genotyped. The detected signal reflects the correlation between the tag and causal variant, not direct causality.

Post-GWAS steps include:

1. Fine-mapping: Examine the LD structure (r², D′) around the top SNP to identify the boundaries of the associated region and narrow down candidate variants.

2. Gene annotation: Identify nearby genes using databases (e.g., GFF files, BioMart) and tools like BEDTools intersect, often within a defined window (e.g., 0.5 Mb).

3. Biological evaluation: Assess candidate gene relevance using GeneCards, GWAS Catalog, Mouse Genome Informatics (MGI), and scientific literature.

4. Functional enrichment: Use ORA tools (DAVID, EnrichR) to test if the gene set is enriched for specific biological pathways.

5. Functional validation: Conduct experimental studies (e.g., gene expression, knockouts) to confirm the causal role of the candidate variant/gene.

Q44 — Open Calculation
You are performing a GWAS with a panel of 500,000 SNPs. (a) What is the Bonferroni-corrected significance threshold at α = 0.05? (b) Why might the fixed threshold of P < 5 × 10⁻⁸ be more appropriate? (c) Using FDR at q* = 0.05 with 500,000 SNPs, what is the BH threshold for the SNP ranked 10th?
✓ Model Answer

(a) Bonferroni correction:

α_corrected = α / N = 0.05 / 500,000 = 1.0 × 10⁻⁷

(b) Why the fixed threshold is more appropriate: Bonferroni assumes all 500,000 SNP tests are independent, but many SNPs are correlated through LD, so the effective number of independent tests is lower. This makes Bonferroni overly conservative. The fixed threshold of P < 5 × 10⁻⁸ was derived from ~1 million independent LD blocks (Pe'er et al., 2008), is LD-aware, standardized across studies, and does not change with the genotyping platform used.

(c) FDR BH threshold for rank 10:

q(10) = (10 / 500,000) × 0.05 = 0.000001 = 1.0 × 10⁻⁶

The 10th-ranked p-value must be below 1.0 × 10⁻⁶ to be declared significant under FDR.

Q45 — Open Short Answer
Explain the difference between a QQ-plot showing (a) good population structure control and (b) uncorrected stratification. What role does the genomic inflation factor λGC play alongside the QQ-plot?
✓ Model Answer

(a) Good control: Most observed p-values follow the diagonal (expected under the null hypothesis). Only a few points deviate upward at the extreme tail (top-right), representing true associations. This indicates that the vast majority of SNPs behave as expected (no association), and only a handful show genuine signal.

(b) Uncorrected stratification: The entire distribution shifts upward — many points across the full range lie above the diagonal, not just the tail. This systematic inflation means that ancestry differences are creating widespread false signals, not just a few true associations.

λGC complements the QQ-plot: It quantifies the inflation numerically by comparing the median of observed chi-squared test statistics to the expected median under the null. λGC ≈ 1 confirms good control (matching the QQ-plot diagonal). λGC > 1 quantifies the degree of inflation seen in the QQ-plot. λGC < 1 flags overcorrection (too many PCs as covariates), which may cause false negatives. Together, the QQ-plot provides visual assessment and λGC provides a numerical summary — both are essential QC tools before interpreting GWAS results.

Lectures 11-12-13: Genotyping Tools, CNV, Population Genomics, ROH & GWAS

📝High Throughput Genotyping Tools
0 / 16
Q1 Easy
What does "genotyping" mean in the context of high-throughput genomic studies?
ASequencing the entire genome of an organism de novo
BDetermination of the genotype at polymorphic loci
CIdentifying all genes present in an organism's genome
DMeasuring gene expression levels across all tissues
Explanation
Genotyping specifically refers to the determination of the genotype at polymorphic loci — i.e., identifying which alleles an individual carries at variable positions in the genome. It does not involve whole-genome sequencing or gene expression analysis. The value of genetic information relies largely on variants/polymorphisms, which are the informative parts of a genome.
Q2 Easy
Which of the following is NOT listed as an application of high-throughput genotyping in agricultural species?
AGenome-wide association studies (GWAS)
BGenomic selection
CParentage testing
DDe novo genome assembly
Explanation
The four main applications listed are: GWAS/QTL mapping, genomic selection, marker-assisted selection/breeding (MAS/MAB), and parentage testing. De novo genome assembly is a different process that aims to reconstruct the full genome sequence and is not a direct application of high-throughput genotyping tools like SNP chips.
Q3 Medium
A restriction enzyme that recognizes a 4 bp sequence would be expected to cut, on average, once every:
A64 bp
B128 bp
C256 bp
D4,096 bp
Explanation
The expected cutting frequency is 4n, where n is the number of bases in the recognition sequence. For a 4 bp cutter: 44 = 256 bp. Similarly, a 6 bp cutter cuts every 46 = 4,096 bp, and an 8 bp cutter cuts every 48 = 65,536 bp. AluI is an example of a 4 bp cutter.
Q4 Medium
In the Illumina Infinium BeadChip genotyping assay, how is allele specificity determined?
AA single base extension that incorporates one of four labeled nucleotides
BHybridization of two differently colored probes to both alleles simultaneously
CSequencing the region surrounding each SNP
DPCR amplification with allele-specific primers
Explanation
In the Illumina BeadChip, each probe binds to complementary sequence in the sample DNA, stopping one base before the locus of interest. Allele specificity is then conferred by a single base extension that incorporates one of four labeled nucleotides. When excited by a laser, the nucleotide label emits a signal whose intensity conveys information about the allelic ratio at that locus.
Q5 Tricky
How does the Axiom (Affymetrix) genotyping assay differ from the Illumina Infinium assay in interrogating simple SNPs?
AAxiom uses two probes per SNP; Illumina uses one probe with single base extension
BAxiom uses one probe with two-color readout via differentially labeled nonamers; Illumina uses single base extension
CAxiom uses sequencing-by-synthesis; Illumina uses hybridization
DBoth platforms use identical chemistry but differ in array density
Explanation
In the Axiom system, simple SNPs are interrogated using one standard probe with allelic discrimination achieved by differentially labeled nonamers that hybridize to each allele (one probe, two color readout). In contrast, Illumina Infinium uses a probe that stops one base before the SNP and relies on single base extension with labeled nucleotides. This is a subtle but important distinction between the two major genotyping platforms.
Q6 Medium
In the GenomeStudio Genoplot for SNP genotyping, what do the three clusters (red, purple, blue) represent?
ADifferent chromosomes where the SNP is located
BDifferent quality scores: high, medium, and low
CThe three possible haplotypes at the locus
DThe three genotype classes: AA, AB, and BB
Explanation
In a GenomeStudio Genoplot, data points are color coded for the genotype call: red = AA, purple = AB, blue = BB. Each dot represents a sample, plotted by signal intensity (norm R) and allele frequency (Norm Theta) relative to canonical cluster positions for a given SNP marker.
Q7 Medium
What is a key advantage of custom genotyping arrays over commercially available SNP chips?
AThey are always cheaper per sample than commercial arrays
BThey provide higher data quality and fewer genotyping errors
CThey enable studies of species or populations not supported by standard products
DThey always include more SNPs than commercial arrays
Explanation
Custom genotyping arrays allow researchers to target regions relevant to their specific research interests. Key advantages include: enabling studies of species/populations not supported by standard products, allowing focus on genes/variants/regions of interest not covered in pre-designed products, and conserving resources by avoiding irrelevant genome regions. They are not necessarily cheaper or higher quality — their value lies in flexibility and specificity.
Q8 Medium
What is the main advantage of restriction enzyme GBS (RE-GBS) methods like RAD-Seq over array-based genotyping?
AReduced ascertainment bias and simultaneous SNP discovery and genotyping
BHigher per-sample cost but better data quality
CNo need for a reference genome under any circumstances
DComplete absence of missing data in the final dataset
Explanation
RE-GBS methods have reduced ascertainment bias over array-based methods, the ability to discover and characterize polymorphisms simultaneously, and low cost per sample (<$20 USD). However, they can have issues with missing data (especially in divergent populations), and a reference genome or genome knowledge can be helpful. High divergence can result in missing data, while low divergence may yield fewer SNPs.
Q9 Tricky
In the Ramos et al. study on pig SNP discovery using reduced representation libraries (RRLs), which of the following was used as a criterion to discard unreliable SNPs?
ASNPs with read depth lower than 120 were discarded
BSNPs with read depth higher than 120 were discarded
CThe minor allele had to be present in at least 10 reads
DOnly reads mapping to multiple locations were considered
Explanation
In the SNP discovery pipeline, SNPs with total read depth higher than 120 were discarded (to avoid repetitive/duplicated regions), the minor allele needed to be represented in at least 3 reads (not 10), and only reads mapping to a single unique location were considered (not multiple locations). The quality thresholds for MAQ mapping quality, consensus quality, and best mapping read quality were all set at 10.
Q10 Medium
In the three-step process for converting NGS data into genotype calls, what is the correct order?
ASNP calling → alignment → filtering
BFiltering → alignment → SNP calling
CPre-processing (alignment + quality scores) → SNP/genotype calling → post-processing (filtering)
DSNP calling → pre-processing → post-processing
Explanation
The correct pipeline is: (1) pre-processing steps that transform NGS data into aligned reads with quality scores, (2) SNP or genotype calls using multi-sample or single-sample calling procedures depending on the number of samples and depth of coverage, and (3) post-processing steps that filter the called SNPs to remove unreliable variants.
Q11 Easy
The formula P = G + E in the context of genotyping studies refers to:
APhenotype = Genotype + Environment
BPopulation = Genes + Evolution
CProbability = Genetics + Error
DPower = Genotyping + Efficiency
Explanation
P = G + E is a fundamental equation in quantitative genetics where an individual's Phenotype (P) is determined by their Genotype (G) plus Environment (E). High-throughput genotyping studies are crucial for generating large volumes of genotyping data to identify associations between genotypes and phenotypes (such as diseases or production traits).
Q12 — Open Calculation
A restriction enzyme recognizes a 6 bp sequence. On average, how frequently would you expect this enzyme to cut in a random DNA sequence? If you digest a 3 Gbp genome with this enzyme, approximately how many fragments would you expect?
✓ Model Answer

The expected frequency of a restriction site with an n-base recognition sequence is 4n.

For a 6 bp cutter: 4⁶ = 4,096 bp
The enzyme cuts on average once every 4,096 bp.
Number of fragments ≈ Genome size / Cut frequency = 3,000,000,000 / 4,096 ≈ 732,422 fragments

So a 6 bp restriction enzyme would generate approximately 730,000 fragments from a 3 Gbp genome. This principle underlies why enzymes with longer recognition sites produce fewer, larger fragments (e.g., an 8 bp cutter would produce ~46,000 fragments).

Q13 — Open Short Answer
Explain what a Reduced Representation Library (RRL) is and why it is useful for SNP discovery in livestock species. Describe the general steps to construct one.
✓ Model Answer

A Reduced Representation Library (RRL) is a method to sequence only a fraction of the genome by using restriction enzymes to fragment the DNA and selecting fragments of a specific size range. This dramatically reduces the amount of sequencing needed while still sampling reproducible, genome-wide locations.

Steps (as in the Ramos et al. pig study):

1. Pool DNA from multiple individuals (equal amounts per individual) to capture population-level variation.

2. Digest pooled DNA with restriction enzymes (e.g., AluI, HaeIII, MspI).

3. Select fragments of a specific size range (reduced representation).

4. Sequence the selected fragments using NGS (e.g., Illumina).

5. Align reads to the reference genome and call SNPs using quality filters.

Why useful: It enables cost-effective, large-scale SNP discovery across the genome without the expense of whole-genome sequencing all individuals. The discovered SNPs can then be used to design SNP chips (e.g., PorcineSNP60 BeadChip).

Q14 — Open Short Answer
Compare and contrast commercially available SNP chips versus custom genotyping arrays. When would you choose one over the other?
✓ Model Answer

Commercial SNP chips (e.g., PorcineSNP60, BovineSNP50): Pre-designed for common species with known SNPs. They offer standardized, trusted data quality, widespread adoption enabling cross-study comparisons, low per-sample cost, and comprehensive genome coverage. Best for well-studied species with existing tools.

Custom genotyping arrays: Designed for specific research needs. Advantages: enable studies of species/populations not supported by standard products, allow focus on specific genes/regions of interest, and conserve resources by excluding irrelevant regions.

Choose commercial when working with common species (cattle, pigs, humans), needing standardized results, or conducting large population studies. Choose custom when studying non-model species, targeting specific genomic regions relevant to a particular disease/trait, or when commercial products don't cover your variants of interest.

Q15 Tricky
In the RE-GBS context, what happens when the target populations are more divergent than expected?
AIt results in a lower number of detected SNPs
BIt results in increased missing data, complicating downstream analysis
CIt increases the per-sample cost by more than 10-fold
DIt makes restriction enzyme digestion more efficient
Explanation
When populations are more divergent than expected, RE-GBS protocols can result in increased missing data (because restriction sites may differ between divergent individuals, leading to different fragments being sequenced), complicating downstream analysis. Conversely, low divergence results in a lower number of detected SNPs. This is a tricky distinction — high divergence → more missing data; low divergence → fewer SNPs.
Q16 Medium
What is the purpose of genomic selection in agricultural species?
ATo sequence and assemble the genomes of all individuals in a breeding population
BTo identify and remove deleterious mutations from the population
CTo select animals based on a single marker linked to one trait
DTo improve quantitative traits using whole-genome molecular markers combined with phenotypic and pedigree data
Explanation
Genomic selection aims to improve quantitative traits in large breeding populations through the use of whole-genome molecular markers. Genomic prediction combines marker data with phenotypic and pedigree data (when available) to increase the accuracy of predicting breeding and genotypic values. This differs from marker-assisted selection (MAS), which selects based on individual markers linked to specific traits.

📝Copy Number Variation (CNV)
0 / 15
Q1 Easy
According to Redon et al. (2006), a copy number variation (CNV) is defined as a DNA segment that is:
AAt least 100 bp and present at variable copy number
BAt least 500 bp and differs by a single nucleotide
C1 kb or larger and present at variable copy number compared to a reference genome
DAt least 1 Mb and completely deleted from some genomes
Explanation
Redon et al. (2006) defined a CNV as a DNA segment that is 1 kb or larger and present at variable copy number in comparison with a reference genome. Lee et al. (2008) similarly defined CNVs as intra-specific gains or losses of more than 1 kb of genomic DNA. Note that CNVs are not simply SNPs — they involve larger structural changes.
Q2 Medium
Which of the following is NOT a platform for CNV analysis?
AGWAS with chi-squared testing
BArray Comparative Genome Hybridization (aCGH)
CComparative intensity analysis of SNP genotyping chips
DNext-generation sequencing platforms
Explanation
The three main platforms for CNV analysis are: (1) Array Comparative Genome Hybridization (aCGH), including whole genome tilepath arrays and oligonucleotide arrays; (2) Comparative intensity analysis of SNP genotyping chips (Affymetrix and Illumina); and (3) Next-Generation Sequencing platforms. GWAS with chi-squared testing is used for association studies, not specifically for CNV detection.
Q3 Medium
In oligonucleotide-based aCGH, the reference DNA and test DNA are labeled with:
AReference with Cy3 (green) and test with Cy5 (red)
BReference with Cy5 (red) and test with Cy3 (green)
CBoth with the same fluorescent dye at different concentrations
DNeither is labeled; hybridization is detected by mass spectrometry
Explanation
In aCGH, the reference DNA is labeled with Cy5 and the sample/test DNA is labeled with Cy3. Both are then co-hybridized to the oligonucleotide microarray. The ratio of the two fluorescent signals at each probe position indicates whether the test sample has a gain (more Cy3), loss (more Cy5), or normal copy number (equal signals) relative to the reference.
Q4 Hard
What are the four main NGS-based methods for detecting CNVs?
AAlignment, Variant Calling, Filtering, Annotation
BPCR, Sanger, Microarray, FISH
CRead-Pair, Split-Read, Read-Depth, Haplotype-based
DRead-Pair, Split-Read, Read-Depth, Assembly-based
Explanation
The four main methods are: (1) Read-Pair (RP) — compares insert sizes between mapped read-pairs, (2) Split-Read (SR) — uses reads where one mate fails to map to find breakpoints, (3) Read-Depth (RD) — detects CNVs based on correlation between coverage depth and copy number, and (4) Assembly-based (AS) — assembles contigs/scaffolds and compares them with the reference.
Q5 Tricky
Which NGS-based CNV detection method can determine the exact number of copies, unlike the others which only report positions?
ARead-Pair (RP)
BSplit-Read (SR)
CRead-Depth (RD)
DAssembly-based (AS)
Explanation
Compared to RP and SR, the Read-Depth (RD) method can detect the exact number of CNVs, as RP and SR can only report the position of potential CNVs and not the counts. RD works particularly well for large-size CNVs, which are hard to detect with RP and SR. The method is based on the hypothesis that there is a correlation between depth of coverage and the copy number of a region.
Q6 Medium
What is a major limitation of assembly-based (AS) methods for CNV detection?
AThey can only detect homozygous structural variants and demand extensive computational resources
BThey require paired-end sequencing data exclusively
CThey are limited to CNVs smaller than 1 kb
DThey are unable to detect any insertions, only deletions
Explanation
Assembly-based methods have several limitations: (1) overwhelming demand on computational resources, (2) eukaryotic genomes contain repeats and segmental duplications which reduce accuracy, and (3) they are unable to handle haplotype sequences, meaning only homozygous structural variations can be detected. This is why AS methods are less commonly used for CNV detection in practice.
Q7 Hard
In PennCNV, what is the expected B-Allele Frequency (BAF) pattern for a triploid (duplicated) region?
ATwo bands at 0 and 1
BFour bands at 0, 0.33, 0.66, and 1
CThree bands at 0, 0.5, and 1
DA single band at 0.5
Explanation
For a triploid (duplicated) region with three allele copies, the possible genotypes are AAA (BAF=0), AAB (BAF≈0.33), ABB (BAF≈0.66), and BBB (BAF=1), giving four allele tracks. In contrast, normal diploid regions have three bands (0, 0.5, 1), and deleted regions have only two bands (0 and 1). This is a key pattern for identifying duplications from SNP chip data.
Q8 Medium
What does a Log2 Ratio (LogR) value of 0 indicate in CNV analysis using microarrays?
AComplete deletion of the region
BDuplication (copy number gain)
CLoss of heterozygosity
DNormal copy state (copy number = 2)
Explanation
LogR represents the difference between a reference data point and the sample of interest on a Log2 scale. A value of 0 represents the normal copy state of 2 (equal signal in reference and sample). Positive LogR values indicate copy number gain (duplication), while negative values indicate copy number loss (deletion).
Q9 Tricky
PennCNV uses a hidden Markov model (HMM) for CNV calling. What makes it different from segmentation-based algorithms?
AIt integrates SNP allelic ratio distribution and other factors in addition to signal intensity
BIt only uses signal intensity data, ignoring allelic information
CIt cannot use family information for CNV calling
DIt works only with Illumina data, not Affymetrix
Explanation
PennCNV differs from segmentation-based algorithms in that it considers SNP allelic ratio distribution (BAF) as well as other factors, in addition to signal intensity (LRR) alone. It integrates multiple sources of information through its HMM framework. PennCNV can handle both Illumina and Affymetrix data and can optionally utilize family information to generate family-based CNV calls.
Q10 Tricky
In loss of heterozygosity (LOH) regions, what is characteristic of the BAF pattern?
AThree bands at 0, 0.5, and 1 (normal pattern)
BFour bands at 0, 0.33, 0.66, and 1 (like duplication)
CTwo bands at 0 and 1 only (no heterozygous SNPs), with unchanged copy number
DA single band at 0.5 indicating all heterozygous SNPs
Explanation
In LOH regions, copy number is unchanged (LogR ≈ 0), but only homozygous SNPs (AA or BB) are present, giving only two BAF bands at 0 and 1. This distinguishes LOH from deletion: in deletion, you also see two bands, but LogR is negative (reduced copy number). LOH can arise through mechanisms like mitotic recombination or gene conversion without actual loss of DNA.
Q11 Medium
What is the purpose of cross-species aCGH, as used in the goat genome study?
ATo determine the phylogenetic relationship between cattle and goats
BTo detect CNVs in goats using a chip designed based on the bovine genome
CTo identify SNPs shared between cattle and goats
DTo compare gene expression levels between cattle and goats
Explanation
Cross-species aCGH involves using a microarray chip designed based on one species' genome (bovine, Btau_4.0 and UMD 2.0 assemblies) to detect CNVs in a closely related species (goat). This was done because a goat-specific aCGH chip was not available. It leverages the conservation between closely related genomes to study structural variation in species that lack their own dedicated genomic tools.
Q12 Medium
In the Read-Depth (RD) method for CNV detection, what is the purpose of normalizing the read counts?
ATo increase the number of detected CNVs
BTo align reads to the correct chromosomal positions
CTo convert read counts to allele frequencies
DTo remove potential biases mainly due to GC content and repeat regions
Explanation
In the RD method: (1) reads are aligned and depth counted per window, (2) counts are normalized to remove potential biases mainly due to GC content and repeat regions, and then a segmentation algorithm identifies contiguous windows with the same copy number, (3) statistical significance is predicted and filtering applied. GC content bias is a well-known confound in sequencing depth — GC-rich regions tend to have different coverage than expected.
Q13 — Open Short Answer
Describe the general steps of an oligonucleotide-based aCGH experiment for CNV detection. What data does aCGH produce and how is it interpreted?
✓ Model Answer

Steps:

1. Design microarray with long oligonucleotide probes (50-70 bp) based on the reference genome, spaced at regular intervals while avoiding repetitive sequences.

2. Extract high-quality DNA from both a reference sample and the test sample.

3. Label reference DNA with Cy5 and test DNA with Cy3.

4. Co-hybridize both labeled DNAs to the microarray — they compete to bind the probes.

5. Scan fluorescent images using a microarray scanner.

6. Normalize the data.

7. Analyze CNVs using specialized software (e.g., CGHWeb, SignalMap).

Interpretation: The Log2 ratio of Cy3/Cy5 signal at each probe is calculated. A Log2 ratio of 0 indicates normal copy number, positive values indicate gains (duplications), and negative values indicate losses (deletions). Various smoothing/segmentation algorithms (e.g., CBS, BioHMM, GLAD) can be used to identify CNV regions from the raw data.

Q14 — Open Short Answer
Compare the four NGS-based methods for CNV detection (Read-Pair, Split-Read, Read-Depth, Assembly-based). What are the strengths and limitations of each?
✓ Model Answer

Read-Pair (RP): Compares insert sizes of paired-end reads with expected size from reference. Detects medium-sized insertions and deletions. Limitation: insensitive to small events because small perturbations are hard to distinguish from normal variability. Reports positions but not copy number counts.

Split-Read (SR): Uses reads where one mate maps but the other fails to map fully. Provides precise breakpoints at single base pair resolution. Limitation: requires reads to span breakpoints, may miss larger events. Reports positions but not copy counts.

Read-Depth (RD): Based on the correlation between coverage depth and copy number. Unique advantage: can detect the exact number of copies (not just positions). Works well for large CNVs. Can be applied to single samples, case/control pairs, or populations. Requires normalization for GC content and repeat biases.

Assembly-based (AS): Generates contigs/scaffolds and compares them with the reference. Theoretically can detect all forms of variation. Major limitations: high computational resource demands, poor performance in repeat regions, and can only detect homozygous structural variants (cannot handle haplotype sequences).

Q15 — Open Short Answer
What data does PennCNV require for CNV calling from SNP genotyping arrays, and what statistical model does it use? How does it differ from segmentation-based algorithms?
✓ Model Answer

PennCNV requires: (1) LRR (Log R Ratio) and BAF (B-Allele Frequency) values from the signal intensity file, (2) population frequency of B alleles, (3) SNP genome coordinates, and (4) an appropriate HMM model.

PennCNV uses a Hidden Markov Model (HMM) that integrates multiple sources of information to infer CNV calls. Unlike segmentation-based algorithms that rely primarily on signal intensity alone, PennCNV also considers the SNP allelic ratio distribution (BAF) and other factors. This integration of multiple data sources (LRR + BAF + population frequencies) makes it more robust at distinguishing true CNVs from noise.

PennCNV can handle both Illumina and Affymetrix array data and can optionally utilize family information to generate family-based CNV calls or use a validation-calling algorithm for specific candidate CNV regions.


📝Population Genomics, Inbreeding & ROH
0 / 25
Q1 Easy
Which of the following is NOT a condition required for Hardy-Weinberg Equilibrium (HWE)?
ARandom mating
BSmall population size
CNo mutation, migration, or selection
DOrganisms are diploid
Explanation
HWE requires an infinitely large population size to eliminate genetic drift. A small population size would violate HWE assumptions. The full list of conditions includes: diploid organisms, exclusively sexual reproduction, non-overlapping generations, random mating, infinitely large population, equal allele frequencies between sexes, and no evolutionary forces (mutation, migration, selection, gene flow).
Q2 Easy
Genetic drift is defined as:
ADirected changes in allele frequency due to natural selection
BChanges in allele frequency due to migration between populations
CRandom changes in allele frequencies from one generation to the next, especially in small populations
DIncrease in genetic diversity over time due to mutations
Explanation
Genetic drift refers to random changes in allele frequencies from one generation to the next due to chance events, not natural selection. It is most impactful in small populations where not all alleles are guaranteed to be transmitted. The direction of drift is entirely random — no selective pressure guides it. Small populations can lose alleles entirely by chance, while larger populations are less prone to this but still experience small fluctuations.
Q3 Medium
What happens to genetic diversity after a population bottleneck?
AGenetic diversity increases because selection is relaxed
BAllele frequencies remain unchanged because bottlenecks are neutral events
COnly beneficial alleles are retained through the bottleneck
DGenetic diversity is reduced, allele frequencies shift randomly, and homozygosity increases
Explanation
A bottleneck is a special case of genetic drift where a population experiences a sharp reduction in size due to a random, drastic event. The surviving gene pool is a non-representative sample of the original. Alleles that were common might be lost while rare alleles may become fixed, leading to reduced genetic diversity and increased homozygosity. The changes are random, not selective — common alleles aren't preferentially retained.
Q4 Medium
What is an "outlier locus" in population genomics?
AA genomic region showing significantly stronger allele frequency differences between populations than expected under neutral conditions
BA region where all individuals in a population have identical genotypes
CA genetic locus that is located outside the coding regions of the genome
DA region with a high mutation rate that creates new alleles every generation
Explanation
Outlier loci are genomic regions that show much stronger allele frequency differences between populations than expected under neutral conditions (genetic drift, migration, demography). These loci may be under selection — either natural (adaptive traits) or artificial (breeding). They are identified using tools like FST scans, PCA, or Bayesian methods like BayeScan.
Q5 Easy
What does the inbreeding coefficient (F) represent?
AThe number of deleterious mutations in an individual's genome
BThe probability that two alleles at a given locus are identical by descent (IBD)
CThe proportion of heterozygous loci in an individual
DThe rate of mutation at each genomic locus
Explanation
The inbreeding coefficient (F) represents the probability that two alleles at a randomly chosen locus are identical by descent (IBD) — meaning both alleles originated from a common ancestor. F = 0 means no inbreeding (alleles from unrelated parents); F = 1 means complete inbreeding (all loci are homozygous for IBD alleles). F also reflects the level of autozygosity — the proportion of an individual's genome that is homozygous due to descent from a common ancestor.
Q6 Tricky
What is the key difference between identical by descent (IBD) and identical by state (IBS)?
AIBD refers to alleles on the same chromosome; IBS refers to alleles on different chromosomes
BIBD alleles are always in heterozygous state; IBS alleles are always homozygous
CIBD alleles are identical because they were inherited from a common ancestor; IBS alleles are identical by chance without known shared ancestry
DThere is no practical difference; IBD and IBS are interchangeable terms
Explanation
IBD (Identical by Descent): Two alleles are genetically identical AND came from the same common ancestor through inheritance. IBS (Identical by State): Two alleles look the same (same nucleotide sequence), but there is no known shared ancestor — they may be identical by chance. Only IBD contributes to inbreeding. IBS does not necessarily reflect inbreeding and may occur even in outbred populations. This distinction is crucial for correctly interpreting genomic inbreeding estimates.
Q7 Hard
Which of the following is NOT a limitation of pedigree-based inbreeding coefficient (FPED)?
AIt assumes all animals of the base population are unrelated
BIt does not account for the stochasticity of recombination during meiosis
CIt assumes all pedigree registrations are correct
DIt directly measures the actual homozygous regions in the individual's DNA
Explanation
FPED does NOT directly measure actual DNA — that's a feature of genomic methods, not a limitation of FPED. The actual limitations are: (i) assumes founder animals are unrelated, (ii) needs complete pedigree registration, (iii) assumes correct pedigree records, (iv) does not account for stochastic recombination events, and (v) does not consider selection biases on specific genomic regions. These are five specific limitations listed in the lecture.
Q8 Easy
What is a Run of Homozygosity (ROH)?
AA continuous stretch of DNA where all polymorphic loci are homozygous, with no heterozygous genotype
BA region of the genome with higher-than-expected heterozygosity
CA stretch of DNA where copy number is variable between individuals
DA region where recombination rates are extremely high
Explanation
ROH are continuous and uninterrupted chromosome portions showing homozygosity at all loci without any heterozygous genotype. They provide evidence of identical by descent (IBD) inheritance. Only polymorphic sites (SNPs) are used to detect ROHs. The ROH ends where the first heterozygous SNP is encountered. ROHs of different sizes indicate different inbreeding histories: long ROHs suggest recent inbreeding, short ROHs suggest more distant ancestral events.
Q9 Medium
How is the genomic inbreeding coefficient FROH calculated?
ANumber of ROH segments divided by total number of chromosomes
BSum of lengths of all ROH segments divided by the total autosomal genome length
CAverage length of ROH segments divided by average chromosome length
DNumber of homozygous SNPs divided by total number of SNPs
Explanation
FROH = SROH / LGEN, where SROH is the sum of all ROH segment lengths and LGEN is the total length of the autosomal genome. This gives the proportion of the genome that is covered by homozygous segments inherited from a common ancestor. It directly measures autozygosity from DNA data, unlike pedigree-based estimates which are theoretical probabilities.
Q10 Tricky
What does a long ROH indicate, compared to a short ROH?
ALong ROH = ancient inbreeding; Short ROH = recent inbreeding
BLong ROH = higher mutation rate; Short ROH = lower mutation rate
CLong ROH = recent inbreeding; Short ROH = ancient or distant inbreeding
DLong ROH = more recombination events; Short ROH = fewer recombination events
Explanation
Long ROHs indicate recent inbreeding because recombination has not had enough generations to break up the large haplotype block inherited from close common ancestors. Short ROHs reflect ancient or distant relatedness where many recombination events have gradually broken down the original haplotype over many generations. This is the same principle used for LD-based age estimation of mutations: long haplotype + strong LD → recent; short haplotype + weak LD → older.
Q11 Medium
What effect does inbreeding have on genotype frequencies?
AIncreases homozygous frequencies and decreases heterozygous frequency
BChanges allele frequencies by favoring dominant alleles
CIncreases heterozygous frequency and decreases homozygous frequencies
DHas no effect on genotype frequencies but changes allele frequencies
Explanation
Inbreeding does NOT change allele frequencies, but it does alter genotype frequencies: it increases homozygous genotype frequencies (AA and aa) and decreases heterozygous genotype frequency (Aa). This deviation from expected frequencies violates HWE assumptions. A significant excess of homozygotes and deficit of heterozygotes detected by chi-squared testing is a hallmark of inbreeding.
Q12 Medium
What is an ROH island?
AA region of the genome with no ROHs in any individual
BA single very long ROH in one individual's genome
CA genomic region where ROHs overlap between chromosomes in the same individual
DA genomic region where a high proportion of individuals in a population share ROHs at the same location
Explanation
ROH islands are specific chromosome regions where a high frequency of individuals in a population share ROHs at the same genomic location — they are "hotspots" of shared homozygosity. They often indicate selection pressure (natural or artificial) acting on that region, because an allele in that region is advantageous when homozygous. For example, in African populations, ROH islands may contain genes for trypanosome resistance.
Q13 Tricky
A genotyping error inside a true ROH would most likely cause:
AThe ROH to appear longer than it actually is
BThe ROH to be broken into smaller pieces, underestimating inbreeding
CA false detection of a CNV at that position
DNo effect on ROH detection because algorithms account for all errors
Explanation
A single genotyping error inside a true ROH (e.g., a homozygous SNP mistakenly called as heterozygous) can break that continuous homozygous stretch into smaller pieces, artificially shortening the detected ROH length. This leads to underestimation of ROH lengths and potentially the overall inbreeding level. Mitigation: use high-quality SNP data, strict quality control, and appropriate minimum window sizes.
Q14 Medium
Which ROH pattern would you expect in a consanguineous population (close-kin mating)?
AVery high number and length of ROHs
BFew ROHs, mostly short
CVery few ROHs with high heterozygosity
DNo ROHs detectable
Explanation
Different populations show distinct ROH signatures: Large outbred = few, short ROHs; Admixed = very few ROHs, high heterozygosity; Small population = more numerous and longer ROHs; Consanguineous = very high number and length of ROHs (close-kin mating); Bottleneck = many ROHs, variable length. A consanguineous population has the most extreme ROH burden because close relatives share large identical chromosome segments.
Q15 Medium
What does FST = 0 between two populations indicate?
AComplete genetic differentiation with no shared alleles
BOne population is a subset of the other
CBoth populations have undergone a recent bottleneck
DNo genetic differentiation — both populations have the same allele frequencies
Explanation
FST = (HT − HS) / HT, where HT is total expected heterozygosity and HS is average expected heterozygosity within subpopulations. FST = 0 means no genetic differentiation (HT = HS), indicating populations are genetically identical in allele frequencies. FST = 1 means complete differentiation with no shared alleles. Values between 0 and 1 indicate partial differentiation.
Q16 Medium
Why are FST values typically calculated over genomic windows rather than individual SNPs?
AIndividual SNPs are always uninformative
BWindow-based analysis is computationally faster
CAveraging over windows captures linkage disequilibrium and provides more robust, less noisy estimates
DGenomic windows can only be applied to whole-genome sequencing data, not SNP chips
Explanation
Individual SNP-level FST can be noisy or biased. Averaging over genomic windows (e.g., 1 Mb) captures linkage disequilibrium — the causal mutation may influence nearby genetic variants. This sliding window approach provides a more robust measure of genetic differentiation and increases the signal-to-noise ratio. This is analogous to how sliding windows are used in selection scans.
Q17 Medium
In the PLINK PED file format, how many columns are needed to represent genotypes for 50 SNPs per individual?
A56 columns
B106 columns
C50 columns
D100 columns
Explanation
The PED file has 6 mandatory columns (Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype) plus 2 columns per SNP (for the two alleles in a diploid organism). For n SNPs: total columns = 6 + 2n. For 50 SNPs: 6 + 2(50) = 106 columns.
Q18 Tricky
In the PLINK MAP file, which of the following is the correct column order?
AChromosome, SNP Identifier, Genetic distance, Base-pair position
BSNP Identifier, Chromosome, Base-pair position, Genetic distance
CChromosome, Base-pair position, SNP Identifier, Genetic distance
DBase-pair position, Chromosome, Genetic distance, SNP Identifier
Explanation
The standard PLINK .map file has 4 columns in this order: (1) Chromosome number, (2) SNP Identifier (e.g., rs123456 or custom label), (3) Genetic distance (usually set to 0), (4) Base-pair position. This file serves as the reference for aligning genotype data in the .ped file and ensures correct interpretation of SNPs during analysis.
Q19 Medium
In the context of linkage disequilibrium (LD), a "selective sweep" refers to:
AThe removal of all deleterious alleles from a population by natural selection
BThe random loss of alleles during a population bottleneck
CThe increase in frequency of a beneficial mutation along with nearby linked neutral variants
DThe gradual decay of LD between distant loci over evolutionary time
Explanation
A selective sweep occurs when a beneficial mutation increases in frequency in the population, and nearby neutral loci also increase in frequency — not because they are beneficial, but because they are physically linked to the selected allele on the same chromosome (hitchhiking). This creates a region of reduced variation and strong LD around the selected site. Over time, recombination gradually breaks down this LD block.
Q20 Hard
Which combination of genomic tools correctly matches: detecting outlier loci, visualizing population structure, and Bayesian selection testing?
APLINK, GWAS, BLAST
BPennCNV, PCA, TASSEL
CBayeScan, PLINK, GenomeStudio
DFST, PCA, BayeScan
Explanation
The three main statistical tools for identifying selection from the lecture are: (1) FST (Fixation Index) for detecting outlier loci by measuring genetic differentiation between populations; (2) PCA (Principal Component Analysis) for visualizing and grouping populations based on genetic similarity; and (3) BayeScan for Bayesian testing that explicitly compares a selection model versus a neutral drift model for each locus.
Q21 — Open Short Answer
Explain at least four limitations of the pedigree-based inbreeding coefficient (FPED), and describe how genomic inbreeding estimation methods (FROH) overcome these limitations.
✓ Model Answer

Limitations of FPED:

1. Assumes founders are unrelated: Does not account for true relatedness of base population animals.

2. Requires complete pedigree: Needs full registration for both paternal and maternal lineages; incomplete pedigrees lead to underestimation.

3. Assumes correct records: Cannot verify pedigree accuracy, especially in extensive production systems.

4. Ignores stochastic recombination: Assumes equal 25% inheritance from each grandparent, but actual inheritance varies (0-50%) due to random recombination.

5. Ignores selection: Does not consider biases from selection on specific genomic regions.

How FROH overcomes these:

FROH is calculated from actual DNA data (SNP genotyping or WGS) by measuring Runs of Homozygosity. It requires no pedigree information, directly measures autozygosity from the individual's genome, captures both recent inbreeding (long ROHs) and ancient inbreeding (short ROHs), and can detect hidden inbreeding from unknown or distant relatives. It reflects the real consequences of recombination and selection, providing more accurate individual-level estimates.

Q22 — Open Short Answer
Describe the FST statistic: what does it measure, how is it calculated, and how is it used in population genomics studies? Include interpretation of extreme values.
✓ Model Answer

FST (Fixation Index) measures the proportion of genetic diversity due to differences between populations versus within populations.

FST = (HT − HS) / HT

Where HT = total expected heterozygosity across all populations combined, and HS = average expected heterozygosity within individual subpopulations.

Interpretation:

• FST = 0: No differentiation — populations have identical allele frequencies.

• FST = 1: Complete differentiation — populations share no alleles (each fixed for different alleles).

• 0 < FST < 1: Partial differentiation — allele frequencies differ but overlap.

Practical use: Genomes are divided into windows (e.g., 1 Mb). FST is calculated for each window between population pairs and visualized in Manhattan plots. High FST peaks indicate regions of strong differentiation, potentially under selection. Low FST regions evolve neutrally. This allows identification of genomic regions associated with adaptive traits or breeding-specific selection.

Q23 — Open Calculation
In a PLINK PED file, a father has genotype TT at a locus, the mother has genotype AT, and their child has genotype AA. Explain whether this is consistent with Mendelian inheritance and what could cause such a result.
✓ Model Answer

This is a Mendelian inconsistency.

Father: TT → can only pass T allele to offspring
Mother: AT → can pass either A or T allele
Possible child genotypes: AT (T from father + A from mother) or TT (T from father + T from mother)
Child genotype AA is IMPOSSIBLE — the father cannot provide an A allele

Possible causes of this inconsistency include: (1) sequencing/genotyping error (e.g., the child's genotype was miscalled), (2) data formatting error in the PED file, (3) sample mislabeling (wrong sample assigned to the child), or (4) incorrect pedigree (the stated father is not the biological father).

Tools like PLINK can automatically identify and flag these Mendelian errors during quality control checks.

Q24 — Open Short Answer
Describe how linkage disequilibrium (LD) originates from a new beneficial mutation and explain how the length of the LD block can be used to estimate the age of the mutation.
✓ Model Answer

Origin of LD from a new mutation:

1. A new beneficial mutation arises on a single chromosome within a specific haplotype context (surrounding markers).

2. The mutation is initially in complete LD with all nearby variants on that chromosome (they form a single haplotype block).

3. If the mutation is advantageous, natural selection increases its frequency in the population — and the linked nearby markers "hitchhike" along (selective sweep).

4. Over generations, recombination during meiosis gradually breaks up the original haplotype, shortening the LD block around the mutation.

Estimating mutation age from LD:

• Long haplotype + strong LD around the mutation → the mutation is recent (recombination has not had time to break the block).

• Short haplotype + weak LD → the mutation is older (many generations of recombination have eroded the original haplotype).

This principle is used in population genomic scans to detect recent versus ancient selection events and to estimate when adaptive alleles arose in a population.

Q25 — Open Short Answer
List at least three factors that can cause biases or errors in ROH detection from genomic data. For each, explain how it affects ROH detection and how it can be mitigated.
✓ Model Answer

1. Genotyping Errors: A homozygous SNP miscalled as heterozygous breaks a true ROH into smaller pieces → underestimates inbreeding. Mitigation: Use high-quality SNP data, strict QC, appropriate minimum window sizes.

2. SNP Density and Distribution: Uneven SNP spacing means regions with low density may miss real ROHs or inaccurately size them, while high-density regions may detect more small ROHs. Mitigation: Consider SNP chip design, set minimum SNP number thresholds, interpret with caution in poorly covered regions.

3. Missing Data: Failed genotype calls create gaps that can break up ROHs → underestimation. Mitigation: Filter samples/SNPs with excessive missing data, use imputation, allow some tolerance for missing SNPs in ROH detection parameters.

4. Window Size Parameters: Too low thresholds → many spurious short ROHs (overestimate); too high → miss real short ROHs (underestimate). Mitigation: Choose parameters based on population-specific LD and SNP density; consult literature.

5. LD Variation: High LD populations may have long homozygous stretches by chance (false positives); low LD populations may have meaningful short ROHs. Mitigation: Adjust minimum ROH length thresholds based on typical LD structure.


📝Genome-Wide Association Studies (GWAS)
0 / 18
Q1 Easy
What is the primary aim of a genome-wide association study (GWAS)?
ATo sequence the complete genome of each individual in the study
BTo identify genetic variants (SNPs) associated with a trait or disease across the genome
CTo determine the complete pedigree of all study participants
DTo identify all genes in an organism's genome
Explanation
GWAS involve testing genetic variants across the genomes of many individuals to identify genotype–phenotype associations. Population-based association studies focus on identifying SNPs for which genotypes are associated with the trait under investigation, meaning they have different frequencies in affected vs. unaffected individuals, or different mean quantitative measures.
Q2 Medium
The common disease/common variant (CD/CV) hypothesis states that:
ACommon disorders are likely influenced by genetic variation that is also common in the population, with small individual effect sizes
BCommon diseases are caused by rare mutations with very large effect sizes
COnly one common variant is responsible for each common disease
DDiseases become common when their causative variants undergo positive selection
Explanation
The CD/CV hypothesis states that common disorders are influenced by common genetic variants. Key ramifications: (1) any single common variant must have a small effect size, and (2) multiple common alleles must influence disease susceptibility (the total genetic risk is spread across multiple genetic factors). This contrasts with Mendelian disorders where rare, highly penetrant alleles have large effect sizes.
Q3 Medium
What are the two main categories of phenotypes investigated in GWAS?
ADominant traits and recessive traits
BCoding variants and non-coding variants
CStructural variants and single nucleotide variants
DBinary disease/affected phenotypes (case-control) and quantitative (continuous) measurements
Explanation
The two most widely considered categories of traits in GWAS are: (a) binary disease/affected phenotypes, where individuals are classified as affected cases or unaffected controls (e.g., coronary heart disease), and (b) quantitative (continuous) measurements, such as lipid profiles, BMI, stature, etc. Each requires different statistical approaches.
Q4 Medium
What is a "tag SNP" in the context of GWAS?
AA SNP that directly causes the disease or trait being studied
BA SNP used to label different chromosomes for identification
CA SNP that serves as a proxy for nearby SNPs through LD, allowing genome-wide coverage without genotyping all SNPs
DA SNP located at the start of every LD block in the genome
Explanation
Tag SNPs are selected to guarantee coverage of all common polymorphisms at some threshold of r². Because SNPs within LD blocks are strongly correlated, we need not genotype all common polymorphisms genome-wide. Instead, GWAS arrays use a smaller number of tag SNPs from which we can recover information about common variation across the genome. GWASs thus rely on "indirect association" — tag SNPs may not be causal but serve as proxies for causal variants within the same LD block.
Q5 Hard
The widely accepted genome-wide significance threshold of p < 5 × 10⁻⁸ in GWAS corrects for approximately:
A500,000 independent SNPs
B1 million independent LD blocks
C10 million individual SNPs on the array
DThe total number of genes in the human genome
Explanation
The p < 5 × 10⁻⁸ threshold corrects for approximately 1 million blocks of LD across the genome, within which common SNPs are assumed to be strongly correlated (Pe'er et al., 2008). This is essentially a Bonferroni correction for 1 million independent tests: 0.05 / 1,000,000 = 5 × 10⁻⁸. It accounts for the LD structure rather than treating every single SNP as independent.
Q6 Medium
What is the Bonferroni correction in the context of GWAS?
ADividing the significance level α by the number of tests (N) to achieve an experimentwise false positive rate of α
BMultiplying each p-value by the sample size to account for population structure
CTaking the logarithm of all p-values to normalize the distribution
DUsing only SNPs with minor allele frequency above 5%
Explanation
The Bonferroni correction adjusts the significance level to maintain an overall experimentwise false positive error rate. When testing N SNPs, the SNP-wise significance level is set to α/N. The disadvantage is that it assumes each test is independent, but in GWAS, SNPs are correlated due to LD, making the correction conservative (too strict). The widely accepted threshold of 5 × 10⁻⁸ accounts for this by estimating ~1 million effective independent tests.
Q7 Tricky
Why is the Bonferroni correction considered conservative for GWAS?
ABecause it uses too few SNPs in the correction
BBecause it only works for quantitative traits, not binary traits
CBecause it assumes each test is independent, but SNPs are correlated due to LD, overcorrecting the significance level
DBecause it does not account for sample size
Explanation
The Bonferroni correction treats each SNP test as independent (α/N). However, in GWAS, many SNPs are correlated with each other due to LD (linkage disequilibrium). The actual number of independent tests is therefore lower than the total number of SNPs tested, meaning Bonferroni overcorrects and may miss true associations (loss of power). This is why the effective number of independent LD blocks (~1 million) is used rather than the total number of SNPs.
Q8 Medium
What does a genomic control inflation factor (λGC) greater than 1 indicate in a GWAS?
AThe study has too few samples
BAll associations found are true positives
CThe genotyping platform has a high error rate
DUnmeasured confounding due to genetic structure (population stratification)
Explanation
The genomic control inflation factor λGC is estimated by comparing the median of observed test statistics with the null distribution. λGC > 1 indicates inflation of test statistics due to unmeasured confounding from genetic structure (population stratification or cryptic relatedness). This is visualized on a QQ plot as observed p-values being more significant than expected under the null. A simple (but imperfect) correction is to divide all test statistics by λGC.
Q9 Hard
How does population stratification cause spurious associations in a case-control GWAS?
AIf disease prevalence differs between strata, cases are enriched from one stratum, and any SNP differing in frequency between strata will appear associated even without true association
BPopulation stratification increases the mutation rate in cases relative to controls
CStratification reduces linkage disequilibrium in the population, making tag SNPs unreliable
DIt causes genotyping errors that are specific to one population stratum
Explanation
Consider a population with two underlying strata that differ in disease prevalence. Cases will more often be selected from the stratum with higher disease prevalence. As a result, any SNP that differs in allele/genotype frequency between the strata will appear to be associated with disease, even if there is no true association within each stratum. This is a confounding effect — the SNP frequency difference is due to population structure, not disease biology. Solutions include matching cases and controls by stratum, or using statistical methods like PCA to correct for structure.
Q10 Medium
On a GWAS quantile-quantile (QQ) plot, what does inflation of observed -log10(p-values) above the y=x line indicate?
ANo significant associations were found
BPopulation structure that has not been accounted for in the analysis
CThe GWAS had too many samples
DAll genotyped SNPs are in perfect linkage equilibrium
Explanation
On a QQ plot, each SNP is plotted by its ranked observed -log10(p-value) against the expected ranked value under the null hypothesis. If most points fall on the y=x line, the study is well-calibrated. Systematic inflation above this line indicates that there are more significant signals than expected by chance, which is indicative of population structure not accounted for in the analysis. A few points deviating at the tail (far right) represent potentially true associations.
Q11 Medium
Which of the following is NOT one of the six key design considerations listed for a GWAS?
APopulation structure and stratification
BGenome-wide significance and correction for multiple testing
CChoice of restriction enzyme for library preparation
DSample size
Explanation
The six key GWAS design considerations are: (1) Phenotype definition, (2) Structure of common genetic variation (LD), (3) Sample size, (4) Population structure/stratification, (5) Genome-wide significance and correction for multiple testing, and (6) Replication. Choice of restriction enzyme is relevant to RRL/GBS library preparation, not to GWAS study design.
Q12 Tricky
The False Discovery Rate (FDR) correction in GWAS is calculated as:
Aα × k / N, where k is number of significant SNPs and N is total SNPs
Bα / k, where k is the number of significant SNPs
Ck / (N × α), where N is total tests
DNα / k, where N is total SNPs, α is the SNP-wise significance level, and k is the number of SNPs with p < α
Explanation
The FDR (Benjamini and Hochberg, 1995) fixes the expected number of false positives among significant associations. For an uncorrected SNP-wise significance level of α, the FDR = Nα/k, where N is the total number of SNPs tested and k is the number of SNPs with p < α. Using these relationships, one can define the appropriate SNP-wise significance threshold to obtain an overall FDR at a desired experimentwise error rate. Unlike Bonferroni, FDR accounts for the number of actual discoveries made.
Q13 Medium
Why is careful phenotype definition critical in a case-control GWAS design?
ANon-specific case-control definitions increase heterogeneity in causal polymorphisms, reducing power for detection
BIt determines which restriction enzymes are used for genotyping
CPhenotype definition only matters for quantitative traits, not binary traits
DIt is only important for replication studies, not discovery studies
Explanation
Careful phenotype definition is essential because non-specific case-control definitions can increase heterogeneity in the underlying causal genetic polymorphisms (and non-genetic risk factors), leading to decreased power for detection. If "cases" include individuals with different subtypes of a disease (each driven by different genetic variants), the signal from any single variant is diluted. The same principle applies to control definitions.
Q14 Medium
What is an "indirect association" in the context of GWAS?
AAn association between two different diseases mediated by the same gene
BAn association detected only in a replication cohort
CA genotyped tag SNP that is associated with a trait as a surrogate for the true causal variant through LD
DAn association that is not statistically significant after Bonferroni correction
Explanation
Genotyped tag SNPs often lie in a region of high linkage disequilibrium with the actual causal variant. The tag SNP will be statistically associated with the trait as a surrogate for the disease SNP through an indirect association. The tag SNP may not itself be causal, but its genotypes serve as proxies for those at the causal polymorphism located within the same block of LD. This is why GWAS identifies "associated regions" rather than specific causal variants.
Q15 — Open Short Answer
List and briefly explain the six key design considerations for planning a GWAS study.
✓ Model Answer

1. Phenotype definition: Precisely define the trait under investigation. For binary traits (case-control), ensure case and control definitions are specific to avoid heterogeneity that reduces power. For quantitative traits, use standardized measurements (e.g., BMI, height).

2. Structure of common genetic variation (LD): Understand the LD structure in the target population to select appropriate tag SNPs. Common SNPs are arranged in LD blocks; genotyping arrays exploit this to cover variation efficiently.

3. Sample size: Key determinant of statistical power. Power depends on significance level, effect size, causal allele frequency, and LD between causal variant and tag SNP. Effect sizes for complex traits are small, so large samples are needed.

4. Population structure/stratification: Unmeasured confounding from population structure can cause spurious associations. Must be detected (QQ plots, λGC) and corrected (PCA, genomic control, matching).

5. Genome-wide significance and multiple testing correction: Must correct for testing hundreds of thousands of SNPs. Standard threshold: p < 5 × 10⁻⁸. Methods include Bonferroni, FDR, and permutation procedures.

6. Replication: Findings should be validated in independent cohorts to confirm true associations and rule out false positives.

Q16 — Open Calculation
In a GWAS testing 500,000 SNPs at a significance level of α = 0.05: (a) How many false positive associations would you expect by chance without correction? (b) What is the Bonferroni-corrected significance threshold? (c) Why might the standard threshold of 5 × 10⁻⁸ differ from the strict Bonferroni value here?
✓ Model Answer

(a) Expected false positives without correction:

Expected false positives = N × α = 500,000 × 0.05 = 25,000 SNPs

Without correction, 5% of all SNPs (25,000) would appear significant by chance — a huge false positive problem.

(b) Bonferroni-corrected threshold:

α_corrected = α / N = 0.05 / 500,000 = 1 × 10⁻⁷

Each SNP must reach p < 1 × 10⁻⁷ to be declared significant.

(c) Why the standard threshold differs:

The standard GWAS threshold of 5 × 10⁻⁸ was derived by correcting for approximately 1 million independent LD blocks across the human genome, not the raw number of SNPs on the array. Since many SNPs on a 500K array are correlated (in LD), the effective number of independent tests is larger than the array size (~1 million blocks estimated from HapMap data). The Bonferroni for 1 million tests: 0.05 / 1,000,000 = 5 × 10⁻⁸. This more stringent threshold ensures genome-wide significance regardless of array density.

Q17 — Open Short Answer
Explain what population stratification is in GWAS, how it leads to spurious associations, and describe at least two methods to detect or correct for it.
✓ Model Answer

Population stratification arises when a study population consists of subgroups (strata) that differ in both allele frequencies and disease prevalence. If cases are preferentially drawn from one stratum, any SNP that differs between strata will appear associated with disease — even without a true biological link.

Example: A population has two ethnic strata. Disease X is more common in stratum 1. Cases will disproportionately come from stratum 1. A SNP that happens to be more common in stratum 1 (for ancestral reasons) will appear disease-associated even though it has no causal role.

Detection methods:

1. QQ plot inspection: Systematic inflation of observed p-values above the expected y=x line indicates population structure.

2. Genomic control (λGC): Comparing the median of observed test statistics with the null distribution. λGC > 1 indicates confounding from structure.

Correction methods:

1. Matching cases and controls by stratum to equalize population composition.

2. Dividing test statistics by λGC (simple but assumes uniform confounding across all SNPs, which may lose power).

3. PCA-based correction: Including principal components as covariates in the association model to adjust for ancestry differences.

4. Mixed models: Using kinship/relatedness matrices to account for both population structure and cryptic relatedness.

Q18 — Open Short Answer
Explain what linkage disequilibrium (LD) is, define the measures D' and r², and explain how LD is exploited in GWAS design through tag SNPs.
✓ Model Answer

Linkage Disequilibrium (LD) is a property of SNPs on a contiguous stretch of genomic sequence that describes the degree to which an allele of one SNP is inherited or correlated with an allele of another SNP within a population.

LD Measures:

The basic statistic is D = q₁₂ − q₁q₂, where q₁ and q₂ are allele frequencies and q₁₂ is the haplotype frequency. Under linkage equilibrium, D = 0 (alleles are randomly associated). To reduce dependence on allele frequencies, two standardized measures are used:

D': Ranges from 0 to 1. D' = 1 indicates complete LD (no recombination has occurred between the two loci).

r²: Ranges from 0 to 1. Represents the correlation between alleles. r² = 1 means the two SNPs are perfect proxies for each other. This is the most commonly used measure for GWAS design.

Exploitation in GWAS:

Because SNPs within LD blocks are strongly correlated, GWAS arrays need not genotype every common SNP. Instead, "tag SNPs" are selected that guarantee coverage of all common polymorphisms at a predetermined r² threshold. This enables efficient genome-wide coverage with fewer markers. GWAS then identifies tag SNPs with "indirect association" — they are proxies for the causal variant located within the same LD block. The International HapMap Project characterized LD patterns across populations to enable this approach.

Lecture 14 – Software for Population Genomic Analysis (PLINK)

📝Lecture 14 — PLINK & Population Genomic Analysis
0 / 30
Q1 Easy
What is PLINK primarily designed for?
ADe novo genome assembly from long reads
BWhole genome association analysis and related large-scale genomic analyses
CRNA-seq differential expression analysis
DMultiple sequence alignment and phylogenetic tree construction
Explanation
PLINK is a free, open-source whole genome association analysis toolset designed to perform a range of basic, large-scale analyses in a computationally efficient manner. Its tasks include data management, quality control, population stratification detection, association testing, and more.
Q2 Medium
In a PLINK PED file, what are the first six mandatory columns (in order)?
AChromosome, SNP ID, Genetic distance, Position, Allele 1, Allele 2
BSample ID, Family ID, Sex, Phenotype, Paternal ID, Maternal ID
CFamily ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype
DFamily ID, Individual ID, Sex, Phenotype, Paternal ID, Maternal ID
Explanation
The PED file has six mandatory columns in strict order: (1) Family ID, (2) Individual ID, (3) Paternal ID, (4) Maternal ID, (5) Sex (1=male, 2=female, 0=unknown), (6) Phenotype. The order matters — swapping Sex and Phenotype or rearranging IDs would break the format. Note that the PED file has NO header line.
Q3 Medium
How many columns does a PED file have if 500 biallelic SNP markers are genotyped for a diploid organism (assuming all six mandatory fields are present)?
A506
B500
C6 + 500 = 506
D6 + 2 × 500 = 1006
Explanation
The formula is: 6 + 2 × number of markers. The first 6 columns are the mandatory fields. Each marker requires 2 columns (one per allele in a diploid organism). So 6 + 2 × 500 = 1006 columns. This is a commonly tested calculation — don't forget the factor of 2!
Q4 Easy
Which four columns does a PLINK MAP file contain?
AChromosome, SNP identifier, Genetic distance (morgans), Base-pair position
BChromosome, SNP identifier, Minor allele frequency, P-value
CFamily ID, Individual ID, SNP identifier, Genotype
DChromosome, SNP identifier, Base-pair position, Allele
Explanation
Each line in a MAP file describes one marker with exactly 4 columns: (1) Chromosome, (2) rs# or SNP identifier, (3) Genetic distance in morgans, (4) Base-pair position. Note that genetic distance is often set to 0 when unknown.
Q5 Tricky
In the PED file, sex is encoded as numeric values. Which coding does PLINK use?
A0 = male, 1 = female, 2 = unknown
B1 = male, 2 = female, 0 = unknown
CM = male, F = female, U = unknown
D1 = female, 2 = male, 0 = unknown
Explanation
PLINK uses: 1 = male, 2 = female, 0 = unknown. Option D is a common trap — it reverses male and female. This small detail is easily confused and exactly the kind of thing a professor might test.
Q6 Medium
Which PLINK flag is used to load text-format PED/MAP files?
A--bfile
B--ped
C--file
D--input
Explanation
--file is used to load text-format PED + MAP files (e.g., --file Altamurana looks for Altamurana.ped and Altamurana.map). --bfile is used for binary files (FAM, BIM, BED). This is an important and frequently tested distinction.
Q7 Medium
What three files constitute the PLINK binary file format?
A.fam (individual info), .bim (marker info), .bed (genotypes)
B.ped (individual info), .map (marker info), .log (results)
C.fam (genotypes), .bim (individual info), .bed (marker info)
D.fam (individual info), .bim (genotypes), .bed (marker info)
Explanation
The binary format has three files: .fam stores individual/phenotype info (analogous to first 6 columns of PED), .bim stores marker position info (analogous to MAP), and .bed stores genotypes in compressed binary. Options C and D are traps that shuffle which file stores what.
Q8 Tricky
A PED file contains a line: 1 3 2 1 1 1 A A T C. What can we conclude about individual 3?
AIndividual 3 is female, from family 1, with parents unknown
BIndividual 3 is female, father is individual 2, mother is individual 1
CIndividual 3 is male, father is individual 1, mother is individual 2
DIndividual 3 is male, father is individual 2, mother is individual 1, and is homozygous AA at locus 1
Explanation
Reading the columns: Family ID=1, Individual ID=3, Paternal ID=2, Maternal ID=1, Sex=1 (male), Phenotype=1. Then locus 1 = A A (homozygous), locus 2 = T C (heterozygous). The column order is Father then Mother (not the other way around), and sex=1 means male. Option B reverses the sex coding and parent assignment.
Q9 Easy
Does the PLINK PED file contain a header row?
AYes, the first row always lists column names
BNo, the PED file does not use any header
COnly when the file is in binary format
DYes, but the header is optional
Explanation
The PED file does NOT use any header. Every row directly represents an individual. This is explicitly stated in the lecture and is a detail students might forget when working with other bioinformatics formats that do use headers (like VCF).
Q10 Medium
Which PLINK flag converts text PED/MAP files to binary BED/BIM/FAM format?
A--make-bed
B--recode
C--convert-binary
D--bfile
Explanation
--make-bed converts the input to binary format (.bed/.bim/.fam). --recode does the opposite — it outputs text PED/MAP format. --bfile is an input flag (to load binary files), not a conversion flag. Example: ./plink --file Altamurana --make-bed --out Altamurana_binary.
Q11 Medium
What does the PLINK flag --mind 0.1 do during quality control?
AExcludes SNPs with more than 10% missing genotypes
BIncludes only individuals with at least 10% heterozygosity
CExcludes samples (individuals) with more than 10% missing genotypes
DSets the minor allele frequency threshold to 0.1
Explanation
--mind filters individuals (samples), not SNPs. It removes samples with a missing genotype rate above the specified threshold (here 10%). The equivalent filter for SNPs is --geno. This is a classic exam trap: confusing --mind (individuals) with --geno (SNPs).
Q12 Tricky
Which of the following PLINK QC commands is correctly described?
A--geno 0.1 excludes individuals with more than 10% missing data
B--geno 0.1 includes only SNPs with a genotyping rate of at least 90%
C--maf 0.05 removes SNPs with a minor allele frequency above 0.05
D--hwe 0.01 removes SNPs that are in Hardy-Weinberg equilibrium
Explanation
--geno 0.1 includes only SNPs with ≤10% missing data (i.e., ≥90% genotyping rate). Option A confuses --geno (SNPs) with --mind (individuals). --maf 0.05 includes SNPs with MAF ≥ 0.05 (not removes them). --hwe 0.01 includes SNPs with HWE p-value ≥ 0.01 (removes those significantly deviating from HWE).
Q13 Medium
What is the default minor allele frequency (MAF) threshold in PLINK if --maf is used without a specified value?
A0.05
B0.10
C0.001
D0.01
Explanation
The default MAF threshold in PLINK is 0.01. Many students assume it's 0.05 because that's the most commonly used value in practice, but the actual default is 0.01. The lecture explicitly states: "include SNPs with MAF >= 0.05. The default value is 0.01."
Q14 Easy
Which PLINK command generates allele frequency statistics?
A--freq
B--hardy
C--maf
D--assoc
Explanation
--freq generates a .frq file with allele frequencies (CHR, SNP, A1, A2, MAF, NCHROBS). --hardy tests for Hardy-Weinberg equilibrium. --maf is a QC filter, not a statistics command. --assoc performs association testing.
Q15 Medium
What is the main purpose of Multidimensional Scaling (MDS) in PLINK?
ATo identify runs of homozygosity across the genome
BTo compute the minor allele frequency of each SNP
CTo represent high-dimensional genetic data in a low-dimensional space to detect population stratification
DTo perform genome-wide association tests for quantitative traits
Explanation
MDS is a dimensionality reduction technique. In PLINK, it compresses the information contained in thousands of SNPs into a 2D (or few-D) space so you can visualize population structure. Clusters in the MDS plot often correspond to distinct breeds or populations. It operates on a genome-wide IBS (identity by state) pairwise distance matrix.
Q16 Hard
To perform MDS analysis in PLINK, which two-step process is required?
AFirst run --freq, then run --mds-plot
BFirst compute --genome (IBS pairwise distances), then run --cluster --mds-plot
CFirst run --hardy, then run --cluster --mds-plot
DFirst run --assoc, then run --mds-plot
Explanation
MDS requires two steps: (1) compute genome-wide IBS pairwise distances with --genome, which produces a .genome file; (2) load that file with --read-genome and run --cluster --mds-plot N where N is the number of dimensions. The lecture shows: step 1: --genome --out Al_Ap-Ba_genome, step 2: --read-genome ... --cluster --mds-plot 2.
Q17 Easy
What are Runs of Homozygosity (ROH)?
AShort regions with high heterozygosity across SNPs
BRegions of copy number variation detected by aCGH
CInversions in the chromosome that prevent recombination
DLong stretches of chromosome regions that are homozygous at each polymorphic position
Explanation
ROH are contiguous stretches of the genome where an individual is homozygous at every (or nearly every) SNP position. They arise from autozygosity — inheriting two copies of the same ancestral haplotype. ROH are indicators of inbreeding level.
Q18 Hard
In the PLINK ROH analysis shown in the lecture, which parameters were used for the sliding window?
AWindow of 1000 kbp, 0 heterozygous SNPs allowed, max 5 missing, final ROH ≥15 SNPs, density 1 SNP per 100 kb
BWindow of 500 kbp, 1 heterozygous SNP allowed, max 3 missing, final ROH ≥20 SNPs, density 1 SNP per 50 kb
CWindow of 1000 kbp, 1 heterozygous SNP allowed, max 5 missing, final ROH ≥15 SNPs, density 1 SNP per 100 kb
DWindow of 2000 kbp, 0 heterozygous SNPs allowed, max 10 missing, final ROH ≥30 SNPs, density 1 SNP per 200 kb
Explanation
The lecture example uses: --homozyg-kb 1000 (1000 kbp = 1 Mbp window), --homozyg-window-het 0 (no heterozygous SNPs), --homozyg-window-missing 5 (max 5 missing), --homozyg-snp 15 (minimum 15 SNPs), --homozyg-density 100 (1 SNP per 100 kb). Option C is the main trap — it allows 1 heterozygous SNP, but the lecture explicitly sets this to 0.
Q19 Medium
What does the PLINK .hom.indiv output file contain?
AA list of every SNP that falls within any ROH
BOne row per identified homozygous region with start/end positions
CA per-individual summary including number of ROH segments (NSEG) and total ROH length (KB)
DA summary of ROH frequency per chromosome
Explanation
The .hom.indiv file provides a per-individual summary with columns FID, IID, PHE, NSEG (number of segments), KB (total ROH length), and KBAVG (average ROH size). The .hom file (not .hom.indiv) contains one row per individual ROH region. These two are often confused.
Q20 Medium
How is the genomic inbreeding coefficient FROH calculated?
AFROH = Number of ROH segments / Total number of SNPs
BFROH = Total length of all ROHs / Length of the autosomal genome
CFROH = Average ROH length / Longest chromosome length
DFROH = Number of homozygous SNPs / Total number of SNPs
Explanation
FROH = LROH / Laut, where LROH is the total length of all ROHs and Laut is the total length of the autosomal genome. Option D describes overall homozygosity but not the ROH-based inbreeding coefficient specifically. The distinction is important: FROH uses physical length of ROH segments, not just SNP counts.
Q21 Medium
What do short ROH versus long ROH indicate about an individual's ancestry?
AShort ROH suggest remote common ancestors; long ROH suggest recent inbreeding
BShort ROH suggest recent inbreeding; long ROH suggest remote ancestors
CBoth short and long ROH indicate recent inbreeding equally
DROH length is unrelated to the timing of inbreeding events
Explanation
Short ROH originate from remote common ancestors because recombination over many generations breaks up long stretches of DNA. Long ROH indicate autozygosity from more recent ancestors because fewer recombination events have had time to disrupt them. This is a key concept from Ceballos et al., 2018.
Q22 Medium
In PLINK GWAS for a quantitative trait, which statistical model is used?
AChi-squared contingency table test
BLogistic regression
CFisher's exact test
DGeneralized linear model (GLM) / linear regression
Explanation
For quantitative traits, GWAS uses generalized linear models (GLM). For dichotomous case/control traits, contingency table methods or logistic regression are used. The lecture explicitly distinguishes these two approaches. The --assoc flag with quantitative phenotypes uses linear regression.
Q23 Hard
In a linear mixed model for GWAS (Y = Xb + Zu + e), what do the terms represent?
AXb = random genetic effects, Zu = fixed environmental effects, e = phenotype
BXb = genotype encoding, Zu = linkage disequilibrium, e = Hardy-Weinberg deviation
CXb = fixed effects (known constants), Zu = random effects (from subsampling), e = residual error
DXb = random effects, Zu = fixed effects, e = environmental variance
Explanation
In Y = Xb + Zu + e: Xb represents fixed effects (known constants that remain the same over repeated sampling, e.g., sex, age, SNP genotype), Zu represents random effects (random variables arising from subsampling, e.g., population structure), and e is the residual error. Options A and D swap fixed and random effects.
Q24 Medium
Why is covariate adjustment important in GWAS?
AIt increases the number of SNPs tested, improving genome coverage
BIt reduces spurious associations due to sampling artifacts, biases, or population substructure
CIt converts a quantitative trait into a case/control phenotype
DIt eliminates the need for quality control filtering
Explanation
Covariate adjustment reduces spurious associations caused by confounders like sex, age, study site, or population substructure. However, it comes at a cost: each additional covariate uses degrees of freedom, potentially reducing statistical power. Population substructure is noted as one of the most important covariates to consider.
Q25 Easy
Which PLINK command performs a basic quantitative trait association test (GWAS)?
A--assoc
B--genome
C--homozyg
D--freq
Explanation
--assoc performs association testing. For quantitative traits it produces a .qassoc file. The lecture example: ./plink --file Cattle --assoc --out GWAS_stature_cattle_no_covariates. --genome computes IBS distances, --homozyg detects ROH, and --freq calculates allele frequencies.
Q26 Tricky
An advantage of SNP panels listed in the lecture is that they provide "the most comprehensive view of the genome." However, what is a key limitation NOT mentioned?
ALow per-sample cost
BScalable workflow for large populations
CSNP panels are limited to pre-selected markers and cannot discover novel variants
DThey detect both SNPs and other variations across the genome
Explanation
The lecture lists many advantages of SNP panels (low cost, scalable, comprehensive, good data quality). However, a fundamental limitation is ascertainment bias: SNP panels only genotype pre-selected markers. They cannot discover new/novel variants the way whole-genome sequencing can. Options A, B, and D are stated advantages, not limitations.
Q27 Medium
In the PLINK .hom output file, which columns describe the boundaries of an identified ROH?
ACHR, NSNP, DENSITY, PHOM
BSNP1, SNP2, POS1, POS2
CFID, IID, KB, KBAVG
DA1, A2, MAF, NCHROBS
Explanation
In the .hom file, SNP1 and SNP2 are the SNPs at the start and end of the ROH, while POS1 and POS2 are the physical positions (bp) of those boundary SNPs. Option D describes columns from a .frq file (allele frequency output), and option C mixes .hom.indiv columns.
Q28 Tricky
You run wc -l Altamurana.ped and get 24. You also run wc -l Altamurana.map and get 54241. What do these numbers tell you?
A24 SNPs and 54241 individuals
B24 families and 54241 chromosomes
C24 individuals and 54241 alleles
D24 individuals (animals) and 54241 DNA markers (SNPs)
Explanation
In a PED file, each line = one individual, so 24 lines = 24 animals. In a MAP file, each line = one marker, so 54241 lines = 54241 SNPs. This is shown directly in the lecture as a quick way to check dataset dimensions using wc -l.
Q29 — Open Calculation
An individual has the following ROH data from PLINK: total ROH length (LROH) = 400,646 kb. The autosomal genome length of the species is 2,500,000 kb. Calculate the genomic inbreeding coefficient FROH for this individual.
✓ Model Answer

The formula for the genomic inbreeding coefficient is:

FROH = LROH / Laut
FROH = 400,646 kb / 2,500,000 kb
FROH = 0.1603 (or approximately 16.0%)

This means about 16% of this individual's autosomal genome is covered by runs of homozygosity, indicating a moderate level of genomic inbreeding. The population mean FROH would be calculated as the average FROH across all individuals.

Q30 — Open Short Answer
Explain the four main PLINK quality control filters (--mind, --geno, --maf, --hwe). For each, state what it filters (individuals or SNPs) and what criterion is applied.
✓ Model Answer

--mind [threshold]: Filters individuals. Excludes samples with a proportion of missing genotypes exceeding the threshold. Example: --mind 0.1 removes individuals with >10% missing data.

--geno [threshold]: Filters SNPs. Excludes markers with a proportion of missing genotypes exceeding the threshold. Example: --geno 0.1 removes SNPs with >10% missing data (i.e., keeps SNPs with ≥90% call rate).

--maf [threshold]: Filters SNPs. Excludes markers with a minor allele frequency below the threshold. Example: --maf 0.05 removes SNPs with MAF < 0.05 (removes very rare variants or monomorphic markers). Default is 0.01.

--hwe [threshold]: Filters SNPs. Excludes markers whose Hardy-Weinberg equilibrium test p-value falls below the threshold. Example: --hwe 0.01 removes SNPs with HWE p < 0.01 (those significantly deviating from HWE, which may indicate genotyping errors).

Q31 — Open Short Answer
Describe the relationship between ROH length and inbreeding history. How can ROH length class distributions (e.g., 1–2 Mb, 2–4 Mb, 4–8 Mb, 8–16 Mb, >16 Mb) help reconstruct the demographic history of a breed?
✓ Model Answer

Short ROH (e.g., 1–4 Mb): Originate from remote common ancestors. Over many generations, recombination breaks long ancestral haplotypes into smaller fragments. A breed with predominantly short ROH likely experienced background relatedness long ago but has maintained a relatively large effective population size recently.

Long ROH (e.g., >8 Mb or >16 Mb): Indicate recent inbreeding, because few meiotic recombination events have occurred since the common ancestor. A high frequency of long ROH suggests recent bottlenecks, small population sizes, or close mating.

Demographic reconstruction: By plotting the frequency distribution of ROH across length classes, researchers can infer the timing and severity of inbreeding events. A breed with many long ROH has experienced recent, intense inbreeding. A breed with mainly short ROH has ancient background inbreeding but recent outcrossing. Additionally, plotting total ROH coverage (SROH) vs. number of ROH segments per individual helps distinguish populations: many short segments = ancient inbreeding; fewer but longer segments = recent inbreeding.

Q32 — Open Tricky
You want to construct a PED file for 3 individuals genotyped at 5 loci. Individual A is a female (family 1, no parents known, phenotype = 160, genotypes: AA, TC, GG, AT, CC). Individual B is a male (family 1, no parents known, phenotype = 185, genotypes: AG, TT, GA, AA, CT). Individual C is male (family 1, father = B, mother = A, phenotype = 175, genotypes: AG, TC, GA, AT, CC). Write the complete PED file content.
✓ Model Answer

Remember: columns are FamilyID, IndividualID, PaternalID, MaternalID, Sex (1=M, 2=F), Phenotype, then 2 columns per locus. No header row!

1 A 0 0 2 160 A A T C G G A T C C
1 B 0 0 1 185 A G T T G A A A C T
1 C B A 1 175 A G T C G A A T C C

Key details: (1) No header; (2) Unknown parents = 0; (3) Sex: A is female → 2, B and C are male → 1; (4) Each genotype takes 2 columns (one per allele); (5) Individual C has father=B and mother=A (paternal before maternal). Total columns = 6 + 2×5 = 16.

Genome Assembly – Exam Practice

📝Genome Assembly – Full Coverage
0 / 45
Q1 Easy
What is variant calling?
AThe process of assembling reads into contigs
BThe process of identifying differences (SNPs, indels) between sequenced reads and a reference genome
CThe process of aligning reads to a reference genome
DThe process of annotating genes in the genome
Explanation
Variant calling identifies differences (like SNPs or indels) between sequenced reads and a reference genome. Not every difference is a true variant — sequencing errors, alignment issues, and low-quality bases can produce false positives.
Q2 Medium
Which of the following is most likely to produce false-positive variant calls?
AHigh mapping quality reads
BSequencing depth of 30×
CVariants located in homopolymer regions
DVariants supported by high base-call quality scores
Explanation
Homopolymer regions (e.g., AAAAA) are prone to sequencing errors, especially with certain technologies. Variants found in these regions are often false positives. Modern variant callers include filters to detect and handle homopolymeric regions.
Q3 Medium
What is the advantage of joint variant calling over individual variant calling?
AIt requires fewer computational resources
BIt produces one VCF file per sample
CIt only identifies homozygous variants
DA low-confidence variant in one sample may be confirmed by evidence from other samples
Explanation
Joint variant calling analyzes all samples simultaneously. A key advantage is improved sensitivity: a low-confidence variant in one sample can be confidently called if supported in other samples. It also helps detect shared variants even when some individuals have low coverage. However, it requires more computational resources than individual calling.
Q4 Tricky
In individual variant calling, if a variant is missing from a sample's VCF file, what can be concluded?
AThe variant may still be present but was missed due to insufficient coverage
BThe variant is definitely absent in that sample
CThe reference genome is incorrect at that position
DThe variant is homozygous in that sample
Explanation
A key limitation of individual variant calling: if a variant is missing from some VCF files, it does not necessarily mean it is absent in those samples. It could be due to insufficient sequencing coverage or different variant types. This is one reason joint calling is generally preferred for multi-sample projects.
Q5 Easy
In a VCF file, what does the QUAL column represent?
AThe base quality of the reference allele
BThe confidence score for the variant call
CThe mapping quality of reads at the position
DThe read depth at the variant position
Explanation
The QUAL column in VCF indicates confidence in the variant call. Read depth (DP) and allele frequency (AF) are found in the INFO column. Mapping quality is a separate concept from variant call quality.
Q6 Hard
In a VCF file, a sample shows FORMAT GT:GQ:DP and value 0/1:99:32. What does this mean?
AHomozygous reference, quality 99, depth 32
BHomozygous alternative, quality 32, depth 99
CHeterozygous (one ref, one alt allele), quality 99, depth 32
DHeterozygous, phased genotype, quality 99, depth 32
Explanation
FORMAT GT:GQ:DP means the sample data is structured as Genotype:Genotype Quality:Read Depth. 0/1 = heterozygous (0 = reference, 1 = first alt allele), the "/" indicates unphased. GQ=99 means very high confidence. DP=32 means 32 reads support this site. If it were phased, it would use "|" instead of "/".
Q7 Tricky
What is the difference between 0/1 and 0|1 in VCF genotype notation?
A0/1 is homozygous while 0|1 is heterozygous
B0/1 is unphased (allele-to-chromosome assignment unknown) while 0|1 is phased (known assignment)
C0/1 means low quality while 0|1 means high quality
D0/1 is from short-read data while 0|1 is from long-read data
Explanation
The "/" (forward slash) indicates an unphased genotype — you know which alleles are present but not which chromosome each came from. The "|" (pipe) indicates a phased genotype — you know the exact combination of alleles on each chromosome. Both 0/1 and 0|1 are heterozygous. Phasing is useful in haplotype or linkage studies.
Q8 Medium
In VCF genotype notation, what does 2/1 indicate?
AHomozygous for the second alternative allele
BHeterozygous with one reference and one alternative allele
CMissing genotype data
DHeterozygous with the first ALT allele and the second ALT allele
Explanation
In VCF: 0 = reference allele, 1 = first alternative allele, 2 = second alternative allele (at multiallelic sites). So 2/1 means heterozygous with one allele being the first ALT and the other being the second ALT. This is different from 0/1 (ref + first ALT).
Q9 Medium
In IGV, how would you identify a heterozygous variant (0/1)?
AAbout half of the reads show the variant allele and half show the reference
BAll reads match the reference
CNearly all reads show the variant allele
DOnly 1–2 reads show the variant
Explanation
Heterozygous (0/1): approximately half the reads show the variant. Homozygous reference (0/0): all reads match reference. Homozygous alternative (1/1): nearly all reads show the variant. If only 1–2 reads show the variant, it's likely a sequencing or mapping error, not a true variant.
Q10 Medium
Which of the following is NOT a recommended post-calling quality control step?
AFilter variants by read depth (exclude too low or too high)
BRetain only variants marked as "PASS" in the FILTER field
CRetain all variants regardless of quality score to maximize sensitivity
DExclude variants in repetitive or low-complexity regions
Explanation
Post-calling QC should filter by quality score (e.g., QUAL > 30) to keep only high-confidence variants. Retaining all variants regardless of quality would include many false positives. Other valid steps include filtering by read depth, excluding repetitive regions, and checking the FILTER field for "PASS".
Q11 Easy
What is a typical minimum read depth threshold for reliable variant calling?
A≥ 3 reads
B≥ 10 reads
C≥ 50 reads
D≥ 100 reads
Explanation
A common minimum depth threshold is ≥ 10 reads. Fewer than 3 reads means low confidence. A maximum depth threshold (e.g., > 100 reads) is also useful to filter out positions with unusually high coverage, which often correspond to repetitive regions.
Q12 Medium
What is the primary goal of variant annotation?
ATo determine the biological impact and consequences of each identified variant
BTo align reads to the reference genome
CTo increase the sequencing depth of variants
DTo remove false-positive variant calls
Explanation
Variant annotation determines the impact of each variant. Variants can be in genes, introns, or regulatory regions, and their effects vary by location. Tools like ENSEMBL provide gene locations, functions, variant tables with rsIDs, and predicted consequences. Databases like dbSNP record SNP information and modification types.
Q13 Tricky
When checking variants against a variant table in ENSEMBL, approximately what percentage of variants in your sample are expected to be previously known?
AAround 50%
BAround 25%
CAround 1%
DAround 8%
Explanation
The lecture notes specifically state that "usually around 8%" of variants in your sample are previously known. Checking the variant table helps verify your results, even though the main goal is to discover new variants. This is a detail easily overlooked by students.
Q14 Easy
When is de novo genome assembly necessary?
AWhen performing variant calling on a well-studied model organism
BWhen RNA-seq data is available
CWhen no reference genome is available for the species
DWhen the genome is very small
Explanation
De novo genome assembly is essential when no reference genome is available. It can also be used to improve an existing reference. However, given its complexity and resource demands, researchers must first assess whether it's truly necessary or if a reference-guided approach can be used.
Q15 Easy
In shotgun sequencing, how are longer sequences reconstructed?
ABy searching for overlaps between the sequences of individual fragments
BBy aligning each fragment to a reference genome
CBy using restriction enzymes to cut at known positions
DBy sequencing each chromosome separately
Explanation
Shotgun sequencing works by fragmenting DNA, sequencing the fragments, and then using overlaps between fragments to reconstruct longer sequences. The method relies on random fragmentation producing overlapping pieces that can be computationally assembled.
Q16 Medium
In hierarchical shotgun sequencing, what are BAC libraries used for?
ATo sequence the genome directly in one step
BTo clone and amplify large DNA fragments (~300 kb) whose order and overlap are known
CTo store short sequencing reads for downstream analysis
DTo perform variant calling on sequenced reads
Explanation
In hierarchical shotgun sequencing, large DNA fragments are inserted into BAC (Bacterial Artificial Chromosome) libraries. Because BAC fragments are large (~300 kb) and their order/overlap are known beforehand, each BAC is individually shotgun-sequenced and then all are aligned to reconstruct the full genome. This method is mostly obsolete now.
Q17 Hard
Using flow cytometry, the C-value of a species is measured at 2.5 pg. What is the estimated genome size in base pairs?
A~2.5 × 10⁹ bp
B~978 Mb
C~1.95 × 10⁹ bp
D~2.445 × 10⁹ bp
Explanation
Using the formula: Genome size (bp) = DNA content (pg) × 0.978 × 10⁹. So: 2.5 × 0.978 × 10⁹ = 2.445 × 10⁹ bp ≈ 2,445 Mb. The C-value is the amount of DNA in picograms in a haploid genome, and 1 pg = 978 Mb.
Q18 Medium
In a K-mer frequency distribution, what do low-frequency K-mers (appearing only 1–20 times) most likely represent?
AHighly conserved coding regions
BRepetitive regions
CSequencing errors that introduce unique erroneous K-mers
DTrue genomic K-mers at average coverage
Explanation
In a K-mer frequency distribution: low-frequency K-mers (left peak) = sequencing errors creating unique, erroneous K-mers; the main peak = true genomic K-mers at average coverage; high-frequency K-mers (right tail) = repetitive regions sequenced multiple times.
Q19 Hard
How is genome size estimated from K-mer frequency analysis?
ATotal number of K-mers (area under the curve) divided by average K-mer coverage (peak position)
BNumber of unique K-mers multiplied by K-mer length
CTotal number of reads multiplied by read length
DMaximum K-mer frequency divided by read length
Explanation
Genome Size = Total number of K-mers (area under the curve) / Average K-mer coverage (mean coverage = position of the main peak). This provides an approximate genome size based solely on sequencing data. It's particularly useful for unknown or poorly studied genomes.
Q20 Medium
How does high heterozygosity affect genome assembly?
AIt simplifies the assembly by reducing the number of contigs
BAllelic variation can be assembled as separate regions, causing fragmentation and inflated genome size
CIt has no effect on assembly quality
DIt only affects GC content bias in Illumina sequencing
Explanation
In highly heterozygous genomes, the assembler may interpret allelic variation as two separate genomic regions. This leads to fragmentation and inflated genome size estimates — heterozygous regions may be reported twice for diploid organisms. Solutions: use inbred lines or haploid individuals, or bioinformatics tools that distinguish allelic differences from true duplications.
Q21 Medium
Why does extreme GC content cause problems for Illumina sequencing?
AIt increases the error rate of base calling
BIt causes reads to be too long for assembly
CIt causes amplification bias during PCR, resulting in low or no coverage in affected regions
DIt makes the reference genome incompatible with the reads
Explanation
Extremely low or high GC content causes amplification bias during PCR-based Illumina sequencing. This results in low or no coverage in those regions. Solutions: use GC-insensitive platforms like PacBio or Nanopore, or over-sequence to compensate for coverage gaps.
Q22 Tricky
Why is it recommended to sequence inbred individuals for genome assembly?
AInbred individuals have larger genomes
BInbred individuals have more repetitive elements
CInbred individuals have higher GC content
DLow polymorphism in inbred individuals greatly simplifies assembly by reducing heterozygosity
Explanation
Inbred organisms (e.g., lab strains) have low levels of polymorphism, which greatly simplifies assembly. High heterozygosity causes assemblers to misinterpret allelic differences as separate genomic regions, leading to fragmentation. The lecture specifically mentions that differences between reads due to polymorphism "may be misinterpreted by assemblers and errors introduced in the sequence."
Q23 Medium
What is required for long-read sequencing that is NOT necessary for short-read sequencing?
ADNA extraction from the sample
BHigh-molecular-weight DNA from fresh or well-preserved tissue
CLibrary preparation
DPCR amplification
Explanation
Long-read sequencing requires high-molecular-weight (HMW) DNA (≥20 kbp), mainly obtained from fresh material. Short-read sequencing can work with fragmented or degraded DNA, making it suitable for ancient or poor-quality samples. PCR amplification is sometimes needed when DNA is limited but actually introduces bias.
Q24 Tricky
Why is PCR amplification of genomic DNA a potential problem for genome assembly?
ASome regions amplify more efficiently than others, leading to uneven coverage and potential gaps
BPCR destroys the DNA fragments
CPCR only works with long-read sequencing
DPCR introduces indels into the reads
Explanation
PCR introduces bias because some genomic regions amplify more efficiently than others. This leads to uneven coverage and potential gaps in the genome assembly. PCR-free library preparation methods are preferred when possible to avoid this bias.
Q25 Easy
What is a standard minimum coverage/depth recommended for genome assembly?
A10×
B30×
CAt least 60×
D100×
Explanation
A coverage of at least 60× is standard practice for genome assembly, ensuring each region is sequenced enough times for accurate assembly. This is explicitly mentioned in the lecture when discussing fold-coverage requirements for a "good assembly (>60x)".
Q26 Medium
What is the main advantage of a hybrid assembly approach (combining SGS + TGS)?
AIt is cheaper than using only short reads
BIt eliminates all assembly errors
CIt only requires de Bruijn graph assembly
DShort reads correct errors in long reads, while long reads improve assembly continuity across repeats
Explanation
The hybrid approach compensates for the downsides of both technologies: SGS (Illumina) provides high accuracy to correct errors in TGS reads, while TGS (PacBio/Nanopore) provides long reads that span repeats and improve continuity. It's a cost-effective strategy since SGS data can correct errors in TGS reads.
Q27 Hard
In a De Bruijn graph, what do vertices and edges represent?
AVertices = reads; Edges = overlaps between reads
BVertices = (k−1)-mers; Edges = k-mers connecting prefix to suffix
CVertices = k-mers; Edges = (k−1)-mers
DVertices = chromosomes; Edges = contigs
Explanation
In a De Bruijn graph: vertices = (k−1)-mers (prefix and suffix of each k-mer), and edges = k-mers. For example, with k=3, the k-mer ATG has prefix AT and suffix TG, so the edge ATG connects node AT to node TG. Option A describes the OLC approach, not De Bruijn. Option C reverses vertices and edges — a very common exam trap!
Q28 Hard
What condition must be met for an Eulerian path to exist in a De Bruijn graph?
AThe graph must have all nodes balanced (equal in-degree and out-degree) or exactly two semi-balanced nodes
BThe graph must have exactly one node with maximum out-degree
CEvery node must be visited exactly once
DThe graph must be undirected
Explanation
An Eulerian path visits every edge exactly once (not every node — that's a Hamiltonian path). For an Eulerian path to exist, the graph must have all nodes balanced (indegree = outdegree) or at most two semi-balanced nodes (where |indegree − outdegree| = 1). De Bruijn graphs are directed, not undirected.
Q29 Tricky
Why should you NOT choose the longest possible k-mer for De Bruijn graph assembly?
ALonger k-mers produce more ambiguous assemblies
BLonger k-mers require more sequencing depth
CA single sequencing error affects 100% of k-mers from that read when k equals the read length, versus only a few k-mers with smaller k
DLonger k-mers create larger, more connected graphs
Explanation
The assumption in De Bruijn graph assembly is that all k-mers are error-free, which is not true for NGS data. If you choose k = read length, then a single sequencing error affects 100% of the k-mers from that read. With a smaller k, an error only affects a limited number of k-mers. However, too small a k increases ambiguity (many repeated k-mers). Multiple assemblies with different k values can be compared.
Q30 Medium
Why did De Bruijn graph-based assemblers replace OLC for short-read data?
AOLC produces better assemblies but is more expensive
BDe Bruijn graphs produce error-free assemblies
COLC cannot handle reads shorter than 1000 bp
DOLC requires comparing all reads to all other reads, which is impractical with billions of short reads
Explanation
With SGS, the number of reads increased exponentially while read lengths shortened. The OLC approach requires comparing all reads with every other read — computationally impractical with millions or billions of reads. De Bruijn graphs are more efficient because they decompose reads into k-mers, avoiding direct all-vs-all comparisons. OLC is still used for long-read data.
Q31 Medium
What causes branching structures in De Bruijn graphs?
ARepetitive DNA regions that create multiple possible paths
BToo few reads in the dataset
CUsing too large a k-mer size
DThe use of paired-end reads
Explanation
Repeated sequences create branches in the De Bruijn graph because identical k-mers from different genomic locations converge, creating ambiguity about which path to follow. Paired-end reads can actually help resolve these branches — if a fragment spans the repeat, its paired reads anchor the assembly in unique flanking regions.
Q32 Medium
What is the key advantage of mate-pair sequencing over paired-end sequencing?
AMate-pair is simpler and cheaper to prepare
BMate-pair covers much larger distances (2–10 kb inserts), useful for scaffolding across repeats and gaps
CMate-pair produces longer reads
DMate-pair has higher base-call accuracy
Explanation
Mate-pair sequencing uses long DNA fragments (2–10 kb) that are circularized and labeled with biotin. This enables spanning large distances, which is crucial for scaffolding across repeats, detecting structural variations, and connecting distant contigs. Paired-end inserts are typically only 50–500 bp. However, mate-pair preparation is more complex and labor-intensive.
Q33 Tricky
In paired-end sequencing, when can two reads from the same fragment be merged into a single longer read?
AWhen the fragment is very long (>1 kb)
BWhen using mate-pair libraries
CWhen the DNA insert is short enough that the two reads from each end overlap
DWhen long-read technology is used simultaneously
Explanation
If the DNA insert is short (e.g., 200 bp) and the reads are long (e.g., 150 bp each), the two reads overlap in the middle and can be merged into a longer, more accurate read that behaves like a single-end read. This only works when the insert size is less than 2× the read length.
Q34 Easy
What is the purpose of assembly polishing?
ATo increase sequencing depth
BTo correct sequencing and assembly errors, improving base-level accuracy
CTo fragment the assembly into smaller contigs
DTo annotate genes in the assembly
Explanation
Polishing corrects sequencing errors and improves the accuracy of the consensus sequence. It is especially important for long-read assemblies, which tend to have higher error rates. Polishing tools (e.g., Pilon, Racon, Medaka) use aligned reads to detect mismatches and correct them iteratively.
Q35 Hard
What does N50 measure in a genome assembly?
AThe average length of all contigs
BThe percentage of genes correctly assembled
CThe total number of contigs in the assembly
DThe shortest contig length such that contigs of that length or longer cover 50% of the total assembly
Explanation
N50 is calculated by ranking contigs from longest to shortest, then summing their lengths until 50% of the total assembly size is reached — the length of the last contig added is the N50. A higher N50 implies a less fragmented assembly. Critically, N50 measures contiguity, NOT correctness — aggressive assemblers may produce high N50 but with misassemblies.
Q36 Tricky
Why is a high N50 value alone NOT sufficient to confirm assembly quality?
AAggressive assemblers may produce long contigs with misjoins (wrong order/orientation), inflating N50
BN50 can only be calculated for scaffolds, not contigs
CN50 measures correctness but not completeness
DN50 is only valid for long-read assemblies
Explanation
N50 is a measure of contiguity, NOT correctness. Aggressive assemblers may join regions in the wrong order or orientation, producing artificially long contigs and a high N50 — but with structural errors. That's why additional metrics (BUSCO scores, assembly size, read mapping) are needed to evaluate assembly quality comprehensively.
Q37 Medium
What does BUSCO evaluate in a genome assembly?
ASequencing error rates
BThe length distribution of scaffolds
CCompleteness by checking for conserved single-copy orthologous genes expected for the lineage
DGC content uniformity across the assembly
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) evaluates assembly completeness by checking for conserved single-copy genes expected for a given lineage. A high BUSCO score suggests a complete and biologically meaningful assembly. Duplicated or missing BUSCOs may indicate assembly errors or gene prediction artifacts.
Q38 Medium
In genome scaffolding, what are gaps between contigs typically filled with?
ARandom nucleotide sequences
B'N's as placeholders for unknown sequence
CRepetitive sequences from a database
DReference genome sequences from a related species
Explanation
In scaffolding, unknown sequences between contigs are filled with 'N's as placeholders (often 50 Ns as standard). If long reads or other matching reads span the gap, actual sequence can fill it — this is called "gap filling." Technologies like BioNano, 10X Genomics Chromium, and Hi-C help improve scaffold contiguity.
Q39 Tricky
What is a potential risk of reference-guided genome assembly?
AIt may introduce bias if the reference has inversions or translocations, masking unique structural features
BIt requires more computational resources than de novo assembly
CIt can only use long-read sequencing data
DIt is incompatible with scaffolding techniques
Explanation
Reference-guided assembly aligns reads to a closely related reference genome. While this is more efficient and requires lower coverage, it carries the risk of bias: if the reference differs structurally (inversions, translocations), the assembly may assume the reference structure is correct, overlooking unique features in the genome of interest. De novo assembly is unbiased but more computationally demanding.
Q40 Easy
What are the two main types of masking used for repetitive regions?
AForward masking and reverse masking
BFull masking and partial masking
CHard masking (replace with N's) and soft masking (convert to lowercase letters)
DStatic masking and dynamic masking
Explanation
Hard masking replaces repeat regions with 'N's (e.g., ACGTACGT → ACNNNNNN). Soft masking converts repeats to lowercase letters (e.g., ACGTACGT → acgtACGT). Soft masking is preferred because it preserves sequence data while signaling repeat regions, allowing flexibility in downstream analyses.
Q41 Hard
What distinguishes Class I (retrotransposons) from Class II (DNA transposons)?
AClass I uses cut-and-paste; Class II uses copy-and-paste
BClass I are only found in prokaryotes; Class II only in eukaryotes
CClass I are smaller and less abundant than Class II
DClass I uses copy-and-paste via RNA intermediate; Class II uses cut-and-paste via DNA intermediate
Explanation
Class I retrotransposons (LINEs, SINEs, LTR retrotransposons) use a "copy and paste" mechanism via an RNA intermediate — the original element stays in place while a copy inserts elsewhere. Class II DNA transposons use "cut and paste" via a DNA intermediate — the element is excised and reinserted at a new location. Option A reverses them — a classic exam trap!
Q42 Medium
Which tool is most widely used for homology-based repeat annotation?
ARepeatMasker
BAUGUSTUS
CBUSCO
DIGV
Explanation
RepeatMasker is the most widely used tool for repeat annotation. It uses homology-based approaches, comparing the genome against databases like Dfam and Repbase. It integrates algorithms like NHMMER (Hidden Markov Models) to detect even divergent repeat elements. AUGUSTUS is for gene prediction, BUSCO for assembly completeness, and IGV for visualization.
Q43 Hard
Why is gene annotation in eukaryotes much harder than in prokaryotes?
AProkaryotes have more genes than eukaryotes
BEukaryotic genes are interrupted by introns, have abundant intergenic DNA (~62%), and require analysis of UTRs and regulatory elements
CEukaryotic genomes cannot be sequenced with current technology
DProkaryotic genes have more complex intron-exon structures
Explanation
Prokaryotic genomes are simpler: ORFs are long (300–350 codons), there's minimal intergenic DNA (11% in E. coli), and genes rarely overlap. Eukaryotic genomes are complex: up to 62% intergenic DNA, genes interrupted by introns, exons, UTRs, and regulatory elements. This makes ab initio gene prediction in eukaryotes much more difficult and error-prone.
Q44 Medium
What is the "combiner" approach to gene annotation?
AUsing only ab initio prediction methods
BUsing only homology-based prediction methods
CIntegrating both intrinsic (ab initio) and extrinsic (homology-based) methods for improved accuracy
DCombining short reads and long reads during assembly
Explanation
Combiners merge ab initio (intrinsic) and extrinsic methods. They leverage statistical models from the genome sequence AND sequence similarity from external databases (RNA-Seq, protein evidence). AUGUSTUS is a key example — it can work both de novo and by incorporating external evidence. Combiners are the most popular and widely used approach.
Q45 Medium
In the GFF file format, what does the "Score" column (column 6) represent?
AThe GC content of the feature
BThe number of reads covering the feature
CThe length of the feature in base pairs
DA confidence value for the feature prediction (higher = more confident)
Explanation
The Score column in GFF is a floating-point value representing confidence in the feature prediction — higher numbers indicate higher confidence. The GFF file has 9 columns: Seqname, Source, Feature, Start, End, Score, Strand, Frame, and Attribute.
Q46 — Open Calculation
A species has a C-value of 1.8 pg measured by flow cytometry. Calculate the estimated genome size in base pairs and in megabases. Then, if you plan to sequence at 60× coverage using 150 bp reads, how many reads do you need?
✓ Model Answer

Step 1: Genome size in base pairs

Genome size (bp) = DNA content (pg) × 0.978 × 10⁹
= 1.8 × 0.978 × 10⁹ = 1.7604 × 10⁹ bp ≈ 1,760 Mb ≈ 1.76 Gb

Step 2: Number of reads needed

Coverage = (Number of reads × Read length) / Genome size
60 = (N × 150) / 1,760,400,000
N = (60 × 1,760,400,000) / 150 = 704,160,000 reads ≈ 704 million reads
Q47 — Open Short Answer
Describe the 10-step genome assembly pipeline. For each step, provide a brief (one-sentence) explanation of its purpose.
✓ Model Answer

1. Gather information about the target genome — Investigate genome size, repeats, heterozygosity, ploidy, and GC content to plan the assembly strategy.

2. Extract high-quality DNA — Obtain pure, intact, high-molecular-weight DNA suitable for the chosen sequencing technology.

3. Design the best experimental workflow — Define experimental goals, select sequencing strategy (de novo vs reference-guided), and plan coverage and library types.

4. Choose sequencing technology and library preparation — Select between SGS, TGS, or hybrid approaches and prepare appropriate libraries (PE, mate-pair, PCR-free).

5. Evaluate computational resources — Ensure sufficient CPU, RAM, and storage are available for the assembly algorithm chosen.

6. Assemble the genome — Apply the chosen assembly algorithm (greedy, OLC, De Bruijn graph, or hybrid) to build contigs from sequencing reads.

7. Polish the assembly — Correct residual sequencing and assembly errors to improve base-level accuracy using tools like Pilon or Racon.

8. Check assembly quality — Evaluate using metrics such as N50 (contiguity), BUSCO (completeness), assembly size, and read mapping rates.

9. Scaffolding and gap filling — Connect contigs into scaffolds using paired reads, long reads, Hi-C, or optical mapping; fill gaps with sequence or Ns.

10. Re-evaluate assembly quality — Repeat quality control to ensure scaffolding and gap filling improved the assembly.

Q48 — Open Tricky
Explain the difference between De Bruijn graph and Overlap-Layout-Consensus (OLC) approaches for genome assembly. Include: what serves as nodes and edges in each, which sequencing data type each is best suited for, and why De Bruijn became dominant for short-read assemblies.
✓ Model Answer

OLC (Overlap-Layout-Consensus):

• Nodes = individual reads; Edges = overlaps between reads

• Three steps: (i) compute overlaps between all reads, (ii) lay out overlap information in a graph, (iii) infer consensus sequence

• Requires comparing all reads to all other reads → computationally impractical for billions of short reads

• Best suited for long reads (e.g., PacBio, Nanopore) where the number of reads is smaller

• Assembly corresponds to finding a Hamiltonian path (visiting every node once)

De Bruijn Graph (DBG):

• Reads are decomposed into k-mers; Nodes = (k−1)-mers; Edges = k-mers connecting prefix to suffix

• Does not require all-vs-all read comparison, making it much more scalable

• Assembly corresponds to finding an Eulerian path (visiting every edge once), which is computationally easier than Hamiltonian paths

• Best suited for short reads (e.g., Illumina)

Why DBG became dominant: With SGS, the number of reads increased exponentially while lengths shortened. OLC's all-vs-all comparison became impractical. DBG avoids this by working with k-mers, consuming less computational time and memory.

Q49 — Open Short Answer
Describe the three main strategies for gene prediction (structural annotation) in genome annotation. For each, explain its principle, strengths, and limitations.
✓ Model Answer

1. Intrinsic (Ab initio): Relies solely on the genomic sequence itself, using mathematical models (e.g., Hidden Markov Models) trained on known genes to identify gene-like features (ORFs, start/stop codons, splice sites). Strengths: no external data needed, can detect novel genes, high sensitivity (~100%). Limitations: species-specific training required, moderate accuracy (~60-70% for exon-intron structures), struggles with complex eukaryotic genes.

2. Extrinsic (Homology-based): Compares the genome to known gene/protein sequences in databases (NCBI, UniProt). If similarity is found, a gene is inferred. Strengths: leverages extensive existing databases, protein sequences are conserved even between distant species. Limitations: cannot detect truly novel genes absent from databases.

3. Combined (Hybrid/Combiner): Integrates both ab initio models and extrinsic evidence (RNA-Seq data, protein databases, known genes). Strengths: most accurate and widely used approach, benefits from both computational prediction and experimental evidence. Example: AUGUSTUS can work both de novo and with external data. This is the most popular strategy in modern annotation projects.

Q50 — Open Short Answer
A K-mer frequency analysis of raw sequencing reads shows a total area under the curve of 5 × 10⁹ k-mers and a main peak at coverage 10×. Estimate the genome size. The distribution also shows a prominent left peak at 1–5× frequency and a long right tail extending to 50×. Interpret these features.
✓ Model Answer

Genome size estimation:

Genome Size = Total K-mers / Average K-mer coverage = 5 × 10⁹ / 10 = 5 × 10⁸ bp = 500 Mb

Interpretation of the distribution:

Left peak (1–5× frequency): These low-frequency K-mers are likely caused by sequencing errors. Errors introduce unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.

Main peak (~10× coverage): Represents the true genomic K-mers. The position of this peak corresponds to the average sequencing depth/coverage of the genome.

Right tail (extending to 50×): These over-represented K-mers are likely derived from repetitive regions in the genome, which are sequenced multiple times. A prominent right tail suggests the genome contains significant repetitive content, which may complicate assembly.

Lecture 9 – Application of NGS: Different Approaches

📝Pool-seq & Targeted Sequencing
0 / 10
Q1 Easy
What is the primary advantage of Pool-seq over individual whole-genome sequencing?
AIt provides individual genotype data for each sample
BIt is a cost-effective way to estimate allele frequencies across a population
CIt allows haplotype phasing of complex variants
DIt detects rare variants more accurately than individual sequencing
Explanation
Pool-seq sequences the combined DNA of multiple individuals together, providing more accurate allele frequency estimation at a lower cost than sequencing individuals separately. However, it sacrifices individual genotype data and haplotype information. Rare variant detection is actually harder with Pool-seq because low-frequency alleles can be lost in noise.
Q2 Medium
When preparing a Pool-seq experiment, what does "equimolar pooling" ensure?
AEach individual contributes the same number of reads after sequencing
BAll SNPs have equal minor allele frequency in the pool
CEach individual's DNA contributes an equal number of genome copies to the pool
DEqual amounts of PCR product are added from each individual
Explanation
Equimolar pooling means that each individual's DNA contributes an equal number of genome copies to the pool. This requires precise DNA quantification (e.g., spectrophotometry or fluorometry) before pooling. Without equimolar input, some individuals would be overrepresented, distorting allele frequency estimates and biasing population structure conclusions.
Q3 Tricky
Which of the following is NOT a limitation of Pool-seq?
AIndividual genotypes cannot be recovered from pooled data
BHaplotype phasing is impossible in pooled samples
CLow-frequency alleles may be lost in sequencing noise
DAllele frequencies cannot be estimated from pooled data
Explanation
Estimating allele frequencies is exactly what Pool-seq is designed for — it's the main strength, not a limitation. The actual limitations include: loss of individual genotypes (A), impossible haplotype phasing (B), difficulty detecting rare variants (C), potential bias from unequal DNA input, and unsuitability for clinical diagnostics.
Q4 Medium
In the Pool-seq study of red vs. yellow canaries, what statistical measure was used to identify genomic regions of differentiation between the two pools?
AFST index
BChi-squared test for HWE
CLinkage disequilibrium (r²)
DFPKM normalization
Explanation
The FST index measures population differentiation based on allele frequency differences between groups. In the canary study, allele frequencies were compared between red and yellow pools, and FST peaks indicated genomic regions with strong differentiation — candidate regions for the red coloration phenotype.
Q5 Easy
Which targeted sequencing method is best suited for sequencing a small number of specific genomic regions?
AHybridization-based capture
BPCR amplification combined with Sanger sequencing
CWhole-exome sequencing
DPool-seq
Explanation
For a small number of targeted regions, PCR amplification followed by Sanger sequencing is the appropriate choice. As the number of targets increases: Ion AmpliSeq™ handles hundreds of genes, and hybridization-based capture (Ion TargetSeq™) is used for larger target regions up to ~60 Mb.
Q6 Medium
In Illumina amplicon sequencing, what makes it particularly useful for detecting rare somatic mutations in tumor biopsies?
AIt sequences the entire genome at low coverage
BIt uses random fragmentation to cover all genomic regions
CUltra-deep sequencing of PCR amplicons provides high sensitivity for variant detection
DIt eliminates the need for a reference genome
Explanation
Amplicon sequencing provides ultra-deep coverage of targeted regions. This high sequencing depth is critical for detecting rare somatic mutations in complex samples like tumor biopsies, where cancer cells are mixed with normal (germline) DNA. A mutation present in only a small fraction of cells can still be detected with sufficient depth.
Q7 Medium
Amplicon sequencing of bacterial 16S rRNA genes is widely used for:
ADetecting human copy number variations
BPerforming genome-wide association studies
CWhole-genome assembly of bacterial species
DPhylogenetic and taxonomy studies in diverse metagenomics samples
Explanation
16S rRNA gene amplicon sequencing is a standard method for characterizing microbial communities (e.g., soil, water, human gut). The 16S rRNA gene contains conserved regions (for universal primer design) and variable regions (for species identification), making it ideal for phylogenetic classification and taxonomy assignment in metagenomics.
Q8 Tricky
Ion AmpliSeq™ panels consist of:
AA pool of oligonucleotide primer pairs, each designed to amplify a specified genomic region
BBiotinylated probes that hybridize to target regions and are captured with streptavidin
CShort DNA fragments immobilized on a glass slide for hybridization
DRestriction enzymes that cut DNA at specific sites for reduced representation
Explanation
AmpliSeq panels are PCR-based: they consist of pools of oligonucleotide primer pairs that amplify specified genomic regions. Option B describes hybridization capture (e.g., exome sequencing). Option C describes microarrays. Option D describes RAD-seq or reduced representation library approaches. Knowing the difference between amplicon-based and hybridization-based methods is key.
Q9 Easy
What percentage of the human genome does the exome represent?
ALess than 0.5%
BLess than 2%
CAbout 15%
DAbout 85%
Explanation
The human exome represents less than 2% of the genome but contains ~85% of known disease-related variants. This is why WES is so cost-effective: you sequence only ~4–5 Gb per exome vs. ~90 Gb for a whole genome, yet capture the vast majority of clinically relevant variation. Don't confuse the 2% (genome fraction) with the 85% (disease variant fraction).
Q10 Hard
In hybridization capture for whole-exome sequencing, what is the role of biotin-labeled probes and streptavidin beads?
ABiotin fragments DNA; streptavidin sequences the fragments
BBiotin amplifies target regions; streptavidin removes PCR duplicates
CBiotin-labeled probes hybridize to target sequences; streptavidin beads pull down the probe-DNA complexes for isolation
DBiotin labels the adapters; streptavidin separates the two DNA strands for sequencing
Explanation
In hybridization capture: (1) biotinylated probes (baits) hybridize to the target DNA regions (e.g., exons); (2) streptavidin-coated magnetic beads bind to the biotin on the probes, allowing physical separation of the probe-DNA complexes from non-target fragments. This biotin-streptavidin interaction is one of the strongest non-covalent bonds in nature, making capture highly efficient.

📝WES Strategies, Discrete Filtering & Epigenomics
0 / 12
Q11 Medium
In discrete filtering for Mendelian disorders, what is the purpose of the "filter set"?
ATo amplify rare variants before sequencing
BTo select common variants for GWAS analysis
CTo identify all variants shared between patients and controls
DTo eliminate common or known benign variants found in healthy populations, leaving only rare candidates
Explanation
The filter set (from databases like dbSNP, 1000 Genomes, gnomAD, or unaffected controls) contains common variants. By removing any variant found in the filter set, researchers eliminate likely benign polymorphisms. For rare Mendelian disorders, only ~2% of exome variants are novel, making this approach highly effective at narrowing candidates.
Q12 Medium
A child presents with a rare genetic syndrome, but both parents are healthy. What WES-based strategy is most appropriate?
ATrio sequencing (child + both parents) to identify de novo mutations
BPool-seq of the child's DNA with unrelated controls
CRNA-seq of the affected tissue to find expression changes
DExtreme phenotype sequencing comparing the child with healthy siblings
Explanation
When a child has a rare disorder but healthy parents, a de novo mutation is a likely cause. The trio approach sequences the child and both parents. Filtering removes all shared/inherited variants and common variants, leaving novel variants unique to the child as strong disease-causing candidates.
Q13 Tricky
In extreme phenotype sequencing for a quantitative trait like height, what is the main rationale for selecting individuals from the tails of the distribution?
AExtreme individuals have more de novo mutations
BRare causative variants are more likely to be enriched at phenotypic extremes
CExtreme individuals have simpler genomes that are easier to sequence
DIt avoids the need for a reference genome during analysis
Explanation
By selecting individuals at the extremes of a quantitative trait distribution (e.g., tallest vs. shortest), rare causative variants are more likely to be concentrated in one tail. This increases the statistical power to detect them without sequencing the entire population. This approach can be combined with Pool-seq to further reduce costs.
Q14 Easy
In bisulfite sequencing, what happens to unmethylated cytosines?
AThey are converted to adenine
BThey remain as cytosine
CThey are converted to uracil, which is read as thymine during sequencing
DThey are removed from the DNA strand
Explanation
Sodium bisulfite converts unmethylated cytosines → uracil → read as thymine (C→T). Methylated cytosines are protected and remain as C. Therefore: C→C in reads = methylated; C→T in reads = unmethylated. This is the fundamental principle of bisulfite sequencing for studying DNA methylation.
Q15 Hard
A major analytical challenge of bisulfite sequencing is distinguishing between:
AMethylated CpG sites and unmethylated CpG islands
B5-methylcytosine and 5-hydroxymethylcytosine
CCpG shores and CpG shelves
DTrue C→T SNPs and C→T changes caused by bisulfite conversion of unmethylated cytosines
Explanation
A C→T change in bisulfite-treated reads could be either an epigenetic signal (bisulfite conversion of unmethylated C) or a real genetic variant (a true C→T SNP). To resolve this ambiguity, researchers can: (1) perform parallel WES/WGS without bisulfite treatment, (2) use specialized bioinformatics tools, or (3) exploit strand-specificity — the G on the opposing strand is unaffected by bisulfite treatment.
Q16 Medium
About 98% of DNA methylation in the human genome occurs at:
AAdenine residues in GATC motifs
BCpG dinucleotides
CThymine residues in repetitive elements
DGuanine residues in GC-rich promoters
Explanation
In the human genome, about 98% of cytosine methylation occurs at CpG dinucleotides. These CpG sites often cluster into CpG islands (regions >500 bp), which are typically located near gene promoters. Highly methylated promoters are generally associated with repressed gene expression, while low methylation often indicates active transcription.
Q17 Medium
What is the correct order of steps in ChIP-seq?
ACrosslink → Fragment chromatin → Immunoprecipitate with antibody → Extract DNA → Sequence
BExtract DNA → Fragment → Bisulfite treat → Sequence → Call peaks
CImmunoprecipitate → Crosslink → Sequence → Fragment → Align
DFragment → Hybridize probes → Pull down with streptavidin → Sequence → Quantify expression
Explanation
ChIP-seq workflow: (1) Crosslink proteins to DNA with formaldehyde; (2) Fragment chromatin (sonication/enzymatic); (3) Immunoprecipitate with a protein-specific antibody; (4) Reverse crosslinks and extract DNA; (5) Sequence and align reads. Peaks in read depth indicate protein binding sites. Option B describes bisulfite sequencing; option D describes hybridization capture.
Q18 Easy
What does ChIP-seq identify?
ADifferentially methylated regions across the genome
BGene expression levels across tissues
CGenome-wide binding sites of DNA-associated proteins
DCopy number variations in the genome
Explanation
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) identifies the binding sites of DNA-associated proteins genome-wide. These include transcription factors, histone modifications, and other regulatory proteins. The "peaks" in the sequencing data correspond to protein binding locations, which help map regulatory elements like promoters and enhancers.
Q19 Tricky
Which statement about CpG islands is correct?
AThey are defined as regions shorter than 100 bp enriched in CpG dinucleotides
BThey are regions greater than 500 bp, typically located near gene promoters, and surrounded by shores and shelves
CHeavily methylated CpG islands always indicate active gene transcription
DThey are found exclusively in intergenic regions far from any gene
Explanation
CpG islands are defined as regions >500 bp with high CpG density, typically near gene promoters. They are flanked by CpG shores (~2 kb away) and shelves. Critically, high methylation of promoter CpG islands correlates with gene repression (not activation — option C is reversed). Low methylation at promoters generally means active transcription.
Q20 Medium
In an RNA-seq experiment, which method is used to enrich mRNA from total RNA?
ABisulfite treatment
BImmunoprecipitation with anti-RNA antibodies
CRestriction enzyme digestion
DPoly-A selection using oligo-dT primers/beads
Explanation
mRNA has a poly-A tail, so it can be enriched using oligo-dT primers or beads that bind these tails. An alternative approach is ribosomal RNA depletion, since rRNA makes up ~80% of total RNA. These two methods serve the same goal — enriching mRNA — but work differently. Bisulfite is for methylation, immunoprecipitation is for ChIP, and restriction enzymes are for DNA fragmentation.
Q21 Hard
Which RNA-seq normalization metric is most appropriate for comparing gene expression across samples?
ATPM (Transcripts Per Million)
BRaw read counts
CRPKM (Reads Per Kilobase per Million)
DGC content ratio
Explanation
TPM (Transcripts Per Million) is considered better for comparing gene expression across samples because the total TPM values sum to the same number in each sample. RPKM and FPKM normalize for sequencing depth and gene length but their totals can differ between samples, making cross-sample comparisons less reliable. Raw counts need separate normalization for both depth and length.
Q22 Medium
In RNA-seq, when no reference genome is available, which approach is used?
AAlign-then-assemble using STAR or HISAT2
BChIP-seq peak calling
CDe novo transcriptome assembly using tools like Trinity
DBisulfite sequencing of cDNA libraries
Explanation
When no reference genome exists (e.g., non-model organisms), reads are assembled into transcripts de novo using tools like Trinity or SOAPdenovo-Trans. This is the "assemble-then-align" approach. Note: de novo assembly works best for the most abundant transcripts. The "align-then-assemble" approach (option A) requires a reference genome.

📝High-Throughput Genotyping & SNP Chips
0 / 14
Q23 Easy
What is the concept behind using "tag SNPs" for genotyping?
ATag SNPs are the rarest variants in the genome and thus most informative
BDue to linkage disequilibrium, a representative subset of SNPs can capture most genetic variation without genotyping every variant
CTag SNPs are always located in exonic regions encoding proteins
DTag SNPs must have a MAF below 0.01 to be useful
Explanation
Because nearby variants on the same chromosome are inherited together in blocks (linkage disequilibrium), genotyping one representative "tag" SNP per block captures the variation of all other SNPs in that block. This allows researchers to reduce millions of SNPs to tens or hundreds of thousands of tag SNPs, saving cost while retaining most genetic information.
Q24 Medium
A SNP with a Minor Allele Frequency (MAF) of 0.5 indicates:
AThe SNP is monomorphic in the population
BOnly one individual carries the minor allele
CThe SNP is likely a sequencing error
DBoth alleles are present at equal frequency — maximum informativeness
Explanation
MAF = 0.5 means the two alleles are equally frequent (50%/50%). This is the most informative state for a SNP because there is maximum chance that any two individuals will differ at that position. A higher MAF = more informative for detecting genetic differences. Monomorphic would mean MAF = 0 (only one allele present).
Q25 Medium
In the Illumina Infinium array, how is a SNP genotype determined?
ABy hybridization of sample DNA to multiple overlapping probes
BBy restriction enzyme digestion and fragment length analysis
CBy single-base extension of a probe that stops just before the SNP, followed by fluorescence detection
DBy PCR amplification and gel electrophoresis
Explanation
The Infinium (Illumina) array uses single-base extension: a probe on a glass slide matches the DNA up to just before the SNP position. DNA polymerase extends by one fluorescently labeled nucleotide corresponding to the SNP allele. A camera detects the color — one color = homozygous, mixed colors = heterozygous. Option A describes the Affymetrix approach.
Q26 Tricky
What is the key difference between Infinium (Illumina) and Affymetrix genotyping arrays?
AInfinium uses single-base extension; Affymetrix uses hybridization of sample DNA to multiple probes
BInfinium is based on hybridization capture; Affymetrix uses bisulfite conversion
CInfinium can only genotype 100 SNPs; Affymetrix handles millions
DThey are identical technologies from different manufacturers
Explanation
Infinium (Illumina) uses single-base extension: one probe per SNP, extending by one fluorescent nucleotide. Affymetrix uses hybridization: multiple overlapping probes per SNP, detecting differential binding. Infinium is generally considered more robust and accurate; Affymetrix is more dependent on probe design and hybridization conditions. Both handle large numbers of SNPs.
Q27 Medium
In GenomeStudio's Genoplot, what do the axes Norm R and Norm Theta represent?
ANorm R = allele frequency; Norm Theta = sequencing depth
BNorm R = signal intensity; Norm Theta = allele frequency (balance between alleles)
CNorm R = mapping quality; Norm Theta = GC content
DNorm R = chromosome position; Norm Theta = p-value
Explanation
In GenomeStudio's Genoplot: Norm R represents signal intensity (how strong the overall signal is) and Norm Theta represents allele frequency (the balance between the two alleles). Dots are color-coded: Red = AA homozygous, Blue = BB homozygous, Purple = AB heterozygous. These clusters make genotype calling visual and intuitive.
Q28 Hard
A SNP consistently deviates from Hardy-Weinberg Equilibrium across all populations tested. This most likely indicates:
AStrong natural selection acting on that locus in all populations
BHigh inbreeding in every tested population
CA technical problem with the genotyping assay (e.g., poor probe design or repetitive region)
DThe SNP has a MAF of exactly 0.5
Explanation
If a SNP is out of HWE in all populations, it's a strong signal of a technical problem — poor probe design, location in a repetitive/duplicated region, or assay chemistry issues. Biological causes (selection, inbreeding) would typically affect only some populations. If the HWE deviation is population-specific, then biological explanations become more plausible.
Q29 Medium
In the pig 60K SNP chip study, why were Reduced Representation Libraries (RRL) used?
ATo sequence the entire pig genome at high coverage
BTo enrich repetitive DNA elements for mapping
CTo amplify only exonic regions of the pig genome
DTo reduce sequencing effort by focusing on a non-repetitive subset of the genome using restriction enzymes and size selection
Explanation
RRL uses restriction enzymes (e.g., AluI) to cut genomic DNA, followed by size selection via gel electrophoresis. This focuses sequencing on a representative, non-repetitive subset of the genome. Repetitive regions (like SINEs) are avoided because SNPs there can't be reliably mapped to a single location, making them useless for genotyping.
Q30 Tricky
Why might a SNP on a genotyping chip fail to show two alleles (appearing monomorphic when it shouldn't be)?
AThe SNP may be in a repetitive region, the reference genome may be misassembled, or the assay chemistry may have failed
BThe population has too much genetic diversity
CThe MAF of 0.5 makes the alleles invisible to the chip
DPool-seq was used instead of individual genotyping
Explanation
Technical reasons for SNP genotyping failure include: (1) the SNP is in a repetitive region causing ambiguous probe binding; (2) the reference genome has the SNP in a misassembled or unassigned contig; (3) the assay chemistry fails. High MAF (option C) would actually make a SNP easier to detect, not harder. Too much diversity (option B) wouldn't cause monomorphism.
Q31 Medium
What minimum sequencing depth is recommended for reliable genotype calling from NGS data?
A
B100× or more
C10×
D
Explanation
For confident genotype calling from NGS, ~100× coverage or more is recommended. At low depth (e.g., 10×), a heterozygous position might appear homozygous due to random sampling — you might only capture reads from one allele by chance. SNP chips are more robust because they use thousands of probes per SNP, providing built-in redundancy.
Q32 Medium
Which NGS-based genotyping method does NOT require a reference genome?
AHybridization-based enrichment (exome sequencing)
BSNP chip genotyping
CRAD-seq (Restriction-site Associated DNA sequencing)
DWhole-genome resequencing
Explanation
RAD-seq (and GBS) can work without a reference genome, making them ideal for non-model species. They use restriction enzymes to create reproducible genomic fragments across individuals. Exome sequencing requires probe design based on a reference, SNP chips need mapped positions, and WGS resequencing needs a reference for alignment. Amplicon sequencing also doesn't require a full reference but needs known target sequences.
Q33 Easy
What distinguishes genomic selection from marker-assisted selection (MAS)?
AGenomic selection uses thousands of genome-wide markers; MAS uses a few specific markers linked to traits
BGenomic selection is only for plants; MAS is only for animals
CMAS requires whole-genome sequencing; genomic selection does not
DThere is no difference; they are the same method
Explanation
Genomic selection uses thousands to hundreds of thousands of markers genome-wide to predict an individual's total genetic potential for complex traits (many genes, small effects). MAS targets a few specific markers linked to traits controlled by major genes. Genomic selection enables early-life prediction and faster breeding cycles; MAS is simpler but limited to well-characterized traits.
Q34 Medium
Copy Number Variations (CNVs) are defined as:
ASingle nucleotide changes scattered randomly throughout the genome
BSegments of DNA smaller than 100 bp that are duplicated
CInsertions of transposable elements at random positions
DDNA segments ≥1 kb that vary in copy number compared to a reference genome, typically occurring as tandem repeats
Explanation
CNVs are segments of DNA ≥1 kb (kilobase) that vary in copy number between individuals compared to a reference. They typically occur as tandem repeats (adjacent copies on the same haplotype), not as dispersed elements. Despite being fewer in number than SNPs, CNVs contribute more total nucleotide variation and have been linked to many traits and diseases.
Q35 Hard
In inbreeding studies, why might a single incorrectly called heterozygous SNP within a long homozygous stretch be problematic?
AIt would falsely increase the estimated MAF of the region
BIt would break a run of homozygosity (ROH), leading to underestimation of inbreeding
CIt would cause the entire chromosome to be excluded from analysis
DIt would cause HWE violation in the entire population
Explanation
In inbreeding analysis, researchers look for long runs of homozygosity (ROH). A single falsely heterozygous SNP due to technical noise would split a long ROH into two shorter ones (or eliminate it), leading to incorrect conclusions about the degree of inbreeding. This is why such problematic SNPs should be identified and removed from the dataset.
Q36 Tricky
In the pig 60K SNP chip design, starting from 2.6 million initial SNPs, approximately how many were selected for the final chip?
A60,000
B2,600,000
C1,000,000
D106,000
Explanation
The pipeline was: 2.6M initial SNPs → filtered to ~106,000 → selected 60,000 for the final chip. Option D (106,000) was the intermediate step after quality filtering for MAF, mapping quality, etc. The final product is the "60K SNP chip" with 60,000 highly informative SNPs. This multi-step filtering is essential because not all discovered SNPs are suitable for genotyping.

📝Open Questions — Lecture 9
0 / 6
Q37 — Open Short Answer
Describe the Pool-seq approach. What type of information does it provide, what are its main limitations, and when is it preferred over individual whole-genome sequencing?
✓ Model Answer

Approach: Pool-seq involves combining DNA from multiple individuals into a single pool (equimolar amounts), preparing one library, and performing whole-genome sequencing. Reads are mapped to a reference genome and allele frequencies are estimated at each variant position.

Information provided: Population-level allele frequency estimates for SNPs across the genome. It enables comparison of allele frequencies between groups (e.g., using FST).

Limitations: (1) No individual genotype data — variants cannot be traced to specific individuals. (2) Hard to detect rare variants (low-frequency alleles lost in noise). (3) Haplotype phasing is impossible. (4) Bias from unequal DNA input can distort results. (5) Not suitable for clinical diagnostics.

When preferred: When comparing populations or extreme phenotype groups (e.g., red vs. yellow canaries, healthy vs. diseased), when budget limits individual sequencing, and when the goal is allele frequency estimation rather than individual-level genotyping.

Q38 — Open Short Answer
What is the primary purpose of ChIP-seq? Describe its main steps and explain how binding sites are identified from the data.
✓ Model Answer

Purpose: ChIP-seq identifies genome-wide binding sites of DNA-associated proteins (transcription factors, histone modifications) to understand gene regulation.

Steps: (1) Crosslink proteins to DNA using formaldehyde. (2) Fragment chromatin by sonication or enzymatic digestion. (3) Immunoprecipitate protein-DNA complexes using a specific antibody against the protein of interest. (4) Reverse crosslinks and extract the captured DNA. (5) Sequence the DNA using NGS.

Identifying binding sites: Sequenced reads are aligned to a reference genome. Regions with significantly enriched read coverage (peaks) indicate where the protein was bound. Peak calling algorithms identify these enriched regions. Peaks can be annotated to determine overlap with promoters, enhancers, or other regulatory elements, revealing which genes the protein regulates.

Q39 — Open Calculation
A whole-exome sequencing experiment produces 50 million paired-end reads of 150 bp each. The target exome size is 50 Mb. What is the average sequencing depth of the exome? Is this sufficient for reliable variant calling?
✓ Model Answer
Total bases sequenced = 50,000,000 reads × 150 bp × 2 (paired-end) = 15,000,000,000 bp = 15 Gb
Exome size = 50 Mb = 50,000,000 bp
Average depth = 15,000,000,000 / 50,000,000 = 300×

The average exome depth is 300×. This is well above the recommended ~100× for confident genotype calling from NGS data. However, this is an ideal calculation — in practice, not all reads will map on-target (capture efficiency is typically 60–80%), so effective depth would be lower but still sufficient.

Q40 — Open Short Answer
Compare the four main strategies for finding disease-causing rare variants using exome sequencing: (1) filtering across unrelated individuals, (2) family-based segregation, (3) de novo trio analysis, and (4) extreme phenotype sequencing.
✓ Model Answer

(1) Unrelated affected individuals: Sequence exomes of multiple unrelated patients with the same disease. Apply discrete filtering to remove common variants (dbSNP, 1000 Genomes, gnomAD). Look for novel/rare variants shared across affected individuals in the same gene. Powerful for rare Mendelian disorders where ~98% of exome variants are already known.

(2) Family-based segregation: Sequence affected and unaffected family members. Identify variants that co-segregate with the disease (present in all affected, absent in unaffected). Increases confidence that the variant tracks with the phenotype across generations. Used for dominant or recessive trait mapping.

(3) De novo trio analysis: Sequence the child and both healthy parents. Remove all shared/inherited variants and common variants. What remains are novel, de novo mutations unique to the child — strong candidates for ultra-rare syndromes with unclear inheritance.

(4) Extreme phenotype sequencing: For quantitative traits, select individuals at phenotypic extremes (e.g., tallest vs. shortest). Rare causative variants are enriched at the extremes. Can combine with Pool-seq to reduce costs. Used for height, BMI, fertility, and other continuous traits.

Q41 — Open Tricky
You are designing a custom SNP genotyping chip for a livestock species. After genotyping your first batch of samples, you notice that several SNPs violate Hardy-Weinberg Equilibrium. How would you determine whether this is a technical problem or reflects real biology? What would you do for version 2 of the chip?
✓ Model Answer

Diagnosing the cause:

(1) Check across populations: If the same SNP is out of HWE in all populations → likely a technical issue (poor probe, repetitive region). If only in some populations → may reflect biology (inbreeding, selection, population structure).

(2) Examine the genotype clustering plot: Good clustering (three clearly separated groups) → SNP is reliable, deviation may be biological. Poor/noisy clustering or missing clusters → technical failure.

(3) Consider genomic context: Is the SNP in a repetitive region or near a CNV? These locations cause unreliable probe binding.

(4) Adjust software parameters: Try tuning clustering thresholds in GenomeStudio to see if genotype calls improve.

For version 2: Flag persistently problematic SNPs. Remove those that consistently fail HWE across all populations or show poor clustering. Replace them with new informative SNPs from better-characterized regions. Keep SNPs with biologically explainable HWE deviations if the clustering is clean.

Q42 — Open Short Answer
Compare SNP chip-based genotyping with NGS-based genotyping (e.g., GBS or RAD-seq). Discuss their requirements, advantages, and typical use cases.
✓ Model Answer

SNP chips: Require a reference genome and a pre-designed set of SNPs. Provide fixed, high-accuracy genotyping with built-in probe redundancy, making them robust even with poor DNA quality. Cost-effective for large samples with established markers. Best for: GWAS, genomic selection, parentage testing in well-studied species. Limitation: only genotype pre-selected SNPs; cannot discover new variants.

NGS-based (GBS/RAD-seq): Can work without a reference genome (restriction enzymes create reproducible fragments). Enable simultaneous SNP discovery and genotyping. Cost-effective per locus but require higher sequencing depth (~100×) for confident genotype calls. Best for: population genomics in non-model organisms, diversity studies, evolutionary studies. Limitation: higher risk of genotyping errors at low depth; computationally intensive.

Key trade-off: SNP chips are more reliable and standardized; NGS-based methods are more flexible and can discover novel variation. The choice depends on the species (model vs. non-model), available resources, and whether discovery of new variants is needed.

Applied Genomics — Final Comprehensive Exam

Instructions: Answer all questions. For MCQs, select the single best answer. For open questions, provide concise but complete answers.


1 Medium
What does linkage disequilibrium describe?
AThe degree of similarity between two populations
BThe correlation between alleles of two SNPs within a population
CThe rate of linked contigs during genome assembly
DThe mutation rate of genetic markers
Explanation
Linkage disequilibrium (LD) refers to the non-random association of alleles at different loci. When two SNPs are in LD, certain allele combinations occur more frequently than expected by chance, reflecting shared evolutionary history and limited recombination between them.
2 Easy
Which NGS technology is known for producing long reads, often used in de novo genome assembly?
AIllumina
BIon Torrent
CPacBio
DRoche 454
Explanation
Pacific Biosciences (PacBio) produces long reads averaging 10-25 kb, which are invaluable for resolving repetitive regions and constructing de novo genome assemblies. Illumina and Ion Torrent are short-read technologies, while Roche 454 is discontinued.
3 Medium
In a Manhattan plot (GWAS analysis), what does the vertical axis typically represent?
AThe SNP position (bp) along the chromosome
BThe minor allele frequency (MAF)
CThe significance level [−log₁₀(P)] of each SNP association
DThe effect size (β) of each SNP
Explanation
The Y-axis shows −log₁₀(P-value), meaning that more significant associations appear as higher points. The horizontal line typically represents the genome-wide significance threshold (P < 5 × 10⁻⁸). The X-axis displays chromosome positions.
4 Easy
What is the purpose of a SAM file in NGS data analysis?
ATo store raw sequencing reads and quality scores
BTo store sequence alignments to a reference genome
CTo store variant calls and inferred genotypes
DTo store genome annotation information
Explanation
SAM (Sequence Alignment/Map) files contain alignments of sequencing reads to a reference genome. BAM is the binary compressed version. FASTQ files store raw reads, VCF files store variants, and annotation goes in GFF/BED files.
5 Medium
What is aCGH?
AA chip-based genome resequencing technology
BA microarray-based method to identify copy number variations
CAn NGS paired-end sequencing approach
DA method for advanced evaluation of chromosomal heterozygosity
Explanation
Array Comparative Genomic Hybridization (aCGH) is a microarray-based technique that detects copy number variations (CNVs) by comparing test and reference DNA hybridization signals. It was widely used before NGS-based CNV detection became common.
6 Medium
In a FASTQ file, each sequence entry consists of how many lines?
A2 lines
B3 lines
C4 lines
D5 lines
Explanation
FASTQ format uses exactly 4 lines per read: (1) header starting with @, (2) the nucleotide sequence, (3) separator line starting with +, and (4) quality scores encoded as ASCII characters.
7 Medium
What is the sequencing depth formula?
ADepth = G × L / N
BDepth = (N × L) / G
CDepth = N / (L × G)
DDepth = (G × N) / L
Explanation
Coverage (depth) = (number of reads × read length) / genome size. For example, 100 million 150-bp reads on a 3 Gb genome gives (100M × 150) / 3G = 5× coverage.
8 Medium
Which Ion Torrent technology principle involves hydrogen ion release?
APyrosequencing with light detection
BSequencing by synthesis with reversible terminators
CDetection of H⁺ ions released during nucleotide incorporation
DSingle-molecule sequencing using zero-mode waveguides
Explanation
Ion Torrent detects the pH change (hydrogen ions) released when DNA polymerase incorporates a nucleotide into the growing strand. Each incorporation releases one H⁺ ion, which is detected by an ion sensor.
9 Medium
What does ABI SOLiD technology use for encoding?
ASingle-base encoding
BThree-base encoding
CFour-base encoding
DTwo-base encoding system
Explanation
ABI SOLiD uses di-base (two-base) encoding, where each fluorescence color represents a dinucleotide combination. This provides built-in error checking since each base is read twice in different contexts.
10 Medium
What is the approximate read length for Illumina sequencing?
A10-50 bp
B100-300 bp
C1-5 kb
D10-25 kb
Explanation
Illumina (now NovaSeq, HiSeq, MiSeq) produces short reads typically 100-300 bp. PacBio produces long reads (10-25 kb), and Nanopore can produce reads exceeding 100 kb.
11 Medium
What is the primary goal of a GWAS?
ATo sequence entire genomes of affected individuals
BTo identify statistical associations between genetic variants and phenotypic traits
CTo determine the complete haplotype structure of populations
DTo develop new therapeutic drugs for genetic diseases
Explanation
Genome-Wide Association Studies aim to identify statistical associations between genetic variants (typically SNPs) and phenotypic traits across the genome. This helps uncover the genetic basis of complex diseases and traits.
12 Medium
What does CIGAR string represent in alignment files?
AThe chromosome identity of the read
BThe quality score of the alignment
CThe mapping coordinates of the read
DThe pattern of matches, mismatches, insertions and deletions in the alignment
Explanation
CIGAR (Compact Idiosyncratic Gapped Alignment Report) describes the alignment through operations like M (match/mismatch), I (insertion), D (deletion), S (soft clip), and H (hard clip). For example, "8M2I4M" means 8 matching bases, 2 insertions, then 4 more matches.
13 Medium
What does BWA stand for?
ABurrows-Wheeler Aligner
BBase-wise Alignment Algorithm
CBinary Read Alignment Tool
DBioinformatics Workflow Analyzer
Explanation
BWA (Burrows-Wheeler Aligner) is a widely-used sequence alignment tool that employs the Burrows-Wheeler Transform (BWT) for efficient read mapping. BWA-MEM is the recommended algorithm for longer reads.
14 Medium
In paired-end sequencing, what is being sequenced?
ATwo separate DNA fragments from different regions
BThe same fragment sequenced twice independently
CThe ends of the same DNA fragment
DForward and reverse strands of double-stranded DNA
Explanation
Paired-end sequencing reads both ends of the same DNA fragment, with a known insert size between them. This provides information about the distance between reads, useful for assembly and structural variant detection.
15 Medium
In VCF format, what does genotype notation "0|1" indicate?
AHomozygous reference genotype
BHomozygous alternative genotype
CHeterozygous phased genotype
DHeterozygous unphased genotype
Explanation
In VCF, 0 = reference allele, 1 = first alternative allele. The pipe "|" indicates phased genotype (chromosome of origin known), while slash "/" indicates unphased genotype. So 0|1 is a phased heterozygous call.
16 Medium
What is the difference between structural and functional annotation?
AStructural annotation identifies gene function; functional annotation identifies gene locations
BThey are different names for the same process
CStructural annotation is computational; functional annotation is experimental only
DStructural annotation identifies gene locations; functional annotation describes gene products and biological roles
Explanation
Structural (or computational) annotation identifies genomic features like genes, exons, introns, and regulatory elements. Functional annotation describes what these features do—their biological functions, pathways, and interactions.
17 Hard
What algorithm is used in De Bruijn graph genome assembly?
AEulerian path algorithm
BHamiltonian path algorithm
CSmith-Waterman algorithm
DNeedleman-Wunsch algorithm
Explanation
De Bruijn graph assembly uses the Eulerian path algorithm, which efficiently traverses edges (k-mers) to reconstruct the genome. This approach is computationally efficient for short reads but struggles with repeats. OLC uses Hamiltonian paths.
18 Medium
What does BUSCO evaluate in genome assemblies?
ABase-level accuracy of the assembly
BCompleteness by checking for conserved single-copy orthologous genes
CAssembly contiguity statistics
DRead mapping rates to the assembly
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses assembly completeness by checking for evolutionarily conserved genes expected in the organism's lineage. High BUSCO scores indicate a biologically meaningful, complete assembly.
19 Medium
In Hardy-Weinberg equilibrium, what does the equation p² + 2pq + q² = 1 represent?
AThe expected genotype frequencies in a population
BThe allele frequencies in a population
CThe mutation rate in a population
DThe selection coefficient in a population
Explanation
The HWE equation describes expected genotype frequencies: p² (homozygous dominant), 2pq (heterozygous), and q² (homozygous recessive), which sum to 1. This assumes random mating, no selection, infinite population size, and no migration or mutation.
20 Medium
What is multidimensional scaling (MDS) used for in population genomics?
ATo calculate linkage disequilibrium between SNPs
BTo phase haplotypes from genotype data
CTo detect and visualize population structure by reducing high-dimensional genotype data
DTo perform genome-wide association testing
Explanation
MDS is a dimensionality reduction technique that summarizes genome-wide genetic variation into a few dimensions. It visualizes population structure—distinct clusters indicate groups with different genetic ancestry, which must be corrected for in GWAS to avoid false associations.
21 Medium
What is over-representation analysis (ORA) applied after GWAS?
ATo identify additional SNPs not tested in the original GWAS
BTo test whether specific biological functions are enriched in GWAS-identified genes
CTo calculate linkage disequilibrium between candidate variants
DTo replicate GWAS findings in independent populations
Explanation
ORA determines whether specific biological functions, pathways, or processes are over-represented in the GWAS gene list compared to what would be expected by chance. Tools like DAVID and EnrichR perform this analysis.
22 Medium
How do you estimate genome size before sequencing?
AUsing the C-value from flow cytometry
BBy counting all genes in related species
CBy measuring DNA concentration with spectrophotometry
DBy performing a small pilot sequencing run
Explanation
The C-value (DNA content in picograms) is measured by flow cytometry. Since 1 pg ≈ 978 Mb, genome size (bp) = C-value × 0.978 × 10⁹. K-mer analysis is another computational approach to estimate genome size from sequencing data.
23 Hard
What is bisulfite sequencing used to detect?
ADNA sequence variants
BCopy number variations
CChromosomal rearrangements
DDNA methylation patterns
Explanation
Bisulfite sequencing treats DNA with bisulfite, which converts unmethylated cytosines to uracil (read as thymine after PCR), while methylated cytosines remain unchanged. Comparing treated and untreated sequences reveals the methylation status of each cytosine.
24 Medium
In ChIP-seq, what does the "peak" represent?
AA genomic region where a DNA-binding protein is likely attached
BA sequencing error in the data
CA copy number variation in the genome
DA gene fusion event
Explanation
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) identifies protein binding sites. Regions with enriched read coverage ("peaks") indicate where the protein of interest (transcription factor, histone modification) was bound to DNA.
25 Medium
What is the standard genome-wide significance threshold in GWAS?
AP < 0.05
BP < 0.01
CP < 5 × 10⁻⁸
DP < 0.001
Explanation
P < 5 × 10⁻⁸ is the widely accepted genome-wide significance threshold, derived from approximately 1 million independent LD blocks across the genome. This threshold accounts for the massive multiple testing burden in GWAS.
26 Medium
What does ROH stand for in population genomics?
ARate of Homoplasy
BRecombination Output Hierarchy
CReference Ontology Hub
DRuns of Homozygosity
Explanation
Runs of Homozygosity (ROH) are contiguous stretches of homozygous genotypes. Longer ROH indicate recent inbreeding, as identical-by-descent alleles are inherited from a common ancestor. ROH analysis is used to quantify inbreeding coefficients.
27 Medium
What is the primary purpose of Pool-seq?
ATo obtain individual genotypes for all participants
BTo estimate allele frequencies across a population cost-effectively
CTo phase haplotypes in family data
DTo identify rare variants in individuals
Explanation
Pool-seq combines DNA from multiple individuals into a single pool and sequences it to estimate population-level allele frequencies. It's cost-effective for population comparison studies (e.g., case vs. control pools) but loses individual genotype information.
28 Medium
What is the key advantage of exome sequencing over whole-genome sequencing?
ALower cost while capturing most disease-relevant variants
BDetection of structural variants
CSequencing of non-coding regulatory regions
DAssembly of novel genomes
Explanation
Exome sequencing targets only the protein-coding regions (~1-2% of the genome) but contains ~85% of known disease-related variants. This makes it much cheaper than WGS while remaining clinically relevant for many genetic disorders.
29 Medium
In RNA-seq, what does TPM normalization account for?
AOnly gene length differences
BOnly sequencing depth differences
CBoth gene length and sequencing depth, enabling cross-sample comparison
DGC content bias only
Explanation
TPM (Transcripts Per Million) normalizes for both gene length and sequencing depth, making it suitable for comparing gene expression across samples. Unlike RPKM/FPKM, TPM values sum to the same total in each sample.
30 Medium
What is the difference between genomic selection and marker-assisted selection?
AGenomic selection uses phenotype data; marker-assisted selection uses genotype data
BThey are identical methods
CGenomic selection uses few markers; marker-assisted selection uses many
DGenomic selection uses genome-wide markers for complex traits; marker-assisted selection uses few markers for traits with major genes
Explanation
Genomic selection uses thousands of genome-wide markers to predict total genetic merit for complex quantitative traits. Marker-assisted selection (MAS) uses specific markers linked to major-effect genes. GS enables selection for hard-to-measure traits early in life.
31 Medium
What is the definition of an allele?
AA segment of DNA that encodes a protein
BOne of two or more alternative forms of a gene or DNA sequence
CA mutation that causes disease
DThe complete set of chromosomes in an organism
Explanation
An allele is an alternative form of a gene or DNA sequence at a specific locus. For example, at the ABO gene the A, B, and O, alleles represent different versions. Individuals inherit one allele from each parent.
32 Medium
If a population is NOT in Hardy-Weinberg equilibrium, what might this indicate?
AEvolutionary forces are acting (selection, migration, mutation, drift) or non-random mating
BThe population is extremely large
CThe DNA sequencing was performed incorrectly
DAll individuals are genetically identical
Explanation
Deviation from HWE suggests evolutionary forces are at work: natural selection, migration (gene flow), mutation, genetic drift (especially in small populations), or non-random mating (including inbreeding). This is a fundamental test in population genetics.
33 Medium
What does FST measure?
AThe frequency of somatic mutations in a population
BThe forward substitution rate in DNA sequences
CThe level of genetic differentiation between subpopulations
DThe fixation index of alleles within individuals
Explanation
FST measures genetic differentiation among subpopulations. Values range from 0 (no differentiation, populations identical) to 1 (complete differentiation, no shared alleles). High FST indicates population structure, which is important for GWAS to avoid confounding.
34 Medium
What is the primary goal of genome annotation?
ATo assemble reads into contigs
BTo align reads to a reference genome
CTo identify variants between samples
DTo identify and describe functional elements in the genome
Explanation
Genome annotation identifies and describes genomic elements including genes, exons, introns, regulatory regions, and other functional elements. Structural annotation locates features; functional annotation describes their biological roles.
35 Hard
In a De Bruijn graph, what do vertices represent?
A(k−1)-mers (prefixes and suffixes of k-mers)
BIndividual sequencing reads
CComplete genes
DK-mers connecting nodes
Explanation
In De Bruijn graphs, vertices are (k−1)-mers derived from decomposing reads into k-mers. Edges represent the k-mers themselves, connecting prefix (k−1-mer) to suffix (k−1-mer). For example, k-mer "ATGC" connects node "ATG" to "TGC".
36 Medium
What is N50 in genome assembly?
AThe total number of contigs in the assembly
BThe contig length where 50% of the assembly is in contigs of this length or longer
CThe average contig length
DThe number of gaps in the assembly
Explanation
N50 is a contiguity metric: sort contigs longest to shortest, sum lengths until reaching 50% of total assembly length—the length of the last contig added is N50. Higher N50 means less fragmented assembly, but doesn't guarantee correctness.
37 Medium
What is population stratification in GWAS?
AThe random sampling of individuals from a population
BThe division of a population into cases and controls
CSubgroups differing in genetic ancestry that can cause confounding in GWAS
DThe stratification of DNA by GC content during sequencing
Explanation
Population stratification occurs when a study includes genetically distinct subgroups (e.g., different ancestries). Allele frequency differences between groups can mimic associations with traits, causing false positives. MDS/PCA and covariate adjustment are used to correct for this.
38 Medium
What is copy number variation (CNV)?
ASingle nucleotide differences between individuals
BSmall insertions and deletions (1-50 bp)
CChanges in chromosome number (aneuploidy)
DDNA segments ≥1 kb that vary in copy number
Explanation
CNVs are DNA segments ≥1 kb that vary in copy number between individuals. They include duplications, deletions, and complex rearrangements. Despite being fewer than SNPs, CNVs contribute significantly to phenotypic diversity and disease susceptibility.
39 Medium
What is imputation in the context of GWAS?
AStatistical method to infer genotypes at untyped SNPs using reference panels
BThe process of filling gaps in genome assemblies
CA quality control procedure to remove low-quality reads
DEstimating missing phenotype data
Explanation
Genotype imputation uses LD patterns from reference panels (HapMap, 1000 Genomes) to statistically estimate genotypes at SNP positions not directly genotyped. This enables meta-analysis across studies using different genotyping platforms.
40 Medium
Why is high heterozygosity problematic for genome assembly?
AIt increases sequencing error rates
BAllelic differences can be misassembled as separate regions, causing fragmentation
CIt reduces coverage depth
DIt makes DNA extraction more difficult
Explanation
In highly heterozygous genomes, assemblers may interpret allelic variation (from maternal and paternal chromosomes) as distinct genomic regions, assembling both separately. This leads to fragmented assemblies and inflated genome sizes. Using inbred lines or haploid tissues helps.
41 Medium
What does a λGC value of approximately 1.0 indicate in GWAS?
AMany true positive associations detected
BSevere population stratification causing false positives
CNo inflation—test statistics match expected null distribution
DOver-correction requiring more covariates
Explanation
λGC (genomic control inflation factor) ≈ 1.0 means test statistics follow the expected null distribution—indicating proper population structure control. λGC > 1 suggests inflation (confounding); λGC < 1 suggests over-correction.
42 Hard
In a GFF file, the score column represents:
AThe GC content of the feature
BThe length of the feature in base pairs
CThe number of reads supporting the feature
DA confidence value for the prediction (higher = more confident)
Explanation
GFF (General Feature Format) has 9 columns: seqname, source, feature, start, end, score, strand, frame, and attribute. The score column is a floating-point value typically representing confidence—higher values indicate more reliable predictions.
43 Medium
What is RAD-seq primarily used for?
ARestriction site-associated DNA sequencing for population genetics
BFull genome assembly of new species
CRNA expression profiling
DMetagenomic community analysis
Explanation
RAD-seq (Restriction site-Associated DNA sequencing) uses restriction enzymes to generate consistent genomic fragments across samples. It's cost-effective for population genetics, genetic mapping, and species without reference genomes, enabling SNP discovery and genotyping simultaneously.
44 Medium
What is the primary purpose of variant annotation?
ATo align reads to the reference genome
BTo determine the biological impact of identified variants
CTo filter out low-quality reads
DTo estimate population allele frequencies
Explanation
Variant annotation determines the functional consequences of variants: where they occur (exonic, intronic, UTR), what type of change (missense, nonsense, splice site), and predicted impact (using SIFT, PolyPhen). This prioritizes variants for follow-up studies.
45 Hard
Calculate N50: Contig lengths are 100, 70, 60, 50, 50, 40, 30 kb. What is the N50?
A70 kb
B50 kb
C60 kb
DD100 kb
Explanation
Total = 400 kb. Half = 200 kb. Sort descending: 100, 70, 60, 50... Cumulative: 100 (100), 170 (100+70), 230 (170+60)—exceeds 200 at 60 kb. So N50 = 60 kb.

46 — Open Calculation
You have a genome of 3 Gbp and want to achieve 30× coverage using 150-bp Illumina reads. How many reads do you need? Show your calculation.
✓ Model Answer

Using the coverage formula: Coverage = (N × L) / G

30× = (N × 150 bp) / 3,000,000,000 bp
N = (30 × 3,000,000,000) / 150
N = 90,000,000,000 / 150
N = 600,000,000 reads

Answer: 600 million reads

47 — Open Calculation
A population has 500 AA individuals, 200 Aa individuals, and 300 aa individuals (total 1000). Calculate allele frequencies and determine if this population is in Hardy-Weinberg equilibrium.
✓ Model Answer

Step 1: Calculate allele frequencies

Total alleles = 1000 × 2 = 2000
A alleles = (500 × 2) + (200 × 1) = 1000 + 200 = 1200
p = freq(A) = 1200 / 2000 = 0.6
q = freq(a) = 1 − 0.6 = 0.4

Step 2: Expected HWE genotype frequencies

AA = p² = 0.36 → 360 individuals
Aa = 2pq = 2 × 0.6 × 0.4 = 0.48 → 480 individuals
aa = q² = 0.16 → 160 individuals

Step 3: Comparison

Observed: 500 AA, 200 Aa, 300 aa
Expected: 360 AA, 480 Aa, 160 aa

Conclusion: This population is NOT in HWE—there is a large excess of homozygotes and deficit of heterozygotes, suggesting inbreeding, selection, or population structure.

48 — Open Short Answer
Describe the key steps in the NGS variant discovery pipeline, from raw sequencing data to annotated variants. Name the file format and one tool for each step.
✓ Model Answer

1. Quality Control & Trimming: Input FASTQ → FastQC (assessment), Trimmomatic (trimming) → Output cleaned FASTQ. Removes low-quality bases and adapters.

2. Alignment: Input FASTQ → BWA-MEM (alignment) → Output SAM/BAM. Maps reads to reference genome.

3. Variant Calling: Input BAM → GATK (variant calling) → Output VCF. Identifies SNPs and indels.

4. Variant Annotation: Input VCF → Ensembl VEP or SnpEff → Output annotated VCF. Determines functional impact of variants.

5. Filtering & Quality Control: Applies filters for depth, quality, and variant type to obtain high-confidence variant calls.

49 — Open Short Answer
Explain the difference between de novo genome assembly and reference-guided assembly. When would you choose each approach?
✓ Model Answer

De novo assembly: Reconstructs genome from scratch using overlapping reads without a reference. Required when no reference exists (non-model organisms). More computationally intensive and challenging for repetitive genomes.

Reference-guided assembly: Aligns reads to an existing reference genome. More efficient, requires lower coverage, but may miss species-specific variants or structural differences.

Choose de novo when: No reference genome available, studying novel species, or characterizing unique genomic regions absent from reference.

Choose reference-guided when: Reference exists, studying well-characterized species, or resources are limited (lower coverage needed).

50 — Open Short Answer
What is linkage disequilibrium (LD)? How does it differ from physical linkage, and why is it important for GWAS?
✓ Model Answer

Linkage disequilibrium (LD): The non-random association of alleles at different loci—the tendency of certain allele combinations to be inherited together more (or less) frequently than expected by chance.

Difference from physical linkage: Physical linkage means genes/loci are on the same chromosome. LD describes the statistical association between alleles, which is influenced by physical linkage BUT also by selection, drift, population history, and mutation.

Importance for GWAS: LD enables tag SNP strategies—genotyping a subset of variants (tag SNPs) can capture information about nearby variants in the same LD block. This reduces genotyping costs while maintaining genome-wide coverage. However, detected associations are often indirect—the true causal variant may not be genotyped but is in LD with the tag SNP.

51 — Open Short Answer
Describe the main steps in ChIP-seq and explain how binding sites are identified from the data.
✓ Model Answer

ChIP-seq steps:

1. Crosslink: Formaldehyde crosslinks proteins to DNA in vivo.

2. Fragment: Sonication breaks chromatin into small fragments.

3. Immunoprecipitate: Antibody against the protein of interest pulls down protein-DNA complexes.

4. Reverse crosslinks & purify: Extract and purify the DNA.

5. Sequence: NGS library preparation and sequencing.

Identifying binding sites: Sequence reads are aligned to the genome. Regions with significantly enriched read coverage ("peaks") compared to input control indicate protein binding locations. Peak calling algorithms (MACS, SICER) identify these enriched regions.

52 — Open Short Answer
What are the three main strategies for gene prediction (structural annotation)? Explain each briefly.
✓ Model Answer

1. Ab initio (intrinsic): Uses statistical models trained on known genes to predict features from genomic sequence alone. Advantages: detects novel genes, no external data needed. Limitations: requires species-specific training, moderate accuracy.

2. Homology-based (extrinsic): Compares genome to known genes/proteins in databases. If similarity is found, a gene is predicted. Advantages: leverages conserved sequences. Limitations: cannot detect truly novel genes absent from databases.

3. Combined (hybrid): Integrates both approaches—uses ab initio predictions guided by evidence from RNA-seq, ESTs, or protein data. Most accurate and widely used approach (e.g., AUGUSTUS with evidence).

53 — Open Calculation
A species has a C-value of 2.0 pg. Estimate the genome size in base pairs. If using 60× coverage with 150-bp reads, how many reads are needed?
✓ Model Answer

Step 1: Genome size

Genome size (bp) = C-value × 0.978 × 10⁹
= 2.0 × 0.978 × 10⁹
= 1.956 × 10⁹ bp ≈ 1.96 Gb

Step 2: Number of reads

Coverage = (N × L) / G
60 = (N × 150) / 1,956,000,000
N = (60 × 1,956,000,000) / 150
N = 782,400,000 reads ≈ 782 million reads
54 — Open Short Answer
Explain the concept of "indirect association" in GWAS. Why is a significant SNP often not the causal variant?
✓ Model Answer

Indirect association: The detected SNP shows statistical association with the trait but is not itself the causal variant—it is correlated with the causal variant through LD.

Why significant SNPs are often not causal: GWAS genotyping arrays use tag SNPs designed to capture genetic variation in LD blocks. When a tag SNP shows association, the signal may reflect the presence of the true causal variant (which was not directly genotyped) due to their correlation. The detected SNP and causal variant are inherited together because recombination hasn't separated them.

Consequence: Post-GWAS fine-mapping is needed to narrow the association signal and identify the actual causal variant(s) for functional studies.

55 — Open Short Answer
Compare RNA-seq and DNA-seq (whole-genome sequencing). What different information does each provide about an organism?
✓ Model Answer

DNA-seq (WGS):

• Sequences the entire genome (all DNA)

• Captures all variant types: SNPs, indels, CNVs, structural variants

• Identifies variants in coding and non-coding regions

• Can determine genotype, population ancestry, evolutionary relationships

• Does not directly measure gene expression or functional activity

RNA-seq:

• Sequences transcribed RNA (the transcriptome)

• Measures which genes are actively expressed and at what levels

• Captures alternative splicing, allele-specific expression, novel transcripts

• Provides functional readouts—shows which variants may affect gene regulation

• Cannot detect variants in non-expressed genes or genomic rearrangements not affecting transcription

Together: Combining DNA and RNA data provides comprehensive understanding—genetic variants (DNA) and their functional consequences (RNA expression).

56 — Open Short Answer
What is population stratification and how does it affect GWAS results? How can it be detected and corrected?
✓ Model Answer

Population stratification: Presence of genetically distinct subgroups within a study population (e.g., different ancestries). These groups differ in both allele frequencies and trait prevalence, creating confounding—association signals may reflect ancestry rather than genuine genotype-phenotype relationships.

Detection: Principal Component Analysis (PCA) or Multidimensional Scaling (MDS) plots of genotype data reveal clustering. The genomic inflation factor (λGC) quantifies statistical inflation—values >1 indicate stratification.

Correction: Include top principal components or MDS dimensions as covariates in the association model. This accounts for genetic ancestry differences. Genomic control can also adjust test statistics. Family-based designs or matching cases/controls by ancestry help prevent stratification from the start.

57 — Open Short Answer
Describe the key differences between Illumina (short-read) and PacBio/Nanopore (long-read) sequencing technologies. What are the advantages and disadvantages of each?
✓ Model Answer

Illumina (short-read):

• Read length: 100-300 bp

• High accuracy (>99.9%)

• Lower cost per base

• Requires PCR amplification

• Challenges with repetitive regions and structural variants

• Best for: variant calling, RNA-seq, ChIP-seq, population studies

PacBio/Nanopore (long-read):

• Read length: 10 kb to >100 kb

• Lower raw accuracy (85-95%) but improving

• Higher cost per base

• Can sequence without amplification (native DNA)

• Excellent for: genome assembly, structural variants, haplotype phasing, epigenetic detection

Hybrid approaches: Combine short-read accuracy with long-read contiguity for optimal assemblies.

Applied Genomics — Final Exam Simulation

📝Final Exam — 45 MCQ + 12 Open Questions
0 / 57
Q1 Easy
An allele is best defined as:
AA segment of DNA that codes for a protein
BOne of two or more alternative forms of a gene at a given locus
CA mutation that alters gene function
DA chromosome region inherited as a block
Explanation
An allele is one of two or more alternative forms of a gene (or a genetic locus) at the same position on a chromosome. Different alleles can produce variation in the trait that the gene controls. This is a fundamental concept in genetics that underpins all genomic analyses.
Q2 Medium
In a De Bruijn graph, the genome assembly problem is solved by finding:
AA Hamiltonian path (visiting every node once)
BThe shortest path between start and end nodes
CAn Eulerian path (visiting every edge once)
DA maximum spanning tree
Explanation
In a De Bruijn graph, nodes are (k−1)-mers and edges are k-mers. Assembly requires finding an Eulerian path — traversing every edge exactly once. This is computationally tractable, unlike the Hamiltonian path problem (visiting every node once) used in OLC, which is NP-complete. This distinction is a frequently tested concept.
Q3 Easy
What does a FASTQ file contain?
ASequencing reads and per-base quality scores
BAligned reads and their genomic positions
CVariant calls (SNPs and indels)
DGene annotations and coordinates
Explanation
FASTQ stores raw sequencing reads with per-base quality scores. Each entry has 4 lines: identifier (@), sequence, separator (+), and quality string. BAM/SAM stores alignments, VCF stores variant calls, and GFF/BED store annotations. This is one of the most fundamental file formats in NGS analysis.
Q4 Medium
What is the primary goal of a GWAS?
ATo sequence entire genomes of affected individuals
BTo determine the haplotype structure of a population
CTo identify all genes in the genome
DTo find statistical associations between genetic variants and traits
Explanation
GWAS identifies statistical associations between SNPs (or other genetic variants) and phenotypic traits across a population. It does not sequence entire genomes — it genotypes known variant positions using SNP arrays. The goal is to link specific genomic regions with traits of interest for downstream biological investigation.
Q5 Easy
The Ion Torrent sequencer detects nucleotide incorporation by measuring:
AFluorescent light emission
BpH changes from H⁺ ion release
CBioluminescent signal from luciferase
DChanges in electrical current through a nanopore
Explanation
Ion Torrent uses semiconductor sequencing. When a nucleotide is incorporated, H⁺ ions are released, causing a pH change detected by an ion-sensitive layer. This is electronic detection — no optics, cameras, or fluorescent labels are needed. Option A describes Illumina, C describes 454 pyrosequencing, and D describes Nanopore.
Q6 Medium
BUSCO evaluates genome assembly quality by assessing:
AGC content uniformity
BScaffold length distribution
CPresence of conserved single-copy orthologs expected for the lineage
DSequencing error rates in consensus bases
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) checks for conserved genes expected in a given lineage. A high BUSCO score indicates a complete assembly. Missing BUSCOs suggest gaps; duplicated BUSCOs may indicate assembly errors or redundancy. BUSCO measures completeness, while N50 measures contiguity — both are needed for comprehensive QC.
Q7 Medium
In a VCF file, the genotype 0|1 indicates:
AA phased heterozygous genotype
BAn unphased heterozygous genotype
CA homozygous alternative genotype
DA missing genotype call
Explanation
The pipe "|" indicates a phased genotype — you know which allele is on which chromosome. The slash "/" indicates unphased (allele assignment to chromosomes is unknown). Both 0|1 and 0/1 are heterozygous, but phasing information differs. 0 = reference allele, 1 = first alternative allele.
Q8 Easy
Linkage disequilibrium describes:
AThe mutation rate of genetic markers
BThe non-random association of alleles at different loci
CThe rate of linked contigs during genome assembly
DThe degree of similarity between two populations
Explanation
LD describes the tendency of alleles at different loci to be inherited together more often than expected by chance. It is shaped by recombination, genetic drift, selection, and population history. LD is distinct from physical linkage — even unlinked loci can be in LD due to population structure or recent admixture.
Q9 Medium
ABI SOLiD sequencing uses:
AFluorescent reversible terminators
BSemiconductor pH detection
CPyrosequencing with luciferase
DSequencing by ligation with a two-base encoding system
Explanation
SOLiD (Sequencing by Oligonucleotide Ligation and Detection) uses a ligation-based approach with fluorescently labeled di-base probes. Each base is interrogated twice through five rounds of primer reset, enabling high accuracy (up to 99.99% with ECC). The output is in color-space, requiring conversion to nucleotide sequences.
Q10 Medium
PacBio SMRT sequencing detects nucleotide incorporation using:
ApH changes in semiconductor wells
BCurrent disruptions through a protein pore
CFluorescent signals in zero-mode waveguides (ZMWs)
DBioluminescence from a luciferase reaction
Explanation
PacBio uses ZMWs — tiny wells with a single immobilized polymerase at the bottom. Fluorescently labeled nucleotides are incorporated continuously; the camera detects the color and timing of each incorporation in real time. PacBio can also detect DNA modifications through interpulse duration changes. Typical read lengths are 10–25 kb or more.
Q11 Easy
A SAM/BAM file stores:
ARaw sequencing reads and quality scores only
BSequence alignments to a reference genome
CSNP and indel variant calls
DGenome annotation information
Explanation
SAM (Sequence Alignment Map) stores reads aligned to a reference, including mapping positions, CIGAR strings, mapping quality, and mate-pair information. BAM is the compressed binary version. FASTQ stores raw reads, VCF stores variant calls, and GFF stores annotations.
Q12 Medium
In a Manhattan plot, the vertical axis represents:
A−log₁₀(p-value)
BMinor allele frequency
CEffect size (β)
DPhysical position in base pairs
Explanation
In a Manhattan plot, the X-axis shows chromosomal positions and the Y-axis shows −log₁₀(p-value). This transformation makes more significant associations appear as higher points. A horizontal line typically marks the genome-wide significance threshold at P < 5 × 10⁻⁸.
Q13 Medium
Paired-end sequencing means:
ASequencing two different samples on the same flow cell
BSequencing the same strand twice for error correction
CSequencing with two different chemistries on one library
DSequencing both ends of a DNA fragment
Explanation
Paired-end sequencing reads both ends of a DNA fragment, producing two linked reads per molecule. The middle portion may remain unsequenced, but the known insert size provides positional information critical for detecting structural variants, indels, and gene fusions that are not detectable with single-end reads.
Q14 Medium
If an individual has a high degree of inbreeding, what effect does this have on genome assembly?
AIt makes assembly harder due to increased heterozygosity
BIt has no effect on assembly quality
CIt makes assembly easier due to increased homozygosity
DIt requires long-read sequencing exclusively
Explanation
High inbreeding increases homozygosity, which simplifies assembly because there is less allelic variation to confuse the assembler. In highly heterozygous genomes, the assembler may interpret allelic variants as separate genomic regions, causing fragmentation and inflated genome size. This is why inbred lines are preferred for reference genome assembly.
Q15 Medium
Runs of Homozygosity (ROH) in a genome indicate:
ARegions of high mutation rate
BStretches of homozygous genotypes reflecting autozygosity from a common ancestor
CRegions with high recombination rates
DErrors in genotyping array data
Explanation
ROH are long continuous stretches of homozygous genotypes that arise when an individual inherits identical haplotype segments from both parents due to a common ancestor. The total length and number of ROH correlate with the degree of inbreeding — longer ROH indicate more recent inbreeding events.
Q16 Easy
Illumina sequencing generates clusters using:
ABridge amplification on a flow cell
BEmulsion PCR on beads
CRolling circle amplification
DIsothermal strand displacement
Explanation
Illumina uses bridge amplification: single-stranded DNA hybridizes to oligos on the flow cell surface, folds over to bridge with adjacent primers, and is amplified into clonal clusters of ~1,000 copies. Emulsion PCR is used by Ion Torrent, 454, and SOLiD.
Q17 Medium
The CIGAR string "5M2I3M1D4M" in a SAM record means:
A15 bases aligned with no indels
B5 soft-clipped, 2 insertions, 3 matches, 1 deletion, 4 matches
C5 deletions followed by 2 insertions
D5 aligned, 2 inserted in read, 3 aligned, 1 deleted from read, 4 aligned
Explanation
CIGAR operations: M = match/mismatch (aligned), I = insertion in read (bases present in read but not reference), D = deletion from read (bases in reference but not read), S = soft clip. Here: 5M (5 aligned) + 2I (2 bases inserted) + 3M (3 aligned) + 1D (1 base deleted) + 4M (4 aligned). Read length = 5+2+3+4 = 14 bases.
Q18 Medium
BWA-MEM is based on:
AHash table indexing
BSmith-Waterman local alignment
CBurrows-Wheeler Transform
DK-mer frequency counting
Explanation
BWA (Burrows-Wheeler Aligner) uses the Burrows-Wheeler Transform for efficient read alignment. It is the default aligner in many standard pipelines (e.g., GATK best practices). Different aligners can produce substantially different variant calls — one study showed only 24.5% concordance between BWA-MEM and Bowtie2.
Q19 Medium
The genome-wide significance threshold in GWAS is typically:
AP < 0.05
BP < 5 × 10⁻⁸
CP < 1 × 10⁻³
DP < 0.01
Explanation
The widely accepted threshold is P < 5 × 10⁻⁸, derived from correcting for approximately 1 million independent LD blocks across the genome (Pe'er et al., 2008). This is a fixed, LD-aware threshold that replaced the per-study Bonferroni correction, which was considered overly conservative.
Q20 Medium
Structural annotation of a genome refers to:
AIdentifying the positions and structures of genes (exons, introns, UTRs)
BAssigning biological function to predicted genes
CDetermining the 3D structure of encoded proteins
DMeasuring gene expression levels across tissues
Explanation
Structural annotation identifies where genes are located and what they look like (exon-intron boundaries, start/stop codons, UTRs). Functional annotation then assigns biological roles (e.g., enzyme activity, pathway involvement) to those predicted genes. Both are essential steps after genome assembly.
Q21 Medium
MDS (Multidimensional Scaling) in GWAS is used to:
ACalculate p-values for each SNP
BPerform multiple testing correction
CPhase haplotypes from genotype data
DDetect and visualize population structure
Explanation
MDS reduces high-dimensional genotype data into a few dimensions, where each point represents an individual. Clusters on the MDS plot reveal population subgroups. If distinct clusters correlate with case/control status, population stratification is confounding results. MDS components can be included as covariates to correct for this.
Q22 Medium
What is aCGH?
AA chip-based genome resequencing technology
BAn NGS paired-end sequencing approach
CA microarray-based method to identify CNVs
DA method for evaluating chromosomal heterozygosity
Explanation
Array Comparative Genomic Hybridization (aCGH) is a microarray-based method that compares a test genome with a reference genome to detect copy number variations (CNVs) — duplications and deletions. Test and reference DNA are labeled with different fluorescent dyes and co-hybridized to the array.
Q23 Easy
A Phred quality score of Q20 corresponds to a base call accuracy of:
A90%
B99%
C99.9%
D99.99%
Explanation
Q = −10 × log₁₀(e). For Q20: e = 10⁻² = 0.01, meaning 1 error in 100 bases = 99% accuracy. Q10 = 90%, Q30 = 99.9%, Q40 = 99.99%. Illumina typically achieves Q30+, while Ion Torrent averages around Q20.
Q24 Medium
In equimolar DNA pooling for Pool-seq, what must be ensured?
AEach individual contributes equal amounts of DNA to the pool
BAll individuals are homozygous at target loci
CThe pool contains only coding sequences
DEach individual is sequenced separately before pooling
Explanation
Equimolar pooling means each individual contributes the same amount of DNA so that allele frequencies in the pool accurately represent the population. Unequal contributions would bias allele frequency estimates. Pool-seq estimates population-level allele frequencies but cannot determine individual genotypes.
Q25 Medium
Hardy-Weinberg Equilibrium assumes all of the following EXCEPT:
ARandom mating
BNo mutation
CLarge population size
DSelection favoring heterozygotes
Explanation
HWE assumes: random mating, no selection, no mutation, no migration, and large population size. Selection (including heterozygote advantage) violates HWE. If a population deviates from HWE, it may indicate selection, non-random mating, population structure, or genotyping errors.
Q26 Medium
The Sanger chain-termination method uses:
AFluorescent reversible terminators added simultaneously
BH⁺ ion detection in semiconductor wells
CDideoxynucleotides (ddNTPs) that terminate chain elongation
DLigation of fluorescent di-base probes
Explanation
Sanger sequencing uses ddNTPs that lack a 3'-OH group, terminating chain elongation when incorporated. Each ddNTP is labeled with a different fluorescent dye. This is classified as first-generation sequencing — producing long, high-accuracy reads (~800–1000 bp) but at low throughput.
Q27 Medium
Illumina SBS has higher accuracy in homopolymer regions than Ion Torrent because:
AIllumina uses a more sensitive camera
BReversible terminators ensure only one base is incorporated per cycle
CIllumina reads are inherently longer
DIllumina uses a two-base encoding system
Explanation
Illumina's reversible terminator chemistry blocks the 3' end after each nucleotide incorporation, ensuring exactly one base is added per cycle — even in homopolymer runs like AAAA, each A is read in a separate cycle. Ion Torrent flows nucleotides without termators, so multiple identical bases may incorporate simultaneously, and the signal intensity must estimate the count — which is error-prone.
Q28 Medium
Over-Representation Analysis (ORA) in post-GWAS analysis tests whether:
ASpecific biological functions are enriched in GWAS-identified genes
BSNPs are in Hardy-Weinberg Equilibrium
CPopulation stratification has been corrected
DThe genotyping call rate exceeds 95%
Explanation
ORA determines whether certain biological pathways or Gene Ontology (GO) terms are more frequently represented in GWAS-identified genes than expected by chance. Tools like DAVID and EnrichR perform this analysis. It helps translate lists of candidate genes into biologically meaningful insights about the trait.
Q29 Medium
What does FST measure?
AThe rate of mutation between two loci
BThe inbreeding coefficient of an individual
CGenetic differentiation between populations
DThe proportion of missing genotypes
Explanation
FST (Fixation Index) measures the proportion of genetic variance found between populations relative to the total variance. FST = 0 means no differentiation (same allele frequencies); FST = 1 means complete fixation of different alleles. It is widely used in population genomics and was used in the Pool-seq study of red vs. yellow canaries.
Q30 Easy
ChIP-seq identifies:
ACopy number variations
BMethylation patterns at single-base resolution
CmRNA expression levels
DGenome-wide protein-DNA binding sites
Explanation
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) identifies where proteins (e.g., transcription factors, histones) bind to DNA across the genome. The process involves crosslinking, fragmentation, immunoprecipitation with a specific antibody, and sequencing the enriched DNA fragments.
Q31 Medium
Typical Illumina read lengths are approximately:
A35–50 bp
B100–300 bp
C1–5 kb
D10–25 kb
Explanation
Illumina platforms produce short reads of ~100–300 bp depending on the platform and chemistry. SOLiD produces 35–50 bp. PacBio produces 10–25 kb (or more with HiFi). The trade-off is: Illumina has high accuracy and throughput but shorter reads; PacBio/Nanopore have long reads but historically higher error rates.
Q32 Medium
Oxford Nanopore sequencing can directly sequence RNA without:
AConverting it to cDNA first
BUsing any electrical current
CFragmenting the molecules
DA protein nanopore
Explanation
Nanopore can sequence RNA directly, without reverse transcription to cDNA. RNA molecules pass through the pore, and current changes are used to infer the sequence. This preserves RNA modifications (e.g., m6A) that would be lost in cDNA conversion. Nanopore always requires electrical current and a protein pore for detection.
Q33 Medium
Population stratification in GWAS can cause:
AIncreased read length
BDecreased sequencing depth
CSpurious associations between ancestry-related SNPs and the trait
DImproved statistical power
Explanation
Population stratification occurs when subgroups differ in both ancestry and trait prevalence. SNPs that differ between subgroups may appear associated with the trait simply because they track ancestry, not biology. This creates false positives. It is detected using MDS/PCA and corrected by including ancestry components as covariates.
Q34 Medium
The exome represents approximately what percentage of the human genome?
A0.1%
B10%
C5%
D~1–2%
Explanation
The exome (all protein-coding exons) represents only about 1–2% of the human genome, yet contains ~85% of known disease-causing mutations. Whole-exome sequencing (WES) targets this fraction, making it far cheaper than WGS while capturing most clinically relevant variants.
Q35 Medium
In bisulfite sequencing, unmethylated cytosines are converted to:
AGuanine
BUracil (read as thymine after PCR)
CAdenine
D5-methylcytosine
Explanation
Bisulfite treatment converts unmethylated cytosines to uracil (read as T after PCR amplification), while methylated cytosines (5mC) are protected and remain as C. By comparing the treated sequence to the reference, methylation status at each C can be determined. The main analytical challenge is distinguishing true C→T conversions from C→T SNPs.
Q36 Medium
A VCF file stores:
ASNPs, indels, and structural variant calls
BRaw sequencing reads
CRead alignment coordinates in binary format
DGenome annotation features
Explanation
VCF (Variant Call Format) is a standardized text file for variant calls. It contains columns for chromosome, position, ID, reference allele, alternative allele(s), quality, filter status, info annotations, format, and per-sample genotype data. Meta-information lines begin with ## and the header line with #.
Q37 Medium
In RNA-seq, poly(A) selection is used to:
ARemove adapter sequences
BFragment the cDNA library
CEnrich mRNA from total RNA
DNormalize expression levels across samples
Explanation
Most eukaryotic mRNAs have a poly(A) tail. Oligo(dT) beads capture polyadenylated transcripts, separating mRNA from the dominant rRNA (~80% of total RNA). This enrichment is essential because sequencing total RNA without enrichment would be overwhelmed by ribosomal RNA.
Q38 Medium
Tag SNPs in GWAS genotyping arrays are selected because they:
AAre always the causal variants
BHave the highest mutation rates
CAre located exclusively in coding regions
DCapture variation within LD blocks without genotyping every SNP
Explanation
Because SNPs within an LD block are correlated, genotyping one representative "tag" SNP captures information about the others. This reduces cost while maintaining genome-wide coverage. The detected association is typically indirect — the tag SNP is in LD with the actual causal variant, which may not be on the array.
Q39 Medium
A SAM FLAG value of 4 indicates the read is:
AA PCR duplicate
BUnmapped to the reference
CA secondary alignment
DProperly paired with its mate
Explanation
SAM FLAG is a bitwise flag: FLAG 4 = unmapped, FLAG 256 = secondary alignment, FLAG 1024 = PCR/optical duplicate. Unmapped reads can be useful for metagenomics — they may come from contaminant organisms (bacteria, viruses) not present in the host reference genome.
Q40 Medium
The correct order of file formats in a variant discovery pipeline is:
AFASTQ → BAM → VCF
BBAM → FASTQ → VCF
CVCF → BAM → FASTQ
DFASTQ → VCF → BAM
Explanation
Raw reads (FASTQ) are aligned to a reference with BWA to produce BAM files, which are then processed through variant callers like GATK to produce VCF files. Quality control and filtering happen between every step. This FASTQ → BAM → VCF flow is the backbone of any resequencing analysis.
Q41 Medium
Class I transposable elements (retrotransposons) move via:
ACut-and-paste through a DNA intermediate
BDirect excision and reinsertion
CCopy-and-paste through an RNA intermediate
DHorizontal gene transfer
Explanation
Class I (retrotransposons: LINEs, SINEs, LTR elements) use "copy and paste" via an RNA intermediate — the original stays in place while a copy inserts elsewhere. Class II (DNA transposons) use "cut and paste" via a DNA intermediate. This is a commonly confused distinction — reversing them is a classic exam trap.
Q42 Medium
A genomic inflation factor (λGC) of 1.00 in a GWAS indicates:
AThe study detected many true associations
BUncorrected population stratification
COvercorrection for population structure
DNo systematic inflation — proper control of confounders
Explanation
λGC ≈ 1.00 is the ideal scenario: observed test statistics match the expected null distribution. λ > 1 indicates inflation (possible stratification, false positives). λ < 1 suggests overcorrection (too many covariates, risking false negatives). The QQ-plot provides the visual counterpart to this numeric assessment.
Q43 Medium
In the FastQC "Per base sequence content" module, all four bases showing approximately equal frequencies at every position suggests:
AAdapter contamination
BRandom fragmentation — a good-quality WGS library
CRestriction enzyme digestion
DPCR duplicate artifacts
Explanation
Random fragmentation produces roughly equal proportions of A, T, G, C at every position along the read — expected for a good WGS library. If specific bases dominate at certain positions (e.g., T always at position 1), it suggests restriction enzyme digestion was used. Understanding the library prep method is key to interpreting FastQC output correctly.
Q44 Medium
The additive genetic model in GWAS codes genotypes as:
AAA = 1, Aa = 1, aa = 0 (dominant model)
BAA = 1, Aa = 0, aa = −1
C0, 1, or 2 copies of the minor allele
DGenotypes are not coded numerically
Explanation
The additive model — the most commonly used in GWAS — counts minor allele copies: 0 (homozygous major), 1 (heterozygous), 2 (homozygous minor). Linear regression then tests whether the phenotype changes with each additional copy of the minor allele.
Q45 Medium
RepeatMasker is used for:
AIdentifying and masking repetitive elements in a genome assembly
BPredicting gene structures using HMMs
CEvaluating assembly completeness
DVisualizing read alignments in a genome browser
Explanation
RepeatMasker identifies repetitive elements by comparing the genome against databases (Dfam, Repbase). Repeats can be hard-masked (replaced with Ns) or soft-masked (converted to lowercase). Masking repeats before gene prediction prevents false gene predictions in repetitive regions. AUGUSTUS is for gene prediction, BUSCO for completeness, and IGV for visualization.
Q46 — Open Calculation
You sequence a genome of 2.5 Gbp using Illumina paired-end 150 bp reads. You obtain 500 million reads. Calculate the sequencing depth. Is this sufficient for robust SNP detection (minimum ~10×)?
✓ Model Answer
Depth = (N × L) / G
N = 500,000,000 reads; L = 150 bp; G = 2,500,000,000 bp
Depth = (500,000,000 × 150) / 2,500,000,000
= 75,000,000,000 / 2,500,000,000 = 30×

Yes, 30× exceeds the recommended minimum of ~10× for robust SNP detection. This depth provides high confidence for variant calling and genotyping.

Q47 — Open Calculation
Given the following contig lengths (in kb): 120, 90, 80, 70, 50, 40, 30, 20. Calculate the N50 value.
✓ Model Answer
Step 1: Sort contigs from longest to shortest: 120, 90, 80, 70, 50, 40, 30, 20
Step 2: Total assembly size = 120 + 90 + 80 + 70 + 50 + 40 + 30 + 20 = 500 kb
Step 3: Half of total = 250 kb
Step 4: Cumulative sum from longest:
120 → cumulative = 120 (below 250)
120 + 90 = 210 (below 250)
210 + 80 = 290 (exceeds 250)
N50 = 80 kb

The N50 is the length of the contig that, when added, causes the cumulative sum to cross 50% of the total assembly size. Note: N50 measures contiguity, not correctness — a high N50 does not guarantee an error-free assembly.

Q48 — Open Short Answer
Describe the bisulfite sequencing strategy. What chemical conversion occurs, and how does it allow detection of methylated cytosines?
✓ Model Answer

Bisulfite sequencing works by treating genomic DNA with sodium bisulfite, which converts unmethylated cytosines to uracil (read as thymine after PCR amplification). Methylated cytosines (5-methylcytosine) are protected from this conversion and remain as C.

After sequencing, the reads are aligned to the reference genome. At each cytosine position: if the read shows C → the position was methylated; if it shows T → the position was unmethylated. This provides single-base resolution methylation mapping.

A major analytical challenge is distinguishing bisulfite-induced C→T conversions from genuine C→T SNPs in the genome. About 98% of methylation in the human genome occurs at CpG dinucleotides. CpG islands (regions dense in CpG sites) near gene promoters are of particular interest as their methylation status often regulates gene expression.

Q49 — Open Short Answer
How can genome size be estimated before sequencing? Describe two approaches.
✓ Model Answer

1. C-value (Flow Cytometry): The C-value is the amount of DNA in picograms (pg) in a haploid genome. It is measured using flow cytometry or Feulgen densitometry, typically by comparing staining intensity to a reference species with known genome size. The conversion formula is: Genome size (bp) = C-value (pg) × 0.978 × 10⁹. For example, a C-value of 2.0 pg gives approximately 1.96 Gbp.

2. K-mer frequency analysis: After sequencing, reads are decomposed into K-mers and their frequency distribution is plotted. The genome size is estimated by: Genome size = Total number of K-mers (area under the curve) / Average K-mer coverage (position of the main peak). The distribution typically shows three features: a left peak of low-frequency K-mers (sequencing errors), a main peak (true genomic K-mers at average coverage), and a right tail of high-frequency K-mers (repetitive regions).

Q50 — Open Short Answer
Describe and draw a Manhattan plot. What information does it display, what do the axes represent, and how do you identify significant associations?
✓ Model Answer

A Manhattan plot is the standard visualization of GWAS results. It displays all tested SNPs across the genome:

X-axis: Genomic position — SNPs are plotted by their physical location, ordered by chromosome. Each chromosome is shown in a different color.

Y-axis: −log₁₀(p-value) — the negative log-transformed p-value of each SNP-trait association. This transformation makes more significant associations appear as taller points (a p-value of 10⁻⁸ appears as 8 on the Y-axis).

Significance threshold: A horizontal line at −log₁₀(5 × 10⁻⁸) ≈ 7.3 marks the genome-wide significance threshold. SNPs above this line are considered significantly associated.

Interpretation: True associations appear as "peaks" — clusters of linked SNPs (in LD) rising above the background. The peak shape reflects LD structure: the top SNP has the strongest signal, and nearby correlated SNPs form a hill. An isolated single SNP above the threshold (without supporting nearby SNPs) is suspicious and may be a false positive due to genotyping errors.

[Drawing: a scatter plot with chromosomes along the X-axis separated by alternating colors, dots scattered at low Y values (1–4), with one or more sharp peaks exceeding the horizontal significance line around Y = 7.3]

Q51 — Open Calculation
In a population of 1000 individuals, you observe: 500 AA, 200 Aa, 300 aa. Calculate allele frequencies, expected genotype counts under HWE, and determine whether this population is in Hardy-Weinberg Equilibrium.
✓ Model Answer
Total individuals = 1000; Total alleles = 2000
Allele A count: (2 × 500) + (1 × 200) = 1200
p = freq(A) = 1200 / 2000 = 0.6
q = freq(a) = 1 − 0.6 = 0.4
Expected under HWE:
AA: p² × 1000 = 0.36 × 1000 = 360
Aa: 2pq × 1000 = 0.48 × 1000 = 480
aa: q² × 1000 = 0.16 × 1000 = 160
Chi-squared test:
χ² = (500−360)²/360 + (200−480)²/480 + (300−160)²/160
= 19600/360 + 78400/480 + 19600/160
= 54.44 + 163.33 + 122.50 = 340.28

With 1 degree of freedom, the critical value at α = 0.05 is 3.84. Since 340.28 >> 3.84, the population is not in Hardy-Weinberg Equilibrium. There is a large excess of homozygotes and a deficit of heterozygotes, suggesting non-random mating, selection, or population substructure.

Q52 — Open Short Answer
What is the primary purpose of ChIP-seq? Describe its main experimental steps and explain how protein-DNA binding sites are identified from the data.
✓ Model Answer

Purpose: ChIP-seq identifies genome-wide binding sites of proteins (transcription factors, histones, etc.) to DNA.

Steps:

1. Crosslinking: Formaldehyde covalently links proteins to the DNA they are bound to in vivo.

2. Fragmentation: Chromatin is sheared into small fragments (~200–500 bp) by sonication or enzymatic digestion.

3. Immunoprecipitation: An antibody specific to the target protein pulls down protein-DNA complexes.

4. Reverse crosslinking & purification: The crosslinks are reversed and DNA is purified.

5. Sequencing: The enriched DNA fragments are sequenced using NGS.

Identifying binding sites: Reads are aligned to the reference genome. Regions with significantly more reads than the background (input control) form "peaks." Peak-calling algorithms (e.g., MACS2) identify these enriched regions as binding sites. The height and shape of peaks indicate binding strength and precision.

Q53 — Open Calculation
You want to sequence a 1.2 Gbp genome at 60× coverage using 150 bp reads. How many reads do you need?
✓ Model Answer
Coverage = (N × L) / G → N = (Coverage × G) / L
N = (60 × 1,200,000,000) / 150
= 72,000,000,000 / 150
= 480,000,000 reads

You need approximately 480 million reads of 150 bp each to achieve 60× coverage of a 1.2 Gbp genome.

Q54 — Open Short Answer
Describe the FastQC "Per base sequence quality" module. What does the boxplot at each position represent, and when should trimming be applied?
✓ Model Answer

The "Per base sequence quality" module shows quality score distributions at each position along the read. At each position, a boxplot displays:

- The median quality score (central line)

- The interquartile range (IQR, the box: 25th–75th percentile)

- The 10th and 90th percentiles (whiskers)

- The mean quality (blue line)

The background is color-coded: green (good, Q ≥ 28), yellow (acceptable, Q 20–28), and red (poor, Q < 20).

When to trim: Trimming should be applied when quality scores drop into the yellow or red zones, which typically occurs toward the 3' end of reads. A sliding window approach (e.g., with Trimmomatic) calculates the average quality within a window and trims when it falls below a threshold (e.g., Q20). After trimming, FastQC should be re-run to confirm improvement. Reads shorter than a minimum length (e.g., 25 bp) should be discarded entirely.

Q55 — Open Short Answer
Describe the K-mer frequency distribution graph. What are the three main regions visible, and what does each represent?
✓ Model Answer

A K-mer frequency distribution plots K-mer frequency (X-axis) against the number of distinct K-mers at that frequency (Y-axis). Three main regions are visible:

1. Left peak (low frequencies, e.g., 1–5×): Represents K-mers caused by sequencing errors. Errors create unique, erroneous K-mers that appear only once or a few times. These should be discarded before assembly.

2. Main peak (moderate frequency): Represents true genomic K-mers. The position of this peak corresponds to the average sequencing depth. For example, a peak at 30× means each genomic K-mer was sequenced approximately 30 times.

3. Right tail (high frequencies, extending well beyond the main peak): Represents K-mers from repetitive regions. Repeats occur multiple times in the genome, so their K-mers appear at multiples of the average coverage. A prominent right tail indicates high repeat content, which will complicate assembly.

Genome size is estimated by: Total K-mers (area under the curve, excluding error peak) / Main peak position.

Q56 — Open Short Answer
Describe the complete variant discovery pipeline from raw reads to annotated variants. For each step, name the input format, output format, and one commonly used tool.
✓ Model Answer

1. Quality Control & Trimming: Input: FASTQ → Tool: FastQC (QC), Trimmomatic (trimming) → Output: cleaned FASTQ. Assess per-base quality, GC content, duplications. Trim low-quality ends and remove short reads.

2. Alignment: Input: cleaned FASTQ + reference FASTA → Tool: BWA-MEM → Output: SAM/BAM. Map reads to reference genome. Post-alignment: sort, index, and remove PCR duplicates (Picard). Filter by mapping quality (MAPQ).

3. Variant Calling: Input: filtered BAM → Tool: GATK HaplotypeCaller → Output: VCF. Identify SNPs and indels at each position. Joint calling across samples is preferred for population studies.

4. Variant Annotation: Input: VCF → Tool: Ensembl VEP or SnpEff → Output: annotated VCF. Determine effect of each variant (missense, synonymous, intronic, splice site, TFBS gain/loss). Provide functional impact predictions (SIFT, PolyPhen-2).

Quality control and filtering occur between every step — this iterative QC cycle is essential for reliable results.

Q57 — Open Short Answer
Explain why a GWAS-significant SNP is usually not the causal variant. What is "indirect association," and what steps follow a GWAS to identify the true causal variant?
✓ Model Answer

GWAS uses tag SNPs on genotyping arrays, which are representative markers for LD blocks. The detected SNP is typically in LD with the true causal variant rather than being causal itself — this is called indirect association. The causal variant may not be on the array at all.

Post-GWAS steps:

1. Fine-mapping: Examine LD structure (r², D′) around the peak to narrow the candidate region and prioritize variants most likely to be causal.

2. Gene annotation: Use BEDTools intersect to identify genes within a defined window (e.g., 0.5 Mb) of the top SNPs. Consult databases like GeneCards and the GWAS Catalog.

3. Functional enrichment: Apply ORA (DAVID, EnrichR) to test whether candidate genes are enriched for specific pathways.

4. Replication: Validate findings in an independent cohort with the same phenotype definition.

5. Functional validation: Experimental studies (gene expression analysis, knockouts, reporter assays) to confirm the causal role of the candidate variant.

📝 Exam Simulation — Version B

📝Applied Genomics — Final Exam Simulation B
0 / 57
Q1 Easy
What does "equimolar DNA pool" mean?
ADNA pooled from individuals of the same species only
BDNA mixed with equal volumes regardless of concentration
CDNA pooled with equal amounts from each individual
DDNA fragmented into equal-length pieces
Explanation
Equimolar means equal molar quantities of DNA from each individual. This ensures equal genomic representation in the pool so that allele frequencies estimated from read counts are proportional to the true population frequencies. If one individual contributes more DNA, its genome is over-represented and biases frequency estimation.
Q2 Medium
In a VCF file, the genotype notation "1|2" indicates:
AHomozygous for the first alternative allele
BUnphased heterozygous genotype
CMissing genotype data
DPhased heterozygous with two different alternative alleles
Explanation
The pipe "|" indicates phased data (as opposed to "/" for unphased). "0" = reference allele, "1" = first alternative, "2" = second alternative. So "1|2" means one chromosome carries ALT1 and the other carries ALT2 — a phased heterozygous genotype with two different alternative alleles.
Q3 Easy
Which sequencing technology uses fluorescently labeled reversible terminators?
AIon Torrent
BIllumina
CPacBio
DSanger
Explanation
Illumina sequencing by synthesis uses fluorescently labeled reversible terminators. Each nucleotide has a fluorescent dye and a 3' blocking group. After incorporation, the cluster is imaged, then the dye and block are cleaved. Sanger uses irreversible dideoxy terminators; Ion Torrent detects H⁺ ions; PacBio uses real-time detection of fluorescent nucleotides.
Q4 Medium
What is the main advantage of mate-pair libraries over standard paired-end libraries?
ALarger insert sizes (up to ~5 kb), helping scaffold across repeats
BHigher sequencing accuracy
CLower DNA input requirement
DSimpler library preparation protocol
Explanation
Mate-pair libraries use biotinylation and circularization of large fragments (up to ~5 kb) to sequence the ends of distant genomic regions. This larger insert size helps link contigs across repetitive regions during genome assembly. Standard paired-end libraries have inserts of ~200–800 bp. The mate-pair protocol is actually more complex, not simpler.
Q5 Medium
The FLAG field in a SAM file is found in:
AColumn 1
BColumn 2
CColumn 6
DColumn 10
Explanation
In SAM format: column 1 = read name (QNAME), column 2 = FLAG (bitwise flags indicating properties like primary/secondary alignment, unmapped, mate unmapped, strand), column 3 = reference name, column 4 = position, column 5 = MAPQ, column 6 = CIGAR, etc. The FLAG field encodes multiple properties as a single integer using bit flags.
Q6 Easy
BUSCO is used to evaluate:
ASequencing error rate
BRead quality scores
CGenome assembly completeness
DPopulation genetic diversity
Explanation
BUSCO (Benchmarking Universal Single-Copy Orthologs) evaluates genome assembly completeness by searching for conserved single-copy genes expected to be present in all organisms of a given lineage. It reports percentages of complete, fragmented, duplicated, and missing orthologs. It is one of the most frequently asked topics on the exam.
Q7 Medium
In a de Bruijn graph used for genome assembly, vertices represent:
AComplete sequencing reads
BK-mers themselves
COverlapping regions between reads
D(k−1)-mers
Explanation
In a de Bruijn graph, vertices are (k−1)-mers and edges are k-mers. A k-mer connects vertex X (its prefix of length k−1) to vertex Y (its suffix of length k−1) with a directed edge. The graph is then traversed using an Eulerian path (visiting each edge exactly once) to reconstruct the sequence.
Q8 Medium
Why should duplicate reads be removed before variant calling?
AThey can reinforce sequencing errors as real variants
BThey increase the file size beyond storage limits
CThey cause the reference genome index to fail
DThey originate from mitochondrial contamination
Explanation
Duplicate reads arise from PCR amplification during library preparation. If a read containing a sequencing error is duplicated, the error appears in multiple reads and can be mistakenly called as a true variant. Removing duplicates ensures each molecule is counted once, giving accurate allele frequency estimates.
Q9 Easy
Which technique identifies genome-wide protein–DNA binding sites?
ABisulfite sequencing
BChIP-seq
CRNA-seq
DExome sequencing
Explanation
ChIP-seq (Chromatin Immunoprecipitation sequencing) identifies DNA regions bound by specific proteins such as transcription factors. The DNA-protein complex is cross-linked, fragmented, immunoprecipitated with an antibody specific to the protein of interest, and the recovered DNA is sequenced. Bisulfite seq detects methylation; RNA-seq measures expression.
Q10 Easy
What does FST measure?
AIndividual inbreeding level
BSequencing error rate
CLinkage disequilibrium decay
DGenetic differentiation between populations
Explanation
FST (fixation index) is a measure of population differentiation based on allele frequency differences between subpopulations. FST = 0 means no genetic differentiation; FST = 1 means complete fixation of different alleles. It is a key statistic in population genomics for comparing pools or populations and detecting signatures of selection.
Q11 Medium
In GWAS, a genomic inflation factor (λGC) greater than 1 suggests:
APopulation stratification was not properly corrected
BThe sample size is too large
CAll SNPs are in linkage equilibrium
DNo significant associations exist
Explanation
Lambda GC is computed by comparing the median of observed test statistics to the expected chi-square distribution. λ > 1 indicates systematic inflation of test statistics, typically caused by unaccounted population structure (stratification). This must be corrected using methods like including principal components as covariates in the model.
Q12 Easy
What is the C-value?
AThe number of chromosomes in a cell
BThe GC content percentage of a genome
CThe mass of DNA in a haploid chromosome set
DThe coverage depth of sequencing data
Explanation
The C-value represents the amount (mass) of DNA in picograms contained in a haploid chromosome set. It is used to estimate genome size before sequencing by comparing with databases of known C-values from related species. This is measured using techniques like flow cytometry.
Q13 Easy
Oxford Nanopore sequencing detects nucleotides by measuring:
AFluorescent emissions from labeled nucleotides
BChanges in ionic current as DNA passes through a pore
CpH changes from hydrogen ion release
DLight emitted during pyrophosphate cleavage
Explanation
Nanopore sequencing passes a single strand of DNA through a biological nanopore embedded in a membrane. As each nucleotide passes through, it causes a characteristic disruption in the ionic current flowing through the pore. This signal is decoded to determine the DNA sequence in real-time, enabling very long reads.
Q14 Easy
Why are highly inbred individuals preferred for de novo genome assembly?
AThey have more transposable elements
BThey produce more DNA per cell
CThey have larger genomes
DHigh homozygosity makes read overlap easier
Explanation
Inbred individuals are nearly homozygous at all positions. In a heterozygous individual, reads from the two haplotypes may differ at SNP positions, making it difficult to overlap and connect them during assembly. Homozygous individuals have identical haplotypes, so reads overlap cleanly and contigs extend more easily.
Q15 Medium
The CIGAR string "5M2I8M" means:
A5 matches, 2 insertions in the read, 8 matches
B5 mismatches, 2 introns, 8 mismatches
C5 matches, 2 deletions in the read, 8 matches
D5 soft clips, 2 insertions, 8 soft clips
Explanation
CIGAR string operations: M = alignment match (can include both matches and mismatches), I = insertion in the read relative to reference, D = deletion from the reference, S = soft clipping. So "5M2I8M" = 5 aligned bases, then 2 extra bases in the read (insertion), then 8 more aligned bases. The read consumes 15 bases; the reference consumes 13.
Q16 Easy
In pool sequencing (Pool-seq), allele frequency is estimated from:
AThe number of individuals in the pool
BGel electrophoresis band intensity
CRead counts supporting each allele at a position
DIndividual genotype calls from GATK
Explanation
In pool sequencing, DNA from multiple individuals is mixed and sequenced together. At any polymorphic position, the proportion of reads carrying each allele approximates the allele frequency in the pool. For example, if 6 out of 10 reads show allele A, the estimated frequency of A is ~60%. This requires equimolar pooling and sufficient sequencing depth.
Q17 Medium
In a GWAS QQ plot, what does early deviation of observed from expected p-values indicate?
AThe Bonferroni threshold is too lenient
BSystematic bias, likely from population stratification
CThe study has perfect statistical power
DAll tested SNPs are associated with the trait
Explanation
In a QQ plot, observed −log10(p) values are plotted against expected ones. Under the null hypothesis (no association), points should fall on the diagonal. If points deviate early (across the entire distribution), this indicates systematic inflation — typically from uncorrected population stratification. Late deviation only at the tail suggests true associations.
Q18 Medium
The biotin–streptavidin interaction is exploited in which procedures?
ASanger sequencing only
BFastQC quality control
CDe Bruijn graph construction
DMate-pair library prep and exome capture
Explanation
Biotin–streptavidin binding is used in mate-pair libraries (biotinylated adapters mark circularized fragment junctions, then streptavidin pulldown selects these fragments) and in hybridization-based exome capture (biotinylated probes hybridize to exonic fragments, then streptavidin beads capture them). Both exploit the extremely strong and specific biotin–streptavidin bond.
Q19 Medium
A pyramid-shaped peak in a Manhattan plot is caused by:
ALD decay around the causal variant
BSequencing errors at that locus
CA repetitive element in that region
DRandom statistical noise
Explanation
The pyramid or "skyline" shape occurs because SNPs near the causal variant are in strong LD with it and show high significance, while SNPs farther away have decreasing LD and lower significance. This creates a peak that tapers off on both sides. The width of the peak reflects the extent of LD in that genomic region and population.
Q20 Medium
An Eulerian path through a de Bruijn graph exists if the graph contains:
ANo balanced vertices at all
BExactly four semibalanced vertices
CAt most two semibalanced vertices
DOnly vertices with in-degree of zero
Explanation
An Eulerian path visits every edge exactly once. It exists in a directed graph when at most two vertices are semibalanced (|in-degree − out-degree| = 1) and all other vertices are balanced (in-degree = out-degree). The two semibalanced vertices serve as start and end points of the path. This is the traversal algorithm used in de Bruijn graph–based assembly.
Q21 Medium
High GC-content regions may have low coverage with Illumina. A potential solution is:
AUsing shorter k-mers in assembly
BSupplementing with Nanopore sequencing
CRemoving those regions from analysis
DIncreasing the Phred quality threshold
Explanation
Illumina sequencing can fail in GC-rich regions due to PCR amplification bias during library prep. Nanopore sequencing is much less affected by GC content because it does not require PCR amplification and directly reads native DNA. Using a complementary technology resolves coverage gaps.
Q22 Easy
In the additive GWAS model, the genotype AA, AG, GG (where G is the minor allele) is coded as:
A0, 1, 2
B1, 2, 3
C0, 0, 1
D−1, 0, 1
Explanation
The additive model codes genotypes by counting the number of minor allele copies: AA = 0 copies of G, AG = 1 copy, GG = 2 copies. This allows linear regression between genotype (0, 1, 2) and phenotype. Each additional copy of the minor allele is assumed to have an equal additive effect on the phenotype.
Q23 Easy
The Bonferroni correction in GWAS adjusts for:
ASequencing depth differences
BSample size imbalance
CLinkage disequilibrium between SNPs
DMultiple testing across many SNPs
Explanation
When testing thousands of SNPs simultaneously, some will appear significant by chance. Bonferroni divides the significance threshold (e.g., α = 0.05) by the number of tests. For 50,000 SNPs: 0.05/50,000 ≈ 10⁻⁶. This is conservative because it assumes tests are independent, while SNPs in LD are correlated. The genome-wide threshold of 5 × 10⁻⁸ is commonly used instead.
Q24 Medium
ROH islands (regions frequently in ROH across many individuals) suggest:
AGenotyping errors at those loci
BRandom genetic drift only
CPossible selection pressure favoring homozygosity
DHigh sequencing coverage artifacts
Explanation
ROH islands are genomic regions where a high percentage of individuals in a population share runs of homozygosity. This non-random pattern suggests that being homozygous at those positions increases fitness. These regions often harbor genes under selection pressure and can be visualized in Manhattan-style plots showing ROH frequency per genomic position.
Q25 Medium
When genomic DNA is digested with a restriction enzyme, visible bands on a gel are caused by:
ACoding exon sequences only
BRepetitive elements producing many fragments of the same size
CComplete chromosomes migrating together
DRNA contamination in the sample
Explanation
Random restriction digestion produces a smear of fragment sizes. Visible bands appear because repetitive elements (which have the same sequence repeated throughout the genome) produce many fragments of identical size. In reduced representation library construction, these bands are deliberately avoided to prevent sequencing repetitive DNA.
Q26 Medium
Genotyping by sequencing (GBS) is characterized by:
ARestriction digestion to sequence a reduced genome fraction across individuals
BWhole genome sequencing at 30× coverage per individual
CUsing SNP arrays with pre-designed probes
DSequencing only mitochondrial DNA
Explanation
GBS uses restriction enzymes to select and sequence the same small fraction of the genome across many individuals. It is cost-effective (~€20–30/sample), does not require a pre-existing SNP chip, and can even work without a reference genome. The sequenced fraction is determined by the restriction enzyme used, and SNPs are identified by comparing sequences across individuals.
Q27 Easy
The CIGAR string is found in which column of a SAM file?
AColumn 2
BColumn 4
CColumn 10
DColumn 6
Explanation
SAM format mandatory columns: 1 = QNAME (read name), 2 = FLAG, 3 = RNAME (reference name), 4 = POS (position), 5 = MAPQ (mapping quality), 6 = CIGAR (alignment summary string), 7–9 = mate information, 10 = SEQ (sequence), 11 = QUAL (quality string).
Q28 Easy
A Phred quality score of Q20 means the base call has:
A1 in 10 chance of error (90% accuracy)
B1 in 100 chance of error (99% accuracy)
C1 in 1,000 chance of error (99.9% accuracy)
D1 in 10,000 chance of error (99.99% accuracy)
Explanation
Phred score Q = −10 × log₁₀(P), where P is the error probability. Q10 = 10% error, Q20 = 1% error, Q30 = 0.1% error, Q40 = 0.01% error. So Q20 means 99% accuracy, which is generally considered a minimum acceptable quality for many analyses.
Q29 Medium
In array CGH, a log₂ ratio below zero between test and reference DNA indicates:
AA gain of copies in the test sample
BEqual copy number in both samples
CA loss (deletion) in the test sample
DA sequencing error at that position
Explanation
aCGH compares hybridization intensity between test DNA and reference DNA. log₂(test/reference) = 0 means equal copies. log₂ < 0 means the test has fewer copies (loss/deletion). log₂ > 0 means the test has more copies (gain/duplication). The minimum resolution depends on the average probe spacing — at least 3 consecutive probes must show the same ratio.
Q30 Easy
What is the difference between structural and functional genome annotation?
AStructural identifies gene locations; functional assigns biological roles
BStructural uses RNA-seq; functional uses DNA-seq
CThey are two names for the same process
DStructural annotates proteins; functional annotates DNA
Explanation
Structural annotation identifies the positions of genomic features — exons, introns, UTRs, promoters, genes — along the assembled sequence. Functional annotation then assigns biological functions to those features using tools like Gene Ontology, pathway databases, and homology searches. Both are essential steps after de novo genome assembly.
Q31 Medium
Using a smaller k-mer size in de Bruijn graph assembly:
AEliminates all sequencing errors
BIncreases the number of unique k-mers
CMakes the graph impossible to traverse
DReduces the fraction of k-mers affected by a single error
Explanation
If a read of length L has one error, a k-mer equal to L gives 100% error-containing k-mers. With smaller k, more k-mers are generated and only a subset contain the error position. For example, k=3 on a 10bp read gives 8 k-mers, and only ~3 contain the error. This is why assemblers often test multiple k-mer sizes and combine results.
Q32 Easy
The Variant Effect Predictor (VEP) is used to:
AAlign reads to a reference genome
BPredict the biological impact of detected variants
CPerform genome assembly from raw reads
DCalculate population allele frequencies
Explanation
VEP (from Ensembl) annotates variants with their predicted biological consequences — e.g., synonymous, missense, stop-gain, splice site, intergenic. It requires matching the VCF chromosome naming convention with the annotation database. VEP and SnpEff are common tools for functional annotation of variants.
Q33 Easy
In SNP array genotyping, "call rate" refers to:
AThe speed of the genotyping instrument
BThe minor allele frequency threshold
CThe percentage of SNPs successfully genotyped
DThe number of samples per chip
Explanation
Call rate is the fraction of SNPs for which a genotype could be reliably determined. A typical call rate is ~98%, meaning for a 10,000-SNP chip, about 9,800 SNPs are genotyped and ~200 fail. Failed calls appear as "0 0" (missing) in the data. Low call rates may indicate poor DNA quality or technical issues.
Q34 Easy
Inbreeding depression refers to:
AReduced fitness from increased homozygosity of deleterious alleles
BHigher heterozygosity in large populations
CIncreased mutation rate in inbred lines
DImproved assembly quality from homozygosity
Explanation
Inbreeding increases the frequency of homozygous genotypes across the genome — including for deleterious recessive alleles that would normally be masked in the heterozygous state. When these alleles become homozygous, they reduce individual fitness (survival, reproduction). This population-level phenomenon is called inbreeding depression.
Q35 Easy
In a FASTQ file, the third line ("+") serves to:
AStore the reference genome name
BIndicate the strand of the read
CStore alignment coordinates
DSeparate the sequence from quality scores
Explanation
FASTQ format has 4 lines per read: line 1 = header starting with "@", line 2 = nucleotide sequence, line 3 = "+" separator (optionally followed by the header again), line 4 = ASCII-encoded quality scores (one character per base). The "+" line is simply a delimiter between sequence and quality data. FASTQ contains raw reads, not alignments.
Q36 Medium
A reference-guided genome assembly may introduce errors because:
AIt always requires PacBio reads
BStructural rearrangements in the guide species may misplace contigs
CIt cannot use Illumina data
DIt skips the annotation step
Explanation
In reference-guided assembly, the genome of a related species provides a scaffold for ordering contigs. However, if the guide species has structural rearrangements (inversions, translocations) relative to the target species, contigs may be placed in the wrong order or orientation. This saves computational time but introduces potential errors. De novo assembly avoids this but is more resource-intensive.
Q37 Medium
A SNP with minor allele frequency (MAF) of 0.5 is considered:
AMonomorphic and uninformative
BRare and difficult to detect
CMaximally informative for population studies
DLikely a sequencing artifact
Explanation
MAF = 0.5 means both alleles are equally frequent in the population. This provides maximum heterozygosity and thus maximum informativity for distinguishing individuals and detecting genetic associations. SNP arrays aim to include SNPs with high MAF (ideally ≥0.3) across target populations. A good average MAF for a genotyping panel is around 0.3.
Q38 Easy
In bisulfite sequencing, sodium bisulfite converts:
AUnmethylated cytosine to uracil (read as thymine)
BMethylated cytosine to uracil
CAdenine to guanine
DThymine to cytosine
Explanation
Sodium bisulfite treatment converts unmethylated cytosine → uracil → thymine (during PCR), while methylated (5-methylcytosine) remains unchanged as C. After sequencing and alignment, positions where C→T conversion occurred were unmethylated; positions retaining C were methylated. A challenge: distinguishing bisulfite-induced C→T from true C→T SNPs may require parallel genomic sequencing.
Q39 Medium
Long ROH segments in an individual's genome most likely indicate:
AAncient inbreeding many generations ago
BAn admixed population background
CHigh sequencing error rate
DRecent inbreeding (parents closely related)
Explanation
Long ROH segments indicate recent inbreeding because there has been little time for recombination to break them down. Ancient inbreeding produces short ROH (fragments broken by many generations of crossing over). Admixed populations typically show very few ROH. The size distribution of ROH can reconstruct the genetic history of individuals and populations.
Q40 Medium
The minimum resolution of an aCGH system with probes spaced every 10 kb is approximately:
A10 kb
B30 kb
C100 kb
D1 kb
Explanation
At least 3 consecutive probes must show the same log₂ ratio shift to reliably call a CNV (otherwise a single probe deviation could be an artifact). Therefore, minimum resolution ≈ 3 × average probe spacing. With probes every 10 kb: resolution ≈ 30 kb. CNVs smaller than 30 kb would be missed with this design.
Q41 Easy
PCR amplification of DNA before sequencing can introduce:
ALonger read lengths
BHigher quality scores
CAmplification biases and duplicate artifacts
DBetter genome coverage uniformity
Explanation
PCR amplification during library preparation can introduce errors through polymerase mistakes, create duplicate molecules from the same template, and preferentially amplify certain fragments (e.g., those with moderate GC content). This is why duplicates are marked/removed during analysis, and why PCR-free library protocols or technologies like Nanopore (no PCR needed) can be advantageous.
Q42 Easy
In a VCF file, the genotype "0/0" represents:
AHomozygous reference
BHeterozygous
CMissing genotype
DHomozygous alternative
Explanation
VCF genotype encoding: 0 = reference allele, 1 = first alternative, 2 = second alternative. "/" means unphased, "|" means phased. So 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternative, ./. = missing data. Genotypes are encoded numerically, not with nucleotide letters.
Q43 Medium
The effective population size (Ne) primarily influences:
AThe sequencing error rate
BThe physical size of the genome
CThe cost of DNA extraction
DThe extent of linkage disequilibrium in a population
Explanation
Effective population size determines the rate at which LD decays over generations. Large Ne (e.g., humans) = more recombination = rapid LD decay (a few kb), requiring denser SNP arrays. Small Ne (e.g., livestock breeds) = slower LD decay (~100 kb), requiring fewer SNPs. This directly affects the design and cost of genotyping tools.
Q44 Medium
A "tag SNP" in GWAS is useful because it:
AIs always the causal mutation itself
BIs in high LD with nearby ungenotyped variants
CHas the lowest MAF in the population
DIs located only in coding regions
Explanation
Tag SNPs represent nearby variants through linkage disequilibrium. If a tag SNP and a causal variant have r² ≈ 1, genotyping the tag SNP captures the same information without genotyping the causal variant directly. This is the basis of indirect association in GWAS — the SNP array samples tag SNPs across the genome, and associated tags point to nearby causal regions for fine-mapping.
Q45 Easy
RepeatMasker is used to:
ADetect single nucleotide variants
BAssemble contigs into scaffolds
CIdentify and mask repetitive elements in a genome
DPredict protein structures from DNA
Explanation
RepeatMasker screens genome sequences for interspersed repeats (SINEs, LINEs, DNA transposons, etc.) and low-complexity regions, then replaces them with Ns or lowercase letters. This is essential before: (1) SNP calling — to avoid calling false variants in repetitive regions, (2) read mapping — to prevent multi-mapping artifacts, and (3) CNV detection — to avoid bias from repetitive elements.
Q46 — Open Calculation
A mammalian genome is 2.8 Gbp. You want 40× average coverage using 150 bp reads. How many reads do you need?
✓ Model Answer

Using the coverage formula rearranged to solve for number of reads:

Coverage = (N × L) / G → N = (Coverage × G) / L
N = (40 × 2,800,000,000) / 150
N = 112,000,000,000 / 150
N ≈ 746,666,667 reads ≈ 747 million reads

So approximately 747 million reads of 150 bp are needed to achieve 40× coverage of a 2.8 Gbp genome.

Q47 — Open Calculation
Given these contig lengths (in kb): 150, 100, 85, 65, 55, 35, 20, 15, 10. Calculate the N50.
✓ Model Answer

Step 1: Contigs are already sorted from largest to smallest: 150, 100, 85, 65, 55, 35, 20, 15, 10 kb.

Total assembly length = 150 + 100 + 85 + 65 + 55 + 35 + 20 + 15 + 10 = 535 kb
Half of total = 535 / 2 = 267.5 kb

Step 2: Cumulative sum from largest:

150 → cumulative = 150 (< 267.5)
150 + 100 = 250 → cumulative = 250 (< 267.5)
250 + 85 = 335 → cumulative = 335 (≥ 267.5) ✓

N50 = 85 kb — the contig that crosses the 50% cumulative threshold.

Q48 — Open Short Answer
Describe the de Bruijn graph approach for genome assembly. Include: what k-mers are, how the graph is built, and how the sequence is reconstructed.
✓ Model Answer

K-mers: Substrings of fixed length k extracted by sliding a window across each read. For example, the sequence ATGCG with k=3 produces: ATG, TGC, GCG.

Graph construction: (1) Extract all k-mers from reads and retain unique ones. (2) Vertices represent (k−1)-mers (prefixes and suffixes of k-mers). (3) Edges represent k-mers — each k-mer connects its prefix vertex to its suffix vertex with a directed edge.

Sequence reconstruction: The graph is traversed using an Eulerian path, which visits every edge exactly once. This requires at most two semibalanced vertices (|in-degree − out-degree| = 1). The sequence is reconstructed by concatenating the vertices along the path.

Challenges: Sequencing errors create false k-mers; repetitive elements cause ambiguous paths (multiple valid Eulerian paths). Using multiple k-mer sizes and paired-end data helps resolve these issues.

Q49 — Open Short Answer
What is pool sequencing (Pool-seq)? Explain how it works, what information it provides, and why it is cost-effective.
✓ Model Answer

Definition: Pool sequencing involves mixing equimolar DNA from multiple individuals into a single pool and sequencing the pool together, rather than sequencing each individual separately.

How it works: DNA is extracted from each individual, quantified, and combined in equal amounts (equimolar pooling). The pooled DNA is then used for library preparation and sequenced. Reads map randomly to the reference genome. At any polymorphic position, the proportion of reads carrying each allele approximates the allele frequency in the pooled population.

Information provided: Pool-seq gives population-level allele frequencies at each variant position, not individual genotypes. This allows detection of variants, estimation of allele frequencies, and comparison of frequencies between populations (e.g., using FST).

Cost-effectiveness: Instead of sequencing N individuals separately (cost = N × per-sample cost), only one pooled library is sequenced. For example, comparing two populations of 50 individuals each requires 2 sequencing runs instead of 100. This dramatically reduces cost while preserving population-level variant information.

Q50 — Open Short Answer
Describe the RNA-seq technique. Include: what is sequenced, the two main library preparation strategies (random priming vs. poly-A selection), and two applications.
✓ Model Answer

What is sequenced: RNA (transcripts) is extracted, converted to cDNA, and sequenced. RNA-seq captures the transcriptome — all RNA molecules expressed at the time of sampling.

Library preparation strategies:

(1) Poly-A selection: Mature mRNAs have a poly-A tail. Probes with poly-T sequences capture these mRNAs specifically. This enriches for protein-coding transcripts and excludes rRNA/tRNA.

(2) Random priming: RNA is fragmented and random hexamer primers are used for cDNA synthesis. This captures a broader range of RNAs but may include unwanted rRNA.

Applications:

(1) Gene expression quantification: Comparing transcript abundance between conditions (e.g., healthy vs. diseased tissue) to identify differentially expressed genes.

(2) Genome annotation: RNA-seq data provides evidence of transcribed regions, helping identify gene structures, exon boundaries, and transcript isoforms (alternative splicing) during functional annotation of a new genome assembly.

Q51 — Open Calculation
In a population of 800 individuals, the observed genotype counts are: 320 GG, 400 Gg, 80 gg. Test whether this population is in Hardy–Weinberg equilibrium.
✓ Model Answer
Total individuals = 800, Total alleles = 1,600

Step 1 — Allele frequencies:

G alleles = (2 × 320) + 400 = 1,040 → p = 1,040/1,600 = 0.65
g alleles = (2 × 80) + 400 = 560 → q = 560/1,600 = 0.35

Step 2 — Expected genotype frequencies and counts:

p² = 0.4225 → Expected GG = 0.4225 × 800 = 338
2pq = 0.455 → Expected Gg = 0.455 × 800 = 364
q² = 0.1225 → Expected gg = 0.1225 × 800 = 98

Step 3 — Chi-squared test:

χ² = (320−338)²/338 + (400−364)²/364 + (80−98)²/98
χ² = 324/338 + 1,296/364 + 324/98
χ² = 0.96 + 3.56 + 3.31 = 7.83
Critical value (df=1, α=0.05) = 3.84

Conclusion: χ² = 7.83 > 3.84 → Reject H₀. The population is NOT in Hardy–Weinberg equilibrium. There is an excess of heterozygotes compared to expectations, which could indicate balancing selection or recent admixture.

Q52 — Open Short Answer
Explain the concept of whole exome sequencing (WES). Describe the hybridization capture method used and why WES is cost-effective compared to whole genome sequencing.
✓ Model Answer

WES concept: Whole exome sequencing targets only the protein-coding regions (exons) of the genome, which constitute approximately 2% of the human genome (~45 Mb vs. ~3 Gb).

Hybridization capture method: (1) Genomic DNA is fragmented. (2) Biotinylated probes complementary to all known exonic sequences are hybridized to the fragments. (3) Streptavidin-coated beads capture biotinylated probe–fragment complexes. (4) Non-exonic fragments are washed away. (5) Captured exonic fragments are eluted and sequenced.

Cost-effectiveness: By sequencing only ~2% of the genome, WES generates much smaller FASTQ files (~45 Gb vs. ~90 Gb for WGS), requires less sequencing output, and enables faster bioinformatic analysis. The trade-off is that regulatory variants in non-coding regions (introns, intergenic regions) are missed. WES is ideal when the hypothesis is that causal variants alter protein-coding sequences.

Q53 — Open Short Answer
Describe the k-mer frequency distribution plot. Explain the three typical regions (left peak, main peak, right tail) and what each represents biologically.
✓ Model Answer

A k-mer frequency distribution plots k-mer multiplicity (x-axis) against the number of distinct k-mers with that multiplicity (y-axis).

Region 1 — Left peak (frequency = 1): K-mers appearing only once. These are mostly derived from sequencing errors — a single nucleotide error creates a k-mer unique to that read. This peak should be excluded when estimating genome size.

Region 2 — Main peak (central): K-mers appearing at the expected coverage depth. These represent unique genomic sequences. The position of this peak corresponds to the average sequencing depth. Genome size can be estimated as: G = (total k-mers under the curve) / (mean coverage from the main peak).

Region 3 — Right tail (high frequency): K-mers appearing much more frequently than average. These derive from repetitive elements (SINEs, LINEs, transposons) that are present in many copies throughout the genome. The height and extent of this tail reflects the repeat content of the genome.

Q54 — Open Calculation
You sequenced 300 million reads of 100 bp from a genome estimated to be 1.5 Gbp. Calculate the average sequencing depth (coverage).
✓ Model Answer
Coverage = (N × L) / G
N = 300,000,000 reads
L = 100 bp
G = 1,500,000,000 bp
Coverage = (300,000,000 × 100) / 1,500,000,000
Coverage = 30,000,000,000 / 1,500,000,000 = 20×

The average depth of coverage is 20×, meaning each position in the genome is covered by ~20 reads on average.

Q55 — Open Short Answer
What is population stratification in GWAS? Explain how it causes false positives and how multidimensional scaling (MDS) or PCA is used to address it.
✓ Model Answer

Population stratification: When a GWAS sample includes individuals from genetically distinct subpopulations that also differ in phenotype, allele frequency differences between subpopulations are confounded with phenotype differences — creating false positive associations.

Example: If wild mice (low body weight, genotype profile A) and laboratory strains (high body weight, genotype profile B) are analyzed together, nearly every SNP differing between strains appears associated with body weight — not because those SNPs cause weight differences, but because they track population membership.

MDS/PCA solution: MDS or PCA compresses genome-wide genotype data into a few principal components that capture population structure. Each individual gets component scores. Individuals from the same subpopulation cluster together. These components are then included as covariates in the GWAS statistical model to correct for ancestry differences. After correction, only true genotype–phenotype associations remain significant.

Verification: The QQ plot and λGC metric are used to confirm that stratification has been properly corrected (λ ≈ 1 indicates no inflation).

Q56 — Open Short Answer
Describe the complete variant discovery pipeline from FASTQ files to annotated VCF. List each major step, the file format at each stage, and one key tool for each step.
✓ Model Answer

Step 1 — Quality Control: Raw reads (FASTQ) → Quality assessment with FastQC → Evaluate per-base quality, adapter content, GC distribution.

Step 2 — Trimming: FASTQ → Trimmed FASTQ using Trimmomatic or fastp → Remove low-quality bases and adapter sequences. Re-run FastQC to confirm improvement.

Step 3 — Alignment: Trimmed FASTQ + Reference genome (FASTA) → SAM file using BWA (Burrows-Wheeler Aligner) → Reads are mapped to the reference genome. Convert SAM → BAM (binary, compressed) and sort.

Step 4 — Duplicate Removal: Sorted BAM → Deduplicated BAM using Picard MarkDuplicates → Remove PCR duplicates that could bias variant calling.

Step 5 — Variant Calling: Deduplicated BAM → VCF file using GATK HaplotypeCaller → Identify SNPs and indels. Each variant line includes chromosome, position, REF, ALT, quality score, and sample genotypes.

Step 6 — Variant Annotation: VCF → Annotated VCF using VEP or SnpEff → Each variant is annotated with its predicted biological effect (synonymous, missense, stop-gain, splice-site, etc.) and compared to known variant databases (e.g., dbSNP).

Q57 — Open Short Answer
Explain how a Manhattan plot is constructed and interpreted in a GWAS. What do the axes represent? What does a "peak" indicate? What determines the significance threshold line?
✓ Model Answer

Axes: The x-axis represents genomic position, with SNPs ordered along each chromosome (chromosomes are shown in alternating colors). The y-axis represents −log₁₀(p-value) from the association test between each SNP and the phenotype. Higher values mean stronger statistical significance.

Construction: For each genotyped SNP, a statistical test (e.g., linear regression with additive model: phenotype ~ genotype + covariates) produces a p-value. Each SNP is plotted as a dot at its chromosomal position (x) and its −log₁₀(p) value (y).

Peaks: A "peak" or cluster of highly significant SNPs indicates a genomic region associated with the trait. The pyramid shape occurs because the causal variant and its neighbors in strong LD all show elevated significance, tapering off as LD decays with distance. The peak width reflects the extent of LD in that region.

Significance threshold: A horizontal line marks the genome-wide significance threshold. Using Bonferroni correction: α/number_of_tests. The conventional threshold is 5 × 10⁻⁸ (−log₁₀ ≈ 7.3), which accounts for approximately 1 million independent tests across the genome. SNPs above this line are considered genome-wide significant associations.

Important note: Associated SNPs are usually tag SNPs in LD with the true causal variant — they are not necessarily the causal mutation itself. Fine-mapping is needed to identify the actual causal variant within the associated region.

Algorithms and Data Structures for Computational Biology

DNA/RNA Dynamics

Introduction to Big Data Processing Infrastructures

Molecular Phylogenetics

Applied Machine Learning - Advanced

Applied Machine Learning - Basic

License

Contributors

A big shout-out to everyone who has contributed to these notes!

  • Mahmoud - mahmoud.ninja - Creator and primary maintainer
  • Vittorio - Contributions and improvements
  • Betül Yalçın - Contributions and improvements

Want to contribute?

If you've helped improve these notes and want to be listed here, or if you'd like to contribute:

  • Submit corrections or improvements via whatsapp, email, or github PR
  • Share useful resources or examples
  • Help clarify confusing sections

Feel free to reach out at mahmoudahmedxyz@gmail.com or text me directly if you have any method of connection, to be added to this list.