Core Concepts

Structural Bioinformatics

Focus: Protein folding and structure prediction

The main goal of structural bioinformatics is predicting the final 3D structure of a protein starting from its amino acid sequence. This is one of the fundamental challenges in computational biology.

The Central Dogma Connection

Question raised: To be sure that a protein is expressed, you must have a transcript. Why?

Because: DNA β†’ RNA (transcript) β†’ Protein. Without the transcript (mRNA), there's no template for translation into protein. Gene expression requires transcription first.

What is Protein/DNA Folding?

Folding is the process by which a linear sequence (amino acids for proteins, nucleotides for DNA) adopts a specific three-dimensional structure. This structure determines function.

  • Protein folding: Amino acid chain β†’ functional 3D protein
  • DNA folding: Linear DNA β†’ chromatin structure

Structure and Function

A fundamental principle in biology: structure determines function. The 3D shape of a protein dictates what it can do - what it binds to, what reactions it catalyzes, how it interacts with other molecules.

The structure of a molecule is dependent on the electron density, in reality the structure itself is just the shape of the electron density cloud of the molecule in space. The structure determines also the function πŸ‘ͺ when you know the structure, you can derive properties of the molecule and so the function.

πŸ”¬
Random Fact

Bioinformatics does not produce data, it analyses existing data. Quality of the data is crucial.

Functional Annotation

One of the most important fields in bioinformatics is functional annotation.

What does it mean?

Functional annotation is the process of assigning biological meaning to sequences or structures. Given a protein sequence, what does it do? What pathways is it involved in? What cellular processes does it regulate?

This involves:

  • Predicting function from sequence similarity
  • Domain identification
  • Pathway assignment
  • Gene Ontology (GO) terms
πŸ’‘
Reference databases

The reference database for protein structures is the PDB

The reference database for protein function is UNIPROT

The reference database for DNA sequences is GENBANK , which is in the U.S., in Europe we have. ECA

The reference database for the human genome is ENSEMBL, located in the Sanger Institute in Hinxton and UCSC (from the U.S.A.)

Functional annotation in uniport can be manually curated (SWISSPROT) or automatic (TREMBL). Swissprot contains only non-redundant sequences.

πŸ”¬
Random Fact

Those databases contain various isoforms of the same proteins.

We can also see the distribution of proteins based on length in Uniprot. The majority of the proteins sit between 100 and 500 residues, with some proteins that are very big, and others that are very small. However, it is not a normal distribution. The tail corresponding to the big sequences is larger, and this is because a very small number of aminoacids can generate a small number of unique sequences. Also we can see the abundance of the aminoacids. The more abundant are the aliphatic ones.

Data Challenges

The professor discussed practical issues in bioinformatics data:

Collection: How do we gather biological data?
Production: How is data generated (sequencing, experiments)?
Quality: How reliable is the data? What are the error rates?
Redundancy: Multiple entries for the same protein/gene - how do we handle duplicates?

Gene Ontology (GO)

A standardized vocabulary for describing:

  • Biological processes (what cellular processes the gene/protein is involved in)
  • Molecular functions (what the protein does at the molecular level)
  • Cellular components (where in the cell it's located)

GO provides a controlled language for functional annotation across all organisms.

Machine Learning in Bioinformatics

πŸ“–
Definition

Machine learning is about fitting a function(or line) between input and output

Given input data (like protein sequences), ML tries to learn patterns that map to outputs (like protein function or structure). Essentially: find the line (or curve, or complex function) that best describes the relationship between what you know (input) and what you want to predict (output).

We are in the era of big data, and to manage all this data we need new algorithms. Artificial intelligence is an old concept, in the 80s however, an algorithm that can train artificial intelligences was developed. Learning is essentially and optimization process.

Deep learning is a variant of machine learning that is more complex, accurate and performative. Today we call classical machine learning β€œshallow” machine learning. It is important to have good quality data in order to train these machines so they can associate some information to specific data.