Proteins and Bioinformatics

What is a Protein?

A biopolymer - a biological polymer made of amino acid monomers linked together.
A complex system capable of folding in the solvent
A protein is capable of interactions with other molecules

Are All Proteins Natural?

No.

Natural proteins: Encoded by genes, produced by cells
Synthetic proteins: Designed and manufactured in labs
Modified proteins: Natural proteins with artificial modifications

This distinction matters for understanding protein databases and experimental vs. computational protein design.

Protein Sequence

The linear order of amino acids in a protein. This is the primary structure and is directly encoded by DNA/RNA.

Proteins as Complex Systems

Proteins aren't just simple chains - they're complex biological systems that:

Fold into specific 3D structures
Interact with other molecules
Respond to environmental conditions
Have dynamic behavior (not static structures)

As biopolymers, they exhibit emergent properties that aren't obvious from just reading the sequence.

🔬

Random Fact

Complex models can be very useful, for example organoids are at the forefront of medicine. Having a reliable cellular model is a challenge to solve.

Protein Stability

Measured by ΔG (delta G) of folding

ΔG represents the change in free energy during the folding process:

Negative ΔG: Folding is favorable (stable protein)
Positive ΔG: Folding is unfavorable (unstable)
ΔG ≈ 0: Marginal stability

This thermodynamic measurement tells us how stable a folded protein is compared to its unfolded state.

Transfer of Knowledge (Annotation)

One of the key principles in bioinformatics: we can transfer functional information from well-studied proteins to newly discovered ones based on sequence or structural similarity.

If protein A is well-characterized and protein B is similar, we can infer that B likely has similar function. This is the basis of homology-based annotation.

🔬

Random Fact

Protein phases are aggregations of proteins that presumably have a common goal. For example, proteins in the Krebs cycle aggregate themselves, generating a protein phase. This process is driven by protein affinity with each other. The process is considered so important that if some of those phases do not occur, diseases can arise.

Structure vs. Sequence

Key principle: The structure of a protein is more informative than its sequence.

Why?

Sequences can diverge significantly while structure remains conserved
Different sequences can fold into similar structures (convergent evolution)
Structure directly relates to function
Structural similarity reveals evolutionary relationships that sequence alone might miss

This is why structural bioinformatics is so important - knowing the 3D structure gives you more information about function than just the sequence.

Macromolecular Crowding

Concept: Inside cells, it's crowded. Really crowded.

ℹ️

Info

Macromolecular crowding: the cytoplasm of any cell is a dynamic environment. Macromolecular crowding is how the cell balances the number of molecules with the number of processes.

Proteins don't fold and function in isolation - they're surrounded by other proteins, RNA, DNA, and small molecules. This crowding affects:

Folding kinetics
Protein stability
Protein-protein interactions
Diffusion rates

It is important to remember that the intracellular environment is very crowded and studying all the interactions is very important and an issue nowadays. For example, one thing that we don’t understand is how chromosomes interact within the nucleus, and understanding this can lead to the production of models. A model is crucial for doing data analysis. If the model is not there, we have to produce it.

Lab experiments often use dilute solutions, but cells are packed with macromolecules. This environmental difference matters for understanding real protein behavior.

Protein Quality and Databases

Where to find reliable protein data?

UniProt: Universal protein database

Contains both reviewed and unreviewed entries
Comprehensive but variable quality

Swiss-Prot (part of UniProt):

Manually curated and reviewed
High-quality, experimentally validated annotations
Gold standard for protein information
Smaller than UniProt but much more reliable

Rule of thumb: For critical analyses, prefer Swiss-Prot. For exploratory work, UniProt is broader but requires more careful validation.

Interoperability: the characteristic of databases to talk to themselves. It is important to retrieve complete information that databases talk to each other.

Data Quality management: the quality of data is a very important issue. It is crucial to be able to discriminate between good and bad data. Even in databases there is good data and very bad data.

Folding of proteins

📝

Note

The most important thing (cause) that drives the folding of a protein is the hydrophobic effect. The folding of a protein is specific to the family of a protein. Proteins can be composed of more single polypeptide chains, in this case we say they are heteropolymers.

Summary: What We've Covered

Structural bioinformatics and protein folding
Structure-function relationship
Functional annotation and Gene Ontology
Data quality challenges
ML as function fitting
Proteins as biopolymers and complex systems
Natural vs. synthetic proteins
Protein stability (ΔG)
Structure is more informative than sequence
Macromolecular crowding
Data quality: UniProt vs. Swiss-Prot

Main themes:

Predicting protein structure and function from sequence
Understanding proteins as complex, context-dependent systems
Data quality and annotation are critical challenges
Computational methods (especially ML) are essential tools

Bioinformatics Forever