Proteins and Bioinformatics
What is a Protein?
- A biopolymer - a biological polymer made of amino acid monomers linked together.
- A complex system capable of folding in the solvent
- A protein is capable of interactions with other molecules
Are All Proteins Natural?
No.
- Natural proteins: Encoded by genes, produced by cells
- Synthetic proteins: Designed and manufactured in labs
- Modified proteins: Natural proteins with artificial modifications
This distinction matters for understanding protein databases and experimental vs. computational protein design.
Protein Sequence
The linear order of amino acids in a protein. This is the primary structure and is directly encoded by DNA/RNA.
Proteins as Complex Systems
Proteins aren't just simple chains - they're complex biological systems that:
- Fold into specific 3D structures
- Interact with other molecules
- Respond to environmental conditions
- Have dynamic behavior (not static structures)
As biopolymers, they exhibit emergent properties that aren't obvious from just reading the sequence.
Complex models can be very useful, for example organoids are at the forefront of medicine. Having a reliable cellular model is a challenge to solve.
Protein Stability
Measured by ΔG (delta G) of folding
ΔG represents the change in free energy during the folding process:
- Negative ΔG: Folding is favorable (stable protein)
- Positive ΔG: Folding is unfavorable (unstable)
- ΔG ≈ 0: Marginal stability
This thermodynamic measurement tells us how stable a folded protein is compared to its unfolded state.
Transfer of Knowledge (Annotation)
One of the key principles in bioinformatics: we can transfer functional information from well-studied proteins to newly discovered ones based on sequence or structural similarity.
If protein A is well-characterized and protein B is similar, we can infer that B likely has similar function. This is the basis of homology-based annotation.
Protein phases are aggregations of proteins that presumably have a common goal. For example, proteins in the Krebs cycle aggregate themselves, generating a protein phase. This process is driven by protein affinity with each other. The process is considered so important that if some of those phases do not occur, diseases can arise.
Structure vs. Sequence
Key principle: The structure of a protein is more informative than its sequence.
Why?
- Sequences can diverge significantly while structure remains conserved
- Different sequences can fold into similar structures (convergent evolution)
- Structure directly relates to function
- Structural similarity reveals evolutionary relationships that sequence alone might miss
This is why structural bioinformatics is so important - knowing the 3D structure gives you more information about function than just the sequence.
Macromolecular Crowding
Concept: Inside cells, it's crowded. Really crowded.
Macromolecular crowding: the cytoplasm of any cell is a dynamic environment. Macromolecular crowding is how the cell balances the number of molecules with the number of processes.
Proteins don't fold and function in isolation - they're surrounded by other proteins, RNA, DNA, and small molecules. This crowding affects:
- Folding kinetics
- Protein stability
- Protein-protein interactions
- Diffusion rates
It is important to remember that the intracellular environment is very crowded and studying all the interactions is very important and an issue nowadays. For example, one thing that we don’t understand is how chromosomes interact within the nucleus, and understanding this can lead to the production of models. A model is crucial for doing data analysis. If the model is not there, we have to produce it.
Lab experiments often use dilute solutions, but cells are packed with macromolecules. This environmental difference matters for understanding real protein behavior.
Protein Quality and Databases
Where to find reliable protein data?
UniProt: Universal protein database
- Contains both reviewed and unreviewed entries
- Comprehensive but variable quality
Swiss-Prot (part of UniProt):
- Manually curated and reviewed
- High-quality, experimentally validated annotations
- Gold standard for protein information
- Smaller than UniProt but much more reliable
Rule of thumb: For critical analyses, prefer Swiss-Prot. For exploratory work, UniProt is broader but requires more careful validation.
Interoperability: the characteristic of databases to talk to themselves. It is important to retrieve complete information that databases talk to each other.
Data Quality management: the quality of data is a very important issue. It is crucial to be able to discriminate between good and bad data. Even in databases there is good data and very bad data.
Folding of proteins
The most important thing (cause) that drives the folding of a protein is the hydrophobic effect. The folding of a protein is specific to the family of a protein. Proteins can be composed of more single polypeptide chains, in this case we say they are heteropolymers.
Summary: What We've Covered
-
Structural bioinformatics and protein folding
-
Structure-function relationship
-
Functional annotation and Gene Ontology
-
Data quality challenges
-
ML as function fitting
-
Proteins as biopolymers and complex systems
-
Natural vs. synthetic proteins
-
Protein stability (ΔG)
-
Structure is more informative than sequence
-
Macromolecular crowding
-
Data quality: UniProt vs. Swiss-Prot
Main themes:
- Predicting protein structure and function from sequence
- Understanding proteins as complex, context-dependent systems
- Data quality and annotation are critical challenges
- Computational methods (especially ML) are essential tools