AlphaFold2 & ColabFold: Next-Gen Protein Structure Prediction
For CS 7150

An Analysis of "Alphafold2 - Highly accurate protein structure prediction with AlphaFold"

Introduction

AlphaFold2, developed by DeepMind and described in the 2021 Nature paper "Highly accurate protein structure prediction with AlphaFold" by Jumper et al., uses deep learning with attention mechanisms and evolutionary sequence data to predict protein three-dimensional structures in minutes with near-experimental accuracy (median GDT > 90 in CASP14). It solves the protein folding problem, a grand challenge for over 50 years that resisted rapid, cost-effective solutions and forced reliance on slow, expensive, and sometimes unsuccessful experimental methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. This breakthrough enables rapid, reliable protein structure determination, transforming biological research and therapeutic development.

Our team was inspired to delve deeper into this breakthrough to understand how AlphaFold2 works and explore its potential applications in research and education. After investigating implementation options, we discovered ColabFold - an adaptation of AlphaFold2 designed to run efficiently on Google Colab. This tool provides accessibility to this powerful technology without requiring specialized hardware, democratizing access to state-of-the-art protein structure prediction. Our project aims to demonstrate how ColabFold can be used to successfully predict protein structures and provide hands-on experience with a technology that's transforming structural biology worldwide.

To assess the impact of evolutionary and template data on predictive performance, we will perform two contrasting ColabFold experiments. In the first, we will supply a deep MSA of ~150 homologs plus four high-quality structural templates to establish a high-information baseline. In the second, we will restrict the MSA to the query sequence alone and disable template use to simulate an information-poor scenario, then compare per-residue confidence (pLDDT) and overall pTM scores across multiple refinement cycles.

How does the availability of evolutionary alignments and structural templates affect ColabFold's protein structure predictions?

AlphaFold2 Architecture: A Deep Dive

The AlphaFold2 architecture represents a revolutionary approach to protein structure prediction, combining evolutionary information with deep learning to achieve unprecedented accuracy. Unlike previous methods that relied on fragment assembly and physics-based simulations, AlphaFold2 uses an end-to-end neural network approach to directly predict 3D coordinates.

Overall architecture of AlphaFold2 showing the pipeline from input sequence to 3D structure
Figure 5: Overall architecture of AlphaFold2 showing the complete pipeline from amino acid sequence to 3D structure prediction.

Key Components

1. Multiple Sequence Alignment (MSA) Processing

The first critical component of AlphaFold2 is the processing of evolutionary information through multiple sequence alignments:

  • Searches genetic databases to find evolutionarily related sequences
  • Aligns these sequences to capture conservation patterns
  • Generates a rich representation that captures which amino acids tend to co-evolve
  • This evolutionary information is crucial, as co-evolving residues often indicate spatial proximity in the folded structure

2. Template Processing

While not always necessary for accurate predictions, AlphaFold2 can use information from known protein structures:

  • Searches structural databases (PDB) for proteins with similar sequences
  • Extracts distance maps and structural features from these templates
  • Integrates this information with sequence-based predictions
  • This component helps guide the model, especially for proteins with known homologs

3. Evoformer

The core of AlphaFold2's architecture is the Evoformer, a novel transformer-based neural network:

  • Contains 48 identical blocks that iteratively refine representations
  • Processes both MSA and pairwise representations simultaneously
  • Uses specialized attention mechanisms to update information bidirectionally between these representations
  • Key mechanisms include:
    • Row-wise gated self-attention: processes evolutionary information across sequence positions
    • Column-wise gated self-attention: processes evolutionary information across aligned sequences
    • Triangle multiplication: models higher-order interactions between residues
    • Transition layers: non-linear processing that integrates information
Detailed structure of the Evoformer module showing attention mechanisms
Figure 6: Detailed structure of the Evoformer module showing the various attention mechanisms and information flow.

4. Structure Module

The final stage of AlphaFold2 is the Structure Module, which converts refined representations into 3D coordinates:

  • Consists of 8 blocks with shared weights
  • Takes the representations from the Evoformer and gradually builds the 3D structure
  • Uses Invariant Point Attention (IPA): a novel attention mechanism that respects the physics of 3D space
  • Predicts backbone frames (rotations and translations)
  • Computes atom positions based on these frames
  • Predicts torsion angles to determine side chain conformations
Detailed structure of the Evoformer module showing attention mechanisms
Figure 7: Detailed structure of the Structure module showing the various attention mechanisms and information flow.

5. Recycling and Confidence Estimation

AlphaFold2 employs two additional techniques that significantly improve its performance:

  • Recycling: The entire prediction process is repeated multiple times (typically 3), with each iteration using the output of the previous one as input. This allows the model to refine its predictions based on emerging structural information.
  • Confidence Estimation: AlphaFold2 provides two key metrics:
    • pLDDT (predicted Local Distance Difference Test): per-residue confidence scores from 0-100, with higher values indicating greater confidence
    • PAE (Predicted Aligned Error): estimates the expected position error between any two residues, providing insight into the reliability of domain arrangements in multi-domain proteins

This sophisticated architecture enables AlphaFold2 to achieve remarkable accuracy, with many predictions reaching experimental-level quality (GDT scores > 90). The integration of evolutionary information with geometric reasoning through deep learning represents a fundamental breakthrough in protein structure prediction.

Methods

Our approach utilizes ColabFold, an open-source adaptation of AlphaFold2 designed for accessibility and efficiency. Developed by Mirdita et al., ColabFold significantly reduces the computational barriers to protein structure prediction while maintaining AlphaFold2's accuracy. Key advantages of ColabFold include:

  • Accelerated MSA generation using MMseqs2, reducing search times from hours to minutes
  • Optimized implementation for Google Colab's free GPU resources
  • Streamlined preprocessing and reduced memory requirements
  • Support for both protein monomer and complex predictions
  • Interactive visualization tools for structural analysis
  • Batch processing capabilities for multiple sequences

ColabFold preserves the core AlphaFold2 architecture while making it accessible to researchers without specialized computing infrastructure. Our implementation follows the standard ColabFold pipeline, which employs a multi-stage neural network architecture consisting of:

  1. Multiple Sequence Alignment (MSA) generation using MMseqs2 to capture evolutionary information
  2. Template structure identification from the PDB database
  3. Neural network processing through the Evoformer (48 blocks) which refines sequence representations
  4. Structure module (8 blocks) that predicts backbone coordinates and side chain orientations
  5. Confidence assessment using pLDDT (predicted Local Distance Difference Test) scores

For our experiments, we configured ColabFold with the following parameters:

  • Model type: "auto" (automatically selects appropriate model based on input)
  • MSA depth: 512 sequences for optimal runs, reduced to 2 sequences for limitation testing
  • Extra MSA depth: 1024 sequences for optimal runs, reduced to 0 for limitation testing
  • Number of recycles: 6
  • Early stopping tolerance: 0.5 RMSD
  • Use of templates: Enabled with PDB70 database for optimal runs, disabled for some limitation tests

Our experiments covered two distinct scenarios. First, we predicted the structure of lysozyme (129 amino acids), achieving very high confidence scores (pLDDT ~0.98) that compared favorably with experimentally determined structures (PDB ID: 1LYZ). The entire prediction process took approximately 1-2 minutes on a T4 GPU, with early stopping triggering after just 2 recycles due to structural convergence. Second, we conducted limit-testing experiments on a larger fragment (~1000 amino acids) with minimal MSA and no templates, which resulted in dramatically lower confidence predictions (pLDDT scores averaging 30-50%), demonstrating the critical dependence of AlphaFold2 on sufficient evolutionary and template information, especially for larger proteins.

Link to Code

Our implementation and demonstration can be found in our GitHub repository: AlphaFold2-Demo. The repository includes our Jupyter notebook with step-by-step execution of the ColabFold pipeline, visualization code, and explanations of each component. We've also incorporated auxiliary scripts for structural comparison between predicted models and experimentally determined structures.

Open In Colab

Click the button above to open our demo notebook directly in Google Colab where you can run the protein structure prediction pipeline yourself.

Experiments

Experiment 1: Lysozyme Prediction (Optimal Conditions)

We conducted structure prediction experiments on lysozyme (129 amino acids), a well-characterized enzyme found in egg whites and human tears. Our implementation:

  1. Generated deep multiple sequence alignments (>2500 sequences)
  2. Graph showing MSA for lysozyme.
    Figure 8: Graph showing MSA coverage.
    Top 5 MSA received for lysozyme input.
    Figure 9: Top 5 MSA received for lysozyme input.
  3. Identified suitable templates (1ior, 1ioq, 1iot, 1kxw)
  4. Suitable templates for lysozyme.
    Figure 10: Suitable templates for lysozyme.
  5. Ran predictions with 5 recycles across multiple AlphaFold models
  6. 2 recycles due to early stopping.
    Figure 11: 2 recycles due to early stopping.
  7. Achieved exceptional prediction quality with pLDDT scores averaging 0.984 and ptm scores of 0.913
  8. Prediction for lysozyme with confidence coloring.
    Figure 12: Prediction for lysozyme with confidence coloring.
  9. Visualized predictions using both rainbow coloring (by residue position) and confidence coloring
  10. Compared our predicted structure with the experimental structure (1LYZ)
  11. Predicted vs Experimental Structure of lysozyme.
    Figure 13: Predicted vs Experimental Structure of lysozyme.

The high confidence score (blue coloring in pLDDT visualization) indicates a highly reliable prediction across the entire protein structure, demonstrating AlphaFold2's remarkable ability to predict protein structures with near-experimental accuracy.

Experiment 2: Exploring the Limitations of AlphaFold2

To understand the boundaries of AlphaFold2's capabilities, we conducted two challenging prediction scenarios:

Limited Evolutionary Information for long protein

We tested how the prediction quality changes for a significantly larger protein fragment( ~1000 amino acids) under challenging conditions when evolutionary information is restricted:

  • Reduced MSA depth from 512 to just 2 sequences
  • Maintained all other parameters
  • No structural templates available
  • Observed significant degradation in prediction quality:
    • pLDDT scores dropped from 0.984 to 0.433 at best
    • Required full 6 recycles (no early stopping)
    • High predicted alignment errors (predominantly red PAE matrix)
    • Unstable structures across recycles with large RMSD variations
    • Different models produced substantially different predicted structures
Predictions from model 1 and 2.
Figure 14: Predictions from model 1 and 2.
Predictions from model 3 and 4.
Figure 15: Predictions from model 3 and 4.
Predictions from model 5.
Figure 16: Predictions from model 5 and best pLDDT of 0.433.
rediction result for long protein fragment with limited information.
Figure 17: Prediction result for long protein fragment with limited information.

Key Insights from Experiments

These experiments collectively demonstrate that AlphaFold2's prediction quality depends on multiple interdependent factors:

  • Evolutionary information (MSA depth) plays a crucial role in achieving high-confidence predictions
  • Template availability provides important structural guidance
  • Prediction quality degrades substantially under challenging conditions with limited information
  • The combination of protein complexity and limited evolutionary information presents the greatest challenge
  • Under optimal conditions, predictions can achieve near-experimental accuracy

Conclusion:

Our hands-on exploration of AlphaFold2 through ColabFold has revealed both the impressive capabilities and important limitations of this revolutionary technology. While we achieved remarkable accuracy with lysozyme prediction (pLDDT scores of 0.984), our experiments with challenging scenarios demonstrated clear performance boundaries. When working with a protein fragment under constrained conditions (minimal MSA depth and no templates), prediction confidence dropped significantly, with pLDDT scores averaging only around 30-50% and high predicted alignment errors as visualized in the red confidence maps.

These findings highlight a crucial insight about the interdependent factors affecting prediction quality: the absence of sufficient evolutionary information (limited MSA) combined with lack of structural templates dramatically impacts performance, especially for challenging proteins. Our experimental comparison demonstrates how AlphaFold2's success depends on the integration of multiple data sources rather than any single algorithmic innovation. This Nobel Prize-worthy breakthrough represents a sophisticated synthesis of evolutionary information, structural knowledge, and deep learning rather than a complete departure from traditional approaches.

References

[1] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021). This groundbreaking paper introduced AlphaFold2 and its revolutionary approach to protein structure prediction. It details the neural network architecture that solved a 50-year-old grand challenge in biology with near-experimental accuracy.

[2] Mirdita, M., Schütze, K., Moriwaki, Y. et al. ColabFold: making protein folding accessible to all. Nat Methods 19, 679-682 (2022). This paper describes an optimized implementation of AlphaFold2 designed to run efficiently on consumer hardware. ColabFold democratizes access to state-of-the-art protein structure prediction by reducing computational requirements while maintaining high accuracy.

[3] Google DeepMind. Demis Hassabis and John Jumper awarded Nobel Prize in Chemistry. DeepMind Blog (2024). Retrieved April 21, 2025. This announcement highlights the unprecedented scientific impact of AlphaFold2, which earned a Nobel Prize just four years after publication. The recognition underscores how AI can solve fundamental scientific problems previously thought to require decades more research.

[4] Ahdritz, G., Bouatta, N., Kadyan, S. et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods (2024). This paper presents OpenFold, an open-source reimplementation and retraining of AlphaFold2 that provides deeper insights into how the model works. The study reveals important details about AlphaFold2's learning mechanisms and its ability to generalize to novel protein structures.

Team Members (Authors)

Rishi Mule
Rishi Mule
Dhanshree Baravkar
Dhanshree Baravkar