v3.0 — 2026 Edition. A practical guide to AlphaFold (v3.0 - 2026 Edition), covering the protein folding problem, AI architecture, database usage, programmatic APIs, specialized models like AlphaFold-Multimer and AlphaMissense, local deployment, the introduction of AlphaFold 3, and performance scaling techniques.
Scientific ComputingProtein Structure PredictionDeep Learning for Science
We explore the 50-year grand challenge of protein folding and why it matters to software engineers. Learn what proteins are and why their 3D structure dictates their biological function.
4m 02s
2
Inside AlphaFold 2: Architecture Overview
A deep dive into the neural network architecture of AlphaFold 2. We break down the Evoformer block, Multiple Sequence Alignments (MSA), and Pair Representations.
3m 25s
3
Evaluating Predictions: pLDDT and PAE
How do you know if an AI-generated protein structure is accurate? Learn how to interpret pLDDT for local confidence and PAE for global domain positioning.
4m 08s
4
The AlphaFold Protein Structure Database
Before running massive computational pipelines, check if your protein is already solved. We explore the massive AlphaFold Database hosted by EMBL-EBI.
3m 36s
5
Automating Discovery: The AlphaFold Database API
Learn how to build automated programmatic workflows to fetch protein structures at scale using the AlphaFold Database API.
3m 52s
6
Predicting Structures with ColabFold
Discover ColabFold, a faster alternative for AlphaFold inference that replaces Jackhmmer with MMseqs2 for lightning-fast sequence alignment.
3m 22s
7
AlphaFold-Multimer: Predicting Protein Complexes
Proteins rarely act alone. Learn how AlphaFold-Multimer predicts the interactions and 3D structures of complex protein assemblies.
3m 42s
8
AlphaMissense: Predicting Variant Pathogenicity
Explore AlphaMissense, a specialized model that predicts whether a single letter change in a protein's sequence will cause disease.
4m 00s
9
Deploying AlphaFold 2 Locally
Take control of your infrastructure by deploying the open-source AlphaFold 2 pipeline locally using Docker and massive genetic databases.
4m 25s
10
Introducing AlphaFold 3: Beyond Proteins
AlphaFold v3.0 fundamentally shifts the landscape by modeling DNA, RNA, ligands, and ions, painting a complete picture of the cellular environment.
3m 25s
11
AlphaFold Server: The AF3 Gateway
Get hands-on with AlphaFold v3.0 using the AlphaFold Server, a web-based GUI that removes the need for local hardware and complex setups.
3m 32s
12
Interpreting AlphaFold 3 Results
Evaluating AlphaFold v3.0 predictions requires new metrics. Learn how to interpret clash scores and nucleic acid confidences.
4m 08s
13
AlphaFold 3 Inference Pipeline
Learn how to orchestrate the open-source AlphaFold v3.0 pipeline, manage JSON inputs, and run the containerized application.
3m 56s
14
Data Pipelines & Hardware Requirements
Master the separation of concerns in AlphaFold v3.0 by decoupling the CPU-heavy data pipeline from the GPU-heavy inference engine.
3m 32s
15
The Memory Bottleneck: O(n³) Attention
We dive into the FastFold research paper to understand why AlphaFold's Evoformer module causes catastrophic Out-of-Memory errors on long sequences.
4m 01s
16
Dynamic Axial Parallelism (DAP)
Learn how the FastFold architecture solves AlphaFold's memory limits by splitting intermediate activations across multiple GPUs using Dynamic Axial Parallelism.
4m 00s
17
AutoChunk: Optimizing Memory for Long Sequences
Manual memory chunking is tedious. We explore the AutoChunk algorithm from the FastFold paper, which automatically optimizes tensor partitioning during inference.
3m 36s
18
Overcoming Communication Imbalance
Distributed training is plagued by stragglers. Learn how the ScaleFold architecture redesigns the AlphaFold data pipeline to prevent slow CPU nodes from stalling GPU clusters.
3m 47s
19
Kernel Fusion and GPU Optimization
AlphaFold launches over 150,000 separate CUDA kernels per step. We explore how the ScaleFold paper uses OpenAI's Triton to fuse LayerNorm and Multi-Head Attention.
3m 58s
20
Building a High-Throughput Pipeline
From evaluating model weights asynchronously to leveraging CUDA graphs, learn the system architecture secrets to running AlphaFold at massive scale.
3m 46s
21
The Future: Flow-Matching with SimpleFold
Do we really need complex, domain-specific architectures to fold proteins? We explore SimpleFold, an experimental model that uses standard transformers and flow-matching.
4m 12s
Episodes
1
The Protein Folding Problem
4m 02s
We explore the 50-year grand challenge of protein folding and why it matters to software engineers. Learn what proteins are and why their 3D structure dictates their biological function.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 1 of 21. For fifty years, scientists could read the raw text of our DNA, but they could not predict what that text would actually build. They had the source code of life, but no idea how it compiled into the physical machines running our cells. This grand challenge is known as The Protein Folding Problem.
Proteins are the molecular machines that keep you alive. They handle almost everything happening inside your body, from carrying oxygen in your blood to fighting off infections. Every protein begins its life as a one-dimensional string. This string is built from a vocabulary of twenty different chemical building blocks called amino acids.
Think of the amino acid sequence as the raw source code of biology. It is a linear string of characters, read sequentially. But a straight line of amino acids cannot do any actual work. Just as a plain text file needs to be compiled into an executable binary to run, the one-dimensional amino acid sequence must fold into a highly specific three-dimensional shape.
In the biological world, shape dictates function completely. The physical contours of a protein determine what it can interact with. A protein folded into a pocket shape might catch and break down a specific sugar molecule. A protein folded into a rigid tube might act as a structural support for a cell. If the folding process goes wrong, the biological program crashes. In humans, misfolded proteins are the root cause of many severe diseases.
The sequence of amino acids contains all the instructions needed to form this exact three-dimensional structure. The different amino acids have different chemical properties. Some carry positive or negative charges and act like magnets. Some repel water and try to hide in the center of the structure, while others are attracted to water and push to the outside. These competing physical forces cause the string to tangle, twist, and snap into a single, stable configuration.
Here is the key insight. The math behind this folding process is staggering. A typical protein chain is made of hundreds of amino acids. The number of possible ways a chain that long could bend is around ten to the power of three hundred. A scientist named Cyrus Levinthal pointed out that if a protein tried every possible shape sequentially to find the right one, the process would take longer than the age of the universe. Yet, inside your cells, a new protein string snaps into its correct shape in a few milliseconds.
The protein folding problem is the attempt to bridge this gap. It is the challenge of taking a one-dimensional amino acid sequence as the only input and computationally predicting its final three-dimensional structure.
Historically, scientists had to rely on slow, physical laboratory techniques to map these structures. Methods like X-ray crystallography involved freezing proteins into crystals and firing beams at them to measure the angles of the scattered light. Finding the structure of a single protein could take years of painstaking trial and error. Because genetic sequencing technology outpaced physical mapping, the scientific community accumulated hundreds of millions of known one-dimensional sequences, but mapped the 3D structures for only a tiny fraction of them. We had endless lines of source code, but no decompiler to show us the executing logic.
Solving the protein folding problem computationally gives us the exact mechanical blueprints of biology, turning drug discovery from a slow process of laboratory guessing into precise, targeted engineering.
If you enjoy the podcast and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!
2
Inside AlphaFold 2: Architecture Overview
3m 25s
A deep dive into the neural network architecture of AlphaFold 2. We break down the Evoformer block, Multiple Sequence Alignments (MSA), and Pair Representations.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 2 of 21. You want to predict the physical shape of a single protein sequence, but looking at that sequence alone is not enough. To figure out how it folds, you actually need to examine its evolutionary family tree to spot synchronized patterns of mutations across millions of years. Understanding how an algorithm processes those evolutionary patterns is exactly what we cover in Inside AlphaFold 2: Architecture Overview.
The architecture operates as a continuous flow of data, transforming a string of amino acid letters into a 3D shape. It begins with the target sequence. Before any neural network processing happens, AlphaFold searches massive biological databases to gather two specific inputs based on that sequence. The first input is the Multiple Sequence Alignment, or MSA. This is a collection of similar protein sequences from other organisms. If two amino acids in a sequence constantly mutate together across different species, they are likely physically touching in the final folded structure. The second input consists of structural templates. These are known 3D structures of proteins that are highly similar to your target sequence.
These raw MSAs and templates feed directly into the Embedding layer. The Embedding layer translates this biological data into two distinct mathematical formats that the neural network can process. These are the MSA representation and the Pair representation. The MSA representation is a matrix holding the evolutionary mutation history. The Pair representation is an abstract two-dimensional grid tracking the potential distance and physical relationship between every possible pair of amino acids in the sequence.
Once created, both representations enter the Evoformer stack. The Evoformer is the engine of AlphaFold 2, consisting of 48 distinct processing blocks. Here is the key insight. Inside each block, the MSA representation and the Pair representation talk to each other. They exchange information to refine their respective data. If the evolutionary data in the MSA representation strongly suggests two amino acids interact, it updates the Pair representation to pull them closer together on the abstract distance grid. Conversely, if the Pair representation realizes that placing two amino acids together violates physical space constraints, it updates the MSA representation to re-evaluate that evolutionary link. This cross-communication happens continuously as the data flows through all 48 blocks, producing a highly accurate map of internal relationships.
People often assume the Evoformer generates the final physical shape, but it does not. The Evoformer only builds abstract mathematical representations of distances and evolutionary links. It outputs highly refined data matrices, not a physical object.
To get the actual folded shape, the data leaves the 48th Evoformer block and enters the Structure Module. The Structure Module takes the refined Pair representation and the original sequence data, and translates that abstract grid into actual 3D atomic coordinates. It assigns an exact X, Y, and Z position in space for every atom in the protein backbone and its side chains.
The success of AlphaFold 2 hinges on the Evoformer continuously forcing evolutionary history and spatial constraints to agree with one another before a single 3D coordinate is ever drawn. That is all for this one. Thanks for listening, and keep building!
3
Evaluating Predictions: pLDDT and PAE
4m 08s
How do you know if an AI-generated protein structure is accurate? Learn how to interpret pLDDT for local confidence and PAE for global domain positioning.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 3 of 21. The most powerful feature of an AI model is not always its output, but its ability to tell you exactly when it is guessing. If you look at a predicted protein model and assume every loop and domain is perfectly placed in reality, your downstream experiments will likely fail. To actually use these models safely, you need to understand Evaluating Predictions: pLDDT and PAE.
AlphaFold outputs two distinct metrics to quantify its uncertainty. The first is pLDDT, which stands for predicted Local Distance Difference Test. This evaluates local confidence. For every single amino acid residue in the protein, AlphaFold assigns a score between 0 and 100. This score tells you how confident the network is in the local backbone structure.
When a residue scores above 90, you are looking at very high confidence. At this level, even the side-chain orientations are usually reliable. A score between 70 and 90 still represents a good, trustworthy backbone prediction. As the score drops below 70, confidence gets shaky.
Here is the key insight. When you see a pLDDT score below 50, your first instinct might be that the model failed to find the right fold. That is usually incorrect. A very low pLDDT score often indicates an intrinsically disordered region. The protein physically lacks a fixed structure in isolation. It might be a flexible linker or a tail that only folds when it binds to another molecule. The AI is not failing; it is accurately predicting that this piece of the protein is naturally floppy.
While pLDDT is excellent for local folds, it has a major blind spot. It evaluates regions in isolation. Consider a protein with two completely separate structural domains connected by a long string of amino acids. You run the prediction, check the pLDDT, and see that both domains score above 90. Their internal structures are solid. However, pLDDT cannot tell you if those two domains are correctly positioned relative to each other in three dimensional space.
To solve this, AlphaFold provides a second metric called Predicted Aligned Error, or PAE. This metric evaluates the relative position of domains. PAE measures the expected distance error between any two specific residues in the protein. The logic asks a simple question: if we align the prediction to the true structure perfectly on residue X, how many Ångströms off will residue Y be?
This gives you a two dimensional grid of pairwise errors. Going back to the two domain protein scenario, if you check the PAE between residues inside the first domain, the error will be very low. The same applies inside the second domain. But if you check the PAE between a residue in domain one and a residue in domain two, the error might be very high. A high PAE between domains means their relative orientation is completely uncertain. The model knows exactly what the two separate shapes look like, but it has no idea what angle they sit at relative to one another.
You must evaluate both metrics together to understand the full picture. You use pLDDT to trust the local folding and to identify regions of natural disorder. You rely on PAE to verify if the overall global arrangement of those distinct folded parts is actually fixed, or just a spatial guess by the algorithm. Never treat a predicted structure as a single rigid truth; treat it as a map of probabilities where every domain and linker carries its own proof of reliability. Thanks for spending a few minutes with me. Until next time, take it easy.
4
The AlphaFold Protein Structure Database
3m 36s
Before running massive computational pipelines, check if your protein is already solved. We explore the massive AlphaFold Database hosted by EMBL-EBI.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 4 of 21. Training a breakthrough machine learning model is impressive, but for a working biologist, a raw neural network is just another heavy tool to configure. To truly change the field, you do not just publish the model. You use it to fold nearly every known protein on Earth, and you put the results online for free. This is the AlphaFold Protein Structure Database.
Built in partnership between Google DeepMind and EMBL-EBI, this database holds over 200 million predicted structures. That massive number represents almost the entire known protein universe. Instead of setting up hardware and running the AlphaFold algorithm yourself, you bypass the compute entirely and query the completed predictions.
Consider the workflow for using the web interface. If you are investigating the Free fatty acid receptor 2, you go to alphafold dot ebi dot ac dot uk and type that name into the search bar. You can search by gene name or sequence, but the most precise method is using a UniProt accession number. Because the database maps directly to UniProt, every structure is tightly linked to existing biological metadata. When you select your target from the results, you land on the specific entry page for that protein.
This is the part that matters. The central feature of the entry page is an interactive 3D viewer, but it does not just show you the physical geometry. It visually represents the model's internal confidence in that geometry. AlphaFold scores its own accuracy for every single amino acid using a metric called pLDDT. The 3D viewer color-codes the physical model based on these exact scores. Dark blue indicates very high confidence in the structure. Light blue is confident. Yellow means low confidence, and orange means very low confidence.
When you rotate the Free fatty acid receptor 2 in the browser, you will see sharp, dark blue helical structures where the protein crosses the cell membrane. But the loose tails dangling off the ends might be bright orange. Those orange regions are rarely failures of the algorithm. They are usually intrinsically disordered regions, meaning parts of the protein that do not possess a fixed shape in biological reality.
Viewing the structure in the browser is good for a quick sanity check, but actual computational work requires the raw data. Below the viewer, you will find the download section. You can download the 3D coordinates in the standard PDB format or the modern mmCIF format. You drop these files straight into your local modeling software to measure distances or simulate molecular interactions.
Alongside the coordinate files, you must also download the Predicted Aligned Error data. This is provided as a simple JSON file. While the colors in the 3D viewer tell you if a local piece of the protein is accurate, the PAE JSON file tells you if the relative positions of two different pieces are accurate. It contains a matrix of error margins for the distances between every pair of residues. If your protein has two solid blue domains separated by a flexible hinge, the JSON data will tell you if you can actually trust the angle between them.
The AlphaFold Protein Structure Database shifts your workflow entirely. You no longer spend days trying to predict a structure; you spend minutes looking it up, and your actual work becomes interpreting the confidence metrics attached to the download.
That is all for this one. Thanks for listening, and keep building!
5
Automating Discovery: The AlphaFold Database API
3m 52s
Learn how to build automated programmatic workflows to fetch protein structures at scale using the AlphaFold Database API.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 5 of 21. Real-world bioinformatics does not happen by manually clicking download buttons in a web browser. When you need to analyze hundreds of proteins, you build automated, scalable pipelines. That brings us to Automating Discovery: The AlphaFold Database API.
You know the database holds millions of structural predictions, but viewing one protein at a time on a website is unworkable for large projects. If you are running a comparative analysis across an entire family of enzymes, or screening targets for a drug discovery pipeline, you need those structures pulled directly into your computing environment. The AlphaFold Database provides a REST API to solve this exact bottleneck. It allows your code to query the database and programmatically retrieve structural data without human intervention.
The primary key for this automation is the UniProt ID. You can think of a UniProt ID as the universal barcode for any given protein. Your pipeline generally starts with a list of these IDs. Using a standard HTTP library in Python, you send a request to the API asking for the prediction data associated with a specific barcode.
This is the part that matters. When you make that request, the API does not hand back a massive file full of three-dimensional atomic coordinates. Serving heavy files over a synchronous API call is slow and fragile. Instead, the API returns a manifest. This manifest is a lightweight response that describes exactly what resources are available for your requested protein.
Conceptually, this manifest gives you the biological metadata and a directory of direct download links. It points to the actual 3D structure files in industry-standard formats. Crucially, it also provides links to the confidence metrics, such as the Predicted Aligned Error, or PAE matrix. You need both. In an automated pipeline, downloading the structure is only half the job. You must also download the PAE data so your downstream algorithms know which parts of the protein are highly confident and which parts are flexible.
Let us walk through how this logic flows in a typical Python script. You set up a loop to iterate over your list of UniProt IDs. Inside the loop, you make your network request for the first ID. Before doing anything else, you check the HTTP status code. Not every protein sequence in existence has a pre-calculated AlphaFold structure. If you request an unmapped protein, the database returns a standard four-oh-four not found error. Your script must catch this error gracefully, log the missing ID, and continue to the next iteration. If you ignore status codes and try to parse a missing response, your entire batch job will crash on the first unsupported protein.
When the request is successful, your script reads the returned manifest. You write logic to extract the specific download links you care about, which is usually the primary coordinate file and the corresponding error matrix file. Your script then makes secondary HTTP requests to those specific links, streaming the heavy data directly to your local storage or cloud bucket.
By separating the query from the download, the API keeps the initial discovery phase incredibly fast. You can quickly check the availability of thousands of proteins, build a validated list of links, and then handle the heavy downloading in bulk. The real value of the AlphaFold API is that it removes the human bottleneck, turning a vast static database into a fully programmable component of your research architecture. That is all for this one. Thanks for listening, and keep building!
6
Predicting Structures with ColabFold
3m 22s
Discover ColabFold, a faster alternative for AlphaFold inference that replaces Jackhmmer with MMseqs2 for lightning-fast sequence alignment.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 6 of 21. When you run a protein structure prediction, the neural network is rarely the bottleneck. The real waiting happens before the model even starts, dragging hours out of your day just to search through genetic databases. Predicting structures with ColabFold completely changes this math.
The standard pipeline for AlphaFold relies on a tool called Jackhmmer. Jackhmmer scans massive, terabyte-sized local databases to find evolutionary relatives of your target protein. This search builds the Multiple Sequence Alignment, or MSA. The MSA is the critical input that tells AlphaFold which parts of the protein are likely to interact based on evolutionary history. However, generating this alignment the traditional way takes hours of compute time and forces you to manage enormous datasets on your own infrastructure.
ColabFold modifies this pipeline by targeting that specific bottleneck. It replaces Jackhmmer with a different search algorithm called MMseqs2. MMseqs2 is heavily optimized for high-speed sequence searching. But ColabFold goes a step further than just swapping algorithms. Instead of making you run MMseqs2 and its associated databases on your local machine, ColabFold offloads this step entirely. It takes your input sequence and sends it to a dedicated, public MMseqs2 server via an API.
This server handles the heavy lifting. It searches its own centrally managed databases and returns the finished MSA directly to your environment. This architectural shift means you do not need local database storage at all.
This is exactly why you can run ColabFold effectively within a Jupyter Notebook hosted on cloud services like Google Colab or UCloud. The setup requires almost no infrastructure. You open the notebook and paste in your amino acid sequence. This could be a wild-type protein, or a custom mutated sequence you are experimenting with.
When you run the notebook, the execution happens in two distinct phases. First, your cloud environment makes a fast API call to the remote MMseqs2 server. The server computes the alignment and sends the MSA back to your notebook. This cuts a process that normally takes hours down to mere minutes or even seconds.
Now, the second piece of this workflow kicks in. Your notebook passes that retrieved MSA into the AlphaFold neural network. This is where the actual folding prediction happens. Unlike the database search, this inference step runs locally within your cloud instance. It utilizes the GPU attached to your Google Colab or UCloud session to calculate the spatial coordinates of the protein structure.
Because the slow database search is outsourced and accelerated, your overall iteration time is drastically reduced. You get the highly accurate neural network prediction you expect from AlphaFold, but you get it fast. If you need to test how a specific point mutation alters the physical shape of a protein, you simply edit a single letter in your sequence variable, run the cell again, and evaluate the new output.
Here is the key insight. ColabFold proves that by completely decoupling the data-heavy alignment search from the neural network inference, you can put world-class structural biology tools into a standard web browser.
That is all for this one. Thanks for listening, and keep building!
7
AlphaFold-Multimer: Predicting Protein Complexes
3m 42s
Proteins rarely act alone. Learn how AlphaFold-Multimer predicts the interactions and 3D structures of complex protein assemblies.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 7 of 21. Predicting a single protein is like guessing the shape of an isolated puzzle piece. But in biology, proteins rarely act alone. If you throw two random sequences into a standard prediction model, it will often just mash them together in three-dimensional space, regardless of whether they actually bind in nature. To figure out how multiple flexible pieces genuinely snap together, you need a different tool. That is AlphaFold-Multimer.
Standard AlphaFold is optimized for single polypeptide chains. AlphaFold-Multimer is a retrained version of the system specifically designed to predict protein complexes. It is trained to recognize and model both homomeric interfaces, where multiple copies of the same protein bind together, and heteromeric interfaces, where entirely different proteins interact.
When you provide multiple sequences to AlphaFold-Multimer, it will always return a predicted complex. The critical challenge is determining if that predicted interface is a real biological interaction or an artificial forced collision. The model does what you ask, so it will place the chains next to each other. To solve the problem of false interactions, AlphaFold-Multimer introduces a specialized confidence metric called the interface predicted Template Modeling score, or ipTM.
Standard confidence scores evaluate how well a localized piece of a chain folds. The ipTM score entirely ignores the internal structure of the individual proteins. Instead, it measures the accuracy of the relative positions of the chains. It specifically evaluates the confidence in the interface where the proteins meet.
Consider a scenario where you are predicting a two-chain antibody-antigen complex. You pass the amino acid sequence for the antibody and the sequence for the target antigen into the model. The system outputs a structure containing both molecules. First, you check the standard local confidence scores for each chain. They might be very high, indicating the model knows exactly how the isolated antibody and the isolated antigen fold.
Pay attention to this bit. You can have perfectly folded individual chains with an ipTM score close to zero. If the ipTM is low, the model is telling you that while the shapes are correct, it has no idea how they connect. It merely parked the two folded structures next to each other because you provided two sequences. The interface is meaningless.
However, if the ipTM score is high, typically anything above zero point eight, the model is confident in the interaction. A high ipTM means the system found structural and evolutionary evidence that these specific surfaces interlock. The model is not just forcing them together; it is predicting a genuine binding event.
To rank the final outputs, AlphaFold-Multimer calculates a combined confidence score. This combined metric is heavily weighted toward the interface, typically taking eighty percent of the ipTM score and adding twenty percent of the overall structural score. The model with the highest combined score is returned as your top prediction.
The structural accuracy of individual chains does not guarantee a valid interaction. The interface predicted Template Modeling score is the defining metric to prove the model discovered a genuine biological complex rather than just fulfilling a request to group two sequences.
That is all for this one. Thanks for listening, and keep building!
8
AlphaMissense: Predicting Variant Pathogenicity
4m 00s
Explore AlphaMissense, a specialized model that predicts whether a single letter change in a protein's sequence will cause disease.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 8 of 21. A single typo in a sequence of thousands of amino acids can be the difference between health and severe genetic disease. But when you sequence a patient's genome and find a novel mutation, how do you know if it is the root cause or just harmless biological noise? AlphaMissense provides a mathematical answer to that exact problem.
A missense variant is a genetic mutation where a single DNA letter change results in substituting one amino acid for another in the final protein. The average human genome contains thousands of these variants. Many are entirely benign. They alter a peripheral part of the protein without affecting its overall function. Others disrupt the protein entirely, leading to serious genetic conditions. Sorting the dangerous variants from the harmless ones is a massive bottleneck in clinical genetics.
AlphaMissense is a model specifically built to classify these effects. We need to clear up a common misconception right away. AlphaMissense does not predict the 3D structural changes caused by a mutation. It does not output a new file showing a deformed protein shape. Instead, it leverages the structural and evolutionary context already learned by the AlphaFold network to calculate a probability. It outputs a continuous pathogenicity score ranging from zero to one. This score represents the likelihood that a specific mutation causes disease.
Take a concrete scenario. You are analyzing a patient's genome and you isolate a novel single amino acid substitution in a critical gene. A valine has replaced an alanine at position 142. The patient is ill, but this exact mutation has never been documented in medical literature. You need to know if this substitution is pathogenic.
You query the AlphaMissense database for this specific variant. Under the hood, the model evaluates the evolutionary conservation of that exact position by looking at multiple sequence alignments. It checks if nature has tolerated changes at this position across millions of years of evolution. It also evaluates the structural context. An amino acid packed tightly into the hydrophobic core of the protein is much more sensitive to changes than one floating loosely on the exterior surface.
AlphaMissense processes these evolutionary and structural signals to return your score. A score close to zero means the variant is likely benign. The protein probably folds and functions normally. A score close to one flags it as likely pathogenic. The substitution almost certainly breaks the protein's stability or function.
To make this data actionable, the model applies predefined thresholds. Scores falling above the upper threshold are categorized as likely pathogenic. Scores below the lower threshold are likely benign. Anything falling in the middle band is classified as ambiguous. This classification allows you to instantly filter genomic noise. Instead of designing expensive, time-consuming lab experiments to test every unknown variant in the patient's genome, you immediately focus your clinical resources on the few mutations AlphaMissense flags with a high score.
Here is the key insight. The true impact of this tool is its scale. AlphaMissense does not wait for queries. It has already calculated the pathogenicity score for all 71 million possible single amino acid substitutions across the entire human proteome, cataloging the results before a patient ever walks into a clinic.
If you enjoy the podcast and want to help keep the show going, you can support us by searching for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!
9
Deploying AlphaFold 2 Locally
4m 25s
Take control of your infrastructure by deploying the open-source AlphaFold 2 pipeline locally using Docker and massive genetic databases.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 9 of 21. You might assume the biggest hurdle to predicting protein structures is finding a powerful enough GPU. But the real bottleneck is storage. Before you fold a single amino acid, you need to query massive, multi-terabyte evolutionary databases. This episode covers deploying AlphaFold 2 locally.
Running AlphaFold on your own hardware gives you strict privacy for proprietary sequences and removes the execution time limits found on public servers. DeepMind provides the open-source code directly on GitHub. To use it, you must first replicate the exact genetic reference environment the model uses to find similar sequences. This requires downloading local copies of public databases like BFD, MGnify, UniRef90, and PDB70. Combined, these uncompressed databases consume roughly three terabytes of storage.
Picture a DevOps engineer provisioning a new High Performance Computing node for a research team. The very first requirement is a three terabyte drive. You cannot use cheap, slow spinning disks here. You need fast NVMe storage. Searching through millions of genetic sequences generates aggressive disk read operations. If your storage is slow, the entire pipeline chokes before data ever reaches the neural network.
Here is the key insight. The local AlphaFold pipeline consists of two distinct compute phases, and they require completely different hardware profiles. Phase one generates the Multiple Sequence Alignment, or MSA. The system searches your terabytes of local databases using external bioinformatics tools. This phase uses exactly zero GPU cycles. It is entirely bottlenecked by CPU cores, system memory, and disk speed. Phase two is the actual structural inference. This is where the neural network takes over to generate the 3D coordinates, and this phase strictly requires a GPU with significant VRAM. If you provision a node with a massive GPU but a weak CPU, your prediction job will stall for hours in the alignment phase.
Managing the software dependencies for both of these phases is notoriously difficult. You have to align specific versions of CUDA, Jax, and the sequence alignment tools. To solve this, the official repository relies heavily on Docker. Containerization ensures that this complex execution environment remains isolated and consistent across different operating systems.
Setting up the deployment involves two primary scripts. First, you run a provided bash script to download the data. You point it at your NVMe drive and let it run. Depending on your network connection, fetching and extracting these files can take a few days. Once the data rests on your fast storage, you are ready for execution.
You control the pipeline using a provided python wrapper script called run docker dot py. You execute this script from the terminal, passing it a few mandatory arguments. You supply the path to the input FASTA file containing your target amino acid sequence. You supply the path to your multi-terabyte database directory. You specify an output directory. Finally, you set a maximum template release date, which restricts the model from using known structural templates published after a specific point in time.
The python wrapper takes these arguments and builds the Docker container. It mounts your local host directories into the container as volumes, allowing the isolated software to read your databases and write output files back to your disk. The container then takes over, orchestrating the CPU-heavy alignment search and the GPU-heavy inference process automatically.
Deploying AlphaFold locally is fundamentally an infrastructure challenge, requiring you to balance fast disk speed for the CPU-heavy alignment phase with capable VRAM for the GPU-heavy folding phase.
Thanks for spending a few minutes with me. Until next time, take it easy.
10
Introducing AlphaFold 3: Beyond Proteins
3m 25s
AlphaFold v3.0 fundamentally shifts the landscape by modeling DNA, RNA, ligands, and ions, painting a complete picture of the cellular environment.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 10 of 21. For years, if you wanted to see how a potential drug interacted with a target protein, you had to run your sequence, generate a protein structure, and then use entirely separate software to computationally dock the drug into place. You were forcing a rigid chemical into a rigid model. AlphaFold 3 changes this entirely by folding everything together at the same time. Today we are Introducing AlphaFold 3: Beyond Proteins.
Previous versions of AlphaFold were strictly focused on predicting the structure of proteins based on their amino acid sequences. They were exceptionally good at it, but real biological systems do not operate as naked proteins floating in a vacuum. Proteins interact with DNA to turn genes on and off. They rely on metal ions to catalyze reactions. They are blocked or activated by small drug molecules.
Before this update, modeling these complex interactions required a fragmented pipeline. You would predict the protein structure first, then perhaps model the DNA separately, and finally try to calculate how they physically fit together. This sequential approach has a major flaw. It assumes the protein structure is static. In reality, proteins dynamically change shape when they bind to other molecules.
Here is the key insight. AlphaFold 3 transitions from being just a protein folding tool to a generalized biomolecular complex predictor. It expands its native vocabulary beyond amino acids to include three entirely new categories of molecules.
First, it now supports nucleic acids, meaning you can input both DNA and RNA sequences. Second, it natively understands small molecule ligands, which covers the vast majority of pharmaceutical drugs. Third, it supports the inclusion of essential ions that often sit at the core of functional proteins.
Instead of predicting these components in isolation, AlphaFold 3 predicts the joint structure of all these distinct molecular entities simultaneously. It calculates the interactions between every atom across the entire complex in a single pass.
Think about how this changes a standard modeling scenario. You are no longer looking at an isolated target. You can now define an input that contains a protein sequence, a specific strand of DNA, and the chemical definition of a small inhibitor drug. The system evaluates them together. The resulting output shows the protein actively gripping the DNA strand, while the small drug molecule is wedged perfectly into the binding pocket. Because everything was folded in the same computational space, the protein model automatically reflects the structural shifts caused by both the DNA and the drug.
This is not simply the old AlphaFold engine with a docking algorithm bolted onto the end. It is a completely distinct model architecture trained directly on how different classes of biological molecules interact in physical space.
The most significant shift with AlphaFold 3 is that you are no longer generating isolated biological parts, but rather predicting complete, interacting molecular machines exactly as they exist in the physical world.
That is all for this one. Thanks for listening, and keep building!
11
AlphaFold Server: The AF3 Gateway
3m 32s
Get hands-on with AlphaFold v3.0 using the AlphaFold Server, a web-based GUI that removes the need for local hardware and complex setups.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 11 of 21. You no longer need an engineering degree or a supercomputer to run the world's most advanced biomolecular AI. The barrier to entry has dropped to a simple web browser. That is the AlphaFold Server, your direct gateway to AlphaFold 3.
Historically, running complex structure predictions meant managing software dependencies, downloading terabytes of genetic databases, and maintaining powerful GPUs. The AlphaFold Server bypasses all of that. It is a completely graphical web interface that allows biologists and chemists to run AlphaFold 3 jobs without writing a single line of code. All the multiple sequence alignments, template searching, and diffusion-based folding happen remotely on shared compute infrastructure.
Before we walk through a prediction, we need to clarify two strict boundaries. First, the AlphaFold Server is exclusively for non-commercial use. Second, because it relies on provided compute resources, your account is subject to daily job limits. You cannot use it to brute-force a massive screening pipeline. It is designed for targeted, single-job experimental design.
Here is how you build a complex, multi-chain prediction job in the interface. The entire workflow revolves around adding separate entities to a shared workspace.
Start with your main protein. You click the button to add a protein entity. A text field appears. You simply paste your one-letter amino acid sequence directly into that box. If you expect this protein to form a homodimer, you do not need to paste the sequence twice. You just change the copy number setting to two, and the system handles the stoichiometry.
Now, the second piece of this is adding interacting molecules. Suppose you want to see how that protein binds to a specific segment of DNA. You click to add a DNA entity and paste the sequence for the forward strand. To build the actual double helix, you add another DNA entity and paste the reverse complementary sequence. AlphaFold 3 will model them together in the same three-dimensional space, predicting both the base pairing of the DNA and its structural interface with your protein.
This is where it gets interesting for non-polymer components. The server supports ligands, ions, and cofactors natively. If your protein requires a specific metal ion to stabilize its fold, you click to add an ion entity. You do not need to look up complex chemical identifiers or write SMILES strings for standard components. The server provides a built-in dropdown menu. You just search for a common ion, select Magnesium or Zinc, define how many copies you need, and the interface locks it in.
Once your workspace contains the protein sequence, the two DNA strands, and the selected ion, you hit the submit button. That is the entire setup.
Behind the scenes, the server takes over. It manages the alignment pipelines, feeds the data into the AlphaFold 3 neural network, and runs the structural diffusion process. When the job completes, the results load directly in your browser. You get an embedded 3D viewer to inspect the predicted complex immediately. More importantly, you get a clean package to download, containing the predicted atomic coordinate files and all the associated confidence metrics for your local analysis.
The true power of the server is not just immediate access to the model, but the sheer speed of iteration it provides when testing structural hypotheses.
That is all for this one. Thanks for listening, and keep building!
12
Interpreting AlphaFold 3 Results
4m 08s
Evaluating AlphaFold v3.0 predictions requires new metrics. Learn how to interpret clash scores and nucleic acid confidences.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 12 of 21. Predicting exactly how a new drug binds to a target protein feels like a breakthrough, right up until you realize the AI just placed two solid carbon atoms in the exact same physical space. A beautifully rendered structure is useless if it defies the laws of physics. Today we cover Interpreting AlphaFold 3 Results to help you separate biological insight from computational hallucinations.
AlphaFold 3 generates a three-dimensional coordinate file along with a set of specific metrics. Evaluating a result requires reading these metrics together, because a high score in one area can mask a catastrophic failure in another. First, you look at local confidence using a metric called pLDDT. This grades every single atom or residue on a scale of zero to one hundred. A high score means the model is very certain about the specific local geometry of that chain.
But local confidence alone is not enough. You can have a perfectly predicted protein and a perfectly predicted small molecule ligand that are floating independently in space. To know if they actually interact, you must check the Predicted Aligned Error, or PAE. PAE measures the expected position error in Angstroms. Lower numbers are better.
When you predict complexes, like a protein binding to a DNA strand or a drug molecule, you focus on the cross-PAE. Think of the PAE output as a heat map matrix. The diagonal shows how confident the model is about a single chain's internal structure. The off-diagonal areas show the confidence between different entities. If the PAE between your protein and your ligand is low, AlphaFold is highly confident in their relative positions. It strongly believes the ligand sits exactly in that specific pocket.
Here is the key insight. High confidence does not automatically mean physical reality. AlphaFold 3 is a deep learning model, not a molecular dynamics simulator. It prioritizes recognized spatial patterns over strict physics.
This brings us to the clash score. Because AlphaFold 3 models ligands, nucleic acids, and modified residues directly without a strict physics-based relaxation step, outputs can contain severe steric clashes. A clash happens when two atoms are assigned coordinates that place them impossibly close together, effectively overlapping. The server calculates and provides a clash score for your output. A high clash score is a massive red flag.
Consider a scenario where you analyze an output and the main protein chain has a pLDDT over 90. The cross-PAE between the protein and your target ligand is very low. Interface metrics suggest a strong, confident binding event. On paper, it looks like a perfect docking prediction.
Then you check the clash score and it is highly elevated. If you load the physical coordinates into a viewer, you will see the ligand passing straight through the protein backbone. The neural network recognized that the ligand belongs in that general binding pocket, but it failed to route the atoms around the existing side chains. It hallucinated a ghost molecule that ignores solid matter.
When you evaluate an AlphaFold 3 result, you must check structural confidence and physical realism simultaneously. Use pLDDT to verify the shape of individual molecules. Use PAE to confirm they interact where you expect them to. Then, check the clash score to ensure the model has not violated basic chemistry to force that interaction to happen.
Remember that a deep learning model outputs a spatial hypothesis based on pattern recognition, so a highly confident prediction is only valid if the atoms actually respect the laws of physics.
That is all for this one. Thanks for listening, and keep building!
13
AlphaFold 3 Inference Pipeline
3m 56s
Learn how to orchestrate the open-source AlphaFold v3.0 pipeline, manage JSON inputs, and run the containerized application.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 13 of 21. DeepMind open-sourced their latest structural biology model, but if you clone the repository and try to run it, it will fail immediately. The code is entirely public, but the actual neural network driving the predictions is tightly controlled. Today, we are looking at the AlphaFold 3 Inference Pipeline and how you actually navigate deploying it.
To run AlphaFold 3 locally, you must unite two separate pieces: the execution code and the model parameters. The repository gives you the pipeline logic. For the parameters, specifically the version three point zero point one weights, you have to request access separately. Google requires you to submit a form detailing your non-commercial research use case. Once they approve your request, you receive a link to download the heavy model weights. This split approach allows DeepMind to restrict commercial use while still letting researchers inspect and run the open-source pipeline on their own hardware.
After downloading the weights, your next step is defining what you want the model to predict. Earlier versions of AlphaFold mostly dealt with single protein chains, so you could just pass a simple text string of amino acids. AlphaFold 3 models complex interactions across diverse molecule types, so a basic string is no longer enough. Instead, you create a configuration file. This file acts as a complete manifest for your biological target. You specify the exact molecules involved in your experiment. This might be a standard protein, a strand of DNA or RNA, or a specific small molecule ligand.
In this same configuration file, you also define randomness seeds. Because the new model architecture relies on a diffusion process to generate the final three-dimensional structure, introducing slight variations in the random seed produces different possible structural states. By defining multiple seeds in your configuration, you instruct the pipeline to generate distinct predictions for the exact same input. You also declare a dialect version in this file, which simply tells the parsing engine to apply the AlphaFold 3 validation rules to your manifest.
With your weights downloaded and your configuration file ready, you move to execution. Because this pipeline depends on highly specific versions of machine learning libraries and GPU drivers, running it directly on your host operating system is risky. The official workflow relies on Docker. You build a container image directly from the provided repository, locking in the perfect environment.
To actually run the inference, you launch this container and connect it to your host machine by mounting three critical paths. First, you mount the folder containing your input configuration file. Second, you mount the directory holding those massive, approved model weights you downloaded earlier. Finally, you mount an empty directory where the container can write its results.
Once the container boots, you invoke the main run script. The pipeline validates your input manifest, loads the weights from your mounted directory into GPU memory, and begins the structural prediction. When the process finishes, your mounted output folder will contain the final coordinates of your requested molecules, formatted and ready for visualization.
The primary challenge of this pipeline is not the underlying code, but managing the logistics of securing the model weights and meticulously structuring your configuration file to map out complex biological interactions. If you would like to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!
14
Data Pipelines & Hardware Requirements
3m 32s
Master the separation of concerns in AlphaFold v3.0 by decoupling the CPU-heavy data pipeline from the GPU-heavy inference engine.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 14 of 21. The biggest mistake engineers make when deploying AlphaFold is letting expensive GPUs sit idle while waiting for basic database queries to finish. You provision an A100, kick off a job, and watch your utilization flatline at zero percent for hours. The solution is separating your execution environments, which is exactly what we are covering today with Data Pipelines and Hardware Requirements.
AlphaFold three performs two fundamentally different types of work. First, it searches massive genetic databases to build Multiple Sequence Alignments and find structural templates. Second, it passes that processed data into the neural network to predict the final three dimensional structure. The database search is entirely bound by CPU and storage read speeds. The neural network prediction requires a GPU. If you run the default setup, AlphaFold does both steps in sequence on the same machine. This means your expensive GPU does absolutely nothing while your CPU spends hours grinding through terabytes of reference databases.
Here is the key insight. The repository provides two specific execution flags to decouple these workloads. The first flag is run data pipeline. When you execute the run script with this flag enabled and run inference disabled, AlphaFold only performs the genetic and template searches. It reads your input configuration, queries the local databases, and generates all the necessary sequence features. It then saves this intermediate data state to disk. Crucially, this step requires zero GPU resources. It relies strictly on a machine with many CPU cores and fast NVMe solid state drives to handle the heavy input and output operations.
Once the data pipeline finishes building the features, you use the second flag. You execute the script again, but you set run data pipeline to false and run inference to true. AlphaFold will bypass the database search entirely. It directly loads the cached feature data you just generated, initializes the model weights, and executes the neural network forward pass. This step is completely GPU bound. It relies exclusively on your hardware accelerator and its high bandwidth memory.
This explicit separation is how you design a proper enterprise infrastructure. Instead of attaching massive database drives to multiple expensive GPU machines, you provision a fleet of cost effective CPU nodes. You distribute your incoming prediction requests across these CPU nodes, using them to aggressively query databases and generate alignments in parallel. Every time a CPU node finishes a data pipeline job, it drops the resulting feature files onto a shared storage system.
Meanwhile, you maintain a single, highly utilized node with a powerful GPU. This machine exists strictly for inference. It constantly monitors the shared storage. When a new feature file appears, it picks it up, runs the neural network prediction in minutes, writes out the final molecular structure, and immediately pulls the next file. Your GPU never waits on a storage bottleneck or a database search again.
Decoupling the CPU bound data pipeline from the GPU bound inference is the single most effective way to scale protein structure prediction while keeping hardware costs under control. Thanks for spending a few minutes with me. Until next time, take it easy.
15
The Memory Bottleneck: O(n³) Attention
4m 01s
We dive into the FastFold research paper to understand why AlphaFold's Evoformer module causes catastrophic Out-of-Memory errors on long sequences.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 15 of 21. Your protein model is surprisingly small, sitting under one hundred million parameters. You successfully predict the structure of a 500-residue sequence on a standard 16 gigabyte GPU, but when you input a 1500-residue sequence, the process instantly crashes with an out of memory error, suddenly demanding over 80 gigabytes of VRAM. This catastrophic scaling happens because of a specific architectural trait called The Memory Bottleneck: O(n³) Attention.
According to the FastFold paper by Cheng et al, this explosive memory consumption has nothing to do with the model weights. The entire AlphaFold model holds only 93 million parameters. The bottleneck lies entirely in the intermediate activations generated during the forward pass. As the sequence length grows, the memory required to store these intermediate tensors scales cubically.
To understand why, we have to look inside the Evoformer. The Evoformer is the main structural trunk of AlphaFold, consisting of 48 stacked blocks. Inside each block, the model relies on a specialized attention mechanism. Standard transformers, like the ones used in language models, typically have attention mechanisms where memory scales quadratically with the sequence length. They only compare a one-dimensional sequence against itself. AlphaFold is different. It uses a two-dimensional intermediate representation to model pair-wise interactions between amino acids.
Here is the key insight. To accurately predict three-dimensional protein structures, the network cannot just look at pairs of residues in isolation. It must evaluate triangular relationships to ensure the predicted distances between three points are physically possible in three-dimensional space. To calculate this, the attention module generates intermediate activation tensors. The memory footprint of these tensors follows a strict formula. It is the sequence length cubed, multiplied by the number of attention heads, multiplied by the byte size of the data type. AlphaFold uses BFloat16 precision, which takes two bytes per value.
Let us trace the math for the developer who crashed their GPU. When you pass a 500-residue sequence into the model, the sequence length is 500. The cube of 500 is 125 million. If you have four attention heads, you multiply 125 million by four, and then by two bytes for the precision. A single attention layer creates about one gigabyte of intermediate activations. Your 16 gigabyte GPU handles this with room to spare.
Now you change the input to a 1500-residue sequence. You only tripled the input length. But because the scaling is cubic, you must cube 1500, which yields over 3.3 billion. Multiply that by four heads and two bytes, and that exact same single attention layer now demands almost 27 gigabytes of memory just to store its activations. Because the network passes this data through 48 consecutive Evoformer blocks, the total memory requirement instantly exceeds 80 gigabytes. The tensors swell so rapidly that the hardware simply aborts the process.
This is why AlphaFold requires heavy optimization for longer sequences. The raw parameter count of a neural network tells you almost nothing about its hardware requirements when the internal geometry forces intermediate data to grow in three dimensions simultaneously.
If you want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!
16
Dynamic Axial Parallelism (DAP)
4m 00s
Learn how the FastFold architecture solves AlphaFold's memory limits by splitting intermediate activations across multiple GPUs using Dynamic Axial Parallelism.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 16 of 21. If an intermediate tensor is too massive for a single GPU, the standard instinct is to slice the model parameters across multiple cards. But when your neural network only has 93 million parameters and your activation tensors consume 20 gigabytes, slicing the model solves nothing. You need to slice the geometry of the data itself. That is exactly what Dynamic Axial Parallelism resolves.
To understand why this approach exists, we have to look at the bottlenecks identified in the FastFold paper. The core engine of AlphaFold is the Evoformer block. It processes multiple sequence alignments and pair representations, which essentially function as massive two-dimensional grids of data. Standard Tensor Parallelism tries to handle heavy workloads by splitting the linear layer weights across different GPUs. For AlphaFold, this creates severe inefficiencies. Tensor Parallelism requires frequent, heavy synchronization, triggering up to twelve communication steps per block. It also only applies to the attention and feed-forward modules. Worse, its scaling is hard-capped by the number of attention heads. In the AlphaFold pair stack, that limits you to a maximum of just four GPUs.
Dynamic Axial Parallelism, or DAP, abandons the weight-splitting approach entirely. Instead, DAP keeps the complete model parameters intact on every single device. The model itself is never distributed. The intermediate activations are.
The Evoformer processes data along two sequence dimensions, but the mathematical operations only ever happen along one dimension at a time. This behavior allows DAP to cleanly divide the data along the inactive dimension.
You can picture this in a multi-GPU setup. A massive grid of sequence data enters the layer. DAP slices this input horizontally. GPU zero takes the first chunk of rows, GPU one takes the next, and so on. Each GPU then calculates the attention for its specific slice of data. Because every GPU holds a full copy of the model weights, and because the computation is isolated to the horizontal axis, the GPUs do not need to talk to each other. They compute their chunks independently.
Here is the key insight. Once that horizontal calculation finishes, the model needs to process the data along the vertical sequence dimension. To make this happen, the GPUs execute an all-to-all communication step. They transpose the geometry of the data across the network. GPU zero scatters its column fragments to the other devices and gathers the pieces it needs for the next phase. The orientation of the split flips. Now the data is sliced vertically across the cluster. The GPUs immediately run the next layer of attention calculations, again with zero cross-talk during the actual math.
This transposition strategy drops the communication volume by a full order of magnitude compared to standard Tensor Parallelism. It also enables parallelism across every single computational module in the Evoformer. Because the massive activation tensors are evenly distributed across the cluster, overall memory consumption drops heavily, preventing individual devices from crashing under the weight of long sequences.
When your hardware bottleneck is driven by the sheer physical size of intermediate representations rather than the depth of the neural network, you do not fragment the architecture. You distribute the axes of the data.
That is all for this one. Thanks for listening, and keep building!
17
AutoChunk: Optimizing Memory for Long Sequences
3m 36s
Manual memory chunking is tedious. We explore the AutoChunk algorithm from the FastFold paper, which automatically optimizes tensor partitioning during inference.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 17 of 21. What if your compiler was smart enough to automatically slice your matrix math so it perfectly fits your hardware memory budget? Consider an engineer spending days manually profiling and slicing code to prevent Out of Memory errors. Every time the sequence length changes, the fixed slices fail. Contrast that with a system that dynamically estimates memory footprints and generates partitioned code on the fly. AutoChunk is the algorithm that makes this happen.
In the FastFold paper, researchers tackled the massive memory bottlenecks of running long sequences through protein prediction models. The standard approach to avoid running out of memory is chunking. You partition a tensor along dimensions that do not interact during a computation, process those smaller blocks sequentially, and stitch them back together. Usually, this is a manual, labor-intensive process. You profile the code, guess the best slice sizes, and hardcode the boundaries. The FastFold paper points out that fixed chunking schemes are inefficient because they cannot adapt to varying sequence dimensions or specific operational memory spikes.
Here is the key insight. The FastFold researchers observed that 95 percent of the operations in the model use less than 20 percent of peak memory. Module-level chunking is entirely unnecessary. You only need to chunk the specific operations that cause the memory spikes.
AutoChunk automates this by analyzing your code as a computational graph. It takes that graph and your specific hardware memory budget as inputs. In plain spoken logic, the algorithm runs a continuous loop. While a memory strategy is still needed, it estimates the memory consumption of the graph to find the single node with the highest usage. A node here is a basic operation, like an addition or a linear projection.
Once it flags that peak memory node, AutoChunk determines the maximum possible chunk range extending outward from it. It computes this by checking all active nodes currently holding data in memory. Next, it identifies every possible chunk strategy within that range by tracing tensor dimensions upwards through the graph. A dimension can only be chunked if it is a free dimension, meaning no computation occurs across it during that specific range of operations.
Because tracing every output upwards is computationally expensive, AutoChunk uses a two-stage search. Stage one checks if the start and end nodes of a range meet the chunking rules. If they do, stage two performs a deep verification for all the intermediate nodes. After mapping the possibilities, AutoChunk selects the strategy that keeps memory strictly under the budget while minimizing the penalty to execution speed. Finally, the algorithm exits the loop and passes the chosen strategies to a code generator, inserting the optimal partition logic directly into the graph.
You no longer hardcode matrix slices and hope they hold up in production. The system analyzes the active memory state and writes the partition logic for you. The most useful takeaway is that chunking should never be a static, module-wide setting; by isolating and partitioning only the specific outlier operations that cause peak memory spikes, you can process vastly longer sequences without destroying execution speed. That is all for this one. Thanks for listening, and keep building!
18
Overcoming Communication Imbalance
3m 47s
Distributed training is plagued by stragglers. Learn how the ScaleFold architecture redesigns the AlphaFold data pipeline to prevent slow CPU nodes from stalling GPU clusters.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 18 of 21. You spin up a massive, multi-million dollar cluster with a thousand GPUs to train your model, but your utilization flatlines. Your entire distributed ring grinds to a halt because a single machine drew a highly complex evolutionary sequence and needs 100 seconds to generate an alignment. Overcoming communication imbalance is how you fix this exact bottleneck.
The underlying issue here is known as the straggler problem. In distributed training, every GPU in your cluster processes a local batch of data. At the end of a training step, all GPUs must reach a synchronization point to share their gradients. Your cluster only moves as fast as its absolute slowest machine. As detailed in the ScaleFold paper by Zhu et al, preparing training batches for protein folding is highly variable. Some amino acid sequences are short and simple. Others require massive multi-sequence alignments, taking up to three orders of magnitude longer to prepare. A fast batch takes a fraction of a second. A slow batch takes 100 seconds.
If you use the default PyTorch data loader, it generates batches in a strict, deterministic order. If one of your dataloader workers gets stuck preparing a massive sequence, the training process waits for that specific batch to finish. Even if other workers have already finished preparing subsequent batches, the pipeline is blocked. Your training step finishes, your local GPU goes idle, and because it cannot reach the synchronization point, the other 999 GPUs in your cluster sit completely idle waiting for it.
The ScaleFold authors solved this by building a Non-Blocking Data Pipeline. Instead of enforcing a rigid sequence, the pipeline yields a batch the moment any processing batch becomes ready.
Here is the key insight. The system decouples the data preparation order from the data consumption order using a priority queue. First, you assign multiple dataloader workers to prepare batches asynchronously. When a worker receives a batch to process, the system tags that batch with its original sequence index. This index becomes its priority score.
Let us trace the logic. Worker one gets batch A, a fast sequence. Worker two gets batch B, the 100-second sequence. Worker three gets batch C, another fast sequence. Worker one finishes batch A immediately and pushes it to the priority queue. The training process consumes it and executes a step. Worker two is still crunching the massive batch B. Under the default system, the GPU would stop and wait. But in this non-blocking pipeline, worker three finishes batch C and pushes it to the priority queue. Since batch B is still not ready, the queue simply hands batch C to the training process. The GPUs continue training without a pause.
Eventually, worker two finishes batch B and pushes it to the queue. Because batch B has an earlier original index, the priority queue immediately places it at the absolute front of the line. The training process consumes it on the very next step.
This priority mechanism guarantees a best-effort sample ordering. The exact global sequence of data changes slightly across different training runs, but the paper confirms this does not negatively impact training convergence. You eliminate the idle time caused by background CPU usage peaks and complex data samples. The slowest data batch no longer dictates the execution speed of your fastest machine.
That is all for this one. Thanks for listening, and keep building!
19
Kernel Fusion and GPU Optimization
3m 58s
AlphaFold launches over 150,000 separate CUDA kernels per step. We explore how the ScaleFold paper uses OpenAI's Triton to fuse LayerNorm and Multi-Head Attention.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 19 of 21. Sometimes the GPU is not actually slow at doing math. It is just spending all its time waiting for the CPU to give it the next instruction, leaving your profiler full of massive gaps of white space. The solution to this problem is Kernel Fusion and GPU Optimization.
According to the ScaleFold paper, training the AlphaFold model requires launching over 150,000 individual operations per step. Most of these are memory-bound kernels, like small LayerNorms or fragmented element-wise operations. Every time the CPU tells the GPU to run a PyTorch operation, there is a launch overhead. When you string together 150,000 small operations, that overhead completely eclipses the actual math.
Look at AlphaFold's Multi-Head Attention, which takes about thirty-four percent of the total training step time. It is not standard attention. AlphaFold adds a specific pair bias term to the logits matrix right before the softmax operation. In default PyTorch, this creates a chain of separate events. First, you launch a kernel for the batched matrix multiplication. The GPU reads the data from global memory, does the math, and writes the result back. Then, the CPU launches a second kernel to add the pair bias. The GPU reads the matrix back from memory, adds the bias, and writes it back again. Finally, a third kernel launches for the softmax. This constant round-trip to global memory starves the GPU.
The authors of the ScaleFold paper fix this by writing custom kernels using the OpenAI Triton compiler. Instead of separate steps, they fuse the entire sequence into a single kernel.
Here is the key insight. By fusing the operations, the GPU loads the input data into its fast, on-chip SRAM just once. It performs the matrix multiplication, adds the pair bias, applies the softmax, and does the final multiplication directly inside the SRAM. It only writes to global memory when the entire Multi-Head Attention block is finished. Standard optimized libraries like FlashAttention do not work here because of that unique pair bias injection, making a custom Triton kernel strictly necessary to bypass the memory bandwidth bottleneck.
This approach extends to other fragmented parts of the model. LayerNorm consumes fourteen percent of the step time because AlphaFold uses small dimensions, typically 128 or 256. When using Dynamic Axial Parallelism, or DAP, these problem sizes are scaled down even further, leaving the hardware underutilized. ScaleFold introduces a fused LayerNorm kernel where a single CUDA thread block processes multiple input rows at once. It calculates normalization statistics in a single pass rather than using expensive iterative methods, and it uses a two-step reduction in the backward pass to avoid atomic operations.
Even the optimizer gets fused. The ScaleFold authors combined the Adam optimizer and Stochastic Weight Averaging into one single kernel. Intermediate values between the optimizer steps stay in GPU registers, bypassing memory reads completely. For the rest of the model, they rely on PyTorch compiler tools to automatically fuse the remaining fragmented operations, particularly in serial components like the Structure Module.
When you are dealing with thousands of small, memory-bound operations, your bottleneck is not teraflops, it is memory bandwidth and CPU launch overhead. Fusing those operations into single, continuous blocks of execution is the only way to actually saturate the hardware.
Thanks for listening, happy coding everyone!
20
Building a High-Throughput Pipeline
3m 46s
From evaluating model weights asynchronously to leveraging CUDA graphs, learn the system architecture secrets to running AlphaFold at massive scale.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 20 of 21. Scaling an AI model from 128 GPUs to 2,000 GPUs is not just about buying more hardware; it requires completely redesigning how the system breathes. When you add that much compute, you uncover massive bottlenecks in how your host machines talk to your accelerators. Resolving those bottlenecks requires Building a High-Throughput Pipeline, which is exactly how researchers recently dropped AlphaFold pretraining from seven days down to ten hours.
The blueprint for this architecture comes from the ScaleFold paper. The authors identified a brutal scaling reality. As you distribute a model across thousands of GPUs, the math workload per individual GPU shrinks. The GPUs finish their calculations so fast that the CPU cannot issue new kernel instructions quickly enough to keep them busy. The CPU overhead becomes the dominant bottleneck. To fix this, you remove the CPU from the execution loop using CUDA Graphs.
A CUDA Graph captures a sequence of GPU operations and their memory allocations into a single static graph. Once captured, the GPU executes the entire graph directly without waiting for the CPU to dispatch each individual kernel. Here is the key insight. You cannot just apply standard CUDA Graphs to AlphaFold. The AlphaFold architecture uses a recycling mechanism, feeding predictions back into the model dynamically. This creates a dynamic computation graph. If the operations change, a standard CUDA Graph breaks and must be recaptured, which ruins the performance gain. The ScaleFold paper solves this by designing a CUDA Graph cache. Instead of one rigid graph, the system captures and stores multiple graphs representing the different recycling scenarios. When the dynamic execution shifts, the system simply pulls the correct pre-compiled graph from the cache. The CPU is bypassed completely.
Now, the second piece of this high-throughput architecture. Once your training step time is highly optimized, a new bottleneck appears. In a standard pipeline, training nodes pause periodically to run validation metrics. According to the ScaleFold paper, as step times shrink, this evaluation phase can consume up to 43 percent of the total pipeline time. Your massive, expensive training cluster sits idle for nearly half the time just checking its own work.
The solution is Asynchronous Evaluation. You completely decouple validation from the training loop. Training nodes never pause. They constantly calculate gradients, update weights, and stream model checkpoints to a separate, dedicated pool of evaluation nodes. In the ScaleFold implementation, out of roughly 2,000 GPUs, only 32 were dedicated to evaluation. The rest did nothing but train. However, moving evaluation to separate nodes introduces a race condition. The evaluation nodes must finish validating a checkpoint before the training nodes produce the next one. If evaluation falls behind, your pipeline stalls. To guarantee the evaluation nodes keep up, the system bypasses disk storage entirely. The entire evaluation dataset is cached directly into CPU DRAM on the validation nodes.
By eliminating CPU dispatch overhead with cached CUDA Graphs and offloading validation to memory-cached asynchronous nodes, the pipeline never stops moving. The hardware is finally saturated. When architecting at scale, your primary job is no longer optimizing the math; your primary job is preventing the GPUs from waiting. That is it for today. Thanks for listening — go build something cool.
21
The Future: Flow-Matching with SimpleFold
4m 12s
Do we really need complex, domain-specific architectures to fold proteins? We explore SimpleFold, an experimental model that uses standard transformers and flow-matching.
Hi, this is Alex from DEV STORIES DOT EU. AlphaFold: Protein Structure Prediction, episode 21 of 21. For years, the industry assumed that predicting a protein structure required incredibly specialized, custom neural network designs. You needed multiple sequence alignments, explicit pair representations, and computationally heavy triangular updates just to get a viable result. But what if a standard, generic transformer could do the exact same job? The answer is a concept called flow-matching with SimpleFold.
The SimpleFold paper by Apple introduces a radical departure from the rigid architectures we have seen in traditional folding models. It strips away the domain-specific heuristics entirely. Traditional models hard-code biological intuition into the network. They use multiple sequence alignments to find evolutionary clues and triangle updates to enforce geometric rules. This requires massive compute and highly specialized engineering. SimpleFold abandons these mechanics, opting instead for a flow-matching generative model built exclusively on general-purpose transformer blocks.
SimpleFold treats folding as a conditional generative task. Think of text-to-image models, where a text prompt guides the generation of a picture. Here, the amino acid sequence is the prompt, and the output is the all-atom three-dimensional coordinate structure. It achieves this using flow-matching. Flow-matching generates data by defining a continuous path from a simple noise distribution to a complex data distribution. During training, the model learns a time-dependent velocity field. It integrates an ordinary differential equation over time, gradually moving random noise toward the true atomic coordinates.
The process starts when a frozen pretrained protein language model, specifically ESM2, converts the input amino acid sequence into a sequence embedding. At the same time, an atom encoder takes noisy atomic coordinates and processes them into atom tokens. Here is the key insight. Instead of maintaining complex, memory-heavy pairwise interaction maps, SimpleFold uses a grouping operation. It simply averages the atom tokens belonging to the same residue into a single residue token.
These residue tokens are concatenated with the sequence embeddings and passed into the residue trunk. This trunk contains the bulk of the model parameters. It is made up entirely of standard transformer blocks with adaptive layers conditioned on the flow timestep. There are no geometric equivariant modules and no triangular math. Just standard attention and scaling operations. Finally, an ungrouping operation broadcasts the updated residue tokens back to their individual atoms. An atom decoder then predicts the velocity field to update the final atom positions.
During training, the system does not sample time uniformly. It heavily oversamples timesteps closer to the clean data manifold. Because protein structures have a strict coarse-to-fine hierarchy, this late-stage focus forces the model to learn highly refined structures, including delicate side chains. Furthermore, because the model uses a generative objective rather than a deterministic regression objective, it naturally captures uncertainty. If you pass the same sequence through SimpleFold multiple times, it can generate an ensemble of different valid conformations, accurately reflecting how proteins actually move and exist in nature.
SimpleFold proves that if you map the problem correctly, a general-purpose transformer can learn the underlying physics of protein folding directly from the data. I encourage you to read the paper, explore the official documentation, and try running the code on your own hardware. You can also visit devstories.eu to suggest topics for our next series. That is all for this one. Thanks for listening, and keep building!
Tap to start playing
Browsers block autoplay
Share this episode
Episode
—
Copy this episode in another language:
This site uses no cookies. Our hosting provider may log your IP address for analytics. Learn more.