Back to catalog
Season 22 15 Episodes 58 min 2026

Mastering DeepChem

v2.8 — 2026 Edition. A comprehensive guide to DeepChem, a framework for democratizing deep learning in the sciences. Covering everything from data handling and MoleculeNet to graph featurizers, specialized ML models, molecular docking, and reinforcement learning (v2.8 - 2026).

Scientific Computing Deep Learning for Science Cheminformatics
Mastering DeepChem
Now Playing
Click play to start
0:00
0:00
1
The DeepChem Project
An introduction to the DeepChem project and its mission to democratize deep learning for science. We cover how it evolved from chemical applications to a broader suite of scientific machine learning tools.
3m 48s
2
Managing Scientific Datasets
Explore DeepChem's Dataset abstraction for handling large-scale scientific data. Learn the critical differences between NumpyDataset and DiskDataset for out-of-core memory management.
3m 33s
3
MoleculeNet Benchmarks
Discover MoleculeNet, the premier benchmark suite curated within DeepChem. We discuss how standardizing datasets like Tox21 and QM9 accelerates computational chemistry.
4m 09s
4
Feature Engineering for Molecules
Learn how DeepChem translates chemical structures into machine-readable numbers using Featurizers. We explore the CircularFingerprint method for mapping SMILES strings to bit vectors.
3m 59s
5
Graph Convolution Featurizers
Move beyond flat bit vectors and explore how DeepChem represents molecules as mathematical graphs. We cover ConvMolFeaturizer and MolGraphConvFeaturizer.
3m 48s
6
Scientifically Aware Splitting
Discover why standard random splits fail on scientific datasets. We explore RandomStratifiedSplitter and how to correctly validate models on highly imbalanced multi-task data.
3m 49s
7
Taming Data with Transformers
Learn how to normalize wild scientific distributions using DeepChem Transformers. We discuss the NormalizationTransformer and MinMaxTransformer for stable training.
3m 54s
8
The Model API and Scikit-Learn Wrappers
Explore DeepChem's unified Model interface and how to wrap traditional algorithms using SklearnModel. Learn why sometimes the best solution isn't a deep neural network.
3m 04s
9
Specialized Molecular Graph Models
Dive into deep learning architectures built specifically for chemistry. We cover Graph Convolutional Networks (GCNModel) and Message Passing Neural Networks (MPNNModel).
3m 49s
10
Evaluating Scientific Models
Learn why standard accuracy fails in scientific ML. We explore DeepChem's Metric class, the Matthews Correlation Coefficient, and how to evaluate imbalanced multi-task models.
4m 00s
11
Intelligent Hyperparameter Tuning
Move beyond brute-force grid search. Discover how to use GaussianProcessHyperparamOpt in DeepChem to intelligently navigate complex hyperparameter spaces.
3m 55s
12
Metalearning for Low Data Regimes
Explore Model-Agnostic Meta-Learning (MAML) in DeepChem. Learn how to train models that can rapidly adapt to new, expensive scientific experiments with very little data.
3m 55s
13
Binding Pocket Discovery
Understand the geometry of protein-ligand interactions. We explore DeepChem's ConvexHullPocketFinder for algorithmically locating binding grooves on 3D protein structures.
3m 57s
14
Pose Generation with Vina and Gnina
Take the next step in molecular docking by computing binding poses. Learn how VinaPoseGenerator and GninaPoseGenerator score spatial geometries to predict interactions.
4m 25s
15
Reinforcement Learning in Science
Discover how reinforcement learning can autonomously design molecules. We cover DeepChem's Environment and Policy abstractions alongside the Advantage Actor-Critic (A2C) algorithm.
4m 17s

Episodes

1

The DeepChem Project

3m 48s

An introduction to the DeepChem project and its mission to democratize deep learning for science. We cover how it evolved from chemical applications to a broader suite of scientific machine learning tools.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 1 of 15. You want to apply machine learning to a scientific problem, so you reach for standard tools, but those standard tools treat molecules and proteins like flat data arrays, completely ignoring the underlying physics. The tool that bridges this gap is the DeepChem project. People often assume DeepChem is a standalone tensor engine, something you use instead of PyTorch, TensorFlow, or JAX. That is incorrect. DeepChem is a domain-specific toolset built directly on top of those generic frameworks. It acts as a translator. It takes care of the complex engineering required to turn messy physical data into a format that a standard neural network can process. The project exists to democratize deep learning for science. When it started, the focus was strictly on chemistry. The initial goal was to make drug discovery accessible, giving researchers the tools to predict chemical properties without needing a massive proprietary laboratory. But as the field grew, so did the framework. Today, DeepChem has quietly evolved into a central hub for almost any scientific deep learning application. To understand the scope, consider the difference in domains it handles. On Monday, you might use DeepChem to predict the aqueous solubility of a new drug-like molecule. This requires a model that understands molecular bonds and quantum states. By Thursday, you might use the same framework to analyze a microscopy image to count individual cells. Those are vastly different physical problems. DeepChem abstracts away the underlying boilerplate for both, providing specialized tools for molecular machine learning, bioinformatics, and even materials science. Here is the key insight. The hardest part of scientific machine learning is data representation. A standard deep learning model has no concept of atomic structure or biological sequences. DeepChem solves this by providing highly tuned featurizers. A featurizer takes a raw scientific object, like a chemical compound written in standard text notation, and mathematically translates it into a graph or a vector. After featurization, the information goes into specialized DeepChem dataset objects. These objects are designed to efficiently manage large collections of scientific data on disk. This prevents memory crashes when you process millions of complex compounds. DeepChem also addresses how scientific models are evaluated. In standard machine learning, you typically split your training and testing data randomly. But in chemistry, random splitting causes models to memorize the training data rather than learning the actual physics. To fix this, DeepChem provides specialized splitters. For example, a scaffold splitter separates molecules based on their core two-dimensional structure. This ensures your testing data represents entirely new chemical families, forcing the model to prove it actually generalizes to unseen compounds. Once your data is prepared and split, DeepChem provides an entire suite of pre-built models tailored for these datasets. You pass your scientific objects to the featurizer, DeepChem converts them, handles the splits, feeds them into the underlying PyTorch or TensorFlow layer, and outputs a prediction. The real power of DeepChem is that it encodes domain knowledge directly into your data pipeline, allowing you to focus purely on the scientific discovery rather than the structural plumbing of machine learning. If you want to help keep the show going, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!
2

Managing Scientific Datasets

3m 33s

Explore DeepChem's Dataset abstraction for handling large-scale scientific data. Learn the critical differences between NumpyDataset and DiskDataset for out-of-core memory management.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 2 of 15. The biggest bottleneck in scientific machine learning is almost never the model architecture. It is trying to load massive files without crashing your RAM. Managing Scientific Datasets is how you get past that memory wall. When moving into DeepChem, a common instinct is to load everything into a Pandas DataFrame. DataFrames are excellent for two-dimensional tabular data. But scientific machine learning frequently requires higher-dimensional arrays, multiple simultaneous prediction tasks, and per-task sample importance. DeepChem uses its own Dataset abstraction because it natively binds features, multi-task labels, and sample weights together in a strict format that models can consume efficiently. Every Dataset object in DeepChem guarantees the presence of four parallel arrays. First, you have X, which holds your numerical features. Second, you have y, holding your labels or the targets you are trying to predict. Third, you have w, the weights array. This is where it gets interesting. Scientific datasets are often sparse. You might have experimental data showing a molecule is effective against one target, but no data on whether it is toxic to the liver. Instead of dropping the molecule entirely or inventing a fake label, you set the weight for the missing task to zero. The model learns from the data you have and ignores the gaps. Finally, you have the ids array, which stores a unique identifier for each sample, like a chemical SMILES string. Having a standardized format is useful, but the real power of the Dataset abstraction is how it manages system memory. DeepChem provides two primary ways to store these four arrays. If your dataset is small enough to fit into your system memory, you use a NumpyDataset. Under the hood, this simply holds standard NumPy arrays in RAM. It provides extremely fast access and is ideal for prototyping or working with smaller collections of molecules. The limitation of NumpyDataset becomes obvious when dealing with real-world scientific data. Suppose you are working with a 100 gigabyte dataset of dense three-dimensional crystal structures. Attempting to load that into standard in-memory arrays will instantly crash a typical machine. To solve this, DeepChem provides the DiskDataset. A DiskDataset does exactly what the name implies. It stores the X, y, w, and id arrays across multiple small files, or shards, directly on your hard drive. You specify a data directory, and DeepChem manages the storage. When you train a model, the DiskDataset only pulls the current batch of data into RAM. Once the model processes that batch, the memory is freed up for the next one. The transition between these two formats is entirely transparent to the rest of your code. You can write a training loop, test it locally using a small NumpyDataset, and then deploy that exact same code on a cluster pointing to a massive DiskDataset. The model simply asks the dataset for the next batch, and the dataset handles whether that batch comes from RAM or is streamed from disk. Designing your data pipelines around this out-of-core streaming from day one guarantees your infrastructure will survive the jump from a few thousand experimental records to millions of generated structures. Thanks for listening, happy coding everyone!
3

MoleculeNet Benchmarks

4m 09s

Discover MoleculeNet, the premier benchmark suite curated within DeepChem. We discuss how standardizing datasets like Tox21 and QM9 accelerates computational chemistry.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 3 of 15. You read a paper claiming a new graph neural network achieves state-of-the-art results on chemical properties. But when you look closely, they used a custom data split, a proprietary dataset, and an obscure metric, making their claim mathematically meaningless. Standardized comparison is the only way scientific progress happens, and in computational chemistry, that standard is the MoleculeNet Benchmark suite. A common mistake is thinking MoleculeNet is just a static repository. It is not a downloadable folder of CSV files sitting on a server that you have to parse and clean yourself. It is a curated, deeply integrated suite of datasets baked directly into DeepChem through the dc dot molnet module. It serves as the ImageNet of molecular machine learning. It provides a shared set of tasks, standardized splits, and evaluated baselines across different domains. The suite categorizes its data into specific scientific areas to test different modeling capabilities. You have quantum mechanics datasets, like QM9, which contains geometric and energetic properties for small molecules. You have physical chemistry datasets for predicting solubility or hydration free energy. And you have biophysics and physiology datasets, including toxicity benchmarks like Tox21, which measures how thousands of environmental chemicals interact with specific biological pathways. Let us look at a practical scenario. You just designed a new toxicity prediction model, and you need to mathematically prove it outperforms previous baselines. You do not need to hunt down the raw Tox21 data, handle the missing values, or write a custom parser. Instead, you call the load tox 21 function from the dc dot molnet module. This is where it gets interesting. When you invoke a load function in MoleculeNet, it does not just return raw text. It processes the data dynamically using a featurizer you specify in the arguments, like circular fingerprints or graph structures. The function typically returns a tuple containing three core elements. First, it gives you a list of the task names. For Tox21, these are the twelve specific biological targets you are trying to predict. Second, it returns the dataset itself, which is already neatly partitioned into training, validation, and test subsets. Third, it provides the transformers used to normalize or scale the data during the loading phase. You take that pre-split Tox21 training set, feed it to your new model, and then evaluate the predictions against the test set. Because every other researcher uses this exact same API and featurization pipeline, your final receiver operating characteristic score can be directly and fairly compared against published baselines. MoleculeNet also standardizes how you evaluate generalization through splitting strategies. The API supports multiple ways to divide your data. While you can use a basic random split, chemistry often requires a scaffold split. A scaffold split separates molecules based on their core two-dimensional structures. This forces the test set to contain molecular frameworks the model never saw during training. It tests if your model actually learned underlying chemical rules, rather than just memorizing small variations. By default, the load functions apply the splitting method that makes the most scientific sense for that specific dataset. The hardest part of applied machine learning is not writing the model architecture, it is proving the model actually generalizes to unseen data without data leakage. MoleculeNet gives you the strict, standardized playground required to make those proofs stick. That is all for this one. Thanks for listening, and keep building!
4

Feature Engineering for Molecules

3m 59s

Learn how DeepChem translates chemical structures into machine-readable numbers using Featurizers. We explore the CircularFingerprint method for mapping SMILES strings to bit vectors.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 4 of 15. A neural network cannot digest a chemical structure drawn on a whiteboard. Even if you write that structure out as a text string, standard machine learning algorithms still cannot read it. They need the data pre-chewed into numerical arrays. The bridge between a chemical representation and a mathematical model is Feature Engineering for Molecules. In cheminformatics, we often represent molecules as SMILES strings. These are sequences of characters defining atoms and bonds. A carbon ring becomes a specific text pattern. But if you try to hand a raw string to a Random Forest or a support vector machine, it will fail. A mathematical model requires a fixed-length numerical vector as its input. Transforming that chemical text into an array of numbers is what we call featurization. DeepChem handles this step with a dedicated set of classes called featurizers. Before looking at how DeepChem does this, there is a common point of confusion to clear up. When people hear about generating a vector representation of text, they immediately think of learned embeddings like those used in modern language models. Chemical fingerprints are not learned embeddings. There are no weights being updated, and no neural networks are involved in creating the feature itself. A fingerprint is a deterministic hashing algorithm. If you process the exact same molecule on two different machines, you will get the exact same array of ones and zeros. The most common featurizer you will use for baseline models in DeepChem is the Circular Fingerprint. This implements a method known as Extended Connectivity Fingerprints. Here is how the logic flows. The algorithm looks at every heavy atom in your molecule. From each atom, it radiates outward in a circle, looking at the neighboring atoms and bonds up to a specific radius. You can control this radius, but looking two bonds away is standard practice. It captures these local structural fragments, runs them through a hash function, and maps them to a specific index in a fixed-length array. The result is a bit vector. A one at a specific index means a particular chemical substructure is present. A zero means it is absent. Imagine you have a list of SMILES strings representing hundreds of small, drug-like molecules. You want to train a Random Forest model to predict their toxicity. First, you initialize the CircularFingerprint class in DeepChem. You configure it to output a size of 1024. This is the part that matters. Every single molecule, regardless of whether it has ten atoms or fifty, will be converted into an array of exactly 1024 bits. You then pass your list of SMILES strings to the featurize method of this class. The algorithm processes each string independently and returns a two-dimensional matrix. The rows represent your molecules, and the 1024 columns represent the presence or absence of specific substructures. Because the output is just a standard numerical matrix, you are no longer constrained to chemistry-specific tools. You can hand that matrix directly to standard machine learning libraries. Because fingerprints rely on fixed-length deterministic hashes rather than learned semantics, collisions can happen, meaning two different complex structural fragments might occasionally map to the very same bit. Despite this, they remain the fastest, most reliable baseline for turning abstract chemistry into concrete math. That is all for this one. Thanks for listening, and keep building!
5

Graph Convolution Featurizers

3m 48s

Move beyond flat bit vectors and explore how DeepChem represents molecules as mathematical graphs. We cover ConvMolFeaturizer and MolGraphConvFeaturizer.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 5 of 15. If you represent a chemical compound as a flat string of characters, you completely strip away the physical topology that dictates how it interacts with the world. Molecules are not linear sentences, they are interconnected structures. To preserve that structural reality in your datasets, you need Graph Convolution Featurizers. When you hear the word convolution, you might picture an image model sliding a filter across a fixed, square grid of pixels. You need to drop that mental model. Atoms do not sit in uniform grids. A carbon atom might connect to four neighbors, while an oxygen connects to two, and a hydrogen just one. Graph operations handle this arbitrary, irregular connectivity, and graph featurizers act as the bridge to translate raw chemical data into a format these operations can consume. DeepChem provides two primary tools to do this. The first is the ConvMolFeaturizer. This function looks at a molecule and generates an initial feature vector for every single atom. It calculates properties like the element type, the total number of connected heavy atoms, valence, formal charge, partial charge, and orbital hybridization. It bundles these atom-level features into a specific object called ConvMol. This format was designed to feed directly into DeepChem's native graph convolution networks. The second tool is the MolGraphConvFeaturizer. This is the more modern approach and is highly versatile because it outputs a standard GraphData object. This makes it the go-to choice if you are passing data into generic frameworks like PyTorch Geometric or DGL. Let us walk through how MolGraphConvFeaturizer handles a simple benzene ring. Benzene consists of six carbon atoms arranged in a continuous hexagon. The featurizer processes this structure by splitting it into nodes and edges. First, it creates the nodes. For each of the six carbon atoms, it calculates a numerical array of atomic properties, recording the hybridization state and partial charge of that specific atom. These six individual arrays are stacked together to form the node feature matrix. Next, it maps the edges, which correspond to the chemical bonds. It builds an edge index, which is simply a list of coordinate pairs identifying exactly which atom is connected to which. It also generates edge features. For every bond in that edge index, the featurizer records structural data. It checks if the bond is single, double, triple, or aromatic. Since we are looking at benzene, the featurizer explicitly encodes the aromatic nature of these bonds, along with whether the bond is conjugated or part of a ring system. Finally, it packs the node feature matrix, the edge index, and the edge features into a single GraphData object. You now have a mathematical graph that retains both the atomic characteristics and the exact topological connections of the original molecule, completely bypassing the limitations of flat vector representations. Here is the key insight. The quality of a graph neural network is entirely bottlenecked by the structural richness of the data structure you feed it. By properly utilizing a graph featurizer, you ensure your model actually learns the underlying chemistry instead of attempting to memorize a one-dimensional summary. Thanks for spending a few minutes with me. Until next time, take it easy.
6

Scientifically Aware Splitting

3m 49s

Discover why standard random splits fail on scientific datasets. We explore RandomStratifiedSplitter and how to correctly validate models on highly imbalanced multi-task data.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 6 of 15. You build a model on a new chemical dataset, run your validation, and get ninety-nine percent accuracy. You deploy it, and it immediately fails in the real world. Your validation metrics confidently lied to you because of how you divided your data. Today, we are looking at Scientifically Aware Splitting, specifically comparing the default Random Splitter with the Random Stratified Splitter. You might assume splitting data in DeepChem is exactly the same as using a basic train-test split function from a general machine learning library like Scikit-Learn. That is a common misconception. General-purpose splitters work well for simple, single-label arrays. DeepChem splitters are engineered to natively handle complex multi-task environments, sparse boolean labels, and the specific data structures DeepChem uses to store molecules. Let us start with the default approach: the Random Splitter. This tool behaves exactly as the name implies. It takes your loaded dataset and assigns the chemical compounds into training, validation, and testing sets uniformly at random. If you are working with a perfectly balanced dataset where your active and inactive compounds occur in equal measure, this random shuffling works fine. But scientific data is practically never balanced. Consider a toxicity dataset like Tox21. In the real world, the vast majority of chemical compounds you test are safe, and toxic compounds are relatively rare. Suppose only one percent of the compounds in your dataset are flagged as toxic. This is a severe class imbalance. If you pass this dataset through a standard Random Splitter, pure statistical chance dictates that your validation or test set might end up with zero toxic examples. If your test set consists entirely of safe compounds, a model that simply guesses "safe" for every single input will score one hundred percent. You end up with a mathematically perfect score for a completely useless model. This is where the Random Stratified Splitter becomes mandatory. Instead of blindly throwing compounds into buckets, stratification forces the split to respect the actual distribution of your labels. The Random Stratified Splitter scans the properties of your dataset before dividing anything. If your overall data contains exactly one percent toxic compounds and ninety-nine percent safe compounds, the splitter guarantees that your training set has a one-to-ninety-nine ratio, your validation set has a one-to-ninety-nine ratio, and your test set maintains exactly that same proportion. Here is the key insight. DeepChem datasets usually involve multi-task learning. This means a single chemical compound is not just evaluated for one property, but often for dozens of different biological assays simultaneously. The labels across these tasks are highly sparse. You might only have a handful of positive hits for one specific toxicity assay across ten thousand rows of data. The Random Stratified Splitter navigates this multi-dimensional matrix. It ensures that those extremely rare positive hits are distributed fairly across your splits. Every subset of your data receives a representative slice of the active compounds, preventing any single task from losing its minority class during the split. Without this mechanism, your model evaluation relies completely on luck. For highly imbalanced, multi-task scientific datasets, stratification is not an optional optimization. It is a fundamental requirement to prove your model actually learned the underlying chemistry instead of just exploiting a statistical blind spot. Thanks for spending a few minutes with me. Until next time, take it easy.
7

Taming Data with Transformers

3m 54s

Learn how to normalize wild scientific distributions using DeepChem Transformers. We discuss the NormalizationTransformer and MinMaxTransformer for stable training.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 7 of 15. Real scientific data is wild, unbalanced, and spread across massive numerical ranges. If you feed those raw experimental values directly into a neural network, you are practically begging for non-convergent gradients during backpropagation. To fix this, you need to tame your distributions before they touch your model, and that is where DeepChem Transformers come in. When you hear the word transformer in modern machine learning, your mind likely jumps to attention mechanisms, large language models, or BERT. Forget all of that for this episode. In the DeepChem ecosystem, a Transformer is simply a data preprocessing utility. It is not a neural network layer. It is an object that takes a dataset and modifies its features, labels, or weights to fit within the strict mathematical constraints of machine learning algorithms. Take a scenario where you are predicting experimental molecular solubility. Your raw training data might contain target values ranging from tiny negative numbers to tens of thousands. Neural network weights generally initialize as very small numbers. If your input features or target labels contain a value of forty thousand, the resulting loss calculation generates a massive gradient. The network weights will swing wildly during the update step, failing to converge on a solution. You have to scale the data down. One way to handle this is the MinMax Transformer. This utility scans your entire dataset, locates the absolute minimum and maximum values for your specified features or labels, and squashes the entire distribution into a strict zero-to-one range. The lowest value becomes zero, the highest becomes one, and everything else falls proportionally in between. You initialize the transformer by passing it your target dataset so it can compute those boundaries. Then, you call its transform method, which outputs a fresh dataset with the newly scaled numbers. Now the gradients remain stable. Squashing data between zero and one is not always ideal, particularly if your dataset contains massive outliers. An extreme outlier will force the rest of your normal data points to compress into a tiny, indistinguishable sliver of that zero-to-one range. For this, you use the Normalization Transformer. Instead of enforcing hard boundaries, it shifts your entire distribution so the mean sits exactly at zero, and scales the spread so the standard deviation is one. This centers your data perfectly, aligning it with the operational sweet spot of most neural network activation functions. Here is the key insight. Scaling input features is a one-way street, but transforming your labels creates a secondary problem. If your model trains on normalized solubility targets, its final predictions will also be normalized. A predicted value of zero point four is mathematically correct but practically useless to a chemist expecting a real-world measurement. DeepChem solves this by maintaining the state of the scaling metrics inside the transformer object. Once your model generates a normalized prediction, you pass that output into the untransform method of the exact same transformer. It reverses the arithmetic, mapping the scaled prediction directly back to its original scientific unit. Normalizing your data is not an optional optimization step in deep learning, it is a structural requirement for training stability. Thanks for listening, happy coding everyone!
8

The Model API and Scikit-Learn Wrappers

3m 04s

Explore DeepChem's unified Model interface and how to wrap traditional algorithms using SklearnModel. Learn why sometimes the best solution isn't a deep neural network.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 8 of 15. Sometimes the best solution to a scientific problem is not a massive neural network that takes a week to train. Often, a robust random forest is all you need. But switching between a traditional algorithm and a complex deep learning pipeline usually means rewriting all your data handling code. This episode covers the Model API and Scikit-Learn Wrappers, which solve exactly that problem. DeepChem uses a unified base class simply called Model. This class standardizes how you train and evaluate algorithms across the entire library. The core logic relies on two primary methods. You use fit to train the algorithm, and you use predict to generate outputs on new data. The defining feature of this interface is that these methods expect DeepChem Dataset objects as their input. You do not pass raw arrays or dataframes directly to the model. The Dataset acts as a standard container for your features, labels, and weights, and the Model API knows exactly how to read it. You might assume a specialized library like DeepChem would write its own custom implementations for basic machine learning algorithms. It does not. DeepChem does not reinvent Scikit-Learn. Instead, it provides an adapter called SklearnModel. This wrapper takes any standard Scikit-Learn estimator and gives it the DeepChem Model interface. Take a scenario where you want to predict the physical properties of a new material. First, you import a standard random forest regressor directly from Scikit-Learn and initialize it. Then, you pass that standard regressor into DeepChem's SklearnModel wrapper. You now have a DeepChem-compatible object. To train it, you call fit on the wrapped model, passing in your DeepChem Dataset. Behind the scenes, the wrapper automatically extracts the feature matrices, targets, and sample weights from the Dataset, and hands them off to the underlying Scikit-Learn algorithm. When it is time to test, you call predict on a new Dataset, and the wrapper formats the output back into a standard array. Here is the key insight. Because the SklearnModel wrapper exposes the exact same fit and predict methods as a deep neural network built natively in DeepChem, you can swap the underlying algorithm without touching your data pipeline. You can establish a quick baseline with a traditional machine learning model, and then seamlessly swap in a complex neural network later just by changing the single line of code that initializes the model. The data loading, the training loop, and the evaluation steps remain identical. The primary takeaway is that the Model API decouples your data ingestion from your algorithm choice, freeing you to test different mathematical approaches without constantly rewriting glue code. If you find these episodes useful and want to support the show, you can search for DevStoriesEU on Patreon. That is your lot for this one. Catch you next time!
9

Specialized Molecular Graph Models

3m 49s

Dive into deep learning architectures built specifically for chemistry. We cover Graph Convolutional Networks (GCNModel) and Message Passing Neural Networks (MPNNModel).

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 9 of 15. You cannot just feed a molecule into a standard neural network designed for images. Molecules are not rigid grids of pixels; they are complex webs of atoms of varying sizes and connections. If you force them into a standard grid, you destroy the structural chemistry that makes them unique. To solve this, we use Specialized Molecular Graph Models. These models are designed specifically to natively ingest the featurized graphs generated in your earlier preprocessing steps. Listeners sometimes confuse these with traditional Convolutional Neural Networks. A standard CNN slides a fixed-size filter over a two-dimensional grid. But a molecule does not have a top-left corner or a fixed resolution. Instead, specialized graph models operate dynamically on chemical bonds as edges and atoms as nodes. They do not care about grid coordinates. They care entirely about connectivity. DeepChem provides a few specific architectures for this. The first is the GCNModel, which stands for Graph Convolutional Network. Think of a GCN as a way to gather local atomic neighborhoods. For every atom in your featurized molecule, the GCN looks at its immediate neighbors. It takes the features of those neighboring atoms, like their element type or hybridization state, and pools them together to update the original atom's features. It repeats this process over a few layers. By the end, each atom has a mathematical representation that includes context from its surrounding chemical environment. The model then aggregates all these atom representations into a single mathematical vector to predict a general chemical property, like toxicity or solubility. That covers GCNs, which mostly focus on aggregating node data. But what about the bonds themselves? A double bond behaves very differently from a single bond, and sometimes you need the network to weigh that difference heavily. This is where the MPNNModel, or Message Passing Neural Network, comes in. Message Passing Neural Networks do not just look at a molecule. They mathematically simulate how information flows between atomic neighbors. Let us say you are training an MPNNModel to predict the quantum mechanical properties of a molecule. To do this accurately, the exact nature of the bonds matters immensely. In an MPNN, both the atoms and the bonds hold explicit feature data. During the training step, the model performs a message passing phase. Every atom generates a message based on its current state and sends it to its neighbors along the connecting bonds. Crucially, the bond itself modifies that message. A message traveling over a rigid aromatic bond will calculate differently from one traveling over a flexible single bond. When an atom receives messages from all its neighbors, it uses a small internal neural network to update its own state. It essentially calculates that it is a carbon atom currently being influenced by an oxygen atom on a double bond and a hydrogen atom on a single bond. This process iterates several times. Information ripples outward, step by step, across the entire bond network. After a set number of steps, the model gathers the final states of all atoms to make its prediction. Because the MPNN natively respects the true chemical structure, it excels at predicting complex quantum behaviors that simpler models miss. Here is the key insight. You do not need to invent ways to map molecules into flat arrays. By using GCN and MPNN models, you preserve the exact atomic topology of your data, allowing the network to learn directly from the chemistry itself. That is all for this one. Thanks for listening, and keep building!
10

Evaluating Scientific Models

4m 00s

Learn why standard accuracy fails in scientific ML. We explore DeepChem's Metric class, the Matthews Correlation Coefficient, and how to evaluate imbalanced multi-task models.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 10 of 15. If you are screening molecules for a new drug, maybe one in ten thousand actually works. A model that just predicts every molecule is useless will achieve a ninety-nine point nine nine percent accuracy rate, while discovering absolutely nothing. This is why basic accuracy is often a terrible measure for scientific discovery, and it is exactly why we need to talk about Evaluating Scientific Models. In the sciences, datasets are almost always highly imbalanced. Consider building a classifier to predict HIV inhibition. The vast majority of tested chemical compounds will not inhibit the virus. The active drug candidates are extremely rare. If you use standard accuracy to evaluate your model, the results will lie to you. The model simply learns to guess the majority class, scoring very high numbers while completely failing to find a single working drug. To fix this, you need metrics that expose this behavior. The first is the recall score. Recall measures how many of the actual active compounds your model successfully identified. It strictly punishes models that miss true discoveries. However, recall alone is not a complete picture. A naive model could just guess that every single compound is active, achieving a perfect recall score while generating thousands of false positives. This is where the Matthews correlation coefficient, or MCC, becomes essential. MCC is a balanced measure that evaluates all four categories of your results. It looks at true positives, true negatives, false positives, and false negatives simultaneously. It produces a score between negative one and positive one. A score of positive one means perfect prediction, zero means the model is no better than random guessing, and negative one means total disagreement. MCC only generates a high score if the model predicts accurately on both the rare active compounds and the common inactive ones. It brutally penalizes a model that takes the easy route of guessing the majority class. You might wonder why you cannot just import these exact metric functions directly from scikit-learn. Scikit-learn has perfectly good implementations of both recall and MCC. The issue comes down to the data shape. DeepChem models frequently perform multi-task learning, predicting dozens of different chemical properties simultaneously. Standard scikit-learn functions expect simple flat arrays. If you pass them DeepChem's complex, multi-dimensional outputs, they will either crash or calculate the score incorrectly. DeepChem solves this mismatch with its own Metric class. This class acts as a wrapper around standard mathematical scoring functions. When you instantiate a DeepChem Metric, you pass it the raw scoring function you want to use, like the MCC function from scikit-learn. The DeepChem Metric wrapper takes over from there. It manages the batching of your data and aligns the true labels with your model predictions across the entire dataset. Here is the key insight. The Metric class relies heavily on an internal mechanism called normalize prediction shape. It inspects your data to determine if your model is outputting a single task, multiple tasks, or one-hot encoded probability vectors. It then reshapes those raw multi-dimensional outputs into the exact flat arrays that the underlying mathematical functions require. You do not have to write manual loops to slice and unpack your multi-task arrays before scoring them. The wrapper handles the structural complexity automatically so the math works exactly as intended. When you evaluate a scientific model on an imbalanced dataset, your primary job is to prove that the model is not cheating the baseline, so choose a metric that forces the model to actually learn the rare chemistry. Thanks for spending a few minutes with me. Until next time, take it easy.
11

Intelligent Hyperparameter Tuning

3m 55s

Move beyond brute-force grid search. Discover how to use GaussianProcessHyperparamOpt in DeepChem to intelligently navigate complex hyperparameter spaces.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 11 of 15. Exhaustively testing every possible combination of learning rate, layer size, and penalty for a heavy model might take weeks or even months. But what if your code could intelligently guess where the best model configuration lives after just a few trials? Today, we are looking at intelligent hyperparameter tuning in DeepChem. Before looking at the search methods, let us clear up a common confusion. DeepChem provides a base class called HyperparamOpt. This is not your actual model training loop. It is an outer optimization loop. You provide a function called model builder. This function knows how to construct your specific model given a set of parameters. The HyperparamOpt class wraps this builder, feeding it different parameter sets, training the resulting model on a dataset, evaluating it with a specific metric, and returning the best configuration. The simplest way to search this parameter space is using GridHyperparamOpt. You give it discrete lists of values. For example, three learning rates, four different numbers of estimators, and two penalty types. Grid search takes the brute-force approach. It evaluates every single combination one by one. If you have a small model and a small parameter space, grid search is fine. But as soon as you add more parameters, you hit a combinatorial explosion. If your parameter grid creates one thousand combinations and each takes an hour to train, grid search is no longer viable. This is where GaussianProcessHyperparamOpt comes in. Instead of a brute-force grid, it performs a probabilistic search using a backend library called pyGPGO. You define continuous ranges or categorical choices for your parameters rather than fixed lists. The Gaussian Process treats your model evaluation like a black-box function. It wants to find the maximum performance metric with the fewest possible evaluations. Here is the key insight. The Gaussian Process learns as it goes. When it tests a specific combination of learning rate and penalty, it looks at the resulting evaluation score. It then builds a mathematical map of the parameter space, predicting where the scores might be highest and where its predictions are most uncertain. For its next trial, it does not just pick the next item in a list. It calculates an acquisition function to decide the smartest next guess. It balances exploring unknown zones of your parameter space against exploiting areas that have already produced good scores. So in a scenario of tuning a heavy model, you pass your model builder to GaussianProcessHyperparamOpt. You define a search space dictating the upper and lower bounds for the learning rate and the number of estimators. You tell the optimizer to run for twenty trials. Instead of blindly marching through one thousand combinations, the algorithm probes the space, realizes certain learning rates perform poorly, completely avoids that territory, and zeroes in on the optimal zone. You save enormous amounts of computational time. The most important takeaway here is to match your search strategy to your computational budget. Use grid search when you have a handful of discrete values and fast training times, but when training is expensive and the parameter space is vast, rely on a Gaussian Process to actively hunt down the best model. I would like to take a moment to thank you for listening — it helps us a lot. Have a great one!
12

Metalearning for Low Data Regimes

3m 55s

Explore Model-Agnostic Meta-Learning (MAML) in DeepChem. Learn how to train models that can rapidly adapt to new, expensive scientific experiments with very little data.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 12 of 15. When a single biological experiment costs ten thousand dollars, you cannot collect big data. You might only get five data points for a brand new assay. Standard deep learning will overfit instantly and fail completely here. What you need is an algorithm designed specifically to extract signal from almost nothing. This brings us to Metalearning for Low Data Regimes. DeepChem handles this through a class called MAML, which stands for Model-Agnostic Meta-Learning. In many scientific domains, especially drug discovery or material science, you simply do not have the data volume required for traditional neural networks. You are forced to operate in a low-data regime, where few-shot learning is the only viable approach. Many engineers will assume standard transfer learning is the answer here. It is not. Transfer learning trains a model on a massive dataset, hoping the learned static features provide a good warm-start for a smaller dataset. MAML does not just warm-start weights. It mathematically optimizes the model's gradient trajectory. It does not just learn the data; it learns how to adapt to new data efficiently. How does it achieve this? MAML uses a nested loop of optimization across a distribution of tasks. During training, you do not feed the algorithm a single continuous stream of data. You feed it batches of distinct tasks. For each task in the batch, the algorithm takes a tiny amount of data. It calculates the gradients and computes what the new model weights would be after one or two standard training steps. Here is the key insight. The algorithm does not permanently apply those updated weights yet. Instead, it takes those hypothetical weights and tests them against a second, separate set of data from that exact same task. It calculates the loss on this validation set. Then, it computes the gradient of that validation loss with respect to the original starting weights, and updates those original weights. The math forces the model to find an initialization point where taking just a few gradient steps produces a massive drop in error for any new task drawn from that domain. Let us apply this to a concrete scenario. Suppose you have historical data from dozens of older biological assays. You want to predict the results of a brand-new, highly expensive assay where you can only afford to collect five actual data points. In DeepChem, you instantiate the MAML object and pass it a base model. This base model is called the learner. You then train the MAML algorithm by sampling small tasks from your historical assays. The algorithm constantly practices adapting. It takes five data points from assay A, updates itself, checks its prediction error on more data from assay A, and backpropagates the result to the global starting weights. Then it repeats this exact simulation for assay B, assay C, and so on. Over time, the model converges on an optimal set of starting parameters for the entire family of assays. When your new, expensive assay finally arrives, you take those starting parameters and run a standard fine-tuning pass using your five tiny data points. Because the MAML optimization aligned the gradient trajectory specifically for fast adaptation, the model configures itself almost instantly. You get highly accurate predictions for a novel problem using a microscopic dataset. MAML shifts the fundamental goal of pre-training from minimizing prediction error on old data to maximizing adaptability on new data. That is all for this one. Thanks for listening, and keep building!
13

Binding Pocket Discovery

3m 57s

Understand the geometry of protein-ligand interactions. We explore DeepChem's ConvexHullPocketFinder for algorithmically locating binding grooves on 3D protein structures.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 13 of 15. You have a newly mapped 3D structure of a viral protein, but finding where a therapeutic drug can actually latch onto it is like trying to place a single puzzle piece onto a ten thousand piece jigsaw floating in three dimensions. You cannot just throw a small molecule at a massive protein and hope it sticks. You need to map out the structural grooves first, and that is exactly what Binding Pocket Discovery is for. Before going further, let us clear up a common misunderstanding. Pocket discovery does not place a drug inside a protein. It only surveys the empty real estate to find potential binding sites. Generating the actual position of the molecule inside that site is called pose generation. That is a completely separate step. Today, we only care about finding the empty parking spots. In DeepChem, this workflow begins with the Binding Pocket Finder base class. This is an abstract template. Its job is to provide a standard interface, taking a 3D macromolecule as input and returning a list of potential pockets as output. By using a base class, DeepChem ensures that whether you use a built-in algorithm or write your own, the pipeline remains consistent. But the actual heavy lifting is done by specific implementations, the most common being the Convex Hull Pocket Finder. The Convex Hull Pocket Finder algorithmically analyzes the three-dimensional geometry of your viral protein. Picture wrapping the entire macromolecule tightly in shrink wrap. The wrap stretches straight across the gaps, clefts, and crevices on the surface. That outer boundary is the mathematical convex hull. The empty spaces trapped between that flat outer boundary and the actual bumpy atomic surface of the protein are your potential binding pockets. These are the vulnerable grooves where a new drug might successfully anchor. To identify these spaces systematically, the algorithm divides the volume around the protein into a fine three-dimensional grid. It sweeps through this grid and identifies voxels, which are tiny 3D boxes. It looks for voxels that sit inside the convex hull but do not contain any protein atoms. It also measures the distance from these empty boxes to the nearest protein atom. If a box is too exposed to the surface, it is ignored. If it is buried too deeply inside the protein core, it is also ignored. The algorithm clusters the remaining valid boxes together to form the continuous shapes of the surface cavities. Once it maps a cavity, the finder generates a bounding box around it. Here is the key insight. The algorithm does not cut this bounding box perfectly to the exact dimensions of the hole. It adds a calculated layer of padding around the edges. This extra margin is critical because molecules in biology are dynamic. When you eventually hand these pocket coordinates over to a docking algorithm later in your pipeline, that padding provides the necessary wiggle room. It gives the next step enough surrounding space to calculate how the drug might twist, rotate, or shift as it settles into the groove. You pass your viral protein into the Convex Hull Pocket Finder, and it hands you back a clean array of padded bounding boxes. You now know exactly where the structural vulnerabilities are, and you can focus your computational resources purely on those specific coordinates. If you want to help keep these episodes coming, you can support the show by searching for DevStoriesEU on Patreon. The padding ensures your downstream docking simulations will not fail simply because they bumped into an artificial mathematical wall. That is all for this one. Thanks for listening, and keep building!
14

Pose Generation with Vina and Gnina

4m 25s

Take the next step in molecular docking by computing binding poses. Learn how VinaPoseGenerator and GninaPoseGenerator score spatial geometries to predict interactions.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 14 of 15. You have a target protein and a potential drug molecule, but knowing they might interact is useless unless you know exactly how they fit together physically. We do not have to blindly guess shapes anymore. We can leverage pre-trained Convolutional Neural Networks to predict the actual physics of binding poses. This is Pose Generation with Vina and Gnina. First, address a common mix-up. Listeners often confuse docking with Molecular Dynamics. Molecular Dynamics is a time-based simulation showing how molecules twist, fold, and vibrate over microseconds. Pose generation, or docking, does not do this. It computes a static snapshot. It calculates the optimal resting geometry of a ligand inside a protein pocket and assigns it a stationary energetic score based on that single position. In DeepChem, the computational heavy lifting for this happens inside the PoseGenerator class. Its specific job is to compute multiple three-dimensional arrangements of your ligand, called poses, inside a restricted bounding box on the protein. It then ranks these poses. The standard implementation for this is the VinaPoseGenerator. This wraps the AutoDock Vina engine, which relies on an empirical scoring function. It calculates a classical binding energy score by evaluating physical interactions like hydrogen bonds, hydrophobic contacts, and steric repulsion. Vina tests a geometric configuration, calculates the energy penalty or reward, adjusts the atoms slightly, and tries again. It searches this massive geometric space and returns a list of poses ranked by their lowest binding energy. Classical physics formulas are incredibly fast, but they sometimes misjudge complex molecular interactions. This is where the GninaPoseGenerator steps in. GNINA is a deep-learning upgrade. It takes the underlying Vina framework and evaluates the configurations using a pre-trained Convolutional Neural Network. Take a scenario where you are evaluating a new cancer drug candidate. The pose generator computes 10 different geometrical configurations of this drug resting inside a target binding pocket. Instead of just tallying up classical physics terms, GNINA passes the three-dimensional atomic structures of these 10 poses through its neural network. GNINA ranks those 10 poses using three distinct metrics. First, it still computes the traditional Vina empirical score for a baseline. Second, it outputs a CNN pose score. This is a probability between zero and one indicating how closely the generated pose matches a high-quality, experimentally proven structure. Third, it computes a CNN affinity score, which predicts the actual binding strength. The CNN pose score is the critical addition here. It acts as an advanced filter, preventing the system from highly ranking a pose that looks mathematically stable to classic equations but is physically unrealistic in nature. Implementing this requires just a few steps. You initialize the GninaPoseGenerator. You call its generate method, passing the protein structure file, the ligand structure file, and the coordinate dimensions defining the bounding box of the pocket. Restricting the search to a specific box prevents the system from wasting compute cycles on empty space. The method then returns your ranked poses, alongside the CNN scores, allowing you to extract the top physical candidate. Here is the key insight. GNINA achieves this high accuracy by treating the physical 3D space of the protein pocket like a structured image. It divides the atomic coordinates into a three-dimensional grid. The Convolutional Neural Network then scans for spatial and chemical patterns across that volumetric grid, exactly the way a standard image-recognition network looks for edges and textures in a photograph. That is all for this one. Thanks for listening, and keep building!
15

Reinforcement Learning in Science

4m 17s

Discover how reinforcement learning can autonomously design molecules. We cover DeepChem's Environment and Policy abstractions alongside the Advantage Actor-Critic (A2C) algorithm.

Download
Hi, this is Alex from DEV STORIES DOT EU. Mastering DeepChem, episode 15 of 15. You want to design a highly stable molecule, but you do not have a massive dataset of perfect examples to train a model. You just have a set of physical rules. When you lack static data, you cannot use standard supervised learning. Instead, you need an AI that learns by trial, error, and feedback. That brings us to Reinforcement Learning in Science. If you are steeped in traditional machine learning, you might confuse this with supervised training. They are fundamentally different. Supervised learning requires an explicit, labeled static dataset. Reinforcement learning relies on continuous interactive rewards generated by an environment. The model tries a move, the environment responds with a score, and the model adjusts its strategy. In DeepChem, the foundation for this interaction is the Environment class. Think of the Environment as your high-fidelity scientific simulator. It defines the physical rules of the universe your AI operates in. It tracks the current state of your system, defines the mathematical actions your model is allowed to take, and exposes a step function. When your model takes an action, it passes it to that step function. The Environment calculates the physics, then returns three things: the new state, a numerical reward based on how good the resulting state is, and a boolean flag indicating if the task is finished. If you are already working with standard tools from outside chemistry, DeepChem provides the GymEnvironment class. This simply wraps standard OpenAI Gym simulation environments so they plug directly into your DeepChem workflows. The component that actually interacts with this Environment is called the Policy. The Policy is the abstraction representing your agent's brain. It takes the current state from the Environment and maps it to a specific action, or outputs a set of probabilities for different possible actions. To train a Policy efficiently, DeepChem implements algorithms like Advantage Actor-Critic, usually referred to as A2C. Here is the key insight. A2C splits the learning process into two separate neural networks running in parallel: the Actor and the Critic. The Actor looks at the state and decides what action to take to progress the task. The Critic watches the result and estimates the overall value of being in that new state. The word advantage refers to the difference between the actual reward received from the Environment and the reward the Critic predicted. If the action resulted in a higher reward than the Critic expected, the advantage is positive, and the Actor network is updated to take that action more often in the future. Consider a concrete scenario where you want to autonomously construct novel molecules. Your Environment is a chemistry engine programmed to calculate molecular stability. You start with a basic chemical ring. The Actor evaluates this starting state and decides to attach a carbon atom. The Environment processes this step, simulates the new chemical structure, discovers a drop in stability, and returns a negative reward. The Critic observes this outcome, updates its baseline expectation, and signals the Actor that this was a poor move. In the next iteration, the Actor tries attaching an oxygen atom instead. The Environment simulates the change, calculates higher stability, returns a positive reward, and the Critic reinforces that choice. Step by step, the A2C algorithm navigates the chemical simulator, building a complex, highly stable molecule entirely autonomously without ever referencing a static dataset of good molecules. Reinforcement learning agents can query simulators millions of times, discovering solutions human intuition might completely miss. This episode brings us to the end of our deep dive into DeepChem. If you want to take these tools further, explore the official documentation and try building your own environments hands-on. You can also visit devstories dot eu to suggest topics for future series. That is all for this one. Thanks for listening, and keep building!