Back to catalog

Season 19 22 Episodes 1h 24m 2026

Python Cheminformatics & AI

2026 Edition. A practical course taking Python developers from basic chemistry concepts to designing AI-driven cheminformatics systems. Learn how to use RDKit, scikit-fingerprints, and state-of-the-art deep learning techniques like Graph Neural Networks and Diffusion Models for drug discovery.

Scientific Computing Cheminformatics Deep Learning for Science

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

The Digital Molecule

We introduce RDKit and the core concept of representing chemistry in Python. Listeners will learn how to initialize molecular objects from strings and understand the framework's central role in AI drug discovery.

3m 31s

I/O in Cheminformatics

Learn how to safely ingest and export massive chemical datasets. We cover reading SDF and SMILES files, handling parse errors, and writing data back to disk.

3m 33s

Molecular Graph Traversal

Discover how molecules are represented as graph data structures. We explore iterating over atoms, analyzing bonds, and identifying ring systems within molecules.

3m 22s

Substructure Searching

Master the art of querying molecules using SMARTS. We walk through finding specific functional groups and patterns within complex chemical structures.

3m 50s

Fingerprinting and Molecular Similarity

Explore how to translate molecular graphs into mathematical bit vectors. We cover MACCS keys, Morgan fingerprints, and calculating Tanimoto similarity.

3m 56s

Breaking the 2D Plane

Transition from flat 2D drawings to realistic 3D geometries. We discuss adding explicit hydrogens and generating reliable 3D conformers using ETKDG.

4m 04s

Accelerating Feature Engineering

Bridge cheminformatics and standard data science with scikit-fingerprints. We explore generating over 30 types of molecular fingerprints directly within a scikit-learn interface.

4m 05s

High-Performance Cheminformatics

Learn how to process massive chemical datasets efficiently. We dive into utilizing CPU parallelism with Joblib and saving memory using SciPy sparse matrices.

4m 10s

End-to-End ML Pipelines

Combine processing, fingerprinting, and prediction into a single clean architecture. We build robust scikit-learn pipelines that seamlessly integrate 3D conformer generation and property prediction.

3m 58s

Predicting Binding Affinity

Explore the reality of predicting protein-ligand binding affinity. We compare the performance of simple 2D tree-based models against complex 3D Graph Neural Networks.

3m 52s

LLMs vs Classical Fingerprints

Discover how Natural Language Processing applies to chemistry. We pit vector embeddings from Large Language Models against classical RDKit structural fingerprints for predicting bioactivity.

4m 15s

Active Learning for Virtual Screening

Learn how to iteratively discover top drug candidates without exhaustive testing. We dive into active learning loops and greedy selection strategies to maximize hit rates.

4m 07s

The Activity Cliff Challenge

Examine the fragility of structure-activity relationships. We discuss 'activity cliffs'—where a tiny structural change causes a massive shift in a drug's potency.

3m 31s

Similarity-Quantized Relative Learning

Solve the activity cliff problem by rethinking how models learn. We explore the SQRL framework, which trains AI to predict relative property differences between strictly filtered molecular pairs.

3m 32s

The Generative AI Revolution

Transition from predicting properties to imagining entirely new molecules. We map out the landscape of molecular generative tasks: De Novo generation, optimization, and conformer generation.

3m 23s

The Intuition of Molecular Diffusion

We break down the core concept of diffusion models without the heavy math. Listeners will understand the forward process of adding noise to a molecule and the reverse process of hallucinating new structures.

3m 48s

Bridging 2D and 3D Generative Spaces

We explore how AI actually represents the molecules it generates. We compare generating flat 2D topological graphs with generating complex 3D geometric point clouds, and the challenges of each.

4m 03s

Target-Aware Generation & Docking

Discover context-aware generative design. We discuss generating novel molecules directly inside a disease protein's binding pocket to maximize binding affinity.

3m 30s

The Size Trap in Generative Evaluation

Learn why standard benchmarks for generative models can be deeply flawed. We reveal the confounding effect of generated library size on metrics like Fréchet ChemNet Distance.

4m 17s

Navigating De Novo Hallucinations

Rank AI-generated molecules intelligently. We explore the exploration-exploitation tradeoff of model likelihoods, and how to filter out frequent, low-quality 'chemical hallucinations'.

3m 56s

Molecule Sampling Constraints

Understand why NLP techniques fail in chemistry. We compare Temperature sampling against Top-k and Top-p, and why the constrained chemical vocabulary changes everything.

4m 01s

Deploying Cheminformatics in the Cloud

Take your AI pipeline to production. We discuss packaging RDKit and machine learning models into Docker containers and scaling workloads across cloud infrastructure.

3m 42s

Episodes

The Digital Molecule

3m 31s

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 1 of 22. Before you can predict a drug's toxicity with a machine learning model, you have to solve a fundamental problem. You need a way to teach Python what a molecule actually is. Standard data types like strings and lists do not understand atoms, bonds, or ring structures. To bridge that gap, you need a universal translator between chemistry and code. That is exactly what RDKit provides, introducing the concept of the digital molecule. RDKit is the industry-standard, open-source cheminformatics toolkit. At its core, it is a high-performance C++ library, but it exposes a massive, intuitive Python interface. It exists because representing chemical structures computationally is surprisingly difficult. A molecule is mathematically a graph. The atoms are nodes, and the chemical bonds are edges connecting those nodes. If you try to build a custom graph parser from scratch every time you want to analyze chemical data, you will spend all your time debugging edge cases. RDKit abstracts that complexity away, managing the graph logic under the hood. To get a molecule into Python, you first need a text representation of its structure. The most common format is a SMILES string. SMILES uses standard characters to represent chemical connectivity. For example, an isolated carbon atom is simply a capital C. Benzene, which is a six-carbon ring with alternating double bonds, is written as a lowercase c, the number one, four more lowercase c characters, and a final lowercase c followed by a number one to close the ring. Here is the key insight. That SMILES string is just plain text. To Python, it is indistinguishable from a password or a file path. You cannot calculate molecular weight from a raw string. To do real chemistry, you must convert it into an RDKit molecule object. You handle this by importing the Chem module from RDKit. Then, you call a specific function designed to create a molecule from a SMILES string, and you pass your text variable into it. When you pass the benzene SMILES string to this function, RDKit does a lot of heavy lifting. It parses the text, builds the node and edge graph, assigns bond orders, and validates basic chemical rules like atomic valences. If the string represents a valid molecule, the function returns a molecule object. If you pass it a chemically impossible structure or a typo, the function fails safely. It prints a warning to your console and returns a null object. Because of this, you should always check if your molecule object actually exists before passing it to the next step of your program. Once you hold that validated molecule object in memory, the entire RDKit ecosystem unlocks. You are no longer working with text; you are working with a computable chemical graph. The essential takeaway here is that SMILES strings are strictly for storage and data transfer, but RDKit molecule objects are for computation. Everything you do in computational chemistry begins with making that conversion. If you would like to help support the show, you can search for DevStoriesEU on Patreon—it helps us a lot. That is all for this one. Thanks for listening, and keep building!

I/O in Cheminformatics

3m 33s

Learn how to safely ingest and export massive chemical datasets. We cover reading SDF and SMILES files, handling parse errors, and writing data back to disk.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 2 of 22. A single missing coordinate in a million-molecule dataset can crash your entire pipeline. If you assume every text record in your vendor file is perfectly formatted chemistry, your Python script will eventually throw a fatal exception halfway through a ten-hour job. Defending against dirty data while moving structures in and out of your scripts is mandatory, and that is exactly what we resolve today with I/O in Cheminformatics. When you receive a standard structure-data file, or SDF, full of potential ligands, you need a way to parse it. In RDKit, the default tool for this is the SD molecule supplier. You initialize it by passing the file path as a string. This supplier object acts very much like a Python list. You can loop over it, you can ask for its total length, and you can pull out a specific record by its index number. It does this by scanning the file quickly to find where each molecule starts, allowing you to jump around the data. Sometimes you cannot scan ahead. If you are piping data directly from a web stream, or reading a massive gzipped file chunk by chunk, you do not have random access. For these situations, you use the forward SD molecule supplier. Instead of a file path, you pass it an open file object. This forward supplier is a strict iterator. It reads one molecule, parses it, and immediately moves to the next. You cannot ask for its length, and you cannot ask for the fiftieth molecule without reading the first forty-nine. You trade flexibility for low memory usage and stream compatibility. Here is the key insight. Regardless of which supplier you use, RDKit does not raise a Python exception when it encounters a corrupted molecule. If a text block in your file has an invalid valence or a formatting typo, RDKit will output an error message to the console, but the actual Python object it yields for that loop iteration will simply be the None type. If you take that None object and try to calculate its weight or write it to a new file, your script will crash. Handling this is simple but critical. The very first line inside your parsing loop must always check if the yielded molecule is None. If it is None, you use the continue statement to skip to the next record. This silently filters out the garbage data and keeps your pipeline running. Once you have safely parsed and filtered your valid molecules, you usually need to save the results. For this, you use the SD writer. You initialize the writer by passing your desired output file path. Inside your safe loop, right after your None check, you pass the valid molecule object to the writer using its write method. Once the loop finishes processing every ligand, you call the close method on the writer to ensure all data flushes to the disk safely. You can also wrap the writer in a standard Python context manager so it closes itself automatically when the block ends. To put this all together for a data cleaning script, first, create your SD writer for the output file. Second, create your SD molecule supplier for the input file. Loop through the supplier. Check if the current item is None, and if so, skip it. If it is valid, hand it to the writer. Close the writer at the end. Always treat external chemistry datasets as inherently dirty; verifying that a parsed molecule is not None is the cheapest insurance policy your code will ever have. That is all for this one. Thanks for listening, and keep building!

Molecular Graph Traversal

3m 22s

Discover how molecules are represented as graph data structures. We explore iterating over atoms, analyzing bonds, and identifying ring systems within molecules.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 3 of 22. To a computer, a molecule is not a physical object taking up space. It is strictly a graph of nodes and edges waiting to be parsed. If you cannot efficiently navigate that underlying graph, you cannot analyze the chemical structure. This brings us directly to molecular graph traversal. In RDKit, a molecule object acts as a primary container for this graph data structure. To inspect the nodes, you use the GetAtoms method. This returns an iterable sequence containing all the atom objects in the molecule. You can write a simple loop to step through this sequence one by one. For a concrete scenario, suppose you need to extract the atomic numbers of all your nodes. Inside your loop, you can call the GetIdx method to find the unique numerical identifier for the current atom, and the GetAtomicNum method to find out exactly which chemical element it is. By iterating through, you process every node systematically. Nodes alone do not define the chemistry. You also need the edges connecting them, which you access using the GetBonds method. Just like with atoms, this provides an iterable sequence of bond objects. A bond knows its exact place in the graph. By calling the GetBeginAtomIdx and GetEndAtomIdx methods on a bond object, you extract the specific numerical identifiers of the two atoms it connects. You can also read the bond type, determining if it is a single, double, or aromatic connection. Here is the key insight. RDKit treats bonds as first-class objects in the graph hierarchy, meaning you query them independently rather than digging them out of atom properties. Navigating individual nodes and edges is standard graph logic, but chemical graphs frequently feature cycles, better known as rings. You do not need to write your own cycle-finding traversal algorithms. RDKit pre-calculates these cycles when the molecule is instantiated. You access this data via the GetRingInfo method. This returns a dedicated ring information object rather than a simple list. If your task is simply to count the number of rings in the molecule, you call the NumRings method directly on this ring information object. When you need deeper structural details, you can ask this same object for the AtomRings property. This gives you a collection of sequences, where each sequence contains the exact atom indices that make up one specific ring in the graph. You can even pass an atom index to the ring info object to ask if that specific node participates in a ring of a particular size, such as a five- or six-membered cycle. Traversing a molecule is ultimately about chaining these basic operations. You grab the ring info to check the macro structure, you loop over atoms to read node-level data like atomic numbers, and you loop over bonds to map the specific edge connections. Once you stop seeing a molecule as a physical entity and start seeing it as a predictable collection of indexed nodes, edges, and pre-calculated cycles, extracting structural properties becomes a standard data parsing task. That is all for this one. Thanks for listening, and keep building!

Substructure Searching

3m 50s

Master the art of querying molecules using SMARTS. We walk through finding specific functional groups and patterns within complex chemical structures.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 4 of 22. You would not try to parse thousands of text logs without using regular expressions. Similarly, you should not try to filter chemical databases without SMARTS patterns. Finding specific functional groups across large datasets requires dedicated structural logic, and that is exactly what Substructure Searching provides. Suppose you have a library of drug candidates. Your goal is to flag and isolate any molecule that contains a specific, known toxic functional group. First, you define that toxic group using a SMARTS string. SMARTS is an extension of SMILES designed specifically for querying molecular patterns, allowing you to specify wildcards, specific bond types, or ring structures. You pass this text string into the RDKit function that creates a molecule from SMARTS. This generates your query object. Your target molecules, the drug candidates, are already standard RDKit molecule objects. To filter the library, you take a candidate molecule and call the method named has substructure match. You pass your query object into this method. It evaluates the candidate molecule against the pattern and returns a simple boolean value. True means the toxic group exists somewhere inside the candidate. False means it is clean. Because this method stops searching the moment it finds a single valid match, it is highly optimized. You can loop this boolean check across your entire library to quickly split a massive dataset into safe and flagged subsets. Now, what if simply knowing the toxic group is present is not enough? Perhaps the toxicity scales with the number of times the group appears, or you need to isolate the exact location of the toxic atoms for a structural biologist. For this, you use the method named get substructure matches, plural. You call this on your candidate molecule, again passing in the query object. Instead of a boolean, this method forces the search engine to map every possible occurrence of the pattern. It returns a tuple containing other tuples. Each inner tuple represents one complete match of your pattern. The integers inside these tuples are the exact atom indices within the candidate molecule. Here is the key insight. The order of those indices perfectly mirrors the order of the atoms defined in your original SMARTS string. This means you always know exactly which atom in the target corresponds to which part of your query. If the toxic group appears three times in the candidate, you get three inner tuples. You can count the tuples to find the frequency of the pattern, or pass those specific atom indices into a drawing function to visually highlight the toxic regions. If you only need the atom indices of the very first match it finds, you can use the singular get substructure match method to save processing time. You must also account for stereochemistry. By default, RDKit substructure matching ignores chirality completely. A wedge bond and a dash bond will both trigger a match for a basic SMARTS query. If your target toxicity only occurs with a specific stereoisomer, this default behavior will generate false positives in your drug screen. To fix this, you pass an argument called use chirality and set it to true when calling any of the matching methods. RDKit will then enforce stereochemical rules based on the specific configuration defined in your query. The true power of substructure searching is that it maps a purely logical text query directly onto the physical topology of your dataset, bridging the gap between abstract string patterns and concrete atomic coordinates. That is all for this one. Thanks for listening, and keep building!

Fingerprinting and Molecular Similarity

3m 56s

Explore how to translate molecular graphs into mathematical bit vectors. We cover MACCS keys, Morgan fingerprints, and calculating Tanimoto similarity.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 5 of 22. AI models do not actually understand atoms and bonds. They only understand numbers. If you want an algorithm to compare two structures, you need a way to translate chemistry into math. Fingerprinting and molecular similarity is exactly how we bridge that gap. A molecule in RDKit is essentially a mathematical graph. The atoms act as nodes, and the bonds act as edges. To perform fast computations on these graphs, we convert them into bit vectors, which are just long arrays of ones and zeros. This array is called a fingerprint. The logic is straightforward. If a specific structural feature exists in the molecule, a specific bit in the array is flipped to one. If that feature is missing, the bit stays zero. By converting complex molecular graphs into standard bit vectors, we can easily compare them mathematically. RDKit provides several fingerprinting algorithms. The default RDKit fingerprint uses a topological approach. It analyzes linear paths through the molecule. The system starts at an atom and traces along the connected bonds up to a specific length, typically between one and seven bonds. Every unique path it finds is passed through a hashing function, which assigns that path to a specific position in the bit vector. While topological paths are useful, modern cheminformatics heavily relies on Morgan fingerprints, often called circular fingerprints. Instead of tracing linear paths, Morgan algorithms analyze the neighborhood radiating outward from every single atom. When generating a Morgan fingerprint, you must define a radius. A radius of zero means the algorithm only records the individual atoms. A radius of one captures each atom plus its immediate connected neighbors. A radius of two expands that circle one bond further. The algorithm catalogs all of these overlapping circular environments, hashes them, and flips the corresponding bits to one. Usually, we fold these hashes into a fixed-length vector, like two thousand and forty eight bits, to keep memory usage predictable. Morgan fingerprints with a radius of two are the industry standard because they capture functional groups and local chemical context beautifully. Let us look at a concrete scenario. You have two slightly different molecules. Maybe they share a large core structure, but one has an extra methyl group attached. You want to quantify how much they overlap. First, you read both molecules into RDKit. Next, you generate a Morgan fingerprint for each, setting the radius to two. You now have two distinct bit vectors. To calculate how alike they are, you compute their Tanimoto similarity. Here is the key insight. Tanimoto similarity ignores the zeros. It only cares about the features that are actually present. The math is simple intersection over union. RDKit counts the number of bits set to one in both fingerprints, and divides that by the total number of bits set to one in either fingerprint. If the two vectors match perfectly, the Tanimoto score is one point zero. If they share no features at all, the score is zero point zero. For our two molecules differing by a single methyl group, the circular environments around the core will mostly match, while the environments near the mutation will differ. You might get a Tanimoto score of zero point eight five, giving you a precise numerical value for their structural overlap. Keep in mind that mapping a complex molecule down to a fixed array of bits means losing some data, and a high Tanimoto score guarantees structural overlap, not biological equivalence. Thanks for spending a few minutes with me. Until next time, take it easy.

Breaking the 2D Plane

4m 04s

Transition from flat 2D drawings to realistic 3D geometries. We discuss adding explicit hydrogens and generating reliable 3D conformers using ETKDG.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 6 of 22. A two-dimensional drawing might look great on a screen, but drugs exist in three-dimensional space. If you ignore geometry, you ignore reality. Today, we are breaking the 2D plane and generating proper 3D coordinates for your molecules. When you read a molecule from a standard SMILES string, it has no coordinates at all. It is strictly a topological graph of connected atoms. Even if you load a structure from a 2D drawing file, those coordinates are merely spaced out for human readability. To perform docking, calculate surface area, or run physical simulations, you need a physically realistic 3D structure. The absolute first step before generating any 3D geometry is adding hydrogens. In a standard SMILES string, hydrogens are implicit. They are treated as a basic property of the heavy atoms, simply filling valence requirements. But in physical space, hydrogens take up real volume. They create steric hindrance, and they dictate the angles of the bonds around them. If you attempt to calculate a 3D structure without explicitly adding hydrogens first, the resulting geometry will collapse on itself and the bond angles will be completely wrong. RDKit provides a function called AddHs that converts those implicit hydrogens into actual nodes in your molecular graph, complete with bonds. You must always run this function before moving to 3D. Once you have a complete molecule, you need to calculate its spatial coordinates. Because single bonds can rotate freely, a flexible ligand does not have just one static shape. It can adopt many different shapes, known as conformations. To generate a viable conformation, RDKit uses a default method called ETKDG. This stands for Experimental Torsion-angle Preference with Distance Geometry. Here is the key insight. Older generation methods relied entirely on pure math. They used distance geometry to guess atomic positions based on known bond lengths and angles. This often led to awkward, high-energy shapes that required heavy computational cleanup. ETKDG solves this by combining the math of distance geometry with empirical rules derived from the Cambridge Structural Database. It knows how real, physical molecules actually prefer to bend and twist, and it forces the geometry algorithm to respect those natural preferences. Take a concrete scenario. You have a highly flexible ligand, and you need to understand all the different ways it might fold up to fit into a protein binding pocket. Generating a single conformation is not enough to capture that behavior. You need an ensemble. RDKit handles this with a function called EmbedMultipleConfs. You pass in your molecule with its explicit hydrogens, and you specify that you want fifty conformers. RDKit will then run the ETKDG algorithm fifty separate times, starting from different random seeds, to generate fifty distinct 3D geometries. It stores all fifty of these shapes inside the original molecule object. You do not get fifty separate molecules back; you get one molecule that holds fifty distinct coordinate sets. You can then loop through these coordinate sets to measure distances or calculate energies. Because ETKDG is heavily informed by real-world crystal data, the initial structures it provides are usually very high quality straight out of the function. A molecule is not a flat drawing, and it is rarely just a single rigid shape; it is a dynamic object, and sampling its multiple conformers gives you the true boundaries of its physical behavior. If you want to help keep the show going, you can support us by searching for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Accelerating Feature Engineering

4m 05s

Bridge cheminformatics and standard data science with scikit-fingerprints. We explore generating over 30 types of molecular fingerprints directly within a scikit-learn interface.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 7 of 22. Writing custom fingerprint extraction loops for every new project is tedious and prone to errors. You switch from an ECFP fingerprint to a MACCS key, and suddenly you have to rewrite your entire preprocessing block. Scikit-fingerprints is a library that resolves this by making molecular feature extraction as simple as calling a standard scikit-learn transform. Molecules are fundamentally represented as graphs. Most machine learning algorithms, however, require multidimensional vectors. Molecular fingerprints are the feature extraction algorithms that bridge this gap, encoding structural information into numeric arrays. The problem is that standard open-source tools for computing these fingerprints, like RDKit, Open Babel, or the Chemistry Development Kit, are written in C++ or Java. Their Python wrappers do not natively align with the scikit-learn application programming interface. You end up writing custom data loaders, format converters, and error-prone loops just to get your data into a shape that a classifier can consume. Scikit-fingerprints changes this architecture. It implements over 30 different molecular fingerprints as standard, stateless scikit-learn transformers. All fingerprint classes inherit from the scikit-learn base classes. This means they plug directly into standard machine learning pipelines and feature unions. Consider a standard workflow. Normally, you might write a twenty-line custom loop in RDKit to iterate over a dataset, validate the molecules, extract circular fingerprints, and stack the results into an array. With this library, you replace that entire block of boilerplate with a single step. You create an object called ECFP Fingerprint and pass it directly into a scikit-learn pipeline right before your random forest model. When you call the fit method on your pipeline with your training data and target variables, the fingerprint transformer processes the inputs and outputs a dense NumPy array directly to the model. Here is the key insight. You do not need to convert your text representations into RDKit molecule objects before feeding them to the transformer. For any two-dimensional fingerprint based on graph topology, the transform method directly accepts a standard Python list of SMILES strings. The library handles the internal conversion automatically. Because SMILES strings are not always unique or chemically valid, the library also provides a Molecule Standardizer class. This class applies the sanitization steps recommended by RDKit to ensure data quality before extraction begins. The library also supports three-dimensional fingerprints that rely on spatial conformation. These spatial algorithms do require RDKit molecule objects with computed conformers. Generating conformers can be unstable, so the package includes a Conformer Generator class that uses a specific algorithm known as ETKDG version 3. This provides reliable defaults that maximize efficiency for simple molecules while minimizing calculation failures on complex compounds. You put the conformer generator at the start of your pipeline, follow it with a three-dimensional fingerprint transformer, and finish with an imputer to handle any missing values. By encapsulating complex chemistry logic inside standard transformer classes, the library abstracts away the domain-specific boilerplate. You configure options like the output vector length or whether you want a binary or count variant simply by passing parameters to the transformer constructor. The result is that tuning hyperparameters for molecular fingerprints becomes just as straightforward as tuning the depth of a decision tree. That is your lot for this one. Catch you next time!

High-Performance Cheminformatics

4m 10s

Learn how to process massive chemical datasets efficiently. We dive into utilizing CPU parallelism with Joblib and saving memory using SciPy sparse matrices.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 8 of 22. Computing complex substructures across a massive dataset can easily crash your machine with out-of-memory errors, unless you know exactly how to manage your data footprint. The technique that resolves this is High-Performance Cheminformatics. When you extract features from a molecule, you often use substructure-based fingerprints. The Klekota-Roth fingerprint is a classic example. To compute it, your machine checks a molecule against thousands of predefined chemical patterns, known as SMARTS patterns. It scans the molecular graph over and over to see if specific functional groups or structural motifs exist. Doing this sequentially for a few molecules is fine. Doing it for a dataset of four hundred thousand molecules is a severe computational bottleneck. The computation of molecular fingerprints is an embarrassingly parallel task. The structural analysis of one molecule has absolutely zero dependency on the analysis of the next molecule in your dataset. Because they share no state, you can compute them entirely simultaneously. To scale this effectively in Python, you utilize Joblib, specifically relying on the Loky executor. When you initiate the process, the input dataset is split into chunks corresponding precisely to your available CPU cores. If you have a 16-core processor, your input molecules are divided into 16 separate batches. Each Python worker process takes one batch and begins executing the SMARTS pattern matching independently. The technical hurdle with multiprocessing in Python is usually the cost of moving data between the workers and the main process. Loky bypasses this by using memory mapping. Instead of serializing the final fingerprint arrays and sending them through inter-process communication, the workers write directly to a shared memory space. For computationally heavy fingerprints like Klekota-Roth, distributing the workload across 16 cores yields nearly a fifteen-fold speedup. That handles the processing time. The second critical bottleneck is system memory. A single Klekota-Roth fingerprint vector is long. If you process hundreds of thousands of molecules, you generate a massive matrix of results. By default, numerical libraries return this result as a dense NumPy array. Every single position in that matrix allocates memory, regardless of whether the value is a one or a zero. Chemical fingerprints are overwhelmingly sparse. Usually, only one or two percent of the requested structural features are actually present in any given molecule. The vast majority of your resulting matrix consists of zeros. Storing those zeros is what triggers the out-of-memory errors on large datasets. The fix is switching the output format to a SciPy sparse matrix, specifically using the Compressed Sparse Row format. A sparse matrix fundamentally changes how the data is stored. Instead of building a rigid grid in memory, it only records the values of the non-zero elements alongside their row and column coordinates. Consider a real-world scenario using the PCBA dataset, which contains just under four hundred and forty thousand molecules. You run the Klekota-Roth fingerprint computation across the entire dataset using your 16 cores. The parallel execution finishes efficiently. If you leave the output as a default dense array, this single matrix will consume just over two Gigabytes of RAM. By instructing the computation to return a SciPy sparse array instead, that exact same dataset drops to a footprint of just 23 Megabytes. You achieve an eighty-eight-fold reduction in memory without losing a single piece of chemical information, and the sparse representation does not negatively impact your computation time at all. Here is the key insight. You do not need a massive compute cluster to process hundreds of thousands of molecules, provided you stop paying the memory tax for storing zeros and ensure your data passing skips standard inter-process communication overhead. That is all for this one. Thanks for listening, and keep building!

End-to-End ML Pipelines

3m 58s

Combine processing, fingerprinting, and prediction into a single clean architecture. We build robust scikit-learn pipelines that seamlessly integrate 3D conformer generation and property prediction.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 9 of 22. A complex 3D virtual screening script used to require hundreds of lines of fragile code. You had to manually generate conformers, handle failures, compute multiple descriptors, and stitch the arrays together before even touching a model. Now, you can architect the entire process in a single, elegant pipeline definition using End-to-End ML Pipelines. In machine learning, a pipeline is a sequence of data processing steps and a final estimator grouped into a single object. In cheminformatics, especially with 3D structural data, preprocessing is notoriously fragmented. You take raw SMILES strings, calculate 3D coordinates, run force field optimizations, extract features, fix missing values, and finally train a classifier. Doing this manually means writing custom loops and intermediate data structures that break easily and leak memory. We will walk through building a complete scikit-learn pipeline that takes raw SMILES straight to a Random Forest classifier. The first step in the sequence is conformer generation. You initialize a conformer generator and pass it as the first stage of the pipeline. It reads the 2D input and computes the 3D structures. You can configure it to optimize the geometry using a force field like MMFF94. It automatically parallelizes this heavy lifting across all available CPU cores. Now, the second piece of this is feature extraction. For 3D tasks, combining different geometry descriptors captures more molecular information. You use a scikit-learn feature union to compute the GETAWAY and WHIM fingerprints simultaneously. Both of these fingerprint classes act as stateless transformers in the pipeline. They take the 3D conformers from the previous step, calculate their respective descriptors in parallel, and concatenate the results into a single, wide feature matrix. Next, you must handle computation failures. 3D descriptor algorithms sometimes fail to process highly complex or strained molecules, resulting in missing values in your matrix. The pipeline handles this without custom error handling. You drop a simple imputer directly after the feature union. If a GETAWAY or WHIM calculation outputs a missing value, the imputer catches it and replaces it with the mean of that feature across your dataset. Finally, you cap the pipeline with your predictive model, which in this case is a Random Forest classifier. To architect this in code, you call the make pipeline function. Inside that function call, you pass your conformer generator. Next, you pass the feature union containing your GETAWAY and WHIM fingerprints. Then comes the simple imputer, and lastly, the Random Forest classifier. You assign this entire sequence to a single variable. When you call the fit method on that pipeline variable, you pass your raw training SMILES strings and your target labels. The strings flow sequentially through the conformer generator, into the feature union for fingerprinting, through the imputer to clean the data, and straight into the classifier for training. When it is time to evaluate, calling the predict method on your test SMILES strings forces the new data to follow the exact same path. Here is the key insight. State and data routing are managed entirely by the pipeline object, meaning you never hold intermediate dense arrays in memory or write custom data loaders for your conformers. That is all for this one. Thanks for listening, and keep building!

Predicting Binding Affinity

3m 52s

Explore the reality of predicting protein-ligand binding affinity. We compare the performance of simple 2D tree-based models against complex 3D Graph Neural Networks.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 10 of 22. You wire up a massive three-dimensional graph neural network, feeding it spatial coordinates for every atom in a protein pocket. Then you run a simple gradient boosted decision tree that only looks at the two-dimensional sketch of the drug. The benchmark finishes, and your cutting-edge spatial model just got beaten by an algorithm from two decades ago. The reason your heavy network failed is rooted in how we handle predicting binding affinity. Predicting binding affinity is the computational process of estimating how tightly a small molecule, or ligand, attaches to a specific protein target. In drug discovery, finding a molecule that binds strongly is the entire goal. To do this, engineers typically take one of two paths. The first is highly complex. You use three-dimensional neural networks like GraphNet or TensorNet. These models ingest the exact bound conformation of the protein-ligand complex. They use message passing layers to learn the precise spatial distances and quantum mechanical features between the atoms of the drug and the atoms of the protein pocket. The second path completely ignores the protein. You drop the spatial coordinates and use a classical two-dimensional model like XGBoost. The input here is just a concatenated vector of molecular fingerprints. You compute structural features from the ligand alone, essentially turning the two-dimensional drawing of the chemical into an array of numbers, and feed that directly into the tree-based model. To see how these approaches compare, researchers run them against standardized test sets. One of the most telling is the Merck FEP benchmark. This dataset mimics a common virtual screening scenario called a congeneric series. In a congeneric series, all the tested ligands share the exact same core chemical scaffold and bind to the exact same site on a single protein target. The only differences between the molecules are small structural variations, like different chemical branches attached to the main core. Here is the key insight. When evaluated on the Merck dataset, the heavy three-dimensional models achieved a correlation score around zero point three. The simple two-dimensional XGBoost model scored significantly higher, reaching zero point four five. The computationally cheap decision tree clearly outperformed the advanced graph neural networks. This happens because of what the models are forced to focus on. In a congeneric series, the protein target and the binding pocket do not change. The three-dimensional model spends massive computational resources mapping a spatial environment that remains static across every single test case. Worse, these models are highly sensitive to tiny variations in atomic coordinates. A slight, arbitrary shift in how a hydrogen atom is placed during data preparation introduces noise that distracts the graph network. The two-dimensional model succeeds precisely because it is blind to the protein. By only looking at the ligand features, it relies on the only variables that actually change from one test to the next. The decision tree correlates those direct structural variations in the drug to the final binding strength, bypassing the noise of the spatial environment entirely. Heavy three-dimensional models are still highly valuable when you need to generalize across entirely different, unseen protein targets where the pocket geometry is unknown. But when you are optimizing a specific family of drugs for a single known target, feeding constant environmental data into a deep network is inefficient and prone to error. The most powerful predictive tool is often the one that filters out the static environment and models only the variables that change. That is your lot for this one. Catch you next time!

LLMs vs Classical Fingerprints

4m 15s

Discover how Natural Language Processing applies to chemistry. We pit vector embeddings from Large Language Models against classical RDKit structural fingerprints for predicting bioactivity.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 11 of 22. What if the best way to describe a molecule to a machine learning model is not with a handcrafted structural formula, but by treating its SMILES string exactly like a sentence in natural language? You might be spending valuable compute cycles generating complex chemical descriptors, only to be outperformed by a text model. That tension is at the core of LLMs vs Classical Fingerprints. When predicting how tightly a ligand binds to a protein, your model needs a mathematical description of the ligand. The traditional route relies on classical structural fingerprints computed by tools like RDKit. You run a molecule through a deterministic algorithm and receive a static vector. A Morgan fingerprint counts circular substructures around atoms. MACCS keys check the molecule against a predefined list of chemical patterns. The limitation of these classical fingerprints is their rigidity. They encode specific, unchangeable rules. You cannot fine-tune them. If a particular structural nuance matters for a highly specific binding pocket, but the fingerprint algorithm was not explicitly designed to capture it, that information is completely lost before the predictive model even sees the data. Instead of hardcoding chemical rules, we can use a pretrained chemical Large Language Model like BioT5, GPT2, or BERT. These models are pre-trained on millions of SMILES strings. They learn the grammar of chemistry in an unsupervised way. When you pass a ligand to an LLM, it does not output a fixed checklist of functional groups. It outputs a rich vector embedding. Every character or token in that SMILES string gets its own contextual vector. To understand the difference, look at how these representations feed into predictive models. First, consider an XGBoost model using classical MACCS keys. You generate the MACCS fingerprint, which results in a simple binary array. You pass that fixed vector to XGBoost, which tries to map those raw presence-or-absence features to binding affinity. In benchmark tests across congeneric series, this specific combination consistently yields the poorest performance. The handcrafted features are simply too blunt. Now, swap that architecture for a BioT5 embedding fed into a Transformer head. First you pass the raw SMILES string to the BioT5 model. This returns a sequence of per-token embeddings. Then you pass that sequence into a Transformer head. Here is the key insight. The Transformer uses an attention mechanism. It looks across the entire sequence of token embeddings and learns dynamically which parts of the molecule matter most for binding to this specific target. It weighs the features intelligently before outputting the predicted binding affinity. If you try to take those exact same BioT5 tokens, sum them together into one flat vector, and feed that into XGBoost, the predictive performance drops significantly. Sum pooling averages out the token-level details. The Transformer head succeeds precisely because it preserves and exploits the granular context of the text-like representation. This shift from static arrays to dynamic embeddings offers massive practical benefits. LLM embeddings are highly versatile and can be fine-tuned for specialized downstream tasks. They are also much more compact than massive classical bit vectors, which saves memory when storing large molecular libraries. Furthermore, generating text embeddings runs on a GPU, which is drastically faster than computing traditional RDKit fingerprints on a CPU. The era of manually telling algorithms which chemical substructures matter is ending; the models that perform best are the ones allowed to read the molecule and decide for themselves. If you want to help support the show, you can search for DevStoriesEU on Patreon. Thanks for hanging out. Hope you picked up something new.

Active Learning for Virtual Screening

4m 07s

Learn how to iteratively discover top drug candidates without exhaustive testing. We dive into active learning loops and greedy selection strategies to maximize hit rates.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 12 of 22. Running complex physical simulations on a million molecules is impossibly slow. You need a way to find the absolute best candidates by testing just a tiny fraction of your dataset. That is exactly what Active Learning for Virtual Screening does. When you evaluate a chemical library, calculating accurate binding affinities using computational physics methods is incredibly expensive. You simply cannot afford to simulate every single compound. Instead, you test a small batch, use those results to train a machine learning model, and let that model predict the affinities for the rest of the library. You then test the most promising predictions, update the model, and repeat the process. This iterative cycle is active learning. Let us look at a concrete scenario. You are exploring a congeneric series of ten thousand compounds targeting a specific protein, like Tyk2. Your goal is to find the top one percent of active molecules. To do this efficiently, you rely on a greedy selection strategy. A greedy strategy means your model always picks the compounds with the highest predicted binding affinities for the next round of testing. You set your batch size to sixty molecules per round. This number strikes a practical balance for a real world workflow. It is small enough that you can run demanding physics simulations on the batch quickly, but large enough to provide a substantial chunk of new data to your model. You run your tests on these sixty compounds to get their true binding affinities, and immediately feed that data into a two dimensional tree based model, like XGBoost. The XGBoost model learns the patterns, scores the remaining untested molecules in the ten thousand compound pool, and selects the next sixty candidates. Here is the key insight. How you pick the very first batch of sixty molecules dictates how fast your entire system learns. Standard active learning often defaults to a random baseline. You select sixty molecules completely at random, test them, and train your first model. But random selection gives your model a poor starting point, filling the initial training set with mostly inactive compounds. The solution is to initialize the active learning loop using a pretrained three dimensional neural network. This 3D model has already been trained on a massive, general dataset of diverse protein and ligand complexes. Because it understands general binding physics based on structural interactions, it can score your ten thousand compounds before the active learning loop even begins. First, you use the pretrained 3D model to predict affinities for the entire pool. Then, you take the top sixty molecules identified by this prescreening and run your heavy physics simulations on them. Now you have a highly enriched set of starting data. You pass this initial, high quality dataset to your XGBoost model. From this point forward, the XGBoost model takes over the cycle. It trains on the verified data, predicts the remaining pool, and greedily selects the next sixty candidates. This combination generates a massive acceleration in hit discovery. The general 3D model provides a rich starting point, and the XGBoost model rapidly adapts to the specific chemical space of your congeneric series. By initializing with a pretrained 3D model instead of random sampling, you can find eighty percent of the top one percent of binders after testing less than ten percent of the entire dataset. Starting your loop with a general 3D prescreening gives your specialized models an unbeatable head start. That is all for this one. Thanks for listening, and keep building!

The Activity Cliff Challenge

3m 31s

Examine the fragility of structure-activity relationships. We discuss 'activity cliffs'—where a tiny structural change causes a massive shift in a drug's potency.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 13 of 22. In chemistry, a single atom out of place can turn a potent, life-saving drug into an inert powder. These sudden drops in biological potency are called activity cliffs, and they are the absolute nemesis of traditional AI models. An activity cliff occurs when you have a pair of molecules with high structural similarity but significantly different activity levels. Consider a concrete scenario. You have two molecules that share ninety percent of their structural framework. The first molecule binds tightly to a biological target. The second molecule is completely inactive. The only physical difference between them is a single methyl group attached to a specific ring. To a human medicinal chemist, this specific structural modification is highly informative. It tells them exactly where the boundaries of the receptor pocket are. To a standard machine learning model, it is a catastrophic disruption. Here is the key insight. Most deep learning approaches in cheminformatics are built to predict absolute property values. Whether you are using a graph neural network or a chemical language model, the architecture is fundamentally designed to map chemical structures into a continuous mathematical space. The core assumption hardcoded into these models is that similar molecular structures should map to similar biological properties. When a standard model processes our two nearly identical molecules, it generates representations that are right next to each other in that mathematical space. Because the inputs are close together, the model naturally outputs nearly identical absolute potency predictions for both. Activity cliffs fundamentally violate this assumption of a smooth, continuous chemical space. They represent a severe discontinuity. The model expects a gentle hill, but encounters a vertical drop. This problem is heavily amplified by the nature of drug discovery data. Experimental datasets are notoriously limited and noisy. When you train a deep neural network on sparse data, the model struggles to generalize. To minimize overall error across the training set, the network learns broad, global patterns. It smooths out the local irregularities. When it encounters an activity cliff, the standard regression objective treats that sudden jump in variance as experimental noise or an outlier. The model ignores the most critical piece of local structural information because it does not fit the global trend. This is why predicting activity cliffs remains one of the hardest problems in molecular property prediction. The models are forced to learn a discontinuous chemical space directly from limited data. Because they focus entirely on absolute predictions for single molecules, they completely ignore the valuable information hidden in the relative differences between matched molecular pairs. In many cases, simpler tree-based models actually end up outperforming complex neural networks on these datasets simply because deep learning models over-smooth the representations. The assumption that similar chemical structures always yield similar biological activities is a useful statistical baseline, but it is not a physical law. Activity cliffs are the brutal, discontinuous reality of structure-activity relationships, and they prove that predicting absolute properties in a vacuum will always fail at the margins. Thanks for spending a few minutes with me. Until next time, take it easy.

Similarity-Quantized Relative Learning

3m 32s

Solve the activity cliff problem by rethinking how models learn. We explore the SQRL framework, which trains AI to predict relative property differences between strictly filtered molecular pairs.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 14 of 22. Instead of forcing an AI to memorize the absolute potency of every molecule in existence, what if you simply taught it to ask how a molecule differs from its closest known neighbor? This shift in perspective solves a major generalization problem in low-data regimes, and it is the core of a framework called Similarity-Quantized Relative Learning, or SQRL. Normally, molecular property prediction treats every molecule as an isolated data point. The model tries to map a structure directly to an absolute property value. With small, noisy datasets, deep learning models struggle to build an accurate global map of chemical space. Previous attempts at pairwise learning tried to fix this by pairing every molecule with every other molecule in the training set. That approach floods the data with comparisons between completely unrelated structures, drowning out the useful local signal. SQRL fixes this by restricting the training data to pairs of molecules that are structurally highly similar. The model learns to predict the relative difference in their properties, known as delta y. This is achieved through a specific dataset matching step. You do not feed the model individual molecules. You feed it pairs, but only if they pass a strict similarity threshold. Let us walk through the logic. You start with your standard training set of molecules and their known potencies. First, you calculate the pairwise distances between all molecules using a metric like Tanimoto distance on Morgan fingerprints. Here is the key insight. You set a distance threshold, alpha. Let us use a Tanimoto distance threshold of zero point seven. You iterate through all possible molecule pairs. If the distance between molecule A and molecule B is zero point seven or higher, you discard the pair entirely. If the distance is strictly less than zero point seven, you add this pair to your new relative training set. The target variable for this new pair is no longer an absolute potency. It is the exact numerical difference in potency between molecule A and molecule B. Now you train your neural network. The network generates a mathematical representation for molecule A, and a representation for molecule B. It subtracts the representation of B from the representation of A. That resulting difference vector is passed through a final layer to predict the delta y. By filtering out the noise of dissimilar pairs, the network is forced to focus exclusively on local, high-signal chemical changes. It learns exactly how a specific structural tweak alters the activity. This makes the model highly sensitive to activity cliffs. That covers training, but what about predicting a brand new molecule? When a new structure comes in, the system scans the original training data to find the single closest structural neighbor based on that same Tanimoto distance metric. The network evaluates the new molecule against this nearest neighbor and predicts the relative delta. Finally, you take the neighbor's known absolute potency, add the predicted delta, and you have your final prediction. By restricting the training space to highly similar pairs, you stop asking the model to learn the entire chemical universe and instead train it to become an expert in local chemical gradients. That is all for this one. Thanks for listening, and keep building!

The Generative AI Revolution

3m 23s

Transition from predicting properties to imagining entirely new molecules. We map out the landscape of molecular generative tasks: De Novo generation, optimization, and conformer generation.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 15 of 22. For years, AI in chemistry was purely an act of triage. You would feed a model thousands of existing molecules, and it would simply predict which ones were the least terrible. If the ideal molecule was not in your screening library, the model could not help you. The Generative AI Revolution fundamentally flipped that dynamic. Instead of just filtering what already exists, we can now instruct models to invent entirely new chemical matter. This shift from prediction to creation splits into two major molecular generative tasks: de novo generation and molecular optimization. De novo generation involves creating novel molecular structures from scratch. The model starts with a representation of random noise and iteratively refines it into a valid chemical structure. When this is done without any constraints, it is called unconditional generation. The model freely explores the vast chemical space to produce something entirely new. While that is useful for broad discovery, you usually need more control. This brings us to conditional generation, specifically property-based generation. Here, you dictate the output. You provide specific constraints, such as a target bioactivity or a required level of synthesizability, and the model restricts its generation to molecules that meet those criteria. This is often referred to as inverse molecule design because you start with the properties you want and force the model to work backward to construct a molecular structure that delivers them. De novo generation is powerful, but you rarely start a project with zero existing knowledge. Usually, you already have a lead compound. This is where molecular optimization comes in. Unlike de novo tasks, molecular optimization focuses on modifying a known structure rather than starting from a blank slate. You take an existing molecule and refine it to enhance its properties. Suppose you have a moderately effective drug scaffold. It binds to your target, but its bioactivity is too low to be a viable drug. Using a generative model, you can perform targeted molecular optimization. One approach is scaffold hopping. You instruct the model to replace the core molecular scaffold with a new one while retaining the original biological activity. This is highly effective for discovering structurally novel compounds that evade existing patents while keeping the functional behavior intact. Another approach is R-group design. In this scenario, you lock your core scaffold in place and instruct the generative model to automatically optimize its side chains. The model generates new R-groups, looking for the specific side chain modifications that improve that lagging bioactivity. You are not discarding your moderately effective molecule, you are letting the AI calculate the precise structural tweaks needed to push it over the finish line. Here is the key insight. The transition from predictive AI to generative AI means you are no longer limited by the molecules you have on hand. Whether you are generating a custom molecule from a blank slate or algorithmically swapping the side chains of an existing drug, you are treating chemical space as a programmable medium. That is all for this one. Thanks for listening, and keep building!

The Intuition of Molecular Diffusion

3m 48s

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 16 of 22. To teach an AI how to invent a new molecule, you first have to teach it how to completely destroy one. It seems entirely backward, but this systematic destruction is the exact mechanism used to generate novel drugs from scratch. This is the core intuition of Denoising Diffusion Probabilistic Models. Traditional molecular design is incredibly labor-intensive. If you want to automate the discovery of novel compounds, you need a model capable of exploring a vast chemical space without generating invalid nonsense. Early deep generative models tried to map this space directly. Denoising Diffusion Probabilistic Models, or DDPMs, take a different route. They treat molecular generation as a progressive denoising problem. The framework is split into two distinct Markov chains: the forward process and the reverse process. The forward process is strictly about data degradation. You take a valid, known drug from your training set. Over a fixed sequence of steps, you progressively blur its atomic coordinates by injecting pure Gaussian noise. The amount of noise added at each step is controlled by a set hyperparameter schedule. At step one, you perturb the atoms slightly. The molecule is a bit distorted but still clearly recognizable. By step fifty, the structure is severely warped. By the final step, typically denoted as step T, the original molecule is entirely gone. You are left with a random cloud of unstructured Gaussian noise. This forward process does not require a neural network. It is a strict mathematical corruption. It serves a vital purpose because it generates the ground truth for our training data. Here is the key insight. Because we controlled the exact amount of noise added at every single step, we have a perfect, step-by-step record of how the molecule fell apart. The reverse process is where the neural network comes in. The model is trained to walk that exact path backward. We feed the network a corrupted molecule at a specific time step. We then ask it to predict the specific noise that was added to reach that state. We evaluate the model by comparing its noise prediction against the actual noise we injected during the forward phase. We update the model parameters to minimize that difference. Over time, the network learns to denoise the data step by step, gradually restoring the original data distribution. To generate a completely new molecule, you execute this reverse process from scratch. First, you sample a completely random cloud of Gaussian noise. Then, you pass that noise into your trained neural network, along with the starting step number. The network evaluates the input, predicts the structural correction needed, and returns a slightly less noisy cloud of atoms. You loop this subtraction process. You pass the new output back into the network for the next step down. With each iteration, the random cloud tightens. The model continuously strips away the noise. As you step backward toward step zero, atomic coordinates snap into place and a valid chemical structure emerges. You pull a completely novel molecule out of the initial static. The model does not just memorize a database of existing drugs; it simply learns the universal process of removing chaos to leave behind stable chemistry. If you would like to support the show, you can search for DevStoriesEU on Patreon. Thanks for listening, and keep building!

Bridging 2D and 3D Generative Spaces

4m 03s

We explore how AI actually represents the molecules it generates. We compare generating flat 2D topological graphs with generating complex 3D geometric point clouds, and the challenges of each.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 17 of 22. It is one thing for a model to draw a flat 2D graph of a molecule. It is an entirely different engineering nightmare to generate a stable, 3D geometric cloud of atoms. You can have a generative model build a beautiful 3D point cloud of a drug that perfectly fills a target protein pocket, only to watch the whole thing fall apart in post-processing when the system fails to guess where the actual covalent bonds should go. The solution to this disconnect is Bridging 2D and 3D Generative Spaces. In generative chemistry, data modalities dictate what your model can understand. The first modality is 2D topological space. Think of this as a standard molecular graph. Nodes represent atoms with specific types, and edges represent the chemical bonds connecting them. The model outputs an adjacency matrix telling you exactly what is connected to what. Graph neural networks handle this well. The problem is that molecules exist in the physical world, not on paper. A 2D graph gives you the binding topology, but it completely ignores the 3D geometric structure. Without spatial coordinates, you cannot accurately calculate quantum properties or perform structure-based drug design. To fix that, models shifted to generating molecules directly in 3D geometric space. Here, the output is a point cloud. The model defines the atom types and their exact X, Y, and Z positional coordinates. The technical hurdle here is maintaining SE 3 equivariance, ensuring the molecule remains mathematically consistent regardless of how it is rotated or translated in space. Here is the key insight. Generating in pure 3D space means the model does not explicitly generate the chemical bonds. It just places atoms in space. You have to infer the binding topology afterward through post-processing algorithms. This introduces major errors. Returning to the drug pocket scenario, your model might arrange atoms in a shape that physically fits the target, but because it never considered binding topology during generation, the post-processing step might infer impossible covalent bonds. For larger molecules, directly generating a stable 3D structure without any topological guidance often results in a suboptimal solution. This brings us to generating in 2D and 3D joint space, which produces a complete molecular structure simultaneously. The model outputs the atom types, the discrete adjacency matrix for the bonds, and the continuous spatial coordinates all at once. By bridging these two spaces, the modalities constrain and correct each other during the generation process. The 2D topology acts as a blueprint, guiding the 3D structure to ensure the spatial arrangements are chemically feasible. At the same time, the 3D geometry refines the 2D graph by suggesting plausible bonding patterns based on spatial proximity. The main technical challenge in this joint approach is managing two fundamentally different data types. You are forcing the model to handle discrete topological structures, like bond types, alongside continuous geometric structures, like coordinate values. Different architectures resolve this differently. A framework called JODO treats both the topological and geometric structures as continuous variables to process them together. Another model, MUDiff, handles them separately, applying a discrete process for the topology and a continuous process for the geometry. You cannot reliably generate novel, functional drugs by guessing the physical shape and hoping the chemical bonds work themselves out later. True molecular generation requires topological blueprinting and spatial positioning to interact and complement each other in the exact same computation. That is all for this one. Thanks for listening, and keep building!

Target-Aware Generation & Docking

3m 30s

Discover context-aware generative design. We discuss generating novel molecules directly inside a disease protein's binding pocket to maximize binding affinity.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 18 of 22. Why ask an algorithm to generate a million random keys and test them one by one, when it could just look directly at the lock and forge a custom key right inside the keyhole? Traditional drug discovery often relies on generating vast libraries of candidates and filtering them down, but that wastes immense computational resources. Target-Aware Generation and Docking solves this by directly utilizing the three-dimensional geometry of the biological target to build or place molecules. In these generative tasks, we are working entirely in 3D geometric space. Target-aware generation, also known as structure-based drug design, builds a new molecule based directly on the 3D structure of a target binding pocket. Take a concrete scenario involving a viral protein. This protein has a highly specific geometric cavity. Instead of generating molecules in a vacuum, a conditional diffusion model analyzes the exact spatial boundaries and chemical properties of that cavity. It then custom-grows a novel ligand structure directly inside the pocket. The model starts with a cloud of noisy 3D coordinates and atom types located within the binding site. Over successive steps, it denoises this cloud. Because the generation is conditioned on the target pocket, the model places atoms and forms structures that naturally complement the cavity, aiming to guarantee high interaction affinity. Here is the key insight. The algorithm does not just guess a shape; it explicitly learns the spatial relationship between the target and potential binders. Some approaches even incorporate interaction-based retrieval, pulling data from known high-affinity ligands to further guide the generation of these target-specific molecules. That covers generating a completely new molecule from scratch inside a pocket. But you will often have an existing molecule and need to know exactly how it interacts with a biological target. That brings us to molecular docking. Molecular docking predicts the binding pose to assess binding affinity and specificity. In a diffusion framework, models take a known molecule and a target structure as inputs. Instead of generating chemical identity, the diffusion process operates purely on the spatial orientation of the molecule. The model starts with the ligand in a random, noisy 3D pose and iteratively denoises it. It refines the spatial coordinates of the molecule until it settles into the correct binding configuration inside the protein pocket. Advanced docking models take this a step further by treating the target itself as a flexible entity. A model called Re-Dock uses a technique called a diffusion bridge to predict the binding poses of the ligand while simultaneously modeling the movement of the pocket sidechains. This creates a realistic, flexible docking scenario where both the ligand and the target adapt to each other during the prediction phase. The critical shift here is that diffusion models have moved structural drug design away from rigid, isolated approximations. By treating both the generated ligand and the biological pocket as a continuous, adaptable geometric system, the model natively outputs molecules and poses that are physically grounded in the precise reality of the target environment. That is all for this one. Thanks for listening, and keep building!

The Size Trap in Generative Evaluation

4m 17s

Learn why standard benchmarks for generative models can be deeply flawed. We reveal the confounding effect of generated library size on metrics like Fréchet ChemNet Distance.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 19 of 22. You evaluate your new generative chemistry model by sampling a thousand molecules, and the metrics look terrible. You generate a hundred thousand from that exact same model, and suddenly it looks world-class. Scale changes everything. This phenomenon is called the Size Trap in Generative Evaluation. Generative drug discovery pipelines generally follow three stages. You train, you generate, and you evaluate. When practitioners reach the evaluation stage, they face a basic question about how many de novo designs to generate for benchmarking. Standard practice often defaults to small batches, typically one thousand or ten thousand SMILES strings. Teams then run these batches through standard distributional metrics. The most common is Fréchet ChemNet Distance, or FCD. FCD measures how close your generated molecules are to your training set in chemical and biological space. A lower FCD score means your generated distribution closely matches your target distribution. Another common metric is Fréchet Descriptor Distance, or FDD, which compares the distribution of physicochemical properties like molecular weight and topological surface area. Teams also routinely measure Uniqueness, which is the fraction of generated designs that are distinct. Here is the key insight. All of these metrics are heavily dependent on the physical size of the generated library. They do not measure absolute model quality in a vacuum. When you sample just one thousand molecules, your FCD and FDD scores will be artificially high. The model looks like it failed to learn the target distribution. But if you keep sampling from that exact same model, pushing the library size out past ten thousand, fifty thousand, or a hundred thousand molecules, the FCD score drops significantly. It continues to decrease until it eventually hits a plateau. This happens because generative molecule design involves sampling from a highly complex learned probability distribution. A tiny sample of one thousand molecules cannot adequately represent the full scope of that model output. The Fréchet distance algorithms need a massive number of samples to accurately capture the shape of the generated space and compare it to the fine-tuning space. Consider a concrete scenario where you are comparing a recurrent neural network against a transformer. If you evaluate the recurrent network using one hundred thousand designs, but you evaluate the transformer using only ten thousand, the recurrent network will likely show vastly superior FCD and FDD scores. The performance gap has nothing to do with architecture. It is purely an artifact of the sample size. The metrics have not converged for the smaller library. This trap works in reverse when you look at internal diversity. Uniqueness behaves completely differently at scale. At one thousand molecules, almost every valid SMILES string your model generates might be unique. The model appears highly creative. But as you push the generation toward one hundred thousand, uniqueness drops sharply. The model begins to repeat itself. If you rank different generative models based on uniqueness at a small scale, the differences between them look minor. Push the scale higher, and the gap between models widens dramatically. The relative ranking of your models will actually flip depending on the size of the library you use to measure them. To fix this, you must treat library size as a strict control variable in your pipeline. You can never reliably compare FCD, FDD, or Uniqueness across libraries of different sizes. To ensure robust assessment, you should evaluate libraries containing at least one hundred thousand designs so the distributional metrics fully converge. If your evaluation metrics change simply because you let the sampling loop run longer, you are measuring the size of the sample, not the intelligence of the model. That is all for this one. Thanks for listening, and keep building!

Navigating De Novo Hallucinations

3m 56s

Rank AI-generated molecules intelligently. We explore the exploration-exploitation tradeoff of model likelihoods, and how to filter out frequent, low-quality 'chemical hallucinations'.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 20 of 22. Just because a generative AI spits out a specific molecule ten thousand times does not mean it is a viable drug. You might assume generation frequency indicates chemical quality or relevance. It does not. Those highly frequent outputs are often the chemical equivalent of a large language model hallucinating. The mechanism to resolve this is navigating de novo hallucinations using model likelihood. When you generate a massive library of one million SMILES strings from a fine-tuned chemical language model, you must decide which molecules to prioritize for prospective studies. A standard, flawed approach is simply selecting the designs the model outputs most often. This creates a count trap. Instead of discovering robust drug candidates, you end up extracting basic, repetitive substructures like isolated benzene rings, simple amines, and basic ethers. These are recurring structural hallucinations. The model generates them constantly not because they are high quality, but because they are syntactically simple to construct. To expose and filter out these hallucinations, you evaluate your library using model likelihood. Likelihood is a metric that captures how well a generated sequence aligns with the probability distribution the model learned during training. For an autoregressive model, you compute this by multiplying the sampling probability of every individual token in the generated SMILES string. First, you calculate the likelihood score for all one million generated designs. Next, you sort the entire library based on these scores. Finally, you divide the sorted library into ten equal groups, or deciles, ranging from the lowest likelihood to the highest. This is where it gets interesting. Analyzing these deciles reveals a strict exploration-exploitation tradeoff. The tenth decile contains the highest likelihood generations. These designs represent exploitation. They have extremely high chemical validity, and their generic Bemis-Murcko scaffolds closely match the known active molecules from your training data. The model is heavily exploiting what it already knows works. The downside is that these top-tier designs lack novelty. They contain very few new substructures because the model is playing it safe. Moving down to the intermediate deciles, you hit a balance. Novelty and unique substructures peak in this middle range, while validity remains acceptable. But when you drop down to the first decile—the ten percent of molecules with the absolute lowest likelihood scores—you hit the count trap. If you isolate the designs that the model generated more than ten times across the entire million-molecule run, they almost all cluster in this bottom decile. They have incredibly low model likelihoods, yet they appear with massive frequency. Their structural similarity to your training set is terrible, and their overall chemical validity crashes. By binning your library this way, you prove mathematically that frequency is a false signal for quality. You can systematically drop the low-likelihood, high-frequency bins and focus your computational screening on the intermediate deciles where true chemical exploration happens. The most frequent outputs from a generative chemical model are often its worst, but filtering your library by likelihood deciles turns that noise into a precise map of where the model is exploring and where it is just hallucinating. That is all for this one. Thanks for listening, and keep building!

Molecule Sampling Constraints

4m 01s

Understand why NLP techniques fail in chemistry. We compare Temperature sampling against Top-k and Top-p, and why the constrained chemical vocabulary changes everything.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 21 of 22. In natural language processing, Top-p sampling produces brilliantly creative text. But apply that exact same logic to generating molecules, and your AI will just print identical carbon rings forever. The reason comes down to Molecule Sampling Constraints. When a chemical language model generates a SMILES string, it builds the molecule one token at a time. The model predicts a probability distribution for the next token, and you have to extract a specific choice from that distribution. In text generation, practitioners heavily rely on Top-k and Top-p sampling to make this choice. Top-k restricts the model to the absolute most likely k tokens. Top-p restricts the selection to the smallest group of tokens whose combined probabilities exceed a target percentage p. If you apply these methods to a chemical language model, they fail catastrophically. If you use Top-k sampling with k set to 3 on an LSTM trained on drug targets, your model will experience severe mode collapse. It will output chemically valid molecules, but they will be completely repetitive. Here is the key insight. The failure stems from the size of the chemical vocabulary. A text model selects from hundreds of thousands of words. A chemical language model uses a highly constrained alphabet. It only has a handful of elements like carbon, oxygen, and nitrogen, plus syntax tokens for branching and ring closures. Because the chemical alphabet is tiny, and because valid chemistry requires rigid syntax rules like closing every open ring, a very small subset of tokens absolutely dominates the probability distribution. Carbon and basic structural tokens are almost always highly probable. When you apply Top-k or Top-p sampling, you slice off the long tail of the probability distribution. The model is forced to pick exclusively from that narrow band of dominant tokens. It gets caught in a filtering trap, repeating the exact same basic scaffolds endlessly. To escape this trap, you must use Temperature sampling. Instead of filtering tokens out, Temperature sampling applies a smoothing parameter to the raw neural network scores before calculating the final probabilities. This alters the shape of the entire distribution. Consider a scenario where you are running a fine-tuned LSTM model to generate novel drug candidates. You adjust the Temperature parameter, T, to dial in the tradeoff between validity and diversity. If you set T low, around 0.5, the probability distribution permeates a sharp peak. The model heavily exploits the most likely tokens. Your output will feature extremely high chemical validity, but the structures will lack novelty. They will closely mimic the training set. If you increase T up to 1.5 or 2.0, you flatten the probability distribution. Now, the less probable tokens have a mathematical chance of being sampled. Your model begins exploring new chemical space. The number of unique substructures in your generated library spikes. You find highly novel molecules. The tradeoff is that higher temperatures increase randomness, leading the model to make more syntax errors, which reduces the overall percentage of valid SMILES strings. You cannot blindly port text generation strategies into molecular design. Because chemical vocabulary is inherently constrained, Temperature scaling remains the single most effective lever to balance strict chemical validity against the exploration of novel structures. Thanks for hanging out. Hope you picked up something new.

Deploying Cheminformatics in the Cloud

3m 42s

Take your AI pipeline to production. We discuss packaging RDKit and machine learning models into Docker containers and scaling workloads across cloud infrastructure.

Download

Hi, this is Alex from DEV STORIES DOT EU. Python Cheminformatics & AI, episode 22 of 22. You have built a state-of-the-art AI drug discovery pipeline on your laptop, but how do you screen a billion molecules over the weekend? The answer is Deploying Cheminformatics in the Cloud. Moving a model from a local environment to a distributed cloud architecture usually fails at the dependency layer. RDKit is not a pure Python library. It is a large C++ codebase that requires system-level dependencies, most notably the Boost C++ libraries. If you provision generic cloud servers and run standard installation scripts, you frequently hit compiler errors or missing shared object files. The official RDKit documentation highlights that building from source requires a specific C++ toolchain. While pre-compiled pip wheels exist, the most robust way to guarantee all underlying dependencies align is using Conda. However, installing Conda dynamically on thousands of temporary cloud workers takes too much time and introduces network instability during scale-up. Here is the key insight. You bypass the dependency problem completely by wrapping your pipeline in a Docker container. You write a configuration file that specifies a base operating system. Inside that container, you install a lightweight Conda environment, pull the compiled RDKit binaries, and add your machine learning frameworks like PyTorch or XGBoost. Finally, you copy your pre-trained model weights into the image. Building this image freezes the entire stack into a single, immutable artifact. The cloud provider only needs to know how to run a standard Docker container. The complex C++ dependencies are securely locked inside. To process millions of molecules, you separate your data flow from your compute workers using a cloud message queue. You partition your massive dataset of SMILES strings into smaller, manageable chunks. You place these chunks in cloud object storage and send a message containing the chunk location to the queue. You then point a scalable cloud compute service at this queue. For heavy, GPU-accelerated workloads, you deploy your container using a service like AWS Batch. For lighter, CPU-based inference, serverless container platforms like Google Cloud Run or AWS Lambda handle this perfectly. You configure the compute service to scale automatically based on the queue depth. If there are fifty thousand messages waiting, the cloud controller spins up thousands of identical Docker containers simultaneously. Each container connects to the queue and claims one message. It downloads the corresponding chunk of SMILES strings. RDKit converts the SMILES into molecular graphs, computes the required descriptors, and passes them to your machine learning model for inference. The container writes the highest scoring molecules directly to a managed cloud database. Once the chunk is processed, the worker deletes the message from the queue and grabs the next one. When the queue is empty, the cloud infrastructure terminates the containers automatically. You only pay for the exact compute seconds your code actually consumed. Scaling cheminformatics is rarely about writing faster loop structures in Python; it is about packaging your environment reliably and using decoupled cloud architecture to process data in parallel. This wraps up our series on Python Cheminformatics and AI. I encourage you to read the official RDKit documentation on installation, try containerizing a simple script hands-on, or visit devstories dot eu to suggest topics for future series. That is all for this one. Thanks for listening, and keep building!