Back to catalog

Season 21 16 Episodes 57 min 2026

Scanpy Single-Cell Analysis

v1.11 — 2026 Edition. A comprehensive guide to single-cell analysis using Scanpy (v1.11 - 2026). Learn how to preprocess, visualize, cluster, and infer trajectories for scalable single-cell gene expression data.

Scientific Computing Single-Cell Analysis Bioinformatics

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

The Scanpy Identity and AnnData

Discover the foundational ideas behind Scanpy and why it was built for scalable single-cell analysis. We explore the AnnData object, the core data structure that keeps matrices, annotations, and embeddings perfectly aligned. You will learn the mental model needed to navigate Scanpy's ecosystem.

4m 10s

Quality Control Metrics

We explore how to perform initial quality control on single-cell data using Scanpy. By isolating specific gene populations like mitochondrial RNA, we can identify stressed or dying cells. You will learn how to compute and interpret these crucial QC metrics.

3m 23s

Filtering and Normalization

This episode covers the critical steps of filtering and normalizing single-cell expression matrices. We explain how to discard low-quality data and apply count depth scaling with a log1p transformation. You will learn how to make cells with different sequencing depths directly comparable.

3m 28s

Doublet Detection with Scrublet

We dive into doublet detection, a crucial step for catching technical artifacts in microfluidic single-cell sequencing. We break down how Scrublet simulates artificial doublets to flag suspicious cells. You will learn how to identify and remove these artificial combinations from your dataset.

3m 43s

Feature Selection and Highly Variable Genes

We examine the concept of feature selection and why it is necessary to identify highly variable genes. By discarding housekeeping noise, we focus the analysis on biological drivers. You will learn how to use Scanpy to isolate the most informative genes for downstream steps.

3m 31s

Cell-Cycle Scoring and Regression

We explore how to handle confounding factors by scoring and regressing out cell cycle phases. We discuss how to calculate S and G2M scores and use regression to remove their influence. You will learn how to prevent active cell division from ruining your clustering topology.

3m 34s

Dimensionality Reduction with PCA

This episode explains Principal Component Analysis in the context of single-cell data. We discuss how PCA denoises the dataset and why selecting the right number of components matters. You will learn how to reduce thousands of genes into a manageable foundation for advanced algorithms.

3m 19s

The Nearest Neighbor Graph and UMAP

We break down the absolute core of modern single-cell topology: the nearest neighbor graph. We then explain how UMAP translates this complex web into a readable 2D plot. You will learn why the neighbor graph is the prerequisite for almost every advanced tool in Scanpy.

3m 34s

Clustering with Leiden

We explore how to find discrete populations of cells using the Leiden clustering algorithm. By optimizing modularity on the neighborhood graph, Leiden isolates highly connected communities. You will learn how to adjust the resolution parameter to find stable, biologically meaningful groups.

4m 00s

Marker Gene Discovery

We dive into marker gene discovery and differential expression testing. We explain how statistical tests identify the unique transcriptomic signatures of your clusters. You will learn how to transition from anonymous numbered clusters to confidently labeled biological cell types.

3m 45s

Data Integration with Ingest

This episode covers data integration using the ingest tool. We explain how to project new datasets onto the PCA and UMAP space of a pre-annotated reference atlas. You will learn a fast, invariant method for mapping labels across different experiments.

3m 38s

Visualizing Expression Patterns

We explore advanced visualization techniques for evaluating gene expression across clusters. We focus on dot plots and matrix plots, detailing how they encode both expression intensity and sparsity. You will learn how to visually validate your cell type annotations at a glance.

3m 20s

Exploring Manifolds with Diffusion Maps

We introduce Diffusion Maps, a powerful embedding technique for continuous biological data. We contrast it with UMAP, explaining why diffusion is better suited for analyzing cellular differentiation. You will learn how to visualize continuous transitions and developmental processes.

3m 51s

Abstracted Graphs with PAGA

This episode covers Partition-based Graph Abstraction, or PAGA. We discuss how to measure the actual connectivity between clusters to preserve global topology. You will learn how to use PAGA to uncover the true lineage relationships hidden in your data.

3m 32s

Trajectory Inference with DPT

We explore trajectory inference using Diffusion Pseudotime (DPT). We explain how to designate a root cell and calculate geodesic distances across the cellular graph. You will learn how to arrange cells along a continuous developmental timeline.

3m 33s

Experimental Scale-Up with Dask

In our final episode, we look at the experimental frontier of Scanpy: scaling up with Dask. We explain how to handle datasets that exceed your machine's RAM using lazy evaluation and out-of-core processing. Thank you for joining us on this deep dive into Scanpy!

3m 36s

Episodes

The Scanpy Identity and AnnData

4m 10s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 1 of 16. Single-cell datasets have exploded from thousands of cells to millions over just a few years. Try loading that scale into older, memory-heavy toolkits, and your machine will choke. That is the exact problem Scanpy solves. Scanpy is a scalable toolkit built to analyze massive single-cell gene expression datasets in Python. It handles memory efficiently by relying on a highly specific foundational data structure. That structure is called Annotated Data, or AnnData. People coming from standard tabular data often assume AnnData is just a customized pandas dataframe. It is not. A single dataframe is far too flat for single-cell biology. In a single-cell experiment, you have a massive matrix of expression counts, but you also have complex metadata about the cells and entirely separate metadata about the genes. AnnData is a multidimensional container that binds the core matrix and all its associated metadata tightly together in one synchronized object. Consider a scenario where you are loading a dataset of one million cells. At the center of your AnnData object sits the core data matrix, accessed via the dot X attribute. This is a two-dimensional matrix holding your actual numerical values, typically gene expression counts. The rows always represent observations, which are your individual cells, and the columns always represent variables, which are your genes. For a dataset of a million cells, dot X is almost always stored as a sparse matrix to conserve RAM. Here is the key insight. The matrix in dot X does not store its own row or column names. It relies entirely on two dedicated metadata dataframes to provide that context. The first is the observation metadata, accessed via the dot obs attribute. This is a standard pandas dataframe mapped directly to the rows of your dot X matrix. It holds everything you know about the cells. For your one million cell dataset, dot obs will have exactly one million rows. This is where your cell barcodes, batch labels, quality control metrics, and clustering assignments live. The second is the variable metadata, accessed via the dot var attribute. This is another dataframe mapped directly to the columns of your dot X matrix. It holds everything you know about the genes or features you measured. This is where you store gene symbols, chromosome locations, and statistical metrics like highly variable gene flags. Because dot obs and dot var are strictly aligned to the dimensions of dot X, you can slice the AnnData object safely. If you filter out dead cells from dot obs, the AnnData object automatically drops the corresponding rows from the dot X matrix. The dimensional alignment never breaks. There is one more crucial layer in the AnnData structure. As you process your single-cell data, you generate multidimensional representations of your cells, like principal components or UMAP coordinates. These outputs do not fit neatly into a single column of dot obs. Instead, they go into a separate dictionary called dot obsm, which stands for observation matrices. The only rule is that any matrix you place in dot obsm must have the exact same number of rows as dot X. By keeping the core matrix, the cell metadata, and the gene metadata locked in one self-updating structure, AnnData guarantees your data stays perfectly synchronized from the first filtering step to the final visualization. If you find these episodes helpful, you can support the show by searching for DevStoriesEU on Patreon. As always, thanks for listening. See you in the next episode.

Quality Control Metrics

3m 23s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 2 of 16. The fastest way to ruin a single-cell analysis is to unknowingly keep dying or empty cells in the dataset. You might think you are discovering a novel subpopulation, but you are actually just clustering cellular debris. Exposing these compromised cells relies entirely on Quality Control Metrics. Let us clear something up right away. People often confuse calculating quality metrics with filtering out bad data. They are not the same thing. The Scanpy function calculate qc metrics does not remove a single cell or gene from your dataset. It is strictly an annotation tool. It calculates statistics and attaches them as new columns to your observation dataframe, which tracks cells, and your variable dataframe, which tracks genes. The actual removal of bad cells happens in a separate step. Why do we need these specific metrics? Consider a bone marrow sample. During extraction, physical stress can rip fragile cells open. When the cell membrane ruptures, the cytoplasmic RNA leaks out and washes away. However, mitochondria are enclosed in their own membranes, so mitochondrial RNA gets trapped inside the broken cell shell. If you sequence this droplet, you will get a high concentration of mitochondrial genes and very little else. This is a dead cell. To identify these ruptured cells, you need to track specific gene populations. In human datasets, mitochondrial genes typically start with the prefix MT-dash. Ribosomal genes might start with RPS or RPL. Before you can calculate metrics for these populations, you have to label them in your dataset. You do this by creating a new boolean column in your var dataframe. For example, you create a column called mt that evaluates to True if the gene name starts with MT-dash, and False otherwise. Once you have flagged these genes, you run the calculate qc metrics function. By default, this function calculates standard baseline statistics, like the total number of RNA counts per cell and the number of genes expressed per cell. But you can also tell it to look at the specific gene populations you just defined. You pass the name of your boolean column, like mt, into the qc vars argument. The function then computes the proportion of counts originating from that specific gene group. It adds new columns to your obs dataframe. One column will show the total counts of mitochondrial genes for each cell. Another, more critical column, will show the percentage of total counts that come from mitochondrial genes. If a cell shows that thirty percent of its RNA is mitochondrial, you know it is likely a ruptured, dying cell from the extraction process. Here is the key insight. The calculate qc metrics function transforms raw, uninterpretable count matrices into biological signals about cell health. It does not make decisions for you, but by tagging specific gene populations, it gives you the exact numerical evidence you need to separate actual biology from extraction noise. Thanks for spending a few minutes with me. Until next time, take it easy.

Filtering and Normalization

3m 28s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 3 of 16. You look at two cells in your dataset. One seems to express twice as much RNA as the other. But it is not a biological difference—one droplet simply got sequenced twice as deep by the machine. If you compare them directly, your entire downstream analysis will be warped by technical artifacts. Filtering and Normalization are the tools that resolve this disparity. Before you can adjust for sequencing depth, you have to remove the garbage. Raw single-cell data is full of dead cells, empty droplets, and random noise. You clean this up along two distinct axes: the cells and the genes. People sometimes confuse these two steps, but they do completely different things. You handle the cells first using the filter cells function. You tell Scanpy to discard any cell that expresses fewer than a minimum number of genes. If a droplet contains only two hundred detected genes when a healthy cell should have two thousand, that droplet is likely empty or contains a broken, dying cell. You drop it completely. Next, you filter the genes across your entire dataset. Using the filter genes function, you remove genes that are expressed in too few cells. If a specific gene is only detected in one or two cells out of ten thousand, it provides no statistical value for grouping or classifying cell types later on. It is just computational noise. You drop that gene entirely. Once the low-quality cells and uninformative genes are gone, you still face the sequencing depth problem. This is where you normalize the total counts. The goal is to scale every cell so they all appear to have the same total number of read counts. Take a scenario where Cell A has five thousand total counts and Cell B has twenty thousand. You choose a common size factor, typically ten thousand. Scanpy applies a scaling factor to each cell individually. It doubles the counts in Cell A and halves the counts in Cell B. Now, both cells sum to ten thousand total counts. When you look at a specific gene in both cells, you are comparing their true relative expression, completely independent of how heavily the sequencing machine sampled them. Making the totals equal is only half the math. Biological expression data is massively skewed. A handful of genes will have enormous count numbers, while most will have very few. If you feed this skewed data into variance calculations or dimensionality reduction algorithms later, those few massive genes will dominate the math and drown out subtle biological signals. You fix this using a log plus one transformation. You call the log one p function in Scanpy, which applies a natural logarithm to all your normalized counts. The plus one part of the function is critical because your data matrix is mostly zeros, representing genes that are not expressed in a given cell. The log of zero is undefined, but the log of zero plus one is zero. This simple step compresses the extreme high values while keeping the zeros exactly where they are, resulting in a much more balanced distribution. Here is the key insight. Filtering and normalization do not alter the underlying biology of your sample. They strip away the mechanical biases of the sequencing hardware so the actual biology can emerge. Thanks for listening, happy coding everyone!

Doublet Detection with Scrublet

3m 43s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 4 of 16. Sometimes two completely different cells get trapped in the same microfluidic droplet, creating a Frankenstein transcriptomic signature that looks like a completely novel biological state. It is not a discovery. It is an error, and if you do not catch it early, it will contaminate your analysis. That is exactly the problem Doublet Detection with Scrublet is designed to solve. When you run a single-cell experiment using droplet microfluidics, you push a cell suspension through a channel, aiming for one cell per droplet. Statistically, this process is not perfect. Occasionally, two cells share a single droplet. Take a scenario where a monocyte and a T-cell get trapped together. The sequencing machine reads their combined RNA as a single entity, outputting a profile that blends genes from both cell types. People often mistake these mixed profiles for true biological transitional states, like a cell midway through differentiation. We need to be completely clear. Doublets are purely technical artifacts. They do not exist in the tissue, and they must be removed. In Scanpy, you handle this using the scrublet function in the preprocessing module. Scrublet operates on a highly effective premise to find these fake cells. If you want to detect doublets, you need to know what a doublet looks like in your specific dataset. Since the algorithm does not know which of your observed cells are errors, it manufactures its own. First, Scrublet takes the expression matrix of your observed cells. Then, it randomly picks pairs of these real cells and computationally adds their gene expression profiles together. These combined profiles are your simulated doublets. Now, Scrublet maps both your actual observed cells and these newly simulated doublets into the same high-dimensional space. It builds a nearest-neighbor classifier to analyze the relationships between them. This is the key insight. Scrublet evaluates the immediate neighborhood of every real cell in your dataset. If an observed cell is surrounded primarily by simulated doublets, that real cell looks mathematically identical to an artificial mashup. Scrublet assigns it a high doublet score. Conversely, if an observed cell sits in a cluster with other real cells and very few simulated doublets, it receives a low score. It is highly likely to be a genuine single cell. The function does not stop at assigning a continuous score. It evaluates the distribution of all doublet scores across your dataset to automatically compute a cutoff threshold. It looks for a separation between the large peak of normal cells and the smaller tail of suspected doublets. Based on this threshold, Scrublet tags each cell in your dataset with a boolean value, marking it true if it is a predicted doublet, and false if it is a singlet. These results are saved directly into your data object, allowing you to filter out the false cells before you move on to downstream analysis. The fundamental strength of Scrublet is that it does not rely on external reference databases to find technical errors. It learns the exact failure modes of your specific experiment by combining the very cells you sequenced. That is all for this one. Thanks for listening, and keep building!

Feature Selection and Highly Variable Genes

3m 31s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 5 of 16. Out of the roughly 30,000 genes in the human genome, most are just doing basic cellular upkeep. If you try to analyze all of them at once, the sheer volume of baseline activity will drown out the actual biology you are looking for. To find the real signal, you need Feature Selection and Highly Variable Genes. First, a quick distinction. You have already used a basic filtering function to drop genes that are barely detected across your dataset. That step cleans up technical noise and empty droplets. Highly variable gene selection does something entirely different. It assumes the remaining genes are real, but asks which of them are actually informative. Think of it like this. Sifting through tens of thousands of genes to find the ones that drive differences between cell types means discarding the boring baseline housekeeping genes. A housekeeping gene is active in almost every cell at roughly the same level. Its expression is stable, which makes it useless for distinguishing a T-cell from a B-cell. We want genes that are heavily expressed in some cells and totally silent in others. These are the highly variable genes. Usually, you want to narrow your dataset down to about two thousand of these key drivers. In Scanpy, you handle this with the highly variable genes function. But you cannot simply rank genes by raw variance. In sequencing data, variance scales with mean expression. If a gene is highly expressed across the board, its raw variance will naturally be high, even if it is not biologically interesting. The algorithm has to decouple the variance from the mean expression. It does this by dividing genes into bins based on their average expression levels. Then, it calculates a normalized dispersion within each bin. This tells you how much a gene varies compared only to other genes that are expressed at similar baseline levels. Scanpy offers different statistical methods to do this math, called flavors. The traditional Seurat flavor expects your data to be log-normalized first. It calculates dispersion, bins the data, and standardizes the values. There is also a newer Seurat v3 flavor, which explicitly requires raw, unlogged count data to properly model the variance. Alternatively, the CellRanger flavor uses a slightly different approach to calculate normalized dispersion based on the counts. The flavor you choose simply dictates the specific statistical distribution used to model that relationship between the mean and the variance. When you run this function, it does not delete the rest of your data. Instead, it adds a few new columns to your variable annotations array. The most important one is a boolean column simply called highly variable, marking true for the top two thousand genes and false for the rest. Future steps in your pipeline will automatically look for this flag and only use those selected genes for downstream analysis. Here is the key insight. Feature selection is not just a computational trick to make your code run faster; it is the deliberate process of stripping away biological white noise so the true cellular identities have room to emerge. Thanks for hanging out. Hope you picked up something new.

Cell-Cycle Scoring and Regression

3m 34s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 6 of 16. Your clustering algorithm just split identical T-cells into two distinct groups. They are the exact same cell type, but the algorithm separated them simply because one group happens to be actively dividing while the other is resting. To fix this, we use Cell-Cycle Scoring and Regression. Cell cycle heterogeneity is a massive source of variance in single-cell data. If left unchecked, the highly expressed genes driving mitosis will overpower the subtle gene signatures that define actual cell types. You end up with clusters defined by a temporary state rather than true biological identity. To solve this, Scanpy provides a dedicated function to score genes based on the cell cycle. You pass this function your single-cell dataset along with two specific lists of known marker genes. One list contains genes that are active during the S-phase, the synthesis phase of the cell cycle. The other list contains genes active during the G2M-phase, the mitosis phase. The scoring function evaluates every single cell and calculates two continuous metrics: an S-score and a G2M-score. It does this by looking at how strongly those specific phase genes are expressed compared to the background expression level of the cell. Based on these two scores, the function also assigns a categorical phase label to each cell in your metadata, labeling it as S, G2M, or G1 if neither score is particularly high. Now that you have quantified this effect, you need to scrub its influence from the dataset. This is where you use the regress out function. You instruct regress out to look at the S-score and G2M-score columns that were just added to your cell metadata. The algorithm then builds a linear model for the expression of every single gene across all cells, using those two cell cycle scores as the predictor variables. It calculates the residual, which is the exact amount of gene expression that cannot be explained by the cell's position in the cell cycle. This residual value becomes the new, corrected expression level in your dataset. Here is the key insight. People often confuse regression with batch correction. They are related concepts, but they are fundamentally different tools. Batch correction methods are designed to align discrete, categorical groups, like samples collected on different days or sequenced on different machines. Regress out is specifically designed for continuous confounding variables. You use it for continuous numerical gradients like these cell cycle scores, total counts per cell, or the percentage of mitochondrial genes. It models a mathematical slope and flattens it. Once you run this regression step, the heavy biological bias of cell division is mathematically removed from the expression matrix. When you run your clustering algorithm again on this corrected data, those dividing T-cells and resting T-cells will snap back together into a single, cohesive cluster. The algorithm is no longer distracted by the noise of DNA replication. Regressing out cell cycle scores ensures your downstream analysis clusters cells strictly by what they are, rather than what they just happen to be doing at the moment they were sequenced. That is all for this one. Thanks for listening, and keep building!

Dimensionality Reduction with PCA

3m 19s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 7 of 16. Human minds cannot visualize two thousand dimensions. Worse, complex clustering algorithms become computationally paralyzed by them, bogged down by a mix of true biological signal and random technical noise. You need a way to extract the core structure of your data before analyzing it. Dimensionality reduction with Principal Component Analysis is how you solve this. Before getting into the mechanics, we need to address a common misunderstanding. People often think PCA is just a scatter plot showing PC1 against PC2, meant as a final visualization for human eyes. It is not. While you can plot the top two components, PCA is fundamentally a mathematical foundation. It compresses a sparse, noisy matrix into a dense format to feed into downstream tools like neighbor graphs. Think about your starting point. You just isolated about two thousand highly variable genes. Each gene represents an independent dimension in your dataset. You run the scanpy tools pca function to compute the principal components. This function evaluates your entire dataset and crushes those two thousand gene dimensions down into a much smaller set of synthetic dimensions, typically around fifty. These synthetic dimensions are ordered by how much variance they explain. The first principal component represents the absolute strongest axis of variation in your cells. The second represents the next strongest, and so on. Here is the key insight. The first handful of components capture the real, structured biological differences. As you move down the list of components, they capture progressively less biology and more random technical noise. By slicing off the tail end of these components, you are essentially denoising your data matrix. To evaluate this reduction, you need to decide how many components to retain. You use the scanpy plot pca variance ratio function. This command generates a line chart showing the fraction of total variance explained by each individual component. You scan this line looking for the elbow point, which is where the steep drop-off suddenly flattens into a long tail. If the curve flattens at component fifteen, you might assume you only need fifteen components. However, in single-cell workflows, we deliberately overestimate the number of principal components. You might see the elbow at fifteen but tell your downstream functions to use fifty anyway. Downstream clustering algorithms are highly robust. They can easily ignore the slight technical noise contained in components sixteen through fifty. What they cannot do is recover biological signal that you threw away too early. If component twenty-two holds the variance signature for a very rare cell type, dropping it means that cell type disappears from your analysis entirely. You compute the components, check the variance ratio to confirm the data structure, and move on. The most important takeaway is that PCA is not a picture to look at, but a targeted mathematical filter that sacrifices raw dimensional depth to strip away noise and expose the true biological axes of your data. Hope that was useful. Thanks for listening, and enjoy the rest of your day.

The Nearest Neighbor Graph and UMAP

3m 34s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 8 of 16. The secret to modern single-cell analysis is not just plotting points on a grid. It is understanding exactly which cells are neighbors in a massive, high-dimensional space. If you get that underlying structure wrong, every visualization that follows will mislead you. The mechanism for capturing this structure is the nearest neighbor graph, and today we are looking at how to build it and project it using UMAP. At this stage in a typical pipeline, you have already run Principal Component Analysis. You have condensed thousands of genes down to maybe forty or fifty principal components. But fifty dimensions is still impossible to visualize, and it does not explicitly tell us which cells belong to the same local community. We need to construct a web of connections. In Scanpy, you do this using the function sc dot pp dot neighbors. This step computes the neighborhood graph of your cells. Calculating distances between tens of thousands of cells across thirty thousand raw genes is computationally brutal and highly susceptible to noise. By computing the neighbors on the principal components instead, the math is fast and the technical noise is already stripped away. For every single cell, the algorithm looks at its coordinates across those principal components and finds its closest peers. By default, it links a cell to its fifteen nearest neighbors. The output is a mathematical network where cells are nodes, and the edges between them represent high similarity. This neighborhood graph becomes the foundational data structure for downstream tasks. Once you have this web, you want to actually look at it. This is where UMAP comes in, called via sc dot tl dot umap. Here is the key insight. A very common misconception is that UMAP computes distances directly from your raw gene expression data. It does not. UMAP is entirely blind to your genes. It is simply a layout engine. Its only job is to take that pre-computed nearest neighbor graph and flatten it out into a two-dimensional space. UMAP works by optimizing a layout to match the graph. It pulls connected neighbors close together while pushing disconnected cells apart. Because it relies entirely on the local connections defined in the previous step, it is exceptionally good at preserving local structure. If a group of cells were tightly connected in the high-dimensional graph, they will form a distinct, tight island on your 2D UMAP plot. But be aware that the empty space between the separate islands on a UMAP plot means very little. UMAP sacrifices global spatial accuracy to ensure local neighbors stay glued together. This separation of concerns is vital. The shape you see on a UMAP is completely dictated by the neighbor graph built right before it. If you want to change how sensitive your layout is to rare cell types, you do not tune UMAP. You go back and adjust the number of neighbors in the graph construction itself. If you find these episodes helpful and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Clustering with Leiden

4m 00s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 9 of 16. Traditional clustering algorithms assume your data exists in perfectly round blobs organized around a central point. But real biological data is messy, interconnected, and forms complex, continuous shapes. When you try to force those complex structures into simple spheres, you end up splitting single cell types into artificial fragments. Clustering with Leiden solves this by looking at how cells connect to each other, rather than just where they sit in an abstract space. People often default to thinking about clustering like K-means, where you define a center point and group everything nearby using standard physical distances. Leiden does not operate like that. It is a graph-based clustering algorithm. It completely ignores Euclidean distances to a centroid. Instead, it relies entirely on the density of edges in the neighbor graph you built earlier in your analysis. Think of the neighbor graph as a massive social network. The cells are individuals, and the edges between them are friendships. Leiden performs what is known as community detection. It searches for groups of cells that have a very high number of connections within their specific group, but very few connections to the outside network. To achieve this, the algorithm optimizes a metric called modularity. Modularity measures the density of links inside communities compared to the links you would expect if the network were totally random. The algorithm starts by assigning every single cell to its own individual community. It then iteratively merges these communities, moving nodes back and forth, constantly checking if the new grouping increases the overall modularity score. Leiden is specifically designed to refine these partitions carefully, guaranteeing that the final communities are densely connected internally and not suffering from disconnected fragments, which was a known issue in the older Louvain algorithm. In Scanpy, you run this using the tool function Leiden. You pass it your main data object, and it operates directly on the existing neighbor graph. The output is a new categorical column added to your data, containing a cluster number for every single cell. Here is the key insight. The most important control you have over this entire process is the resolution parameter. This parameter acts as a dial that dictates how aggressively the algorithm splits groups. By default, Scanpy uses a resolution of one. If you increase the resolution value, you get more clusters. The algorithm becomes highly sensitive, breaking the graph into smaller, highly specific subpopulations. If you decrease the resolution value, you get fewer clusters. The algorithm becomes more tolerant, grouping larger portions of the graph together. Suppose you run Leiden and look at the resulting map. You might notice that a single known biological cell type has been split into five tiny, over-fragmented subpopulations. The algorithm found slight differences, but biologically, those five groups belong together as one distinct cell state. To fix this, you simply run the Leiden function again, but this time you explicitly pass a lower number to the resolution argument. By dropping the resolution, you instruct the algorithm to relax its criteria. Those five tiny fragments will merge back into one solid, biologically meaningful cluster. The exact resolution you need is never a fixed mathematical truth; it is an adjustable dial you turn until the statistical communities in the graph accurately reflect the biological realities of your tissue. Thanks for hanging out. Hope you picked up something new.

Marker Gene Discovery

3m 45s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 10 of 16. You have isolated a tight, distinct cluster of cells in your dataset, but how do you know if it is a T-cell, a B-cell, or an entirely unknown state? Mathematics grouped them together, but biology has to tell you what they are. That translation step relies entirely on marker gene discovery. Before looking at the tools, we need to draw a hard line between clustering and annotation. Clustering algorithms simply group cells based on statistical similarity across thousands of dimensions and assign them an arbitrary label, like Cluster 0 or Cluster 1. That number means nothing biologically. Marker gene discovery is the process of finding the specific genes that drive that statistical grouping, allowing you to assign real biological names to those clusters. In Scanpy, you find these driving genes using a function called rank genes groups. This function performs differential expression analysis. It takes a categorical grouping, usually your calculated clusters, and compares the gene expression of cells inside one cluster against the cells in all other clusters combined. The goal is to find genes that are highly expressed in your target cluster but mostly silent everywhere else. By default, it compares each cluster to the union of the rest of the cells, but you can also configure it to compare a cluster against one specific reference group if you are looking for subtle differences between two highly related cell types. To decide if a gene is truly a marker, Scanpy runs a statistical test to score the difference in expression. You can choose a standard t-test, but the Wilcoxon rank-sum test is highly recommended and is the standard choice for single-cell data. Single-cell gene expression does not follow a normal bell curve; it is highly variable, heavily skewed, and full of zero values where a gene simply was not detected. The Wilcoxon test does not assume a normal distribution. Instead of looking at raw mean values, it ranks the expression values across all cells and compares the ranks between your target cluster and the rest of the dataset. This makes it far more robust against extreme outliers. Let us look at a specific scenario. You have a dataset with several clusters and you want to interrogate Cluster 3. You call the rank genes groups function, tell it to use your existing cluster labels, and set the method to Wilcoxon. Scanpy crunches the numbers and ranks every single gene based on how uniquely it defines Cluster 3. You then inspect the top results. You see that the highest-ranked genes are CD8A and GZMK. If you know your immunology, you recognize immediately that these are classic markers for cytotoxic T-cells. Because these specific genes are uniquely upregulated here compared to the rest of the dataset, you can confidently label Cluster 3 as a CD8 positive T-cell. The output of this function is stored quietly in your AnnData object under the uns attribute. Scanpy saves arrays of gene names, statistical scores, p-values, and log-fold changes for every single cluster simultaneously. You can extract these arrays to build dataframes, save them to a csv file, or pass them directly to downstream annotation tools. Here is the key insight. Differential expression turns arbitrary mathematical shapes into actionable biological identities. Without marker genes, you just have a map of numbers; with them, you have a mapped biological system. That is all for this one. Thanks for listening, and keep building!

Data Integration with Ingest

3m 38s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 11 of 16. You have a perfectly annotated reference dataset, and a brand new patient sample that you need to analyze. Normally, combining them means running batch correction, recalculating your principal components, and waiting for a massive joint UMAP to render from scratch. Data Integration with Ingest offers a faster, non-destructive alternative. People often confuse ingest with standard batch correction. Traditional batch correction tools take multiple datasets and calculate a brand new joint model. They alter the underlying representation of your reference data to force an alignment. Ingest does the opposite. It is an asymmetric projection. Your reference dataset acts as the absolute ground truth. The spatial model is locked in place, and the new data is simply pushed through it without altering the original model at all. Take a beautifully annotated Peripheral Blood Mononuclear Cell reference atlas. Every cluster is verified and labeled. You just received an unannotated, messy patient sample. You want those atlas labels applied to your new sample, and you want to plot the new cells in the exact same coordinate space as the reference. To make this work, the variables in your two datasets must align. This means both datasets need to share the exact same genes. In practice, you filter your new query dataset so its genes match the highly variable genes already identified in your reference atlas. Your reference dataset must be fully processed before you begin. It needs an existing Principal Component Analysis, a calculated neighborhood graph, and a UMAP representation. It also holds the categorical labels you want to transfer, stored in its observation metadata. The execution is a single command. You call the ingest function, passing it your new query dataset, your annotated reference dataset, and the specific observation column you want to map, such as the cell type label. Here is the key insight. When you trigger the function, ingest takes the expression profiles of your new cells and projects them mathematically into the existing principal component space of the reference atlas. It skips calculating a new global principal component analysis entirely. Once the query cells land in that shared spatial layout, the algorithm searches for nearest neighbors. It maps the query cells directly onto the pre-existing neighbor graph of the reference dataset. The heavy computational lifting of building a graph has already been done by the reference model. Because the new cells now have established neighbors in the reference data, two final transfers occur. First, the UMAP coordinates from the reference neighbors are assigned to the new cells. Second, the metadata labels, like your cell types, are copied over based on majority voting from those closest reference neighbors. The result is a query dataset carrying the exact UMAP layout and cell type annotations of your atlas. You can overlay the previously unannotated patient sample directly on top of your reference visualization, and the matching biological populations will drop cleanly into the established visual clusters. By projecting new cells onto an existing model, ingest shifts your workflow from constantly rebuilding fragile global spaces to constructing one robust reference atlas and seamlessly flowing all future experiments through it. That is all for this one. Thanks for listening, and keep building!

Visualizing Expression Patterns

3m 20s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 12 of 16. A standard feature plot is great for showing where one specific gene is active, but it completely breaks down when you need to compare twenty different genes across ten distinct clusters simultaneously. To solve that, we are looking at Visualizing Expression Patterns. When checking marker genes, the default instinct is often to reach for a heatmap. You line up all your cells, line up your genes, and look for color blocks. But single-cell RNA sequencing data is notoriously sparse. Most cells have zero counts for most genes. In a standard single-cell heatmap, this sparsity creates visual noise. You end up staring at a sea of background color, trying to guess if a gene is actually a specific marker for a cluster or if it is just randomly dropped out everywhere else. This is where the dot plot comes in. Instead of plotting individual cells, a dot plot aggregates them. You place your cell groups, like your Leiden clusters, on one axis, and your genes of interest on the other. At every intersection, you get a circle. Here is the key insight. A dot plot encodes two completely different pieces of information into that single circle. First, the color of the dot represents the mean expression level of the gene in those cells. Darker or more intense colors mean higher expression. Second, the size of the dot represents the fraction of cells in that cluster that actually express the gene at all. A large dot means almost every cell in the cluster has at least some RNA for that gene. A tiny dot means only a few cells express it. This dual encoding is incredibly powerful for sparse data. It separates how much of a gene is present from how broadly it is distributed. Let us say you are looking at fifteen candidate marker genes across five Leiden clusters. You pass your data, your gene list, and your cluster labels into the dot plot function. You can instantly see if your expected monocyte markers are both highly expressed and widely expressed in Cluster one, while being totally absent in the other four clusters. You do not have to squint at individual cell rows. The large, dark dots in the Cluster one row give you immediate validation. Sometimes, you do not need the frequency information provided by the dot size. You just want a clean grid showing the average expression. For that, Scanpy offers the matrix plot function. Think of a matrix plot as a grouped heatmap. It still aggregates your cells by cluster, but it fills the entire grid square with color representing the mean expression value. There are no changing dot sizes. It is a faster, denser way to verify broad expression patterns when you have a massive list of genes and the fraction of expressing cells matters less to you. Both tools take your data object, a list of your target genes, and the metadata category you want to group by. They execute quickly and scale beautifully to dozens of genes. When dealing with single-cell sparsity, separating the expression intensity from the expression frequency is the most reliable way to confirm a marker gene actually defines a cluster. That is all for this one. Thanks for listening, and keep building!

Exploring Manifolds with Diffusion Maps

3m 51s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 13 of 16. UMAP is fantastic for separating distinct cell types, but it can violently tear apart continuous biological processes. When you are studying cellular development, you do not want disjointed islands of cells; you need to see the smooth continuum of state changes. Exploring manifolds with Diffusion Maps resolves this exact problem. In Scanpy, you calculate this using the tool diffmap. Users often default to UMAP for all dimensionality reduction. Understand that UMAP is an embedding technique optimized for finding distinct clusters and preserving local neighborhoods. Diffusion maps are fundamentally different. They preserve the continuous mathematical probability of transitioning between states. This makes them the ideal choice for analyzing continuous processes like cellular differentiation. The diffmap algorithm treats your data manifold as a continuous network. It relies entirely on the nearest neighbor graph of your cells. Once that graph is established, the algorithm simulates a random walk across the connections. Think of it as modeling a diffusion process, similar to how heat spreads through a physical material. The algorithm evaluates how easily a signal can travel through the dense regions of your data. It calculates the probability of moving from one cell state to another over a specific number of steps. Cells that share a high probability of transition are placed closer together in the final lower-dimensional space. Consider tracing a hematopoietic stem cell differentiating into an erythrocyte. If you project this data using an embedding that favors discrete separation, the intermediate progenitor cells often get forced into artificial, separate clusters. The underlying math fractures the biological timeline. If you run diffmap instead, the algorithm computes the transition probabilities along the entire developmental path. The result is a smooth, continuous trajectory. The stem cell sits at one end, the mature erythrocyte at the other. Every intermediate state is plotted along a connected path based strictly on the likelihood of state transition. You are not looking at isolated snapshots of distinct cell types. You are looking at a fluid biological event. Applying this in Scanpy follows a rigid sequence. First, you must compute the neighborhood graph in your object. The diffusion map cannot run without those pre-computed neighbor connections. Next, you call the diffmap tool and pass it your AnnData object. You can optionally specify the number of components you want to compute, which sets the dimensions of the output. Scanpy calculates the diffusion map and stores the new coordinates in the multidimensional observation attribute of your object, under the key X diffmap. The tool also stores the eigenvalues in the unstructured data attribute. These values tell you how much variance each diffusion component captures. A sharp drop in these eigenvalues indicates that you have captured the most important biological transitions, and the subsequent components are likely noise. In a typical differentiation dataset, the first non-trivial diffusion component directly aligns with the primary developmental time axis. Here is the key insight. In a diffusion map, the physical distance between two cells on the plot is not just a generic measure of transcriptomic similarity. That distance explicitly represents the mathematical probability of a biological transition occurring between those two specific states. That is your lot for this one. Catch you next time!

Abstracted Graphs with PAGA

3m 32s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 14 of 16. Just because two cell clusters sit next to each other on a UMAP plot does not mean they are biologically related. Visual proximity in two dimensions is often a mathematical illusion, and if you rely on it to draw developmental trajectories, you might connect dots that do not actually touch in high-dimensional space. To map actual, statistically backed connectivity, you need Abstracted Graphs with PAGA. PAGA stands for Partition-based Graph Abstraction. We need to clear up a common misconception immediately. PAGA is not a dimensionality reduction embedding like t-SNE or UMAP. It does not calculate coordinates for individual cells to draw a scatter plot. Instead, PAGA creates a simplified, coarse-grained graph. The nodes in this graph are whole clusters, or partitions, of cells. The edges connecting those nodes represent the statistical confidence that those clusters share a continuous boundary. When you call the PAGA function on your annotated data, you point it to a specific set of group labels, usually your Leiden or Louvain clusters. The algorithm then evaluates the boundaries between these groups by digging into the underlying single-cell neighborhood graph. It looks at the individual cells in cluster A and counts how many of their direct neighbors belong to cluster B. By tallying all these cross-cluster connections and comparing them to a random model, PAGA generates a quantifiable connectivity matrix. A high value means the clusters are deeply intertwined, suggesting a biological transition. A low value means they are separate islands. Consider a concrete scenario. You are tracking immune cell development and you need to prove a specific progenitor cluster directly gives rise to an effector T-cell cluster. On a standard visual plot, the layout algorithm might throw an entirely unrelated cluster right between them, making their relationship look indirect. By examining the PAGA connectivity matrix, you bypass this visual distortion. You look directly at the mathematical edge weight between your progenitor and effector groups. PAGA allows you to set a minimum connectivity threshold. When you apply this threshold, you filter out spurious, low-confidence connections. If the strong edge between your two target clusters survives the cut, you have established a statistically robust link. This is the part that matters. PAGA does not just sit alongside your embeddings; it can fix them. Because PAGA preserves global topology so reliably, you can use the abstracted graph to initialize a UMAP embedding. Instead of letting UMAP start from a random spatial layout, you tell it to position the individual cells based on the coarse-grained PAGA graph. This anchors your final two-dimensional visualization to the true high-dimensional reality, ensuring that distant biological states are not artificially squished together. Visualizations will always distort complex data to fit it onto a flat screen, but the abstraction of a neighborhood graph relies entirely on mathematical proximity. Trust the graph over the picture. That is all for this one. Thanks for listening, and keep building!

Trajectory Inference with DPT

3m 33s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 15 of 16. Single-cell sequencing destroys the very cell you are trying to study. You get a static snapshot of its gene expression, which means you cannot record a video of a stem cell differentiating into a mature state. To see that developmental journey, you have to infer a timeline mathematically by calculating transcriptomic distances. That is exactly what we do using Trajectory Inference with DPT. DPT stands for Diffusion Pseudotime. Before looking at how the algorithm operates, we need to clarify what that name actually means. Pseudotime is not real chronological time. It does not measure hours, days, or the biological age of a cell. It is strictly a metric of transcriptomic distance. It measures how many incremental expression changes a cell has undergone relative to a specific starting point. To run this in Scanpy, you use the function named sc dot tl dot dpt. This function operates on the existing neighborhood graph of your dataset, which connects cells based on their similarity. However, a graph alone has no inherent direction. To give it direction, you must define a starting point. You do this by setting a root cell. Consider a scenario where you are studying blood development. You examine your clusters and identify the naive hematopoietic stem cells. You pick a specific cell index from that group and assign it as the root in your dataset structure. This acts as the origin point, or time zero, for the entire calculation. Once the root is established, you execute the DPT function. Here is the key insight. The algorithm does not measure a straight, linear distance between the root and another cell. Biological development is not a straight line; it follows complex, branching paths. To capture this, DPT calculates geodesic distances along your neighborhood graph. It evaluates the structure of the data by simulating random walks from the root. It steps from cell to cell across the dense edges of the graph, finding the most probable paths of transcriptomic change. The result of this calculation is a new array of values added to your cell annotations. Every single cell in your dataset receives a pseudotime score. The root cell sits at zero. As the geodesic distance from the root increases, the score rises. In our blood development scenario, an intermediate progenitor cell might get a score of point four, while a fully mature cell at the terminal end of a branch gets a score near one. You have effectively mapped a static cluster of dots into a continuous developmental ordering, sorting them from least to most differentiated. You can now use this numerical axis to track individual gene dynamics, plotting exactly when a specific transcription factor activates along the developmental path. The reliability of your trajectory inference depends entirely on your starting point, meaning an incorrectly chosen root will yield a perfectly calculated but biologically meaningless timeline. If you find these episodes helpful and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Experimental Scale-Up with Dask

3m 36s

Download

Hi, this is Alex from DEV STORIES DOT EU. Scanpy Single-Cell Analysis, episode 16 of 16. What happens when your single-cell dataset hits five million cells and completely exhausts your computer memory? You cannot filter it, you cannot normalize it, and your kernel simply crashes. The solution to this hard memory limit is Experimental Scale-Up with Dask. People often hear Dask and immediately think of parallel processing across a distributed cluster to speed up code. While Dask can do that, its primary superpower in Scanpy right now is out-of-core memory management through lazy evaluation. It is not just about doing things faster. It is about doing things that were previously impossible on a single machine. Standard Scanpy workflows rely on in-memory arrays. This requires the entire dataset to reside in your active RAM. When you use the Dask backend, Scanpy replaces these standard arrays inside your AnnData object with Dask arrays. A Dask array is essentially a collection of many smaller arrays, called chunks. Instead of loading the whole matrix into memory at once, Dask leaves the bulk of the data safely on disk. When you run a Scanpy preprocessing function backed by Dask, it does not calculate the result right away. This is where lazy evaluation comes in. Instead of crunching numbers, Dask builds a recipe. It creates a task graph outlining exactly what mathematical operations need to be performed on each individual chunk of data. Consider a scenario where you have a massive two-million cell dataset on disk, and you need to calculate quality control metrics. If you try this with a standard array, your system will freeze as it tries to pull everything into RAM. But if your AnnData object contains a Dask array, you simply call the standard Scanpy quality control function. The function returns almost instantly. Your memory does not spike because no numbers have actually been processed yet. Scanpy merely noted your intent. When you are finally ready to plot those metrics or save the summary statistics, you explicitly tell Dask to compute the result. This is the part that matters. At this exact moment, Dask pulls the first chunk of data from your hard drive, calculates the metrics for just that chunk, stores the small result, and then immediately throws the raw data chunk out of memory. Then it moves to the next chunk. Your active memory footprint stays tiny. It is dictated entirely by the size of a single chunk, not the millions of cells in the full dataset. Right now, this Dask backend in Scanpy is categorized as experimental. Not every function in the ecosystem supports it yet. However, core preprocessing steps like normalization, scaling, and highly variable gene selection are fully equipped to handle massive out-of-core operations. When you shift from eager processing in memory to lazy processing on disk, the size of your dataset is no longer limited by your hardware RAM, but only by your patience. This brings us to the end of our single-cell series. I highly encourage you to explore the official Scanpy documentation and try building these data graphs hands-on. If you have ideas for what technologies we should cover in our next series, visit devstories dot eu and let us know. Thanks for listening, happy coding everyone!