scikit-image: The AI Image Pipeline

v0.26 — 2026 Edition. A 5-episode curriculum on using scikit-image v0.26 (2026 Edition) as a core image preprocessing and data augmentation engine in modern AI and deep learning pipelines.

Image Processing Data Science Computer Vision

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

The AI Image Pipeline: NumPy at the Core

Discover how scikit-image represents images as NumPy ndarrays. Learn why this design makes it the perfect preprocessing engine for deep learning frameworks like PyTorch and TensorFlow.

3m 51s

Speaking the Same Language: Dtypes and OpenCV

Master image data types to prevent the most common silent bugs in computer vision. Learn how to seamlessly integrate scikit-image with OpenCV and neural network inputs.

3m 50s

Contrast, Exposure, and AI Robustness

Learn how to use contrast adjustment and histogram equalization to standardize datasets. These techniques are crucial for making AI models robust against varying lighting conditions.

4m 07s

Geometrical Transformations for Data Augmentation

Explore how to resize images to fit neural network inputs and apply affine transformations. Essential for building robust data augmentation pipelines.

3m 47s

Classical Segmentation to Bootstrap AI

Discover how to use classical watershed segmentation to automatically generate pixel-perfect training masks for deep learning models, saving hours of manual labeling.

3m 34s

Episodes

The AI Image Pipeline: NumPy at the Core

3m 51s

Discover how scikit-image represents images as NumPy ndarrays. Learn why this design makes it the perfect preprocessing engine for deep learning frameworks like PyTorch and TensorFlow.

Download

Hi, this is Alex from DEV STORIES DOT EU. scikit-image: The AI Image Pipeline, episode 1 of 5. Before a deep learning model can recognize a face, it must digest a grid of numbers. If you feed that grid in the wrong order, your model learns nothing, and your processing slows to a crawl. The AI Image Pipeline: NumPy at the Core is the solution to structuring that grid correctly. When you load an image using scikit-image, you do not get a proprietary image object. You get a standard N-dimensional array, known as an ndarray. An image is simply a matrix of pixels. Because it is a plain NumPy array, you can use any standard operation to slice, mask, or manipulate the image data directly. Here is the key insight. The way scikit-image indexes these arrays trips up a lot of developers. In standard geometry, you use x and y coordinates, where x is horizontal and y is vertical. scikit-image abandons this for matrix notation. The first index is the row, which corresponds to the vertical position. The second index is the column, which corresponds to the horizontal position. If you have a color image, the third index is the color channel. A standard two-dimensional color image is actually a three-dimensional array ordered as row, column, channel. Suppose you are preparing a batch of color images to feed into a Convolutional Neural Network. Your network expects a specific shape. If your image is 256 pixels high and 256 pixels wide with red, green, and blue channels, the shape of a single image is 256, 256, 3. But a batch of these images adds a fourth dimension at the beginning, representing the number of images. When you apply scikit-image functions to this data, the function needs to know which dimension holds the color values so it does not process them as spatial data. This is handled by the channel axis argument. By setting the channel axis argument to negative one, you tell the function that the color channels are always in the last dimension of the array. This ensures the function targets the correct color data, regardless of whether you pass it a single image or a large batch with extra spatial dimensions. This brings us to memory. This is the part that matters for processing speed. NumPy arrays are stored in contiguous blocks of memory, using a C-like order by default. This means the last dimension in the array shape changes fastest in physical memory. For our standard image array of row, column, channel, the individual color channels for a single pixel sit right next to each other in your hardware RAM. If you write custom code to loop over these pixels, you must respect this memory layout. The rule is absolute: the rightmost dimension of your array should be processed in the innermost loop. You iterate over rows in the outer loop, then columns, then channels on the inside. If you reverse this and loop over rows on the inside, you force the CPU to jump back and forth across fragmented memory addresses. This destroys cache locality and makes your code run significantly slower. The most critical thing to remember is that scikit-image does not use horizontal and vertical coordinates; it uses strict matrix indexing of row, column, and channel, and matching your loops to that exact memory order is what keeps your data pipeline fast. If you want to help keep this podcast going, you can support the show by searching for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Speaking the Same Language: Dtypes and OpenCV

3m 50s

Master image data types to prevent the most common silent bugs in computer vision. Learn how to seamlessly integrate scikit-image with OpenCV and neural network inputs.

Download

Hi, this is Alex from DEV STORIES DOT EU. scikit-image: The AI Image Pipeline, episode 2 of 5. The most common silent bug in computer vision AI is not in the neural network architecture. It is a data type mismatch. You feed an image into a model, the code runs without throwing errors, but the outputs are complete garbage. This happens when your libraries fundamentally disagree on how a pixel should be represented numerically. Today we will fix this by covering Speaking the Same Language: Dtypes and OpenCV. In scikit-image, images are stored as standard numpy arrays. But the numbers inside those arrays behave differently depending on their exact data type. The default format from webcams and standard image files is eight-bit unsigned integer, known as uint8. These values run from zero to 255, where zero is absolute black and 255 is peak intensity. However, scientific image processing functions and deep learning frameworks almost always prefer floating-point numbers. In scikit-image, a float image expects pixel intensities to be scaled on a strict range, usually from zero to one. Here is the key insight. Do not use numpy's astype method to convert your uint8 images into floats. If you take a uint8 array and simply call astype float, numpy only changes the underlying memory type. A bright pixel value of 255 merely becomes 255 point zero. It does not rescale the values. If you pass that array into a scikit-image function or a PyTorch model that expects maximum brightness at one point zero, the math explodes. Your whites are suddenly treated as if they are 255 times brighter than the maximum possible value. Instead, you must use scikit-image's built-in utility functions. The most important one is called img_as_float. This function checks the input data type and handles the mathematical rescaling automatically. It safely compresses a zero to 255 uint8 range down to a precise zero to one point zero float range. Data types are only half the battle. You also have to align the color channels, especially if you are capturing video. If you read a frame using OpenCV, it hands you a uint8 numpy array. But OpenCV has a historical quirk. It stores color channels in Blue Green Red order, known as BGR. scikit-image and most modern AI models expect Red Green Blue, or RGB. If you forget to swap the channels, red apples look blue and human skin looks completely alien. You do not need to call a heavy OpenCV color conversion function to fix this. Because the image is just a numpy array, you can use basic array slicing. You slice the array across its three dimensions. You take all rows, all columns, and then for the final dimension representing the color channels, you specify a step of minus one. This tells numpy to step backward through the channels, reversing BGR to RGB instantly in memory. Let us walk through a complete ingestion pipeline for a deep learning model. First, you grab a frame from an OpenCV video stream, giving you a uint8 BGR array. Next, you reverse the color channels using numpy slicing to convert it to RGB. Then, you pass that array into img_as_float. The array is now a floating-point matrix, perfectly scaled between zero and one. Finally, you convert this clean numpy array into a PyTorch tensor. A neural network cannot tell you when its input data is scaled incorrectly, it just learns the wrong patterns or fails silently. Controlling your data types and channel orders right at the ingestion step ensures your pipeline rests on a solid mathematical foundation. That is all for this one. Thanks for listening, and keep building!

Contrast, Exposure, and AI Robustness

4m 07s

Learn how to use contrast adjustment and histogram equalization to standardize datasets. These techniques are crucial for making AI models robust against varying lighting conditions.

Download

Hi, this is Alex from DEV STORIES DOT EU. scikit-image: The AI Image Pipeline, episode 3 of 5. Neural networks are brilliant at finding patterns, but they will happily memorize the specific lighting of a room instead of the object you want them to detect. To fix this, we need to standardize the variance in our datasets through Contrast, Exposure, and AI Robustness. Consider a dataset of medical X-rays collected from a dozen different hospitals. The scans come from different machines with wildly varying calibrations. Some images are dark and muddy, while others are washed out and bright. If you feed this raw data into a neural network, it will likely overfit to the lighting conditions of specific scanners rather than learning to identify the underlying pathology. You have to normalize the exposure before training begins. The first step in this standardization is often removing irrelevant information. If the structure is all that matters, color is a distraction. You can use a function called rgb to gray to convert colored images into single-channel grayscale arrays. This reduces the dimensionality of your data and forces the model to evaluate pure luminance. Once you are working strictly with luminance, you need to align the baseline brightness across your dataset. This is where rescale intensity comes in. This function performs a linear stretch on your image data. It takes the darkest pixel and maps it to the lowest possible value, like zero, and maps the brightest pixel to the highest possible value, like two hundred and fifty-five. Every pixel in between is scaled linearly. Here is the key insight. A simple minimum-to-maximum stretch is fragile. A single dead black pixel or a bright artifact from a piece of dust on the sensor will dictate the entire scale. The stretch will compress your actual anatomical data into a narrow, useless band of grays just to accommodate that one extreme outlier. To solve this, you use percentile clipping. Instead of stretching from the absolute minimum and maximum, you calculate the second and ninety-eighth percentiles of the pixel values in your image. You then pass those percentiles into the rescale intensity function as your input range. The function will chop off the extreme two percent of bright and dark pixels, setting them to pure white and pure black, and linearly stretch the remaining ninety-six percent of the data. This guarantees that the bulk of your structural data uses the full dynamic range, completely ignoring random artifacts. Sometimes, a linear stretch is not enough. You might have an X-ray where the data is captured, but all the pixels are clustered around a few specific shades of gray, making the image look flat and obscuring details. For this, you use equalize hist, which stands for histogram equalization. Histogram equalization is a non-linear process. Instead of just stretching the boundaries, it analyzes the frequency of every pixel value in the image. It then spreads out the most frequent intensity values across the entire available spectrum. If a large portion of your X-ray is trapped in a narrow band of dark grays, histogram equalization will pull those grays apart, assigning them new values that span from black to white. This artificially boosts the local contrast, revealing subtle textures and boundaries that were previously hidden in the muddy areas of the scan. Standardizing your dataset with these techniques ensures your model evaluates the actual shape and texture of the subject. A robust pipeline removes the irrelevant variance of hardware calibration, forcing the neural network to learn the signal rather than the noise. Thanks for spending a few minutes with me. Until next time, take it easy.

Geometrical Transformations for Data Augmentation

3m 47s

Explore how to resize images to fit neural network inputs and apply affine transformations. Essential for building robust data augmentation pipelines.

Download

Hi, this is Alex from DEV STORIES DOT EU. scikit-image: The AI Image Pipeline, episode 4 of 5. A convolutional neural network might recognize a cat perfectly. But turn that cat upside down, or shift it three pixels to the left, and the model is suddenly completely blind. You fix this with Geometrical Transformations for Data Augmentation. Before we augment anything, we have to prepare the baseline. Neural networks require fixed input shapes. You handle this using the resize function in scikit-image. You pass it an image and your target output dimensions, and it mathematically stretches or shrinks the pixel array to fit, interpolating the new pixel values automatically. But resizing alone leaves your model vulnerable to overfitting. It will memorize exactly where objects sit in the frame. To prevent this, you build a data augmentation pipeline that applies random transformations to training images on the fly. In scikit-image, spatial transformations manipulate the coordinate space of the image using matrix math. Moving or rotating an image means multiplying its pixel coordinates by a transformation matrix. Here is the key insight. Translation, which is shifting an image across the x or y axis, cannot be calculated using standard two-by-two matrix multiplication. Matrix multiplication handles scaling and rotation, but shifting requires addition. To solve this, scikit-image uses homogeneous coordinates. By appending a dummy third coordinate, a one, to every two-dimensional point, the system upgrades the math. This allows translations, rotations, and scaling to all be calculated simultaneously as a single three-by-three matrix multiplication. You do not have to write these three-by-three matrices manually. scikit-image provides transformation classes to do the math for you. For our anti-overfitting pipeline, you use the Euclidean transform class. A Euclidean transformation preserves distances and angles, meaning it only handles rotation and translation. You initialize it by passing in a rotation angle and a translation vector. If you needed to add shearing or change the scale, you would step up to an Affine transform. If you needed to simulate a change in perspective, you would use a Projective transform. But for random spins and shifts, Euclidean is exactly what you need. Once your transformation matrix is defined, you have to apply it to your image. You do this using the warp function. You pass the warp function your input image and your Euclidean transform object. This is where it gets interesting. The warp function does not calculate where the input pixels should go in the new image. If you push pixels forward into a new grid, rotation math creates fractional coordinates. When those round to the nearest whole pixel, you end up with missing pixels, or holes, scattered across your output image. Instead, warp works backwards. It takes the inverse of your transformation matrix. It looks at every empty pixel coordinate in the target output image, maps it backward into the original image space, and interpolates the correct color value. This inverse mapping guarantees a solid, hole-free output. For your augmentation pipeline, the logic is simple. Generate a random angle and a random set of shifts. Feed them to a Euclidean transform. Pass that transform and your training image into the warp function. The output goes straight into your neural network. Geometric transformations do not just create more training data; they force your model to separate the object it needs to recognize from the arbitrary coordinates it happens to occupy. That is all for this one. Thanks for listening, and keep building!

Classical Segmentation to Bootstrap AI

3m 34s

Discover how to use classical watershed segmentation to automatically generate pixel-perfect training masks for deep learning models, saving hours of manual labeling.

Download

Hi, this is Alex from DEV STORIES DOT EU. scikit-image: The AI Image Pipeline, episode 5 of 5. Deep learning segmentation models are incredibly powerful, but they are incredibly hungry. They demand thousands of pixel-perfect, human-drawn masks before they can learn anything at all. Classical Segmentation to Bootstrap AI bridges that gap. You are training a modern U-Net to detect cells or coins that touch each other. You need ground-truth labels. Hand-drawing these masks takes weeks. You might try running a simple edge detector like Canny to automate the process. Canny is excellent at finding sharp transitions, but it often fails to close loops. It gives you fragmented outlines, not solid regions. An AI trained on broken outlines will output broken outlines. Region-based segmentation solves this. Specifically, the Watershed algorithm. It treats your image like a topographic landscape. High pixel values are mountains, and low values are valleys. Here is the key insight. Instead of trying to connect broken edges, Watershed floods the image from known starting points until the water meets at the highest ridges. This guarantees closed, solid regions. First, you build the terrain. You do this by generating an elevation map using a Sobel filter. The Sobel filter calculates spatial gradients, highlighting edges. When you apply it to your image, the borders between your overlapping coins become the high ridges in your map. The flat surfaces of the coins and the background become the valleys. Next, you place the markers. You have to tell the algorithm where the water should start rising. If you skip this, the algorithm will flood from every tiny local minimum and shatter your image into hundreds of useless fragments. You create a marker array the exact same size as your original image. You find the definite background by selecting pixels below a specific intensity threshold and assign them a value of one. Then, you find the definite foreground, which are the solid centers of the coins, by selecting pixels above a higher intensity threshold. You assign those a value of two. Finally, you trigger the flood. You pass your elevation map and your marker array to the watershed function from the scikit-image segmentation module. The algorithm fills the regions starting from the ones and twos. As the simulated water rises, the regions expand. When they finally meet at the high ridges of the Sobel elevation map, the algorithm builds a boundary. The function returns an integer array of perfectly labeled, closed regions. Touching objects are separated exactly at the boundary. You now have a clean mask array where every coin is a distinct solid object. You can run this pipeline across your entire unlabelled dataset to generate thousands of masks automatically. You then feed those masks directly into your U-Net as ground-truth training data. Bootstrapping an AI does not require human labor if you know how to combine classic image processing terrain logic with modern model architectures. I encourage you to explore the official scikit-image documentation and try building an elevation map hands-on, and if you have an idea for our next series, drop by devstories.eu and let me know. That is all for this one. Thanks for listening, and keep building!