v2.11 — 2026 Edition. A comprehensive audio course on building deep learning models using PyTorch version 2.11. Covers Tensors, Autograd, Neural Networks, Optimizers, DataLoaders, and the PyTorch compiler.
Discover the fundamental purpose of PyTorch and what sets it apart from traditional math libraries. This episode explains the role of Tensors, Autograd, and GPU acceleration in modern deep learning.
3m 39s
2
Understanding PyTorch Tensors
Dive into Tensors, the foundational data structure of PyTorch. Learn how they bridge raw data with neural networks and share memory seamlessly with Numpy arrays.
3m 54s
3
Tensor Operations and Memory
Learn how to manipulate Tensors efficiently. This episode covers arithmetic operations, concatenation, device transfers, and the memory implications of in-place operations.
3m 30s
4
The Magic of Autograd
Unpack the engine that makes deep learning possible in PyTorch. Learn how Autograd dynamically tracks operations and automatically calculates complex derivatives.
3m 28s
5
Controlling Gradient Tracking
Discover how to disable PyTorch's gradient tracking to save memory and speed up computation. Essential for running inference and freezing model parameters.
3m 48s
6
Datasets and Data Handling
Learn how to decouple your data processing from your model architecture using the PyTorch Dataset class. We explore lazy loading and custom dataset structures.
3m 19s
7
DataLoaders and Batching
Unleash the full speed of your hardware by wrapping Datasets in DataLoaders. Learn how to batch, shuffle, and multiprocess your data streams.
3m 36s
8
Data Transformations
Discover how to preprocess raw data on the fly before it hits your neural network. We cover torchvision transforms like ToTensor and custom Lambda functions.
3m 57s
9
Designing Networks with nn.Module
Explore the structural blueprint of every PyTorch neural network. Learn how to subclass nn.Module, define layers in initialization, and route data in the forward pass.
3m 39s
10
Linear Layers and Activations
Look inside the neural network. We break down the nn.Linear module and explain why non-linear activation functions like ReLU are mathematically essential.
3m 56s
11
The nn.Sequential Container
Streamline your PyTorch code using the nn.Sequential container. Learn how to snap layers together cleanly and inspect your model's parameters.
3m 26s
12
Understanding Loss Functions
Before an AI can learn, it must measure its mistakes. We dive into PyTorch loss functions, comparing CrossEntropyLoss for classification and MSELoss for regression.
3m 20s
13
Optimizers and Gradient Descent
Explore how the optimizer updates model weights to reduce error. Learn the crucial three-step dance of zero_grad(), backward(), and step().
3m 33s
15
Validation and Inference
Evaluate your model objectively. Learn how to switch your network to evaluation mode, freeze gradients, and extract accurate predictions on unseen data.
3m 31s
16
Saving and Loading Models
Don't lose your hard-earned progress! We discuss the safest ways to serialize your model weights using state_dict and load them back securely.
3m 12s
17
Supercharging Speed with torch.compile
Unlock the defining feature of PyTorch 2.0. Learn how the torch.compile decorator JIT-compiles your Python code into optimized kernels for massive speedups.
3m 29s
18
Compilers and Graph Breaks
Dive under the hood of the PyTorch compiler. We explore graph breaks, dynamic control flow, and why torch.compile succeeds where legacy systems failed.
3m 50s
Episodes
1
The Core Identity of PyTorch
3m 39s
Discover the fundamental purpose of PyTorch and what sets it apart from traditional math libraries. This episode explains the role of Tensors, Autograd, and GPU acceleration in modern deep learning.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 1 of 18. You write a complex mathematical model in Python, but when you scale it up, it completely chokes your processor. You need to run those calculations on parallel hardware and continuously calculate all their derivatives, but rewriting everything in a low-level language would take weeks. That tension is resolved by the core identity of PyTorch.
When you first look at PyTorch, it often feels exactly like NumPy. You create arrays, multiply matrices, and manipulate numbers. This visual similarity causes a lot of early confusion. People assume PyTorch is just another standard math library. It is not. While standard math libraries are built for CPU-bound numerical computing, PyTorch is designed from the ground up to harness parallel hardware and build dynamic computational graphs.
The foundational building block of this framework is the tensor. A tensor is essentially a multi-dimensional array. If you have a grid of numbers representing an image, a sound wave, or a block of text, you store it in a tensor. The critical difference between a standard array and a PyTorch tensor is where that data can live and execute. Tensors can seamlessly move from your computer system memory to a Graphics Processing Unit.
Take a massive matrix multiplication. You have two grids containing millions of numbers. If you ask a standard CPU to multiply them, it processes the math sequentially or in very small batches. The process struggles and stalls. Because tensors are explicitly designed for hardware acceleration, you can send that exact same data to a GPU. The GPU contains thousands of small cores designed to execute mathematical operations simultaneously. A massive computation that takes minutes on a CPU finishes instantly on a GPU. PyTorch acts as the bridge, translating your standard Python code into instructions for that parallel hardware.
Fast hardware is only half the requirement for machine learning. Training a neural network requires continuous calculus. You need to know exactly how tweaking one variable changes your final output, which means constantly calculating gradients. Doing this manually for a model with billions of parameters is impossible.
This brings us to the second pillar of PyTorch, which is Autograd. Autograd is an automatic differentiation engine. When you perform math operations on tensors, PyTorch does not just calculate the final number. It silently builds a map in the background. It records every addition, multiplication, and data transformation into a dynamic computational graph.
When you reach the end of your calculation, you simply ask the framework to compute the gradients. PyTorch walks backward through that invisible graph, applying the chain rule of calculus automatically. You receive the exact derivatives for every single parameter in your model without writing any calculus code yourself. Because this graph is built dynamically on the fly, it adapts to your code. If a standard Python loop or if-statement changes the flow of your data, the graph adjusts immediately.
The true power of PyTorch is not just that it runs fast or does calculus. It gives you the execution speed of a supercomputer and the mathematical rigor of an automated calculus engine, entirely hidden behind readable, ordinary Python.
If you want to help keep these episodes coming, you can search for DevStoriesEU on Patreon and support the show. That is all for this one. Thanks for listening, and keep building!
2
Understanding PyTorch Tensors
3m 54s
Dive into Tensors, the foundational data structure of PyTorch. Learn how they bridge raw data with neural networks and share memory seamlessly with Numpy arrays.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 2 of 18. Every image, sound wave, and text document you feed into a neural network eventually turns into the exact same data structure. If it is all just a grid of numbers, you might wonder why we need a specialized object instead of relying on standard programming arrays. The answer lies in Understanding PyTorch Tensors.
A tensor is a specialized data structure that looks and acts a lot like an array or a matrix. In PyTorch, tensors are the universal currency. They hold your raw inputs, the outputs your model generates, and the internal parameters of the neural network itself.
People often assume tensors are completely identical to NumPy arrays. They do look similar, and they share a lot of the same behaviors. The critical difference is what tensors unlock. While a standard array sits in your main system memory and runs on your central processor, a tensor is built to easily move over to a graphics processing unit, or GPU, for massive hardware acceleration. Tensors also contain the built-in plumbing required for gradient tracking, which allows neural networks to learn.
You can initialize a tensor in several ways. The most direct route is passing raw data, like a standard Python list of numbers, straight into the tensor constructor. You can also create a new tensor based on an existing one. When you do this, the new tensor automatically inherits the properties of the original, meaning it will have the same dimensions and data type unless you explicitly override them. Alternatively, if you just need a placeholder, you can define a shape, which is a simple collection of numbers representing the dimensions you want, and ask PyTorch to generate a tensor filled with random numbers, all ones, or all zeros based on that shape.
Once you have a tensor, you will frequently check three main attributes. The first is the shape, which tells you the exact size of the tensor along every dimension. The second is the data type, which indicates the kind of numbers stored inside, such as 32-bit floats or integers. The third is the device attribute. This tells you where the tensor physically resides right now, whether that is on the CPU or a specific GPU. You need to keep track of this because PyTorch requires tensors to be on the same device before they can interact.
Tensors and standard arrays often need to work together, which brings us to the NumPy bridge. Tensors residing on the CPU can actually share their underlying memory with a NumPy array.
Say you load a high-resolution photograph using a standard Python image processing library. That image loads into your system memory as a standard NumPy array. You can pass that array into PyTorch using a dedicated function that creates a tensor from NumPy. PyTorch does not duplicate the underlying pixel data into a new block of memory. It simply wraps its own tensor interface around the existing memory address. Changing a value in the tensor immediately changes the value in the NumPy array, and vice versa. This zero-copy conversion saves both memory and processing time. When you are done passing the data through your model and need to hand the results back to a standard visualization tool, you call a single method on the tensor to expose it back as a NumPy array, using that exact same shared memory.
The real power of a tensor is not just storing a grid of numbers, but carrying the specific hardware context and memory structure needed to push raw data through a neural network without friction. Thanks for listening, happy coding everyone!
3
Tensor Operations and Memory
3m 30s
Learn how to manipulate Tensors efficiently. This episode covers arithmetic operations, concatenation, device transfers, and the memory implications of in-place operations.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 3 of 18. A single underscore in your code can save gigabytes of memory, but it might also silently break your entire neural network. Knowing when to use that underscore comes down to understanding Tensor Operations and Memory.
Consider a practical scenario. You have three separate feature vectors representing text, audio, and image data. You want to combine them and multiply them against a weight matrix. By default, PyTorch creates tensors on the CPU. But for heavy matrix math, you want to use a hardware accelerator. You can check if a GPU is available using the framework built-in checks. If it is, you move your tensors by calling the to method on them. You pass the target device name, like the string cuda, into this method. PyTorch then copies the tensor from system RAM into the dedicated memory of your graphics card.
With your tensors on the right hardware, you need to combine the three separate feature vectors into one. You do this using the concatenate function, commonly written as cat. You pass it a list of your tensors and specify a dimension. If you combine them along the column dimension, your three narrow tensors are joined side-by-side to form one wider tensor. You now have a unified input residing in GPU memory.
PyTorch handles over a hundred different operations, but arithmetic is the foundation. To process your combined feature vector, you need to multiply it against a weight matrix. You can use the matmul method, or just use the at-symbol as a convenient shorthand. This performs a true mathematical matrix multiplication, calculating the dot products of rows and columns, and returns a completely new tensor containing the results.
Sometimes you need element-wise math instead. Suppose you want to apply a binary mask to your tensor, forcing certain values to zero. For this, you use the mul method, or the standard asterisk operator. This does not do matrix multiplication. It simply multiplies the first element of tensor A by the first element of tensor B, the second by the second, and so on.
Every time you run operations like matrix multiplication or element-wise addition, PyTorch allocates fresh memory for the result. When you operate on millions of parameters, this rapidly consumes your available hardware memory.
This is where you have to pay attention. PyTorch provides in-place operations to manage memory overhead. Any operation that ends with an underscore operates in-place. If you use the standard add method, you get a new tensor. If you use the add method with an underscore, PyTorch directly overwrites the values inside the existing tensor. The memory footprint stays exactly the same.
While in-place operations are highly efficient for memory, they are also dangerous. When you overwrite a tensor, you erase its previous values. Neural networks rely on a complete record of past states to calculate derivatives during the learning phase. If you overwrite a tensor using an in-place operation, you destroy the computation history the system needs to update the model.
Reserve in-place operations for formatting data before it enters your model, and stick to standard operations during training to keep your computation history intact. That is all for this one. Thanks for listening, and keep building!
4
The Magic of Autograd
3m 28s
Unpack the engine that makes deep learning possible in PyTorch. Learn how Autograd dynamically tracks operations and automatically calculates complex derivatives.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 4 of 18. Training a neural network means calculating the derivative of your error with respect to millions of parameters. Doing that by hand would require pages of calculus and constant rewriting every time you change your model architecture. PyTorch solves this by tracking your math in the background, a concept known as The Magic of Autograd.
Autograd is PyTorch’s built-in differentiation engine. It automatically computes gradients for any computational graph. To see how it works, picture a standard linear transformation. You have an input tensor containing your data, a weight matrix, and a bias vector. The goal is to compute an output, compare it to the actual target value, and calculate the error, or loss.
Your input data is fixed, so you do not need its derivatives. But the weight and bias tensors need to be updated later, meaning you absolutely need their gradients. You signal this to PyTorch by setting a flag called requires grad to true when you create those parameter tensors. This tells the autograd engine to start watching them.
When you perform operations on these watched tensors — multiplying the input by the weights, adding the bias, and calculating the final loss — PyTorch does two things at once. It computes the actual numerical result, and simultaneously, it constructs a Directed Acyclic Graph, or DAG. In this graph, your starting tensors are the leaves, and the mathematical operations you applied are the roots. Every new tensor created by an operation has an attribute that stores a reference to the function that created it. This tells autograd exactly how to compute the derivative for that specific mathematical step.
This graph is not a static structure defined at the start of your script. PyTorch builds the DAG dynamically from scratch during every single iteration. When you run a forward pass, a brand new graph is constructed on the fly. This dynamic execution means your network can change its behavior on every step. You can use standard Python control flow, like if statements or loops, and the engine will cleanly track whatever path the data actually took during that specific run.
Once your forward pass produces the final loss tensor, you trigger the gradient calculation by calling the backward method on that loss. Autograd immediately traverses the graph in reverse. It uses the chain rule to calculate the derivatives of the loss with respect to every tensor that has requires grad set to true. It then takes those calculated values and stores them in the grad attribute of your weight and bias tensors. The complex calculus is completely abstracted away.
There are times when you only want to push data through the model without calculating gradients, such as when evaluating a trained model. Tracking history requires extra memory and computation. You can stop autograd from building the graph entirely by wrapping your code block in the torch no grad context manager. This temporarily stops the tracking and executes the math much faster.
The true power of autograd is that it turns arbitrary Python code into a fully differentiable mathematical structure without you ever having to manually write the derivative formulas.
Thanks for listening. Take care, everyone.
5
Controlling Gradient Tracking
3m 48s
Discover how to disable PyTorch's gradient tracking to save memory and speed up computation. Essential for running inference and freezing model parameters.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 5 of 18. If you run a newly trained model on a large batch of test images without changing one specific setting, your application will eventually crash from an out-of-memory error. Even though you are just making predictions, the model is secretly hoarding memory to remember every mathematical operation it performs until the system dies. Controlling Gradient Tracking is how you prevent this.
By default, PyTorch tensors are built to learn. If a tensor has its gradient requirement set to true, PyTorch tracks every operation performed on it. It builds a computation graph in the background, linking inputs, weights, and outputs so it can calculate gradients later during backpropagation. This tracking engine is brilliant for training, but it requires a lot of overhead.
When training is complete, your priorities change. Say you just finished training an image classifier and you need to run predictions on a million new images. You do not need to update the model weights anymore. You only want the forward pass. If you leave the tracking machinery running, PyTorch builds a useless, massive computation graph for those million images, burning through your RAM and slowing down your compute cycles.
To stop this, you have two primary tools. The first is a context manager called torch dot no grad. You use this to wrap entire blocks of code. When you place your forward pass inside a no grad block, you are telling PyTorch to temporarily shut down the tracking engine. Any operations performed inside that block will not be recorded. Even if the input tensors are normally tracked, the outputs created inside the block will have their gradient requirements set to false. This is your tool for running evaluation, testing, or bulk predictions. It turns off the graph for everything inside its scope.
The second tool is the detach method. While no grad handles blocks of code, detach handles individual tensors. Calling detach on a tensor returns a new tensor that shares the exact same underlying data as the original, but it is completely disconnected from the computation graph. It has no history.
People often confuse when to use which. Use the torch dot no grad context manager when you want to silence tracking for a sequence of operations, like moving from training to inference. Use the detach method when you are actively building a computation graph during training, but you need to pull one specific tensor out of that graph. A common use case for detach is when you need to pass a tensor to a different Python library, like NumPy, which does not understand PyTorch computation graphs. You detach the tensor first, stripping away the tracking baggage, and then hand over the raw numbers.
Disabling gradient tracking is also a core technique for freezing parameters. If you are fine-tuning a massive pre-trained model, you probably do not want to train the entire thing from scratch. You can loop through the base layers of the model and set their gradient requirements to false. PyTorch stops tracking them entirely. During the backward pass, those frozen layers will not calculate gradients and will not update, saving massive amounts of memory and dramatically speeding up your fine-tuning process.
Gradient tracking is heavy industrial machinery designed strictly for learning. Whenever a tensor does not need to learn, shut the machinery off to reclaim your memory and speed. That is all for this one. Thanks for listening, and keep building!
6
Datasets and Data Handling
3m 19s
Learn how to decouple your data processing from your model architecture using the PyTorch Dataset class. We explore lazy loading and custom dataset structures.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 6 of 18. Your model is only as good as the data you feed it, but what do you do when your dataset is a terabyte in size and your machine only has sixteen gigabytes of RAM? The answer lies in how you fetch that data. This episode covers Datasets and Data Handling.
Data processing logic can quickly get messy. If you mix your file reading, decoding, and formatting code directly into your model training loop, your project becomes brittle and difficult to maintain. PyTorch encourages you to decouple these concerns. You want your data preparation completely separate from your training algorithm. To achieve this, PyTorch provides a primitive called Dataset, located in the torch dot utils dot data module.
The Dataset class acts as a standardized wrapper around your raw data. To handle your own specific files, you create a custom class that inherits from this primitive. When building a custom dataset, you must implement three specific methods. These are init, len, and getitem.
The init method runs exactly once when you create the dataset object. This is where you configure your directories and paths. A frequent mistake beginners make is trying to load all the actual data into memory right here. Do not do this. If you have fifty thousand high-resolution images, reading them all into memory during initialization will immediately crash your machine. Instead, use init to load a lightweight index. For example, you might read a CSV file that contains image filenames in one column and their corresponding text labels in another. You are just building the map, not holding the territory.
Next is the len method. This simply returns the total number of samples in your dataset. If your CSV file has fifty thousand rows, this method returns the number fifty thousand. The system relies on this to know the absolute boundaries of your available data so it does not request an index that does not exist.
The heavy lifting happens in the getitem method. This function is designed to load and return a single sample at a specific requested index. When the system needs sample number forty-two, it calls getitem and passes in that number. Your code looks up row forty-two in the CSV you loaded earlier. It reads the file path string from that row. Then, and only then, it accesses the disk, finds the file, and decodes the actual image pixels into memory. It grabs the label from that same CSV row, and returns the image and the label together as a tuple.
This technique is called lazy loading. You only consume memory for the specific piece of data you need, at the exact moment you are ready to process it. By isolating this logic inside the getitem method, your training code never needs to know if the data came from a local hard drive, a network stream, or a complex database. It just requests an index and receives a standardized output.
Separating the mechanism of how data is fetched from how it is consumed is the foundation of scalable machine learning code. If you find these episodes helpful and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!
7
DataLoaders and Batching
3m 36s
Unleash the full speed of your hardware by wrapping Datasets in DataLoaders. Learn how to batch, shuffle, and multiprocess your data streams.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 7 of 18. GPUs are incredibly fast, but they will sit completely idle if your CPU cannot feed them data quickly enough. Training loops often bottleneck not on the math, but on loading the next set of files from disk. The solution to this is separating data retrieval from model execution using DataLoaders and Batching.
It is easy to mix up the roles of a Dataset and a DataLoader. A Dataset has exactly one job: fetching a single item and its label. It does not know about the broader training process. The DataLoader is a wrapper around that Dataset. It acts as the manager responsible for organizing those individual items into groups, randomizing their order, and using multiple processes to load them efficiently.
During training, models rarely look at one data point at a time. They update their internal weights based on a group of items evaluated simultaneously, known as a minibatch. This approach makes the training process more stable and takes full advantage of the parallel processing power of the hardware. To build a minibatch manually, you would have to write a loop to pull individual samples, stack them into a larger tensor structure, and handle edge cases like the final batch being smaller than the rest. The DataLoader handles all of this automatically.
You initialize a DataLoader by passing it your Dataset object and a parameter called batch size. If you set the batch size to 64, the DataLoader will pull 64 distinct items from the Dataset, consolidate them into a single tensor, and serve them up at once. In your code, the DataLoader behaves as a standard Python iterable. You loop over it. Every time the loop advances, the DataLoader yields the next complete batch of data and the corresponding batch of labels.
You also pass a shuffle parameter. If a neural network processes training data in the exact same sequence every time, it might memorize that specific sequence rather than learning the actual features. Setting shuffle to true tells the DataLoader to randomize the order of the dataset elements at the start of every epoch. Once every batch has been yielded and the dataset is exhausted, the loop ends. The next time you iterate over the DataLoader, it generates a brand new randomized sequence.
This is the part that matters. The DataLoader also accepts a parameter for the number of worker processes. When you use multiple workers, the DataLoader spins up background CPU processes to fetch the data. Picture feeding those 64 images into a neural network. While your GPU is busy calculating gradients for the current batch, the background CPU workers are simultaneously reading, decoding, and stacking the next 64 images. By the time the GPU finishes its current mathematical step, the next batch of data is already waiting in memory. The GPU never starves.
A high-performance training loop isolates the slow, unpredictable reality of disk operations from the fast, structured reality of model training. The DataLoader provides that isolation, turning a collection of standalone files into a continuous, parallelized pipeline of minibatches. That is all for this one. Thanks for listening, and keep building!
8
Data Transformations
3m 57s
Discover how to preprocess raw data on the fly before it hits your neural network. We cover torchvision transforms like ToTensor and custom Lambda functions.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 8 of 18. Neural networks only compute numbers, but your real-world data is usually a messy collection of raw image files and text categories. If you write manual loops to convert every single image and label before you begin training, your code will quickly become a brittle, unreadable mess. Data Transformations are the mechanism that resolves this, automatically converting your raw data into a model-ready format exactly when it is needed.
Data rarely arrives ready for machine learning. You have to manipulate it into a specific tensor format before feeding it into your network. PyTorch handles this cleanly by applying transforms on the fly during the data loading process. When you initialize a dataset, especially in libraries like torchvision, you define these modifications using two specific arguments. You use the transform argument exclusively for your input features, like your raw images. You use the target transform argument exclusively for your labels. It is critical to keep these two separate, as they operate independently on different halves of your data.
Let us look at the input features first. Say you have a dataset of raw PIL images. A neural network cannot read a PIL image object directly. To fix this, you pass a built-in torchvision transform called ToTensor into the transform argument. When the dataset loads an image, ToTensor automatically executes two steps. First, it converts the PIL image into a PyTorch float tensor. Second, it scales the pixel intensity values. Raw image pixels generally range from zero to two hundred fifty-five. The ToTensor operation normalizes these values down to a floating-point range between zero and one. The dataset applies this operation strictly as each image is fetched.
That covers inputs, but what about outputs. Your dataset labels might be simple integers representing different categories. For instance, the number three might mean a dog. But to calculate loss during training, your model often requires those labels to be one-hot encoded vectors, rather than single integers. This means you need an array where all values are zero except for the index representing the correct class, which is set to one.
To handle custom logic like this, PyTorch provides Lambda transforms. A Lambda transform wraps any user-defined function so it can be applied during data loading. You write a short function that takes your integer label as its input. Inside that function, you create a tensor of zeros matching the total number of categories in your dataset. Then, you use an internal PyTorch operation to scatter a value of one into the specific index that corresponds to your integer label. You pass this custom function into a Lambda transform, and then assign that to the target transform argument of your dataset.
This creates a highly efficient pipeline. A worker thread pulls a single raw record from your disk. The image hits the transform argument, runs through ToTensor, and emerges as a normalized float tensor. Simultaneously, the integer category hits the target transform argument, executes your custom Lambda function, and turns into a one-hot encoded vector. Both pieces are now mathematically formatted and handed directly to your model.
The real power of this architecture is separation of concerns. By attaching these data transformations directly to the dataset definition, your actual training loop stays completely blind to the messy reality of your raw files.
That is all for this one. Thanks for listening, and keep building!
9
Designing Networks with nn.Module
3m 39s
Explore the structural blueprint of every PyTorch neural network. Learn how to subclass nn.Module, define layers in initialization, and route data in the forward pass.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 9 of 18. Every neural network in PyTorch, from a basic image classifier to a massive language model, shares the exact same underlying blueprint. If you do not understand how this blueprint organizes data and logic, you will end up fighting the framework at every step. Designing networks with nn.Module is how you master this structure.
The nn.Module is the base class for all neural network components in PyTorch. It acts as a universal container. When you build a custom model, you create a class that inherits from nn.Module. This inheritance automatically gives your class the ability to track its own parameters, calculate gradients, and integrate smoothly with the rest of the PyTorch ecosystem. It also allows for nested architectures. You can place modules inside other modules, creating a tree of layers that the parent module tracks and manages as a single unit.
Consider setting up the blank skeleton of a brand new image classifier. Building this skeleton requires defining two specific methods: the initialize method, and the forward method. PyTorch enforces a strict separation of concerns between these two stages.
First is the initialize method. Think of this as your inventory. When the class is instantiated, this method runs exactly once. You use it to declare all the individual layers and mathematical operations your model will eventually need. You are not processing any actual data here. You are simply taking structural components off the shelf, configuring their input and output shapes, and saving them as internal variables within your class.
Next is the forward method. This is your active assembly line. The forward method takes an input tensor and dictates exactly how it travels through the inventory you just declared. You write the sequence of operations step by step. You take the input image tensor, pass it to a flattening operation, feed that result into a series of dense layers, and finally return the output predictions. Every custom model must define this forward method to establish the data flow.
This brings up a common trap. Because you explicitly wrote the data flow logic inside a method named forward, the natural instinct is to pass your data by calling model dot forward. Do not do this. You must call the model directly as if it were a regular function, passing your input straight to the instantiated model object. Under the hood, executing the model object directly triggers several critical background hooks that PyTorch needs to manage the network state. Calling the forward method directly bypasses these hooks and will cause unexpected behavior during your training loop.
Once your class is defined and you have created a model object, you have a working network. However, by default, PyTorch creates this object and all of its internal weights in your system's CPU memory. To train at realistic speeds, you need to ship this architecture to an accelerator. You accomplish this by checking if a CUDA GPU or specialized silicon like Apple's MPS is available, and assigning that hardware target to a device variable. Then, you call the to method on your model, passing in that device variable. This single command immediately moves all the initialized model parameters out of standard memory and into the high-speed memory of your hardware accelerator.
The defining trait of nn.Module is how it forces a clean architectural boundary between the static components your model owns in memory, and the dynamic path your data takes to get processed.
Thanks for listening. Take care, everyone.
10
Linear Layers and Activations
3m 56s
Look inside the neural network. We break down the nn.Linear module and explain why non-linear activation functions like ReLU are mathematically essential.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 10 of 18. If you stack a hundred neural network layers together without one specific mathematical trick, the whole network mathematically collapses into a single straight line. The culprit is linear algebra, and the fix requires understanding Linear Layers and Activations.
A neural network is fundamentally a sequence of mathematical operations on tensors. The most common operation is the linear layer, defined in PyTorch as nn.Linear. This module applies an affine transformation to incoming data. It holds two internal tensors that it learns over time: the weights and the biases. When data passes through, the layer multiplies the input by the weight matrix and adds the bias.
Take a standard grayscale image that is 28 by 28 pixels. Before a linear layer can process this, you flatten the two-dimensional grid into a one-dimensional array of 784 numbers. You pass that array of 784 values into an nn.Linear layer configured to output 512 features. Under the hood, PyTorch creates a weight matrix mapping the 784 inputs to 512 outputs. It multiplies your pixel values by these weights, sums them up, adds a bias term to shift the result, and outputs 512 new numbers. During training, PyTorch continuously updates these weights and biases. They form the actual memory of your model.
You might assume a deep neural network is just a long sequence of these linear layers stacked back to back. This is the part that matters. If you chain multiple nn.Linear operations together without anything in between, the math simplifies. Matrix A multiplied by matrix B is just another matrix, C. Stacking ten linear layers has the exact same mathematical capacity as calculating one single linear layer. Your deep network is reduced to a flat, linear equation, completely incapable of learning complex, real-world patterns.
To stop this mathematical collapse, you introduce a non-linearity immediately after the linear layer. These are called activation functions. The most widely used activation in PyTorch is nn.ReLU, which stands for Rectified Linear Unit. After the linear layer calculates its 512 outputs, you pass that tensor directly into a ReLU function.
The logic of ReLU is brutally simple. It looks at each number in the tensor. If a number is less than zero, ReLU changes it to exactly zero. If a number is zero or positive, ReLU leaves it entirely alone.
That single kink at zero destroys the linearity. It prevents the next linear layer from mathematically merging with the previous one. By forcing negative values to zero, ReLU also creates sparse representations. This means only a specific subset of neurons activate for any given input, making the network highly efficient.
The data flow is consistent. Your flattened image goes into the linear layer, gets transformed by the weights and biases, and then hits the ReLU activation where the negative outputs are stripped away. You can then safely pass this activated tensor into a second linear layer to extract deeper, more abstract patterns.
A linear layer determines how much mathematical importance to give each input, but the activation function gives the network the actual geometry required to learn the unpredictable shapes of real data. Thanks for spending a few minutes with me. Until next time, take it easy.
11
The nn.Sequential Container
3m 26s
Streamline your PyTorch code using the nn.Sequential container. Learn how to snap layers together cleanly and inspect your model's parameters.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 11 of 18. Writing custom forward methods for every neural network quickly becomes tedious when you are just stacking standard layers. You do not always need to manually route data from one function to the next. Sometimes you just need to snap layers together like LEGO bricks. That is exactly what the nn.Sequential container does.
The nn.Sequential container is an ordered pipeline of neural network modules. When you pass data into this container, the data flows through the internal modules in the exact sequence they were added. Think about assembling a standard three-layer Multilayer Perceptron. Normally, you would define your linear layers and activation functions in an initialization method, then write a custom forward method. In that forward method, you would explicitly take the input, pass it to layer one, wrap it in a ReLU activation, pass that result to layer two, apply another ReLU, and feed that to the final layer.
With Sequential, you bypass the forward method entirely. You instantiate the container and pass your modules directly into it as arguments. You provide a Linear module, followed by a ReLU module, a second Linear module, another ReLU, and a final Linear module. PyTorch automatically handles the data routing. The output of the first module instantly becomes the input to the second, proceeding automatically down the chain.
This container is highly efficient, but it has a hard limitation. It is strictly for linear, straight-line data flow. It cannot handle complex architectures that require branching, multiple inputs, or skip connections. If you are building something like a Residual Network where data bypasses certain layers and gets added back in later, Sequential will not work. For any non-linear topology, you still must write a custom module with a dedicated forward method.
Once you have chained your layers together, you often need to inspect what you just built. Every layer in your Sequential container is a subclass of nn.Module, which means PyTorch automatically registers and tracks all the underlying state. To view this state, you use the named_parameters method.
Calling named_parameters on your model provides an iterator over all the weights and biases inside. Each item it yields is a simple pair: the name of the parameter and the parameter tensor itself. Because you used a Sequential container without explicitly naming your layers, PyTorch generates numerical names based on their index. You will see names like zero dot weight for the first linear layer weights, or zero dot bias for its bias terms. The accompanying tensor contains the actual numerical values, the shape of the matrix, and whether it requires gradient calculation.
Looping through named_parameters is the standard way to verify your architecture. You can quickly print the size of every weight matrix to confirm your input and output dimensions align perfectly across the entire chain before you ever start pushing real data through the system.
The true power of the Sequential container combined with parameter tracking is that PyTorch absorbs the bookkeeping for state management and data routing, leaving you to focus entirely on the shape of your network.
That is your lot for this one. Catch you next time!
12
Understanding Loss Functions
3m 20s
Before an AI can learn, it must measure its mistakes. We dive into PyTorch loss functions, comparing CrossEntropyLoss for classification and MSELoss for regression.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 12 of 18. To teach a neural network to be right, you first have to rigorously measure exactly how wrong it is. If you cannot quantify failure, your model cannot learn from it. That brings us to Understanding Loss Functions.
When an untrained network processes data, its output is essentially a guess. A loss function evaluates that guess. It measures the degree of dissimilarity between the result the model produced and the absolute truth of the target value. The output of a loss function is always a single scalar number. Your entire training process exists to push that single number as close to zero as possible.
Because different machine learning tasks have different definitions of being wrong, PyTorch provides several loss functions. If you are building a regression model to predict a continuous value, like the temperature tomorrow, you measure the distance between your guess and the real temperature. For this, you use Mean Square Error, which in PyTorch is called nn.MSELoss.
But classification is different. Suppose you have a model categorizing images of clothing into ten fashion categories. The model looks at an image of a coat and outputs ten raw scores, one for each possible category. These raw, unnormalized scores are called logits. The true answer is just a single integer, representing the correct class. You cannot just subtract a class index from a raw score. Instead, you need a function that penalizes the model for giving low scores to the correct class and high scores to the wrong classes. For classification, the standard tool is nn.CrossEntropyLoss.
You initialize your loss function, pass it the ten raw logits from your model alongside the correct integer label, and it returns your scalar penalty.
This is the part that matters. There is a massive trap here for developers. In many machine learning textbooks, a classification network ends with a softmax layer. Softmax forces raw logits into a neat probability distribution where all the scores add up to exactly one. Because of this, developers often manually add a softmax operation at the very end of their PyTorch model.
If you are using nn.CrossEntropyLoss, doing that is a mistake. In PyTorch, nn.CrossEntropyLoss automatically applies a LogSoftmax function internally before calculating the negative log likelihood. It is built to accept raw, unnormalized logits directly. If your model outputs probabilities because you already applied softmax, passing them into nn.CrossEntropyLoss means you are applying the math twice. This compresses your gradients, drastically slows down training, and ruins your model's ability to learn effectively.
The rule to remember is that your neural network should just output raw numbers. Keep your model outputs raw, hand them straight to nn.CrossEntropyLoss, and let PyTorch do the heavy lifting of turning those logits into a meaningful penalty.
Thanks for listening, happy coding everyone!
13
Optimizers and Gradient Descent
3m 33s
Explore how the optimizer updates model weights to reduce error. Learn the crucial three-step dance of zero_grad(), backward(), and step().
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 13 of 18. The most common bug in PyTorch training is not bad data or a wrong architecture. It is forgetting to clear out your old math, causing your network to spiral out of control. Today we cover Optimizers and Gradient Descent, which handles exactly how your model learns from its mistakes.
Your model makes a prediction, and you calculate the loss to see how far off it was. Now you need to adjust the internal weights of the neural network to make the next prediction slightly more accurate. This process of nudging the parameters to minimize the loss is called optimization. The optimizer is the specific algorithm that governs how those weights change.
To configure an optimizer, you must give it two things. First, you pass it an iterable containing the model parameters you want it to adjust. Second, you provide a learning rate. The learning rate is a fundamental hyperparameter that controls the magnitude of the changes applied to the weights. If the learning rate is too small, the optimizer takes microscopic steps, making training painfully slow. If the learning rate is too large, the optimizer overshoots the optimal values, leading to wild, unpredictable behavior.
A standard algorithm for this task is Stochastic Gradient Descent, or SGD. It evaluates the slope of your loss function and takes a step in the opposite direction to descend toward the lowest possible error. Once you initialize your SGD optimizer with your parameters and learning rate, the actual updates happen in a strict, three-step sequence.
Step one is clearing the slate. You call the zero grad command on the optimizer. This is where that common bug lives. PyTorch accumulates gradients by default. When it calculates new gradients, it does not overwrite the old ones; it simply adds the new numbers to the existing totals. If you skip this zero grad step, your current batch math is corrupted by the leftover numbers from the previous batch. Always zero the gradients before doing anything else.
Step two is calculating the new gradients. You take your calculated loss value and call the backward command on it. This triggers backpropagation. PyTorch travels backward through your network architecture. It calculates the derivative of the loss with respect to every single parameter. Essentially, it figures out exactly how much each individual weight contributed to the overall error. These calculated gradients are stored directly inside the parameter objects.
Step three is applying the fix. You call the step command on the optimizer. The optimizer looks at the gradients stored in each parameter during the backward pass. It multiplies those gradients by the learning rate to figure out the exact size of the adjustment, and then updates the actual weights in memory.
This cycle repeats for every batch. Zero the gradients, calculate the backward loss, step the optimizer. The critical detail to remember is that the optimizer only updates the parameters it was explicitly given during setup. If you need to freeze a layer in your network, you simply exclude its parameters when you initialize the optimizer, and those weights will remain permanently fixed. By the way, if you want to support the show, you can search for DevStoriesEU on Patreon. Thanks for listening. Take care, everyone.
15
Validation and Inference
3m 31s
Evaluate your model objectively. Learn how to switch your network to evaluation mode, freeze gradients, and extract accurate predictions on unseen data.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 15 of 18. Your model might perform perfectly on its training data, but the only true test of an AI is how it handles the unknown. If you rely solely on the feedback your optimizer sees, you might just be building a very expensive memory bank. To see if your model actually generalizes to the real world, you need validation and inference.
During the training phase, you look at the training loss. That number exists to guide the internal optimizer. It forces the model to adjust its weights until the mathematical error shrinks. But a low training loss does not mean you have a good model. It simply means the model is very good at answering questions it has already seen. Validation accuracy is the entirely separate metric that tells humans if the model can make correct predictions on completely new data. To get this metric, you must run a validation loop against a dedicated test dataset.
Before feeding a single piece of test data into the network, you must change the state of the model. You do this by calling the eval method on your model object. Calling eval switches the network into evaluation mode. Certain internal layers behave differently during training than they do during inference. Calling eval forces them to lock their behavior so your predictions remain consistent. If you skip this step, your test results will be fundamentally unreliable.
That covers the model state. Next, you must control the engine itself by turning off gradient tracking. You do this by wrapping your validation code inside a no grad context manager. During training, PyTorch constantly builds a computational graph in memory, storing the history of every operation so it can calculate gradients later. In a validation loop, you are completely done with training. You do not want to update weights. The no grad block tells PyTorch to stop tracking history. This is the part that matters. Disabling tracking prevents accidental updates to your model, but it also frees up a massive amount of memory and drastically speeds up the computation.
Inside that no grad block, the actual logic is straightforward. You iterate over your test dataset in batches. For each batch, you pass the input data through the model. The model computes the forward pass and returns its raw predictions. If you are doing classification, the model does not output a neat text label. Instead, it outputs a list of numerical scores for every single category it knows.
To find out which category the model actually picked, you need the argmax function. Argmax looks at the list of raw scores and finds the highest number. It then returns the index position of that highest score. That index is your chosen class prediction.
Once you have the model predictions, you compare them directly to the true labels provided by the test dataset. You count exactly how many predictions match the true labels. You keep a running total of these correct matches across all the batches. When the loop finishes, you divide the total number of correct predictions by the total number of items in the test dataset. The result is your final accuracy percentage.
The training loop forces your model to fit the historical data, but the strict constraints of the validation loop prove whether that model is actually useful. Thanks for listening. Take care, everyone.
16
Saving and Loading Models
3m 12s
Don't lose your hard-earned progress! We discuss the safest ways to serialize your model weights using state_dict and load them back securely.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 16 of 18. Training an image classifier for fifty epochs can take weeks and burn through thousands of dollars in compute. Yet the moment your Python script finishes executing, all those hard-earned patterns completely vanish from memory. To protect that investment, you need a way to persist your progress to disk. That is exactly what saving and loading models resolves.
Inside every PyTorch model is an internal dictionary called the state dict. This dictionary maps every layer of your network to its corresponding parameter tensors. It holds the actual weights and biases your model learned during training. The structure of the model is just code, but the state dict is the intelligence. To persist your model, you extract this dictionary and write it to a file.
You do this using a function called torch dot save. You pass it two things. First, the state dict of your model. Second, the file path where you want to store it, which traditionally uses a dot pth extension. In one line of code, your fifty-epoch training run is safely stored on your hard drive as a single file containing nothing but raw tensor data.
You might see examples online that skip the state dict entirely and just pass the model object directly to torch dot save. Do not do this. Saving the whole model relies heavily on Python pickle serialization. It binds the saved file to the exact directory structure and class definitions present when the file was created. If you later refactor your code or move a file, the model will fail to load. Sticking to the state dict is much safer and significantly more robust because you are only saving the data, not the code.
When it is time to deploy your classifier for production inference, you have to reverse the process. Because you only saved the weights, PyTorch needs to know what the structure of the network looks like. You start by instantiating a completely blank version of your model class. This gives you the architectural shell. Next, you call torch dot load and give it your file path to read the dictionary back into memory.
When you call torch dot load, there is a crucial modern best practice you must follow. Always pass the argument weights only set to true. Python pickle files can contain arbitrary executable code. If you download a pre-trained model from the internet and load it blindly, it could run malicious scripts on your machine. Setting weights only to true restricts the loader to only deserialize standard PyTorch tensors, keeping your system secure.
Finally, with your blank model ready and your secure dictionary loaded, you call load state dict on the model and pass in the dictionary. PyTorch maps the loaded weights to the corresponding layers in the blank shell. Your model is now fully restored and ready to make predictions. Never trust your training investment to a fragile serialized object; always separate the architecture in your code from the learned parameters on your disk.
Thanks for hanging out. Hope you picked up something new.
17
Supercharging Speed with torch.compile
3m 29s
Unlock the defining feature of PyTorch 2.0. Learn how the torch.compile decorator JIT-compiles your Python code into optimized kernels for massive speedups.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 17 of 18. You spend weeks tweaking a model architecture for a five percent speedup. But often, the real bottleneck is not your math. It is Python itself, and the constant, inefficient data transfers between your memory and the GPU. Fixing this does not require rewriting your codebase. Today, we are looking at supercharging speed with torch dot compile.
Introduced in PyTorch 2.0, this feature shifts your code from standard execution to a highly optimized workflow. A common assumption is that speeding up PyTorch for production means writing custom C plus plus kernels or fundamentally changing your architecture. That is not the case. You change nothing inside your model. You just wrap your existing model in a single function call.
To understand why this creates a massive leap in performance, you have to look at how PyTorch normally runs. Standard PyTorch operates in eager mode. It executes exactly what you tell it to, exactly when you ask, one operation at a time. If your code tells the GPU to add two tensors, multiply the result by another tensor, and apply an activation function, eager mode treats those as three isolated events.
For each step, the GPU reads data from its main memory, does the math, and writes the intermediate result back. GPU memory bandwidth is limited. That constant shuttling of data takes far longer than the actual calculations.
When you pass your model through the compile function, PyTorch changes tactics. It uses an internal tool called TorchDynamo to capture your operations into a computation graph before executing them. By looking at the broader sequence, it finds inefficiencies. Then, it uses a backend compiler to generate a new, heavily optimized version of your operations.
The primary technique it uses is kernel fusion. Instead of reading and writing to memory three separate times, the compiled code merges those steps. The GPU reads the data once, keeps it in its fastest internal registers, performs the addition, multiplication, and activation back-to-back, and then writes the final answer out just once. The Python overhead disappears, and the memory bottleneck is bypassed.
The implementation is straightforward. You instantiate your model as usual. Then, you call torch dot compile, pass in your model, and assign the output to a new variable. You route your data through this compiled version.
If TorchDynamo encounters an obscure Python construct it cannot safely optimize, it does not break your program. It simply leaves that small section in standard eager mode and compiles the rest.
When you benchmark this, pay attention to the first run. The initial pass takes significantly longer because the actual compilation happens exactly when the first data arrives. But on the second pass, inference time drops drastically.
You no longer have to choose between the flexibility of eager mode during development and the raw speed of a compiled backend in production.
That is all for this one. Thanks for listening, and keep building!
18
Compilers and Graph Breaks
3m 50s
Dive under the hood of the PyTorch compiler. We explore graph breaks, dynamic control flow, and why torch.compile succeeds where legacy systems failed.
Hi, this is Alex from DEV STORIES DOT EU. PyTorch Fundamentals, episode 18 of 18. Older AI compilers demanded perfectly predictable execution paths, failing spectacularly if you included complex, dynamic Python constructs in your model. You would tweak one conditional statement, and the entire compilation process would crash. PyTorch two point zero handles that exact same arbitrary code without complaining. The engine behind this flexibility relies on how the compiler handles graph breaks.
If you worked with the legacy compilation tool, TorchScript, you know it required rigid code structures. TorchScript relied on strict, static typing and predictable execution. If your model featured highly dynamic control flow, relied on standard Python dictionaries, or called out to external non-tensor libraries, TorchScript would often reject it. Engineers frequently had to rewrite significant portions of their model architecture just to satisfy the compiler.
PyTorch two point zero approaches this entirely differently. Instead of demanding static code upfront, the native compiler analyzes your Python execution dynamically. It captures all the mathematical operations it can safely optimize and packages them into a highly efficient computational graph.
Inevitably, the compiler will encounter code it cannot easily map to an optimized graph structure. When it hits this unpredictable logic, it triggers a graph break. A graph break is not an error, and it is not a crash. It is simply a fallback mechanism. It means the compiler gracefully hands control back to standard PyTorch eager execution for that specific segment of code.
Consider a function where you run a series of heavy matrix multiplications, followed by a Python if-statement that checks the mean value of a tensor to decide the next operation. That condition is data-dependent. The execution path is completely unknown until the exact moment the tensor values are calculated at runtime.
When you trace this function with the compiler, it analyzes the flow. It takes the matrix multiplications occurring before the condition and compiles them into a fast, optimized sub-graph. Then, it hits the tricky if-statement. Because it cannot predict the outcome, it creates a graph break. The compiler lets standard Python execute the condition in eager mode. Once the condition is evaluated and the path is chosen, the compiler resumes control, taking the remaining operations and compiling them into a second optimized sub-graph.
The system automatically segments your code. You get compiled islands of fast math separated by standard Python bridges. Your model continues to run seamlessly.
This is the part that matters. You will likely see performance logs pointing out graph breaks in your architecture. While you generally want to minimize them to squeeze out maximum execution speed, they exist purely as a safety net. They ensure your code always produces the correct mathematical result, even if every single line cannot be fused into a single kernel.
The primary design shift in modern PyTorch compilation is prioritizing seamless execution over total optimization, ensuring the engine adapts to your arbitrary logic rather than forcing your logic to adapt to the engine.
This concludes our PyTorch series. I highly encourage you to explore the official documentation, try these compilation tools hands-on, and visit DEV STORIES DOT EU to suggest topics for future series. I would like to take a moment to thank you for listening — it helps us a lot. Have a great one!
Tap to start playing
Browsers block autoplay
Share this episode
Episode
—
Copy this episode in another language:
This site uses no cookies. Our hosting provider may log your IP address for analytics. Learn more.