Back to catalog

Season 35 8 Episodes 28 min 2026

Mastering Modern Pandas

v3.0 — 2026 Edition. Master the core abstractions and modern capabilities of pandas 3.0 in 2026. Learn about data alignment, Copy-on-Write, PyArrow integration, time series mastery, and strategies for scaling out-of-core datasets.

Data Science Data Analysis Python Core

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

The Core Abstraction: DataFrames and Label Alignment

We explore the foundational mental models of pandas: the Series and the DataFrame. You will learn why intrinsic label alignment is the killer feature that prevents row-mismatch disasters.

3m 24s

The Copy-on-Write Revolution

Discover the most significant architectural change in modern pandas: Copy-on-Write. You will learn how CoW eliminates unpredictable mutations and optimizes memory usage.

3m 43s

The PyArrow Engine Room

Pandas isn't just powered by NumPy anymore. You will learn how to leverage the PyArrow backend for native missing data support and incredible memory savings on strings.

3m 11s

Modern Data Ingestion

We tackle efficient I/O strategies for large datasets. You will learn how to ingest massive files selectively and directly into highly optimized memory structures.

3m 23s

Relational Algebra: Merge and Join

We explore how to unify disparate datasets using relational algebra. You will learn to execute optimized SQL-style joins directly in pandas.

3m 44s

The Split-Apply-Combine Pattern

Unlock the true power of the GroupBy object. You will learn how to go beyond simple averages to perform complex group-specific transformations and filtrations.

3m 32s

Time Series Mastery

We dive into pandas' undisputed dominance in time series analysis. You will learn how to leverage DatetimeIndex and native resampling for high-frequency data.

3m 51s

Scaling to Out-of-Core Datasets

We tackle the limits of your machine's RAM. You will learn how to process datasets significantly larger than memory using pure pandas chunking.

3m 48s

Episodes

The Core Abstraction: DataFrames and Label Alignment

3m 24s

We explore the foundational mental models of pandas: the Series and the DataFrame. You will learn why intrinsic label alignment is the killer feature that prevents row-mismatch disasters.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 1 of 8. You pull two columns of financial data, add them together, and the final sum makes no sense. The rows shifted, and you just added yesterday's closing price to today's volume. That happens when you treat data like a dumb grid instead of relying on The Core Abstraction: DataFrames and Label Alignment. People often look at pandas and think it is just a programmatic spreadsheet or a standard two-dimensional NumPy array. It is not. A plain NumPy array relies on strict positional indexing. If you add two arrays, the item at position zero adds to the item at position zero. If your data is missing a row, everything shifts, and your calculations are silently corrupted. Pandas solves this exact problem by decoupling the data from its physical position in memory. It uses intrinsic data alignment, meaning it aligns data by labels, never by position. To understand this, look at the foundation of the library, the Series. A Series is a one-dimensional array that can hold any data type. Unlike a standard list, every item in a Series is securely attached to a label. These labels collectively make up what pandas calls the index. You can use integers as labels, but more often, you use strings or timestamps. A DataFrame is simply a collection of these Series objects acting as columns, all sharing the same index, sitting side by side. Here is the key insight. When you perform an operation in pandas, the index dictates the behavior. Say you have two Series of daily stock returns. Series A has data for Monday, Tuesday, Wednesday, and Thursday. Series B has data for Wednesday, Thursday, and Friday. If you command pandas to add these two Series together, it ignores their physical order. It looks at the date labels. It finds Wednesday in Series A and adds it to Wednesday in Series B, even though Wednesday is the third item in the first dataset and the very first item in the second dataset. This brings up the problem of the non-overlapping days. Monday exists in the first Series but not the second. Friday exists in the second but not the first. Pandas does not crash, and it certainly does not guess a value. Instead, it creates a new index that is the union of all labels from both inputs. For any label that does not exist in both places, pandas inserts a NaN, meaning Not a Number. The operation completes successfully, and you can immediately see where your data is incomplete. You never have to write a loop to check if the dates match up. The alignment is automatic, built directly into the data structure itself. This same logic scales directly to DataFrames. A DataFrame aligns on both dimensions simultaneously. When you operate on two DataFrames, pandas matches row label to row label, and column name to column name. It aligns the entire structure perfectly before executing a single mathematical operation. Anything that does not overlap across both the rows and the columns gets marked as missing data. The true power of pandas is not the math it performs, but the fact that the label always travels with the data, making positional mismatches structurally impossible. If you want to support the show, you can find us by searching for DevStoriesEU on Patreon. Thanks for listening, happy coding everyone!

The Copy-on-Write Revolution

3m 43s

Discover the most significant architectural change in modern pandas: Copy-on-Write. You will learn how CoW eliminates unpredictable mutations and optimizes memory usage.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 2 of 8. You extract a slice of your data, update a single value, and suddenly your original dataset is corrupted. Or worse, you get an unpredictable warning and have no idea if your update actually applied to the slice or the source. Version 3.0 finally resolves this chaos by making Copy-on-Write the default behavior. Copy-on-Write fundamentally rewrites how pandas manages memory. In earlier versions, pandas frequently made defensive copies to prevent you from accidentally modifying the original dataset. If you filtered a dataset, pandas copied it. If you dropped a column, pandas copied the rest. This wasted massive amounts of memory and CPU cycles. When it did not copy, it returned a view, which meant modifying the new object silently altered the parent. Under Copy-on-Write, any DataFrame or Series derived from another shares the exact same underlying memory. This means taking a subset, dropping a column, or resetting an index is nearly instantaneous. No data is duplicated upfront. Here is the key insight. Do not confuse a legacy view with a Copy-on-Write lazy copy. In a traditional view, both the parent and the child point to the same memory, and a change to one changes the other. Copy-on-Write is different. The shared memory is temporary and strictly protected. The sharing only lasts until you try to change something. Consider a concrete scenario. You have a DataFrame containing user profiles. You select the age column and assign it to a new variable called age subset. At this exact moment, age subset takes up zero extra memory. It points directly to the original user profiles DataFrame. Next, you update the first value in your age subset to ninety-nine. This is where the write part of Copy-on-Write happens. Pandas detects the modification. Before executing your update, it checks if any other object is sharing this specific block of data. Because the parent DataFrame is still using it, pandas instantly allocates new memory, copies the data over, and then writes the value ninety-nine into the new location. Your parent DataFrame remains safely unchanged. Mutating a subset never mutates the parent. This mechanism cascades perfectly across chained operations. When you string together operations like dropping empty rows, replacing values, and renaming columns, older versions of pandas created a physical copy at every single step. Under Copy-on-Write, those intermediate steps simply share memory. A physical copy is only triggered when a step actually mutates the underlying data arrays. If an operation just rearranges references, no copy occurs. This completely eliminates accidental data corruption. You no longer have to guess whether an operation returned a view or a copy, and you will never see the Setting With Copy warning again. The rule is absolute: a parent and a child object will never modify each other. By deferring copies until the exact millisecond a modification occurs, pandas gives you the fast performance and low memory footprint of views, paired with the strict safety of deep copies. That is all for this one. Thanks for listening, and keep building!

The PyArrow Engine Room

3m 11s

Pandas isn't just powered by NumPy anymore. You will learn how to leverage the PyArrow backend for native missing data support and incredible memory savings on strings.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 3 of 8. You load a dataset with a few million rows, and suddenly your memory usage spikes through the roof. The numbers are fine, but the text columns and the missing values are quietly destroying your RAM. The fix is a fundamental shift in how pandas stores data, and that is exactly what this episode is about: The PyArrow Engine Room. If you have heard of Apache Arrow, you might think it is strictly for big distributed systems like Spark or Hadoop. It is not. Arrow is now a first-class, native memory format and execution engine right inside pandas. For years, pandas relied entirely on NumPy. NumPy is incredibly fast for dense numerical calculations, but it has a massive blind spot regarding missing data. NumPy does not have a native concept of a missing integer or a missing boolean. If you have an integer column and a single value is missing, pandas has historically been forced to convert the entire column to floating-point numbers just so it can use the Not a Number marker. This changes your data types, ruins exact matches, and consumes more memory. PyArrow solves this using a validity bitmap. Instead of changing the data type to accommodate a missing value, Arrow keeps your integers as integers. It adds a hidden, highly compressed array of ones and zeros alongside your data. A one means the value is valid. A zero means it is missing. Your data type stays intact, and tracking the missing values costs almost zero memory. Here is the key insight. The memory savings are even more extreme when dealing with text. Traditionally, pandas stores strings using the NumPy object data type. This means the column does not actually hold your text. It holds memory pointers. Each row points to a standard Python string object scattered somewhere else in your computer memory. If you have ten million rows of text, you have ten million pointers and ten million separate string objects. The memory overhead is staggering, and iterating through them is terribly slow. PyArrow changes the architecture entirely. When you set your pandas columns to use a PyArrow string data type, the text is stored in a single, contiguous block of memory. The column just keeps track of the byte offsets. It records exactly where each word starts and stops in that large continuous block. Picture a high-cardinality dataset. You have a column of user agent strings or unique transaction IDs. Many of the rows are empty. If you read this into pandas the traditional way, it defaults to a NumPy object array. Now, tell pandas to use the PyArrow engine instead by explicitly assigning a PyArrow-backed string data type during your read step. Instantly, the memory footprint drops, often by fifty percent or more. The improvement goes beyond RAM limits. Because the data is now packed tightly together in a structure built for analytics, string matching operations accelerate. If you run a regular expression search across that column, the Arrow engine processes the raw bytes directly at the system level. It completely bypasses the slow Python object overhead. You get integer columns that actually stay integers when data is missing, and string operations that do not choke your hardware. If you process text or messy data, relying on the NumPy object array is obsolete. Using PyArrow-backed data types is the single fastest way to make a heavy pandas pipeline instantly lighter. Thanks for listening, happy coding everyone!

Modern Data Ingestion

3m 23s

We tackle efficient I/O strategies for large datasets. You will learn how to ingest massive files selectively and directly into highly optimized memory structures.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 4 of 8. You have a multi-gigabyte Parquet file with a hundred columns. You only need four of them to run your metrics. If your first move is to load the entire file into a dataframe and then filter down the columns, you are wasting massive amounts of RAM and CPU cycles before your analysis even begins. The fix lies in mastering Modern Data Ingestion. The primary bottleneck in data ingestion is disk input and output. Pandas read functions, like read underscore csv and read underscore parquet, scan data from disk into memory. To minimize this transfer, you use an argument called usecols. You pass a list containing the exact column names you want. The parser reads only those specific columns from the file. The performance gain you see depends heavily on your file format. CSV files store data row by row. When you use usecols with a CSV, the parser still has to scan through the entire text file row by row, but it immediately discards the unneeded columns before allocating memory for the dataframe. Parquet files, however, store data column by column. For a hundred-column Parquet file where you only need four metric columns, passing usecols means the parser completely ignores the file blocks holding the other ninety-six columns. It reads only the bytes it absolutely needs from the disk. This drastically reduces both your read time and your memory footprint. Restricting columns is only the first step. The next optimization happens in how Pandas stores those columns in memory. Historically, Pandas relied entirely on NumPy arrays. NumPy is excellent for dense numerical computation, but it struggles with text data and missing values. It stores strings as scattered Python objects in memory, and it forces integer columns to become floats just to represent missing data. To solve this, Pandas introduced the dtype backend argument. When you set this argument to the string pyarrow, Pandas uses Apache Arrow to back your data instead. Arrow stores strings in highly efficient, contiguous memory blocks and uses a separate bitmask to track missing values, keeping your integers intact. Here is the key insight. You might think Pandas reads the data into NumPy arrays first and then converts them to Arrow. That is not what happens. When you specify the PyArrow backend during the read function, Pandas bypasses NumPy entirely during the parse phase. The data flows straight from the file on disk into PyArrow arrays in memory. This avoids the severe performance penalty of intermediate memory allocations. Let us look at the full pipeline. You call read underscore parquet. First, you pass your file path. Second, you pass usecols with a list of your four metric columns. Third, you set the dtype backend argument to pyarrow. The parser jumps directly to the four columns on disk, extracts them, and streams them straight into Arrow-backed memory. You end up with a lean, lightning-fast dataframe holding exactly what you need, backed by modern data types. Filtering columns at the IO layer instead of the application layer is the single most effective way to prevent out-of-memory crashes in Pandas. That is all for this one. Thanks for listening, and keep building!

Relational Algebra: Merge and Join

3m 44s

We explore how to unify disparate datasets using relational algebra. You will learn to execute optimized SQL-style joins directly in pandas.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 5 of 8. You have a list of ten million transactions, and each one only has a customer ID. You need the actual customer names attached to those transactions. If you are writing a Python loop or using a dictionary mapping to match those IDs to names, you are wasting CPU cycles and memory. Pandas has an optimized engine specifically built for SQL-style operations in a single line. Today we are talking about Relational Algebra: Merge and Join. Before going further, let us clear up a common mix-up. People often confuse merging with concatenating. Concatenation is just physically stacking arrays on top of each other or side by side. Merge is entirely different. Merge is for relational database joins. It aligns rows from two different tables based on the values of shared keys. The core function for this is pandas dot merge. It takes two DataFrames, which we call the left table and the right table. Think back to our transaction scenario. Your left DataFrame is a massive fact table containing millions of purchases. Your right DataFrame is a smaller dimension table containing customer details, like names and email addresses. Both tables share a column called customer ID. This specific setup is called a many-to-one join. You have many transactions belonging to one single customer. When you merge them using the customer ID key, pandas takes the single customer record from the right table and broadcasts it across all the matching transactions in the left table. Here is the key insight. The most critical parameter you control is the how argument. This dictates which keys survive the merge and make it into the final result. By default, pandas uses an inner join. If you do not specify the how argument, the result only keeps rows where the customer ID exists in both tables. If a customer made a transaction but their record got deleted from the customer database, that transaction vanishes entirely from your merged result. To prevent losing data from your primary table, you use a left join. By passing left to the how argument, pandas keeps every single row from your left table, which is your massive transaction list. If a transaction has a customer ID that does not exist in the right table, pandas still keeps the transaction row but fills the missing customer details with Not a Number values. This is exactly the logic you want when attaching dimension details to a primary fact table. The exact opposite is a right join. Passing right keeps all rows from the customer dimension table, regardless of whether they have any matching transactions in the left table. You end up with a list of all customers, and those who have not bought anything simply show missing values for the transaction data. Finally, there is the outer join. Pass outer to the how argument, and pandas keeps everything. It takes the union of keys from both frames. Every transaction and every customer makes it into the final dataset, with missing values filling in the gaps wherever a perfect match is not found. The default inner join drops unmatched data silently, so unless you explicitly want to filter out rows, you should always specify a left join when attaching lookup tables to your primary dataset. That is all for this one. Thanks for listening, and keep building!

The Split-Apply-Combine Pattern

3m 32s

Unlock the true power of the GroupBy object. You will learn how to go beyond simple averages to perform complex group-specific transformations and filtrations.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 6 of 8. You probably think grouping data is just about rolling up numbers to find a total or an average. But what if you need to evaluate a single row based on the behavior of the group it belongs to, without losing the original shape of your dataset? This requires moving beyond simple summaries and using the Split-Apply-Combine pattern. When you call a groupby method in pandas, you trigger a sequential three-step process. First is the split. Pandas takes your entire dataset and divides it into independent groups based on a key you provide. Next is the apply step, where a function is executed against each group entirely independent of the others. Finally, the combine step takes the results from every group and stitches them back together into a single data structure. Many developers think the apply step is strictly for aggregation. Aggregation takes a group of values and returns a single number, like the sum or the mean. If you group a customer support dataset by agent and ask for the average resolution time, you get a new dataset with one row per agent. That is useful, but it is only one third of the story. Here is the key insight. The apply step is just as frequently used for transformation and filtration. A transformation performs a calculation on the group, but returns an object that is indexed exactly like the original data. It does not reduce the row count. Let us go back to our customer support dataset. You want to know if a specific ticket took unusually long to resolve. But an unusual time for a junior agent might be a normal time for a senior agent handling highly complex issues. You need the z-score of the resolution time, relative only to that specific agent's historical average. You split the data by agent, and apply a transformation function that calculates the z-score. Pandas calculates the mean and standard deviation for agent A, standardizes agent A's tickets, does the exact same for agent B, and then combines them. You get your original dataset back, row for row, but now every ticket has a standardized score based strictly on its specific group context. The third core application is filtration. This lets you discard entire groups based on a collective property, rather than evaluating individual rows. Suppose you want to analyze those ticket z-scores, but some agents in your dataset only processed two or three tickets. Their averages are statistically meaningless. You can use a filter function on your grouped object to check the size of each group. If a group has fewer than ten tickets, the filter logic returns false, and pandas drops every single row belonging to that agent. The combine step then returns a dataset containing only the tickets from agents with a valid sample size. You split the data by a key, apply a logic rule that evaluates the group, and combine the survivors. Aggregation reduces data. Transformation standardizes data within its local context. Filtration discards data based on group rules. The true power of the split-apply-combine pattern is that it allows you to manipulate individual rows using group-level context, without ever writing a manual loop. If you find these deep dives helpful and want to support the show, you can search for DevStoriesEU on Patreon. I would like to take a moment to thank you for listening — it helps us a lot. Have a great one!

Time Series Mastery

3m 51s

We dive into pandas' undisputed dominance in time series analysis. You will learn how to leverage DatetimeIndex and native resampling for high-frequency data.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 7 of 8. Hand-rolling your own time aggregations is a nightmare of missing intervals, leap years, and business day logic. If you find yourself writing custom code to round raw timestamps into regular buckets, you are working far too hard. The solution is Time Series Mastery using pandas native temporal structures. The backbone of time series functionality in pandas is the DatetimeIndex. Instead of standard integer row numbers, your dataframe index becomes a strict sequence of precise timestamps. Upgrading your index to a DatetimeIndex fundamentally changes how the dataframe behaves, making the entire structure temporally aware. This enables native time-based slicing. If you need data from just October 2023, you pass the simple string "2023-10" to the row locator. Pandas automatically calculates the exact microsecond boundaries for that month and returns the correct subset. You can pass partial date strings down to the hour or minute, and the index resolves the underlying timestamp math. Once your data is temporally aware, you usually need to aggregate it. Many developers mistake time-based resampling for basic grouping. They attempt to apply standard group-by operations to a date column. That approach fails when dealing with real-world, messy time series. Standard grouping only looks at the explicit rows currently present in your dataframe. If a server goes offline for an hour, a standard group-by operation just skips that hour entirely. Your output timeline will have a hidden gap, throwing off any subsequent time-based calculations. Here is the key insight. The dot resample method is fundamentally different because it natively understands calendar logic. It projects a rigid, continuous time grid over your data. If you resample by ten-minute intervals, and no data arrives during a specific ten-minute window, pandas still generates that bucket. It leaves the values empty, preserving the strict mathematical integrity of your timeline. Resampling inherently understands empty intervals, irregular month lengths, and business day calendars that strip out weekends and holidays. Consider the scenario of a quantitative analyst processing trading data. You are receiving high-frequency tick data from an exchange. The individual trades arrive at completely irregular intervals, sometimes three per microsecond, sometimes none for twenty seconds. Your pricing model cannot consume this chaos. It requires perfectly aligned five-minute bars. Because your trade prices are mapped to a DatetimeIndex, you call the dot resample method on your dataframe and pass the frequency string "5min". This maps every erratic tick to a strict five-minute grid. To feed a financial model, you specifically need the Open, High, Low, and Close prices for each bucket. Instead of writing custom functions to extract the first, maximum, minimum, and last trades of each window, you chain the dot ohlc method directly onto your resample call. Pandas calculates all four metrics at once, outputting a cleanly structured dataset of five-minute bars. Those empty intervals we mentioned earlier remain intact in this output. You can then chain another method to forward-fill the previous closing prices into the empty gaps, ensuring your model always has valid data. Resampling transforms erratic, event-driven records into a predictable, mathematically sound timeline without requiring you to write a single line of calendar logic. Thanks for tuning in. Until next time!

Scaling to Out-of-Core Datasets

3m 48s

We tackle the limits of your machine's RAM. You will learn how to process datasets significantly larger than memory using pure pandas chunking.

Download

Hi, this is Alex from DEV STORIES DOT EU. Mastering Modern Pandas, episode 8 of 8. Before you request a massive cloud instance or reach for a distributed computing cluster, you might be surprised to learn you can process terabytes of data right on your laptop. Most out-of-memory errors do not require a new framework, they just require a change in how you read files. Today, we are covering Scaling to Out-of-Core Datasets. Out-of-core processing simply means working with datasets that are larger than your available system memory. A common reaction to an out-of-memory crash is assuming pandas has reached its hard limit. People often jump straight to rewriting their pipelines in PySpark or Dask. But pandas can handle massive data natively if you stop trying to load the entire dataset into RAM at once. Simple generator patterns solve the vast majority of these scaling problems. The primary mechanism for out-of-core processing is chunking. If you are dealing with a single massive text file, the standard read function accepts a chunk size argument. When you provide this argument, pandas stops returning a DataFrame. Instead, it returns an iterator. Each time your code advances the iterator, pandas reads only the specified number of rows from the disk and yields them as a normal DataFrame. You apply your logic to that chunk, extract the result, and discard the chunk. Because the old data is cleared from memory before the next batch is read, your memory usage stays completely flat regardless of how large the underlying file is. This is where it gets interesting. Modern data infrastructure rarely relies on single giant text files. Usually, large datasets are stored as directories containing hundreds of smaller, partitioned files, typically in a binary format like Parquet. Parquet files are highly compressed and load very fast, but you still cannot load fifty gigabytes of Parquet files into sixteen gigabytes of RAM. To handle this, you apply the chunking concept manually across files. Imagine you have a directory of yearly Parquet files and you want to calculate the total frequency of categories across the entire historical dataset. You construct a simple iterative loop. First, initialize an empty pandas Series. This will act as your accumulator for the global totals. Next, iterate through your directory file by file. Inside the loop, read the current file into a DataFrame. Now, run the value counts function on the specific column you are analyzing. This gives you a Series containing the frequencies just for that specific year. The crucial step is combining this local result with your global accumulator. You do this by calling the add method on your global Series and passing in the local Series. Because some categories might exist in one file but not another, you must set the fill value argument to zero. This ensures pandas aligns the indexes properly and adds the counts without introducing missing values. Once the loop finishes processing that file, it moves to the next one. Python automatically garbage collects the old DataFrame. You are effectively streaming a massive dataset through memory one file at a time, building a continuous global aggregation. Out-of-core processing is not about throwing hardware at a problem. It is about keeping your active state small and pushing the mathematical aggregations down to the individual chunk level. Since this is the end of the series, I highly encourage you to explore the official pandas documentation on scaling and try these patterns hands-on with your own data. If you have topics you want us to cover in future series, visit devstories dot eu and let us know. Thanks for spending a few minutes with me. Until next time, take it easy.