Back to catalog

Season 14 21 Episodes 1h 16m 2026

PySpark Fundamentals

v4.1 — 2026 Edition. A comprehensive guide to PySpark 4.1, covering Spark Connect, DataFrames, complex data types, data transformations, SQL, UDFs, and profiling.

Big Data Distributed Computing Data Science

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

The Big Data Problem & PySpark's Promise

We establish the fundamental need for PySpark. Discover why standard Python libraries like Pandas fail at scale, and how PySpark provides a distributed execution engine to process massive datasets seamlessly.

3m 38s

The Spark Connect Revolution

Explore the Spark Connect architecture. We explain how PySpark decoupled the client and server, allowing you to run Spark applications anywhere without bulky JVM dependencies.

3m 24s

DataFrames and Lazy Evaluation

Dive into the fundamental abstraction of PySpark: the DataFrame. We discuss the concept of lazy evaluation, the difference between transformations and actions, and why Spark plans before it executes.

4m 13s

Creating and Viewing DataFrames

Learn how to instantiate DataFrames from raw Python objects, dictionaries, and files, and how to safely inspect your distributed data without crashing your driver node.

3m 29s

Mastering Basic Data Types

A tour of PySpark's foundational numerical and string types. We explore how to explicitly define schemas using StructType and StructField for robust data pipelines.

4m 01s

The Perils of Precision

Discover the critical differences between FloatType, DoubleType, and DecimalType. Learn why choosing the wrong numerical type can introduce disastrous rounding errors in financial data.

3m 57s

Taming Complex and Nested Data

Big data isn't always flat. We explore PySpark's complex data types, including ArrayType, StructType, and MapType, allowing you to natively parse deeply nested JSON.

3m 59s

Type Casting and Selection

Learn how to actively mold your DataFrame schemas. We cover how to select subsets of columns, and how to safely cast columns from one data type to another.

3m 29s

Function Junction: Cleaning Dirty Data

Garbage in, garbage out. Learn the essential DataFrame transformations for dropping nulls, filling missing values, and handling NaN records natively in distributed systems.

3m 37s

Transforming and Reshaping Data

Take control of your data's shape. We explore how to generate new columns with mathematical functions, perform string manipulations, and flatten nested arrays using explode.

3m 52s

The Mechanics of Grouping and Aggregation

Master the split-apply-combine strategy. We dive into grouping data by keys and applying powerful aggregation functions to summarize massive datasets.

3m 28s

When DataFrames Collide: The Art of Joining

Navigating the nuances of combining datasets. We break down the seven different join types in PySpark and explain how to merge DataFrames safely.

3m 31s

Old SQL, New Tricks

Why learn a new API when you can use raw SQL? Learn how to execute standard SQL queries directly against distributed PySpark DataFrames.

3m 18s

Interchanging DataFrames and SQL

Mix and match SQL with Python seamlessly. Discover how to create temporary views from DataFrames, use selectExpr, and chain programmatic operations onto SQL query results.

3m 16s

Extending Spark with Python UDFs

When built-in functions aren't enough, User-Defined Functions step in. We explore how to write custom Python logic for DataFrames, and why standard scalar UDFs hide a performance penalty.

3m 48s

Turbocharging UDFs with Apache Arrow

Eliminate the JVM-to-Python serialization bottleneck. We uncover how Vectorized Pandas UDFs and Apache Arrow memory formats supercharge your custom transformations.

3m 38s

Exploding Rows with Python UDTFs

Standard UDFs return one value per row, but what if you need multiple rows? Learn how Python User-Defined Table Functions (UDTFs) solve complex one-to-many generation problems.

3m 40s

The Pandas API on Spark

Scale your existing Pandas scripts to infinity. Discover how the pyspark.pandas API allows you to execute standard Pandas syntax natively on a distributed Spark cluster.

3m 43s

Load and Behold: Storage Formats

Not all file formats are created equal. We contrast row-based CSVs with columnar formats like Parquet and ORC, exploring read/write options and optimal storage techniques.

3m 32s

Bug Busting: Physical Plans and Joins

Peek under the hood of Spark's execution engine. Learn how to debug queries using DataFrame.explain() and how to eliminate costly shuffles by using Broadcast joins.

3m 04s

Profiling PySpark Memory and Performance

We wrap up our PySpark journey by introducing native profiling tools. Learn how to track memory consumption line-by-line and expose hidden Python internal tracebacks.

3m 46s

Episodes

The Big Data Problem & PySpark's Promise

3m 38s

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 1 of 21. Your standard Python script runs perfectly in testing, but the moment your dataset hits fifty gigabytes, it crashes with an OutOfMemory error. You have hit the physical limits of a single machine. The solution to this bottleneck is the focus of this episode: the big data problem and PySpark's promise. Standard Python data tools are built for single-node execution. Libraries like pandas are incredibly efficient, but they require the entire dataset to reside in local memory. If your server has sixteen gigabytes of RAM and you try to load fifty gigabytes of application logs, the operating system intervenes and kills the process. Scaling vertically by renting a larger, more expensive server only delays the inevitable. Data grows faster than hardware upgrades. Eventually, the data outgrows the box. PySpark solves this limitation. It is the Python API for Apache Spark. Apache Spark itself is a distributed computing engine that runs on the Java Virtual Machine. PySpark acts as a bridge, allowing you to write your logic purely in Python while taking advantage of Spark's highly optimized distributed engine. This shifts your architecture from vertical scaling to horizontal scaling. Instead of relying on one massive machine, PySpark partitions your data and distributes your computations across a cluster of many smaller machines, known as nodes. You write your Python code, and PySpark translates it into a parallel execution plan. If your data volume doubles next month, you do not have to rewrite a single line of code. You simply add more nodes to the cluster. The PySpark ecosystem is organized into a few core modules designed for different workloads. First is Spark SQL. This is the foundation for most modern PySpark applications. It provides a DataFrame structure for handling tabular data spread across multiple machines. It also allows you to run standard SQL queries directly against these distributed datasets. Next is Structured Streaming. This module handles real-time data pipelines. Instead of processing a massive batch of data overnight, Structured Streaming continuously processes flows of records, like live sensor readings or web traffic events. It uses the exact same programming model as Spark SQL, meaning your batch processing logic and streaming logic look nearly identical. Then there is MLlib, the Machine Learning Library. Training models on massive datasets on a single machine is a notorious bottleneck. MLlib provides distributed machine learning algorithms for tasks like classification, regression, and clustering. It spreads the heavy mathematical operations across the entire cluster, drastically reducing training time. Here is the key insight. The true power of PySpark is abstraction. You never manually slice your massive files into chunks. You never write networking code to coordinate the servers. You simply define a logical sequence of transformations, and the underlying engine handles the data distribution, the parallel execution, and even the recovery process if a node loses power mid-calculation. PySpark is not merely a utility for opening larger files. It is a fundamental shift from computing constrained by a single motherboard to computing constrained only by the size of your cluster. If you find these episodes helpful and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

The Spark Connect Revolution

3m 24s

Explore the Spark Connect architecture. We explain how PySpark decoupled the client and server, allowing you to run Spark applications anywhere without bulky JVM dependencies.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 2 of 21. For years, writing PySpark code locally meant dragging around a massive, heavy Java Virtual Machine just to test a simple script. You had to synchronize Python versions, Java configurations, and cluster dependencies perfectly before writing a single line of logic. The Spark Connect Revolution makes that entirely obsolete. Traditionally, PySpark relied on a tightly coupled architecture. Your Python script and the Spark execution engine had to co-exist on the exact same physical or virtual machine. Launching a PySpark session meant spinning up a Java Virtual Machine in the background using a bridge library. This architecture burdened your local development environment with the full weight of the Spark execution engine. It made embedding PySpark into web applications, modern code editors, or edge devices highly impractical. Spark Connect solves this by introducing a decoupled client-server architecture. Your Python environment is now strictly separated from the Spark server. The local PySpark client becomes a lightweight library. It no longer requires a local Java installation and does not execute data processing tasks itself. It acts purely as a remote interface to the actual Spark cluster. Here is the key insight. When you write DataFrame operations with Spark Connect, the lightweight client records your method calls and translates them into an unresolved logical plan. You can picture this plan as an abstract blueprint of your query, strictly describing what data to process without worrying about how it gets processed. The client packages this blueprint using Protocol Buffers and transmits it over a gRPC network connection to the remote Spark server. The server unpacks the plan, handles all the complex query optimization, executes the job across the cluster, and finally streams the computed results back to your Python script. Setting this up requires a minor change to how you start your application. You still use the SparkSession builder, but instead of relying on local configurations, you call the remote method. You provide a connection string detailing where the Spark server lives. This string uses a dedicated connection scheme starting with the letters s c. So, if you are connecting to a local test server on the default port, you provide the string s c colon slash slash localhost colon one five zero zero two. After that single connection step, you write your DataFrame code the same way you always have. Because the execution is fully remote, you can connect multiple different Python clients, from different applications, to the exact same Spark server simultaneously. Your application code simply asks for data transformations, and the heavy lifting stays entirely on the server side. By completely isolating the Python client from the execution runtime, Spark Connect eliminates the notorious dependency conflicts that used to break deployments, allowing you to upgrade your application environments completely independently of the Spark cluster itself. Thanks for spending a few minutes with me. Until next time, take it easy.

DataFrames and Lazy Evaluation

4m 13s

Dive into the fundamental abstraction of PySpark: the DataFrame. We discuss the concept of lazy evaluation, the difference between transformations and actions, and why Spark plans before it executes.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 3 of 21. What if your code did not actually run when you typed it, but instead waited, analyzed your end goal, and mapped out the fastest possible route? You chain together filters, aggregations, and joins, yet your machine barely breaks a sweat. That is because it is doing nothing until you force its hand. This mechanism is called lazy evaluation, and it is the core engine behind PySpark DataFrames. A PySpark DataFrame is a distributed collection of data organized into named columns. If you are familiar with pandas, the concept feels identical. The difference is that a PySpark DataFrame splits its data across multiple compute nodes in a cluster. Historically, the foundational structure in Spark was the Resilient Distributed Dataset, commonly known as the RDD. The ecosystem has heavily shifted away from raw RDD manipulation. In fact, as of Spark version 4.0, direct RDD usage is no longer supported on Spark Connect. DataFrames are now the definitive standard, providing a strict API that allows Spark to automatically optimize your queries. That optimization relies entirely on lazy evaluation. Every operation you perform on a DataFrame falls into one of two strict categories: a transformation or an action. Transformations are commands that return a new DataFrame. Examples include selecting specific columns, filtering rows based on a condition, grouping records, or joining two separate tables. When you apply a transformation, PySpark does not execute the data processing. It simply records the operation. It updates an internal blueprint called the logical execution plan. You can write fifty consecutive transformations, and Spark will just quickly validate the syntax and update its graph. Here is the key insight. By delaying the actual execution, PySpark gives its underlying query engine, the Catalyst Optimizer, the complete picture of your data pipeline. The optimizer inspects your entire chain of transformations, rearranges them for maximum efficiency, and drops unnecessary steps entirely before a single byte of data is read from the disk. This blueprint remains completely dormant until you invoke an action. An action is a command that demands a concrete result. It either returns data to your driver program or writes data out to storage. Common actions include counting the total number of rows, collecting the data back into a local Python list, or commanding the system to display the top twenty records on your screen. The moment you trigger an action, the engine kicks into gear. It translates your optimized logical plan into a physical plan, distributes the tasks to the cluster workers, and runs the computation. Consider a standard data workflow. First, you create a DataFrame by pointing to a massive file. Then, you join it with a separate table of user details. After the join, you filter the results to only include users from a specific city. Finally, you ask Spark to show the output. Because of lazy evaluation, Spark does not actually load the entire file, perform a massive distributed join, and then filter the results at the end. Instead, the optimizer looks at your final request, notices the filter, and pushes that filter operation up the chain, long before the join happens. It selectively reads only the relevant records, drastically reducing memory usage and network traffic across the cluster. Your PySpark script is never a sequence of immediate commands. It is a set of instructions drafting an architectural blueprint, and the system only begins construction when you finally demand the end result. That is it for today. Thanks for listening — go build something cool.

Creating and Viewing DataFrames

3m 29s

Learn how to instantiate DataFrames from raw Python objects, dictionaries, and files, and how to safely inspect your distributed data without crashing your driver node.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 4 of 21. Calling one specific method on a massive dataset is a guaranteed way to instantly crash your entire application with an out-of-memory error. Knowing how to safely move data into and out of Spark without blowing up your driver node is critical. That is exactly what this episode covers: creating and viewing DataFrames. Every PySpark application needs data to work with. You generally create DataFrames in three ways. First, you can create them directly from in-memory Python structures. You simply define a list of dictionaries, where each dictionary represents a row and the keys are column names, and pass it to the create DataFrame method on your SparkSession. Second, if you already have a pandas DataFrame in memory, you can pass that exact pandas object to the same create DataFrame method. PySpark handles the conversion automatically. The third and most common way is reading from external files. You use the read attribute on your SparkSession, followed by the format you want, like csv or json, and provide the file path. Once your data is loaded, you need to verify it. PySpark DataFrames are distributed, meaning you cannot just print the variable and see the data like you would in a standard Python script. To see the structure of your data, you call the print schema method. This outputs a text-based tree showing every column name and its corresponding data type. It is the fastest way to check that your file loaded correctly. To view the actual contents, you use the show method. By default, calling show displays the first twenty rows in a tabular format. Pay attention to this bit. If your columns contain long strings, the show method truncates them. You can disable this by passing a truncate argument set to false, or set it to a specific number of characters. If your DataFrame has dozens of columns, the standard table view wraps around the screen and becomes unreadable. In that case, you can pass the vertical argument set to true. This prints each row as a vertical block of key-value pairs, making wide datasets much easier to read in a terminal. Now, we get to the out-of-memory crash mentioned earlier. Sometimes you need to bring the distributed data back into regular Python objects. The method to do this is called collect. Here is the key insight. The collect method takes every single row from every executor across your entire cluster and forces it into the memory of your single driver node. If your DataFrame contains a billion rows, your driver will run out of memory and crash instantly. You should only ever call collect when you have aggregated or filtered your data down to a small size. When dealing with large datasets, always extract smaller samples. Instead of collect, use the take method, passing in the number of rows you want. This returns a standard Python list containing just those first few rows. If you need to check the end of your dataset, use the tail method to grab the last few rows. Both methods safely limit the amount of data transferred to your driver. The rule for distributed data is simple: push computations out to the cluster, but strictly limit the number of rows you pull back to the driver. That is all for this one. Thanks for listening, and keep building!

Mastering Basic Data Types

4m 01s

A tour of PySpark's foundational numerical and string types. We explore how to explicitly define schemas using StructType and StructField for robust data pipelines.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 5 of 21. Relying on automatic schema inference might save you a few lines of code, but it will cost you dearly in production performance. The cluster often has to read your entire dataset just to guess what is inside before doing any actual work. You fix this by mastering basic data types and explicit schemas. It is common to confuse standard Python types with PySpark data types. When you declare an integer or a string in standard Python, that object lives in your local machine's memory. PySpark types operate on a completely different level. They are mapping instructions for the Catalyst optimizer and the underlying Java Virtual Machine. When you use PySpark data types, you are defining a strict, cluster-aware structure. This guarantees data consistency across hundreds of distributed worker nodes and dictates exactly how data is serialized over the network. PySpark provides a specific type for every standard data shape, and selecting the correct one is crucial for performance. For numbers, you have ByteType for very small integers, IntegerType for standard numbers, and LongType for large values. Selecting ByteType instead of LongType for a simple status code saves significant memory when that choice is multiplied across billions of rows. For text and logic, you use StringType and BooleanType. Handling time correctly is another area where exact typing matters. PySpark splits temporal data into DateType and TimestampType. You use DateType when you only care about the calendar date, like a user's birthday. You use TimestampType when you need exact points in time, tracking both the date and the exact hour, minute, and second an event occurred. Knowing these types is only the foundation. You must apply them directly to your data ingestion process using an explicit schema. You construct this schema using two specific objects: StructType and StructField. You can think of a StructType as the blueprint for an entire row in your dataframe. A StructField is the blueprint for a single column within that row. To build an explicit schema, you instantiate a StructType and provide it with a collection of StructFields. Every StructField requires three specific arguments. First, you provide the column name as a standard string. Second, you pass the specific PySpark data type you want to enforce, like IntegerType or StringType. Third, you provide a boolean flag indicating whether this column is allowed to contain null values. For example, you construct a schema starting with a StructField called user identifier, assigned to a StringType, and set the null flag to false. You follow this with a StructField called account age, assigned to an IntegerType, setting the null flag to true. Once this StructType object is fully assembled, you pass it directly to your dataframe reader using the schema method before you call the load command to read your files. This is the part that matters. When you provide this explicit schema upfront, PySpark completely skips the data scanning phase. It applies your blueprint directly to the incoming data stream. This drastically reduces the time it takes to read a file. It also acts as an immediate quality gate. If a malformed file arrives with text in your integer column, the pipeline handles it based on your defined structure rather than silently shifting the inferred schema downstream and breaking your transformations. Defining your schema explicitly transforms a fragile, expensive read operation into a predictable, highly optimized pipeline step. Thanks for listening, happy coding everyone!

The Perils of Precision

3m 57s

Discover the critical differences between FloatType, DoubleType, and DecimalType. Learn why choosing the wrong numerical type can introduce disastrous rounding errors in financial data.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 6 of 21. Using a standard float might seem harmless, until your aggregation query silently miscalculates millions in financial transactions. Code that runs perfectly can produce numbers that are slightly, dangerously wrong. This is exactly why we need to talk about the perils of precision. In PySpark, you have three primary ways to store numbers with fractional parts. You have FloatType, DoubleType, and DecimalType. They are not interchangeable. A common mistake is letting PySpark infer a schema from your raw data. Inference usually assigns DoubleType to any number with a decimal point. If you are calculating financial revenues, relying on this default behavior is a serious operational risk. To understand why, we need to look at how FloatType and DoubleType function under the hood. FloatType uses 32-bit IEEE 754 floating-point math. DoubleType uses the 64-bit version of the same standard. Both represent numbers as binary fractions. Think about how the fraction one-third cannot be written perfectly using base-ten decimals. It becomes an endless string of threes. The exact same limitation exists in binary. Common decimal numbers, like zero point one or zero point two, cannot be represented perfectly in base two. The computer stores a tiny approximation. With DoubleType, you get 64 bits of space, meaning the approximation is incredibly close to the real number. If you query a single row of data, you will rarely notice the difference. Here is the key insight. The error compounds during aggregations. When you calculate total financial revenues by summing up billions of individual rows, those microscopic inaccuracies add up. A fraction of a cent lost or gained on every transaction eventually skews the final aggregate total by thousands or even millions of dollars. Your aggregation logic is mathematically sound, but the underlying data type corrupts the result. If your system is calculating physics simulations or training machine learning models, FloatType and DoubleType are exactly what you want. They trade exactness for high-speed hardware processing. But the moment you handle money, you require strict, unyielding accuracy. This brings us to DecimalType. DecimalType does not use floating-point approximations. It stores numbers exactly as you define them, using a fixed scale. When you configure a DecimalType, you define two distinct parameters. First, you specify the precision, which is the maximum total number of digits the value can hold. Second, you specify the scale, which dictates the exact number of digits allowed to the right of the decimal point. If you set up a DecimalType with a precision of ten and a scale of two, PySpark allocates the exact space needed to store that value down to the penny. There are no binary fractions and no rounding guesses. In practice, you implement this by taking strict control of your schemas. When reading financial records from a source file, do not let PySpark guess the types. First, you create a strict schema object. Then, you define your financial fields like revenue or tax. Finally, you explicitly assign them a DecimalType with your chosen precision and scale. Once your dataframe loads with this schema, your standard sum or average aggregations will execute perfectly from the first row to the billionth. You sacrifice a minor amount of compute performance compared to a standard DoubleType, but you guarantee that your financial reporting is absolutely flawless. The rule is simple: use floating-point types for speed and scientific approximations, but the moment a number represents currency, lock it down with a DecimalType. Thanks for tuning in. Until next time!

Taming Complex and Nested Data

3m 59s

Big data isn't always flat. We explore PySpark's complex data types, including ArrayType, StructType, and MapType, allowing you to natively parse deeply nested JSON.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 7 of 21. Real-world big data is rarely a flat spreadsheet. Sometimes, you need an array of nested dictionaries just to parse a single JSON event. To handle that, we need to talk about Taming Complex and Nested Data. Relational workflows prefer flat tables, but modern event data arrives heavily nested. PySpark handles this by providing three complex data types. These are ArrayType, StructType, and MapType. These allow you to explicitly model hierarchical structures natively in the engine. Take a standard customer profile to see how these types operate. The first concept is ArrayType. This represents a collection of elements. The strict rule is that every item inside an ArrayType must share the exact same underlying data type. You cannot mix strings and integers within the same array. If your customer profile includes a list of recent order IDs, you define that column as an ArrayType containing integers. Next is StructType. A StructType models a nested hierarchical record, essentially functioning as a row embedded inside another row. It contains specific, named fields. Unlike an array, each field inside a StructType can have a completely different data type. Suppose your customer has an address. That address contains a street name as a string, a zip code as an integer, and a boolean flag indicating if it is a commercial property. You bundle these distinct fields together into one StructType. Here is the key insight. You can nest these complex types arbitrarily deep. If a customer has multiple addresses, you do not create flat, numbered columns. Instead, you create an ArrayType where the internal element type is that exact address StructType. You now have an array of structs, which perfectly maps to a standard JSON array of objects. The third structure is MapType, designed specifically for key-value pairs. It differs from a StructType in how it handles structure versus schema. A StructType requires you to hardcode the exact field names up front. A MapType is flexible with its data contents but strict with its data types. Every key in the map must be of one specific type, and every value must be of another specific type. You might use a MapType to store customer application preferences. The keys could be strings, such as theme or language, and the values could also be strings, such as dark or English. Because it is a MapType, the upstream application can inject entirely new preference keys later without forcing you to alter the core DataFrame schema. You simply query the values dynamically by their keys. When you construct this complex schema in your code, you build it from the inside out. First, you define the inner fields of the address StructType. Then, you pass that completed struct into an ArrayType definition. Next, you define the MapType for the user preferences. Finally, you wrap all of these components, along with simple scalar types like the customer name string, into one master StructType that defines the overarching DataFrame row. Instead of flattening nested structures into messy JSON strings, explicitly defining these complex schemas allows the Spark optimizer to prune data and filter deep within nested fields without deserializing the entire payload into memory. Appreciate you listening — catch you next time.

Type Casting and Selection

3m 29s

Learn how to actively mold your DataFrame schemas. We cover how to select subsets of columns, and how to safely cast columns from one data type to another.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 8 of 21. A simple string value hiding in an integer column can bring a thousand-node cluster to a grinding halt. You need a reliable way to enforce correct data structures and pick exactly what data moves through your pipeline, which is why we are looking at Type Casting and Selection today. To manipulate data in PySpark, you must first understand what a column actually is. A column instance is not a physical array of data loaded into memory. It is a lazily evaluated representation of an expression. When you reference a column in your code, you are not touching the underlying data. You are simply adding a step to Spark's logical plan. The data only moves when an action is triggered later. To retrieve and shape this data, you use the select method on your DataFrame. You have two main ways to tell the select method which columns you want. The simplest way is to pass the column names as standard text strings. If you pass a string to select, Spark returns a new DataFrame containing exactly that column, completely unchanged. This works well for basic extraction, but it offers no room for modification. To modify the data during selection, you must use Column objects instead of strings. You access a Column object by referencing it directly from the DataFrame. You can do this using dot notation, such as dataframe dot age, or by using bracket notation with the column name as a string inside the brackets. Bracket notation is especially useful when your column names contain spaces or special characters that would break standard dot notation. This is the part that matters. When you pass a Column object into the select method, you can attach methods to it to transform the data on the fly. One of the most critical transformations is type conversion. Data often arrives in the wrong format. For instance, you might receive numerical metrics formatted as text strings. To correct this, you use the cast method. PySpark also provides an alias called astype, which executes exactly the same logic. You call the cast method directly on your Column object inside the select statement. The cast method requires one argument, which is the target data type. You can define this target by passing a string representation of the type, like the word int, or by passing a specific Spark data type object, like IntegerType. Here is how this flows in a real script. You call the select method on your DataFrame. Inside the parentheses of that method, you reference your target column using bracket notation. Right next to that column reference, you call dot cast and provide your new type. When evaluated, this returns a completely new DataFrame where your selected column is now safely converted to the specified type. The original DataFrame remains entirely untouched because DataFrames are immutable. The key takeaway is that type casting in PySpark is not a standalone process applied to an existing dataset in place. It is a lazily evaluated column expression, inherently tied to the act of selecting data to build a new, strongly typed DataFrame. If you enjoy the podcast and want to help support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Function Junction: Cleaning Dirty Data

3m 37s

Garbage in, garbage out. Learn the essential DataFrame transformations for dropping nulls, filling missing values, and handling NaN records natively in distributed systems.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 9 of 21. Garbage in, garbage out. But what do you do when your garbage dataset is hundreds of terabytes in size and you cannot manually inspect a single row? You need a systematic way to sanitize it at scale. That is exactly what we cover today in Function Junction: Cleaning Dirty Data. The first step in cleaning is usually standardizing your schema. You will often receive raw files with spaces, special characters, or typos in the headers. Use the method called with column renamed. You simply pass it the old string name and the desired new string name. If you have several columns to fix, you chain this method sequentially for each column before you apply any complex transformations downstream. Before removing bad data, we must clear up a frequent confusion regarding null and NaN in PySpark. Null means a data point is missing entirely. NaN stands for Not a Number, which represents an undefined mathematical result, such as dividing zero by zero. In pure Python, these require separate handling. However, PySpark groups them together for convenience. When you use the data frame N A functions, Spark evaluates NaN values as nulls for the purpose of dropping or filling. To eliminate rows with missing values, you use the N A dot drop method. Calling this function completely empty drops any row containing a null or NaN in any single column. This approach is highly destructive on wide datasets. A single missing value in an optional metadata column will wipe out a row of otherwise perfect transaction data. To prevent this, pass a list of column names to the subset parameter. PySpark will then evaluate only those specific, critical columns when deciding whether to drop the row. Dropping rows is not always permitted by business rules. Often, you must replace missing values with safe defaults. You accomplish this using N A dot fill. While you can pass a single value to fill all columns, the superior approach is passing a dictionary. The dictionary keys represent the specific column names, and the values represent your chosen replacements. This pattern allows you to fill a missing numeric metric with a zero, while simultaneously replacing a missing category with a text string like unknown. Doing this via a dictionary executes in a single pass, which is highly efficient. Finally, your data might be fully populated but still invalid. Outliers and physically impossible values require logical filtering. You isolate good data using the where method to keep only the rows that satisfy a specific condition. For numeric or date boundaries, the between method is your best tool. You select your column, call between, and provide the lower and upper limits. This replaces verbose greater-than and less-than logic, making your code easier to read. Any row falling outside those limits is filtered out of the resulting data frame. Here is the key insight. Order matters heavily when cleaning at scale. Always rename columns first to lock in your schema, drop or fill missing values next to stabilize your data types, and filter outliers last only when you know the underlying data is structurally sound. That is all for this one. Thanks for listening, and keep building!

Transforming and Reshaping Data

3m 52s

Take control of your data's shape. We explore how to generate new columns with mathematical functions, perform string manipulations, and flatten nested arrays using explode.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 10 of 21. Sometimes a single row of data contains an array of hidden records—and you need to detonate that array to analyze it properly. Transforming and reshaping data is how you unpack, format, and structure that information for downstream processing. When you need to modify a dataframe in PySpark, you do not change data in place. Dataframes are immutable. Instead, you create new versions using a method called withColumn. This method takes two arguments. The first is a string representing the name of the column you want to create or replace. The second is a column expression defining the actual data. If you provide a name that already exists in the dataframe, PySpark overwrites the original column. If the name is completely new, PySpark appends the new column to the right side of your dataset. To define what goes into that new column, you typically use PySpark built-in functions. These are imported from the SQL functions module and provide highly optimized operations that execute across your entire cluster. Consider string manipulation. Text data from external sources is rarely perfectly formatted. You might have a column containing user names written in an unpredictable mix of uppercase and lowercase letters. You can fix this by passing your existing column to a built-in function like lower, which forces all text into lowercase. Alternatively, you can use a capitalization function to ensure the first letter is uppercase and the rest are lowercase. In practice, you build these operations directly into your dataframe transformations. You call withColumn, name your target column, and assign it the result of the lower function applied to your input column. PySpark evaluates this expression for every single row. You can string multiple withColumn calls together to apply several transformations sequentially, passing the progressively updated dataframe to the next step each time. Now, the second piece of this is reshaping. Cleaning strings changes the values, but what happens when the fundamental shape of your data is preventing analysis? This is where it gets interesting. You might receive a dataset where a person's identifier is in one column, and their monthly incomes for the entire year are packed into a single array in the adjacent column. You cannot run standard relational aggregations on a nested array. You need each individual income value on its own row to calculate averages or find minimums. You solve this structural problem using a built-in function called explode. The explode function specifically handles arrays and maps. You call withColumn, specify the column name you want for the output, and pass the explode function wrapping your array column. PySpark executes this by taking the original single row and tearing it open. If the income array contains twelve distinct values, explode generates twelve entirely separate rows. In the new dataframe, the target column now holds a single, flat income value per row instead of a list. Crucially, PySpark duplicates all the other columns from the original row. The user identifier is copied exactly across all twelve new rows. The logical relationship between the user and their income remains perfectly intact, but the data is now flat. You have reshaped a nested structure into a long table ready for standard grouping and filtering operations. The true power of PySpark transformations is that functions like explode and lower do not just manipulate individual values; they define a logical computation plan that scales instantly whether you have a hundred rows or a hundred billion rows, without ever requiring you to write a single manual loop. That is your lot for this one. Catch you next time!

The Mechanics of Grouping and Aggregation

3m 28s

Master the split-apply-combine strategy. We dive into grouping data by keys and applying powerful aggregation functions to summarize massive datasets.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 11 of 21. When you are staring at billions of individual records, reading them row by row is impossible. To extract any actual meaning, you have to summarize them. Today we cover exactly how that happens: The Mechanics of Grouping and Aggregation. Under the hood, PySpark processes aggregations using a classic data strategy called split-apply-combine. This pattern is exactly what it sounds like. First, PySpark splits the massive dataset into distinct logical buckets based on a key you choose. Next, it applies a specific calculation to every single bucket independently across the cluster. Finally, it combines those independent answers back together into a single, summarized result. In your code, you trigger the split phase by calling the group by method on your DataFrame. You simply provide the name of the column you want to use as your grouping key. For example, if you have a massive table of historical transactions, you might group by the user name column. Here is the key insight. Calling group by does not return a new DataFrame. Instead, it returns a transitional construct called a GroupedData object. Because PySpark evaluates your code lazily, it has only built the execution plan for organizing these buckets. It will not actually move any data until you tell it what mathematical operation to perform on those buckets. To provide that mathematical operation, you chain the aggregate method, typically written as agg, directly onto your grouped data. This handles the apply and combine phases. Inside the aggregate method, you tell PySpark what to calculate by using tools from the PySpark SQL functions module. This module contains dozens of optimized aggregation operations. Let us say you want to calculate the average income for each of those users. You would import the average function, usually referred to as avg. You pass the name of your income column into the average function, and you place that inside the aggregate method. When this executes, PySpark calculates the average income for every distinct user bucket simultaneously. The combine phase then kicks in, returning a standard, readable DataFrame. This new DataFrame contains just one row per user, paired with their newly calculated average income. At this point, you have a perfectly summarized table. However, because the calculation happened in parallel across a distributed cluster, the final rows are returned in whatever random order the processing nodes finished their work. If you need to see the highest earners, random order is useless. To fix this, you chain the order by method to the end of your aggregation step. You pass the order by method the column containing your new averages, and you tell it to sort in descending order. PySpark will take the combined results, rank them, and deliver a clean, sorted table. The split-apply-combine pattern is powerful precisely because it maps perfectly to distributed hardware, allowing massive datasets to be summarized in seconds. But remember that grouping data is only half the operation. Grouping requires an aggregation to finish the job, otherwise you just have a cluster full of empty buckets waiting for instructions. Thanks for spending a few minutes with me. Until next time, take it easy.

When DataFrames Collide: The Art of Joining

3m 31s

Navigating the nuances of combining datasets. We break down the seven different join types in PySpark and explain how to merge DataFrames safely.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 12 of 21. Merging two massive tables together is the single most expensive operation in distributed computing. Apply the wrong matching logic, and it becomes the easiest way to crash your cluster by running out of memory. Knowing exactly how to combine datasets safely is what When DataFrames Collide: The Art of Joining is all about. The primary mechanism for combining data in PySpark is the join method. You call this on your base DataFrame, passing in the DataFrame you want to attach, the specific column or columns to match on, and the join method. If you provide no join method at all, PySpark defaults to an inner join. Consider a concrete scenario. You have one DataFrame recording people's heights, and a second DataFrame recording their incomes. Both datasets share a column called name. With an inner join, PySpark looks at the name column in both datasets and only keeps rows where the name exists in both places. If a person appears in the heights data but is missing from the incomes data, their record is completely dropped from the result. To retain unmatched records, you change the join type. A left join keeps every row from your starting DataFrame, which in this case is the heights data. If PySpark finds a matching name in the incomes data, it appends that income. If it does not find a match, it keeps the height row but places a null value in the income column. A right join performs the exact inverse, keeping all incomes and padding missing heights with nulls. When you need absolutely everything, you use a full join. PySpark retains every record from both DataFrames. Matching names are merged into a single row, and any names existing in only one dataset are kept, with null values filling in the missing data from the other side. Here is the key insight. A cross join operates differently because it ignores the join condition entirely. It pairs every single row in the heights DataFrame with every single row in the incomes DataFrame, creating a Cartesian product. If both tables have just one thousand rows, a cross join outputs one million rows. This explosive growth is why cross joins are heavily restricted by default and often require an explicit configuration to execute without throwing an error. The final two join types are actually filtering operations rather than true data merges. A left semi join looks for matches, returning rows from the heights DataFrame only if the name also appears in the incomes DataFrame. The crucial difference from an inner join is that a left semi join does not pull over any columns from the right side. You are left with the exact same columns you started with, just filtered down to the records that have a corresponding match. A left anti join does the precise opposite. It returns rows from the heights DataFrame only if the name does not exist in the incomes data. It drops the right-side columns entirely. This makes the left anti join the most efficient way to identify missing data or find records that failed to process downstream. The choice of join determines not just what data you get back, but how much data has to physically move across your network to generate the result. Thanks for tuning in. Until next time!

Old SQL, New Tricks

3m 18s

Why learn a new API when you can use raw SQL? Learn how to execute standard SQL queries directly against distributed PySpark DataFrames.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 13 of 21. You have a team of analysts who write excellent SQL, but your data sits on a massive distributed cluster. You could force them to learn a completely new Python syntax, or you could let them unleash the language they already know. That is where running raw SQL strings directly in PySpark comes in, bringing Old SQL, New Tricks. PySpark gives you a direct bridge to standard SQL through a single method on your Spark session, called simply sql. You pass a raw SQL string into this method. The output is not plain text. It is a standard PySpark DataFrame. This means you can run a standard database query, get a DataFrame back, and immediately pass it to another Python function. It is entirely interoperable. Before you can query data with SQL, PySpark needs to know what tables exist. You have two main ways to expose your data to the SQL engine. First, if you already have a DataFrame in Python, you can call a method to register it as a temporary view. You give it a string name, and suddenly it acts like a table in your SQL queries. Second, you can create tables entirely within your SQL string. You pass a create table statement into the sql method. Inside that string, you define the schema and tell PySpark exactly where the underlying data files live, such as a cloud storage path containing Parquet files. PySpark registers this in its internal catalog. From then on, you query it by name just like a traditional database table. Compare how the same logic looks in both approaches. Say you need to grab customer names, drop anyone with a zero balance, and merge the result with an orders table. In the DataFrame API, you build a chain of Python methods. You call select on your customer dataset to pick the name column. Then you chain a filter method, checking if the balance is greater than zero. Finally, you append a join method referencing the orders dataset on a matching key. It is highly programmatic. In the SQL approach, you write a standard select statement pulling the name column, add a where clause for the balance, and write an inner join for the orders table. It sits in your script as a single, readable string block. Here is the key insight. There is a common misconception that writing SQL inside Python strings must be slower or less native than using the structured DataFrame methods. That is false. Whether you chain Python methods or pass a raw SQL string, PySpark treats them identically. Both inputs are immediately parsed, translated into the very same logical plan, and handed off to the Catalyst optimizer. The execution engine does not know or care which API you used to express your intent. The performance is exactly the same. The choice between the DataFrame API and raw SQL is never about cluster performance. It is purely about what makes your team faster and your codebase easier to maintain. Thanks for hanging out. Hope you picked up something new.

Interchanging DataFrames and SQL

3m 16s

Mix and match SQL with Python seamlessly. Discover how to create temporary views from DataFrames, use selectExpr, and chain programmatic operations onto SQL query results.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 14 of 21. You might find yourself locked in a debate about whether to write your data transformations in Python or SQL. Forcing a strict choice between the two leaves a massive amount of utility on the table. The real advantage lies in interchanging DataFrames and SQL seamlessly within the exact same pipeline. Sometimes a complex set of nested joins is much easier for your team to read and maintain in raw SQL. Other times, you need to iterate through column names dynamically, which is impossible in pure SQL but trivial in Python. PySpark allows you to blend both approaches without breaking your data flow. To start writing SQL against an existing Python DataFrame, you must first expose that DataFrame to the Spark SQL engine. You achieve this by calling the method create or replace temp view directly on your DataFrame. You pass a single string argument, which becomes the table name. This operation does not move any data. It does not write to disk. It simply registers a temporary pointer in your current Spark session. The SQL engine now knows how to resolve that table name back to your Python DataFrame. Now you can query it. You call spark dot sql and pass in your standard select statement as a string, referencing the table name you just created. Here is the key insight. The output of that spark dot sql call is not a static text result, nor is it a different type of object. It returns a standard PySpark DataFrame. This means you can immediately chain normal Python DataFrame methods directly onto the end of your SQL call. You can write a fifty-line SQL string to handle a complex window function, close the spark dot sql parenthesis, and immediately append a dot filter or dot group by method. You transition from Python to SQL and back to Python in a single block of code. If you only need SQL for a specific column calculation, registering a full temporary view is unnecessary. Instead, you use the select expression method. This method acts as a bridge. It works exactly like a standard DataFrame select method, but it accepts raw SQL string expressions instead of Python column objects. If you need to execute a case-when statement, perform mathematical functions, or cast a data type using native SQL syntax, you pass those exact SQL strings into select expression. Spark takes those strings, parses them, and executes them exactly as it would within a full SQL query. This allows you to stay entirely within the chainable DataFrame API while relying on SQL syntax for complex row-level logic. The boundary between these two paradigms is completely artificial. Whether you chain Python methods, write raw SQL queries, or use select expression strings, Spark compiles everything down into the exact same optimized execution plan. If you would like to help us keep making these episodes, you can search for DevStoriesEU on Patreon to support the show. That is all for this one. Thanks for listening, and keep building!

Extending Spark with Python UDFs

3m 48s

When built-in functions aren't enough, User-Defined Functions step in. We explore how to write custom Python logic for DataFrames, and why standard scalar UDFs hide a performance penalty.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 15 of 21. You write a custom function in Python, plug it into your data pipeline, and it works flawlessly on a small sample. But when you run it on the full dataset, the job slows to a crawl while CPU usage spikes. The code itself is fine, but you are paying a hidden execution tax. Today we are talking about Extending Spark with Python UDFs. A User Defined Function, or UDF, allows you to execute custom Python logic directly on a Spark DataFrame. You use this when the built-in Spark SQL functions do not cover your specific business logic. The process is straightforward. You start by writing a standard Python function. For example, you write a function that takes a text string, applies a complex custom formatting rule, and returns the modified string. To make Spark recognize this function, you import the udf function from the PySpark SQL functions module and apply it as a decorator directly above your Python function definition. You also pass a return type to the decorator, such as a string type or an integer type. If you do not provide a return type, Spark defaults to a string type, which can cause silent data issues if your function actually returns a number. Once decorated, your custom Python function acts just like a native Spark function. You can pass it into DataFrame operations, like a select statement, feeding it column names as arguments. Here is the key insight. A standard scalar Python UDF operates strictly one row at a time. It takes one or more column values from a single row as input, evaluates your custom Python logic, and returns exactly one output value for that specific row. If your DataFrame contains ten million rows, your Python function is invoked ten million separate times. This row-by-row operation is easy to reason about, but it creates the massive performance bottleneck we mentioned at the start. To understand why it is so slow, you have to look at how Spark executes code under the hood. Spark is built in Scala, meaning its core engine runs inside a Java Virtual Machine, or JVM. Your custom UDF is written in Python. The JVM cannot execute Python code natively. To apply your UDF, Spark is forced to spin up separate Python worker processes alongside its own executors. It then has to physically move the data out of the JVM memory space and into the Python process. Spark relies on a Python serialization library called cloudpickle to handle this complex transfer. This is where the performance tax is collected. For every single row in your dataset, Spark serializes the inputs in the JVM, sends that binary data across a local socket to the Python worker, and deserializes it into standard Python objects. Your custom function finally runs on those objects. Then, the whole cycle happens in reverse. Python serializes the output value using cloudpickle, sends it back over the socket, and the JVM deserializes it back into Spark's internal memory format. This constant serialization and deserialization between Java and Python is incredibly expensive. The real cost of a standard Python UDF is rarely the logic you write; it is the silent overhead of translating data back and forth between two entirely different runtime environments on every single row. Thanks for spending a few minutes with me. Until next time, take it easy.

Turbocharging UDFs with Apache Arrow

3m 38s

Eliminate the JVM-to-Python serialization bottleneck. We uncover how Vectorized Pandas UDFs and Apache Arrow memory formats supercharge your custom transformations.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 16 of 21. What if you could speed up your custom Python functions in Spark by a factor of ten, just by changing a single decorator? Standard Python UDFs are notoriously slow, but the solution does not require rewriting your logic in Scala. Today, we are covering Turbocharging UDFs with Apache Arrow. When you run a standard Python UDF, you hit a massive performance wall at the language boundary. Spark operates inside the Java Virtual Machine, but your custom logic runs in a separate Python worker process. To pass data between them, Spark extracts rows from its internal memory, serializes them using a library called cloudpickle, and sends them to Python. Python processes the data one row at a time, serializes the result, and sends it back. Doing this for millions of individual rows creates an unbearable serialization bottleneck. Apache Arrow changes the rules of this data exchange. Arrow is a cross-language, columnar, in-memory data format. It standardizes how data looks in memory, so both the JVM and Python understand it natively without complex translation. Instead of serializing data row by row, Spark packs the data into large, columnar batches. All the values for a specific column sit right next to each other in contiguous memory. Spark sends these large blocks to Python in one efficient step. You can take advantage of this in two ways. First, you can enable Arrow optimization for standard UDFs. You do this by setting the Spark configuration property for Arrow execution to true, or by specifying the parameter useArrow equals true when registering your UDF. Spark will use Arrow to transfer the data in batches, dramatically reducing the serialization overhead, even though your Python function still technically executes the logic one row at a time. Here is the key insight. To get the maximum speed boost, you want your Python code to process those Arrow batches simultaneously. This is where Pandas UDFs come in. By wrapping your custom function with the pandas UDF decorator, you change how the function receives data. Instead of getting a single value for one row, your function receives a Pandas Series containing an entire batch of values. Your function applies a vectorized operation to that entire batch and returns a new Pandas Series of the exact same length. Think of a function called calculate tax. You apply the pandas UDF decorator and declare that it returns a double type. The function accepts a Pandas Series containing product prices. Inside the function, you do not write a for-loop. You simply write a return statement that multiplies the input Series by one point two. Because Pandas is backed by highly optimized C code under the hood, it multiplies the entire block of prices instantly. Spark then takes that returned Series and seamlessly merges it back into the DataFrame using Arrow. The real power of a Pandas UDF is not just that it sidesteps the cloudpickle serialization bottleneck, but that it shifts your actual computation from slow Python loops into vectorized, native execution. Thanks for listening. Take care, everyone.

Exploding Rows with Python UDTFs

3m 40s

Standard UDFs return one value per row, but what if you need multiple rows? Learn how Python User-Defined Table Functions (UDTFs) solve complex one-to-many generation problems.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 17 of 21. Standard User-Defined Functions are strictly limited to a one-to-one mapping. You pass in one value, you get exactly one value out. But what if a single dense log entry needs to be expanded into a hundred separate rows? To solve this, you use Python User-Defined Table Functions, or UDTFs. A UDTF does exactly what the name implies. It returns an entire table from a single input. While a standard UDF calculates a single scalar value, a UDTF can emit multiple rows and multiple columns. This is the tool you reach for when you need to explode a nested JSON string, parse a delimited text file line by line, or generate a sequence of dates from a single timestamp. To create a UDTF in PySpark, you do not write a basic standalone function. Instead, you define a Python class. This class requires a specific method called eval. The eval method is where the actual transformation happens. When you execute the UDTF, Spark calls this method for every input value. Here is the key insight. Inside that eval method, you do not use a standard return statement. Instead, you use the Python yield keyword. Every time the method yields a value, Spark translates that into a new row in your output table. If you pass a single input string, the eval method might loop through it and yield ten times. Spark takes those ten yields and produces ten distinct rows. Let us walk through a concrete example. You build a class called ProcessWords. Your goal is to pass in a full sentence and get back a table where every word has its own row. You write the eval method to accept a text string. Inside the method, you split the sentence by spaces. Then, you loop over the resulting words. For each word, you yield a tuple containing the word itself. Before Spark can use this class, you apply the PySpark UDTF decorator to it. The decorator is mandatory because it defines your output schema. You explicitly declare the column names and data types your function generates. If you yield a string, you tell the decorator the output is a string column. If you want to yield the word and its character count, you yield a two-element tuple, and your decorator specifies a schema with a string column and an integer column. Beyond the eval method, a UDTF class can also include an optional terminate method. Spark calls the terminate method exactly once for each partition of data, after all input rows have been processed by the eval method. This is highly useful for aggregation. If your eval method tracks an internal counter across multiple input rows, the terminate method can yield one final row containing that total count before the partition closes. When you call a UDTF in a DataFrame operation, it behaves like an inline table. If you pass an existing DataFrame column into the UDTF, Spark applies the table function row by row. Because a table function outputs multiple rows for each single input row, combining this output with your original dataset requires an implicit lateral join. Spark handles this behind the scenes, duplicating the original row data to match the newly exploded rows generated by your Python class. The defining power of a Python UDTF is completely unbinding your input volume from your output volume, allowing a single data point to blossom into a full multi-column dataset. That is all for this one. Thanks for listening, and keep building!

The Pandas API on Spark

3m 43s

Scale your existing Pandas scripts to infinity. Discover how the pyspark.pandas API allows you to execute standard Pandas syntax natively on a distributed Spark cluster.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 18 of 21. You have a local data script that works perfectly, but suddenly your dataset size quadruples and your machine runs out of memory. You know the syntax flawlessly, but rewriting everything for a distributed framework takes days. The pandas API on Spark bridges this exact gap. The pandas API on Spark allows you to run standard pandas workloads on a distributed cluster. It does not just blindly emulate pandas. It intercepts your pandas code and translates it into optimized Spark execution plans under the hood. To use it, you import the module named pyspark dot pandas. The standard convention is to assign it the alias ps, directly mirroring the familiar pd alias used in local data science workloads. If you already have a standard local pandas DataFrame in memory, transitioning is straightforward. You invoke a function called from pandas on your ps module and pass in your local DataFrame. This converts the single-node object into a distributed pandas-on-Spark DataFrame. From that point on, the syntax you use to interact with this new object remains identical to what you already know. This consistency extends to how the data is processed internally. The distributed API natively handles missing data exactly as local pandas does. If your dataset contains NumPy Not-a-Number values, the pandas API on Spark manages them properly during mathematical operations or structural transformations. You do not need to invent new data cleaning logic for your Spark jobs. Standard operations translate directly. If you want to group your data by a specific column, you call the standard grouping function. If you want to compute the mean or sum, you chain the aggregate function right after. You can even call plotting functions directly on the distributed DataFrame. Spark processes the heavy computations across the cluster, aggregates the necessary data points, and returns the visualization just as if you were working on a single machine. Here is the key insight. The architecture underneath is fundamentally different, and that introduces a critical edge case regarding index generation. Local pandas relies heavily on a sequential, strictly ordered index for every single row. Spark, however, partitions data and distributes it across multiple independent machines. Enforcing a strict, globally ordered sequential index across a distributed system requires constant communication between worker nodes. When you create a pandas-on-Spark DataFrame without explicitly defining an index column, the API automatically generates a default index to perfectly mimic standard pandas behavior. Creating and maintaining this default index requires synchronizing state across the entire cluster. If you are operating on a massive dataset, this synchronization introduces severe performance overhead. The API will often issue a warning regarding this internal overhead when it executes. To avoid this bottleneck, it is highly recommended to assign an existing column as the index right away or configure the API to use a distributed-friendly index type. The pandas API on Spark gives you the exact syntax of pandas powered by the distributed execution engine of Spark, but remembering that strict sequential indexes carry a heavy synchronization cost will save your cluster from unnecessary slowdowns. That is it for today. Thanks for listening — go build something cool.

Load and Behold: Storage Formats

3m 32s

Not all file formats are created equal. We contrast row-based CSVs with columnar formats like Parquet and ORC, exploring read/write options and optimal storage techniques.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 19 of 21. Saving a massive dataset as a CSV is the easiest thing in the world, and it is also one of the most destructive things you can do to your data lake's performance. You pay for more storage, you pay for more compute, and every downstream query crawls. The fix is in how you handle Load and Behold: Storage Formats, and why how you save your data matters just as much as how you transform it. PySpark uses a unified interface for reading and writing data across dozens of storage systems. You call the read or write attribute on your Spark session or DataFrame, specify a format, provide a chain of options, and point it to a file path. It is a predictable pattern, but the options you choose dictate how much work your cluster has to do later. Let us start with the human-readable formats, CSV and JSON. These are row-based formats. When you read a CSV, Spark parses the data line by line. You often need to chain specific options to make sense of the text. For instance, you might chain an option to tell Spark the file has a header, another option to set a custom delimiter like a pipe or a tab, and a third option to define exactly what a null value looks like, perhaps passing a specific string so Spark correctly maps it to an empty value instead of treating it as literal text. JSON is slightly better because it handles nested structures natively, but it repeats the schema keys for every single record, massively inflating the file size. Both formats force Spark to read the entire row from disk, even if your query only asks for a single column. This is where columnar formats like Parquet and ORC come in. Pay attention to this bit. Analytical queries rarely need every column in a wide table. They usually need specific columns across millions of rows to run aggregations. Parquet and ORC store data organized by column, not by row. If you query three columns out of a hundred, Spark only reads the chunks of the file containing those three columns. It skips the rest entirely, cutting disk input and output to a fraction of what a CSV requires. Because data of the same type is stored together, columnar formats also compress beautifully. A directory of JSON files might shrink by seventy percent or more when converted to Parquet. They also embed the exact schema and data types in the file metadata, meaning Spark does not have to guess or infer types on load. When you are ready to write this data back out, you have to manage state at the destination. By default, if you try to write to a path where data already exists, Spark throws an error to prevent accidental data loss. You control this using the mode method before triggering the save. If you pass the string overwrite, Spark deletes the existing data at the target path and replaces it with your current DataFrame. If you pass append, Spark simply adds your new part files to the existing directory. There is also an ignore mode, which silently does nothing if the directory is already populated. Writing clean, typed, columnar data today saves your cluster hours of wasted processing time tomorrow. If you want to help keep these episodes coming, you can support the show by searching for DevStoriesEU on Patreon. Thanks for spending a few minutes with me. Until next time, take it easy.

Bug Busting: Physical Plans and Joins

3m 04s

Peek under the hood of Spark's execution engine. Learn how to debug queries using DataFrame.explain() and how to eliminate costly shuffles by using Broadcast joins.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 20 of 21. Your PySpark job is not slow because it is computing data. It is slow because it is spending all its time moving data across the network. When a simple join brings your cluster to a crawl, the solution lies in Bug Busting: Physical Plans and Joins. When you write a PySpark script, you define logical operations. You tell Spark what you want, not how to do it. But when a job underperforms, you need to know exactly how Spark executed your request. You do this by calling the explain method on your DataFrame. Calling explain prints the physical plan. This is the blueprint of actual tasks Spark runs on your cluster. You read this plan from the bottom up, tracing the data from the source files all the way to the final output. If you look at the physical plan for a standard join between two DataFrames, you will likely see a step called SortMergeJoin. To perform a SortMergeJoin, Spark must ensure that rows with the same join keys are physically located on the same executor. To achieve this, Spark performs an Exchange. Exchange is the physical plan term for a network shuffle. It means Spark is ripping data out of partitions, pushing it across the network, and writing it to disk so the other executors can read it. Shuffling is the single most expensive operation in distributed computing. Here is the key insight. If you are joining a massive fact table to a small lookup table, shuffling the large table is a massive waste of resources. Instead of shuffling both tables to align the keys, you can just send the entire small table to every executor. This is done using the broadcast function from the PySpark SQL functions module. When you call your join method, you simply wrap the smaller DataFrame in the broadcast function. By wrapping the small table, you give Spark a strict directive. Spark will collect the small DataFrame to the driver node, and then transmit a complete copy of it to the memory of every single executor. Now, when the large DataFrame is processed, the executors already have all the lookup data they need right there in RAM. They just stream through their existing partitions and match the rows locally. No sorting is needed, and no data from the large table moves across the network. If you call explain on this new broadcasted join, the physical plan looks completely different. The SortMergeJoin is gone. The expensive Exchange step is completely absent. In their place, you will see a BroadcastExchange and a BroadcastHashJoin. The BroadcastExchange just moves the small table once, and the join itself happens entirely in place. The easiest way to double the speed of a Spark job is to stop moving data that does not need to move. Read your physical plans, spot the network exchanges, and broadcast your small tables. That is it for today. Thanks for listening — go build something cool.

Profiling PySpark Memory and Performance

3m 46s

We wrap up our PySpark journey by introducing native profiling tools. Learn how to track memory consumption line-by-line and expose hidden Python internal tracebacks.

Download

Hi, this is Alex from DEV STORIES DOT EU. PySpark Fundamentals, episode 21 of 21. Debugging distributed Python code usually means digging through thousands of lines of meaningless Java errors, trying to guess why your function failed or why it consumed all the memory on your cluster. You no longer have to guess. Today we are looking at profiling PySpark memory and performance, along with simplifying stack traces. When you write a User Defined Function, or UDF, in PySpark, your Python code runs on top of a Java Virtual Machine infrastructure. If your Python code divides by zero or references a missing dictionary key, that simple Python exception gets swallowed. It is passed back through the PySpark daemon, across the network, and wrapped in massive Java exceptions. Finding the actual Python error in your logs is tedious. You can fix this by enabling simplified tracebacks. When you set the Spark configuration for simplified traceback to true, PySpark changes how it reports errors. It strips away all the Java interoperability logs and worker process noise. The next time a UDF fails, your console will output a standard, clean Python stack trace showing the exact line number in your Python file where the exception occurred. Fixing crashes is only half the battle. Fixing slow or memory-hungry code is much harder. If you write a Pandas UDF that processes millions of rows, it might run successfully but take far too long or trigger out-of-memory errors on your executor nodes. Historically, finding the bottleneck required adding manual logging or guessing which Pandas operation was inefficient. Spark 4.0 changes this by introducing built-in Python UDF profilers. Here is the key insight. You can now profile your distributed Python code line by line, directly within PySpark. To use this, you set the UDF profiler configuration to one of two modes: performance or memory. If you set the profiler configuration to the word "perf", Spark activates the performance profiler. You then run your Spark job as normal. As the worker nodes execute your Pandas UDF, Spark tracks the execution time of every single line of your Python function. Once your job finishes, you call the show method on the Spark profile object. Spark will print a detailed report to your console. For every line of your code, you will see exactly how many times it was called and the total time spent executing it. You can instantly see if a specific string manipulation or mathematical operation is slowing down your entire pipeline. If you are dealing with memory limits, you set the UDF profiler configuration to the word "memory" instead. The workflow is exactly the same, but the output changes. When you view the profile report, Spark shows you the exact megabyte increment caused by each line of your Python code. You can see exactly where large arrays are being allocated and where memory is failing to release. This line-by-line visibility takes the guesswork out of optimizing complex data transformations. You can pinpoint the exact cause of your performance issues without leaving your PySpark environment. Since this is the final episode of our PySpark series, I encourage you to check out the official Spark documentation and try these debugging tools hands-on. If you have ideas for what technologies we should cover in our next series, drop by devstories.eu and let us know. Thanks for spending a few minutes with me. Until next time, take it easy.