Back to catalog

Season 16 17 Episodes 59 min 2026

Snowflake & Snowpark Python

2026 Edition. A deep dive into Snowflake's core architecture and the Snowpark Python library. Discover how to utilize DataFrames, custom UDFs, Stored Procedures, pandas on Snowflake, and ML training natively in the Data Cloud.

Cloud Data Warehousing Data Science

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

The Snowflake Architecture

Discover the foundational architecture of Snowflake. This episode explains the separation of compute and storage, and how Snowflake combines shared-disk and shared-nothing architectures.

3m 30s

Introducing Snowpark Python

Learn how Snowpark brings Python to your data. We cover the DataFrame abstraction and how Snowpark translates Python code into distributed SQL.

3m 21s

Establishing a Session

Learn how to securely connect to Snowflake using Snowpark. We discuss Session builder configurations, external browser SSO, and managing credentials.

3m 34s

Python Worksheets in Snowsight

Discover how to write and execute Snowpark Python directly within the Snowflake UI. Skip the local setup and utilize pre-installed Anaconda packages.

2m 48s

Constructing Snowpark DataFrames

Get started with the core abstraction of Snowpark: the DataFrame. Learn how to reference tables, create DataFrames from literal values, and leverage lazy evaluation.

3m 21s

Transforming DataFrames

Master DataFrame transformations in Snowpark. We explore filtering, selecting columns, joining datasets, and triggering execution with action methods.

3m 18s

Calling System Functions

Learn how to invoke Snowflake's powerful built-in SQL functions directly from your Python DataFrames without rewriting logic.

3m 14s

pandas on Snowflake: The Paradigm Shift

Discover how to run standard pandas code at massive scale. We introduce the Modin plugin that allows pandas execution natively within Snowflake.

3m 38s

pandas on Snowflake: Hybrid Execution

Dive into the mechanics of Hybrid Execution in pandas on Snowflake. Learn how the engine seamlessly switches between distributed cloud compute and local memory.

3m 44s

Ingesting External Data

Learn how to ingest data from external systems using Snowpark DB-API and JDBC. Bring operational data directly into your analytical pipelines.

3m 22s

Creating Scalar UDFs

Bring custom Python logic to your SQL queries. This episode covers creating anonymous and named User-Defined Functions (UDFs) in Snowpark.

3m 40s

Dependency Management for UDFs

Learn how to import third-party libraries and local modules into your Snowflake UDFs using the Anaconda channel and session imports.

3m 44s

Building User-Defined Table Functions

Move beyond scalar values. Learn how to build User-Defined Table Functions (UDTFs) to return multiple rows and columns from a single input.

3m 26s

Authoring Stored Procedures

Automate your pipelines entirely within Snowflake. We explore creating Stored Procedures to execute complex business logic and control flow.

3m 42s

Training ML Models in Snowflake

Discover how to train heavy machine learning models securely on Snowpark-optimized warehouses using custom Python Stored Procedures.

3m 50s

Dynamic File Access with SnowflakeFile

Learn how to stream large unstructured files dynamically from internal stages directly inside your UDFs and Stored Procedures.

3m 27s

Testing Snowpark Python

Ensure your data pipelines are robust. We cover setting up PyTest, creating Session fixtures, and unit testing DataFrame transformations.

3m 42s

Episodes

The Snowflake Architecture

3m 30s

Discover the foundational architecture of Snowflake. This episode explains the separation of compute and storage, and how Snowflake combines shared-disk and shared-nothing architectures.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 1 of 17. The biggest bottleneck in big data is rarely storage. The real friction happens when you have to move massive datasets across the network just to reach your compute layer. The Snowflake Architecture completely bypasses this problem. People often assume Snowflake is just an on-premise relational database lifted and shifted into the cloud, or a traditional Hadoop cluster you have to babysit. It is neither. Snowflake is a fully self-managed service built natively for the cloud. You do not install software. You do not configure hardware. All maintenance, management, and tuning are handled for you behind the scenes. To understand how it achieves this, look at its hybrid architecture. Snowflake combines a shared-disk setup with a shared-nothing setup. In a traditional shared-disk design, all compute nodes access one central data repository. In a shared-nothing design, each compute node has its own local chunk of data to process. Snowflake takes the best of both. It keeps a single central storage repository accessible to all compute nodes, but it uses independent massively parallel processing clusters to run the actual queries against that data. This hybrid model breaks down into three distinct layers. First is the Database Storage layer. When you load data into Snowflake, it does not just dump raw files into a cloud bucket. It reorganizes that data into an internal, optimized, compressed, columnar format. Snowflake manages all aspects of how this data is stored. You cannot access the raw storage directly, you only interact with it through SQL queries. The second layer is Query Processing. This is where the work happens using what Snowflake calls Virtual Warehouses. A virtual warehouse is simply an independent compute cluster. Because compute is strictly separated from the central storage layer, multiple virtual warehouses can query the exact same data simultaneously. They do not compete for resources, and they do not lock each other out. Here is the key insight. You can scale your compute power instantly without moving a single byte of data. Consider an end-of-month financial report that requires massive data aggregation. Normally, you might have to migrate tables to a larger server to handle the complex workload. In Snowflake, you just spin up a larger virtual warehouse, run the heavy report in minutes, and turn the warehouse off when you finish. The underlying tables never move. The third layer tying everything together is Cloud Services. This is the brain of the platform. It is a collection of services that coordinate everything across the system. It handles user authentication, manages infrastructure, parses incoming queries, and optimizes execution paths. It also manages metadata, tracking exactly where specific pieces of data live in the storage layer so queries can run efficiently without scanning everything. The most powerful takeaway is that in Snowflake, performance scaling is just a toggle switch on the compute cluster, while the massive data tables remain completely stationary. If you are enjoying the series and want to support the show, you can find us by searching for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Introducing Snowpark Python

3m 21s

Learn how Snowpark brings Python to your data. We cover the DataFrame abstraction and how Snowpark translates Python code into distributed SQL.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 2 of 17. What if you could write pure Python but have it execute as hyper-optimized distributed SQL? That is exactly the paradigm shift we are covering today with an introduction to Snowpark. Traditionally, working with large datasets in Python meant pulling data out of your database. You write a query, pull the data across the network into local memory or an external processing cluster, run your transformations, and then push the results back. This approach creates network bottlenecks, increases infrastructure costs, and introduces security risks by moving data out of a governed environment. Snowpark flips this model. Instead of moving data to the compute, Snowpark pushes the compute down to the data. A common misconception is that Snowpark is an external processing engine. People often assume they need to spin up and manage a separate compute cluster just to run Python code. It is not. Snowpark computations run directly within Snowflake. You are using the exact same virtual warehouses that execute your standard SQL queries. There is no new infrastructure to provision or maintain. When you write code using the Snowpark Python API, you are interacting with a DataFrame object. This feels very similar to working with familiar Python data tools. You chain operations together, defining how to select, filter, group, or aggregate your data. However, the Snowpark library does not execute these operations locally. Instead, it natively translates your Python DataFrame operations into complex SQL constructs. Snowflake then processes these SQL statements using its standard execution engine. Here is the key insight. Snowpark uses lazy evaluation. When you define your DataFrame and apply various transformations, no data actually moves or changes. The Snowpark API simply builds up a logical plan. It records your intent step by step. Actual execution only triggers when you call a specific action, such as asking the system to return the final results or write them into a new table. This lazy evaluation model combined with native SQL translation drastically reduces data transfer. Consider a scenario where you are a data engineer tasked with filtering a billion rows of transaction data to isolate a few specific anomalies. You write your filter logic in a local Python script. Because the execution is delayed until the action is called, the API has time to optimize the entire chain of operations and translate it into a single, highly efficient SQL query. The actual filtering happens entirely inside Snowflake. The database does all the heavy lifting. The only data that travels across the network to your local machine is the final, small subset of anomalies. You just processed a billion rows using Python syntax without pulling a single raw record into your local memory. The real advantage here is keeping the developer experience native to Python while leveraging the database engine for what it does best. You write Python, but Snowflake executes SQL. That is all for this one. Thanks for listening, and keep building!

Establishing a Session

3m 34s

Learn how to securely connect to Snowflake using Snowpark. We discuss Session builder configurations, external browser SSO, and managing credentials.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 3 of 17. Connection boilerplates are often a security nightmare, filled with hardcoded passwords and scattered environment variables. You want to give your team a local command line tool, but you absolutely cannot bake credentials into the code. The solution is Establishing a Session cleanly using Snowpark's native configuration features. Every Snowpark script starts by creating a Session object. This is your active link to the Snowflake environment. It encapsulates your authentication, your target warehouse, and your context. To build it, you use the builder object available on the Session class. The most direct approach is passing a dictionary. You create a Python dictionary with keys for your account, user, password, role, database, and warehouse. You pass that dictionary into the configs method of the session builder, then call the create method. The logic flows straight from your dictionary into the Snowflake backend. But hardcoding credentials inside a Python script is a massive security risk, making this approach unsuitable for shared tools. To remove credentials from your code, Snowpark relies on a standard configuration file called connections.toml. You place this file either in your project root or in a hidden snowflake directory inside your user home folder. The TOML file holds connection blocks. You can define a default block, a dev block, and a prod block, each with its own connection parameters. Here is the key insight. When you use a TOML file, your Python code simplifies dramatically. You do not pass a dictionary at all. You just call the get or create method on the session builder. Snowpark automatically scans the standard directories, finds your TOML file, reads the default block, and instantiates the session. If you want a specific environment, you pass the connection name to the configs method before creating the session. Your code stays clean, and credentials stay securely on the local machine. We can secure our command line tool even further. Even with a TOML file, you might not want developers storing permanent passwords locally. You can eliminate passwords entirely by using browser-based single sign-on. In your TOML file, you omit the password key. Instead, you add a key called authenticator and set its value to externalbrowser. When your script reaches the session creation step, execution pauses. Snowpark intercepts the process and automatically opens the user's default web browser. It redirects the developer to your organization's identity provider. The developer logs in normally, satisfying any multi-factor authentication requirements. Once the identity provider approves the login, it sends a temporary authentication token back to a local port that Snowpark is listening on. The script receives the token, establishes the secure Snowflake session, and the Python execution resumes. Your CLI tool now has fully authenticated access to Snowflake without a single password ever touching your configuration files or your source code. The most secure credential is the one that never touches your code, and delegating authentication to the browser keeps your local scripts both safe and entirely focused on data. I would like to take a moment to thank you for listening — it helps us a lot. Have a great one!

Python Worksheets in Snowsight

2m 48s

Discover how to write and execute Snowpark Python directly within the Snowflake UI. Skip the local setup and utilize pre-installed Anaconda packages.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 4 of 17. You want to test a quick data transformation, but your local Python environment is broken again. Your connection strings are outdated, and you just wasted twenty minutes trying to authenticate. What if you could write code against a fully configured environment directly in your browser with zero setup? That is the exact problem solved by Python Worksheets in Snowsight. Python worksheets are a native code editor built straight into the Snowflake web interface. They remove the friction of local environment configuration. Think about prototyping a quick data cleaning script on live data. Instead of opening an IDE and managing credentials, you open a browser tab and start writing code immediately. The execution logic relies on a designated entry point. You do not write a loose, procedural script. Snowflake needs a specific function to trigger your code. By default, this is a handler function called main. This function takes a single argument, which is the active session object. Because you are already logged into the Snowflake interface, the system handles the authentication automatically and passes a live session directly into your main function. You use this session to read tables, execute queries, and manipulate data using standard Snowpark methods. Here is the key insight regarding dependencies. You might expect to open a terminal and use pip to install your libraries, but that is not how worksheets operate. You do not use pip at all. Snowflake integrates natively with the Anaconda repository. Inside the worksheet interface, there is a dedicated packages dropdown menu. You simply search for the library you need, like pandas or scikit-learn, select the version, and Snowflake provisions it for your execution environment instantly. That covers inputs and setup. What about outputs? Your handler function must return a value, and you configure how Snowflake interprets this value using the worksheet settings. You primarily choose between two return types, Table or String. If you set the return type to Table, your main function needs to return a Snowpark DataFrame. Snowflake takes that DataFrame and renders it as a clean, interactive grid in the results panel. This is perfect for inspecting the output of your data cleaning script. If you set the return type to String, your function must return a scalar text value. You use this when you want to return a status message, a numeric aggregate cast as text, or a JSON payload. You must align your code with the worksheet settings. If you configure the worksheet to expect a Table but your function returns a String, the execution will fail. The real power of Python worksheets is not building massive, multi-file applications. It is the ability to validate transformation logic against production-scale data in seconds, securely, without ever configuring a local machine. Thanks for hanging out. Hope you picked up something new.

Constructing Snowpark DataFrames

3m 21s

Get started with the core abstraction of Snowpark: the DataFrame. Learn how to reference tables, create DataFrames from literal values, and leverage lazy evaluation.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 5 of 17. If you instantiate a DataFrame representing a billion rows, how much memory does it use on your laptop? Almost zero. The reason why comes down to the core mechanics of constructing Snowpark DataFrames. A Snowpark DataFrame is not a container holding your information. It is simply a logical reference to data that resides inside Snowflake. When you define a DataFrame in your Python code, you are constructing a query plan. This relies on a concept called lazy evaluation. Under lazy evaluation, Snowpark delays executing the underlying SQL until you explicitly trigger an action that requires a final result. Until that trigger happens, every DataFrame you build is just a lightweight set of instructions. The primary way to start building these instructions is the session dot table method. You pass the name of an existing Snowflake table or view as a string. You can provide just the table name, or a fully qualified name including the database and schema. A frequent mistake new users make is assuming that calling session dot table downloads the table contents into their local Python environment, similar to how Pandas operates. It does not. If you point session dot table at a ten-terabyte transaction log, the Python function returns in milliseconds. No data travels over the network. The DataFrame object just registers that specific table as the root node of your execution plan. Another approach is the session dot sql method. You use this when you want to define a DataFrame using a raw SQL string. This is particularly useful if you have an existing query you want to port over, or if you prefer to express a specific initial extraction in SQL. Just like the table method, passing a query into session dot sql does not send the query to Snowflake for immediate execution. It creates a DataFrame that represents the output of that specific query. You can think of it as defining an inline view. The third mechanism is the session dot create dataframe method. This one operates differently because it starts with data already sitting on your local machine. You can pass a Python list, a dictionary, or a Pandas DataFrame into this method. When executed, Snowpark takes that local data, uploads it to Snowflake, and places it inside a temporary table. The method then returns a Snowpark DataFrame pointing to that new temporary table. Here is the key insight. Because session dot create dataframe physically moves data from your machine to Snowflake, its performance depends entirely on the size of your local dataset and your network connection. You use this for pushing small lookup tables or configuration parameters into Snowflake so they can interact with your larger datasets. You can let Snowpark infer the column names and data types from your local objects, or you can explicitly define a schema using a Snowpark StructType object to ensure precision. Every time you construct a Snowpark DataFrame, regardless of the method, you are defining a relationship to data rather than retrieving the data itself. That is all for this one. Thanks for listening, and keep building!

Transforming DataFrames

3m 18s

Master DataFrame transformations in Snowpark. We explore filtering, selecting columns, joining datasets, and triggering execution with action methods.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 6 of 17. You write ten complex DataFrame operations in a row, but the system does not run ten separate queries. Instead, it delays execution and builds one perfectly optimized SQL statement under the hood. Today, we are talking about Transforming DataFrames. When you transform a DataFrame, you are setting up a recipe for how to shape your data. To build this recipe, you need a reliable way to reference the underlying data columns. You do this using a function named col. You pass the name of a column as a string into this function, and it returns a column object. You then use this object inside other methods to build expressions, like checking if a column value equals a specific number or text string. Let us look at a concrete scenario. You have two DataFrames. One holds customer profiles, and the other holds transaction records. You want to identify high-value users and connect them to their recent purchases. If you wrote this in raw SQL, you might end up with a large, nested query that forces you to read the logic from the inside out. With DataFrames, you use method chaining to write the logic from top to bottom. First, you take your customer DataFrame and call the filter method. Inside that filter, you use the col function to point to the spending tier column, and specify that it must equal your top tier. Right after that filter, you chain a select method. You use the col function again to state that you only want to keep the customer ID and the email address. Every transformation returns a new DataFrame, allowing you to attach the next instruction directly to the end of the previous one. Now, the second piece of this is bringing in the transaction data. You call the join method on your filtered customer DataFrame. You pass the transaction DataFrame as the first argument. Then, you specify the condition for the join, such as matching the customer ID column from both datasets. You can also define the type of join, like an inner or left join. The logic reads sequentially. You filter, you select, and you join. Here is the key insight. As Python reads this chain of methods, Snowpark does not touch the actual data. Transformations are evaluated lazily. The DataFrame simply acts as a blueprint. It records every filter, select, and join you request, but it does not execute them step by step. This delayed execution allows the framework to look at your entire sequence of operations and optimize it before anything actually runs. The execution only triggers when you finally request a concrete result. This requires calling an action method. If you want to print a small sample of the data to your terminal, you call the show method. If you want to download the fully processed results back into your local Python memory, you call the collect method. The moment you call collect, Snowpark translates your blueprint into a single, efficient SQL query and sends it to Snowflake. You get the benefit of writing step-by-step, readable Python, while the database engine still processes the data in one highly optimized pass. That is your lot for this one. Catch you next time!

Calling System Functions

3m 14s

Learn how to invoke Snowflake's powerful built-in SQL functions directly from your Python DataFrames without rewriting logic.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 7 of 17. You are building a data pipeline and suddenly need a complex string manipulation or an obscure mathematical conversion. You do not have to write that logic from scratch in Python. The database already knows how to do it. That is where Calling System Functions comes in. Snowflake has hundreds of built-in system functions written in highly optimized C++. These handle everything from parsing text to calculating standard deviations. Snowpark allows you to trigger these native SQL functions directly from your Python code. You get the familiar structure of Python, but the heavy lifting happens entirely inside Snowflake. The actual data never leaves the server to be processed in your local Python environment. To access these, you import the snowflake dot snowpark dot functions module. This module contains direct Python wrappers for most standard SQL operations. If you need to convert text to uppercase, you just call the upper function from this module and pass it your column object. Behind the scenes, Snowpark translates your Python code into the equivalent SQL syntax and pushes it down to the database for execution. This covers the majority of use cases. But what happens when Snowflake releases a new SQL function, or you need to use a specialized system function that does not have an explicit Python wrapper in the module yet? You do not have to wait for the next release of the Snowpark library to use it. This is the part that matters. You can use a dedicated method named call_function to invoke any Snowflake system function by its exact SQL name. Consider a scenario where you have a table with millions of rows containing angle measurements in degrees, and your downstream machine learning model requires them in radians. Instead of writing a custom Python math transformation that drags all those rows into memory, you let the database engine do the work. You import call_function. Then, you invoke it by passing the string name radians as the first argument. The second argument is the column object you want to transform. Snowpark instantly pushes that down as a native SQL radians operation across your entire dataset. It executes with the speed of compiled C++ code, and it takes you exactly one line of Python to write. The call_function method is flexible. If the underlying SQL function requires multiple inputs, you simply pass them in sequence after the function name. These inputs can be other column objects, or they can be literal values, like a specific string or number. Snowpark maps your Python arguments directly to the expected SQL parameters. This design keeps your code clean while maximizing performance. You do not lose access to the raw power of Snowflake native SQL functions just because you are writing in a different language. By leveraging system functions, you treat Python as the orchestrator, letting the database engine do exactly what it was built to do. Thanks for tuning in. Until next time!

pandas on Snowflake: The Paradigm Shift

3m 38s

Discover how to run standard pandas code at massive scale. We introduce the Modin plugin that allows pandas execution natively within Snowflake.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 8 of 17. Scaling an existing pandas pipeline traditionally meant throwing away your code and rewriting everything in a distributed framework like Spark. What if you could handle a dataset one hundred times larger just by changing a single import statement? That is the premise of pandas on Snowflake. pandas is the universal standard for data manipulation in Python. It is expressive, readable, and deeply ingrained in how data engineers and scientists work. The problem is that pandas is fundamentally constrained by single-node execution. It requires all your data to fit in local memory. When your dataset outgrows your laptop or your cloud instance, standard pandas breaks. pandas on Snowflake represents a complete paradigm shift for these legacy pipelines. It acts as a zero-friction migration tool. Suppose you have an existing Python script that cleans and aggregates daily transaction logs. It works perfectly on ten gigabytes of data, but suddenly you are asked to process a terabyte. To scale this, you do not need to rewrite your logic into SQL. You do not even need to learn the specific syntax of the standard Snowpark DataFrame API. You only change one line at the top of your file. You remove the standard import pandas as pd. You replace it with import snowflake.snowpark.modin.pandas as pd. This is where it gets interesting. A common point of confusion is how this actually processes data. Many developers assume this is just standard pandas operating on a live connection, pulling millions of rows down to their local machine to process. That is incorrect. When you use this specific Modin import, no data is downloaded to your machine. Instead, this library intercepts every pandas command you write in your script. When you chain a filter, a group-by, and a mean aggregation, the underlying engine translates that exact sequence into an optimized Snowflake SQL query. It transpiles your pandas syntax into SQL, and then submits that SQL to the Snowflake compute engine. The operation executes entirely inside your Snowflake warehouse. Your local machine is only responsible for orchestrating the commands. The heavy lifting is distributed across Snowflake clusters. This gives you the exact developer experience of pandas, but backed by a massively parallel database. This shift solves two major organizational headaches at once. First, it eliminates the data egress problem. Because the operations are pushed down to the database, your raw data never leaves the secure Snowflake perimeter. Second, it saves months of developer retraining. Your team continues writing the familiar dataframes, joins, and aggregations they already know perfectly. They just run them against the cloud instead of their local processor. The true power of pandas on Snowflake is that it decouples the language you use to describe your data transformations from the engine that actually executes them. If you find these technical deep dives helpful, you can support the show by searching for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

pandas on Snowflake: Hybrid Execution

3m 44s

Dive into the mechanics of Hybrid Execution in pandas on Snowflake. Learn how the engine seamlessly switches between distributed cloud compute and local memory.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 9 of 17. Your code is now smart enough to know exactly when to use your laptop memory and when to fire up a cloud cluster. This is the core concept behind pandas on Snowflake: Hybrid Execution. When you use standard pandas, every operation runs in your local machine memory. If you load a massive dataset, your process crashes. The Snowpark pandas API solves this by mapping your pandas commands to SQL and executing them in the Snowflake warehouse. But not every pandas function has a direct SQL equivalent. Certain operations, like plotting or applying specific custom Python functions, require the actual local pandas library to execute. The challenge is seamlessly shifting between the cloud engine and your local environment without dragging gigabytes of raw data across the network. Hybrid execution handles this transition dynamically. It is an execution model built entirely around optimizing for the lowest possible data transfer cost. When you run a script using the Snowpark pandas API, the engine acts as an intelligent router. Its default behavior is to keep the data in the cloud. It translates your code into SQL queries and executes them using Snowflake compute. Let us walk through a practical scenario. You start with a table containing ten million rows. You write a line of code to filter out invalid records, and then you group the data to calculate an average. Up to this point, the engine translates your filter and aggregation commands directly into SQL. The Snowflake warehouse performs the heavy lifting. Your local machine does not download a single row of that massive dataset. The aggregation reduces those ten million rows down to a summary table of exactly seven rows. For your next step, you write a command to plot those seven rows into a chart. Plotting is inherently a local operation. The Snowflake database engine cannot draw a chart for you. Here is the key insight. The engine recognizes that the plotting command requires a transition to the local pandas backend. Because the engine processes your operations intelligently, it looks at the state of your data right before the local operation is requested. It knows the preceding aggregation results in only seven rows. So, instead of moving the original ten million rows to your laptop to process the aggregation locally, it lets Snowflake finish the math in the cloud. Then, it triggers a download of just those final seven rows into your local memory. Once the data arrives in your local environment, standard pandas takes over. The local library executes the plot command using those seven rows. Any subsequent operations you perform on that specific data structure will continue to run locally, because the data has already crossed the network boundary. The system isolates the heavy data manipulation inside the cloud infrastructure. It defers network transfer until an operation strictly demands local execution. By the time that demand occurs, your data manipulation steps have usually reduced the dataset to a fraction of its original size. You retain the API syntax you already know, but you completely avoid the network bottleneck. The defining feature of hybrid execution is not just that it falls back to local memory, but that it strategically shrinks the payload in the cloud before making the leap. That is all for this one. Thanks for listening, and keep building!

Ingesting External Data

3m 22s

Learn how to ingest data from external systems using Snowpark DB-API and JDBC. Bring operational data directly into your analytical pipelines.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 10 of 17. What if you could query a PostgreSQL database from inside your Snowflake Python script? Normally, moving data from an operational database to Snowflake requires an external middleman. You set up an extraction tool on a separate server, manage credentials, stage files, and schedule jobs just to get your data ready for analysis. Ingesting external data directly via Snowpark Python resolves this by removing the middleman entirely. This approach allows you to simplify your data architecture. By leveraging external network access in Snowflake, your Python code can reach out directly to other systems. You can use standard Python DB-API libraries or JDBC drivers to connect to external databases, right from a Snowpark stored procedure or user-defined table function. Consider a concrete scenario. You have a legacy on-premise PostgreSQL inventory database, and you need to join that inventory data with your live Snowflake customer tables. Instead of setting up a complex ingestion pipeline, you write a Snowpark Python stored procedure. Inside that procedure, you use a standard Postgres DB-API connector. You pass in your connection string, which references credentials securely managed by Snowflake. You open the connection, execute a standard SQL select statement against the external PostgreSQL database, and fetch the result set. Once the data is in memory within your Python function, you simply yield the rows back to Snowflake, or convert them directly into a Snowpark DataFrame. You now have your external inventory data available as a native Snowflake object, ready to be joined with your customer tables in a single query. The logic flows from the external operational system straight into your analytical environment, securely and without intermediate staging servers. That covers relational inputs. What about semi-structured data sources, like nested files dropped in a stage? Snowpark also provides specific tools for parsing complex formats. A prime example is the XML RowTag reader. When you need to ingest an unwieldy XML file, you do not have to write custom parsing logic to navigate the entire document tree. Instead, you specify a row tag. This is the specific XML element that represents a single record. You call the read method on your Snowpark session, set the format to XML, and provide the row tag parameter. Snowflake scans the document, identifies every instance of that tag, and extracts it as an individual row in a DataFrame. The complex hierarchy above and below that tag is flattened or packed into columns, depending on your schema. It turns a nested text document into a clean, queryable table in one step. Here is the key insight. Whether you are opening a JDBC connection to an ancient on-premise database or pointing a row tag reader at a nested XML file, you are consolidating your ingestion logic inside the data platform. You are turning integration problems into simple code execution problems. Thanks for tuning in. Until next time!

Creating Scalar UDFs

3m 40s

Bring custom Python logic to your SQL queries. This episode covers creating anonymous and named User-Defined Functions (UDFs) in Snowpark.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 11 of 17. How do you run custom Python string-parsing logic across billions of rows without moving the data? The answer lies in creating scalar UDFs, or User-Defined Functions. Standard SQL functions cover most everyday data transformations. When standard SQL fails, Python UDFs step in seamlessly. If you are dealing with heavily customized rules, like writing a complex regular expression to parse messy email addresses, pure SQL quickly becomes unreadable or impossible. Scalar UDFs solve this problem by letting you write plain Python logic and execute it natively inside the Snowflake compute engine. A scalar function simply processes one input row and returns exactly one output value. To deploy your Python code, you use Snowpark to register the function. If you only need the logic for your current script, you create an anonymous UDF. The easiest way to accomplish this is to use the at-UDF decorator. First, you write a standard Python function. Let us call it extract email domain. Inside it, you use the standard Python regex module to parse the text string and return the matched domain. Just above the function definition, you place the at-UDF decorator. When your script executes, Snowpark automatically serializes this Python function, pushes it to the server, and returns a reference to a UDF object. You can now pass this object into your DataFrame column operations, applying your custom parser across massive tables as if it were a native database feature. Alternatively, you can skip the decorator and pass your Python function explicitly into the session dot udf dot register method. Both methods create a temporary object in the database engine. The moment your Snowpark session closes, the anonymous UDF is dropped entirely. This is where it gets interesting. If your new parsing logic is valuable, you probably want to share it with other users or make it available to downstream SQL pipelines. To achieve this, you create a named, permanent UDF. Using the same register method, you add an is permanent flag set to true, and provide a string value for the name parameter. This name dictates how the function will be called in the database. When you make a UDF permanent, Snowflake needs a physical place to persist your Python code so it remains available long after your session disconnects. Because of this, you must also provide a stage location parameter. This points to an existing internal Snowflake stage. Snowpark will upload and safely store the compiled Python files directly into that stage. Once the registration finishes, anyone with the correct database permissions can call your custom Python logic right from a standard SQL query. When creating any UDF, Snowpark typically relies on Python type hints to understand what data types the function accepts and returns. If your function expects a string and returns a string, you add those standard type hints to the Python definition, and Snowpark maps them to the equivalent Snowflake data types automatically. The true power of the scalar UDF is not just the ability to write Python, but the ability to execute that Python within the distributed memory of the database engine, completely eliminating the network latency of pulling data out for processing. That is all for this one. Thanks for listening, and keep building!

Dependency Management for UDFs

3m 44s

Learn how to import third-party libraries and local modules into your Snowflake UDFs using the Anaconda channel and session imports.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 12 of 17. Running third-party Python packages inside a database used to mean dependency hell. You would end up building complex Docker containers just to pass an array through a math function. Now, it takes exactly one line of code. This episode covers Dependency Management for UDFs. When you write a User-Defined Function, or UDF, your logic rarely exists in a vacuum. You need libraries. In a traditional setup, moving your Python code to where the data lives means you also have to move your entire environment. You worry about pip installs, compiler dependencies, and keeping the server identical to your local machine. Snowflake avoids this nightmare automatically through a built-in integration with the Anaconda channel. The Snowflake Anaconda channel hosts thousands of pre-built, pre-compiled Python packages that are guaranteed to work inside the Snowflake execution environment. To use them, you do not install anything manually. You simply declare what your function needs. When you define your UDF using the Python decorator, you add a parameter named packages. You pass this parameter a list of strings, naming the libraries you require. Take a concrete scenario. You want to use a machine learning model to score rows directly inside a Snowflake query. Your prediction function relies on scikit-learn and pandas. Right above your Python function, you add the UDF decorator and set the packages list to contain the strings scikit-learn and pandas. You do not build a container, and you do not run a package manager. When you call that UDF, Snowflake provisions a secure environment, pulls those exact packages from the Anaconda channel, and executes your code. If you need a specific release, you can pin the version by adding two equal signs and the version number right inside the string, just like a standard Python requirements file. That covers external libraries. Now, the second piece of this is your own custom code. Not everything you write belongs on Anaconda. You often have helper functions or custom business rules written in a separate Python file. Snowflake handles this first-party code through your active session. You call a method named add import on the session object. You pass the path of your custom file to add import. This can be a local Python file on your laptop, a file already sitting in a Snowflake stage, or a zip archive containing a deeper directory of modules. When you register your UDF, Snowflake grabs the files you specified and uploads them. At runtime, Snowflake extracts those files and places them directly on the system path of the isolated environment running your UDF. Inside your main function code, you just use standard Python import statements to bring in your custom modules, exactly as if they were sitting in the same folder locally. You can freely combine these two approaches. You declare your external requirements via the packages parameter, and inject your internal modules using add import. Snowflake builds the execution context by safely merging the Anaconda binaries with your custom staged files before a single row of data is processed. Here is the key insight. By handling dependencies this way, Snowflake completely decouples your functional code from the infrastructure needed to run it, letting you define a secure, distributed execution environment using nothing but a list of names and file paths. Thanks for listening, happy coding everyone!

Building User-Defined Table Functions

3m 26s

Move beyond scalar values. Learn how to build User-Defined Table Functions (UDTFs) to return multiple rows and columns from a single input.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 13 of 17. What do you do when your Python function needs to return an entire table of results instead of just one value? Standard user-defined functions are great, but they are strictly limited to returning a single scalar value per input row. When you need to explode a single input into multiple outputs, you build a User-Defined Table Function, or UDTF. The difference is fundamental. A standard function gives back one value. A UDTF gives back a relation. This means it returns a tabular structure that you can query, filter, or join against other tables exactly as if it were a physical table in your database. Consider a concrete scenario. You have a table with a column containing highly nested, custom JSON application logs. Each row holds one massive log entry, but that single entry might contain five, ten, or fifty distinct system events hidden deep inside the JSON. You need to normalize this data programmatically. You want to pass in one JSON string and get back a structured table where every individual event is extracted into its own row. To build a UDTF in Snowpark Python, you do not write a standalone function. You write a Python class. This class acts as the handler for the table function and orchestrates the logic using up to three specific methods. First is the initialization method. This is optional. Snowflake runs it once per partition before processing any rows. If your log parser needs to compile a complex regular expression or initialize a custom state dictionary, you do it here so the setup cost is only paid once per batch of data. Second is the process method. This is the only mandatory method in the class. Snowflake calls this for every single input row. In our logging scenario, the process method receives the JSON string, unpacks the array of events, and loops through them. This is where the conversion happens. Instead of returning a final value, the process method yields tuples. Every tuple it yields instantly becomes a new row in your output table. If one JSON log string contains twelve events, the process method yields twelve separate tuples, and Snowflake turns those into twelve distinct output rows. Third is the end partition method. This is also optional. Snowflake triggers it after all rows in the current partition have gone through the process method. You typically use this if you were aggregating state across the partition and need to yield a final summary row, or simply to clean up memory and resources. When you register this handler class in Snowpark, you must explicitly declare your output schema. Because a UDTF returns a table, the database execution engine needs to know the exact column names and data types of the tuples you are yielding before the query even runs. Here is the key insight. UDTFs bridge the gap between complex Python logic and relational database operations, allowing you to wrap messy, procedural data unnesting inside a clean table interface that the rest of your application can query natively. Thanks for spending a few minutes with me. Until next time, take it easy.

Authoring Stored Procedures

3m 42s

Automate your pipelines entirely within Snowflake. We explore creating Stored Procedures to execute complex business logic and control flow.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 14 of 17. Stop paying for an external orchestration server just to run Python scripts that tell Snowflake what to do. Authoring Stored Procedures brings that control flow directly into the data platform. People often confuse stored procedures with User-Defined Functions, or UDFs. A UDF is not meant for orchestration. UDFs return values inside a query and transform data row by row. Stored procedures are completely different. They execute entire scripts, run multiple queries, and handle procedural control flow. You use them to orchestrate tasks, not to calculate column values. To create a stored procedure in Snowpark Python, you write a standard Python function. The crucial requirement is the first argument of that function. It must always be a Snowpark Session object. This Session is your active link to the database. When your procedure runs, it uses this session to execute SQL statements, build DataFrames, and read or write data. To tell Snowflake this function is a stored procedure, you place the sproc decorator directly above your function definition. The decorator acts as the configuration layer. You use it to specify the return type of your procedure and any third-party Python packages your code needs to run. When you execute the registration code, Snowpark takes your Python function, packages it along with its dependencies, and registers it in Snowflake. From then on, the logic lives inside the database and executes on Snowflake compute. Think about a nightly data pipeline. Your process needs to create a temporary table, load raw data into it, run a series of multi-step transformations, and finally clean up the temporary objects. If you run this from an external server, your Python code sends a command over the network, waits for Snowflake to finish, gets the result back, and sends the next command. By writing this exact same Python logic as a stored procedure, the entire script executes inside Snowflake. It uses the session object to step through the table creations and DataFrame transformations natively. There is zero network latency between the orchestration logic and the data it manipulates. While a stored procedure does not return a value per row like a UDF, it still returns a single value when the script finishes. You define this return type when you set up the sproc decorator. Often, this is a simple string confirming success, an integer representing the total rows processed, or even a tabular result showing an execution log. You have choices for how you deploy this code. You can define and run an anonymous procedure dynamically in a Python script if you only need it for a single session. For automated pipelines, you register it as a permanent object. Snowflake stores the definition so any authorized user, external application, or scheduled task can call it later. Here is the key insight. Writing a Snowpark Python stored procedure means you are treating Snowflake as an application runtime, not just a storage layer. You are moving the brain of your orchestration right next to the data it controls. If you enjoy the show and want to support the work we do, you can find us by searching for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Training ML Models in Snowflake

3m 50s

Discover how to train heavy machine learning models securely on Snowpark-optimized warehouses using custom Python Stored Procedures.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 15 of 17. Moving terabytes of secure data out to an external machine learning server is slow, expensive, and risky. You spend hours pushing data across the network, and your security team gets nervous about where it lands. What if the compute came to the data instead? This is the core idea behind training ML models inside Snowflake. By leveraging Snowpark Python, you can train single-node machine learning models directly inside your data warehouse using stored procedures. The data never leaves the Snowflake perimeter. This solves the data gravity problem. Instead of extracting millions of rows to a separate cloud instance, you execute your training script right where the data currently lives. Training a machine learning model requires loading the dataset into memory. A standard virtual warehouse in Snowflake is optimized for parallel SQL queries, not for holding massive datasets in system RAM. If you try to fit a complex algorithm on a standard warehouse, you will likely hit out-of-memory errors and the process will crash. To handle this, you need a different compute profile. Snowflake provides Snowpark-optimized warehouses specifically for this task. These warehouses provide up to sixteen times more memory per node compared to standard instances. Let us walk through a concrete scenario. You need to train a scikit-learn random forest model on a huge customer dataset. You start by writing a Python stored procedure. When you register this procedure, you specify the required Python packages, such as scikit-learn, pandas, and joblib. Snowflake automatically provisions these dependencies using its integrated Anaconda channel. You also assign the procedure to run on your Snowpark-optimized warehouse. Inside the procedure, you use the Snowpark session object to define your training data query. You then call the method to convert that Snowpark dataframe into a standard pandas dataframe. This action pulls the query results directly into the physical memory of the warehouse node. Because you provisioned the optimized compute, there is plenty of RAM to hold the entire dataset at once. Next, you initialize your random forest model and invoke the fit method, passing in the pandas dataframe. The warehouse processor handles the training cycle exactly as it would on a dedicated machine learning server. Once the model finishes training, you have a fitted algorithm sitting in the temporary memory of the execution environment. You must persist it for future predictions before the stored procedure finishes. This is where serialization comes in. You use a library like joblib to package the model. You take the trained model object in memory and dump it into a file, creating an artifact. You then instruct the stored procedure to write that newly created file directly to an internal Snowflake stage. A stage acts as a secure storage folder tied to your Snowflake account. Here is the key insight. Your stored procedure acts as a secure container connecting your raw tables to standard Python machine learning libraries. By keeping the pipeline entirely within Snowflake, you drop your network latency to zero, bypass all data egress operations, and maintain strict access controls, all while writing the exact same Python training code you already know. That is all for this one. Thanks for listening, and keep building!

Dynamic File Access with SnowflakeFile

3m 27s

Learn how to stream large unstructured files dynamically from internal stages directly inside your UDFs and Stored Procedures.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 16 of 17. How do you process gigabyte-sized unstructured files inside a user-defined function without blowing up your memory limits? If you try to map large files as static dependencies, the system attempts to download everything upfront, and your code crashes. The solution to this is Dynamic File Access with SnowflakeFile. To understand why this matters, look at how file access normally works in Snowpark. Usually, you use the imports parameter to bind a file to your function. This static approach is great for small configuration files or a specialized machine learning model because the file is pulled down to the compute warehouse before execution begins. But it falls apart at scale. If your pipeline needs to process hundreds of massive server log files or high-resolution PDFs stored in a Snowflake stage, downloading them all statically is incredibly slow and guarantees an out-of-memory error. This is where the SnowflakeFile class comes in. Located in the snowflake dot snowpark dot files module, this class allows your function or stored procedure to open files directly from a stage at runtime. You do not bind the file during the function creation. Instead, your function accepts the file location as a standard string argument. When you call the function in your SQL query or dataframe operation, you pass in that location. Inside the Python code, you call the open method on the SnowflakeFile class and pass it the file location string. You also specify the read mode, using standard Python conventions like r for text or r b for reading bytes. Here is the key insight. Calling open does not load the file. It yields a file-like object that streams the data. The file stays on the stage, and you pull the data through a network pipe. Let us apply this to the log processing scenario. You have a database table where each row contains a scoped URL pointing to a different large log file on an internal stage. You pass that URL column to your function. Inside the function, you open the stream and write a standard loop to read the file line by line. As your code evaluates each line to extract error codes or timestamps, Python discards the previous lines from memory. You can parse a fifty-gigabyte log file while your function only ever holds a single line of text in memory at any given millisecond. The exact same concept works for binary files. If you are extracting text from large PDFs, you open the file in binary mode and pass the resulting stream directly into a Python PDF parsing library. The library reads the byte stream sequentially without requiring the entire document to reside in system memory. SnowflakeFile also handles the security boundary automatically. When you pass a dynamically generated scoped URL into your function, Snowflake grants read access to that specific file just for the duration of the query, expiring immediately when the query finishes. Streaming dynamically specified files keeps your memory footprint flat regardless of the file size, letting your compute warehouse focus entirely on processing the data instead of just trying to hold it. That is all for this one. Thanks for listening, and keep building!

Testing Snowpark Python

3m 42s

Ensure your data pipelines are robust. We cover setting up PyTest, creating Session fixtures, and unit testing DataFrame transformations.

Download

Hi, this is Alex from DEV STORIES DOT EU. Snowflake & Snowpark Python, episode 17 of 17. You would never deploy backend application code without unit tests, so why are data pipelines often pushed to production on a wing and a prayer? Snowpark changes this by making your data logic modular, meaning you can finally test it like normal software. Testing Snowpark Python is what brings software engineering rigor to data engineering. Snowpark allows you to pull your transformation logic out of monolithic scripts and into standalone Python functions. Because these functions take a DataFrame as input and return a DataFrame as output, you can test them using standard Python frameworks. PyTest is the most common choice. To run any Snowpark code, you need an active session. Instead of embedding authentication details inside every single test, you define a session fixture. You do this in a dedicated configuration file called conftest.py, which PyTest automatically recognizes. Inside this file, you write a setup function that establishes a connection to Snowflake, returns the session object, and closes the session when the tests finish. You decorate this function as a PyTest fixture. Now, any test in your project can request an active session simply by asking for it by name in the test function arguments. Let us apply this to a concrete scenario. You have a complex custom DataFrame transformation that filters out inactive users, calculates a risk score, and standardizes column names. You need to validate this logic before deploying the pipeline to production. You start by writing a test. The first step inside that test is creating an input DataFrame containing a few rows of carefully selected mock data. You define the exact column names and row values needed to trigger the specific edge cases in your logic. You use your session fixture to convert this raw mock data into a Snowpark DataFrame. Next, you pass this mock input DataFrame into your transformation function. The function applies your business logic and returns the actual output DataFrame. Now you need to verify that the logic did what it was supposed to do. You create a second mock DataFrame within your test, this time representing the exact expected result. This is the part that matters. You must compare the actual DataFrame against the expected DataFrame. However, you cannot just use a standard Python equality check on the DataFrame objects themselves, because they are merely references to query plans. Instead, you need to compare the data they contain. You call the collect method on both the actual and expected DataFrames to pull the results back into memory as lists of row objects. Because distributed databases do not guarantee the order of result sets unless explicitly instructed, comparing two lists of rows directly might fail even if the data is correct. To fix this, you sort both lists of rows before asserting that they are equal. If the sorted actual data matches the sorted expected data, your test passes. By passing small, controlled datasets through isolated transformation functions, you catch logical errors locally before they ever touch a production table. This isolates your pipeline components and makes your data engineering workflow as reliable as traditional software development. This marks the end of our series on Snowpark Python. I encourage you to explore the official Snowflake documentation, try writing these tests hands-on, or visit devstories.eu to suggest topics for our future series. I would like to take a moment to thank you for listening — it helps us a lot. Have a great one!