Back to catalog

Season 15 15 Episodes 52 min 2026

Databricks

2026 Edition. A comprehensive guide to the Databricks Data Intelligence Platform and Lakehouse architecture. Recorded in 2026.

Big Data Cloud Data Warehousing Data Science

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

What is Databricks? The Lakehouse Explained

What exactly is Databricks, and why is every data team talking about it? We break down the massive divide between data scientists and business analysts, and how the Databricks Data Lakehouse solves it.

3m 10s

Why Unity Catalog Changes Data Governance

Data governance is usually a nightmare of scattered permissions. Learn how Databricks Unity Catalog brings centralized security, automated lineage, and easy sharing to your entire organization.

3m 42s

Navigating the Workspace and Compute

How do you actually use Databricks? We explore the Workspace UI and how Databricks manages cloud compute to save you money while giving you massive processing power.

3m 35s

Organizing Your Data: The Object Model

A data lake without structure is just a data swamp. Dive into the Databricks three-level namespace and the critical difference between Managed and External tables.

3m 29s

Taming Unstructured Data with Volumes

What happens to the data that doesn't fit in a database? Learn how Databricks Unity Catalog Volumes securely manage PDFs, images, and raw files for AI.

3m 25s

Bulletproof Cloud Security: External Locations

Stop passing around cloud access keys. Understand how Databricks securely connects to AWS and Azure using External Locations and Storage Credentials.

3m 53s

Painless Ingestion with Lakeflow Connect

Building API connectors from scratch is a waste of time. Discover how Lakeflow Connect ingests data from enterprise apps into your Lakehouse effortlessly.

3m 09s

Automated ETL: Declarative Pipelines

Stop micromanaging your data workflows. Learn how Lakeflow Spark Declarative Pipelines figure out the infrastructure and dependencies for you.

3m 30s

Master Orchestration with Lakeflow Jobs

A brilliant data pipeline is useless if it runs in the wrong order. Discover how Lakeflow Jobs orchestrate complex, multi-task workflows reliably.

3m 27s

Databricks SQL: BI Without Limits

Why copy data out of your lake just to analyze it? We explore Databricks SQL and how serverless compute brings blazing-fast BI directly to your raw data.

3m 19s

The Semantic Layer: One Source of Truth

Stop arguing over whose dashboard is right. Learn how Databricks Metric Views create a semantic layer that guarantees consistent reporting across teams.

3m 18s

Genie Spaces: Talk to Your Data

Empower business users to find answers themselves. Discover how Databricks AI/BI and Genie Spaces allow anyone to query the Lakehouse using plain English.

3m 54s

Deploying AI: Mosaic AI Model Serving

Building an AI model is easy; deploying it is hard. Learn how Mosaic AI Model Serving acts as a secure, unified API gateway for all your machine learning models.

3m 26s

AI Functions: LLMs in Your SQL Queries

You don't need to be a Python expert to use Generative AI. Discover how Databricks AI Functions let you apply Large Language Models directly to your data using standard SQL.

3m 22s

The Future: The AI Agent Framework

Go beyond simple chatbots. In our series finale, we explore the Databricks AI Agent Framework and how to build autonomous AI that acts on your data.

3m 28s

Episodes

What is Databricks? The Lakehouse Explained

3m 10s

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 1 of 15. For years, companies have been paying twice to store the exact same data just to keep both their machine learning engineers and their business analysts happy. Storing raw data in one system and copying it into another creates endless synchronization headaches. Databricks fixes this by introducing a unified approach called the Data Lakehouse. Historically, organizations split their data architecture into two separate paths. First, they built Data Lakes. These are cheap, highly scalable cloud storage systems perfect for dumping massive amounts of unstructured data. Data scientists love them for training machine learning models. But Data Lakes are terrible for fast, reliable SQL queries. To solve that, businesses introduced Data Warehouses for their business intelligence teams. This creates a massive operational burden. Take a growing startup as an example. Their data engineers dump raw event logs into a cloud storage bucket. They run their Python scripts there. But the finance team needs dashboards. So, the engineers must build complex pipelines to extract that data, transform it, and load it into a separate data warehouse. The company pays for storage twice. They pay for the compute required to move the data. And the moment the data arrives in the warehouse, it is already out of date. Databricks eliminates this pipeline entirely with the Data Lakehouse architecture. A Lakehouse combines the cheap, flexible storage of a data lake with the reliability and performance of a data warehouse. It keeps your data in a single, open format directly in your cloud storage. You do not copy it into a proprietary database. Instead, Databricks adds a transactional layer directly on top of your existing data lake. Here is the key insight. Your data stays in one single place, but different professionals interact with it exactly how they need to. Data scientists can write Python or Scala to train models directly on the raw files. Simultaneously, business analysts can run high-performance SQL queries on that exact same data to power their reporting tools. People often mistakenly think Databricks is just another SQL database or simply a managed wrapper around Apache Spark. It is neither. It is a comprehensive Data Intelligence Platform. By merging the lake and the warehouse, you also merge security and governance. In the old model, you had to manage access permissions in cloud storage for the engineers and separately in the data warehouse for the analysts. With Databricks, a unified governance layer handles access control across every table, file, and machine learning model. You define a data access policy once, and it applies everywhere, regardless of the language or tool used to query it. The real power of the Lakehouse architecture is not just saving money on redundant storage pipelines; it is that your artificial intelligence models and your financial dashboards are finally looking at the exact same numbers at the exact same time. If you want to help keep the show going, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Why Unity Catalog Changes Data Governance

3m 42s

Data governance is usually a nightmare of scattered permissions. Learn how Databricks Unity Catalog brings centralized security, automated lineage, and easy sharing to your entire organization.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 2 of 15. If your company uses multiple workspaces to process data, you probably have multiple places where you manage security permissions. That fragmentation is a massive compliance risk, because keeping policies synchronized across disconnected environments relies entirely on manual updates. Unity Catalog eliminates this risk by fundamentally changing how data governance works in Databricks. Before explaining the mechanics, we need to clear up a very common misconception. Unity Catalog is not a passive data dictionary. It is not just a list of tables where users go to read descriptions. It is the central policy engine actively enforcing security rules across your entire architecture. Unity Catalog solves the persistent problem of knowing exactly who has access to what. It provides a unified security model based on standard ANSI SQL. Instead of configuring cloud identity roles, workspace-level permissions, and cluster-level access controls separately, you use familiar commands like grant and revoke directly on your data and artificial intelligence assets. Because Unity Catalog sits at the account level, rather than being bound to an individual workspace, you define a security rule exactly once. That rule is then instantly and universally enforced across every workspace attached to that catalog. Consider a situation where an auditor asks a chief technology officer to prove exactly who queried a specific table containing credit card numbers last Tuesday, and to identify every downstream report that currently uses those numbers. Historically, answering this meant parsing disjointed system logs across different tools, manually reading scheduled transformation jobs, and hoping no intermediate steps were missed. Unity Catalog handles this natively through its next two pillars: built-in auditing and automated lineage. First, it captures detailed, user-level audit logs out of the box. Any time a user or a service principal accesses data, the catalog records the event. Here is the key insight. Unity Catalog does not just track who queried a table; it tracks what happens to the data next through automated lineage. As your scheduled pipelines run, the system continuously reads the execution plans and builds a map of how data flows. It tracks which source tables feed which intermediate datasets, all the way down to the final dashboards. It tracks this at both the table level and the column level. When the auditor asks about the credit card data, you do not need to guess. You view the lineage graph and instantly see every transformation step and every access point. The final major pillar is secure data sharing. Organizations often need to share datasets with external vendors or separate business units. Instead of exporting flat files or duplicating data into separate cloud storage buckets, Unity Catalog integrates a protocol called Delta Sharing. This allows you to grant external parties governed access to live data without copying it. The external consumer reads the data in place, and their access is logged and controlled by the exact same central brain. The true value of Unity Catalog is that it completely removes the dangerous gap between writing a security policy on paper and actually executing it across isolated data silos. That is all for this one. Thanks for listening, and keep building!

Navigating the Workspace and Compute

3m 35s

How do you actually use Databricks? We explore the Workspace UI and how Databricks manages cloud compute to save you money while giving you massive processing power.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 3 of 15. The easiest way to burn through your cloud budget is leaving a massive server running empty all weekend. You want processing power exactly when you need it, and zero billing when you do not. This is exactly what we cover today in Navigating the Workspace and Compute. Your entry point into Databricks is the workspace. Think of the workspace as the unified environment where your team organizes all their Databricks assets. It provides a web interface to manage your notebooks, data objects, machine learning experiments, and the underlying computational resources. The workspace brings all your collaborative tools into one organized view, ensuring different teams can interact with the same underlying data without stepping on each other. Under the hood, Databricks relies on a decoupled architecture. Your data lives persistently in cloud object storage, while the compute power used to process that data is spun up completely separately. This separation of concerns dictates your billing. Because compute is isolated from storage, you only provision and pay for server instances when you are actively running code. When the work is done, the compute shuts down, but your data remains safely stored and accessible. To manage this processing power, Databricks offers different types of compute resources tailored to specific workflows. The first is an All-Purpose cluster. You use this for interactive, ad-hoc work. Say a data analyst needs a highly capable environment to query a billion rows on a Tuesday afternoon. They spin up an All-Purpose cluster, attach their notebook, and start exploring. To prevent weekend billing surprises, these clusters rely on auto-termination. If the analyst goes home at five and leaves the notebook open, the cluster monitors itself for inactivity and automatically shuts down after a specified time limit. Here is the key insight regarding automation. A frequent mistake teams make is scheduling automated production pipelines to run on these interactive All-Purpose clusters. Avoid doing this. All-Purpose clusters carry a higher usage cost, and running multiple different workflows on a shared interactive cluster can introduce library conflicts or resource contention. Instead, production pipelines should use Job clusters. A Job cluster is entirely ephemeral. When an automated pipeline is triggered, the Databricks job scheduler provisions a dedicated Job cluster strictly for that workload. It runs the code, and the absolute second the job finishes, the cluster terminates itself. This guarantees strict resource isolation for your pipeline and ensures you pay the lowest possible compute rate for automated tasks. Finally, if your workload is purely analytical, you can use a SQL warehouse. This is a compute resource optimized specifically for running SQL commands and dashboard queries. If you use Serverless SQL warehouses, Databricks manages the underlying compute automatically. It scales up instantly when a surge of queries hits, and scales back down when the queue empties out, entirely removing the need to configure infrastructure yourself. Matching the right compute type to the exact nature of your task is the single most effective way to guarantee your cloud infrastructure remains powerful during peak hours and highly cost-efficient when the work is done. That is all for this one. Thanks for listening, and keep building!

Organizing Your Data: The Object Model

3m 29s

A data lake without structure is just a data swamp. Dive into the Databricks three-level namespace and the critical difference between Managed and External tables.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 4 of 15. You build a data lake, but within months nobody knows where the production data is, who owns it, or whether a specific table is actually safe to query. The difference between a pristine data lake and an unmanageable data swamp is exactly three levels deep. Today we are looking at Organizing Your Data: The Object Model. Unity Catalog brings order to your data through a strict, predictable hierarchy. The absolute top container is the metastore, which holds your organization's metadata. But your daily interactions rely on the primary three-level namespace. Every query you write targets an asset using the format catalog dot schema dot object. The first level is the catalog. This provides a broad boundary for data assets. You typically use catalogs to logically separate environments, like having one catalog for production and a completely separate one for development. The second level is the schema, which is also referred to as a database. Schemas live inside catalogs and organize related data sets. You might create one schema for raw ingested events and another for refined analytics. The third level is the object itself. This is your actual table, a view, or a volume holding non-tabular files. By enforcing this three-part naming convention, Unity Catalog gives every piece of data a clear, unambiguous address. When an analyst queries production dot sales dot customers, the location, lifecycle stage, and purpose of that data are instantly obvious. Here is the key insight. Once you reach the table level, you must understand how Unity Catalog interacts with your actual storage. There are two primary types of tables: managed tables and external tables. Managed tables are the default. When you create a managed table, Unity Catalog owns both the metadata and the underlying data. It handles the file layout and manages the entire lifecycle of the data. The actual files are saved in a designated storage location that you configure at the metastore, catalog, or schema level. External tables operate differently. You use an external table when you have files already sitting in a cloud storage bucket and you want to leave them exactly where they are. With an external table, Unity Catalog registers the structure and governs access, but it only owns the metadata. You retain complete control over the physical files. This distinction becomes critical during destructive operations. Consider a scenario where a data engineer accidentally executes a drop table command. If they drop a managed table, Unity Catalog removes the table from the metastore and automatically deletes the underlying files from your cloud storage. The data is destroyed. If they drop an external table, Unity Catalog simply removes the metadata link. The table disappears from your workspace interface, but the raw files in your cloud storage remain perfectly intact and untouched. Always use managed tables when you want the catalog to optimize and govern the entire storage lifecycle, and reserve external tables for data that you need to protect from accidental deletion or share directly with other external systems. Thanks for hanging out. Hope you picked up something new.

Taming Unstructured Data with Volumes

3m 25s

What happens to the data that doesn't fit in a database? Learn how Databricks Unity Catalog Volumes securely manage PDFs, images, and raw files for AI.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 5 of 15. You can easily restrict who sees a column in a relational database, but how do you enforce access control on a cloud storage bucket full of thousands of raw PDFs? The answer is Taming Unstructured Data with Volumes. Before we get into how they work, let us clear up a common mix-up. Volumes are strictly for path-based file access. They are not for tabular data. If you are querying rows and columns with SQL, you use a table. If you are reading images, text documents, or audio files, you use a volume. A volume is an object inside Unity Catalog. It represents a logical storage space in your cloud environment. By creating a volume, you bring unstructured data under the exact same security umbrella as your structured tables. Instead of managing identity policies in AWS or role assignments in Azure just to read a file, you control access using standard permissions directly in Databricks. Consider a hospital training a machine learning model to detect anomalies in X-ray images. They cannot put thousands of high-resolution images into a database table. They need to store them as raw files in cloud object storage. Because these are highly sensitive patient files, strict governance is critical. By placing the X-rays inside a Databricks volume, the engineering team can govern exactly which data scientists are allowed to read that specific directory. There are two types of volumes: managed and external. A managed volume is completely handled by Databricks. When you create one, you do not specify a storage path. Databricks simply carves out space in the default storage location assigned to your current schema. You upload files directly into it. If you ever drop a managed volume, Databricks deletes the underlying files as well. This makes them ideal for temporary workspace files or data generated entirely within your Databricks pipelines. An external volume points to an existing cloud storage directory that you already own. First, you register a cloud storage path as an external location in Unity Catalog. Then, you create a volume on top of it. This gives you strict governance over data produced by other systems. If a separate application writes log files into an Azure Data Lake bucket, an external volume lets Databricks users read those files securely. If you drop an external volume, the metadata is removed, but the underlying files in your cloud bucket are left completely untouched. This path-based approach is exactly what modern AI requires. Open-source machine learning libraries typically expect to read data from a local file system. They do not know how to authenticate with proprietary cloud storage interfaces. Volumes solve this by exposing a directory path that looks and behaves like a standard local folder. Your model training script simply opens a file path. Unity Catalog intercepts that request and seamlessly verifies the user permissions. Here is the key insight. Volumes eliminate the disconnect between how you govern your structured databases and how you secure your raw files, allowing you to run machine learning workloads on unstructured data without bypassing enterprise security. That is all for this one. Thanks for listening, and keep building!

Bulletproof Cloud Security: External Locations

3m 53s

Stop passing around cloud access keys. Understand how Databricks securely connects to AWS and Azure using External Locations and Storage Credentials.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 6 of 15. If your data engineers are still pasting cloud access keys directly into their scripts, your company is one mistake away from a massive data breach. The solution to securely bridging your workspace and your cloud storage without exposing secrets is Bulletproof Cloud Security: External Locations. When a user logs into Databricks, they use an identity token. That token proves who they are to the workspace. But that identity means absolutely nothing to your underlying cloud provider, whether that is AWS, Azure, or Google Cloud. To read a file from a cloud bucket, the workspace itself needs to authenticate with the cloud infrastructure. Historically, developers bypassed this disconnect by hardcoding cloud IAM keys directly into their notebooks or environment variables. This creates a severe security vulnerability, as anyone with read access to the code can steal the keys. Unity Catalog solves this through a strict two-part abstraction. The first part is the Storage Credential. A Storage Credential represents an authentication and authorization mechanism directly tied to your cloud provider. It maps to an IAM role in AWS, a Managed Identity in Azure, or a Service Account in Google Cloud. Instead of handing raw cloud keys to a developer, your cloud administrator grants access privileges to this Storage Credential. Unity Catalog holds the authority to assume this role, keeping the actual credential entirely out of the hands of workspace users. Now, a Storage Credential alone is too broad. That IAM role might have permission to access dozens of different buckets across your cloud environment. This is where the second part comes in. An External Location pairs a Storage Credential with a specific cloud storage path, such as an S3 bucket URI or an Azure Data Lake Storage container path. It defines exactly where that credential is allowed to operate. You can think of it as a geographic boundary for your cloud credentials. Take a concrete scenario. A developer needs to analyze system logs stored in a highly secure S3 bucket. In a legacy setup, an admin would generate AWS access keys and send them to the developer, hoping they do not accidentally commit those keys to a public code repository. With Unity Catalog, the workflow changes completely. The admin creates a Storage Credential configured with an IAM role that can read the target bucket. Next, the admin creates an External Location pointing strictly to the S3 path containing the system logs, and attaches that Storage Credential to it. Finally, using standard SQL, the admin grants the developer permission to read files exclusively on that External Location. When the developer runs a query against the logs, Unity Catalog steps in and transparently handles the cloud authentication on their behalf. The developer never sees an AWS key. They do not manage secrets or configure cloud profiles. They just query the allowed path. Later, you can build external tables or external volumes directly on top of this location to further organize the data. If the developer moves to another team, the admin simply revokes their grant to the External Location inside Databricks. The underlying cloud IAM configuration remains completely untouched. Here is the key insight. External Locations decouple your cloud infrastructure security from your daily data access governance. By keeping IAM roles out of user code and anchoring them to explicit paths, you guarantee that every data request is audited, secure, and restricted entirely to the data you intend to share. Thanks for spending a few minutes with me. Until next time, take it easy.

Painless Ingestion with Lakeflow Connect

3m 09s

Building API connectors from scratch is a waste of time. Discover how Lakeflow Connect ingests data from enterprise apps into your Lakehouse effortlessly.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 7 of 15. Data engineers spend an unbelievable amount of their time just trying to keep fragile API ingestion scripts from breaking. When an endpoint changes its pagination logic or a rate limit drops, your pipeline fails, and you spend the afternoon debugging JSON instead of building data models. The solution to this specific headache is Lakeflow Connect. Before we look at how it works, let us clear up a common naming confusion. Databricks has Lakeflow Jobs and Lakeflow Connect. Lakeflow Jobs handles orchestration, meaning it runs tasks in a specific sequence. Lakeflow Connect is strictly about ingestion. It is the mechanism for getting raw data from outside systems into your Databricks environment. At its core, Lakeflow Connect provides Managed Connectors. These are native, purpose-built integrations for enterprise applications and databases. Usually, when you need to pull data from external systems, you write custom Python code. That code has to manage authentication, handle retries when the server drops a connection, track which records were already ingested, and parse complex pagination. Managed Connectors eliminate that entire layer of custom infrastructure. Databricks handles the underlying compute, the API interactions, and the state tracking required for incremental reads. Because Lakeflow Connect runs on serverless compute, you do not need to configure or manage clusters just to pull in data. The service scales automatically based on the volume of incoming data. It also integrates directly with Unity Catalog, meaning the data you ingest is immediately governed and available for querying. Consider a standard requirement. Your marketing team needs up-to-date Salesforce data in your lakehouse. If you build this from scratch, you might spend a week writing a custom script that queries the Salesforce API. You have to write logic to stay under strict API limits, manage token refreshes, and merge updates into your existing Delta tables without duplicating records. With a Managed Connector in Lakeflow Connect, you bypass the custom code entirely. You provide the connection credentials, select the specific Salesforce objects you want to track, and set a destination catalog and schema. The setup takes a few minutes. Databricks takes over the execution. It pulls the initial historical snapshot of your data and then transitions to continuously capturing incremental changes as they happen. Here is the key insight. By shifting the ingestion workload to a Managed Connector, you stop maintaining polling scripts. When an external API specification changes, Databricks updates the connector behind the scenes. Your pipeline simply keeps running. This frees you to focus on the actual business logic, like transforming raw data into aggregate tables or training machine learning models, instead of babysitting a broken extraction script. The real value of Lakeflow Connect is not just the fast setup, but the permanent removal of custom ingestion code from your maintenance backlog. If you want to help keep the show going, you can search for DevStoriesEU on Patreon and support us there. Thanks for spending a few minutes with me. Until next time, take it easy.

Automated ETL: Declarative Pipelines

3m 30s

Stop micromanaging your data workflows. Learn how Lakeflow Spark Declarative Pipelines figure out the infrastructure and dependencies for you.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 8 of 15. You have a complex ETL pipeline where one table updates hourly, another updates continuously, and orchestrating the dependencies requires hundreds of lines of state-management code. What if you could just declare the final tables you want, and let the engine build the infrastructure to keep them updated? That is the premise of Automated ETL using Declarative Pipelines. In a traditional imperative pipeline, you tell the system exactly how to do its job. You write the code to manage checkpoints, handle retries, map dependencies, and provision clusters. Declarative pipelines flip this model. You simply state what the final table should look like, usually with a standard SQL query or Python function. The underlying engine builds the execution graph, manages the infrastructure, and handles the state transitions automatically. To make this work, Databricks relies on two specific table types. A common mistake is treating them as interchangeable. They are not. You must clearly separate your append-only event data from your complex aggregations. The first type is the Streaming Table. Streaming Tables are designed strictly for incremental, append-only processing. They read continuously or in batches from a data source, process only the new records, and append them to the target. Think of processing a massive stream of website clicks coming from Kafka. You write a query to populate a Streaming Table from that Kafka topic. You do not write code to track offsets or remember which messages were already read. The pipeline maintains the state internally, ensuring every click is processed exactly once, even if the system restarts. Now, the second piece of this. Once you have your raw events safely stored, you usually need to transform them. This is where Materialized Views come in. While Streaming Tables handle the initial ingest of new data, Materialized Views are built for complex aggregations, joins, and records that update or delete over time. Returning to our website clicks, you need a daily executive dashboard showing total clicks grouped by region. You define a Materialized View that selects from your Streaming Table and runs the aggregation. When the pipeline runs, the engine evaluates the Materialized View. It determines the most efficient way to bring the view up to date. If it can compute the changes incrementally, it will. If a full recompute is necessary, it handles that automatically. You never write the logic dictating when to refresh or how to merge the new aggregations. Here is the key insight. Because you define both the Streaming Tables and the Materialized Views declaratively, the Databricks engine understands the entire lineage of your data. It knows the Materialized View depends on the Streaming Table. It strings them together into a unified pipeline graph. If a compute node fails mid-processing, the pipeline relies on that graph to pause, retry, and resume without duplicating records or corrupting the final dashboard. Your codebase is no longer cluttered with operational scaffolding. It only contains the pure business logic defining how data flows from source to destination. That is all for this one. Thanks for listening, and keep building!

Master Orchestration with Lakeflow Jobs

3m 27s

A brilliant data pipeline is useless if it runs in the wrong order. Discover how Lakeflow Jobs orchestrate complex, multi-task workflows reliably.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 9 of 15. If your nightly data processing relies on a chain of independent cron schedules, you are essentially hoping the previous step finished in time. You are flying blind. To stop silent failures and guarantee execution order, you need Master Orchestration with Lakeflow Jobs. First, a quick distinction. Lakeflow Pipelines handle data dependencies down at the table level. Lakeflow Jobs, which we are focusing on now, orchestrate tasks at the macro level. Think of pipelines moving data inside the warehouse, while jobs string together notebooks, Python scripts, and machine learning models into a larger workflow. A job in Databricks is the overarching container for your orchestration. Inside that container, you define multiple tasks. A task is a single isolated unit of work. It could be executing a Databricks notebook, submitting a Spark application, running a dbt project, or firing off a SQL query. By linking these tasks together, you build a graph of execution where one task only begins when its specific prerequisites complete successfully. Let us walk through a practical scenario to see how control flow handles reliability. You have a daily process that ingests raw data, checks its quality, transforms it, and alerts the team if something goes wrong. You start by defining an ingestion task. Next, you link a data quality task that runs strictly after the ingestion finishes. Here is the key insight. Instead of writing custom error handling inside your Python code to decide what happens next, you use native job control flow. You add an if-else condition task immediately after the quality check. The condition evaluates a variable returned by your data check task. If the data is clean, the job follows the if-branch and triggers your downstream transformation task. If the data is corrupted, the job takes the else-branch and triggers a webhook task that pings a Slack channel. You also manage state using run-if task conditions. You can configure an alerting task to execute only if the previous task outright failed, while the rest of the pipeline safely halts. This prevents the classic silent failure cascade, where a broken ingestion step silently triggers a machine learning model to train on completely empty tables. To initiate this workflow, you apply a trigger. Jobs can run on demand, on a traditional scheduled interval, or continuously. They can also execute based on an event, such as a new file arriving in an external cloud storage bucket. Once triggered, Databricks provides built-in observability. You do not have to guess where a failure occurred. The platform records a complete run history with a matrix view, showing you exactly which task succeeded, which task stalled, and how long each step took. You can configure job-level or task-level notifications to send automated emails or webhooks the moment an execution state changes. The true value of this orchestration model is shifting failure handling out of your individual scripts and into the platform infrastructure, ensuring your system knows exactly how to route execution when things inevitably break. That is all for this one. Thanks for listening, and keep building!

Databricks SQL: BI Without Limits

3m 19s

Why copy data out of your lake just to analyze it? We explore Databricks SQL and how serverless compute brings blazing-fast BI directly to your raw data.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 10 of 15. Moving data out of your data lake just so your business intelligence team can run queries on it is slow, expensive, and completely unnecessary. You end up maintaining fragile pipelines just to copy data from one system to another, introducing delays and duplicating storage costs. This is exactly the problem Databricks SQL solves. There is a common misconception that Databricks is strictly for data engineers and scientists writing Python or Scala. Databricks SQL clears that up. It is a dedicated workspace built entirely for SQL practitioners. Consider a business intelligence team migrating off a legacy data warehouse. Historically, they waited for engineers to run overnight extraction jobs to load data from the lake into the warehouse. Only then could they start building reports. Databricks SQL eliminates that entire extraction layer. It allows analysts to run standard ANSI-SQL queries directly against the data lake. You get the massive scale of open lake storage, but you interact with it using the familiar, fast interface of a traditional relational warehouse. The engine powering these queries is the Serverless SQL Warehouse. A SQL warehouse is simply a compute resource configured specifically for SQL workloads. In older architectures, you had to provision clusters manually, configure scaling rules, and wait several minutes for virtual machines to boot up before running a query. Here is the key insight. Because these SQL warehouses are serverless, the compute layer starts almost instantly. It scales out automatically when your analysts trigger heavy concurrent workloads, and it terminates itself when the queries finish. The infrastructure management is completely abstracted away, leaving the analysts to focus solely on their data. To write and execute these queries, the platform provides a built-in SQL editor. This is the primary interface for exploring the data. Inside the editor, users can write standard SQL, browse through data catalogs, examine table schemas, and view execution histories. When a query returns data, the analyst does not have to export it to understand it. They can build visualizations directly in the editor and arrange those visualizations into custom dashboards that update automatically. The platform also includes an alerting feature. Analysts can write a query that checks a specific metric, and configure the system to send an email or a web notification if that metric crosses a defined threshold. Many organizations already have established visualization tools. Databricks SQL integrates directly with standard third-party tools like Power BI and Tableau. These external applications connect to the Serverless SQL Warehouse and treat the data lake exactly as if it were a high-performance database. The shift here is fundamentally about proximity to your data. By bringing warehouse-grade compute and standard SQL directly to the lake, you stop copying data and start analyzing your single source of truth the moment it lands. That is all for this one. Thanks for listening, and keep building!

The Semantic Layer: One Source of Truth

3m 18s

Stop arguing over whose dashboard is right. Learn how Databricks Metric Views create a semantic layer that guarantees consistent reporting across teams.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 11 of 15. If three different departments build three different dashboards to track revenue, and they bring three different numbers to the executive meeting, you do not have a data pipeline problem. You have a semantic layer problem. The solution is establishing a semantic layer as one source of truth, and in Databricks SQL, you do this using Metric Views. Raw data is rarely structured the way business users think. Database tables contain obscure column names, complex joins, and raw transaction logs. A semantic layer bridges this gap by translating that underlying data into familiar business concepts. Let us look at a classic scenario where this breaks down. Your marketing team and your finance team both report on Monthly Active Users. Marketing writes a query in their dashboard that counts anyone who opened the application. Finance writes a different query in a separate tool that only counts users who completed a transaction. Both teams call their metric Monthly Active Users. When the numbers clash, organizational trust in the data collapses. Defining Monthly Active Users as a Metric View inside Unity Catalog fixes this forever. To understand why, we need to clarify what this feature actually is. A Metric View is not the same thing as a standard SQL View. A standard SQL view simply saves a query that returns raw rows and columns, leaving it entirely up to the end user to decide how to sum, average, or group that data later. A Metric View is much stricter. It enforces specific aggregation calculations and dimensionality directly at the catalog level. When you create a Metric View, you lock in the exact business logic. You define the measure, such as a distinct count of user IDs based on specific transaction criteria. You also define the allowable dimensions. This means you explicitly dictate that this metric can only be sliced by specific attributes, like the transaction date, the user region, or the device type. Here is the key insight. Once that Metric View is published in Unity Catalog, it becomes the single authoritative definition of Monthly Active Users for the entire company. When analysts connect to Databricks SQL, they do not write custom logic to aggregate the data. They do not join tables or write where clauses to filter active states. They simply query the Metric View. This completely decouples the metric definition from the presentation layer. It does not matter if Marketing is using Tableau, Finance is using Power BI, and the product team is using native Databricks dashboards. The business intelligence tool just asks for the metric, and Databricks performs the predefined calculation on the server side. Because the logic lives centrally in Unity Catalog, it is impossible for different departments to accidentally invent their own math. They all retrieve the exact same number, ensuring perfect consistency across the organization. The real power of a semantic layer is not technical efficiency; it is taking business logic out of disconnected downstream tools and baking it directly into the foundation of the data platform itself. That is all for this one. Thanks for listening, and keep building!

Genie Spaces: Talk to Your Data

3m 54s

Empower business users to find answers themselves. Discover how Databricks AI/BI and Genie Spaces allow anyone to query the Lakehouse using plain English.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 12 of 15. What if your sales director could just text your data warehouse to get instant business insights, without ever having to submit a ticket to the data team? Ad-hoc data requests constantly interrupt engineering workflows, and business users hate waiting days for a simple query. The solution to this bottleneck is a Databricks feature called Genie Spaces, which is part of their broader AI/BI offering. AI/BI is a business intelligence product built on a compound AI system. It is designed to understand the specific semantics of your data. Genie Spaces serve as the conversational interface for this system. A Genie Space looks and feels like a standard chat application, but it is wired directly into your data warehouse. Business users type questions in plain English, and Genie responds with actual data, visualizations, and answers. When people hear about AI querying data, they immediately worry about hallucinations. They assume the model will guess column names, invent metrics, or confidently return wrong answers. Genie prevents this by relying entirely on the governed metadata stored in your Unity Catalog. It is not sending a blind prompt to a generic language model. The AI is grounded in your actual schema, your data types, your foreign key relationships, and the predefined metrics your team has established. To make this work, an analyst first creates and configures the Genie Space. They select the relevant datasets from Unity Catalog and provide a set of instructions. They can add sample queries, define specific business terminology, and clarify ambiguous terms. For example, they can tell the system that when a user says "active customer," it specifically means a customer who has purchased within the last ninety days. This initial setup scopes the AI to a well-defined domain. When a question is asked, the system orchestrates multiple steps. It reads the natural language prompt and checks the provided context. It matches the user's intent to the exact tables and columns in the catalog. It then generates a precise SQL query, runs that query against the Databricks SQL compute engine, and formats the results. Consider a non-technical sales manager using a prepared Genie Space. They type, "Show me sales in Europe by product for last quarter." The system parses the request based on its training. It recognizes "Europe" as a region dimension, locates the product tables, and translates "last quarter" into a precise date filter. Within seconds, the AI generates the SQL, executes it, and returns an interactive chart showing the breakdown. If the manager then replies, "Now exclude Germany," Genie modifies the underlying query and updates the chart instantly, maintaining the conversational context. This workflow fundamentally changes how ad-hoc requests are handled. Data engineers and analysts spend a massive portion of their week writing one-off SQL queries for stakeholders. Shifting this exploration to Genie Spaces gives business stakeholders immediate answers while freeing up engineering time for complex tasks. Furthermore, the entire process remains completely governed. Genie strictly respects all row and column-level access controls defined in Unity Catalog. If the user asking the question does not have permission to see sensitive financial data, the AI will simply not query it. Here is the key insight. The effectiveness of conversational data exploration is determined by the quality of your underlying data model and metadata, not just the intelligence of the language model. That is all for this one. Thanks for listening, and keep building!

Deploying AI: Mosaic AI Model Serving

3m 26s

Building an AI model is easy; deploying it is hard. Learn how Mosaic AI Model Serving acts as a secure, unified API gateway for all your machine learning models.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 13 of 15. Training a machine learning model is the fun part, but deploying it as a highly available, secure REST API is where most data science projects go to die. Moving from a notebook experiment to a production-ready endpoint requires configuring scaling, load balancing, and strict governance. To solve this, you use Mosaic AI Model Serving. This feature provides a unified interface to deploy, govern, and query AI models. A common misconception is that Databricks Model Serving is only for models you train yourself inside Databricks. That is incorrect. It actually acts as a central AI Gateway. It handles three distinct types of models: custom models, foundation models, and external models. First, custom models. These are the models you build, log with MLflow, and register in Unity Catalog. Model Serving provisions a serverless container, loads your model dependencies, and exposes the model as a REST API. You do not manage the infrastructure. It scales up when traffic spikes and scales down to zero when idle. Second, Databricks-hosted foundation models. These are large open-source models that Databricks hosts on optimized compute. You get instant access to state-of-the-art architectures without worrying about GPU provisioning. Third, external models. This is where you configure endpoints that point to third-party services. Why route external traffic through Databricks instead of calling external providers directly? Think about governance and cost control. Suppose your company wants to use GPT-4 for an internal application. If every developer hardcodes an API key in their script, you lose visibility. You cannot strictly monitor costs, manage rate limits, or apply filters to prevent employees from sending sensitive customer data to an external provider. By routing all requests through Mosaic AI Model Serving, you force that traffic through a single, secure gateway. You manage one set of credentials. You apply access controls through Unity Catalog, dictating exactly who or what can query the model. You also get centralized tracking of usage, errors, and latency. The logic flow is straightforward. You define a serving endpoint in Databricks. For a custom model, you point the endpoint to a registered MLflow model and define the compute size and scaling limits. Databricks handles the containerization automatically. For an external model, you provide the external provider name and a securely stored API key. Once the endpoint is active, your downstream applications send a standard JSON payload via an HTTP request to the endpoint URL. The response comes back in a consistent format, regardless of whether the model is running on Databricks serverless compute or sitting in an external data center. Here is the key insight. Mosaic AI Model Serving removes the friction of deployment while enforcing security. It standardizes your application layer, so your client code only ever talks to a single Databricks endpoint, completely abstracted from where or how the underlying model is hosted. By the way, if you want to help support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

AI Functions: LLMs in Your SQL Queries

3m 22s

You don't need to be a Python expert to use Generative AI. Discover how Databricks AI Functions let you apply Large Language Models directly to your data using standard SQL.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 14 of 15. You have ten thousand raw customer support logs sitting in a database table, and the business needs them summarized and categorized by sentiment by the end of the day. Normally, extracting this kind of insight requires a complex Python pipeline, careful management of API keys, and custom batching logic to feed the text into a Large Language Model. What if you could execute that entire workload using a basic database command? That is the exact problem solved by AI Functions, which embed LLMs directly into your SQL queries. AI Functions bridge the gap between state-of-the-art Generative AI and everyday data analytics. They take a capability that usually requires specialized machine learning engineering and hand it to anyone who can write SQL. Instead of building separate infrastructure to extract data, send it to a model, and write the predictions back, AI Functions bring the model directly to where the data already lives. The primary tool for this is a built-in command called A I query. You use it exactly like a standard text processing function within a select statement. You provide the name of the model endpoint you want to use, and then you provide the prompt. Returning to those ten thousand support logs, your workflow becomes trivial. You write a query selecting your customer ID and log text. Then, you add a new column using the A I query function. Your prompt tells the model to read the text, extract the main complaint, and determine if the sentiment is positive, neutral, or negative. You pass the column containing your raw log text into that prompt. When you run the query, the database engine automatically distributes this request. It processes every single row through the specified Large Language Model. The model evaluates the text and returns the summary and sentiment. Because this all happens in SQL, the output arrives as standard structured columns. You can immediately filter the results to show only negative sentiment, join those results with a customer billing table, and aggregate the data to find out which product is causing the most frustration. Here is the key insight. You might assume that giving data analysts access to Large Language Models means distributing sensitive API keys across your entire organization. It does not. AI Functions are tightly integrated with Databricks Model Serving. The actual connections to external models, or self-hosted open-source models, are configured by administrators at the platform level. The data analyst never sees an API key, a token, or a secret. They only reference the pre-configured endpoint name in their query. The entire operation remains completely secure. Every query is logged, and all access controls applied to the data and the models are strictly enforced by the platform governance framework. By removing the infrastructure friction and the credential management, you change the nature of data exploration. You transform complex unstructured text analysis into a simple filtering operation, instantly upgrading the analytical power of your entire team. Thanks for listening. Take care, everyone.

The Future: The AI Agent Framework

3m 28s

Go beyond simple chatbots. In our series finale, we explore the Databricks AI Agent Framework and how to build autonomous AI that acts on your data.

Download

Hi, this is Alex from DEV STORIES DOT EU. Databricks, episode 15 of 15. Your standard chatbot is polite, informative, and completely passive. It can tell you how to fix a broken pipeline based on a manual, but it cannot actually fix the pipeline for you. Moving from AI that just talks to AI that actively executes tasks requires a fundamental shift in architecture. That shift is the AI Agent Framework. Let us clear up a common confusion immediately. People often mix up simple Retrieval-Augmented Generation applications with true agents. A RAG application is essentially a search engine with a language model on top. It retrieves documents and summarizes them. It only reads. A true AI Agent has tools. It can write. It can run SQL queries, trigger jobs, and call external APIs. It changes the state of your systems. The Databricks Agent Framework provides the infrastructure to build, evaluate, and deploy these autonomous agents securely within the Lakehouse. The core mechanism here is tool calling combined with multi-step reasoning. Instead of just generating an answer in one pass, the language model acts as a reasoning engine. You give it a goal and a set of tools, which are basically functions you have defined. The agent decides which tool to use, waits for the result, and then decides what to do next. Think about an agent designed to monitor data pipelines. When a failure occurs, the agent does not just sit there waiting for a user prompt. The framework allows it to trigger a workflow. First, the agent needs context. It uses a custom tool you provided to run a SQL query against your system logs in Databricks. The framework executes this query and feeds the result back to the agent. Here is the key insight. The agent evaluates those logs, identifies the root cause of the failure, and then moves to its next step. It realizes the engineering team needs to know. So, it selects another tool, an API integration with your company chat application. It calls that tool to draft and send a message detailing the exact error and the proposed fix. This is multi-step reasoning in action. The agent planned a sequence, executed code, observed the outcome, and communicated the result, all autonomously. Giving a language model the ability to execute queries and trigger APIs is a massive security risk if handled poorly. This is why the Databricks approach tightly couples the Agent Framework with Unity Catalog. When you deploy an agent using Databricks Model Serving, you are not giving it blanket access to your infrastructure. You register your tools as specific functions within Unity Catalog. Unity Catalog enforces strict governance over what those functions can do. If you give an agent a tool to query log tables, Unity Catalog ensures it can only read those specific tables. If the language model hallucinates and tries to use the SQL tool to drop a production database, the framework stops it because the underlying function lacks the necessary permissions. The agent is strictly bound by the governance rules of your Lakehouse. This capability turns the Lakehouse from a passive storage layer into an active, automated environment. As we wrap up this series, I encourage you to check out the official documentation and try building a simple tool-calling agent yourself. If you want to suggest topics for our next series, drop by devstories dot eu. The transition from chatbots to agents is the defining shift in how we build AI applications today. That is all for this one. Thanks for listening, and keep building!