Back to catalog

Season 2 15 Episodes 54 min 2026

Learning DSPy (v3.1 - 2026 Edition)

A comprehensive, step-by-step curriculum for learning DSPy, moving from brittle string-based prompts toward modular, structured programming and automated optimization.

AI/ML Frameworks Prompt Engineering

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

Programming, Not Prompting

This episode covers the fundamental philosophy of DSPy: moving away from brittle string-based prompts toward modular, structured programming. Listeners will learn why separating system architecture from language model instructions creates more robust AI applications.

3m 25s

Configuring Language Models

Learn how to configure and manage language models in DSPy. This episode covers setting default models, handling caching, overriding generation settings, and accessing different model providers through LiteLLM.

3m 33s

Declarative Prompting with Signatures

Discover how DSPy Signatures replace traditional prompting. This episode explains how to define the input and output behavior of a module declaratively, using both inline strings and class-based definitions with strict typing.

3m 50s

Building Blocks with Modules

Explore DSPy Modules, the core building blocks for language model programs. This episode covers dspy.Predict, dspy.ChainOfThought, and how to compose multiple modules into a larger, cohesive pipeline.

3m 53s

Connecting Models with Adapters

Understand the role of Adapters in DSPy. This episode explains how ChatAdapter and JSONAdapter bridge the gap between abstract DSPy signatures and the actual multi-turn messages sent to language model APIs.

4m 06s

Managing Data with Examples

Learn how DSPy handles datasets for machine learning. This episode covers the dspy.Example object, distinguishing between input keys and labels, and preparing data for evaluation and optimization.

3m 32s

Defining Success with Metrics

Discover how to evaluate DSPy programs using metrics. This episode teaches you how to write custom Python functions to score outputs, use the trace argument, and even leverage AI-as-a-judge for long-form evaluations.

4m 05s

An Introduction to Optimizers

Enter the core magic of DSPy: Optimizers. This episode provides an overview of what optimizers do, the iterative optimization cycle, and the unusual 20/80 data splitting strategy for prompt optimization.

3m 30s

Automatic Few-Shot Learning

Learn how DSPy automates few-shot prompting. This episode focuses on BootstrapFewShot and BootstrapFewShotWithRandomSearch, explaining how they synthesize, filter, and inject high-quality examples into your prompts.

3m 11s

Instruction Optimization with MIPROv2

Dive into automatic instruction tuning. This episode explores MIPROv2 and COPRO, showing how DSPy uses Bayesian Optimization and coordinate ascent to discover superior, counter-intuitive prompt instructions.

3m 35s

Finetuning with BootstrapFinetune

Discover how to distill massive language models into smaller, efficient ones. This episode covers BootstrapFinetune, explaining how to convert a prompt-based DSPy program into a customized, weight-updated model.

3m 21s

Automated Tool Use with ReAct

Learn how to give language models access to external tools. This episode covers the dspy.ReAct module, demonstrating how to build autonomous agents that reason and interact with APIs dynamically.

3m 30s

Manual Tool Handling for Control

Take full control over tool execution. This episode covers manual tool handling in DSPy using dspy.Tool, dspy.ToolCalls, and native function calling for latency-sensitive applications.

3m 35s

Integrating Tools with MCP

Connect your agents to universal tool servers. This episode explains how to use the Model Context Protocol (MCP) in DSPy to leverage standardized tools across different frameworks with minimal setup.

3m 57s

Ensembles and Meta-Optimization

Push DSPy to its limits. The final episode covers program transformations via dspy.Ensemble and the experimental BetterTogether meta-optimizer, which combines prompt tuning with weight finetuning for maximum performance.

3m 18s

Episodes

Programming, Not Prompting

3m 25s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 1 of 15. You spend three hours perfectly crafting a prompt so your language model generates the right outputs. Then a provider releases a new model, you swap the API key, and your entire pipeline shatters. You are stuck maintaining fragile strings instead of building software. That is the problem solved by a philosophy called Programming, Not Prompting. People often hear about DSPy and assume it is just another prompt templating tool for inserting variables into text blocks. It is not. DSPy is a framework for compiling and optimizing control flow. Take a standard setup for a system that reads documents and generates a report with citations. In a traditional approach, you write a massive text block. You explain the task, inject the documents, add manual instructions like think step by step, and specify exactly how the citations must look. This approach tightly couples your system architecture with incidental choices. Your architecture is the core logic. That means extracting facts, drafting a summary, and appending citations. The incidental choices are the specific words you used to coax one particular language model into obeying you. Those exact words will not work optimally on a different model, or even a different version of the same model. When the data shifts slightly, the prompt breaks. DSPy separates your system architecture from those incidental prompt choices. You stop writing long instruction strings. Instead, you define your task purely in terms of inputs and outputs. For the report generator, you declare that the input is a list of text snippets, and the output is a drafted text and a set of citation references. Once the inputs and outputs are defined, you wire them together using standard code. You create an extraction component, pass it the documents, and collect the facts. Then you pass those facts into a drafting component. Finally, you might use a simple loop to cross-check the draft against the original facts. There are no manual strings begging the model to format its output correctly. There is only structured logic. Here is the key insight. Because your architecture is defined as modular code, a compiler can automatically translate that structure into the actual prompts required by whichever language model you are using. DSPy treats the model instructions, the reasoning steps, and the few-shot examples as internal parameters. These are variables to be optimized by the framework, rather than static text you type out by hand. You build the pipeline, define the data shapes, and write the execution steps in standard Python. The framework handles the unpredictable task of discovering the best way to ask the language model to execute those steps reliably. This fundamentally changes the developer experience. You spend your time debugging the logic of your application, not guessing which adjective will make the model pay attention to the end of a sentence. Your system architecture should outlive the quirks of whichever language model you happen to be routing requests to today. Thanks for listening, happy coding everyone!

Configuring Language Models

3m 33s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 2 of 15. Making a single API call to a language model is easy. But managing multiple providers, caching responses, and tracking prompt histories usually forces you to build and maintain a custom wrapper. Configuring Language Models in DSPy eliminates that busywork. You might assume you need the OpenAI SDK for GPT models, the Anthropic SDK for Claude, and a separate library for local models. You do not. DSPy handles this uniformly through a single LM class powered by LiteLLM under the hood. To use a model, you instantiate a DSPy LM object and pass a string containing the provider name, a slash, and the model name. For instance, you pass open ai slash gpt-4o-mini. When you create this object, you can also pass standard parameters like temperature or max tokens. Because of the unified backend, these parameter names remain consistent regardless of which provider you actually call. You can interact with this LM object directly. Call it like a normal Python function, passing either a simple text string or a list of chat dictionaries. It processes the input and returns a list of generated strings. By default, it returns a single string in that list, but you can request multiple completions. This direct usage is straightforward, but manually passing a model object around to every function in a large codebase gets messy fast. To solve this, DSPy uses a global configuration system. You define your default language model once by calling dspy dot configure, assigning your instantiated LM object as the target. Every subsequent DSPy operation automatically routes through that model. But what if you want to compare outputs between providers? Say you want to test how Claude 3.5 Sonnet handles a specific function compared to your default GPT model. Instead of overwriting the global state, you use dspy dot context. This creates a temporary scope. You open a Python with block using dspy dot context, assign Claude as the local default, and execute your code. When the block ends, DSPy automatically reverts to your global GPT-4o-mini model. That covers routing requests. What about performance? DSPy caches every model generation by default to save time and API costs. If you run the exact same prompt with the exact same parameters, it serves the cached response instantly. Here is the key insight. Sometimes you need a fresh generation without changing your prompt or adjusting the temperature. To do this, DSPy uses a parameter called rollout id. When you pass a new rollout id, like a unique integer, DSPy treats it as a distinct request and bypasses the cache. This forces the model to generate a new sequence, giving you control over generation diversity while keeping the core inputs static. Finally, when experimenting, you need to see exactly what went over the wire. Every LM object maintains its own interaction log. You can access the raw data through the history attribute on the model object. For a human readable summary, you call the inspect history method. It prints the exact prompt sent to the provider and the exact response received. The real value of this configuration layer is that it entirely detaches your application logic from provider quirks, turning model selection and caching into simple declarative switches. Thanks for listening, happy coding everyone!

Declarative Prompting with Signatures

3m 50s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 3 of 15. Standard Python function signatures tell your code what data types to expect. But what if a signature could dictate the actual logic of the function itself, without you explicitly writing the instructions? That is the premise of Declarative Prompting with Signatures. You might naturally confuse a DSPy signature with a standard Python function signature. They look similar, but their roles are fundamentally different. A standard Python signature defines a strict data interface. A DSPy signature actually declares and initializes the behavior of the language model. You are not writing a prompt. You are writing a declarative specification of what needs to happen. The framework takes this specification, looks at the input and output expectations, and constructs the underlying prompt for you. The simplest way to define a signature is inline, using a short string. You specify your input variables, write a directional arrow, and specify your output variables. For example, the string "question arrow answer" tells DSPy that the model will receive a question and must generate an answer. You can pass multiple inputs, like "context comma question arrow answer". This is where it gets interesting. The variable names you choose carry real semantic weight. DSPy uses those exact string names to assign roles in the prompt. If you name an input "context", the model interprets it as background information. Do not over-engineer these names or try to hack the keywords with clever prompt tricks. Use clear, descriptive English words to define the roles. When inline strings are not expressive enough, you move to class-based signatures. You create a new class that inherits from the DSPy Signature base class. Inside this class, you define your inputs and outputs as attributes. You assign these attributes using the Input Field and Output Field functions provided by DSPy. This approach gives you fine-grained control over the behavior of the model. The docstring of the class itself becomes the core instruction for the language model, defining the overall task. Consider a multi-modal image classification scenario. You want to pass an image and a text question to a vision model, and extract the specific breed of a dog. You create a class called ClassifyDogBreed. At the top of the class, you write a docstring saying, Identify the breed of the dog based on the provided image and question. Next, you define your inputs. You create an attribute named "image" and assign it as an Input Field. You create a second attribute named "question" and assign it as an Input Field. Finally, you define an attribute named "breed" and assign it as an Output Field. Inside that Output Field, you can pass a description argument stating, The exact name of the dog breed, with no extra text. Class-based signatures also handle type resolution. You can specify standard Python type hints for your fields. If you type-hint your output field as a boolean, DSPy understands that the model must return a true or false value. The framework processes these type annotations and automatically injects constraints into the prompt structure, guiding the model toward the correct output format. The structure of your data and the names of your variables are the actual instructions. A clearly named field and a precise docstring in a declarative signature will dictate model behavior far more reliably than paragraph after paragraph of hand-crafted prompt engineering. Thanks for listening, happy coding everyone!

Building Blocks with Modules

3m 53s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 4 of 15. Usually, when you want a language model to reason through a problem, you resort to hacking strings together, haphazardly appending phrases like think step by step to your prompt. You tangle your application logic with fragile text instructions. In DSPy, prompting techniques are not strings. They are structured, swappable components called Modules. It is easy to confuse modules with signatures. Think of a signature as the contract. It defines the what, mapping specific input fields to specific output fields. A module defines the how. It is a parameterized, callable object that takes your signature and applies a specific prompting strategy to fulfill that contract. The most basic module is the Predict module. You initialize it by passing your signature as an argument. If your signature asks to turn a document into a summary, the Predict module handles the prompt formatting and calls the language model. But maybe the task is complex and requires intermediate logic. You can easily swap Predict out for the Chain of Thought module. You do not change your signature. You just pass it to the Chain of Thought module instead. Under the hood, this module automatically modifies the prompt architecture. It instructs the language model to generate a step-by-step reasoning trace before producing the final output fields you defined. When you call the Chain of Thought module with your input data, it returns an object containing your requested outputs. Because you used Chain of Thought, that object also includes a new field containing the model's rationale. You can inspect exactly how the language model arrived at its answer, separated entirely from the final extracted value. Here is the key insight. You can nest these built-in modules to create complex programs, much like you would stack neural network layers in PyTorch. We can build a multi-hop retrieval pipeline to see this in action. You start by defining a custom class. In its initialization phase, you declare the smaller modules you will need. For a multi-hop architecture, you might declare a query generator module using Chain of Thought, and an answer synthesis module using standard Predict. Then, you define a forward method to route data between them. The forward method takes an initial user question. It passes that question to your query generator module, which outputs a search query. You execute that search against your database to retrieve a document. If you need a second hop, you pass the retrieved document and the original question back into the query generator module to produce a more refined search query. Finally, you pass all the retrieved documents and the user question into the answer synthesis module to generate the final response. You have just built a custom, executable graph out of modular components. When you chain multiple calls together like this, it is crucial to see exactly what is being sent to the model behind the scenes. DSPy tracks your language model usage globally. You can call the inspect history command on your language model object to print the most recent interactions. This renders the exact string the model received and the exact string it generated, ensuring your composed pipeline is assembling the context correctly. By separating your task definition into signatures and your execution strategy into parameterized modules, you transform prompting from a text-editing chore into an architectural decision, allowing you to upgrade your pipeline's reasoning capability simply by swapping a class name. Thanks for listening, happy coding everyone!

Connecting Models with Adapters

4m 06s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 5 of 15. You write a clean, strongly-typed signature in your code, pass it to a language model, and somehow get a perfectly structured Python object back. Between your clean code and the raw text API of the model sits a chaotic translation layer. That hidden layer is the concept of Connecting Models with Adapters. Before we look at how they work, let us clear up a common confusion. You might mix up Adapters with Modules. Modules manage the reasoning strategy. They decide if the model should use Chain of Thought or rely on external tools. Adapters do not care about strategy. An adapter is purely the translation layer. It handles the raw string and JSON serialization that actually gets sent over the wire to the model API. Language models do not understand declarative signatures. They expect multi-turn message arrays containing specific roles and text blocks. The adapter bridges this gap. The default tool for this in DSPy is the ChatAdapter. When you invoke a module, the ChatAdapter intercepts your signature and formats it into a standard chat history. The main signature instruction is mapped directly into the system prompt. Your input fields are gathered and placed into the user message. Here is the key insight. The ChatAdapter uses specific text markers to keep your inputs strictly organized. It wraps every field name in double square brackets and double hash symbols. If your input field is named context, the language model sees a marker with brackets and hashes surrounding the word context, immediately followed by the actual context data. This visual boundary prevents the model from accidentally confusing your system instructions with the user input text. It repeats this pattern for the expected output fields, prompting the model to generate those exact same markers in its response. Consider a scenario where you are extracting science news. Your input is a raw text article, and your output needs to match a Pydantic class with specific fields for the headline and the core scientific discovery. When you pass this requirement through the ChatAdapter, it inspects your Pydantic class, generates a complete JSON schema, and injects that schema directly into the system prompt. It explicitly tells the language model how to format its text response. When the model eventually replies, the ChatAdapter catches the raw text string. It searches for the expected output markers, extracts the text block between them, and parses that data back into the exact Python objects your application requires. That covers inputs and parsing for text-based interactions. But modern language models often have native support for structured output. This is where the JSONAdapter comes into play. Instead of heavily modifying the system prompt and relying on text markers, the JSONAdapter takes a more direct route. It delegates the formatting constraints to the model provider's native JSON mode or structured outputs API. The model is forced at the protocol level to return a valid JSON object that contains all of your requested output fields. Because the model API is handling the structure natively, it bypasses the need for the adapter to search through raw text for string markers. If your target model supports this capability, switching your pipeline to use the JSONAdapter usually results in lower latency and significantly more reliable parsing. The adapter is the rigid boundary between your deterministic application logic and the unstructured text generation of the language model. By controlling exactly how inputs are serialized and outputs are parsed, adapters ensure your pipeline never breaks due to a badly formatted string. Thanks for listening, happy coding everyone!

Managing Data with Examples

3m 32s

Learn how DSPy handles datasets for machine learning. This episode covers the dspy.Example object, distinguishing between input keys and labels, and preparing data for evaluation and optimization.

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 6 of 15. You run an optimizer to improve your language model pipeline, but it achieves a perfect score on the very first try. You look closer and realize you accidentally fed the target answers directly into the prompt. To treat language models like traditional machine learning components, you need an airtight way to manage your training, dev, and test sets without leaking answers. Managing Data with Examples in DSPy handles exactly this. In DSPy, the foundational data structure is the Example object. You use it to build all of your datasets. On the surface, it acts a lot like a standard Python dictionary. You create one by passing in key-value pairs that represent your data. Take a summarization task. You create a new Example object and give it two fields. You assign a long string to a field called report, and a short string to a field called summary. You can read these values back at any time using standard dot notation, requesting the report field or the summary field directly from the object. It is common to treat the Example object as just a dictionary wrapper, but using a plain dictionary will break the compilation process. When you pass a dataset to a DSPy optimizer, the compiler needs to separate what goes into the pipeline from what is used to score the pipeline. It requires explicit boundaries between the input data and the expected answers. This is where it gets interesting. The Example object controls these boundaries using a specific method called with_inputs. When you instantiate your Example containing the report and the summary, you chain the with_inputs method at the very end. You pass it the string "report". This explicitly tags the report field as the input data. Any field you do not specify in this method automatically becomes a label. The optimizer now knows it must only send the report to your pipeline. The summary remains entirely hidden during inference. Once you have a single example configured, you group multiple examples into standard Python lists to form your training, dev, and test sets. Because DSPy frames prompt engineering as a machine learning optimization problem, having these clearly partitioned datasets is a strict requirement. When the optimizer runs your pipeline over the training set, it processes one Example at a time. It strips out the labels, feeds the inputs forward, captures the generated output, and then evaluates the result. When you write custom evaluation metrics, you will need to access these separated fields. The Example object provides two methods for this extraction. Calling the inputs method returns a dictionary containing only the data you marked as inputs. Calling the labels method returns a dictionary containing the hidden target data. Your evaluation function calls the labels method to retrieve the target summary, compares it against the generated text, and assigns a score based on how well they match. Properly configuring your Example objects guarantees that your pipeline actually learns to map inputs to outputs. The strict separation of inputs and labels prevents data leakage during optimization, making sure your system improves rather than just memorizing provided answers. Thanks for listening, happy coding everyone!

Defining Success with Metrics

4m 05s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 7 of 15. You cannot consistently improve what you cannot measure, and when dealing with language models, relying on human intuition to measure output quality simply does not scale. To automatically rewrite your prompts or tune your system, the compiler needs a mathematical guiding star, and that is exactly what Defining Success with Metrics provides. In DSPy, a metric is a standard Python function. It takes two primary arguments. The first is an example, which represents the gold standard input and expected output from your dataset. The second is a prediction, which is the actual response generated by your DSPy program. The metric function compares the prediction against the example and returns a score. This score is usually a float, an integer, or a simple boolean value like true or false. For basic classification tasks, your metric might be straightforward Python logic. You could write an exact match function that checks if the predicted string perfectly equals the expected string. To execute this measurement systematically across your data, DSPy provides a built-in utility called Evaluate. You pass this utility your development dataset, your metric function, and execution parameters like the number of parallel threads. The Evaluate utility runs your metric over every prediction, aggregates the results, and returns a single numerical score representing your overall system performance. However, exact matching is almost always too rigid for generative language tasks. This is where you transition from simple string checks to using AI feedback, a pattern commonly known as LLM-as-a-judge. Because DSPy modules are just Python code, your metric function can instantiate and call a smaller, separate DSPy program to grade complex semantic outputs. Consider a concrete scenario. You are building a system that generates a promotional tweet to answer a user question. An exact match metric fails completely here because a good tweet can be phrased in countless valid ways. Instead, your metric function should evaluate multiple dimensions of the output. First, it uses a basic Python length check to ensure the generated text is under two hundred and eighty characters. Second, it checks if the text contains the factual answer required by the example. Finally, it passes the generated text to a specialized DSPy signature that asks a smaller language model to assess if the tweet is engaging. Your metric function then combines the length check, the fact check, and the language model engagement score into one final mathematical value. When you eventually start compiling and optimizing these programs, your metric function must accept a third, optional argument. It is called trace. Listeners often confuse the trace argument with a debugging log that prints out console errors or execution history. That is what it is not. The trace argument is a specific object used by the DSPy compiler during optimization to validate intermediate reasoning steps. If your program chains multiple language model calls together, the trace contains the specific reasoning path the model took to reach the end. By accessing the trace inside your metric, your function can verify not just that the final tweet was good, but that the intermediate steps used to draft it were logically sound. This is the part that matters. Your metric strictly defines what success looks like, and the DSPy compiler will ruthlessly optimize your system to maximize that specific score. If your metric is flawed, your compiled program will be flawed in exactly the same way. Thanks for listening, happy coding everyone!

An Introduction to Optimizers

3m 30s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 8 of 15. Suppose you write a piece of software, and instead of manually tweaking its internal logic to pass your tests, a compiler automatically rewrites the instructions so the program performs better on its own. You do not touch the code. You just provide the test set. That is exactly what DSPy does through a concept called Optimizers. Optimizers, which were previously called teleprompters in older versions of the framework, are algorithms that tune the parameters of your program. In traditional machine learning, parameters mean neural network weights. In DSPy, parameters primarily mean the actual prompt strings and instructions sent to the language model, though they can also include weights. The optimizer's job is to adjust these parameters to maximize a metric you have already defined. This process happens before you deploy your application. It is pure pre-inference time compute. You spend processing power up front to find the best instructions, so your application runs accurately later. When you hear the word optimizer, you might assume you need massive datasets, like you would for fine-tuning a traditional model. You do not. Prompt optimizers are highly efficient. They typically require only thirty to three hundred examples. Because the dataset is so small, DSPy recommends an unusual approach to splitting your data. Instead of the standard eighty-twenty split where most data goes to training, you flip it. You use twenty percent for training and eighty percent for validation. If you have fifty examples, you give ten to the optimizer to build the prompts and use the remaining forty to evaluate if those prompts actually generalize. This reverse split prevents the optimizer from overfitting the generated prompts to a tiny set of inputs. Here is the key insight. The iterative development cycle in DSPy revolves around running this optimization loop. Let us walk through a concrete scenario. You are building a basic question answering bot. First, you define your DSPy program and your metric. Next, you gather your dataset of fifty unlabelled questions. You split this data, passing the small training portion to an optimizer object. You tell the optimizer to compile your program using your training data and your metric. The optimizer runs, experimenting with different prompt structures under the hood. It checks the outputs against your metric, learns what works, and refines the prompts. When the optimizer finishes, it returns a new, compiled version of your program. This compiled program contains the newly tuned parameters. You do not need to run this optimization step every time your application starts. Instead, you call the save method on the compiled program, providing a file path. This writes all the optimized prompts and configurations to a standard JSON file. When you deploy your application to production, your code simply instantiates the base program and calls the load method, pointing it to that exact JSON file. Your bot is immediately ready to answer questions using the optimized instructions. The true power of DSPy optimizers is that they decouple your application logic from the exact phrasing of your prompts, letting computation find the best words for you. Thanks for listening, happy coding everyone!

Automatic Few-Shot Learning

3m 11s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 9 of 15. Manually picking the best examples to put in a prompt is tedious and prone to bias. You guess which examples matter, hardcode them into your strings, and hope the model pays attention. DSPy actively synthesizes, tests, and selects the perfect demonstrations for you. This process is automatic few-shot learning, and DSPy handles it through three specific optimizers. The simplest approach is LabeledFewShot. You provide a set of labeled training examples. The optimizer randomly selects a subset of these input and output pairs and inserts them directly into your prompts as demonstrations. It gives the model a basic pattern to follow. This works well if your training data exactly matches the intermediate steps your program needs. Usually, it does not. This brings us to BootstrapFewShot. A common misconception is that BootstrapFewShot just randomly picks examples from your training set. It does not. It actively generates intermediate reasoning steps that never existed in your raw data. Here is how the bootstrapping process flows. The optimizer requires a teacher program. By default, this is just the unoptimized, zero-shot version of your own program. The teacher runs through your training examples. For each example, it attempts to generate an answer. DSPy then passes that answer to your evaluation metric. If the metric says the output is correct, DSPy saves the entire trace of that successful execution. This trace includes the input, the output, and crucially, any intermediate work the program did to get there. Consider a sentiment classifier. Your raw dataset contains only customer reviews and a positive or negative label. Your DSPy program asks the language model to use chain-of-thought reasoning before outputting the sentiment. When bootstrapping, the teacher reads a review and writes out a paragraph of reasoning before guessing the sentiment. If the final guess matches the true label, that generated reasoning is considered successful. The optimizer collects these successful traces. It takes four of them and injects them into future prompts. Your zero-shot classifier is now an expert four-shot classifier, complete with synthesized reasoning steps. BootstrapFewShot stops once it finds enough successful traces. But the first successful traces are not always the best ones. Here is the key insight. BootstrapFewShotWithRandomSearch solves this by running the entire bootstrap process multiple times. Each time, it pulls a random sub-sample of your training data. This creates several different candidate sets of few-shot demonstrations. The optimizer then takes all these candidate sets and evaluates them against your validation data. It tests which specific combination of demonstrations yields the highest overall score. It discards the weak sets and keeps the mathematical winner. The true power of automatic few-shot learning is not just saving you time writing prompts, but discovering successful intermediate reasoning paths your dataset never explicitly contained. Thanks for listening, happy coding everyone!

Instruction Optimization with MIPROv2

3m 35s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 10 of 15. Sometimes the most effective instruction for a language model reads like nonsense to a human. You spend hours meticulously crafting the perfect phrasing, only to find that an automated, slightly disjointed prompt outperforms your work completely. This is why you should let algorithms write your prompts. We are looking at Instruction Optimization with MIPROv2. First, clear your mind of prompt templating. This is not about swapping variables into a static string. Instruction optimization actually rewrites the systemic instructions that govern your pipeline. Earlier algorithms like COPRO and SIMBA tackled this by generating step-by-step variations of prompts and refining them over time. MIPROv2 takes this concept much further by treating instructions and few-shot examples as a unified search space. MIPROv2 operates in three distinct stages. The first stage is bootstrapping. The optimizer runs your unoptimized program over your training data to build a pool of execution traces. These traces contain the actual inputs, intermediate steps, and outputs flowing through your system. The second stage is grounded proposal. The optimizer does not guess new instructions blindly. It uses a separate language model, called the prompter, to look at those generated traces. By analyzing where your pipeline succeeded and where it failed, the prompter drafts a set of new instruction candidates. These candidates are directly grounded in the actual behavior of your program, not generic templates. The third stage is discrete search. MIPROv2 evaluates the new instructions alongside different combinations of few-shot traces. To do this efficiently, it relies on Bayesian Optimization. Instead of brute-forcing every possible combination, MIPROv2 builds a surrogate model. This surrogate model acts as a lightweight proxy. It predicts which combinations of instructions and traces will yield the highest score on your specific evaluation metric. Bayesian Optimization allows the surrogate model to map the prompt and demonstration space. It calculates the expected improvement of testing a new combination. This systematically balances exploring untested instructions against exploiting the combinations that already score well. The optimizer zeroes in on the optimal configuration without executing thousands of redundant network calls. Consider a concrete scenario. You build a ReAct agent to answer complex queries. Initially, its accuracy sits at 24 percent. You pass this agent into MIPROv2, configure it to run in light mode, and provide a dataset of 500 questions. The optimizer bootstraps traces, proposes grounded instructions, and searches the space using the surrogate model. When it finishes, your agent's accuracy jumps from 24 percent to 51 percent. The final prompt driving that performance leap will likely contain instructions and trace selections a human would never have drafted. Here is the key insight. MIPROv2 removes the bottleneck of human intuition. It treats your natural language instructions exactly like tunable weights in a mathematical model, shifting prompt creation from an unpredictable art form into a deterministic optimization problem. Thanks for listening, happy coding everyone!

Finetuning with BootstrapFinetune

3m 21s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 11 of 15. Prompting massive models is great for getting a prototype off the ground. But when that logic hits production, the latency and cost of a seventy billion parameter model quickly become a problem. You need the reasoning power of the large model, but the speed and price of an eight billion parameter model. That is exactly what BootstrapFinetune handles. BootstrapFinetune compiles your DSPy program into a fine-tuned model. It serves as the ultimate optimization for efficiency. It updates the actual weights of a smaller target model to mimic the exact behavior of your heavy pipeline. A common misconception is that to fine-tune a model, you have to manually gather thousands of examples, format them into tedious JSONL files, and babysit a training loop. BootstrapFinetune completely automates this. It handles dataset generation, formatting, and the weight updates entirely through the execution traces of your program. Take a concrete scenario involving a banking intent classifier. The program takes a messy customer message and categorizes it. Initially, you build a DSPy module using a highly capable model like GPT-4o-mini configured to use chain-of-thought reasoning. The model thinks step-by-step through the customer phrasing before outputting the intent. It gets the right answers, but it is too slow and expensive for real-time chat. To optimize this, you initialize BootstrapFinetune. You give it your evaluation metric to measure success, and you specify the smaller, cheaper target model you want to deploy. Then you compile the program. When you hit compile, DSPy runs your unoptimized program over your training data. It uses the heavy teacher model to generate outputs. The optimizer watches this execution. Every time the teacher model gets the right answer according to your metric, BootstrapFinetune captures the trace. It records the inputs, the step-by-step reasoning, and the final output. It maps the internal logic of the massive model into a format the small target model can ingest. Once enough successful traces are collected, BootstrapFinetune automatically structures them into a training dataset. It then triggers the fine-tuning process on your target model. The small model is trained directly on the high-quality reasoning paths generated by the large model. Here is the key insight. The smaller model learns the specific task distribution and the reasoning style required to solve it without running the heavy chain-of-thought at inference time. In our banking classifier example, a standard small model might only achieve sixty-six percent accuracy out of the box. But after compiling with BootstrapFinetune, that same small model jumps to eighty-seven percent accuracy. Fine-tuning is no longer a separate infrastructure project; it is simply another compilation step that turns an expensive reasoning pipeline into a fast, cheap production asset. Thanks for listening, happy coding everyone!

Automated Tool Use with ReAct

3m 30s

Learn how to give language models access to external tools. This episode covers the dspy.ReAct module, demonstrating how to build autonomous agents that reason and interact with APIs dynamically.

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 12 of 15. Giving an LLM access to external APIs makes it incredibly capable, but writing the loop to manage its reasoning, parse the outputs, and recover from execution errors is a massive headache. The solution is fully automated tool use with the DSPy ReAct module. People often confuse ReAct with basic function calling. Function calling is merely the API mechanism that allows a language model to format its output as a structured data request. ReAct is a specific behavioral paradigm. It stands for Reason and Act. It is an execution loop where the model cycles through three distinct steps: a Thought, an Action, and an Observation. The DSPy ReAct module completely manages this orchestration for you. You do not write the execution loop. You do not manually catch API exceptions. ReAct wraps a DSPy signature and a list of tools, transforming a static prompt into an autonomous agent. To use it, you first define your tools. In DSPy, tools are simply standard Python functions. You write a function, define its input parameters, and provide a clear docstring. That docstring is critical. DSPy extracts the function name and the docstring, passing them to the language model so it knows exactly what the tool does and when to deploy it. Consider a scenario where you build a basic weather and search agent. You write a Python function named get weather that accepts a city name as a string and queries an API to return the current temperature. You instantiate the dspy dot ReAct module, passing it a standard question and answer signature along with a list containing your get weather function. When you ask the module what the weather is in Tokyo, the ReAct loop begins. First, the model generates a Thought. It reasons that it needs current meteorological data for Tokyo. Next, it generates an Action. It decides to call your get weather tool, passing Tokyo as the argument. Here is the key insight. You do not execute that function yourself. The DSPy ReAct module intercepts the model's Action, executes your Python function behind the scenes, and captures the output. If the function succeeds, DSPy feeds the temperature data back to the model as an Observation. If the model hallucinates a parameter or the function throws a Python error, DSPy catches that error and feeds the error message back as the Observation. The model reads the error, generates a new Thought to correct its mistake, and tries a new Action. Once the model observes the correct temperature data, it recognizes that its goal is met. It breaks out of the loop and formats the final answer for the user. To prevent runaway executions, this cycle is strictly bounded by a parameter called max iters, which stands for maximum iterations. This parameter dictates how many Thought, Action, and Observation cycles the module is allowed to perform. If the model struggles to find the correct data and hits the iteration limit, ReAct forces it to stop searching and generate a final response using only the information it has successfully gathered. The true power of this module is that it abstracts away the brittle, error-prone control flow of agent loops, letting you treat complex tool augmented reasoning as just another predictable component in your pipeline. Thanks for listening, happy coding everyone!

Manual Tool Handling for Control

3m 35s

Take full control over tool execution. This episode covers manual tool handling in DSPy using dspy.Tool, dspy.ToolCalls, and native function calling for latency-sensitive applications.

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 13 of 15. Automated agents are great when you have a loose, open-ended task. But when you need absolute, deterministic control over exactly how, when, and if an external function is fired, handing the wheel completely to the language model is too risky. You have to pop the hood and manage the execution yourself. That is exactly where manual tool handling for control comes in. Using an automated agent loop abstracts away the execution layer, which can cause unpredictable latency or hide runtime errors. Manual handling is the power-user alternative. It restores your control over error recovery, timeout limits, and exact execution order. To build this in DSPy, you start by wrapping a standard Python function using the dspy.Tool class. Imagine you have a Python function that acts as a calculator to multiply two numbers. You pass this function into dspy.Tool. If your function handles database queries or network requests, you can also wrap asynchronous functions, and the tool class will manage the async execution natively. Once your calculator tool is ready, you must expose it to the language model. You do this by passing a list containing your tool directly into a DSPy Predict module. You define a parameter called tools in your Predict call. When the model processes the input, it evaluates the prompt and decides if it needs the calculator to generate the final answer. When the model decides to use your tool, it relies on an underlying mechanism called an Adapter. By default, DSPy uses a JSONAdapter. This adapter automatically translates your Python tool into the native function calling format required by the specific language model API you are using. This ensures the model outputs reliable, structured JSON when requesting a tool. Pay attention to this bit. It is easy to assume that using native tool calling automatically produces higher quality outputs. The DSPy documentation explicitly warns that this is a misconception. Native tool calling provides better reliability for the syntax of the request, but it does not guarantee better reasoning quality than standard text-based parsing. The model is not suddenly smarter just because it is formatting JSON. Because you are managing this process manually, the model does not actually run the calculator. It stops and returns a response containing its intent. You access this intent by inspecting response outputs dot tool calls. This property returns a dspy.ToolCalls object, which behaves like a list of instructions. Each item in this list specifies which tool the model wants to use and the exact arguments it generated, like five and ten for the calculator. Next, you write a standard Python loop to iterate through these requested tool calls. For each call, you manually invoke its execute method. Invoking execute triggers your actual Python code using the model's generated arguments and returns the result. If the arguments are invalid, or if the calculator throws an error, your Python loop catches it. You handle the failure on your terms, rather than hoping an automated loop recovers on its own. Manual tool handling cleanly separates the model's decision to request an action from the physical execution of that action, giving your application the deterministic reliability that production environments require. Thanks for listening, happy coding everyone!

Integrating Tools with MCP

3m 57s

Connect your agents to universal tool servers. This episode explains how to use the Model Context Protocol (MCP) in DSPy to leverage standardized tools across different frameworks with minimal setup.

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 14 of 15. Every time you adopt a new AI framework, you typically end up rewriting the same custom Python wrappers for your database queries, web searches, and file readers. Instead of maintaining endless duplicate API wrappers, what if your agents could instantly plug into universal, standardized tool servers? That is the promise of the Model Context Protocol, and today we are looking at Integrating Tools with MCP. The Model Context Protocol is an open standard introduced by Anthropic. It provides a universal way to connect AI models to external data sources and tools. By adopting this standard, developers write a tool once, host it on an MCP server, and use it across any supporting framework. DSPy supports this natively. You do not need to write complex adapter classes to bring these external tools into your DSPy programs. A common misconception is that DSPy handles the underlying server connections to these tools itself. It does not. DSPy relies entirely on the official Python mcp package to manage the networking and establish the connection. DSPy only steps in at the very end to convert the active MCP tool object into a native DSPy tool format. To see how this works, let us walk through connecting a local tool server to a DSPy agent. First, you need an MCP server running. In your Python script, you import the server parameters class from the MCP package. If you are running a local process, you define standard IO server parameters and point it to your server executable. Alternatively, if your tools live on a remote server, you configure an HTTP client connection. Both methods establish how your application will talk to the tool provider. Next, you use the MCP library to open a client session. Inside this session context, you initialize the connection. At this point, DSPy is still completely out of the picture. You ask the active MCP session to list its available tools. The server responds with a list of tool objects. Each object contains the tool name, a description of what it does, and the expected input arguments. Now you bridge the gap. This is where it gets interesting. For every tool returned by the server, you call the from mcp tool method on the base DSPy Tool class. You pass this method two specific arguments: the raw tool object, and the active client session. This single command reads the schema provided by the MCP server and instantly wraps it in a compatible interface. You now have a ready-to-use list of native DSPy tools. Finally, you hand that newly converted list of tools to an agent. You initialize a ReAct module, and pass in your array of DSPy tools. When you run the agent, it can now seamlessly call the external MCP tools. The arguments flow from the ReAct module, through the converted DSPy wrapper, over the MCP client session to the server, and the result flows back to inform the next reasoning step. The real power of this integration is decoupling. Your DSPy modules can securely access enterprise databases or local file systems with zero custom wrapper code, while ensuring the exact same tool server remains completely usable by entirely different frameworks. Thanks for listening, happy coding everyone!

Ensembles and Meta-Optimization

3m 18s

Download

Hi, this is Alex from DEV STORIES DOT EU. Learning DSPy, episode 15 of 15. What happens when you combine Bayesian prompt optimization with deep learning weight fine-tuning? You stop treating prompt engineering and model training as isolated steps, and you start treating them as a continuous pipeline. This is where you reach the cutting edge of automated AI engineering, relying on Ensembles and Meta-Optimization. We need to clear something up immediately. When you hear the word ensemble, you probably think of querying five different foundation models simultaneously. In DSPy, an ensemble is something else entirely. An ensemble here means running multiple optimized programs on the same underlying language model. It combines different prompt structures and different reasoning traces to aggregate their outputs. The logic here is straightforward. During a deep optimization run, different configurations often discover distinct, equally valid reasoning paths to arrive at the correct answer. Say you just ran the MIPROv2 optimizer. It evaluates hundreds of configurations and keeps a history of the best performers. Instead of picking the single highest-scoring program and discarding the rest, you extract the top five candidate programs. You pass these into the DSPy Ensemble transformation. When a new input comes in, the ensemble executes all five programs. It aggregates their outputs, usually through majority voting, and returns a final, highly robust answer. You are essentially scaling your compute at inference time to guarantee a higher quality result. Running a five-program ensemble on a massive foundation model gives you incredible accuracy, but it is expensive and slow. This is where meta-optimizers come in. A meta-optimizer manages the execution of other optimizers, sequencing them to compound their benefits. The prime example in DSPy is BetterTogether. BetterTogether layers improvements systematically. It allows you to take the massive reasoning capability of your ensemble and distill it down into a fast, fine-tuned model. First, you configure BetterTogether to use prompt optimization to generate extremely high-quality reasoning traces from your heavy ensemble. Next, it automatically passes those traces into a weight optimizer. The weight optimizer uses that data to fine-tune the parameters of a much smaller, cheaper student model. Finally, BetterTogether can run a second round of prompt optimization, this time tailoring the instructions specifically to the newly updated weights of the student model. You are moving from prompt optimization, to weight optimization, back to prompt optimization. The output is a highly specialized, fast model that captured the diverse reasoning paths of the original ensemble, without the massive inference cost. Layering optimization techniques sequentially is how you bridge the gap between heavy, expensive reasoning and fast, production-ready inference. This brings us to the end of the series. I highly encourage you to explore the official DSPy documentation, try building these pipelines hands-on, or visit devstories dot eu to suggest topics you want to see covered next. Thanks for listening, happy coding everyone!