DeepEval brings the rigor of Pytest to non-deterministic LLM applications. In this episode, we explore the framework's core identity and the critical difference between End-to-End and Component-Level evaluations.
3m 55s
2
Defining the LLM Interaction
You cannot measure what you haven't properly defined. Learn how the LLMTestCase defines an atomic unit of evaluation, including its mandatory and optional parameters.
3m 51s
3
The Power of LLM-as-a-Judge
Learn how DeepEval uses LLM-as-a-judge to evaluate test cases, returning scores from 0-1 alongside detailed reasoning. Discover how to configure custom evaluation models.
3m 36s
4
Evaluating RAG Generators
Focus purely on the generation side of RAG pipelines. Learn how Answer Relevancy and Faithfulness metrics ensure your LLM answers the prompt without hallucinating.
3m 55s
5
Evaluating RAG Retrievers
If the context is garbage, the answer will be garbage. Discover how Contextual Precision, Recall, and Relevancy assess the quality of your retrieval engine.
3m 42s
6
Agentic Evaluation
Evaluating autonomous agents requires analyzing complex execution flows. Learn how Task Completion and Tool Correctness metrics keep multi-step agents in check.
3m 41s
7
Multi-Turn Conversation Evaluation
Chatbots require evaluating the entire conversation history. Learn how ConversationalTestCase and specialized metrics track Role Adherence and Knowledge Retention across multiple turns.
3m 49s
8
Building Custom Metrics with G-Eval
When standard metrics fail, build your own. Discover how G-Eval allows you to define custom evaluation criteria in plain English using a 2-step CoT algorithm.
3m 47s
9
Deterministic Evaluation with DAG
Take absolute control over your evaluations. Learn how the Deep Acyclic Graph (DAG) metric uses decision trees to deterministically judge complex formatting and logic.
3m 16s
10
The Evaluation Dataset
Scale your testing by building robust datasets. Explore how EvaluationDatasets group Goldens, distinguish between single and multi-turn data, and import from CSV/JSON.
3m 22s
11
Generating Synthetic Data
Don't have real user data? Learn how to use the Synthesizer to automatically generate high-quality Goldens directly from your knowledge base documents.
3m 16s
12
Evolving Synthetic Complexity
Basic queries are too easy for modern LLMs. Deep dive into EvolutionConfig to artificially complicate synthetic queries using techniques like Reasoning and Concretizing.
3m 32s
13
LLM Tracing and Observability
Move beyond black-box testing. Learn how to use the @observe decorator to trace components, create spans, and gain white-box visibility into your LLM pipelines.
3m 19s
14
Dynamic Evals at Runtime
When workflows are unpredictable, build your test cases dynamically. Learn how to use update_current_span to inject tests as data flows through the agent.
3m 30s
15
Introduction to Red Teaming
Correctness is not security. Explore the DeepTeam framework and learn the four core components of red teaming: Vulnerabilities, Attacks, Targets, and Metrics.
3m 53s
16
Executing Adversarial Attacks
Automate your security tests. Learn how to configure a Model Callback in DeepTeam and launch prompt injections to automatically uncover biases and flaws.
3m 55s
17
CI/CD and Continuous Evaluation
Stop deploying blind. Learn how to integrate DeepEval into your CI/CD pipelines using Pytest integrations to catch LLM regressions before they hit production.
3m 27s
18
The Finale - Scale with Confident AI
Take your evals to the cloud. Discover how Confident AI centralizes testing reports, tracks hyperparameters, and monitors regressions across your entire team.
4m 04s
Episodes
1
The Pytest for LLMs
3m 55s
DeepEval brings the rigor of Pytest to non-deterministic LLM applications. In this episode, we explore the framework's core identity and the critical difference between End-to-End and Component-Level evaluations.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 1 of 18. Most teams evaluate their newly deployed language models by manually reading a spreadsheet of responses. It is slow, heavily biased, and impossible to scale as an application grows. DeepEval is the open-source evaluation framework that replaces this manual work, functioning effectively as the Pytest for LLMs.
Suppose your team is deploying a new generative application. You need to ensure it behaves correctly before pushing it to production. With DeepEval, you write tests exactly the way you write standard Python unit tests. You create a file named test example dot py. Inside that file, you construct a test case. This test case object holds the initial input prompt, the actual output generated by your application, and any expected target answers. You then apply an evaluation metric to this test case. Instead of using a standard equality check to see if variable A perfectly matches variable B, you use a specialized assertion function provided by the framework. You run this file from your command line using the standard Pytest command. The framework executes the tests in parallel, catches the assertions, and reports passes and failures right in your terminal.
Here is the key insight. Standard deterministic tests cannot evaluate the quality or accuracy of human language. If a user asks your application for a summary of a document, there are thousands of correct ways to write that summary. Regex patterns and exact string matching are useless here because they cannot interpret semantic meaning. DeepEval handles this massive variability using a concept called LLM-as-a-judge.
The framework uses a highly capable, generalized language model to evaluate the specific outputs of your own application. The judge model reads your application's output, compares it against the strict criteria of the metric you selected, and calculates a numeric score. More importantly, it outputs a boolean result indicating if the score meets your predefined threshold, alongside a plain text reason explaining exactly why it gave that score. This means a failed test gives you immediate debugging context.
When designing these test cases, you must choose between two distinct modes of evaluation. It is easy to confuse the scope of these modes, so let us draw a clear line. End-to-end evaluation only looks at the initial input and the final output. The entire application is treated as a black box. You provide a prompt, the system gives an answer, and the judge scores that final text. It completely ignores how the application generated the response.
Component-level evaluation is a white-box approach. Instead of just checking the final answer, this mode traces the specific internal steps your application took to get there. If your system searches a database to retrieve context documents before generating its text, a component-level test evaluates that specific search step. It checks if the retrieved documents were actually relevant to the user's prompt, entirely independent of the final generated response. You test the internal machinery, not just the final product. You can have a system that passes an end-to-end test by giving a correct answer, but fails a component-level test because it pulled that answer from the wrong internal document.
Evaluating a language model is no longer a subjective reading exercise; it is a rigid, automated, and repeatable piece of your continuous integration pipeline. Thanks for listening, happy coding everyone!
2
Defining the LLM Interaction
3m 51s
You cannot measure what you haven't properly defined. Learn how the LLMTestCase defines an atomic unit of evaluation, including its mandatory and optional parameters.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 2 of 18. You can not measure what you have not defined. If your automated evaluations feel inconsistent, the issue often starts before the metric even runs. You are likely passing the wrong components of your prompt to the wrong evaluation boundaries. The fix is strictly defining the LLM interaction using the LLM test case blueprint.
An LLM test case in DeepEval serves as the atomic unit of evaluation. It forces you to isolate a single, specific interaction with your system. You do not pass full conversation logs or raw databases to an evaluator. Instead, you extract exactly what went in, what came out, and what background data was involved for one distinct turn. Multi-turn interactions have their own specific blueprint, but the standard test case focuses strictly on one isolated request and response.
Every test case requires two mandatory arguments. First is the input. This is the exact string a user or system submitted to the model. Second is the actual output. This is the text your large language model generated in response. If you evaluate nothing else, you must provide these two parameters to measure basic metrics like toxicity or answer relevance.
Consider a customer support chatbot. The input is a user asking if they can return a pair of worn shoes after thirty five days. The actual output is your model generating a response that denies the return.
To evaluate whether that denial is actually correct, you need to provide a baseline truth. DeepEval gives you two different optional parameters for this. These are expected output and context. Developers mix these up constantly. Context is strictly factual. It contains the raw, unformatted truth, like a text string from your corporate policy stating a strict thirty day refund limit. Expected output is much more specific. It dictates tone, linguistics, and formatting. You use expected output when you want the evaluator to check if the model replied with a polite, specific apology, rather than just outputting a blunt denial. Context grounds the facts. Expected output grounds the style and exact phrasing.
Here is the key insight. How you construct the rest of this test case changes depending on your underlying architecture. If you are evaluating a Retrieval-Augmented Generation pipeline, you must define the retrieval context. This parameter accepts a list of strings representing the exact document chunks your retriever pulled from the vector database. Do not confuse this with the standard context parameter. Context is the ideal truth you hardcode for the test. Retrieval context is the real data your pipeline actually found in production. Evaluators compare the two to determine if your search algorithm is retrieving the right documents.
If you are building an agent rather than a standard pipeline, you utilize the tools called parameter. This accepts a list of objects or strings representing the specific functions the agent decided to invoke during this isolated interaction, such as triggering an internal refund calculator. Providing this lets you evaluate agentic routing decisions alongside the final text generation.
The reliability of an automated metric is entirely bound by the hygiene of these parameters. An evaluator can never penalize a hallucination if you fail to provide the strict factual context against which it checks the output.
Thanks for listening, happy coding everyone!
3
The Power of LLM-as-a-Judge
3m 36s
Learn how DeepEval uses LLM-as-a-judge to evaluate test cases, returning scores from 0-1 alongside detailed reasoning. Discover how to configure custom evaluation models.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 3 of 18. Traditional NLP metrics like BLEU and ROUGE are terrible at evaluating modern large language models, but deploying a human for every single test run is impossible. The solution to scaling your evaluation without sacrificing semantic understanding is the power of LLM-as-a-Judge.
Traditional metrics look for exact word overlaps. If your model outputs a perfectly accurate summary using different synonyms, a metric based on exact string matching will score it poorly because it cannot understand meaning. Using an LLM as a judge solves this problem. It reads the output, processes the context, and evaluates the semantics much like a human reviewer would, but at machine speed.
In DeepEval, an LLM-evaluated metric performs three specific actions. First, it computes a score between zero and one. Zero is a complete failure, one is perfect. Second, it compares that score against a strict threshold you define. If your threshold is zero point seven and the model scores a zero point six, the test fails. Third, it returns a reason. The evaluating LLM generates a text explanation detailing exactly why it assigned that specific score. This tells you what went wrong without forcing you to manually read the raw logs.
DeepEval divides these metrics into three categories. RAG metrics evaluate retrieval-augmented generation pipelines. Agentic metrics assess autonomous agents. Custom metrics let you define your own evaluation criteria from scratch. While the underlying prompts for these differ, they all utilize the same judge mechanism.
When selecting a metric, developers often confuse reference-based and referenceless metrics. Here is the key insight. Reference-based metrics require a ground truth. They need a known, correct answer to compare against, which makes them highly effective during early development and testing. Referenceless metrics do not need a ground truth. They evaluate the output based entirely on the provided context or the input prompt itself. Because they do not rely on a pre-written answer, referenceless metrics are exactly what you use for live production monitoring.
It is tempting to attach a dozen metrics to every prompt to ensure quality. Do not do this. The rule of thumb is to use fewer than five metrics per application. Pick the metrics that actually align with your specific business logic. Running unnecessary metrics just results in slower tests and higher compute costs.
Speaking of costs, using a flagship commercial model to judge thousands of daily test runs gets expensive quickly. DeepEval allows you to swap the default evaluator for custom models. You can configure the framework to use Azure OpenAI if your enterprise infrastructure requires it. Alternatively, you can set up a local model using Ollama. By running a capable open-source model locally on your own hardware, you create a free, unbiased judge. You simply initialize your local Ollama client and pass that model object directly into the metric configuration. DeepEval then handles the rest, executing the entire evaluation pipeline without hitting external billing APIs.
The true value of an LLM judge is not just the numerical score, but the automated reasoning it provides to help you debug every single failure.
Thanks for listening, happy coding everyone!
4
Evaluating RAG Generators
3m 55s
Focus purely on the generation side of RAG pipelines. Learn how Answer Relevancy and Faithfulness metrics ensure your LLM answers the prompt without hallucinating.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 4 of 18. It is one thing if your RAG pipeline answers a question. It is an entirely different problem if it confidently invents details that do not exist in your source documents. Today we are looking at evaluating RAG generators, specifically how to catch those exact failures.
A Retrieval-Augmented Generation pipeline has two distinct moving parts. The retriever grabs the relevant documents. The generator takes those documents and writes the final response. We are ignoring the retriever today. We assume your system already fetched the right documents. Our focus is strictly on the generator synthesizing the text. To ensure this text is both safe and useful, you rely on two metrics: Faithfulness and Answer Relevancy.
The first metric is Faithfulness. This is your primary guardrail against hallucination. Faithfulness measures whether the generator's response, which DeepEval calls the actual output, can be entirely justified by the retrieval context.
Consider a standard scenario. A user asks your chatbot about a company healthcare plan. The retriever fetches the correct HR handbook. The generator replies that the plan covers medical, vision, and comprehensive dental. However, the HR handbook never mentions dental coverage. The generator synthesized a highly plausible, well-formatted lie.
The Faithfulness metric catches this exact behavior. It isolates the claims made in the actual output and verifies them against the retrieval context piece by piece. If the generator includes facts missing from the context, the faithfulness score drops. It does not matter if the fact happens to be true in the outside world. If it is not explicitly supported by the provided context, the metric flags it as a failure.
Now, the second piece of this is Answer Relevancy. A faithful answer is not necessarily a useful answer. The generator could output a perfectly accurate summary of the HR handbook that completely ignores what the user actually asked.
Answer Relevancy measures how well the actual output directly addresses the original input. It evaluates whether the response is complete, concise, and free of unnecessary rambling. If the user asks for the deductible amount, and the generator lists the deductible but then adds three paragraphs about the history of the company health insurance, the answer relevancy score decreases. The system penalizes evasive or bloated information just as strictly as missing information.
To evaluate these metrics, you build a test case in DeepEval. You must provide three variables. First, the input, representing the user prompt. Second, the actual output, which is the text your generator produced. Third, the retrieval context, containing the raw text from the documents your retriever fed to the generator.
You pass this test case to both metrics. DeepEval evaluates the text and calculates a score between zero and one for each. You define a passing threshold. If a response scores a zero point nine on Faithfulness but a zero point four on Answer Relevancy, you know the generator is safe but unhelpful. If the scores are reversed, your generator is helpful but actively hallucinating.
Here is the key insight. When evaluating generators, you must systematically decouple truth from context. A successful generator does not tell the absolute truth; it tells exactly what the provided documents allow it to tell, answering only what was asked, and nothing more.
Thanks for listening, happy coding everyone!
5
Evaluating RAG Retrievers
3m 42s
If the context is garbage, the answer will be garbage. Discover how Contextual Precision, Recall, and Relevancy assess the quality of your retrieval engine.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 5 of 18. If your LLM is giving bad answers, it might not be the model's fault. Your retriever might just be feeding it garbage. Today we are talking about evaluating RAG retrievers, specifically using Contextual Precision, Contextual Recall, and Contextual Relevancy.
Think about querying a large knowledge base for a very specific policy detail. Instead of extracting the one exact paragraph you actually need, your retriever grabs ten pages of loosely related text. The LLM gets lost in the noise, token costs go up, and the final answer suffers. To fix this, you have to measure the retrieval step entirely separate from the generation step.
The first measurement is Contextual Relevancy. This metric looks at the raw chunks of text your retriever pulls from the database and calculates the ratio of relevant sentences to total sentences. If you retrieve a thousand words just to capture one useful sentence, your relevancy score crashes. High relevancy means you are passing a clean, dense context to the LLM prompt without wasting tokens on irrelevant background information.
Next is Contextual Recall. This evaluates whether your retriever found all the necessary information required to answer the user's query. DeepEval calculates this by taking the expected output for a query and extracting all the factual claims from it. It then scans your retrieved context to see if those specific claims are actually present. If a crucial fact needed for the answer is missing from the context, the recall score drops. High recall guarantees the LLM actually has the raw material it needs to succeed.
Then we have Contextual Precision. People frequently confuse precision with recall. Recall means you missed nothing. Precision means you included no fluff. In RAG evaluations, precision heavily factors in ranking. It checks if the highly relevant chunks of context appear at the very top of the retrieved list. If the one vital paragraph is buried at position number ten beneath nine useless chunks, your precision score is low. Language models generally pay more attention to the top of the prompt, so the order of retrieval completely changes the outcome.
Here is the key insight. There is a constant tension between precision and recall. If you configure your retriever to return fifty chunks every time, your recall will probably be perfect because you cast a massive net. But your precision and relevancy will plummet because most of those fifty chunks are noise. Conversely, if you restrict the retriever to exactly one chunk, your relevancy is pristine, but your recall fails the moment a user query requires synthesizing two different facts from two different documents.
To run these tests in the framework, you define a test case object. This object holds the user input, the expected output, and a list of the actual context strings your retriever fetched from the database. You instantiate the specific metric you want, such as a Contextual Recall metric, and pass it the test case. The framework then uses an evaluation model under the hood to read the strings, map the claims, calculate a penalty for missing or poorly ranked information, and return a final numerical score between zero and one.
Evaluating your retriever forces you to confront exactly what you are handing to your generation model. A high-performing RAG pipeline does not just shovel data at an LLM; it filters and ranks that data ruthlessly. Thanks for listening, happy coding everyone!
6
Agentic Evaluation
3m 41s
Evaluating autonomous agents requires analyzing complex execution flows. Learn how Task Completion and Tool Correctness metrics keep multi-step agents in check.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 6 of 18. Evaluating a single text response is straightforward, you check the final output against the input. But how do you evaluate an autonomous agent that decides its own steps, loops through multiple thought processes, and triggers external tools? If the final answer is correct, but the agent took twenty unnecessary steps and called the wrong database to get there, your system is broken. Fixing this requires Agentic Evaluation.
We need to separate a simple API call from an autonomous multi-step agent flow. A standard language model integration follows a pre-defined path. You send a prompt, and the model returns a string. An agentic flow shifts control to the model. The model receives a broad goal and autonomously decides which internal tools to use, in what order, and what arguments to pass to them.
If you evaluate an agent by only looking at its final text output, you miss the actual mechanics of its reasoning. DeepEval addresses this using the Task Completion metric. Instead of parsing the final response string, the Task Completion metric analyzes the agent execution trace. A trace is the complete, sequential record of every thought, action, and tool invocation the agent made during a specific run.
The Task Completion metric reads this trace to determine if the agent actually fulfilled the user request through valid actions. Consider a trip planner agent. A user asks it to book a weekend in Rome with dinner reservations. To succeed, the agent must correctly invoke a restaurant finder tool, and then pass that data to an itinerary generator tool.
If the agent outputs a perfectly formatted itinerary but never actually executed the restaurant finder tool, a standard text evaluation metric will likely pass it. The output looks highly convincing. The Task Completion metric will fail it. By analyzing the trace, the metric sees that the necessary tool execution is missing. The agent hallucinated the restaurant data instead of retrieving it, and the trace proves it.
This moves the evaluation from the final output to the operational steps. You must also evaluate if the tools were used properly. DeepEval test cases handle this using a parameter named tools called.
When you construct an evaluation test case, you pass the list of tools the agent invoked during its execution into this tools called parameter. Providing this data allows the framework to evaluate Tool Correctness. The evaluation verifies whether the agent selected the appropriate tools for the specific goal, whether it provided the correct input arguments to those tools, and whether it successfully processed the data the tools returned.
If your trip planner agent correctly decided to use the restaurant finder tool, but passed the city of Paris instead of Rome as the argument, the tool correctness evaluation catches the error at the exact point of failure. You know exactly which step broke the chain.
Here is the key insight. Evaluating agents means stepping inside the black box. You are verifying the integrity of the agent reasoning process and its mechanical actions step by step. A structurally perfect final answer is completely useless if the agent reached it through bypassed tools, hallucinated data, and broken logic.
Thanks for listening, happy coding everyone!
7
Multi-Turn Conversation Evaluation
3m 49s
Chatbots require evaluating the entire conversation history. Learn how ConversationalTestCase and specialized metrics track Role Adherence and Knowledge Retention across multiple turns.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 7 of 18. A chatbot might give a great first answer, but completely forget the user intent or its own persona by the third turn. That is why we use Multi-Turn Conversation Evaluation.
Evaluating isolated responses is easy, but real users do not send isolated prompts. They ask follow up questions, change their minds, and refer back to things they said five minutes ago. Standard evaluation methods fall apart here because they look at a single input and a single output. They have no concept of time or memory. To test how a model handles an ongoing dialogue, you need a structure that captures the entire thread.
In the framework, you handle this with a specific object called the ConversationalTestCase. Instead of taking one input string and one output string, it takes a parameter called turns. The turns parameter is a sequential list. Each item in this list represents a single back and forth exchange between the user and the system. You order these exchanges from the first message to the very last. Wrapping the sequence in a ConversationalTestCase tells the evaluation engine to treat the entire list as one continuous, stateful interaction.
There is a common trap here. Do not pass a ConversationalTestCase into standard, non-conversational metrics. Standard metrics are built for single outputs. If you use them on a multi-turn object, they will ignore the historical context entirely. You must use dedicated conversational metrics to evaluate the turns list.
Conversational metrics evaluate the conversation as a whole. They take prior context into consideration to judge the sustained behavior of the model. Two primary examples are Role Adherence and Knowledge Retention.
Consider a customer support chatbot that is explicitly instructed to act like a pirate. In turn one, the user says hello, and the bot replies with pirate slang. In turn two, the user asks a question, and the bot stays in character. But by turn three, when the user asks about a refund, the bot replies with a standard, corporate apology. It completely dropped its character. You can catch this failure automatically using the Role Adherence metric. You define the target persona in the metric setup, and it evaluates the entire conversation to verify the model never broke character, even as the context grew longer.
Knowledge Retention solves a different problem. If a user provides an account number in the first turn, and asks for a status update in the fourth turn, the bot should not ask for the account number again. The Knowledge Retention metric scans the turns list to ensure the model successfully retrieves and applies facts introduced earlier in the chat history.
Building this in code takes just a few steps. First, you create your individual turns, mapping the user input to the model output for each step of the dialogue. Next, you pass that entire sequence into a new ConversationalTestCase. Then, you set up your conversational metric, assigning criteria like the expected persona. Finally, you execute the metric against your test case. The framework processes the full history and returns a score based on the cumulative interaction.
Here is the key insight. Multi-turn evaluation shifts your testing from measuring isolated technical accuracy to measuring sustained behavioral consistency over time.
Thanks for listening, happy coding everyone!
8
Building Custom Metrics with G-Eval
3m 47s
When standard metrics fail, build your own. Discover how G-Eval allows you to define custom evaluation criteria in plain English using a 2-step CoT algorithm.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 8 of 18. Sometimes standard metrics do not fit your use case, like checking if a highly-regulated financial chatbot sounds professional enough. You need a way to measure subjective traits reliably, and that is exactly where Building Custom Metrics with G-Eval comes in.
Before defining custom metrics, establish what kind of logic you are measuring. G-Eval is designed exclusively for subjective criteria, like tone, coherence, or conversational flow. If you need to enforce strict objective logic, you should use DAGs. G-Eval handles the nuances of human language.
To build a Professionalism metric for a financial chatbot, you do not write complex parsing rules. You use G-Eval to define a custom metric using everyday language. You instantiate the metric by giving it a name and a natural language criteria string. For this chatbot, your criteria might say, determine whether the actual output maintains a formal, respectful tone and strictly avoids slang or casual phrasing.
Here is the key insight. The metric does not just send your criteria to a large language model and ask for a quick score. G-Eval runs a two-step Chain of Thought algorithm. Step one is generation. The algorithm reads your plain English criteria and automatically generates a structured list of evaluation steps. It writes its own grading rubric. For the Professionalism metric, it might generate a step to scan for informal greetings, and another step to verify the use of appropriate financial terminology.
Step two is the actual evaluation. The algorithm takes those generated steps and applies them to your specific test case parameters. A standard G-Eval test case typically includes the user input and the actual output from your model, but you can also include the expected output or retrieval context. The evaluator runs its custom rubric against the text, calculates a final score between zero and one, and provides a detailed reason explaining why it deducted points.
Writing effective criteria dictates the quality of your metric. Treat the criteria string like a highly constrained prompt. Do not write vague instructions like check if the response is good. Define exactly what good means. If using slang should result in an automatic score of zero, state that explicitly in the criteria. The evaluation steps generated in step one are only as precise as the instructions you provide.
That covers single interactions. Conversations present a different challenge. A chatbot might start out professional but adopt a casual tone by the fourth message. To handle this, you use Conversational G-Eval.
Conversational G-Eval applies the exact same two-step algorithm to a multi-turn chat. The difference is the input format. Instead of evaluating a single input and actual output, you pass an entire conversation history. This history consists of sequential turns alternating between the user and the assistant. The metric reads the entire transcript, generates its evaluation steps based on your custom criteria, and scores the interaction as a whole. This ensures the model output remains consistent from the first greeting to the final sign-off.
The effectiveness of any custom metric depends entirely on treating your criteria like a rigorous specification, where clarity always beats brevity.
Thanks for listening, happy coding everyone!
9
Deterministic Evaluation with DAG
3m 16s
Take absolute control over your evaluations. Learn how the Deep Acyclic Graph (DAG) metric uses decision trees to deterministically judge complex formatting and logic.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 9 of 18. When evaluating complex formatting rules, a single dense prompt can easily cause your evaluator LLM to hallucinate the score. It might see the right words in the text, completely ignore the structural order, and give a passing grade to a failing test. The fix for this is Deterministic Evaluation with DAG.
DAG stands for Directed Acyclic Graph. In this framework, a DAG metric allows you to build a strict decision tree that controls how the LLM evaluates a response. Instead of asking a model to ingest a text and score it based on a massive block of instructions, you break the logic down into granular, step-by-step operations. Data flows strictly in one direction, from root nodes at the start of your tree down to leaf nodes at the end.
To build this tree, you rely on three distinct components. First is the Task Node. A common mistake is treating this like an evaluator. It is not. A Task Node simply extracts or processes data from the input or the generated response. Next is the Binary Judgement Node. This node takes the data processed by the Task Node and evaluates it against specific criteria, returning a strict yes or no. Finally, there is the Verdict Node. This acts as a leaf node. It terminates a branch of your decision tree and outputs a final numeric score along with a written reason.
Let us apply this to a concrete scenario. You are testing an LLM that generates meeting transcript summaries. Your strict requirement is that every summary must have exactly three headings: Intro, Body, and Conclusion, in that exact sequence.
You begin your metric by creating a root Task Node. You instruct this node to read the generated summary and extract a list of all headings it finds. That is its entire job. It isolates the formatting data and passes a simple text list to the next level of the tree.
Now, you feed that list into a Binary Judgement Node. You define the criteria for this node to check if the list contains exactly three items. If the node evaluates this as false, it routes the execution down to a Verdict Node. That Verdict Node immediately fails the test, assigns a score of zero, and outputs a reason stating the heading count was incorrect.
If the Binary Judgement Node evaluates to true, the execution moves to a second Binary Judgement Node. This node takes the same extracted list and checks the sequence. It verifies if the first item is Intro, the second is Body, and the third is Conclusion. If this is true, it routes to a final Verdict Node giving a perfect score. If false, it routes to a different Verdict Node assigning a zero due to order failure.
This is the part that matters. By separating the extraction of information into a Task Node from the evaluation logic in the Judgement Nodes, you force the LLM to follow an unbending path. The model handles the semantic extraction, while your graph guarantees the deterministic execution of the rules.
Thanks for listening, happy coding everyone!
10
The Evaluation Dataset
3m 22s
Scale your testing by building robust datasets. Explore how EvaluationDatasets group Goldens, distinguish between single and multi-turn data, and import from CSV/JSON.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 10 of 18. Testing a language model on five manual prompts is cute. But when you are upgrading to a new model version, those five prompts will not tell you if you broke edge cases across hundreds of historical interactions. To be production-ready, you need a robust, version-controlled repository of truth. That is what the Evaluation Dataset provides.
An Evaluation Dataset in DeepEval is simply a structured collection of items called Goldens. Before we go further, we need to clear up a common misconception. Developers often confuse a Golden with a Test Case. They are not the same thing. A Golden is the raw dataset row. It contains your static testing parameters, like the user input, the expected output, and the retrieval context. It represents the ideal scenario. A Golden only becomes a Test Case later, at runtime, after your live application processes the input and injects its actual output. The Golden is the blueprint, while the Test Case is the executed result.
Let us ground this in a specific scenario. You have a historical log of 500 customer support queries saved in a CSV file. You want to rigorously test a new model version against this exact set of queries. You do not need to write custom parsing logic. You simply initialize an Evaluation Dataset and use the built-in method to add data from a CSV file. You pass the file path and define a mapping. You tell the dataset which CSV column corresponds to the user input, which maps to the expected output, and which contains the context. The framework handles the parsing and constructs 500 Goldens in memory. You can do the exact same thing with a JSON file, mapping the JSON keys to the Golden fields.
Here is the key insight. The Evaluation Dataset controls a specific lifecycle that bridges your static data and your dynamic evaluation pipeline. First, you load and store your Goldens in the dataset. Next, during your test run, you iterate through the dataset. You extract the input from each Golden, feed it to your live language model, and capture the generated response. You then attach that live response to the Golden and convert it into an official Test Case. Finally, you pass that complete Test Case to your metrics for scoring. This keeps your raw data completely separated from your execution logic.
Up until now, we have described single-turn datasets. One user input, one expected output. But many applications involve chat interfaces. For this, DeepEval supports multi-turn datasets. Instead of a flat input string, a multi-turn dataset contains a sequence of interactions. A single multi-turn Golden holds the entire conversation history, tracking how the user and the system interact over multiple steps. This allows your metrics to evaluate the flow and context retention of a conversation, rather than isolating a single isolated reply.
Structuring your data into formal Evaluation Datasets guarantees that every prompt tweak and model swap is measured against a strict, unvarying historical standard. Thanks for listening, happy coding everyone!
11
Generating Synthetic Data
3m 16s
Don't have real user data? Learn how to use the Synthesizer to automatically generate high-quality Goldens directly from your knowledge base documents.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 11 of 18. The biggest bottleneck to evaluating large language models isn't the testing framework you use. It is the lack of high-quality test data to begin with. Generating synthetic data using the DeepEval Synthesizer is how you bypass this blockade.
Consider a startup building an internal HR bot. You need to run tests immediately to verify it retrieves the right policies. But the bot is brand new. You have zero real user query logs. This is the classic cold-start problem for evaluation data. You cannot evaluate a retrieval system without a robust list of realistic questions to ask it.
The DeepEval Synthesizer bootstraps an evaluation dataset directly from your raw knowledge base. Instead of writing hundreds of test cases by hand, you point the tool at your source documents. The core method for this is called generate goldens from docs.
In DeepEval, a Golden is the terminology for a single test case containing an input, an expected output, and a context.
To use it, you first initialize a Synthesizer object. Then, you call the generate goldens from docs method and pass it an array of document paths. For the HR bot, this would be the file paths to your employee handbooks, leave policies, and benefits PDFs.
When you run this method, the Synthesizer processes the files and breaks the text down into manageable chunks. It then uses an evaluator language model to act like an inquisitive user. The model reviews a specific chunk of the HR handbook and generates a relevant, realistic question based purely on that text. This question is saved as the synthetic input.
The Synthesizer also saves the exact text chunk used to inspire the question. This becomes the expected context. Finally, the model formulates the ideal, factual answer to the question and saves it as the expected output.
This is the part that matters. People often misunderstand the boundaries of synthetic generation. The synthesizer only creates the inputs, the expected context, and the expected output. It does not generate the actual output. Finding the actual output is the job of your own HR application during the testing phase. Think of the Synthesizer as a teacher writing an exam and creating an answer key. Your bot still has to take the exam.
You can control the scope of this generation. When calling the method, you can specify how many test cases to generate per document. This keeps the process fast and cheap for quick iterations, or you can scale it up for comprehensive coverage. Once the method finishes, you extract the generated goldens and save them. You can export the dataset locally as a JSON file, or push it directly to Confident AI to track your dataset versions over time.
Bootstrapping synthetic datasets allows you to shift from waiting for real users to expose the flaws in your system, to systematically testing the absolute edges of your document logic on day one.
Thanks for listening, happy coding everyone!
12
Evolving Synthetic Complexity
3m 32s
Basic queries are too easy for modern LLMs. Deep dive into EvolutionConfig to artificially complicate synthetic queries using techniques like Reasoning and Concretizing.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 12 of 18. Basic synthetic queries generated from your documents are too easy. Modern language models breeze through simple questions, leaving dangerous logic gaps completely untested. To find real breaking points in your application, you need to artificially escalate the difficulty of your test data. We achieve this using Evolving Synthetic Complexity.
Standard generation produces simple, direct questions. Evolution mathematically complicates those questions to stress-test your system. People sometimes confuse this with altering the underlying facts. Evolving a synthetic query does not mean changing the ground truth. The source data remains exactly the same. What changes is the structural difficulty of the query asking about that data.
In DeepEval, you control this mutation process using the Evolution Config. This configuration applies specific evolutionary strategies to transform a basic prompt into a multi-constraint edge case. Take a simple synthetic question like, what is the refund policy? It is straightforward, but it is too generic to be a rigorous test.
The first strategy you can apply is Concretizing evolution. This takes an abstract query and forces it into a highly specific, tangible scenario. Instead of asking for the general policy, concretizing mutates the query into something like, if I bought a red shirt on Tuesday, can I refund it next week? The model now has to map specific user constraints to the general rule.
The second strategy is Reasoning evolution. This introduces a required layer of deduction. The evolved question forces the model to perform logical steps before it can provide the final answer. Instead of just retrieving a fact, the query might require the system to calculate dates, compare values, or follow an conditional logic chain based on the source text before forming its response.
The third strategy is Multicontext evolution. This tests retrieval and synthesis by forcing the model to pull answers from disjointed pieces of information. It modifies the query so the answer cannot be extracted from a single paragraph. To succeed, the language model must combine the general refund timeline from one document with specific clearance item exclusions from an entirely different section.
When you artificially mutate thousands of queries using these strategies, some will inevitably degrade. An evolved question might become so convoluted that it is genuinely impossible to answer, or it might drift away from the original facts.
This is the exact problem the Filtration Config solves. You cannot allow unsolvable noise to pollute your evaluation dataset. Filtration employs a separate critic model to act as a quality control gatekeeper. It reviews every newly evolved query against strict criteria before it is saved. If a mutated question is logically broken, no longer aligns with the source context, or degrades into nonsense, the critic model outright rejects it.
This two-step process ensures you generate questions that are incredibly difficult, but still entirely valid. A high score on basic synthetic data proves nothing about your system's resilience; a model's true reliability is only measured by how it handles the intentionally evolved, complex edge cases that standard generation leaves behind.
Thanks for listening, happy coding everyone!
13
LLM Tracing and Observability
3m 19s
Move beyond black-box testing. Learn how to use the @observe decorator to trace components, create spans, and gain white-box visibility into your LLM pipelines.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 13 of 18. End-to-end evaluations tell you if a system failed, but they leave you guessing why. Component-level tracing tells you exactly which specific function broke. Today, we are looking at LLM Tracing and Observability.
Consider a standard Retrieval-Augmented Generation pipeline. A user asks a complex question, and your system returns a hallucinated answer ten seconds later. You need to know exactly what went wrong. Did the embedding retrieval function pull irrelevant documents from your database, or did the final language model generation step just fail to synthesize the right context? You cannot debug this efficiently by just looking at the final output. You need to inspect the internal steps.
To do this, you must understand two foundational terms that are frequently mixed up: Traces and Spans. A Trace is the full execution tree of a single operation. It represents the complete timeline from the moment the user sends a prompt to the moment the system delivers the final response. A Span, on the other hand, is a specific component or function operating inside that Trace.
Your entire pipeline execution is one Trace. The function that queries the vector database is one Span. The function that formats the prompt is another Span. The actual call to the large language model is a third Span. Every Trace is built out of these nested Spans. Each Span records its own start time, end time, input parameters, and output results.
In DeepEval, you capture this hierarchy using the observe decorator. You simply place the word observe with an at-symbol directly above the Python functions you want to monitor. You attach it to your main entry-point function, and you attach it to the internal helper functions like your retriever and your generator.
When your application runs, the observe decorator automatically intercepts the execution. It logs the exact arguments passed into the function and the exact data returned. It also tracks latency and any errors that occur. More importantly, it understands the execution context. If your main pipeline function calls your retriever function, the decorator automatically registers the retriever as a child Span of the main Trace. It maps out the parent-child relationships of your functions without you having to manually link them.
Here is the key insight. Tracing this way is completely non-intrusive. You do not have to rewrite your application codebase to generate telemetry. You do not need to alter your function signatures to pass trace IDs or context objects down the call stack. You keep your business logic clean and just wrap the components you care about. By doing this, you isolate the data for every individual step. If you want to evaluate just the retrieval logic later, the exact inputs and outputs of that specific Span are already logged.
You evaluate end-to-end to measure the final user experience, but you trace at the component level to actually locate and fix the underlying code.
Thanks for listening, happy coding everyone!
14
Dynamic Evals at Runtime
3m 30s
When workflows are unpredictable, build your test cases dynamically. Learn how to use update_current_span to inject tests as data flows through the agent.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 14 of 18. In complex autonomous agent workflows, you do not always know what the test case should be until the agent actually starts executing. You cannot write a comprehensive static test suite for intermediate decisions that have not been made yet. Dynamic Evals at Runtime resolve this limitation by letting you construct and evaluate test cases mid-flight.
Typically, developers define static test cases externally. You create a file of rigid inputs and expected outputs, then run your application against them. That approach breaks down when dealing with autonomous systems. When an agent receives a prompt, it might route the query, select a specialized tool, and generate its own internal search string. Those intermediate inputs and outputs do not exist until the code is actually running. Dynamic evaluations abandon the external file. Instead, they build the test case step-by-step as variables populate inside the active application.
Consider a targeted scenario. You need to evaluate context precision inside a deeply nested retriever function. You want to test just that specific mid-flight step, isolated from the final output the user sees. To do this, DeepEval provides two specific functions to intercept execution data: update current span and update current trace.
A trace records the entire lifecycle of a request, from initial user input to final response. A span represents one specific operation inside that trace, such as your retriever function. When that retriever function executes, the dynamic variables finally materialize. You now have the exact search string the agent generated and the specific text chunks your database returned.
Right at this moment, inside the retriever logic, you call update current span. You use this function to intercept those live variables and map them directly into a new test case. You take the intercepted search string and assign it as the test input. You take the raw database chunks and assign them as the retrieval context. You have just constructed a golden test case during execution.
Because you built this golden dynamically inside the span, you can immediately evaluate it. You apply your context precision metric right there. The metric runs against the live data, scores the retriever step, and attaches that score directly to the local span. When you review your traces later, you do not just see that a retrieval happened. You see a highly targeted evaluation of that specific retrieval based on the exact conditions of that run.
That covers granular steps. Sometimes, however, a nested operation uncovers something that changes the context of the entire request. This is where update current trace becomes necessary. While update current span modifies the local step, update current trace allows a deeply nested function to reach up and modify the global execution record. If your agent discovers information mid-flight that completely changes what the final answer should look like, you call update current trace to update the expected output for the entire run. This keeps the global evaluation aligned with the live, shifting reality of the execution logic.
Here is the key insight. Moving evaluations from external files into the runtime execution tree turns testing from a post-mortem exercise into a live diagnostic mechanism. By binding metrics directly to spans as they execute, you stop guessing why a multi-step agent failed and start measuring exactly which internal handoff caused the failure.
Thanks for listening, happy coding everyone!
15
Introduction to Red Teaming
3m 53s
Correctness is not security. Explore the DeepTeam framework and learn the four core components of red teaming: Vulnerabilities, Attacks, Targets, and Metrics.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 15 of 18. Your LLM might answer every normal question perfectly. But what happens when a malicious user actively tries to jailbreak your application into leaking sensitive data? That requires a completely different approach, which brings us to an Introduction to Red Teaming.
To understand red teaming, you have to change how you view your system. Standard evaluation tests functionality. It measures if your model is helpful, accurate, and relevant when used properly. Red teaming tests security, safety, and guardrails. It requires a mindset shift from checking for correctness to simulating malicious bad actors. You are actively trying to make the model fail.
Take a common scenario. If you directly ask your AI application to output a user profile, standard guardrails will likely catch it. The AI strictly refuses to leak the Personal Identifiable Information, or PII. A standard evaluation marks this as a success. But a bad actor will not ask directly. They might prompt the model using a specific, malicious persona, instructing the AI to act as a senior database administrator performing an emergency system override. Suddenly, the AI complies and willingly leaks the PII. Red teaming is the systematic process of discovering these exact blind spots before they reach production.
DeepEval structures this process around four core components.
The first component is Vulnerabilities. A vulnerability is the specific weakness, risk, or harm you are testing for. It is the underlying issue you want to prevent. In our emergency override scenario, the vulnerability is PII leakage. Other vulnerabilities might include generating toxic output, displaying unauthorized bias, or offering dangerous advice.
The second component is Adversarial Attacks. If the vulnerability is the target, the attack is the weapon. Attacks are the specific techniques or means used to exploit a vulnerability. Adopting a trusted persona to trick the AI is one type of attack. Others include prompt injection, where malicious instructions are hidden in regular input, or complex jailbreaks designed to bypass the model's safety training entirely. DeepEval separates the weakness from the tactic because a single vulnerability can be exposed by many different types of attacks.
The third component is the Target LLM System. This is the actual application you are evaluating. It is not just the raw foundation model, but your specific architecture. This includes your custom system prompts, your retrieval mechanisms, and any existing safety filters. The adversarial attacks are executed directly against this setup to see how your actual product performs under pressure.
The fourth component is Metrics. Once an attack is executed against your target system to probe for a vulnerability, you need a quantifiable result. Metrics evaluate the system's response. They determine if the attack successfully bypassed the guardrails, or if the system safely refused the malicious request. A metric scores the interaction, giving you a concrete pass or fail based on how secure the output actually was.
Here is the key insight. You cannot secure an AI application just by proving it does the right thing when asked nicely; you must systematically prove it refuses to do the wrong thing when under attack.
Thanks for listening, happy coding everyone!
16
Executing Adversarial Attacks
3m 55s
Automate your security tests. Learn how to configure a Model Callback in DeepTeam and launch prompt injections to automatically uncover biases and flaws.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 16 of 18. You shouldn't have to manually type thousands of deceptive inputs to find security flaws in your application. Instead of paying a security team to manually try and break your system over weeks, you can let an AI attack your AI autonomously. Today, we are looking at Executing Adversarial Attacks.
To orchestrate an automated LLM-on-LLM attack, the scanning engine needs a direct line of communication to your application. You establish this connection by defining a target model callback. This is an asynchronous Python function that you write. It takes a single string argument, which is the adversarial prompt generated by the testing engine, and it must return the string response from your system.
In a typical scenario using an OpenAI model, you would define this async callback function, take the incoming prompt parameter, pass it to your OpenAI client, await the generation, and return the final text content. This callback acts as the bridge. The red teaming engine does not need to know about your internal architecture, API keys, or database state. It just needs a function it can continuously hit with malicious inputs.
Once your callback is ready, you pass it to the red team function. This is the main orchestrator that runs the scan. To configure it, you provide two distinct lists: vulnerabilities and attacks. It is crucial to understand the difference between them.
Vulnerabilities are the specific structural flaws or harmful behaviors you want to test for. For example, if you want to ensure your application does not output racial or gender prejudice, you import and pass the Bias vulnerability to the red team function.
Attacks, on the other hand, represent the methodology the engine will use to try and expose that vulnerability. To force the model into making a biased statement, you might want the engine to use deceptive phrasing or jailbreak techniques. You do this by passing the Prompt Injection attack. The engine will now autonomously generate targeted, malicious prompts using prompt injection, specifically engineered to bypass your system prompts and trigger the Bias vulnerability.
A common point of confusion during this setup is how the results are actually graded. In standard evaluation, you spend a lot of time defining and tuning specific metrics. When executing adversarial attacks, do not manually define metrics. The framework handles this entirely behind the scenes. It automatically maps the vulnerability you selected directly to a corresponding internal evaluation metric. Because you told it to test for Bias, the engine automatically runs a bias evaluator against every single response your target model callback returns.
After the red team function finishes firing these generated prompts and evaluating the responses, it outputs a comprehensive Risk Assessment. This assessment provides a clear breakdown of the scan. It shows exactly how many attacks were attempted, which specific attack techniques successfully breached your system, and the exact input strings that caused the failure. You walk away with a concrete list of inputs that your system currently cannot handle.
Here is the key insight. The true power of this setup is decoupling the attack method from the target vulnerability, allowing you to multiply your security coverage by pairing a single flaw like bias with dozens of different attack vectors simultaneously.
Thanks for listening, happy coding everyone!
17
CI/CD and Continuous Evaluation
3m 27s
Stop deploying blind. Learn how to integrate DeepEval into your CI/CD pipelines using Pytest integrations to catch LLM regressions before they hit production.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 17 of 18. You would never merge a pull request for traditional code without running your unit tests first. Yet, teams routinely deploy non-deterministic language models and just hope the new prompts still work. If you want to prevent bad model updates from breaking production, you need CI/CD and Continuous Evaluation.
DeepEval treats language model evaluations exactly like standard software tests by integrating directly with Pytest. You define a test function, initialize your evaluation metrics, and assert that the metric passes. Evaluating a model on a single input is useless, so you need to validate changes against a large batch of approved baseline inputs and expected outputs. This is your golden dataset. To iterate through this dataset efficiently, you use the standard Pytest mark parametrize decorator. You load your dataset, extract the individual test cases, and pass them into the decorator. When the test suite runs, Pytest dynamically generates a separate test execution for every single item in your golden dataset.
Here is the key insight. Because the framework integrates closely with Pytest, developers often assume they can just run standard Pytest commands in their terminal. Do not do this. If you execute Pytest directly against your evaluation files, you will run into unexpected errors related to asynchronous event loops and missing metric telemetry. You must always use the dedicated command line interface. The correct command is deepeval test run followed by your python filename. This wrapper handles the complex asynchronous setup required by language models and ensures that all test results are captured and logged correctly.
Integrating this command into your deployment pipeline gives you continuous evaluation. Consider a typical GitHub Action setup. You configure the workflow to trigger whenever an engineer opens a pull request targeting the main branch. The action runner checks out the repository, sets up the Python environment, and executes deepeval test run against your golden dataset script. The framework evaluates the newly modified code or prompt against every historical test case.
If a developer alters a system prompt to make answers more concise, they might accidentally instruct the model to strip out mandatory compliance warnings. When the CI pipeline runs, the automated evaluation catches this missing context immediately. If the new logic causes your evaluation score to drop below your defined threshold on any test case, the assertion fails. The script returns a non-zero exit code, the GitHub Action immediately turns red, and the pull request is blocked from merging.
This automated pre-deployment check acts as a strict gatekeeper. It catches regressions automatically. You no longer have to manually spot-check outputs or wait for users to complain that a new model swap broke a specific edge case. The pipeline proves mathematically whether the update is safe to deploy, entirely removing human subjectivity from the release process.
Continuous evaluation means you stop treating prompt engineering like an operational gamble and start treating it like a predictable software release backed by hard data.
Thanks for listening, happy coding everyone!
18
The Finale - Scale with Confident AI
4m 04s
Take your evals to the cloud. Discover how Confident AI centralizes testing reports, tracks hyperparameters, and monitors regressions across your entire team.
Hi, this is Alex from DEV STORIES DOT EU. DeepEval Framework, episode 18 of 18. You spend hours tuning a prompt locally, achieve a great evaluation score, and ship the code. Two weeks later, another engineer updates the underlying model, and suddenly the application fails on edge cases no one tracked. Running evaluations locally is only half the battle. This episode covers how to scale your testing and track regressions over time using Confident AI.
First, a necessary distinction. DeepEval is the open-source framework you run in your terminal or Python environment to execute tests. Confident AI is the hosted cloud platform built on top of it. You use DeepEval to define your metrics and run the actual evaluations. You use Confident AI to centralize, track, and analyze those evaluation reports across an entire engineering organization. It takes isolated local scripts and turns them into a collaborative system of record.
Moving from local execution to cloud logging requires one simple step. In your terminal, you execute the command deepeval login. The CLI will prompt you to provide an API key generated from your Confident AI workspace. Once you authenticate, your daily workflow stays exactly the same. You run your test files using the standard test command. The framework automatically detects the active session and streams the results directly to the cloud dashboard while still printing them locally.
Centralizing reports unlocks the ability to track regressions methodically. A regression happens when a change in your code or configuration inadvertently degrades the system's performance. To diagnose why a regression occurred, you need to track exactly what changed between test runs. This is done by logging hyperparameters.
In the context of language model evaluations, a hyperparameter is any variable that alters your pipeline's behavior. This includes the model architecture, the temperature setting, the chunk size used for retrieval, or even the specific prompt template version. When you configure DeepEval to log these hyperparameters, they are attached to every test run sent to Confident AI.
Consider a team attempting to upgrade their application. They want to know if switching from GPT-4o to Claude 3.5 Sonnet actually improves their overall pipeline score. They configure the model name as a tracked hyperparameter. When the engineer runs the evaluation suite using the new model, Confident AI logs the new model name alongside the resulting scores for metrics like contextual precision or factual consistency.
Here is the key insight. Because all historical test runs are saved in the cloud, the team can view a timeline comparing the exact hyperparameter changes against the aggregate evaluation scores. If switching to the new model increases answer relevancy but drastically drops factual consistency, the dashboard highlights this regression instantly. Everyone on the team sees the same data. You never have to parse through old console outputs or rely on memory to decide if a configuration change was a success.
Continuous evaluation requires a historical baseline. Without a centralized system tying your configurations directly to your evaluation scores, you are simply running isolated experiments, not engineering a reliable system.
This concludes our series on the DeepEval framework. I highly encourage you to explore the official documentation and try building these evaluations hands-on. If you have technical topics you would like to see covered in a future series, visit devstories dot eu to leave a suggestion. Thanks for listening, happy coding everyone!
Tap to start playing
Browsers block autoplay
Share this episode
Episode
—
Copy this episode in another language:
This site uses no cookies. Our hosting provider may log your IP address for analytics. Learn more.