Back to catalog

Season 51 14 Episodes 49 min 2026

LlamaIndex: Context-Augmented LLM Applications

v0.14 — 2026 Edition. A comprehensive guide to LlamaIndex, covering context augmentation, RAG pipelines, autonomous agents, and multi-agent workflows. Learn how to build production-ready LLM applications using version 0.14.

LLM Orchestration RAG AI/ML Frameworks

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

LlamaIndex: Context-Augmented LLM Applications

Now Playing

Click play to start

0:00

The Context-Augmentation Imperative

Discover the foundational concepts of LlamaIndex and why LLMs need external context to be truly useful. This episode covers the philosophy behind Retrieval-Augmented Generation, workflows, and agentic applications.

3m 49s

Data Ingestion: Documents and Nodes

Explore the first half of the RAG pipeline. You will learn about Connectors, Documents, Nodes, and the critical process of indexing unstructured data into vector embeddings.

4m 11s

The Query Pipeline: Retrievers and Routers

Dive into the second half of the RAG lifecycle. Learn how Retrievers find relevant chunks, how Routers select the best approach, and how Postprocessors refine the context for the LLM.

3m 20s

Interfacing with LLMs and Multi-Modal Inputs

Master the LlamaIndex LLM class for natural language generation. This episode breaks down chat interfaces, streaming responses, and feeding images to multi-modal models.

3m 42s

Structured Data Extraction with Pydantic

Learn how to force unpredictable LLMs to return strict, typed JSON data. Discover how Pydantic BaseModels act as schemas to extract reliable structured information from raw text.

3m 24s

Building Autonomous Function Agents

Take the leap from static code to autonomous agents. You will learn how to wrap Python functions into tools and deploy a FunctionAgent to execute tasks dynamically.

3m 28s

Extending Agents with LlamaHub Tools

Supercharge your agents with pre-built integrations. This episode shows how to browse LlamaHub, install tool specs, and give your agent real-world capabilities instantly.

3m 30s

Multi-Agent Swarms with AgentWorkflow

Move beyond single-agent setups. Learn how to configure a linear swarm of specialized agents that autonomously hand off tasks to one another using AgentWorkflow.

2m 59s

The Orchestrator Agent Pattern

Take granular control of your agentic workflows. Discover how to build a master orchestrator agent that manages subordinate agents as callable tools.

3m 38s

Custom Multi-Agent Planners

Achieve ultimate multi-agent flexibility. Learn how to roll your own orchestration loop using custom XML prompting, Pydantic, and imperative execution.

3m 34s

Human-in-the-Loop Workflows

Prevent autonomous disasters by keeping a human in the loop. You will learn how to pause workflows with events to wait for human confirmation before executing dangerous tasks.

3m 09s

Observability and Tracing

Stop debugging AI with print statements. This episode explores LlamaIndex callbacks and one-click observability to trace inputs, durations, and outputs across complex pipelines.

3m 26s

RAG Evaluation Metrics

Measure the true effectiveness of your applications. Learn how to use FaithfulnessEvaluator and RetrieverEvaluator to objectively score retrieval and response quality.

3m 50s

Scaffold to Production

Transform prototypes into full applications instantly. Discover how to use create-llama and the RAG CLI to scaffold full-stack web apps and terminal chats without writing boilerplate.

3m 35s

Episodes

The Context-Augmentation Imperative

3m 49s

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 1 of 14. Pre-trained models are brilliant, but they know absolutely nothing about the private documents you created this morning. You ask an LLM to summarize your brand new Q3 financial slide deck, and it either guesses blindly or tells you it cannot help. The Context-Augmentation Imperative is how you fix this. Large Language Models possess incredible reasoning capabilities, but their knowledge is frozen in time and limited to public data. They do not have access to your internal wikis, your customer support tickets, or your private financial reports. LlamaIndex exists to bridge this exact gap. It serves as the connective tissue between foundational models and your private, localized data. When you ask a model to summarize that Q3 slide deck, you cannot just send the question. You need a system that finds the relevant slides, extracts the text, and feeds that specific information to the model along with your prompt. This process is context augmentation. You are giving the model the exact context it needs to apply its reasoning skills to your private data. LlamaIndex provides the infrastructure to ingest, organize, and retrieve your data so that context augmentation happens reliably. Fetching text and answering a single question is only the baseline. Modern applications require more autonomy. This brings us to agentic applications. An agentic application does not just follow a straight line from a question to a database to an answer. It makes decisions along the way to handle complex user intents. The first piece of this is routing. When a user asks a question, the system needs to decide which data source or tool is appropriate. If the user asks for a high-level corporate summary, the router directs the query to the Q3 slide deck index. If the user asks for the exact numerical breakdown of regional sales, the router might send the query to a structured SQL database instead. Routing ensures the model uses the right tool for the job based on the input. The second piece is prompt chaining. Complex tasks often fail when you ask a model to handle them in a single massive prompt. Prompt chaining breaks a complex objective into smaller, sequential tasks. The system might run one prompt to extract revenue figures from the slide deck, pass those figures to a second prompt that compares them to historical data, and send that output to a third prompt that drafts an executive summary. The output of one step becomes the exact context for the next step. This is where it gets interesting. Even with the right data and a structured chain, models make mistakes. This introduces reflection. Reflection is an automated quality control step. Before delivering the final summary of the Q3 deck to the user, the agentic application uses a separate prompt to evaluate its own draft. It checks if the generated text is fully supported by the retrieved slides. If the reflection step spots a hallucination or an omitted key metric, it rejects the draft and triggers a correction. The true power of context-augmented applications is not just giving an LLM a document to read, but giving it a structured, self-correcting workflow to reason securely over your private data. If you want to help keep the show going, you can search for DevStoriesEU on Patreon and support us there. That is all for this one. Thanks for listening, and keep building!

Data Ingestion: Documents and Nodes

4m 11s

Explore the first half of the RAG pipeline. You will learn about Connectors, Documents, Nodes, and the critical process of indexing unstructured data into vector embeddings.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 2 of 14. You cannot just shove a 500-page PDF into an LLM context window and expect a precise answer. The model will lose track, hallucinate, or simply reject the payload for exceeding its limits. To make massive files useful, you have to tear them apart into tiny pieces and translate them into a mathematical format the machine can search. That process is exactly what Data Ingestion: Documents and Nodes covers today. Think about digesting a massive HR employee handbook. The text lives in a giant PDF, spanning dozens of complex chapters. The first step in a Retrieval-Augmented Generation pipeline is the loading stage. This is where Data Connectors, frequently called Readers, come into play. A connector takes your raw data source—whether that is a local PDF, a remote database table, or an external API response—and wraps it in a data structure called a Document. People often trip up on this term. In this framework, a Document does not mean a Word file or a PDF. A Document is simply a generic container for any ingested data source. It holds the raw text along with some basic properties. However, a single Document representing a 500-page handbook is completely useless for precise, fast searching. You must break it down. This brings us to Nodes. A Node is the actual atomic unit of data in LlamaIndex. It is a smaller, manageable chunk of a parent Document—perhaps a single paragraph detailing the parental leave policy. When you process the HR handbook, the framework takes the massive Document and slices it into thousands of Nodes. Here is the key insight. Nodes do not just hold isolated text. They carry rich metadata and structural relationships. A Node knows exactly which parent Document it came from. It also knows which Node logically precedes it and which Node follows it. This linked structure is what allows the system to synthesize larger context later if a single chunk does not contain the whole answer. Once you have your data chopped into precise Nodes, you move to the indexing stage. You need a robust way to find the correct Node when a user later asks a question. This requires translating human language into a numerical format called an embedding. An embedding is an array of floating-point numbers representing the semantic meaning of the text inside the Node. You pass each Node through an embedding model. The model reads the chunk and returns a high-dimensional vector. If two Nodes discuss conceptually similar topics—like sick leave and paid time off—their numerical vectors will sit mathematically close to each other in space. With these vectors generated, you construct an Index. The Index is the core structural component that organizes your Nodes so they can be queried. For most applications, this Index is backed by a Vector Store. The Vector Store acts as a specialized database explicitly designed to hold these mathematical representations and perform highly efficient similarity calculations over them. The logical flow is highly predictable. First, you configure a data connector to target your HR handbook. The connector reads the file and outputs a single Document object. Next, a parser takes that Document and splits it into an array of independent Node objects. Finally, you pass that array of Nodes into an Index, which coordinates the creation of vector embeddings and commits them to the Vector Store. The entire ingestion pipeline exists to solve one fundamental limitation. Large language models cannot reliably read entire books at once, but they can instantly calculate the mathematical distance between two arrays of numbers. Translating your raw files into Documents, slicing them into linked Nodes, and encoding them into vector indexes is what bridges that gap. That is all for this one. Thanks for listening, and keep building!

The Query Pipeline: Retrievers and Routers

3m 20s

Dive into the second half of the RAG lifecycle. Learn how Retrievers find relevant chunks, how Routers select the best approach, and how Postprocessors refine the context for the LLM.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 3 of 14. You grab the right document chunk, send it to the language model, and still get a bad answer because it was buried under ten irrelevant chunks. Finding the text is only half the battle. Filtering, ranking, and deciding how to fetch it in the first place is where the system actually succeeds or fails. Today we are looking at the execution phase of RAG, specifically The Query Pipeline: Retrievers and Routers. At this stage, your data is already loaded and indexed. A user submits a query. The first component to intercept this query is often a router. A router is a decision engine. It looks at the incoming question and determines which underlying tool or index is best suited to answer it. Say a user asks a complex question about a specific historical event that also includes highly specific acronyms. A standard vector search might grasp the semantic meaning of the event but miss the acronyms. A keyword search will nail the acronyms but miss the broader context. The router evaluates the query and decides to send it down two paths simultaneously. It routes the request to both a vector search and a keyword search. This brings us to the retrievers. A retriever is responsible for defining exactly how to fetch relevant context from an index. It does not generate answers. It only fetches data. Following our scenario, the vector retriever converts the user query into an embedding and pulls out the most mathematically similar nodes, which are just chunks of your source documents. At the same time, the keyword retriever pulls out nodes that contain exact text matches for those acronyms. Now you have two distinct piles of nodes. You cannot just blindly append all of them to your language model prompt. Context windows are limited, and models get easily distracted by irrelevant data. This is where node postprocessors step in. Here is the key insight. Node postprocessors act as a gatekeeper between the retrievers and the language model. They apply transformations, filtering, or re-ranking logic to the fetched nodes. For example, a postprocessor can enforce a similarity cutoff, dropping any nodes that scored below a specific threshold. It can deduplicate nodes if the vector search and keyword search happened to pull the exact same paragraph. It can also re-rank the remaining nodes so the absolute most relevant chunk sits at the very top of the context window. Once the postprocessor has cleaned and ordered the data, the system hands it off to the response synthesizer. The synthesizer has one job. It takes the polished list of nodes and the original user query, combines them into a structured prompt, and sends them to the language model. The language model then generates the final human-readable answer based solely on that provided context. The query execution phase is strictly a pipeline. You route the query, retrieve the raw nodes, filter and rank those nodes with a postprocessor, and finally synthesize the text. If you control what the language model sees, you control the quality of the output. Thanks for spending a few minutes with me. Until next time, take it easy.

Interfacing with LLMs and Multi-Modal Inputs

3m 42s

Master the LlamaIndex LLM class for natural language generation. This episode breaks down chat interfaces, streaming responses, and feeding images to multi-modal models.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 4 of 14. You are building an agent that fields IT support tickets, but half the submissions are just smartphone photos of flashing error lights. Text-only processing forces you to build separate vision pipelines or drop the visual context entirely. Today, we are covering Interfacing with LLMs and Multi-Modal Inputs, which solves this right at the model layer. In LlamaIndex, interacting with a language model goes through the base LLM class. This acts as a unified interface. Whether you are calling an OpenAI model, an Anthropic model, or a different provider, the core methods you use are identical. This abstraction protects your application logic from provider-specific API changes. When you want an answer from the model, you choose between two primary methods. The first is the complete method. You pass it a single text string containing your prompt, and it returns a single text response. This is built for straightforward, single-shot tasks like summarizing a document or extracting a specific fact. The second method is the chat method. This is designed for conversations or structured interactions. Instead of a single string, you pass a list of chat message objects. Each message has a specific role attached to it, typically system, user, or assistant. By passing a list, the chat method gives the model the full context of a back-and-forth exchange before generating its next reply. Both complete and chat wait for the model to finish its entire generation before returning the output. If the model is writing a long response, your application sits idle. To fix this, you use streaming. You call stream complete or stream chat instead. These methods return a generator. As the model produces tokens, your code receives them in small chunks. You loop through this generator to print the response to a user interface in real-time, removing the perception of latency. Now, the second piece of this is handling non-text data. This is where it gets interesting. Modern LLMs parse visual information, and LlamaIndex supports this through content blocks. Instead of passing a simple text string inside a user chat message, you can pass a list of blocks. Think back to the IT support ticket with the broken server rack. You need the model to look at the photo and read your diagnostic instructions. First, you create an ImageBlock. You provide this block with the image data. LlamaIndex allows you to pass a local file path, a direct URL, or raw base sixty-four encoded bytes. Next, you create a TextBlock. You give it your text prompt, asking the model to identify the hardware fault shown in the picture. You put both the ImageBlock and the TextBlock into a single list, and you attach that list to a new user chat message. When you pass this message into the chat method of a vision-capable model, the LLM processes the visual layout of the server rack alongside your text instructions. It returns a diagnosis based on both inputs combined. Here is the key insight. The real power of this architecture is its consistency. Whether you are sending a one-line string, streaming a real-time response, or passing a complex array of text and image blocks, the interaction pattern with the LLM class remains completely standardized across your entire codebase. That is your lot for this one. Catch you next time!

Structured Data Extraction with Pydantic

3m 24s

Learn how to force unpredictable LLMs to return strict, typed JSON data. Discover how Pydantic BaseModels act as schemas to extract reliable structured information from raw text.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 5 of 14. Nothing breaks a production pipeline faster than an LLM deciding to add "Sure, here is your JSON" right before the actual data payload. You ask for machine-readable output, you get conversational filler, and your parser instantly crashes. Bridging the gap between unstructured natural language and reliable programmatic types is exactly what Structured Data Extraction with Pydantic resolves. Take the scenario of parsing a messy email inbox filled with vendor messages, payment reminders, and unstructured receipts. Your system needs to extract structured invoice data from this text. If you rely on standard text generation, you get unpredictable results. One response might use camel case for keys, another might format dates differently, and prices often come back as strings with currency symbols attached. You end up writing endless string manipulation logic just to get a usable number out of the response. Instead of asking the model for JSON and hoping it complies, you define your exact requirements using a Pydantic base model. You create a Python class called Invoice. Inside this class, you declare the precise data types your application expects. You define the date as a string, the purchased items as a list of strings, and the total price strictly as a float. Here is the key insight. LlamaIndex takes your Pydantic class and automatically serializes it into a strict JSON schema. When you send the email text to the model, LlamaIndex attaches this schema and triggers the underlying model's structured output or function-calling API. The schema acts as a hard boundary. The LLM is no longer generating freeform text. It is constrained to populate the fields of the JSON schema specifically with the types you demanded. You can also steer the model's reasoning directly inside your data structures. By attaching a field description to an attribute, you give the LLM targeted instructions. For the price attribute, you might add a field description stating to extract the final total cost, ignoring shipping fees. The LLM reads this description as part of the schema definition and applies that logic during the extraction phase. When the response comes back, LlamaIndex does not hand you a raw string or a generic dictionary. It processes the response through Pydantic and returns a fully instantiated Invoice object. The data is already validated. Because the framework leverages native structured output features, the model knows in advance it must provide an actual float for the price, not a string representation of one. You can immediately access the invoice dot price attribute in your code and perform math on it. There is no need to strip out conversational filler, remove dollar signs, or cast strings to numbers. The transition from natural language to application logic happens seamlessly at the extraction layer. By pushing your data schema directly into the extraction process, you force the LLM to adapt to your application code, rather than writing fragile application code to tolerate unpredictable LLM behavior. That is all for this one. Thanks for listening, and keep building!

Building Autonomous Function Agents

3m 28s

Take the leap from static code to autonomous agents. You will learn how to wrap Python functions into tools and deploy a FunctionAgent to execute tasks dynamically.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 6 of 14. What if your application could decide which functions to run based on the user's intent, rather than a rigid set of conditional statements? That is the core idea behind Building Autonomous Function Agents. When you build a standard query pipeline, you dictate the execution path. An agent flips this paradigm. You provide a set of tools, and an automated reasoning engine uses a large language model to decide which tools to call, and in what order, to solve a problem. Before we go further, we need to clear up a common misunderstanding. The LLM itself does not execute your Python code. It merely generates a structured text request saying it wants to call a specific function with specific arguments. The LlamaIndex agent framework intercepts that request, executes your local Python code, and then feeds the result back to the LLM. To make this autonomous routing work, you need tools. A tool is essentially a standard Python function wrapped in a LlamaIndex class called FunctionTool. But because the LLM needs to know when and how to use your function, your code metadata becomes a critical part of the system. The framework extracts the function name, the type hints, and the docstring, and passes them to the LLM as instructions. Let us look at a concrete scenario. You want your agent to solve math word problems. You write two Python functions, one called add and one called multiply. For the multiply function, your type hints specify that it takes two integers and returns an integer. Crucially, you write a docstring that clearly says this function multiplies two numbers together. You wrap both functions into tools and pass them in a list, along with your chosen LLM, to initialize the agent. Here is where it gets interesting. You ask the agent, what is two plus two, multiplied by three. The agent enters a reasoning loop. First, the LLM analyzes the prompt and looks at the available tools. It reads your docstrings and decides it needs to add first. It outputs a request to call the add tool with the arguments two and two. The framework runs your Python function locally and returns the result, four, back to the agent. The agent is not done. It looks at the intermediate result and its original goal. It decides it now needs to multiply. It requests the multiply tool, passing in the four it just received and the three from your original prompt. The framework executes the multiplication and returns twelve. Finally, the LLM recognizes the problem is solved and generates a conversational response for the user. There are no hardcoded rules or explicit routing logic here. The agent figured out the dependencies and the order of operations entirely on its own based on the available tools. This means the way you write your Python definitions directly dictates how smart your agent is. Your type hints and your docstrings are no longer just for other developers, they are the literal prompt that drives the agent's autonomous logic. If you are getting value out of these episodes and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Extending Agents with LlamaHub Tools

3m 30s

Supercharge your agents with pre-built integrations. This episode shows how to browse LlamaHub, install tool specs, and give your agent real-world capabilities instantly.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 7 of 14. You are building an agent that needs to check stock prices, pull messages from Slack, or query a database. You could spend days reading API documentation, handling authentication schemas, and writing boilerplate integration code. Or, you could just grab the completed work a community developer already wrote and verified. This episode covers Extending Agents with LlamaHub Tools. Agents rely on tools to interact with the outside world. While you can write every integration manually, LlamaHub exists specifically to eliminate that redundant work. It operates as a vast open-source registry of pre-built tool specifications. A tool specification, or tool spec, is essentially a Python class that bundles together multiple related API calls into a single package. By pulling these directly into your project, you bypass the entire process of writing the underlying logic for third-party services. To use one of these integrations, you install its specific package. If you want your agent to answer questions about a company stock price, you do not write a custom HTTP request to the Yahoo API. Instead, you use your standard package manager to install the Llama Index Yahoo Finance tools package. Once installed, you import the Yahoo Finance Tool Spec class into your script. Here is the key insight. You do not pass the tool spec class directly to the agent. Because a tool spec is a bundle of multiple capabilities, you must unpack it first. You do this by creating an instance of the Yahoo Finance Tool Spec, and then calling a specific method on it named to tool list. This method breaks the bundle apart and returns a standard flat array of individual tools that the agent can read and execute. This modular design means you are not restricted to using only external tools or only internal tools. You can seamlessly combine them. Suppose you already have a local, custom function tool that formats currency numbers specifically for your company internal dashboard. You simply create a standard Python list. Inside that list, you place your custom currency formatter tool, and you also append the unpacked tools from the Yahoo Finance tool list. You then take this combined array and pass it directly to your agent during initialization by assigning it to the tools parameter. When you prompt the agent with a question about a current stock price, the agent evaluates the user request against the descriptions of all the tools in that combined list. It recognizes that the Yahoo Finance tool is the correct choice to fetch the market data. It extracts the company ticker symbol from the user prompt, executes the Yahoo Finance tool, retrieves the real-time price, and then can optionally chain that raw result into your local formatting tool before returning the final answer to the user. You just gave your application complex financial lookup capabilities by installing a package, instantiating a class, and calling one unpacking method. This architectural pattern applies to hundreds of integrations on LlamaHub, from reading Google Drive documents to querying Wikipedia. The real leverage of an autonomous agent is not found solely in the reasoning capability of the underlying language model, but in the sheer breadth of external systems it can instantly access through pre-built tool specifications. That is all for this one. Thanks for listening, and keep building!

Multi-Agent Swarms with AgentWorkflow

2m 59s

Move beyond single-agent setups. Learn how to configure a linear swarm of specialized agents that autonomously hand off tasks to one another using AgentWorkflow.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 8 of 14. You feed a language model one massive prompt asking it to research a topic, write a report, and critically review its own work. It fails. The context window gets scrambled, tools are misused, and the output is a shallow compromise. The fix is breaking that massive prompt down into a swarm of specialists using Multi-Agent Swarms with AgentWorkflow. A single bloated agent struggles because it has too many tools and conflicting instructions. It has to decide when to search, when to draft, and when to edit, all while managing a massive internal context. AgentWorkflow solves this by letting you define a network of highly focused agents. Each agent operates with a narrow system prompt, a restricted set of tools, and a singular goal. Consider a content generation pipeline. Instead of one mega-agent, we create three distinct FunctionAgents. First, the Researcher. We equip this agent with a web search tool and instruct it strictly to gather facts. It does not write prose. Next, we define the Writer. We strip away the search tools entirely to prevent it from getting distracted. Its only job is to take raw facts and draft clean paragraphs. Finally, we define the Reviewer. We give it guidelines on tone and factual accuracy, instructing it to critique the text. Here is the key insight. Having multiple agents is useless if they cannot coordinate, but giving them free rein to talk to anyone creates chaos. You have to explicitly wire them together. In AgentWorkflow, you do this by defining handoff permissions. When you configure your agents, you specify a property called can handoff to. This accepts a list of other agents. For our pipeline, we give the Researcher permission to hand off to the Writer. We give the Writer permission to hand off to the Reviewer. We also give the Reviewer permission to hand off back to the Writer if the text needs revision. This creates a strict, directed graph of agents. The framework enforces these boundaries. The Researcher cannot bypass the Writer and send raw notes straight to the Reviewer. It simply does not have the authorization. To execute this, you pass your list of agents into an AgentWorkflow instance. You start the process by running the workflow with an initial user query, pointing it at the Researcher to kick things off. The workflow framework manages the shared state and the routing automatically. The Researcher runs its search tools, compiles the data, and internally decides its job is done. It then uses its handoff tool to pass control, along with the gathered data, to the Writer. The Writer takes that context, drafts the report, and hands off to the Reviewer. If the Reviewer spots a flaw, it triggers a handoff back to the Writer with feedback. This loop continues until an agent decides the work is complete and returns a final response to the user instead of triggering another handoff. Constraining an agent to a single role and explicitly defining its communication paths is the most reliable way to force complex, multi-step reasoning out of language models. That is all for this one. Thanks for listening, and keep building!

The Orchestrator Agent Pattern

3m 38s

Take granular control of your agentic workflows. Discover how to build a master orchestrator agent that manages subordinate agents as callable tools.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 9 of 14. If you let autonomous agents pass control peer-to-peer, you easily end up with endless loops or dropped context. Sometimes you cannot rely on agents handing off to each other. You need a central manager in charge of the whole operation. That is the Orchestrator Agent pattern. In a multi-agent system, letting agents talk directly to each other sounds powerful, but it quickly becomes a debugging nightmare. You lose track of who is in charge, and managing the overall state of the application becomes incredibly complex. The Orchestrator pattern solves this by strictly enforcing a hub-and-spoke model. There is one top-level orchestrator agent, and all other agents are treated strictly as tools for that orchestrator to use. Let us walk through a scenario where your system needs to generate a heavily researched technical briefing. You set up an orchestrator agent to act as a strict manager. This manager owns the global state. It remembers the original user prompt, holds the master context, and tracks what has been accomplished so far. It does not do the actual hard work itself. Instead, you provide it with tools. These tools are not simple API wrappers or basic calculators. They are entirely separate, fully functional sub-agents. You might build one sub-agent dedicated to research, equipped with vector search and web scraping capabilities. You might build a second sub-agent dedicated to writing, equipped with stylistic guidelines and formatting logic. In LlamaIndex, you take these sub-agents and expose their run methods as tools. By doing this, you are wrapping their entire internal reasoning loops behind a standard tool interface with a name and a description. To the orchestrator, these sub-agents look exactly like standard Python functions. When the user asks for the briefing, the orchestrator evaluates the overall goal based on the tool descriptions you provided. It decides to call the research tool first and passes in the necessary parameters. Control shifts temporarily to the research sub-agent. This sub-agent runs its own autonomous loop. It might make three or four internal tool calls to gather data, synthesize the facts, and formulate an answer. Once the research agent finishes, it collapses all that work into a final text string and returns it. Here is the key insight. The orchestrator never actually yields control of the main process. It just waits for the tool to return a value. The orchestrator receives the research summary, adds it to its own context, and evaluates the next step. It realizes the information needs to be drafted into a document, so it calls the writing tool, passing the fresh research as the input parameter. The writing sub-agent takes over, does its own internal processing, and hands back the finished text. The orchestrator sees that the final goal is met and delivers the response to the user. This strict separation of concerns makes your system highly predictable. The research agent does not need to know the writing agent exists. Neither sub-agent has to worry about handing off context, formatting the final user response, or deciding when the overall job is done. The orchestrator centralizes all the high-level decision-making logic. By forcing sub-agents to operate purely as isolated tools within a central manager's loop, you can build massively capable multi-agent systems that remain entirely predictable and easy to debug. That is your lot for this one. Catch you next time!

Custom Multi-Agent Planners

3m 34s

Achieve ultimate multi-agent flexibility. Learn how to roll your own orchestration loop using custom XML prompting, Pydantic, and imperative execution.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 10 of 14. Built-in agent orchestration is great until your business logic starts looking like a plate of spaghetti. When standard hand-offs fail to capture your highly specific scheduling rules, abstractions become a blocker. That is exactly when you need Custom Multi-Agent Planners. A custom planner is your power-user escape hatch. It allows you to build your own orchestration loop from scratch using a standard LlamaIndex Workflow. Instead of relying on a pre-built supervisor to dictate which agent acts next, you control the entire scheduling process imperatively in Python. You dictate the execution order, the data routing, and the state management. The process usually starts inside a custom class, often called a PlannerWorkflow. The first phase is planning. When a user submits a request, your workflow sends a prompt to a large language model. This prompt includes the user query and a strict description of the tools or agents you have available. You explicitly instruct the language model to generate a step-by-step plan in a highly structured format. For example, you might tell the model to wrap every action inside an XML block using step tags, or format it as a JSON array. When the model replies, you parse that structured output. You use a library like Pydantic to validate the XML or JSON and convert it into a concrete list of tasks that your code can iterate over. Now you enter the execution phase. This part is entirely up to your custom logic. Your workflow iterates through the parsed list of steps one by one. For each step, it checks the requested action. You use standard conditional logic to decide what to do next. If the parsed step specifies a research task, your code explicitly calls your research agent. If the next step requires calculation, you trigger your math agent. Here is the key insight. Because you are writing the loop yourself, you maintain complete ownership of the state. You typically create a shared context dictionary that lives for the duration of the workflow run. When your research agent finishes its task, your workflow takes the result and writes it directly into that dictionary. When the next step triggers the math agent, your custom logic can pass it the exact required data from that shared dictionary. You are not hoping a black-box orchestrator passes the right variables. You explicitly map the outputs of one agent to the inputs of the next. Once the loop completes all the steps in your validated plan, your workflow performs any final formatting and returns the answer. Building a custom planner means you trade out-of-the-box convenience for total control. If an agent fails, you can write custom retry logic for that specific step. If a step needs external API validation, you can pause the loop. You are writing standard imperative code that just happens to use language models as functions. The ultimate value of a custom planner is predictability. By forcing the language model to generate a rigid XML plan and executing it with standard Python loops and dictionaries, you eliminate the guesswork of black-box orchestration entirely. That is all for this one. Thanks for listening, and keep building!

Human-in-the-Loop Workflows

3m 09s

Prevent autonomous disasters by keeping a human in the loop. You will learn how to pause workflows with events to wait for human confirmation before executing dangerous tasks.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 11 of 14. You build an agent to manage your infrastructure, and it decides the best way to resolve an error is to delete a production database. Never let an autonomous agent take a destructive action without explicitly asking a human for permission first. To prevent this, you need Human-in-the-Loop workflows. Human-in-the-loop is an event-driven mechanism that pauses an agent, requests external input, and resumes execution based on the response. Instead of letting the agent run uninterrupted through its thought and action cycle, you intercept high-risk operations. You achieve this in LlamaIndex Workflows using three specific components: the InputRequiredEvent, the wait_for_event method, and the HumanResponseEvent. Consider a dangerous task tool designed to delete a cloud resource. The agent decides it needs to use this tool and triggers the corresponding workflow step. If the tool executes immediately, the resource is gone. Instead, the workflow step intercepts the execution. Before deleting anything, the step creates an InputRequiredEvent. This event carries a payload containing details about the action, such as the target resource name and a prompt asking the user to confirm the deletion. The step emits this event out to the main application. Here is the key insight. The workflow step cannot just sit in an active loop waiting for an answer. You must suspend its state. You do this by calling wait_for_event on the workflow context, specifying that the step is now listening for a HumanResponseEvent. This action yields control back to the environment. The workflow engine pauses the step entirely, freezing the agent in its current state without consuming compute resources while it waits. Outside the workflow, your application layer catches the InputRequiredEvent. You read the payload and display the confirmation prompt to the user on their command-line interface. The human reads the warning and types yes or no. Now you need to unpause the agent. Your application takes the user input and wraps it inside a HumanResponseEvent. You send this new event directly back into the running workflow engine. The engine recognizes the event type and routes it to the exact step that was suspended. The wait_for_event method resolves, returning the human response string back to the tool logic. The tool evaluates the response. If the human typed yes, the tool proceeds with the API call to delete the cloud resource. If the human typed no, the tool aborts the deletion. In either case, the tool returns a final status message back to the agent. The agent processes this outcome, understands whether the action succeeded or was blocked by the user, and decides on its next move. By using events to pause and resume execution, your agent maintains its entire reasoning context and memory while waiting hours or even days for a human to make a decision. That is all for this one. Thanks for listening, and keep building!

Observability and Tracing

3m 26s

Stop debugging AI with print statements. This episode explores LlamaIndex callbacks and one-click observability to trace inputs, durations, and outputs across complex pipelines.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 12 of 14. When an autonomous agent makes a mistake, digging through standard Python print statements to find the issue is a complete nightmare. You ask a simple question, get a hallucinated answer, and the standard stack trace tells you absolutely nothing about why the model lied. To fix this, you need a way to see inside the black box. This episode covers Observability and Tracing. Traditional debugging tools fall short with large language models. A stack trace tells you where code crashed, but LLM bugs are usually logical. The code runs perfectly, but the system retrieves the wrong document or misinterprets a prompt. Standard Python logging is your first line of defense. By setting the logging level to debug, LlamaIndex outputs a raw feed of everything it does. You will see the exact prompts sent to the language model and the raw HTTP responses. This is useful for checking if a network call failed, but for a multi-step agent workflow, reading a wall of unstructured text is incredibly tedious. Here is the key insight. You do not just need a log of events; you need to see the call graph. You need a structured view of how data flows from the initial query down through the retrievers, into the language model, and back out. LlamaIndex handles this using a callback system. Callbacks are hooks that trigger at specific points in the execution cycle. The framework provides a built-in tool called the Llama debug handler. You initialize this handler and attach it to your global settings. From that point on, it silently records every operation. Say you run a query engine, and it returns a completely fabricated fact. Without tracing, you have no idea if the model hallucinated or if your database fed it bad information. With the debug handler attached, you can ask it to print the trace after the query finishes. The trace reveals the exact sequence of events. You see the initial query. You see the retrieval step. Crucially, you see the exact text nodes the retriever pulled from your index. You inspect those nodes in the trace and discover that an outdated document was retrieved. The language model did not hallucinate; it just read bad data. You fix the index, and the bug is resolved. Terminal traces are great for local development, but they do not scale well when you have complex agents making dozens of reasoning steps. For production, LlamaIndex offers what it calls one-click observability. By setting a specific environment variable or adding a single configuration line, you can route all that callback data to a dedicated observability platform. These platforms ingest the trace data and generate visual dashboards. You can click through a visual tree of your agent workflow, inspecting the exact latency, token usage, and payload for every single step. You do not need to instrument every function manually; the framework's native callbacks handle the heavy lifting. The difference between a fragile prototype and a reliable production application is whether you can explain exactly why the system generated a specific response. If you found this useful and want to help support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

RAG Evaluation Metrics

3m 50s

Measure the true effectiveness of your applications. Learn how to use FaithfulnessEvaluator and RetrieverEvaluator to objectively score retrieval and response quality.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 13 of 14. You swap out your embedding model and suddenly your answers feel a bit off, but you cannot quite put your finger on why. If you are just eyeballing outputs to check quality, your pipeline is guessing in the dark. To stop relying on vibes and prevent regressions in production, you need RAG Evaluation Metrics. Building a RAG application is easy, but making it robust is hard. You will constantly tweak chunk sizes, prompts, and retrieval strategies. If you rely on human review to test every change, you will either slow down development or deploy regressions. You need automated, objective measurements. Because RAG consists of two distinct steps, finding the right information and generating an answer based on it, you must evaluate both stages separately. Let us look at generation first. This is called Response Evaluation. The primary metric here is faithfulness. The goal is to catch hallucinations. A faithful response is one where the language model relies entirely on the retrieved context, rather than inventing facts from its own pre-training data. In LlamaIndex, you handle this with the FaithfulnessEvaluator. This tool uses a language model under the hood to act as a judge. You initialize the evaluator, then you pass it the original query, the array of retrieved context nodes, and the final generated text. The evaluator returns an evaluation object containing a binary pass or fail boolean, telling you whether the response is strictly supported by the context provided. It also provides a reasoning string explaining why the judge made its decision. If your faithfulness score drops after an update, your prompt or your language model might be getting too creative. Now, the second piece of this. Even the best language model cannot generate a faithful answer if you hand it the wrong documents. This is where Retrieval Evaluation comes in. Here is the key insight. You evaluate retrieval by checking if the system fetched the exact source nodes you expected for a given query, completely ignoring the final generated text. You handle this with the RetrieverEvaluator. Consider a scenario where you want to test a new embedding model. Instead of deploying it and guessing if it is better, you build an evaluation dataset. This dataset contains a list of queries paired with the specific document identifiers that contain the correct answers. You run your entire batch of queries through the RetrieverEvaluator. The evaluator calculates two crucial metrics, Hit Rate and MRR. Hit rate is straightforward. It checks if the expected document appeared anywhere in your top retrieved results. If you retrieve five documents, and the correct one is in there, that is a hit. It measures pure recall. But position matters. If the correct document is always fifth, your language model might ignore it due to context limits or attention decay. This is where Mean Reciprocal Rank, or MRR, comes in. MRR looks at the position of the first relevant document. If the correct document is at the very top, the score is one. If it is second, the score is one-half. If it is third, one-third. The evaluator averages these fractions across your entire dataset. A higher MRR means your retriever is consistently pushing the most relevant information to the very top of the context window. By comparing the Hit Rate and MRR of your old embedding model against the new one, you get mathematical proof of which model performs better. You can track these numbers over time and run this pipeline automatically on every pull request. The single most valuable thing you can do for your RAG pipeline is separate the evaluation of what you retrieve from how you generate the final answer. Thanks for spending a few minutes with me. Until next time, take it easy.

Scaffold to Production

3m 35s

Transform prototypes into full applications instantly. Discover how to use create-llama and the RAG CLI to scaffold full-stack web apps and terminal chats without writing boilerplate.

Download

Hi, this is Alex from DEV STORIES DOT EU. LlamaIndex: Context-Augmented LLM Applications, episode 14 of 14. Stop copying and pasting the same API and React boilerplate every time you want to test a new Retrieval-Augmented Generation idea. You already understand the framework mechanics, but setting up a clean interface to actually use your models still takes hours of repetitive work. Today, we cover the starter tools that let you scaffold to production in seconds. Building a functioning application requires much more than just an index and a query engine. You need a backend server to handle incoming requests securely. You need API routes to pass messages back and forth. You need a frontend client to display the chat history, render loading states, and parse the responses. Writing this infrastructure from scratch for every new dataset or prototype drains your time. To solve this, LlamaIndex provides a command-line utility called create-llama. This tool generates a complete, full-stack web application pre-configured with LlamaIndex best practices. You open your terminal and run the create-llama command. The tool then walks you through a series of choices. It asks if you want a Python backend using FastAPI, or a Node backend using Express. It asks if you want a Next JS frontend to give your users a polished web interface. Then, it asks for your data source. You can point the tool directly at a local folder containing your PDF files. Once you finish the prompts, create-llama takes over. It installs all required dependencies. It scaffolds the directory structure. It writes the ingestion script to parse your PDFs. It wires up the API endpoints so your frontend can talk to your retrieval engine. Finally, it sets up the environment variables. You run one start command, and you immediately have a styled chat interface running in your browser. You can type a question, and the interface will hit the generated backend, retrieve the context from your PDFs, and stream the answer back to your screen. You go from an empty folder to a working full-stack prototype in about thirty seconds. That handles web applications. But sometimes a web server and a graphical interface are overkill. If you just downloaded a long technical specification and need to query it immediately without leaving your command line, you use the RAG CLI. The RAG CLI is a tool built purely for terminal-based document interaction. You install it, then run a command to point it at your local directory of documents. The CLI automatically runs the ingestion process. It chunks the text, generates embeddings, and stores them in a local vector database right there on your machine. When ingestion is done, you run the chat command. Your standard terminal prompt transforms into a chat session. You ask a question, the CLI retrieves the relevant data, queries the language model, and prints the generated answer directly to your console. There are no visual components or web routes to configure. It is the absolute fastest way to talk to your data locally. Here is the key insight. You now understand the deep mechanics of context-augmented applications, from chunking documents to building complex agent routers. These scaffolding tools exist so you can stop wrestling with basic infrastructure and spend your time tuning those core retrieval strategies. Since this is the final episode of our LlamaIndex series, the best next step is to head over to the official documentation and try building these pipelines hands-on. If you have an idea for a completely different technology stack you want us to cover, visit devstories dot eu to suggest a topic. That is all for this one. Thanks for listening, and keep building!