Back to catalog

Season 3 24 Episodes 1h 25m 2026

LiteLLM: The Universal LLM Gateway (v1.82 - 2026 Edition)

A comprehensive podcast series covering LiteLLM, from the Python SDK and unified exception handling to the Proxy Server, load balancing, virtual keys, observability, and enterprise features like RBAC and MCP Gateway.

LLM Infrastructure LLM Orchestration

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

LiteLLM: The Universal LLM Gateway (v1.82 - 2026 Edition)

Now Playing

Click play to start

0:00

The Rosetta Stone of LLMs: The Python SDK

Discover the core identity of LiteLLM: a unified Python interface for over 100 LLMs. Learn how to drop LiteLLM into your existing codebase to call Anthropic, Vertex, or Ollama without changing your OpenAI-compatible parsing logic.

3m 53s

Parameter Translation Magic

Stop worrying about which provider supports which hyperparameter. Learn how LiteLLM's SDK automatically translates standard OpenAI inputs like temperature, tools, and max_tokens across all upstream models.

3m 23s

Unified Exception Handling

Write your try-except blocks once. Discover how LiteLLM catches custom errors from AWS, Google, and Azure, mapping them perfectly back to standard OpenAI exceptions like RateLimitError.

3m 26s

The LLM Gateway: Setting up the Proxy Server

Take LiteLLM out of your local code and into a centralized platform. Learn how to launch the LiteLLM Proxy Server via Docker and configure your first endpoints using the config.yaml file.

3m 50s

Centralized Secret Management

Keep your API keys out of plaintext configs. Learn how to connect LiteLLM to enterprise secret managers like AWS Secrets Manager or Azure Key Vault to dynamically fetch credentials.

3m 41s

Model Aliases: The Shadow Upgrade

Migrate users to new models silently. Discover how to use Model Aliases in LiteLLM to map requests for one model to a completely different endpoint without altering any client-side code.

3m 43s

Load Balancing for High Throughput

Avoid rate limits and downtime by routing traffic intelligently. Explore why simple-shuffle is the recommended strategy for production load balancing across multiple deployments.

3m 18s

API Outages and Fallbacks

Never suffer an AI downtime again. Learn how to configure model fallbacks in LiteLLM so that if your primary provider fails, traffic automatically reroutes to a backup provider.

3m 27s

Context Window Fallbacks

Stop overpaying for massive context windows on short prompts. Learn how to use pre-call checks and context window fallbacks to route oversized documents to specialized models.

3m 18s

Taming Hanging Requests with Timeouts

Don't let slow APIs freeze your application. Discover how to configure global timeouts and stream timeouts in LiteLLM to abort stalled requests and trigger speedy fallbacks.

3m 34s

Virtual Keys for FinOps

Lock down your API usage with precision. Learn how to generate virtual keys using LiteLLM, setting strict RPM, TPM, and budget limits to protect your organization from runaway AI costs.

3m 49s

Spend Tracking and Custom Tags

Attribute every cent of LLM spend accurately. Learn how to pass metadata tags in your requests and generate comprehensive spend reports using LiteLLM.

3m 31s

Caching for Speed and Savings

Stop paying for the same LLM responses over and over. Learn how to configure exact caching with Redis and semantic caching with Qdrant to slash latency and API costs.

3m 41s

RBAC: Empowering Team Admins

Distribute platform management safely. Understand LiteLLM's Role-Based Access Control, delegating power to Org Admins and Team Admins without compromising global security.

3m 18s

Security Guardrails

Add an invisible security layer to your LLM requests. Learn how to configure pre-call and post-call guardrails in LiteLLM to block prompt injections and mask PII before it reaches external providers.

4m 02s

Dynamic Callback Management

Give microservices the power of privacy. Learn how to use the x-litellm-disable-callbacks header to let sensitive API requests opt-out of central observability logging.

3m 36s

Drop-in Observability

Get instant visibility into your LLM traffic. Learn how to pipe telemetry, traces, and exceptions to tools like Langfuse and Sentry using simple success and failure callbacks.

3m 20s

Prometheus Metrics and Pod Health

Take the pulse of your proxy. Discover how to expose the /metrics endpoint to Prometheus, track in-flight requests, and use custom tags to slice data in Grafana.

3m 27s

Universal Text-to-Speech

Standardize your voice generation. Discover how to call Text-to-Speech models from Gemini, Vertex, and AWS Polly using the exact same OpenAI-compatible audio endpoint format.

3m 47s

The Assistants API Bridge

Manage conversation state effortlessly across providers. Learn how LiteLLM wraps non-native models in the standard OpenAI Assistants API interface, letting you use Threads and Messages everywhere.

3m 26s

The MCP Gateway

Supercharge your models with tools centrally. Discover how to configure HTTP, SSE, or STDIO Model Context Protocol (MCP) servers in LiteLLM, giving any LLM access to external capabilities.

3m 43s

A2A: Tracking Autonomous Agents

Bring autonomous agents under control. Learn how to invoke complex LangGraph or Bedrock agents through the proxy using the A2A protocol, enabling trace grouping and unified spend tracking.

3m 24s

Zero-Downtime Key Rotations

Achieve zero-downtime security cutovers. Learn how to configure automatic scheduled key rotations and grace periods for enterprise-grade virtual keys in LiteLLM.

3m 16s

The Admin UI and AI Hub

Make your AI platform accessible to everyone. Learn how to manage the Admin UI, tweak UI credentials, and use the AI Hub to let developers securely discover allowed models and agents.

3m 38s

Episodes

The Rosetta Stone of LLMs: The Python SDK

3m 53s

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 1 of 24. You swap your OpenAI API key for an Anthropic one, and suddenly your entire application crashes. Every provider shapes their inputs and outputs differently, forcing you to write and maintain custom parsing logic for each new model. The Rosetta Stone of LLMs, the LiteLLM Python SDK, resolves this completely. There is a common misconception that LiteLLM competes with LangChain. It does not. LiteLLM is not an agent framework. It does not orchestrate tools, manage memory, or chain tasks together. It is simply a drop-in replacement for the OpenAI client. The LiteLLM Python SDK translates over one hundred different provider APIs into the exact OpenAI chat completion format. You do not have to learn the specific nuances of the Vertex AI, Anthropic, or Hugging Face APIs. You write your code once using the OpenAI standard, and LiteLLM handles the translation layer in the background. The core of this SDK is a single unified interface called the completion function. When you use this function, you provide a model string. This string tells LiteLLM which provider to contact. You also pass standard OpenAI parameters like temperature, max tokens, and the messages list. LiteLLM takes those standard parameters and maps them to the correct fields for the target provider. Consider a scenario where you want to call Claude 3.5 Sonnet. Using the standard Anthropic SDK, you would need to structure your prompt blocks according to their specific message API requirements. With LiteLLM, you just call the completion function. You pass in the model name for Claude. You pass in your standard messages array, using dictionaries with user and system roles. Finally, you provide your Anthropic API key, either as a parameter or an environment variable. Here is the key insight. The response object returned by this function is structured exactly like an OpenAI chat completion object. When you want to extract the text generated by Claude, you traverse the exact same path you would for a GPT model. You look at the choices array. You take the first item. You access the message object. You read the content property. Your application logic has no idea that Anthropic generated the text. The underlying network request was translated, sent to the provider, and the response was mapped back to the OpenAI schema before it ever reached your variables. This abstraction applies to the entire interaction. It covers errors, token usage statistics, and even streaming responses. If Anthropic returns a rate limit error, LiteLLM maps it to the standard OpenAI rate limit exception. Your existing error handling code catches it without modification. If you request a streaming response, LiteLLM yields chunks that look exactly like OpenAI streaming chunks. You can iterate through different models by simply changing the string value of the model parameter and swapping the environment variable for the API key. The rest of your Python code remains completely static. You are no longer locked into a single ecosystem just because you wrote your integration code for one specific provider. The true value of the unified completion interface is decoupling your application logic from the endless variations in model provider APIs, turning complex infrastructure changes into simple string swaps. Thanks for listening, happy coding everyone!

Parameter Translation Magic

3m 23s

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 2 of 24. Does Claude expect max tokens, max tokens to sample, or something else entirely? Every time you switch model providers, you usually have to rewrite your entire API request payload to match their specific naming conventions. Parameter Translation Magic is the feature that makes this problem disappear. When you call the completion function in LiteLLM, you only ever write your request using the standard OpenAI parameter format. You do not need to look up the specific dictionary keys for Anthropic, Vertex AI, or Cohere. The library intercepts your OpenAI-formatted inputs and automatically translates them into the exact structure the target provider requires. Let us look at a specific scenario. You want to query a Vertex AI Gemini model. You need it to use function calling, and you want the responses to be highly deterministic, so you set the temperature to zero point two. If you were writing a direct integration, Vertex has its own unique, deeply nested way of defining tools and setting model generation configurations. With LiteLLM, you ignore that completely. You build your tools array using standard OpenAI JSON syntax. You pass that array, along with your temperature value, directly into the completion call. LiteLLM grabs those OpenAI arguments, tears them down, and builds the correct Vertex AI API payload before sending the network request. This automatic mapping is incredibly efficient, but it introduces an edge case. What happens if you pass an OpenAI parameter that the target model simply does not support at all? Perhaps you specify a presence penalty, but the provider lacks a corresponding concept. By default, LiteLLM is strict. It will raise an exception and fail the completion call. This is a deliberate design choice. It prevents you from silently losing configuration that might be critical to your application's behavior. Strict failing is safe, but it can be frustrating if you are dynamically routing user prompts across dozens of different models in production. If you want the call to succeed regardless of minor parameter mismatches, you add a flag called drop params set to true inside your completion call. When you enable this, LiteLLM still translates everything it can map successfully. But if it encounters an OpenAI parameter that the target provider does not recognize, it simply strips that unsupported parameter from the payload and sends the rest of the request. The call succeeds, and the unsupported parameter is safely ignored. If you prefer to handle capabilities gracefully in your own code rather than relying on silent drops, there is a built-in introspection tool. You can call a helper function named get supported openai params. You provide the model name as the argument, and it returns a list of all the standard OpenAI parameters that successfully map to that specific model. This allows you to check what a model actually supports at runtime before you execute the completion call. Here is the key insight. The true value of parameter translation is that it promotes the OpenAI API specification from a vendor-specific schema into a universal interface language for your entire application stack. Thanks for listening, happy coding everyone!

Unified Exception Handling

3m 26s

Write your try-except blocks once. Discover how LiteLLM catches custom errors from AWS, Google, and Azure, mapping them perfectly back to standard OpenAI exceptions like RateLimitError.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 3 of 24. Writing custom error handling logic for fifteen different AI APIs is a great way to waste your weekend. Every API has its own unique way of telling you that you are sending too many requests or that your token is invalid. To stop you from writing endless conditionals, LiteLLM uses Unified Exception Handling. Instead of learning how Anthropic, Cohere, or Google surface their errors, you only need to handle one standard. LiteLLM maps all underlying provider exceptions directly to standard OpenAI exception types. If a request fails, the gateway intercepts the raw error, translates it based on the HTTP status code and the provider response, and raises the corresponding OpenAI error. Consider a scenario where you use a Hugging Face inference endpoint. During peak hours, the endpoint gets overwhelmed with traffic. Hugging Face will reject the request. If you were calling their API directly, you would need to parse their specific HTTP five zero three service unavailable error or a custom throttling message. With unified exception mapping, you do not need a specific Hugging Face exception handler. You structure your code using a standard try block around your generation call. Right below that, you add an except block designed to catch an OpenAI Rate Limit Error. Inside that block, you trigger your standard exponential backoff function. Your application pauses, waits, and retries the request. If you later swap out Hugging Face for Vertex AI, your backoff logic remains exactly the same. The OpenAI Rate Limit Error catches the Vertex throttling event just as reliably. This mapping covers the entire standard set of failures. You can catch an OpenAI API Timeout Error when a remote server hangs. You can catch an Authentication Error when an API key rotates or expires. You can handle an API Connection Error for network drops or a Bad Request Error if your payload is malformed. Your application code treats the whole AI ecosystem as if it is communicating solely with a single, predictable provider. Here is the key insight. Sometimes, a clean abstraction hides too much context. A generic OpenAI bad request error tells you the call failed, but you often need to know exactly why. Azure, for instance, applies strict content policy filters. If Azure rejects a prompt because it violates safety guidelines, mapping that to a generic OpenAI error risks dropping the specific safety flag you need to log or show to the user. To solve this, LiteLLM attaches an attribute called provider specific fields to the exception object. When you catch an OpenAI exception in your code, you can inspect this attribute. It contains a dictionary holding the original, unmapped error data straight from the provider. You get the unified try and except flow for routing and retries, but you keep the granular data for debugging and auditing. You build your resilient systems, your retries, and your circuit breakers entirely around the OpenAI exception classes. Standardizing your error boundaries prevents provider-specific failure modes from bleeding into your core application logic. Thanks for listening, happy coding everyone!

The LLM Gateway: Setting up the Proxy Server

3m 50s

Take LiteLLM out of your local code and into a centralized platform. Learn how to launch the LiteLLM Proxy Server via Docker and configure your first endpoints using the config.yaml file.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 4 of 24. What if your entire company could share one LLM endpoint without ever exposing the raw provider API keys? Moving from a local SDK integration to a centralized architecture changes how you manage access. This is The LLM Gateway: Setting up the Proxy Server. Many developers assume that introducing a custom middleware proxy requires writing custom client code to communicate with it. That is false. The LiteLLM proxy speaks the native OpenAI API format. Any library, application, or script that knows how to talk to OpenAI only needs its base URL updated to point to your new proxy. To start the proxy server, you need a configuration file. This is a YAML file that defines your routing logic. The core of this file is a section called the model list. This list maps the model names your internal applications will request to the actual backend provider models. Inside the model list, each entry requires two main pieces. First is the model name. This is the alias you expose to your users. You might name it internal-chat-model. Second is a block called LiteLLM parameters. This is where you configure the actual destination. This distinction is important. The model name is what your client asks for, but the LiteLLM parameters block defines what actually processes the request. If you are routing to an Azure OpenAI deployment, your LiteLLM parameters block will contain the specific Azure model string, your Azure API base URL, and the API version. You also tell LiteLLM which environment variable holds the API key. You do not hardcode the key in the YAML file itself. With the configuration file ready, you deploy the proxy using Docker. You use the official LiteLLM image from the GitHub Container Registry. When you run the Docker container, you execute three specific steps. You map your local port four thousand to the container port four thousand. You mount your YAML configuration file into the container so the proxy can read your model list. Finally, you pass your actual provider credentials into the container as environment variables. Once the container is running, the proxy is live and listening on localhost port four thousand. You test it exactly like you would test the official OpenAI API. You make a standard HTTP POST request using a tool like cURL. You target localhost port four thousand, followed by the path slash chat slash completions. In the request body, you specify the alias you defined earlier as the model, and you provide your messages array. The proxy receives this standard payload. It reads the requested alias, looks it up in your configuration file, and extracts the Azure parameters. It then signs the request with your Azure API key, forwards it to Microsoft, and translates the response back into the exact format your client expects. Here is the key insight. The proxy completely abstracts the backend provider from the application layer. If you decide to swap your Azure deployment for a completely different provider next month, you do not touch a single line of your application code. You only update the LiteLLM parameters in your YAML configuration file and restart the container. The primary value of the proxy server is that your client applications never know, and never need to know, which cloud provider is actually generating the tokens. Thanks for listening, happy coding everyone!

Centralized Secret Management

3m 41s

Keep your API keys out of plaintext configs. Learn how to connect LiteLLM to enterprise secret managers like AWS Secrets Manager or Azure Key Vault to dynamically fetch credentials.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 5 of 24. If your company's master API key is sitting in a plaintext YAML file on a production server, your security team is going to have a bad day. Hardcoded credentials are an incident waiting to happen. The solution is Centralized Secret Management. When you run a gateway, it needs access to your upstream provider API keys to route traffic successfully. The default approach is often dropping these into a local environment variable or directly into the main configuration file. Centralized Secret Management changes this pattern entirely. It allows the gateway to dynamically read credentials from an external enterprise vault, such as AWS Secrets Manager, Azure Key Vault, Google Secret Manager, or HashiCorp Vault. To enable this connection, you configure two specific fields inside the general settings block of your configuration file. The first field is the key management system. Here, you define the identifier of your vault provider. If you are using AWS, you would set this to aws_kms. The second field is the key management settings. This is a nested structure where you provide the exact connection parameters your specific vault requires, such as the target AWS region or other required authentication details. Once the vault connection is established, you have to tell your individual model routing configurations to pull their keys from that vault. You do this using a specific string prefix in place of the actual key. Here is the key insight. Instead of typing an actual key string, you type the phrase os dot environ forward slash, followed immediately by the exact name of your secret as it exists in the vault. Let us trace a concrete scenario. You want to route traffic to an Azure model, and the real API key is securely stored in AWS Secrets Manager under the name azure api key production. In your configuration file, you set up the model routing block. But for the api key field, you set the value to os dot environ forward slash azure api key production. When the gateway processes a request for that model, it sees the prefix. It knows not to use that string as a literal key. Instead, it makes a secure call out to AWS Secrets Manager, requests the value for that exact secret name, and retrieves the real key to authenticate the request. The plain text key never touches your disk. Your configuration file remains entirely clean and safe to commit to version control. That covers pulling upstream credentials into the gateway. But the secret manager integration goes in both directions. It can also write data. When you use the gateway to generate new virtual proxy keys for your internal engineering teams, you can configure the system to automatically save those newly generated keys directly into your secret manager. This ensures that the internal credentials you issue are stored and managed with the exact same security controls as the provider credentials you consume. Your configuration files define the structure of your routing infrastructure, but they should never hold its secrets. The safest place to store an API key in your gateway configuration is nowhere at all. Thanks for listening, happy coding everyone!

Model Aliases: The Shadow Upgrade

3m 43s

Migrate users to new models silently. Discover how to use Model Aliases in LiteLLM to map requests for one model to a completely different endpoint without altering any client-side code.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 6 of 24. What if you could migrate your entire user base from OpenAI to Anthropic without asking them to update a single line of their integration code? Or quietly switch a free-tier user to an open-source alternative without breaking their existing application logic? The mechanism that makes this silent transition possible is called Model Aliases. It is easy to confuse aliases with routing, so let us separate the two. Routing handles load balancing across multiple deployments of the exact same model. If you have three separate Azure OpenAI instances running the same version of GPT four, routing distributes the incoming traffic evenly among them to prevent rate limits. Aliases do something completely different. An alias intercepts a requested model name and maps it to an entirely different model behind the proxy. You apply this mapping at the moment you create a virtual key for a client. When you send your request to the LiteLLM key generation endpoint, you include an aliases object in your payload. This object is simply a dictionary matching the model name the client will ask for with the model name you actually intend to serve. Let us look at a concrete scenario. You have a segment of free-tier users. They originally built their tools around GPT four, and that exact string is currently hardcoded into their network requests. Serving GPT four to non-paying users is expensive, so you decide to redirect their prompts to an internal, highly optimized Mistral seven B endpoint. To accomplish this, you generate a new virtual key specifically for this user group. In the generation payload, you define an alias that maps the string GPT four directly to your Mistral seven B deployment name. You give this new virtual key to the users. They do not modify their application code. They continue sending standard chat completion requests to your proxy, explicitly asking for GPT four. Here is the key insight. The LiteLLM proxy receives the incoming request and authenticates the virtual key. It reads the configuration bound to that specific key and spots your alias rule. Before it routes the payload to an external provider, the proxy rewrites the model parameter in memory. It removes GPT four and substitutes Mistral seven B. The request goes to your internal Mistral deployment, generates the text, and routes the response back through the proxy to the client. The client application receives the standard response format it expects. Its parsers work perfectly, and the application continues functioning normally. The developers maintaining that client application are completely unaware that the underlying language model was swapped. Because aliases are attached directly to individual virtual keys rather than the global server configuration, you maintain absolute control over different user segments. One key can alias traffic to a cheaper model for free users, while another key allows premium traffic to pass through unmodified. You can also use this exact same logic to handle model deprecations. When a provider retires an older model, you simply alias the old name to the new version, saving all your clients from having to push emergency code updates. The most powerful aspect of a model gateway is not just managing network traffic, but decoupling the expectations of the client from the physical reality of your backend architecture. Thanks for listening, happy coding everyone!

Load Balancing for High Throughput

3m 18s

Avoid rate limits and downtime by routing traffic intelligently. Explore why simple-shuffle is the recommended strategy for production load balancing across multiple deployments.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 7 of 24. You finally got production access to Azure OpenAI, launched your app, and immediately hit a massive rate limit error on your first big traffic spike. You have the quota, but it is bottlenecked in a single region. Load Balancing for High Throughput solves this. To handle large volumes of requests without hitting limits, you need to distribute your traffic across multiple identical deployments. In LiteLLM, you do this by configuring a router. You group multiple backend endpoints under a single logical model name. Say you have an application backed by five Azure deployments spread across five different geographic regions. When your app asks for a completion, the router decides which of those five regions gets the request. You control this decision logic by setting a routing strategy in your router settings. The intuitive approach is usage-based routing, where the system sends the request to whichever deployment currently has the lowest traffic. But doing that requires tracking exact token usage in real-time. Tracking real-time state means making a network call to a cache like Redis for every single request before the prompt even goes out. That adds a permanent latency tax to your application. For high throughput production environments, you want to avoid that extra hop. The recommended approach is a strategy called simple-shuffle. Simple-shuffle does not track live state. It randomly selects an endpoint from your pool based on predefined weights. When you add your five Azure deployments to your configuration file, you assign a limit to each one, usually Requests Per Minute, or RPM. You can also use Tokens Per Minute, or TPM. If your primary region has an RPM limit of ten thousand, and a secondary region has an RPM limit of five thousand, simple-shuffle reads those numbers and treats them as weights. It will automatically route twice as much traffic to the primary region. Under the hood, the router takes the list of available deployments for that model, factors in their RPM weights, and shuffles them into a randomized list for that specific request. It tries the first deployment on the list. Because the randomization strictly respects the RPM limits you configured, your traffic distributes perfectly across all five regions over time, dodging the rate limits without the overhead of real-time monitoring. To set this up, you open your configuration file and set the routing strategy parameter to simple-shuffle. Then, in your model list, you define your five Azure endpoints. You give them all the exact same model name. Finally, you attach the RPM parameter to each endpoint definition with your desired weight. When your application calls the router using that model name, the router handles the math and the randomization automatically. You get the benefits of a distributed architecture without needing a Redis dependency. The best way to handle massive scale is often to trade perfect real-time tracking for stateless, statistically predictable randomization. Thanks for listening, happy coding everyone!

API Outages and Fallbacks

3m 27s

Never suffer an AI downtime again. Learn how to configure model fallbacks in LiteLLM so that if your primary provider fails, traffic automatically reroutes to a backup provider.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 8 of 24. When your primary language model provider suffers an outage, does your application go down with it, or does it silently pivot to a backup? Today we cover API Outages and Fallbacks, the exact mechanism that keeps your system online when external endpoints fail. First, it helps to distinguish this from a related concept. Fallbacks are not load balancing. Load balancing proactively distributes your traffic across multiple active deployments to prevent overwhelming any single endpoint. Fallbacks, on the other hand, are strictly reactive rescues. They sit idle during normal operation and only engage after a definitive failure has occurred. The primary goal is maintaining high uptime, ideally hitting that ninety nine point nine percent mark, even when your upstream providers do not. External APIs will inevitably throw 500 internal server errors or 429 rate limit exceptions. When that happens, a fallback configuration tells the proxy to intercept the error mid-flight and route the request to an alternative provider, completely shielding your application code from the disruption. You set this up in your proxy configuration YAML file using a fallbacks array. Consider a concrete scenario. Your primary deployment is gpt-4. You want to ensure that if gpt-4 goes offline, the proxy will automatically try claude-3-opus instead. In your configuration file, you define your gpt-4 model setup as usual. Inside that specific model definition, you add a fallbacks key. The value of this key is simply an array of strings, where each string is the name of another model defined in your configuration. You add claude-3-opus to this array. When your application sends a prompt to the proxy requesting gpt-4, LiteLLM routes it to the primary endpoint. If that endpoint returns an error, the proxy logic catches it. If you have retries configured, it might try the primary endpoint a few more times. But once the primary definitively fails, the proxy triggers the fallback sequence. This is where it gets interesting. The proxy takes the exact original prompt and parameters. Because LiteLLM normalizes the API format, it seamlessly translates the OpenAI formatted request into the Anthropic format required by Claude. It immediately sends the translated request to the backup endpoint. Your application code does not need to handle any error logic, rewrite the prompt, or manage API keys for the second provider. The proxy handles the entire pivot internally. You are not limited to a single backup. The fallbacks array accepts a list of models. If your primary gpt-4 fails, and then your first fallback claude-3-opus also fails, the proxy moves to the next name in the array. It steps through the list sequentially. If it reaches the end of the array and every single backup model fails, only then does the proxy return an error back to your client application. Here is the key insight. By layering diverse fallback models in your configuration, you effectively isolate your application code from external instability, transforming hard provider outages into nothing more than invisible latency spikes for your end users. Thanks for listening, happy coding everyone!

Context Window Fallbacks

3m 18s

Stop overpaying for massive context windows on short prompts. Learn how to use pre-call checks and context window fallbacks to route oversized documents to specialized models.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 9 of 24. Using an expensive 128k context model for every single user query is a massive waste of money. But if you default to a cheaper, smaller model instead, a user uploading a massive PDF will immediately break your application. The mechanism that resolves this tension is called Context Window Fallbacks. You might assume fallback routing only happens after a provider API returns a failure code. That is true for general reliability fallbacks, but context limits operate differently. LiteLLM can handle prompt sizes proactively using a setting called enable pre-call checks. When you switch this boolean to true in your router configuration, the proxy stops the request and calculates the exact token count of the prompt before the provider ever sees it. Here is the key insight. You configure a primary model, such as a highly cost-effective model, and in that same configuration block, you define a context window fallbacks list pointing to a much larger model. When a new request arrives, the pre-call check executes. If the calculated token count fits within the primary model limit, the request proceeds normally. If the prompt is oversized, LiteLLM drops the primary route completely. It instantly forwards the request to the larger fallback model. This means you never receive a context length error from the upstream API. It also means you avoid the latency penalty of waiting for a provider to reject the initial payload. The application making the request has no idea the routing changed behind the scenes. LiteLLM relies on its internal model registry to know the exact context limits for different providers. However, you might want to enforce stricter limits. Perhaps you want to trigger the fallback earlier to leave more room for generated output tokens, or perhaps you are routing to a custom deployment with a non-standard memory allocation. You handle this by overriding the max input tokens parameter directly in your model configuration. Specifying this value forces the proxy to use your custom ceiling when evaluating the pre-call check. Think about standard application traffic. A user asks a simple text question. The proxy counts fifty tokens, validates that it fits, and routes it to GPT-3.5. A few minutes later, that same user uploads a massive PDF containing eighty thousand tokens. The proxy calculates the new size, sees it exceeds the GPT-3.5 limit, and automatically redirects it strictly to GPT-4-128k. Your application logic remains completely static. You only pay for the premium model when the payload actually requires it. Moving token validation out of your application code and into the proxy layer transforms your fallback strategy from a passive safety net into an active cost-optimization engine. Thanks for listening, happy coding everyone!

Taming Hanging Requests with Timeouts

3m 34s

Don't let slow APIs freeze your application. Discover how to configure global timeouts and stream timeouts in LiteLLM to abort stalled requests and trigger speedy fallbacks.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 10 of 24. There is nothing worse for user experience than a chatbot that displays a loading spinner for forty-five seconds simply because an upstream API silently hung. You need a reliable way to sever dead connections instantly so your system can recover. We solve this by Taming Hanging Requests with Timeouts. When you route traffic to external language models, network delays and provider outages are inevitable. If a provider stops responding, the default behavior of many HTTP clients is to keep the connection open for a long time. LiteLLM intercepts this problem using two distinct timeout mechanisms. The first is the standard timeout parameter. This setting dictates the maximum total time LiteLLM will wait for an entire request to complete, from the moment it is sent to the final generated character. If you set a standard timeout of thirty seconds, and the model takes thirty-one seconds to write a long response, LiteLLM will abort the request. This works fine for background tasks or short, non-streaming generations. However, applying a total timeout to a streaming application creates a structural problem. A thorough response might legitimately take sixty seconds to stream back to the user. If you set a short total timeout to catch stalled requests, you will accidentally kill perfectly healthy, long-running generations. This is where it gets interesting. LiteLLM provides a second parameter called stream timeout. This setting specifically measures the time to the first token. It controls exactly how long the gateway will wait to receive the initial chunk of data from the provider. Once that first piece of data arrives, the stream timeout clock stops, and the connection remains open for the rest of the generation. Consider a concrete scenario. You are routing traffic to a primary Azure OpenAI endpoint. Inside your configuration file, you define your model block and add the stream timeout parameter, setting it to two seconds. A user submits a complex prompt. LiteLLM forwards this request to Azure. Normally, the upstream server processes the prompt and returns the first chunk in a fraction of a second. But in this instance, the specific Azure node hangs. The gateway starts counting. One second passes. Two seconds pass. The first chunk does not arrive. Because you defined a two-second stream timeout, LiteLLM refuses to wait for the standard HTTP timeout to trigger. It forcibly aborts the connection right at the two-second mark. Aborting the dead request is only half the architectural benefit. By forcing a quick failure, LiteLLM immediately activates your fallback logic. The moment the Azure request times out, the gateway reroutes the exact same user prompt to the next available, healthy node in your deployment list. The user experiences a barely noticeable two-second delay before the text starts streaming, entirely avoiding a frozen interface. You have the flexibility to enforce these rules at different levels. You can apply a global timeout across all routed requests, or you can fine-tune them per model. A heavy, logical reasoning model might require a lenient stream timeout of five or ten seconds, while a fast classification model should fail over after just one second. The responsiveness of your application is governed not just by how fast your primary provider succeeds, but by how aggressively you force a stalled connection to fail. Thanks for listening, happy coding everyone!

Virtual Keys for FinOps

3m 49s

Lock down your API usage with precision. Learn how to generate virtual keys using LiteLLM, setting strict RPM, TPM, and budget limits to protect your organization from runaway AI costs.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 11 of 24. A rogue script written by a junior developer can easily rack up a ten thousand dollar OpenAI bill overnight. The underlying provider platform keeps accepting requests as fast as the loop can fire them. Virtual Keys for FinOps are how you stop this before it happens. There is a common confusion between master keys and virtual keys. You must never give your master key to a developer. The master key is your administrative credential. Its primary purpose is to authenticate your platform to create virtual keys. Virtual keys are the restricted credentials you actually hand out to developers or applications to use in their code. You create a new credential by making an HTTP POST request to the key generate endpoint on your LiteLLM proxy. You authorize this request using your master key as the bearer token. The body of this request is where you define the financial guardrails. Consider a scenario where you are provisioning access for a summer intern. You want to ensure they have enough access to build a prototype, but you need an absolute guarantee they cannot burn through your infrastructure budget. You pass two specific parameters in your JSON payload to enforce this. First, you define the financial cap by setting the max budget parameter. If you set this value to ten, you allocate exactly ten US dollars to this specific key. This is a hard lifetime limit. Once the total cost of all prompts and completions tied to this key reaches ten dollars, the key is automatically disabled and will reject all further requests. Second, you control the velocity of those requests by setting the RPM limit parameter. RPM stands for requests per minute. If you set the RPM limit to one, the proxy strictly enforces a one request per sixty-second window. If an accidental infinite loop in the intern's code tries to fire a hundred requests instantly, the proxy processes the first one and immediately rejects the remaining ninety-nine with a standard rate limit error. When you send this payload to the generate endpoint, LiteLLM processes the rules and returns a response containing the newly generated virtual key. This key looks identical to a standard provider credential, typically starting with an sk prefix. You give this string to the intern. Here is the key insight. The developer uses this virtual key exactly as they would an OpenAI or Anthropic key, pointing their standard client library at your LiteLLM proxy URL instead of the public internet. When a request arrives, the proxy intercepts it. It queries its internal database to verify the virtual key exists. It then checks if the key has exceeded its ten dollar budget or its one request per minute velocity limit. If the request clears both checks, the proxy swaps the virtual key for your actual corporate API key and forwards the payload to the provider. When the provider responds, the proxy calculates the exact cost of the prompt and completion tokens based on that specific model's published pricing. It deducts that fraction of a cent from the virtual key's ten dollar budget, records the transaction, and sends the response back to the developer. The developer is completely unaware of the underlying credential swap or the internal accounting. By enforcing limits at the proxy layer, virtual keys intercept unauthorized or runaway requests before they ever reach the billing provider, mathematically guaranteeing that a compromised or poorly written script can never exceed its assigned budget. Thanks for listening, happy coding everyone!

Spend Tracking and Custom Tags

3m 31s

Attribute every cent of LLM spend accurately. Learn how to pass metadata tags in your requests and generate comprehensive spend reports using LiteLLM.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 12 of 24. Your CFO just walked in and asked exactly which new product feature is burning through the monthly AI budget. You pull up your provider dashboard, but all you see is a single massive aggregate bill for the entire company. You have no answer. Spend Tracking and Custom Tags in LiteLLM resolve this blind spot. When you route requests through LiteLLM, the proxy automatically calculates the cost of every completion based on the specific model and token count. A total dollar amount is useless if you cannot attribute it to a specific source. The simplest way to group this spend is by using the standard user parameter in your chat completion call. You pass a unique string representing your end customer. LiteLLM intercepts this, calculates the cost, and logs it against that specific user ID. Tracking by user solves one problem, but often a single user triggers multiple backend processes. Say you have a document classification workload, and the billing department needs a report grouping spend by specific background jobs. A user ID does not help here. This is where custom tags come in. LiteLLM allows you to attach an array of strings to any request, and it will track spend against those exact strings. Here is the key insight. Standard libraries like the official OpenAI SDK or LangChain do not natively know about LiteLLM metadata. If you try to pass an unrecognized parameter called metadata, the SDK will strip it out or throw an error before the request ever reaches the proxy. To bypass this, you use a parameter called extra body. This is a standard escape hatch built into modern SDKs specifically for injecting custom fields. Let us walk through doing this with a LangChain request. You configure your standard chat model object. When you call the invoke method, you pass your prompt as usual. Alongside the prompt, you pass a parameter named extra body. You set this to a dictionary. Inside that dictionary, you create a key called metadata. Inside metadata, you add a key called tags, pointing to an array of strings. You might pass a string like job ID four zero two. LangChain packages this extra body exactly as is and sends it over the wire. LiteLLM receives the payload, extracts your tags from the metadata block, and attaches the precise cost of that LLM call to job ID four zero two. That covers inputs. What about outputs? Once your traffic is flowing with these tags, you need to pull the data. You do this by querying the global spend report endpoint on your LiteLLM proxy. You make a standard HTTP GET request to this endpoint. The proxy returns a JSON payload detailing exactly where the money went. It groups your total spend by API key, by user, and crucially, by every custom tag you provided. You can hand this directly to your billing department. They can instantly see that the document classification job cost exactly four dollars and twenty cents, regardless of which underlying model handled the actual routing. Tagging your traffic at the proxy level means your billing granularity is no longer dictated by how your cloud provider structures their invoices; it is entirely defined by the context of your own application. Thanks for listening, happy coding everyone!

Caching for Speed and Savings

3m 41s

Stop paying for the same LLM responses over and over. Learn how to configure exact caching with Redis and semantic caching with Qdrant to slash latency and API costs.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 13 of 24. Why pay an external API to generate the exact same basic explanation a thousand times? You burn credits and force users to wait for computations that have already been done. Resolving this inefficiency is the core of Caching for Speed and Savings. LiteLLM handles caching entirely at the proxy level. When a request comes in, the proxy checks if it has seen this exact prompt before. If it has, it returns the stored response immediately. Latency drops from seconds to milliseconds, and the API cost drops to zero. To set this up, you use your config dot yaml file. You enable caching in your global settings block by setting cache to true, and specify your cache type. Redis is the standard backend for exact match caching. You provide your Redis host, port, and password, and the proxy takes care of storing the input-output pairs. But exact caching is brittle. It looks for a perfect string match. If one user asks, write a poem about LiteLLM, and another user asks, create a LiteLLM poem, an exact cache sees two completely different requests. It forwards the second request to the heavy language model, wasting time and money on a redundant task just because the phrasing changed slightly. This is where semantic caching comes in. Instead of comparing raw text strings, semantic caching compares the underlying meaning of the prompts. LiteLLM supports Qdrant, a vector database, to handle this. When you configure semantic caching, you must specify an embedding model alongside your main generation model. When a request arrives, the proxy passes the prompt to the embedding model first. This model converts the text into a vector, which is a mathematical representation of the prompt's meaning. The proxy then queries Qdrant to see if a similar vector already exists in the cache. Because write a poem and create a poem share the same semantic intent, their vectors map closely together in space. Qdrant detects this similarity. If the match is close enough, the proxy pulls the cached response from the first user and delivers it to the second user. You skip the heavy text generation step entirely, paying only a fraction of a cent for the fast embedding lookup. Configuring this requires a few more lines in your config dot yaml. You change the cache type to qdrant semantic. You define Qdrant-specific parameters, like the Qdrant endpoint URL and your API key. Most importantly, you define a similarity threshold. This is a decimal value between zero and one. A high threshold, like point nine nine, demands near-identical phrasing. A lower threshold, like point eight, catches broader variations but increases the risk of returning an outdated or slightly off-topic answer if two prompts sound similar but have different intents. Here is the key insight. Semantic caching is not just a storage mechanism, it is an active filter for user intent. Tuning your similarity threshold is the only thing standing between a massive cost reduction and returning irrelevant answers to your users. Thanks for listening, happy coding everyone!

RBAC: Empowering Team Admins

3m 18s

Distribute platform management safely. Understand LiteLLM's Role-Based Access Control, delegating power to Org Admins and Team Admins without compromising global security.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 14 of 24. You are a platform engineer, and half your day is spent answering messages asking for new API keys or budget bumps. Every developer onboarding turns into an IT ticket that blocks their work and wastes your time. The solution is delegating control without losing oversight, using Role-Based Access Control. LiteLLM supports four specific user roles to manage this delegation. These are the proxy admin, the org admin, the team admin, and the internal user. The proxy admin sits at the very top. If you deploy LiteLLM, you are the proxy admin. You configure the models, set up the database, and establish the global rules. But you should not be managing day-to-day key requests. That is where the hierarchy comes in. You can group your company into organizations, which represent massive departments, and teams, which are specific working groups within those departments. The org admin can manage teams within their specific organization. But the real operational power lies with the team admin. Here is the key insight. You can hand off all local administrative tasks to a department lead by making them a team admin. As the proxy admin, you set up the initial structure once. You create a team, apply a hard budget limit of five hundred dollars a month, and assign the lead developer as the team admin. After that, you step out of the loop entirely. The team admin now has the autonomy to manage their own engineers. They can log into the UI or use the API to add new users to their team. They can generate new API keys for those developers and monitor the aggregate spend of their specific group. Crucially, any key created by the team admin or their engineers is automatically bound to that five hundred dollar team budget. The team admin has full local control, but they cannot spend a single cent beyond the limit the proxy admin enforced. Below the team admin is the internal user. This is the role assigned to the standard developers writing the code. An internal user has restricted access. They can view their own token spend and, if the team admin allows it, generate their own personal API keys. Their view of the system is strictly limited to themselves. They cannot see the wider team budget, they cannot view keys belonging to their colleagues, and they certainly cannot modify team settings. To set this up programmatically, the proxy admin makes a single API request to the team creation endpoint. You pass the team name, the max budget parameter, and an array of user IDs tagged with the team admin role. The system returns a team ID. From then on, the department lead uses that team ID to route their own management requests, entirely bypassing the platform engineering team. Role-Based Access Control in LiteLLM is not just about hiding buttons in a user interface, it is about physically constraining token spend at the group level while pushing key management down to the people actually leading the projects. Thanks for listening, happy coding everyone!

Security Guardrails

4m 02s

Add an invisible security layer to your LLM requests. Learn how to configure pre-call and post-call guardrails in LiteLLM to block prompt injections and mask PII before it reaches external providers.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 15 of 24. Relying on a language model to behave itself is not a security strategy. If a user accidentally pastes sensitive customer data into a prompt, asking the model to politely ignore it is already too late. You need a bouncer at the door, and that is exactly what Security Guardrails provide. Security Guardrails in LiteLLM act as an invisible safety layer between your application clients and the language model providers. They intercept API traffic in two distinct phases. The first phase is called pre call. This executes after LiteLLM receives the incoming request from your application, but just before it forwards that payload to the external provider. The second phase is called post call, which triggers after the model generates its response, but before LiteLLM sends that response back to the original client. You define this routing logic entirely within your config dot yaml file. Under the proxy settings block, you define your guardrails by specifying an endpoint or a supported integration, and then assigning it a mode of either pre call or post call. Let us look at a concrete scenario using a pre call guardrail. Suppose an employee asks a cloud hosted model to summarize a support ticket, but that ticket contains US Social Security Numbers. You absolutely do not want those numbers leaving your internal network. You can configure Microsoft Presidio as your pre call guardrail. When the application sends the prompt, LiteLLM intercepts the request and hands the text to Presidio. Presidio scans the text, locates the Social Security Number, and replaces it with a generic mask. LiteLLM then takes this sanitized prompt and sends it over the internet to the cloud provider. The external model generates a summary based on the masked text, and your application code operates as if nothing out of the ordinary happened. Here is the key insight. You do not have to apply these rules globally across all your traffic. LiteLLM lets you attach guardrails at the specific model level. This is crucial when you operate a hybrid architecture. You can configure your routing so that any prompt sent to a public cloud model passes through the strict PII masking guardrail. However, if you route that exact same prompt to an open source model running on your own on premises hardware, you simply leave the guardrail out of that model's configuration block. The local model processes the raw, unmasked data because the information never crosses your network boundary. You avoid unnecessary processing overhead and preserve the exact context. The post call mode operates using the exact same flow, just on the return trip. When the external model replies, LiteLLM passes the output through your post call guardrails. This allows you to evaluate the text for toxic language, hallucinated internal URLs, or unauthorized competitor mentions before the user ever sees it. If the post call guardrail flags the content, LiteLLM intercepts the return journey. It blocks the text and returns a safety error to the client instead of delivering the harmful output. By handling this directly at the proxy level, your application architecture remains entirely unchanged. Your developers just send standard completion requests, and the proxy enforces your compliance rules. The most reliable security layers are the ones that your application code never has to think about. Thanks for listening, happy coding everyone!

Dynamic Callback Management

3m 36s

Give microservices the power of privacy. Learn how to use the x-litellm-disable-callbacks header to let sensitive API requests opt-out of central observability logging.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 16 of 24. You want to log all LLM traffic for debugging, but what happens when a microservice submits highly sensitive compliance data that absolutely cannot be saved? You might assume you have to disable observability globally, making your entire system blind just to protect a few specific API routes. Dynamic Callback Management resolves this tension by letting you control logging on a per-request basis. In a standard LiteLLM setup, callbacks handle sending your request and response payloads to external observability platforms. You configure platforms like Langfuse or Datadog at the proxy level, and by default, they capture everything passing through. The common misconception is that this logging is an all-or-nothing system. People often think that to handle sensitive data, they must deploy a completely separate proxy with logging turned off. Dynamic Callback Management removes that extra work. Consider a microservice handling medical records. The application needs to process patient symptoms through a language model, but sending that sensitive patient data to a third-party logging stack violates compliance. To prevent this leak, the microservice simply adds a specific HTTP header to its outgoing request. This header is called x-litellm-disable-callbacks. You set its value to a comma-separated list of the specific platforms you want to bypass. For the medical records service, the microservice passes the header with the value langfuse comma datadog. When the LiteLLM proxy receives this request, it evaluates the header before triggering the language model. The prompt is sent to the provider, and the response is routed back to the client as usual. The intervention happens during the telemetry phase. The proxy reads the disable header and actively blocks the payload from being forwarded to the specified observability endpoints for that single transaction. Meanwhile, all other applications hitting the same proxy concurrently continue logging their traffic without any interruption. This is where it gets interesting. Giving clients the power to disable their own audit logs introduces a potential security risk. In highly regulated environments, developers should not always have the authority to hide their traffic. If your infrastructure requires a strict, unalterable audit trail for every single prompt, you must enforce it centrally. You handle this using compliance locking. Inside the proxy configuration file, within the general settings block, you set a parameter named allow dynamic callback disabling to false. This single setting establishes a strict global policy that overrides any client-side instructions. If a microservice attempts to pass the disable callbacks header while this lock is active, the proxy does not silently ignore the header. Instead, it rejects the transaction entirely and returns an HTTP 403 Forbidden error. This mechanism guarantees that traffic either complies with the mandatory global logging policy, or it is dropped before reaching the language model. The true utility of Dynamic Callback Management is that it shifts data privacy from a rigid infrastructure deployment into an agile, request-level parameter, while still giving platform engineers the final say on compliance. Thanks for listening, happy coding everyone!

Drop-in Observability

3m 20s

Get instant visibility into your LLM traffic. Learn how to pipe telemetry, traces, and exceptions to tools like Langfuse and Sentry using simple success and failure callbacks.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 17 of 24. When a user complains the AI gave them a bizarre answer three days ago, how do you pull up the exact prompt that caused it? Parsing application logs for multi-line JSON payloads is a nightmare. You need structured trace data, but building custom integrations for every monitoring tool drains engineering hours. This exact problem is handled by Drop-in Observability. Instead of writing custom wrapper functions to time your API calls, count tokens, and catch timeouts, LiteLLM intercepts the traffic natively. It exposes two core hooks inside your settings: the success callback and the failure callback. These callbacks act as automatic routing mechanisms for your telemetry. They accept an array of strings representing supported external observability providers. You simply name the tool you want to use, and LiteLLM translates its internal request data into the exact format that specific tool expects. Take the success callback. This triggers the moment a language model returns a valid response. When this happens, LiteLLM automatically captures a snapshot of the transaction. This includes the exact input prompt sent to the model, the generated output, the time it took to generate, and the precise token usage. To send this data to an external tool, you open your LiteLLM settings and set the success callback variable to a list containing the string "langfuse". As long as your Langfuse authentication keys are present in your environment variables, the system handles the rest. Here is the key insight. The logging process happens asynchronously in the background. Your main application thread never blocks while waiting for the observability provider to acknowledge the trace. Your users experience zero added latency. That handles the happy path. For the errors, you use the failure callback. This triggers when an API call times out, hits a provider rate limit, or fails completely. Without proper tracing, an LLM error often surfaces as an opaque status code. By setting your failure callback variable to a list containing the string "sentry", you map LLM exceptions directly to your existing error tracking workflow. When a request fails, LiteLLM packages the exception type, the model name, and the attempted input, then pushes that context directly into Sentry. To set this up in your code, you do not need to modify your actual API calls. You only touch the global configuration. You assign your chosen tools to the callback arrays once during application startup. From that point forward, every completion call you make is monitored. Your core logic remains entirely decoupled from your logging infrastructure. If you decide to swap Langfuse for another provider next month, you change one string in an array. The true power of drop-in callbacks is not just avoiding boilerplate code. It is standardizing the shape of your telemetry across dozens of different LLM providers so your monitoring platform sees exactly one consistent format, regardless of which underlying model answered the prompt. Thanks for listening, happy coding everyone!

Prometheus Metrics and Pod Health

3m 27s

Take the pulse of your proxy. Discover how to expose the /metrics endpoint to Prometheus, track in-flight requests, and use custom tags to slice data in Grafana.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 18 of 24. If an LLM request takes ten seconds, is the AI provider slow, or is your proxy's event loop completely choked? You cannot fix a bottleneck until you know exactly where it lives. Prometheus Metrics and Pod Health in LiteLLM give you that exact visibility. LiteLLM exposes a standard metrics endpoint at the path slash metrics. A common mistake when monitoring AI gateways is treating all response delays as LLM provider latency. That is inaccurate. There are two distinct waiting periods. First, there is the pre-ASGI queue latency. This is the time a request spends waiting inside your infrastructure before the proxy even begins processing it. Second, there is the actual LLM latency, which is the time spent waiting for OpenAI, Anthropic, or another provider to return tokens. The metrics endpoint separates these numbers so you know exactly who to blame for a slow response. To monitor your pod health, you rely on a specific gauge called litellm in flight requests. This metric tracks the exact number of concurrent requests actively being processed by a pod at any given millisecond. This is a real-time measure of queue depth. When traffic spikes, this number climbs. Consider a concrete scenario. Your monitoring dashboard shows a massive spike in total request duration. Users complain about slow answers. If you only look at total time, you might assume OpenAI is having an outage. But when you check the metrics endpoint, the provider latency is stable at two seconds. Here is the key insight. You look at litellm in flight requests and see it has skyrocketed from twenty to two hundred. This proves the delay is not OpenAI. Your pod queue is completely overloaded. Armed with this exact gauge, you can configure your infrastructure to trigger an auto-scale event the moment in-flight requests cross a specific threshold, spinning up new proxy pods before the event loop chokes. You also need to know who is generating this traffic. LiteLLM supports custom prometheus tags. When a request hits the gateway, you can pass custom metadata in the request payload, like a project ID, a department, or an application name. LiteLLM extracts these custom tags and attaches them as labels to the Prometheus metrics. Instead of just seeing that the gateway processed ten thousand tokens, you see that the marketing application processed seven thousand tokens and the analytics dashboard processed three thousand. This gives DevOps the ability to group token usage, queue depth, and latency by specific tenants. The most critical takeaway for gateway performance is this. Never auto-scale your AI proxy based on total response latency, because you will waste money scaling up when the external provider is simply being slow; scale based on your in-flight request gauge to react only when your own infrastructure queue is actually full. Thanks for listening, happy coding everyone!

Universal Text-to-Speech

3m 47s

Standardize your voice generation. Discover how to call Text-to-Speech models from Gemini, Vertex, and AWS Polly using the exact same OpenAI-compatible audio endpoint format.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 19 of 24. You want to add voice generation to your app, but writing custom integration code for every single provider's audio format requires constant maintenance. Every time you switch from an OpenAI voice model to a Google voice model, the required JSON fields, the API paths, and the audio return types completely change. The mechanism that resolves this is Universal Text-to-Speech using the LiteLLM audio-speech bridge. Instead of maintaining separate API clients for OpenAI, Vertex AI, and Gemini, you standardize your application on a single format. LiteLLM exposes a unified endpoint that exactly mimics the standard OpenAI audio-speech route. You construct a request containing your text, your chosen model name, and a voice preference, sending it to LiteLLM as if you were talking directly to OpenAI. The gateway then translates this standardized payload into the specific format required by your target provider. Here is the key insight. Many AI providers do not offer a straightforward, dedicated text-to-speech API that streams audio files out of the box. Depending on the provider, accessing a voice model natively often requires routing the request through a generic text completion endpoint. You might have to pass highly specific configuration flags, send a complex system prompt, and then extract base64 encoded audio strings deeply nested within a JSON response. LiteLLM abstracts this entire translation layer. It handles the API negotiation, unpacks the proprietary response structure, and isolates the actual audio data. Consider a concrete scenario. You decide to generate spoken audio using Google's Gemini Flash TTS preview model. In your application code, you point your standard HTTP client to your LiteLLM proxy URL at the audio-speech path. You set the model parameter to point to the Gemini Flash TTS model. You assign your plain text to the input parameter, and you specify a valid voice identifier. When you execute the request, LiteLLM intercepts the payload. It securely authenticates with Google Cloud or Vertex AI depending on your setup. It repackages your plain text and voice selection into the specific JSON schema that the Google API demands. When the Google model processes the text and returns the result, LiteLLM intercepts the proprietary response. Rather than forcing your client application to parse a custom Google Cloud payload, LiteLLM extracts the raw audio bytes. It seamlessly bridges the transport layer, immediately streaming a standard MP3 file back to your client. Your frontend or backend application receives a standard audio stream, completely blind to the fact that the underlying generation was performed by Google instead of OpenAI. This bridge logic means you write your text-to-speech client integration exactly once. If a new, faster audio model is released by Vertex tomorrow, you only need to change the model string in your request. The application code handling the MP3 stream remains entirely untouched. Treating audio generation the exact same way you treat text generation allows you to standardize your application logic. By forcing all text-to-speech requests through a single unified interface, you can route, load-balance, and set up failovers for your audio generation across entirely different providers without writing a single line of provider-specific fallback code. Thanks for listening, happy coding everyone!

The Assistants API Bridge

3m 26s

Manage conversation state effortlessly across providers. Learn how LiteLLM wraps non-native models in the standard OpenAI Assistants API interface, letting you use Threads and Messages everywhere.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 20 of 24. OpenAI handles conversation state beautifully, keeping your client code entirely stateless. But locking your architecture into a single provider just to keep that state management is a heavy price to pay. The Assistants API Bridge in LiteLLM resolves this tension. The bridge is a feature of the LiteLLM proxy that exposes the exact endpoints of the OpenAI Assistants API. You hit standard paths like the v one assistants and v one threads endpoints, but you route the actual text generation to any model you want. Standard chat completion interfaces are stateless. Every time you ask a question, your client must send the entire conversation history back to the server. This consumes bandwidth and complicates your client-side code. The Assistants API solves this by keeping the conversation history on the server inside a Thread object. You just append new messages to the thread and tell the server to run the assistant. The problem is that most other providers, from local setups to enterprise cloud alternatives, do not offer this stateful interface natively. The LiteLLM bridge polyfills this missing functionality. To make it work, you configure the LiteLLM proxy to connect to a database. This database will act as the storage layer for your conversation state. Then, you point your existing OpenAI client to the LiteLLM proxy URL instead of the default OpenAI servers. The logic flow starts with creating an assistant. You send a request to the proxy defining the assistant instructions and specifying a target model. This could be an Azure OpenAI deployment or an OpenAI compatible Astra model. The proxy saves this configuration in its database. Next, you create a thread. A thread is simply an empty container stored by the proxy. When a user says something, you send a request to add a message to that specific thread. The proxy saves the message. Up to this point, the underlying large language model has not been contacted at all. Here is the key insight. The bridge only talks to your target model when you trigger a run. When you tell the proxy to run the assistant on a specific thread, LiteLLM retrieves the entire message history from its database. It formats that history into a standard, stateless chat completion payload. It then sends that flat payload to the underlying model you configured earlier. The model evaluates the conversation and returns a response to the proxy. LiteLLM takes that text, packages it as a new assistant message, saves it into the thread database, and updates the run status to completed. Your client polls the proxy, sees the completed status, and fetches the latest message exactly as it normally would. Your application code remains completely unchanged. It still thinks it is talking to a native stateful system. The proxy handles the translation transparently, taking a stateful client request, executing a stateless model call, and maintaining the persistence layer in between. This separation of concerns means your client architecture can rely on a modern, state-managed API interface, while your infrastructure remains entirely free to swap underlying models based on cost, privacy, or performance requirements. Thanks for listening, happy coding everyone!

The MCP Gateway

3m 43s

Supercharge your models with tools centrally. Discover how to configure HTTP, SSE, or STDIO Model Context Protocol (MCP) servers in LiteLLM, giving any LLM access to external capabilities.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 21 of 24. You have an open-source Llama model running locally, but you want it to trigger actions or read files using the exact same GitHub integrations your Claude Code agent uses. You do not want to rewrite your application logic to support a new toolset for every new model. The MCP Gateway solves this. First, let us separate MCP from A2A, or Agent-to-Agent routing. A common confusion is treating them as the same thing. A2A is when you route a user prompt to a specialized external agent to generate a text response. MCP is entirely different. MCP stands for Model Context Protocol, and it is about providing a standardized set of tools to a model, letting that model decide when and how to invoke them to complete a task. The MCP Gateway feature allows the LiteLLM proxy to act as a bridge between any language model and your MCP servers. Instead of writing code to register tools in every single client application you build, you define them centrally in the proxy. From that point on, any model hitting the proxy can utilize those tools. LiteLLM connects to MCP servers using two primary methods. The first is STDIO, or Standard Input and Output. This is used for local tools. You configure the proxy to execute a specific local command, like running a node script or a Python file, right on the host machine. The second method is HTTP with Server-Sent Events, or SSE. This is used to connect to remote MCP servers over a network. Let us look at a concrete scenario. You want to add a remote Zapier MCP server so your models can interact with external web apps. You do this entirely within the proxy configuration YAML file. Under the main configuration block, you add an MCP servers section. You name the integration, for instance, zapier-integration. You define the transport type as SSE. Then, you provide the endpoint URL of your Zapier MCP server. Because this is a remote connection, you will also specify the required authentication headers, like a bearer token, directly inside this YAML definition. Now the proxy knows how to talk to Zapier. The next step is execution. When your client application sends a standard chat completion request to LiteLLM, it just needs to include a specific header indicating which MCP tools it wants to load. Here is the key insight. The client application does not need to know what tools Zapier actually offers. The proxy intercepts the client request, reaches out to the Zapier MCP server over the SSE connection, and dynamically fetches the current list of available tools. The proxy then injects those tool definitions into the payload and forwards the whole package to the language model. If the language model decides to invoke a Zapier tool, it sends a tool call back to the proxy. The proxy catches it, executes the action against the Zapier MCP server, gets the result, and feeds it back to the model. Your client application is completely shielded from this back-and-forth negotiation. It just receives standard OpenAI-compatible tool responses. The true power of the MCP Gateway is decoupling tool implementation from model choice, meaning an integration built for one ecosystem works instantly across dozens of different models without writing custom adapter code. Thanks for listening, happy coding everyone!

A2A: Tracking Autonomous Agents

3m 24s

Bring autonomous agents under control. Learn how to invoke complex LangGraph or Bedrock agents through the proxy using the A2A protocol, enabling trace grouping and unified spend tracking.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 22 of 24. When your autonomous agent goes on a fifty step reasoning loop, how do you track the bill back to the original request? If the agent makes dozens of isolated calls to different language models, your billing logs become an unreadable mess of disconnected requests. The solution to this is the LiteLLM Agent Gateway, also known as the A2A Protocol. Before exploring the mechanics, we should clear up a common confusion. Listeners often mix this up with the Model Context Protocol, or MCP. MCP is used to provide external tools to a language model. A2A does the exact opposite. It treats an external, autonomous agent as if it were a standard language model, allowing you to invoke it and track it through a centralized gateway. To set this up, you define your agent in the LiteLLM configuration file exactly as you would define a standard language model. You give the model a name, set the base URL pointing to your agent API endpoint, and specify the provider as a custom OpenAI endpoint. Now, LiteLLM knows how to route incoming client requests directly to your agent. When a client sends a request asking for this agent, LiteLLM acts as a pass through for context headers. It takes the metadata tied to the request, like user identifiers, team routing tags, and budget limits, and packages them into specific HTTP headers. The most important of these is the X-LiteLLM-Trace-Id header. LiteLLM forwards these headers along with the prompt to your agent. Here is the key insight. Your agent receives this request, starts its autonomous loop, and begins making its own internal calls to process the task. If the agent makes these calls directly to a public provider, you lose the tracking context. Instead, the agent must route its internal calls back through LiteLLM. When it does this, it must include the trace ID it received in the original request. Consider a concrete scenario. You are invoking a local LangGraph agent. A client sends a prompt through LiteLLM to start the process. LiteLLM assigns a unique trace ID and forwards the payload to the LangGraph endpoint. Your LangGraph application reads the incoming HTTP headers and extracts the X-LiteLLM-Trace-Id. When LangGraph needs to evaluate a step or summarize data, it uses its internal client to send a completion call back to LiteLLM. Crucially, it attaches that exact same trace ID header to its outgoing request. Because every internal call carries the same trace identifier, LiteLLM groups them together automatically. When you look at your observability platform or budget logs, you do not see fifty random requests from unknown sources. You see one unified trace. You know exactly which user triggered the agent, how much the entire reasoning loop cost, and which specific autonomous steps consumed the most budget. Treating agents as standard model endpoints transforms complex, multi step agent workflows into trackable, billable units of compute. Thanks for listening, happy coding everyone!

Zero-Downtime Key Rotations

3m 16s

Achieve zero-downtime security cutovers. Learn how to configure automatic scheduled key rotations and grace periods for enterprise-grade virtual keys in LiteLLM.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 23 of 24. Rotating production API keys usually means scheduled downtime, frantic coordination, and hoping no backend system gets left behind with a dead credential. A hard cutover is a strict security measure, but it is also a highly reliable way to break inflight requests. The solution to this tension is Zero-Downtime Key Rotations. In an enterprise proxy setup, keeping static virtual keys forever is an unacceptable security risk. LiteLLM handles this by allowing you to schedule automatic key rotations. Instead of manually generating new credentials and coordinating an exact moment to switch them out, you let the proxy handle the lifecycle. But simply automating the creation of a new key does not solve the underlying reliability problem. If the gateway immediately rejects the old key the second a new one is created, any service that has not yet synced the new credential will instantly fail. To fix this, LiteLLM uses a grace period. When you create or update a virtual key, you configure two specific parameters. First, you set the auto rotate interval, which defines exactly how often a new key should be born. Second, you define the grace period, which tells the system how long the old key should remain valid after the rotation occurs. Consider a standard microservice architecture where your security policy requires rotating LLM access keys every thirty days. You make a request to the LiteLLM key generation endpoint. In that request, you set the auto rotate parameter to 30 days. In the exact same request, you set the grace period parameter to 24 hours. The proxy stores this policy and starts the clock. At the thirty-day mark, the rotation triggers. LiteLLM automatically generates a brand-new virtual key. This is the part that matters. For the next 24 hours, you have two completely valid keys pointing to the exact same configuration, budget, and tracking logic. During this overlap window, your secrets manager fetches the new key and slowly injects it into your production environment. As containers cycle or configuration maps update, services independently switch to the new credential. If a specific background worker is still using the old key twelve hours into the grace period, LiteLLM accepts the request without complaint. The gateway routes the traffic to the large language model and logs the transaction normally. Once the precise 24-hour grace period elapses, LiteLLM automatically invalidates the original key. Any remaining system still attempting to use the old credential will receive an authentication error. The migration is complete. You have entirely decoupled the creation of the new secret from the destruction of the old one. Separating these two events transforms key rotation from a fragile, highly coordinated infrastructure panic into a quiet, reliable background routine. Thanks for listening, happy coding everyone!

The Admin UI and AI Hub

3m 38s

Make your AI platform accessible to everyone. Learn how to manage the Admin UI, tweak UI credentials, and use the AI Hub to let developers securely discover allowed models and agents.

Download

Hi, this is Alex from DEV STORIES DOT EU. LiteLLM: The Universal LLM Gateway, episode 24 of 24. Your platform team built a highly secure, perfectly routed LLM gateway. But when internal developers actually need to build something, how do they know which models and agents they are allowed to use? Without a discovery mechanism, your gateway is an invisible black box. That visibility is provided by the Admin UI and the AI Hub. The Admin UI is a visual dashboard built directly into the LiteLLM proxy. Operating an LLM gateway purely through configuration files and database queries scales poorly when multiple teams start requesting access. The Admin UI gives platform operators a centralized place to manage the proxy. Through this interface, you can generate new API keys, track token spend across different teams, monitor live request logs, and configure model routing rules. When you first launch the LiteLLM proxy, this dashboard is enabled by default and secured with default login credentials. For local testing, this is convenient. For production, it is a vulnerability. You must change these default credentials immediately. You do this by setting specific environment variables for the administrative username and password before starting the container. Here is the key insight. You might not want a graphical dashboard exposed at all on your production gateway. Many platform teams provision infrastructure strictly through automated scripts and do not want an interactive control plane accessible over the network. If that fits your security model, you can turn the dashboard off entirely. By setting an environment variable called disable admin ui to true, you completely remove the interface. The proxy will continue routing traffic and enforcing rules, but the web server will not serve the administrative screens. That covers the platform operators, but the engineers consuming the API need a different perspective. That is where the AI Hub comes in. While the Admin UI is for control, the AI Hub is for discovery. It acts as an internal developer portal for your organization. Instead of messaging the platform team to ask which models are currently approved or where the documentation for an internal agent lives, developers visit the AI Hub. They authenticate, typically through your organization single sign-on provider, and are presented with a catalog. They can see exactly which models they are authorized to call, review the rate limits applied to those models, and discover pre-configured agents built by other teams. More importantly, the AI Hub allows developers to self-serve. They can generate their own API keys tied to their specific team budgets without waiting for a platform engineer to manually provision one. This fundamentally shifts how your organization interacts with generative AI. It bridges the gap between the infrastructure engineers securing the gateway and the product engineers building the applications. The gateway is only as useful as it is accessible, and the AI Hub turns a locked-down proxy into a self-serve platform allowing your engineering teams to move fast without breaking your budget. This wraps up our series on LiteLLM. I encourage you to explore the official documentation, try these configurations hands-on, and visit dev stories dot eu to suggest topics for future series. Thanks for listening, happy coding everyone!