Why Demo Agents Fail in Production
Every AI agent demo follows the same pattern. A prompt fires. An LLM responds. Maybe it calls a tool. The tool returns data. The LLM synthesizes a final answer. The audience applauds. Then someone deploys it to production and everything breaks within 72 hours.
The gap between demo and production is not about model quality. It is about everything the demo ignores. What happens when the LLM API returns a 429? What happens when the tool call takes 45 seconds instead of 2? What happens when the response is valid JSON but contains a field type the downstream system does not expect? What happens when the same request fires twice because the user's browser retried on a timeout? What happens at 3 AM when the agent enters an infinite tool-calling loop and burns through your entire monthly token budget in 40 minutes?
Production agent systems must handle all of these cases. Not as afterthoughts. As first-class architectural concerns baked into the execution model from day one.
A demo agent answers the question. A production agent answers the question, survives the failure, logs the evidence, respects the budget, and does it all without waking anyone up at 3 AM.
Durable Execution: Queues, Not Synchronous Chains
The first architectural decision that separates production agents from demos is execution durability. Demo agents run as synchronous request/response cycles. The user sends a request, the server orchestrates LLM calls and tool invocations in a single thread, and the response comes back when the chain completes. If anything fails mid-chain, the entire operation fails. There is no recovery point. There is no partial progress. Start over from scratch.
Production agents use durable execution. Each step in the agent workflow is a discrete, persistent task in a queue. If step 3 of 7 fails, steps 1 and 2 are preserved. When the failure is resolved, execution resumes from step 3, not from step 1. This requires decomposing the agent workflow into independently executable steps with well-defined inputs and outputs.
interface AgentStep {
step_id: string;
workflow_id: string;
step_type: 'llm_call' | 'tool_invocation' | 'validation' | 'human_review';
status: 'pending' | 'running' | 'completed' | 'failed' | 'waiting_human';
input: Record<string, unknown>;
output?: Record<string, unknown>;
attempt_count: number;
max_attempts: number;
created_at: string;
started_at?: string;
completed_at?: string;
error?: { code: string; message: string; transient: boolean; };
}
interface AgentWorkflow {
workflow_id: string;
steps: AgentStep[];
current_step_index: number;
state: Record<string, unknown>; // accumulated context
status: 'running' | 'completed' | 'failed' | 'paused';
token_budget: { used: number; limit: number; };
timeout_at: string;
}
Tools like Temporal, Inngest, and Trigger.dev provide durable execution primitives out of the box. If you are building from scratch, the minimum viable implementation is a Postgres-backed task queue with row-level locking and a polling worker. The execution engine dequeues the next pending step, runs it, writes the output, and advances the workflow pointer. If the worker crashes mid-execution, the step remains in "running" status, hits a staleness timeout, and gets retried by another worker.
Idempotent Retries with Exponential Backoff
Every external call in an agent workflow will fail eventually. LLM APIs return rate limits, tool endpoints go down, databases hit connection limits. The retry strategy determines whether the system recovers gracefully or compounds the failure.
Idempotency First
Before implementing retries, ensure every step is idempotent. Running the same step twice with the same input must produce the same result without side effects. This means: use idempotency keys on all external API calls, check for existing results before creating new records, and design tool invocations so that duplicate calls are harmless.
async function executeWithRetry<T>(
fn: () => Promise<T>,
options: {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
idempotencyKey: string;
isTransient: (error: unknown) => boolean;
}
): Promise<T> {
const { maxAttempts, baseDelayMs, maxDelayMs, isTransient } = options;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxAttempts || !isTransient(error)) {
throw new PermanentFailureError(error, attempt, options.idempotencyKey);
}
const delay = Math.min(
baseDelayMs * Math.pow(2, attempt - 1) + Math.random() * 1000,
maxDelayMs
);
await sleep(delay);
}
}
throw new Error('Unreachable');
}
The jitter (random addition to the delay) is not optional. Without jitter, when an API recovers from an outage, all your queued retries fire simultaneously at the same intervals, creating a thundering herd that triggers another outage. The random offset spreads the retry attempts across time.
Error Classification
Not all errors deserve retries. The retry logic must classify errors into two categories: transient failures that may resolve on their own (rate limits, timeouts, temporary network errors) and permanent failures that will never succeed no matter how many times you retry (invalid API keys, malformed requests, resource not found). Retrying permanent failures wastes tokens, burns budget, and delays the error report that would actually help someone fix the problem.
function classifyError(error: unknown): 'transient' | 'permanent' {
if (error instanceof HttpError) {
// 429 Too Many Requests, 502/503/504 server errors: transient
if ([429, 502, 503, 504].includes(error.status)) return 'transient';
// 400 Bad Request, 401 Unauthorized, 404 Not Found: permanent
if ([400, 401, 403, 404, 422].includes(error.status)) return 'permanent';
}
if (error instanceof TimeoutError) return 'transient';
if (error instanceof ConnectionError) return 'transient';
if (error instanceof ValidationError) return 'permanent';
// Unknown errors default to transient with limited retries
return 'transient';
}
Structured Outputs: Zod Schemas for Agent Responses
LLMs produce strings. Production systems consume typed data structures. The gap between these two realities is where a staggering number of agent failures occur. The LLM returns JSON that is almost right: a field is missing, a number is a string, an enum value has a different casing, a nested object is flat. The downstream code crashes, or worse, silently processes corrupt data.
Structured outputs solve this by constraining the LLM's response format at generation time and validating the result against a schema before it enters the system.
import { z } from 'zod';
const AgentAnalysisSchema = z.object({
summary: z.string().min(10).max(500),
confidence: z.number().min(0).max(1),
findings: z.array(z.object({
category: z.enum(['risk', 'opportunity', 'neutral']),
description: z.string(),
evidence: z.array(z.string()).min(1),
severity: z.number().int().min(1).max(5),
})).min(1),
recommended_action: z.enum([
'escalate', 'monitor', 'dismiss', 'investigate'
]),
reasoning: z.string(),
});
type AgentAnalysis = z.infer<typeof AgentAnalysisSchema>;
Pass this schema as the response_format parameter to models that support structured outputs (OpenAI, Anthropic with tool use). For models that do not support native structured outputs, parse the response and validate it against the Zod schema. When validation fails, feed the validation errors back to the model as a retry prompt. Most models self-correct on the second attempt when given explicit error messages.
async function getStructuredResponse<T>(
prompt: string,
schema: z.ZodType<T>,
maxParseAttempts: number = 3
): Promise<T> {
let lastError: string | null = null;
for (let attempt = 1; attempt <= maxParseAttempts; attempt++) {
const fullPrompt = lastError
? `${prompt}\n\nYour previous response failed validation: ${lastError}\nPlease fix the issues and try again.`
: prompt;
const raw = await llm.complete(fullPrompt);
const parsed = safeJsonParse(raw);
if (!parsed) {
lastError = 'Response was not valid JSON';
continue;
}
const result = schema.safeParse(parsed);
if (result.success) {
return result.data;
}
lastError = result.error.issues
.map(i => `${i.path.join('.')}: ${i.message}`)
.join('; ');
}
throw new StructuredOutputError(
`Failed to get valid structured output after ${maxParseAttempts} attempts`,
lastError
);
}
Define your agent's output schema before writing the prompt. The schema is the contract between your AI layer and your application layer. Treat it with the same rigor as an API contract. Version it. Test it. Never let unvalidated LLM output cross a system boundary.
Cost Controls: Token Budgets and Circuit Breakers
Agent workflows that involve tool use and multi-step reasoning can consume tokens at unpredictable rates. A single user query that triggers a research agent with web search might use 500 tokens. Or it might use 50,000 tokens if the agent decides it needs to search twelve times and synthesize all results. Without guardrails, one runaway workflow can exhaust your daily budget.
Token Budgets per Workflow
Every workflow instance gets a token budget at creation time. Each LLM call deducts from the budget. When the budget is exhausted, the workflow enters a degraded mode: it must produce a response with what it has gathered so far rather than making additional calls.
class TokenBudget {
private used: number = 0;
constructor(
private readonly limit: number,
private readonly warningThreshold: number = 0.8
) {}
consume(tokens: number): void {
this.used += tokens;
if (this.used >= this.limit) {
throw new BudgetExhaustedError(this.used, this.limit);
}
}
get remaining(): number { return this.limit - this.used; }
get isWarning(): boolean { return this.used / this.limit >= this.warningThreshold; }
get percentage(): number { return (this.used / this.limit) * 100; }
}
Circuit Breakers
A circuit breaker monitors the failure rate of an external dependency and short-circuits calls when the failure rate exceeds a threshold. If your LLM provider starts returning errors on 50% of requests, the circuit breaker opens and immediately fails all subsequent requests for a cooldown period instead of waiting for each one to time out individually. This protects your system from cascading timeouts and reduces unnecessary retry load on a struggling provider.
class CircuitBreaker {
private failures: number = 0;
private successes: number = 0;
private lastFailure: number = 0;
private state: 'closed' | 'open' | 'half_open' = 'closed';
constructor(
private readonly failureThreshold: number = 5,
private readonly cooldownMs: number = 30000,
private readonly successThreshold: number = 3
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailure > this.cooldownMs) {
this.state = 'half_open';
} else {
throw new CircuitOpenError('Circuit breaker is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
if (this.state === 'half_open') {
this.successes++;
if (this.successes >= this.successThreshold) {
this.state = 'closed';
this.failures = 0;
this.successes = 0;
}
}
this.failures = 0;
}
private onFailure(): void {
this.failures++;
this.lastFailure = Date.now();
this.successes = 0;
if (this.failures >= this.failureThreshold) {
this.state = 'open';
}
}
}
Observability: Structured Logging and Trace IDs
When an agent workflow fails at 2 AM, the on-call engineer needs to reconstruct what happened. Unstructured log lines like "Error processing request" are useless. Structured logs with trace IDs, step identifiers, token counts, and timing data turn debugging from archaeology into engineering.
interface AgentLogEntry {
timestamp: string;
trace_id: string; // correlates all logs for one workflow
step_id: string; // which step within the workflow
level: 'debug' | 'info' | 'warn' | 'error';
event: string; // machine-readable event name
data: {
model?: string;
tokens_used?: number;
latency_ms?: number;
tool_name?: string;
error_code?: string;
budget_remaining?: number;
attempt_number?: number;
[key: string]: unknown;
};
}
// Usage in the execution engine
logger.info({
trace_id: workflow.workflow_id,
step_id: step.step_id,
event: 'llm_call_complete',
data: {
model: 'gpt-4o',
tokens_used: 1847,
latency_ms: 2340,
budget_remaining: workflow.token_budget.remaining,
attempt_number: 1
}
});
Every log entry carries the trace_id. When a workflow fails, query your log aggregator with the trace ID and you get the complete timeline: every LLM call, every tool invocation, every retry, every token consumed, every decision the agent made. This is not optional instrumentation. This is the minimum viable observability for running agents in production.
Human-in-the-Loop Checkpoints
Not every agent decision should be autonomous. High-stakes actions, irreversible operations, and decisions above a cost threshold should pause the workflow and wait for human approval. The durable execution model makes this straightforward: set the step status to waiting_human, send a notification, and resume when the human responds.
async function executeStep(step: AgentStep, workflow: AgentWorkflow) {
if (step.step_type === 'human_review') {
await notifications.send({
channel: 'slack',
message: formatReviewRequest(step, workflow),
actions: ['approve', 'reject', 'modify']
});
step.status = 'waiting_human';
await persistStep(step);
return; // workflow pauses here
}
// ... execute other step types
}
// When human responds (via webhook, API, or UI)
async function handleHumanDecision(
workflowId: string,
stepId: string,
decision: 'approve' | 'reject' | 'modify',
modifications?: Record<string, unknown>
) {
const workflow = await loadWorkflow(workflowId);
const step = workflow.steps.find(s => s.step_id === stepId);
if (decision === 'reject') {
workflow.status = 'failed';
step.error = { code: 'HUMAN_REJECTED', message: 'Rejected by reviewer', transient: false };
} else if (decision === 'modify') {
step.input = { ...step.input, ...modifications };
step.status = 'pending'; // re-run with modified input
} else {
step.status = 'completed';
workflow.current_step_index++;
}
await persistWorkflow(workflow);
await resumeExecution(workflow);
}
Design the checkpoint criteria before building the agent. Common triggers: any action that modifies production data, any external API call that costs more than a defined threshold, any response the confidence model scores below 0.7, and any workflow that has consumed more than 80% of its token budget. These criteria should be configurable per deployment environment. Development environments might skip all checkpoints. Production environments enforce them strictly.
State Machines for Multi-Step Workflows
Complex agent workflows have states, transitions, and guards. A research agent might move through states like gathering, analyzing, synthesizing, reviewing, and complete. Each state has valid transitions: gathering can move to analyzing or back to gathering (if more data is needed), but it cannot jump directly to complete.
Modeling this as a state machine makes the workflow's behavior explicit, testable, and debuggable. You can visualize the current state, enumerate all possible transitions, and enforce invariants at each transition boundary.
const agentStateMachine = {
initial: 'intake',
states: {
intake: {
on: {
QUERY_PARSED: 'gathering',
INVALID_QUERY: 'failed'
}
},
gathering: {
on: {
SOURCES_FOUND: 'analyzing',
NO_SOURCES: 'insufficient_data',
BUDGET_WARNING: 'synthesizing' // skip to synthesis if budget is low
}
},
analyzing: {
on: {
ANALYSIS_COMPLETE: 'synthesizing',
NEEDS_MORE_DATA: 'gathering',
ANALYSIS_FAILED: 'failed'
}
},
synthesizing: {
on: {
SYNTHESIS_COMPLETE: 'reviewing',
CONFIDENCE_LOW: 'human_review'
}
},
reviewing: {
on: {
VALIDATION_PASSED: 'complete',
VALIDATION_FAILED: 'synthesizing' // re-synthesize with feedback
}
},
human_review: {
on: {
APPROVED: 'complete',
REJECTED: 'failed',
MODIFIED: 'synthesizing'
}
},
complete: { type: 'final' },
failed: { type: 'final' },
insufficient_data: { type: 'final' }
}
};
Libraries like XState formalize this pattern and provide visualization tools that let you see the state machine diagram alongside your code. Even without a library, defining the valid states and transitions in a declarative structure prevents the most common agent bug: undefined state transitions that produce unpredictable behavior.
Timeout Handling and Dead Letter Queues
Every agent workflow needs a global timeout. If a workflow has been running for 10 minutes and is still in the gathering state, something is wrong. The timeout handler should capture the current state, log a detailed diagnostic, and either produce a partial result or route the workflow to a dead letter queue for manual inspection.
async function monitorWorkflowTimeout(workflow: AgentWorkflow) {
const timeoutAt = new Date(workflow.timeout_at).getTime();
if (Date.now() > timeoutAt) {
logger.error({
trace_id: workflow.workflow_id,
event: 'workflow_timeout',
data: {
current_step: workflow.current_step_index,
total_steps: workflow.steps.length,
elapsed_ms: Date.now() - new Date(workflow.steps[0].created_at).getTime(),
tokens_used: workflow.token_budget.used
}
});
// Attempt graceful degradation
const partialResult = await synthesizePartialResult(workflow);
if (partialResult) {
workflow.status = 'completed';
workflow.state.result = partialResult;
workflow.state.degraded = true;
} else {
// Route to dead letter queue
await deadLetterQueue.enqueue({
workflow_id: workflow.workflow_id,
reason: 'timeout',
state_snapshot: workflow.state,
steps_completed: workflow.steps.filter(s => s.status === 'completed').length
});
workflow.status = 'failed';
}
await persistWorkflow(workflow);
}
}
The dead letter queue is where failed workflows go when automated recovery is not possible. Each entry contains the full workflow state at the time of failure, the error classification, and the number of completed steps. An engineer or an operator reviews the dead letter queue periodically, identifies patterns (are most failures from the same tool? the same step type? the same time of day?), and either fixes the root cause or adjusts the workflow configuration to handle the edge case.
If your dead letter queue is empty, either your system is perfect or your error handling is swallowing failures silently. It is almost certainly the second one. Instrument every failure path. Route every unrecoverable error to a place where a human will see it.
The Production Agent Checklist
Before deploying any agent workflow to production, verify that each of these concerns is addressed. Not as TODO comments. As implemented, tested code.
- Durable execution: Steps persist independently. Failures do not lose completed work.
- Idempotent operations: Every step can safely execute twice with the same input.
- Retry with backoff: Transient failures retry with exponential backoff and jitter. Permanent failures fail immediately.
- Structured outputs: Every LLM response is validated against a typed schema before entering the system.
- Token budgets: Every workflow has a ceiling. Exhaustion triggers graceful degradation, not uncontrolled spending.
- Circuit breakers: External dependency failures are contained. Cascading timeouts are prevented.
- Structured logging: Every step emits structured logs with trace IDs, timing, and token counts.
- Human checkpoints: High-stakes actions pause for human approval. Criteria are defined and configurable.
- State machine: Valid states and transitions are explicit. Invalid transitions are impossible.
- Global timeouts: Runaway workflows are terminated and produce partial results or route to dead letter queues.
- Dead letter queue: Unrecoverable failures are captured with full diagnostic context for manual review.
Demo agents are impressive. Production agents are reliable. The difference is not intelligence. It is engineering discipline applied to every failure mode that the demo conveniently ignores. Build the boring infrastructure first. The impressive capabilities are only valuable if they survive contact with the real world.
The measure of a production agent is not how well it performs when everything works. It is how gracefully it behaves when everything breaks.