AIAgentsProduction

Building AI Agents That Actually Work in Production

November 20, 2025 · 8 min read

The Demo Problem

Every AI agent demo I've seen looks flawless. Clean inputs, predictable outputs, no edge cases. Then you ship it and reality hits: malformed JSON from the LLM, tool calls that timeout, users who type things nobody anticipated.

I've shipped three agent systems to production in the last year. Here's what actually matters.

Retry Logic Is Not Optional

Your LLM will occasionally return garbage. Not because of a bug — because probabilistic models produce probabilistic outputs. Build retries in from day one.

async function callWithRetry(prompt: string, maxAttempts = 3) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      const result = await llm.call(prompt)
      if (isValidOutput(result)) return result
    } catch (e) {
      if (attempt === maxAttempts - 1) throw e
      await sleep(1000 * Math.pow(2, attempt))
    }
  }
}

The exponential backoff matters. Hammering a rate-limited API three times in 100ms accomplishes nothing.

Validate Everything

Don't trust the model's output. Parse it. Schema-validate it. Reject it if it doesn't match what you expect and retry.

import { z } from "zod"

const AgentOutput = z.object({
  action: z.enum(["search", "summarize", "answer"]),
  confidence: z.number().min(0).max(1),
  result: z.string(),
})

If the model can't produce valid output after 3 attempts, fall back gracefully. Show the user a "I couldn't complete that" message rather than crashing.

Know When to Punt

This is the hardest lesson. Not every task should be handled autonomously. Build explicit "escalate to human" paths for:

Confidence below your threshold
Tool calls that would affect irreversible state (send email, charge card, delete data)
Any ambiguity in user intent that context doesn't resolve

An agent that knows its limits is infinitely more valuable than one that confidently produces wrong answers.

Observability From Day One

Log every LLM call: prompt, response, latency, token count. You'll need this when something goes wrong at 2am. I use a simple wrapper that writes to a structured log:

{
  ts: "2025-11-15T14:22:01Z",
  model: "claude-3-5-sonnet",
  prompt_tokens: 847,
  completion_tokens: 312,
  latency_ms: 1840,
  success: true,
  retry_count: 0
}

The Actual Lesson

AI agents aren't magic — they're distributed systems with a probabilistic component. Apply the same engineering discipline you'd apply to any unreliable external service: retries, validation, fallbacks, observability.

The demos make it look easy. The engineering is where the real work is.

← Back to Blog