Building Production Agentic AI Systems: A Practitioner's Architecture Guide

Building production agentic AI systems: a practitioner's architecture guide

By Manimaran Gunasekaran | June 2026 | keneland.com

Most agent demos fail in production. They work on stage with a curated prompt and a single happy path, then collapse when they meet real users, real data, and real failure modes. I've spent the last several months building a production multi-agent platform, and the lessons I've learned have less to do with which LLM you pick and more to do with the engineering decisions that happen around it.

This is a practitioner's guide to the architecture of production agentic AI systems — not a framework tutorial, not a vendor pitch. It's the set of decisions I wish someone had written down before I made the mistakes myself.

What makes an agent different from a chatbot

A chatbot takes a prompt and returns a response. An agent takes a goal and figures out how to achieve it — reasoning about what to do, using tools to act, observing the results, and deciding what to do next. The difference isn't sophistication. It's autonomy.

That autonomy is what makes agents powerful and what makes them dangerous to deploy without engineering discipline. A chatbot that gives a bad answer is annoying. An agent that takes a bad action can corrupt data, break a deployment, or send the wrong message to a customer.

Every architectural decision in this guide is informed by that tension: how do you give agents enough autonomy to be useful while keeping enough control to be safe?

The five layers

A production agent system has five layers. Skip any one of them and you'll find out in production.

Layer 1: Human layer — Users, operators, and administrators who interact with and govern the system. This includes approval gates where humans review agent outputs before they take effect, tenant configuration for multi-customer deployments, and monitoring dashboards that surface what agents are doing and why.

Layer 2: Orchestration core — The brain. This is where agents reason, coordinate, and execute. It includes the agent execution engine (how individual agents think and act), the supervisor (how multiple agents coordinate and recover from failures), the LLM gateway (how you talk to language models reliably), the workflow engine (how long-running multi-step processes are managed), and supporting systems like prompt registries, guardrails, self-healing rules, and event buses.

Layer 3: Memory and retrieval — What agents know and how they find it. Working memory for in-task context, vector stores for semantic retrieval (RAG), keyword stores for exact-match retrieval, and the embedding pipeline that bridges text and vectors.

Layer 4: Tool layer — What agents can do. Domain-specific tools (APIs, CLIs, data queries), code execution environments, external service integrations, and file system access. The Model Context Protocol (MCP) is emerging as a standard for how agents discover and invoke tools.

Layer 5: Infrastructure — Compute, databases, secrets management, CI/CD, observability. The unglamorous layer that determines whether your system stays up at 3am.

Layer 2 deep dive: orchestration

This is where most of the architectural decisions live.

The ReAct pattern

The most effective agent execution pattern I've found is ReAct — Reasoning and Acting. The agent receives a goal, then enters a loop: it thinks about what to do (reasoning), takes an action using a tool, observes the result, and decides whether to continue or stop. The LLM returns structured output — a thought, an action name, and parameters — and a tool dispatcher routes the action to the appropriate implementation.

This sounds simple, but the engineering details matter. You need a tool dispatcher that maps action names to real implementations, state accumulation across loop iterations (so the agent remembers what it's already done), configurable maximum steps (so a confused agent doesn't loop forever), and structured output parsing that handles the inevitable malformed JSON from the LLM.

The key insight is that the agent's "intelligence" comes from the LLM, but its "competence" comes from the tools you give it and the state you maintain for it. A brilliant reasoner with bad tools produces bad results.

Multi-agent coordination

Single agents hit a ceiling. When the task is complex enough — multiple domains of expertise, multiple phases, multiple outputs that depend on each other — you need multiple agents that coordinate.

The pattern that works in production is supervisor/sub-agent hierarchy. A supervisor agent receives the overall goal, breaks it into tasks, assigns them to specialized sub-agents, monitors progress, and intervenes when things go wrong. Sub-agents execute autonomously within their scope but escalate to the supervisor when they're stuck.

The supervisor needs escalation levels. In my implementation, there are three: Level 1 re-queues the failed task with a recovery context (the agent retries with more information about what went wrong). Level 2 reassigns the task to the supervisor itself (a more capable model takes over). Level 3 hard-stops the workflow and creates an incident for human review. The key is that each level is more expensive and more disruptive, so you exhaust cheaper options first.

Race conditions in task assignment are a real problem when multiple agents process events concurrently. A handler lease guard using compare-and-swap (CAS) operations prevents duplicate execution — before an agent starts working on a task, it atomically claims a lease, and any other agent that tries to claim the same task fails fast.

The LLM gateway

Do not call LLM APIs directly from your agent code. Build a gateway.

The gateway abstracts provider-specific details (Anthropic, OpenAI, and Google all have different request/response formats), implements per-tenant rate limiting (so one noisy tenant doesn't exhaust your API quota), adds a circuit breaker (so a failing provider doesn't cascade into timeouts across your entire system), and routes to fallback providers when the primary is down or over budget.

The circuit breaker pattern is critical. When an LLM provider fails three times in sixty seconds, you stop sending requests to it entirely for a cooldown period. After the cooldown, you send a single test request. If it succeeds, the circuit closes and traffic resumes. If it fails, the circuit stays open. Without this, a provider outage turns into cascading timeouts that bring down your entire system.

Cost-aware routing matters at scale. Different models have different price-per-token. For high-complexity tasks (architecture decisions, nuanced code generation), you route to the most capable model. For classification tasks and simple text generation, you route to cheaper, faster models. The gateway makes this a configuration decision, not a code change.

Workflow durability

Agent tasks that take minutes or hours — multi-phase projects, complex approval chains, long-running deployments — need durable workflow orchestration. If your server restarts mid-workflow, you need to resume from where you left off, not start over.

Temporal.io (or similar durable workflow engines) solves this by recording every step in a replay log. On recovery, the workflow function re-executes from the beginning, but completed steps are replayed from the log rather than re-executed. This gives you fault tolerance without manual state management.

The critical constraint is determinism: your workflow code must produce the same result on replay as it did on the original execution. That means no random numbers, no wall-clock time, and no direct side effects — all side effects (LLM calls, database writes, deployments) go through "activities" that are recorded in the replay log.

When a phase fails, saga compensation rolls back completed steps. The implementation uses allSettled (not all) for compensation — if a rollback itself fails, you don't want that failure to mask the original error.

Human approval gates are implemented as signals that the workflow waits for. The workflow pauses, a notification goes to the human reviewer, and when they approve (or the timeout expires), the workflow resumes. This is built into the durable workflow, not bolted on.

Layer 3 deep dive: memory and retrieval

Agents without memory make the same mistakes repeatedly. Agents with memory learn.

Three-tier memory architecture

Tier 1 — Working memory (Redis): Fast, ephemeral, scoped to the current task. Think of it as the agent's scratchpad. TTL-based expiry (24 hours) means it self-cleans. Used for in-task context that doesn't need to survive beyond the current execution.

Tier 2 — Episodic memory (vector store): The agent's past decisions on this project. "Last time I worked on this account object, I used this naming convention." Stored as vector embeddings, retrieved via semantic similarity search. Scoped per agent, per project.

Tier 3 — Organizational knowledge (vector store): Cross-project patterns, scoped by tenant. "Across all projects for this customer, the convention is to use trigger handlers, not process builders." This is where institutional knowledge accumulates.

Hybrid retrieval (the part most RAG implementations get wrong)

Pure vector search misses exact matches. If your agent memory contains "Account.BillingCity" and you search for "billing address," vector search will find it. But if you search for "Account.BillingCity" exactly, vector search might rank a semantically similar but wrong result higher.

The solution is hybrid retrieval: run vector search and keyword search in parallel, merge the results (vector hits first, then keyword hits), and deduplicate by ID. This gives you semantic flexibility and keyword precision.

The implementation detail that matters: embeddings must all come from the same model within a tenant. You cannot compare embeddings generated by OpenAI's model with embeddings generated by Google's model — they live in different vector spaces. If a tenant switches embedding providers, you need to re-embed their entire memory corpus.

Graceful degradation

The vector store will go down. Your embedding API will have an outage. This should not stop your agents from working.

Design every retrieval path with fallback. If the vector store is unavailable, fall back to keyword-only search. If the embedding API is down, skip memory retrieval entirely — the agent generates with less context, not no generation at all. Wrap all retrieval in try/catch and mark it as non-critical. A retrieval failure should never block generation.

The system degrades to less-informed output, not to no output.

Knowing when NOT to use RAG

Not every piece of context should be retrieved semantically. When the source document is known and deterministic — an approved specification, a signed contract, a specific database record — fetch it directly. Semantic retrieval introduces noise when the answer isn't ambiguous. Use RAG for discovery ("find relevant past decisions"). Use direct fetch for grounding ("use this specific approved document").

This distinction is underappreciated. Most RAG tutorials treat retrieval as the default for all context. In production, mixing semantic retrieval with deterministic grounding based on the nature of the source produces better results than using RAG for everything.

Layer 4 deep dive: tool routing and MCP

Agents are only as capable as the tools they can use. The Model Context Protocol (MCP) is emerging as the standard for how agents discover and invoke tools, and it's worth designing around.

MCP provides a structured way for agents to understand what tools are available, what parameters they accept, and what they return. Instead of hardcoding tool definitions in your prompt, the agent queries an MCP server that describes available tools dynamically. This means you can add new capabilities without changing agent code.

The architectural pattern that scales is tool/context packs — bundles of related tools that can be composed per agent or per tenant. A Salesforce pack exposes org metadata, deployment, and validation tools. A Jira pack exposes issue creation and status updates. A calendar pack exposes scheduling. Each agent gets the packs relevant to its role.

This composability is what makes the platform domain-agnostic. The orchestration core doesn't know or care about Salesforce or Jira. It knows about tools, which are dynamically discovered via MCP. Switching domains — from Salesforce delivery to customer support automation to sales prospecting — means configuring different tool packs, not rewriting agent logic.

The decisions that keep you alive in production

Multi-tenancy from day one

If your platform serves more than one customer (or more than one team), tenant isolation is non-negotiable. Per-tenant API keys stored in a secrets manager, never in environment variables. Tenant-scoped vector collections so one tenant's memory never appears in another's retrieval results. Per-tenant rate limits so one customer's workload doesn't starve another.

The cost of retrofitting multi-tenancy is an order of magnitude higher than building it in from the start. I've seen this play out too many times.

Guardrails as a pipeline, not a checkbox

AI guardrails shouldn't be a single filter. They should be a pipeline with distinct stages: input validation (PII detection, injection prevention, budget checks), prompt framework compilation (assembling the final prompt from templates, context, and constraints), LLM execution, output validation (schema conformance, grounding checks, hallucination scoring), and policy evaluation (does this output violate any organizational rules).

Each stage can pass, fail, or flag for human review. The pipeline is configurable per tenant — a healthcare customer might have strict PII detection, while an internal tool might have relaxed constraints.

Self-healing before human escalation

Production systems fail. The question is what happens next. A self-healing engine with categorized rules (context failures, LLM failures, review failures, agent failures, workflow failures, integration failures) and graduated severity (auto-fix, notify, escalate) handles the 80% of failures that follow known patterns. The remaining 20% escalate to humans with full context.

The ring-buffer event history per project is critical for debugging. When a human investigates an escalation, they can see the sequence of events, decisions, and failures that led to the current state — not just the final error.

Observability for AI is different from observability for software

Standard metrics (latency, error rates, throughput) are necessary but insufficient. AI systems need additional observability: tokens consumed per task (for cost tracking), model selection decisions (which model was chosen and why), retrieval quality (what was retrieved and whether it helped), human override rates (how often do humans reject agent outputs), and completion rates (what percentage of tasks actually finish successfully).

Prometheus metrics cover the system health layer. AI-specific tracing (logging inputs, outputs, and intermediate reasoning for every LLM call) covers the debugging layer. Both are required.

What I got wrong

The honest part. These are mistakes I made building a production agent platform.

Underestimating the long tail of LLM output parsing. LLMs produce structured JSON most of the time. But "most of the time" means 5% of outputs are malformed, and that 5% will crash your agent loop if you don't handle it. Every JSON parse needs a fallback. Every structured output extraction needs a retry with a simpler prompt.

Building the event bus as in-process first. An in-process event bus is fast and simple, but events are lost on crash. The durable event bus (backed by Kafka/Redpanda) should have been the default from day one, with in-process as the development-only option, not the other way around.

Not investing in agent output evaluation early enough. Knowing whether an agent's output is good requires more than human review. Automated evaluation — comparing outputs against known-good examples, scoring for specific quality criteria, tracking quality trends over time — should be built from the beginning, not retrofitted when you notice quality drift.

Over-engineering the memory system before having enough data to tune it. Three-tier memory with hybrid retrieval sounds impressive, but the tuning (similarity thresholds, keyword weights, merge strategies) only works when you have enough real agent interactions to measure against. Start with simple keyword retrieval. Add vector search when you have enough data to evaluate whether it helps.

The architecture is the product

The most important thing I've learned is that in an agent platform, the orchestration layer IS the product. The LLM is a commodity — you can swap Claude for GPT for Gemini and the agent still works. The domain tools are a commodity — you can swap Salesforce tools for ServiceNow tools and the agent still works. What doesn't swap out is the orchestration: how agents reason, coordinate, remember, recover, and degrade gracefully.

If you're building an agent system, invest your engineering time in the orchestration layer. Make it domain-agnostic. Make the tools pluggable. Make the LLMs swappable. Build the memory and retrieval system once, well. Get the guardrails and self-healing right. Everything else is configuration.

The companies that win the agent race won't be the ones with the best LLMs. They'll be the ones with the best orchestration, the best tool ecosystems, and the most disciplined approach to production reliability. The architecture is the moat.

Manimaran Gunasekaran is an AI platform builder and enterprise systems architect. He is the founder of ASDA (asdaagent.com), a multi-agent AI platform, and InstaTools.co, a compliance API platform. Previously, he led CRM platform engineering at Block (Square, Cash App, Afterpay) where he built an 85-person engineering organization and managed a $30M operating budget. He holds Salesforce System Architect and Application Architect certifications. Connect on LinkedIn: linkedin.com/in/gunman

All Articles