AI agent systems need structure, not just bigger models

← Writing

Everyone’s building AI agent systems now. The pitch is compelling: describe a task in plain language, and an autonomous agent will figure out the steps, execute them, and deliver a result. In practice, it’s a mess. The agent misunderstands, writes broken code, or goes off on a tangent. We’re building towers of complexity on a fundamentally stochastic foundation—large language models that, at their core, are just selecting probable next tokens.

01 The optimization trap

The standard playbook is to optimize the LLM: use a more powerful model, implement sophisticated RAG to cram more relevant context into the window, engineer better prompts. This treats the LLM as the primary decision-maker and tries to make it smarter. It creates new problems. More context often leads to worse performance—the model gets confused, hallucinates more, and the error margin never disappears. It’s also a liability. Systems like UnitedHealth’s have faced lawsuits over AI decisions with 90% error rates. When an agent violates compliance or generates defamatory content, that’s a real legal problem, not just a technical bug.

02 Building an orchestration system

I recently built an AI agent orchestration system where developers describe a task and the system handles agent creation, execution, and state management. The first challenge was integration—implementing a factory pattern to support different LLM providers (OpenAI, Anthropic, etc.). The landscape changes weekly; keeping up feels ambiguous.

The real issue surfaced during execution. The system’s behavior was inconsistent across runs. Upgrading to a more capable model helped, but only partially. The core problem was context management. Even with a 1M token context window, the issue wasn’t capacity—it was relevance and structure.

The initial architecture used threads to create agents in parallel. Each agent would generate system messages (status updates, partial results, errors), and all of these were dumped into the prompt context for the next decision step. This created dozens of disjointed messages, bloating the context and confusing the LLM. It couldn’t follow the narrative.

Parallel vs. Sequential Context

Parallel (Initial)

Step: Analyze task

Agent A

Agent B

Agent C

Context: [A: analyzing..., B: fetching..., C: parsing...] (interleaved, noisy)

Sequential (Fixed)

Step 1: Analyze task

Single agent registers result → context

Step 2: Fetch data

Uses Step 1 context, registers result → context

Step 3: Parse results

Uses Step 1+2 context, registers result → context

03 Sequential context management

I switched from parallel to sequential agent creation. Each component in the system became responsible for registering its state into a shared context in a strict, sequential order. No more dumping every thread’s output simultaneously. The context became a linear story: step 1 happened, here’s the result; step 2 used that result, here’s its output; and so on.

Decision reliability improved dramatically. The solution wasn’t just about reducing token count—it was about preserving logical relationships. When information flows in a causal sequence, LLMs can follow the reasoning. They’re inherently better at processing narrative than sorting through a heap of disjointed messages.

04 Industry and academic validation

This isn’t just my experience. Amazon’s KIRO IDE enforces a similar structure for AI-assisted development: define requirements first, get an action proposal, approve it, then execute tasks sequentially. It forces a spec-driven, step-by-step workflow.

Academic research is catching up. The “Chain of Agents” paper (Cornell, June 2024) found that sequential agent collaboration outperforms traditional methods by up to 10%. They note that current approaches “struggle with focusing on the pertinent information”—exactly the context management problem I hit. Their solution uses “worker agents who sequentially communicate” and “interleaving reading and reasoning,” mirroring the sequential pipeline approach.

05 Conclusion

Throwing more context or more powerful models at the problem doesn’t fix the underlying issue. Structure matters. A sequential, narrative context allows LLMs to reason more effectively, reduces hallucinations, and makes the system’s behavior predictable and debuggable. As KIRO puts it: “Bring structure to AI coding.” That applies to most agentic workflows. Reliability comes from constraining the chaos, not amplifying it.