Stop Treating AI Like Magic: Why Context Engineering Beats Bigger Models

Smaller models with a disciplined context rival flagships at a fraction of the cost

Duncan Anderson
2025-10-04

Your AI agent fails when you give it 50 documents, 30 tools, and a complex multi-step prompt. Not because the model is bad, but because you're overwhelming its attention span with too much information.

The typical response is to upgrade to a more expensive model. But this misses the fundamental issue: even the smartest model can't process what it can't focus on. More intelligence doesn't fix information overload.

Context Engineering solves this through disciplined software engineering: deciding which tools to expose, which documents to load, how to retrieve information from disparate sources, and how to structure what the model sees. This post provides an initial introduction to the subject of Context Engineering, with further posts in the series from the Barnacle Labs team providing more detail on some of the engineering solutions we've been deploying.

The term was first coined by Tobias Lütke of Shopify:

What counts as context

The context is everything the model sees before it answers:

  • System Prompt: The behavioural guidelines and rules that shape the model's responses
  • User Prompt: The immediate request or question
  • Conversation State: The current conversation history and state
  • Available Tools: The functions the model can use to retrieve further information
  • Memory: Persistent knowledge from past interactions, user preferences, and learned patterns
  • Retrieved Information: Real-time data pulled from documents, databases, or APIs

Memory and Retrieved Information are typically added selectively to the user prompt based on the specific task.

The Context 'Stack'
The Context 'Stack'

Why Models sometimes "lose the plot"

Human working memory holds 7±2 items, or possibly closer to 4. LLMs have similar attention limits. When context contains contradictions, the model picks whichever signal seems strongest, leading to inconsistent outputs. Research on contradiction detection in RAG systems shows this remains challenging even for advanced models.

Context Engineering is the science of how we provide the important contextual information the model needs to answer a query (e.g. private information extracted from your databases or APIs) that the model won't be aware of unless you provide it. But it's also the art of ensuring that the context is sufficent but not overwhelming.

Why This Matters Now More Than Ever

As agents get more capable, the bottleneck shifts from model intelligence to attention allocation.

LLM context windows have grown exponentially from the 4k of the original GPT models to be regularly 100k+ tokens and even 1m+ in some models. With a greater context window, there's a danger of lazy design and context windows loaded with too much vague or inconsistent information that confuses the model. To get good results we need to apply discipline to how we fill those large context windows.

This challenge intensifies as agents operate over multiple turns: each loop generates more data that could be relevant, but context windows are finite. Without careful curation, agents face two critical failure modes. First, they can overflow their available context, losing important information. Second, they can suffer from context poisoning—when too much irrelevant information or contradictory data confuses the model, leading to unreliable or erroneous outputs. Context Engineering addresses both challenges by ensuring that even as information accumulates, only what's truly relevant and coherent makes it into the model's limited attention span.

Two Approaches to Context

Approach 1Approach 2
"It's a frontier model, it can handle 50 tools""Each tool needs to be obvious and distinct"
"Dump all 47 documents, it'll figure it out""Which 5 documents are essential for THIS task?"
"One complex prompt can do everything""Break it into simple, sequential steps"
"More context = better results""Precise, relevant context = better results"

The second approach produces more reliable outputs because it reduces what competes for the model's attention.

Tools Are Context Too

A common mistake is wrapping an existing CRM API with 47 parameters into a single 'query_customer' tool. The model will attempt confused tool calls, or more likely just refuse to make any.

Instead, break it into simple, purpose-specific tools: 'get_customer_by_email', 'get_recent_orders', 'check_account_status'. Each tool does one thing with a handful of clear parameters. Tool calling accuracy improves when you simplify what the model needs to understand.

A related problem is MCP (Model Context Protocol) servers that frequently expose many tools (20+ isn't uncommon). These can overwhelm a model, leading to confused calls or refusal to call any tool. To avoid this, prune the available tools to only those needed for the task.

Retrieved Information — More Than RAG

RAG (Retrieval Augmented Generation) typically means embeddings and vector stores. But the field now includes Graph-RAG, hybrid search, Agentic RAG, Query Generation, and other techniques. I use "Retrieved Information" to capture this broader range of options.

The approach has also shifted: earlier RAG systems tried to load all potentially relevant documents into context upfront. Now systems give the model tools to retrieve information on-demand from files, databases, or API endpoints. The model loads information only when it actually needs it, keeping context lean while ensuring it can access everything it requires.

Three Context Engineering Patterns

To give a flavour of the trade-offs involved in Context Engineering, here we present a small number of examples/patterns we've spotted in our work.

Context Engineering Patterns
Context Engineering Patterns

1: Dynamic Retrieval Over Pre-Loading

Magical thinking: Load everything the AI might need upfront into context—better to have it and not need it than need it and not have it.

Confusion-aware design: Give the AI tools to retrieve what it actually needs, when it needs it. Let it control its own attention.

Pre-loading information that might not be needed can confuse the model by adding unnecessary noise to the context.

For one client, we built a RAG system that automatically added potentially useful documents to the context when a user's question was received.

We improved answer accuracy by stopping the automatic document loading and instead giving the AI a tool to retrieve documents only when it determined it needed them.

This shift from always-filling-context to hiding-RAG-behind-a-tool improved answer accuracy because documents were only loaded when the AI specifically requested them. If the AI instead decided to use a tool that made an API call, the documents never entered the context at all.

2. Task Decomposition Over Complex Single Prompts

Magical thinking: Pack all requirements into one sophisticated prompt—it's GPT, it can handle complexity.

Confusion-aware design: One clear task at a time. Trying to juggle multiple goals risks failure.

When a task has multiple requirements, the temptation is to write one elaborate prompt covering everything and hope the model's intelligence handles the juggling act. In practice, this is like asking someone to write a legal brief while doing mental math and remembering a shopping list—technically possible, but not reliably.

We built an image generation system that was occasionally producing culturally insensitive images. The original approach used a single prompt asking the model to simultaneously:

  • Consider cultural sensitivities
  • Generate a detailed image description
  • Format and call the image API

We were overloading a single step, so we split it into sequential steps:

  1. First step (cultural review): Focus solely on creating a detailed description of the image we wanted, ensuring the description was culturally appropriate and factually accurate.
  2. Second step (image generation): Pass that vetted description to the image model with clear formatting instructions.

Problematic outputs dropped to near zero. Each step had one clear job with precisely the context it needed.

3. External Memory Over Context Accumulation

Magical thinking: Keep everything in context—the AI needs to 'remember' all findings to make good decisions.

Confusion-aware design: Context is working memory, not long-term storage. Offload ruthlessly.

As agents loop through multi-step tasks, the default is to accumulate everything—every tool result, every intermediate discovery—assuming the AI needs it all to maintain coherence. But just like a person can't hold 40 things in working memory, neither can an AI maintain focus with bloated context.

One agent we built was coordinating a number of sub-agents, each researching a different topic and writing its findings into context. The volume of context was so large that the agent framework automatically compacted it, losing information in the process.

The fix was to stop writing sub-agent results into context and instead have them write to files. Each sub-agent created a separate file of its results and the coordinating agent was given tools to read those files.

As a result, the agent remained sharply focused on the immediate task, and quality of results improved significantly.

The Business Case

Good Context Engineering also reduces costs. Careful context design often means a cheaper model (e.g. Gemini Flash) works, avoiding the need to upgrade to a more expensive model (e.g. Gemini Pro).

When premium models often cost several times more than their smaller siblings, there's a clear financial incentive to invest in Context Engineering before upgrading. Restructuring tool definitions, tightening instructions, and loading context selectively can help you ship with a base-tier model at a fraction of the cost of a flagship one. Base/small models are typically 3–10× cheaper than flagship tiers, so these cost savings are significant.

Concluding Thoughts

Context Engineering is becoming a core competency for building effective AI systems.

The developers building reliable production systems don't treat AI as magical. They design like they're managing a brilliant but easily overwhelmed assistant: one clear task, the exact information needed, nothing more.

This approach requires:

  • Understanding your specific use case deeply
  • Identifying what information is crucial for different tasks
  • Building systems to retrieve and format that information dynamically
  • Orchestrating tools, databases, files and APIs

The same principle applies whether you're managing a human intern or an AI: a brilliant mind in a noisy room with five simultaneous tasks will underperform. That same mind in a quiet room with one clear task and the right context will excel.

With careful context design, smaller, cheaper models can deliver results that rival expensive ones. Context discipline isn't just about reliability; it's about 3–10× cost savings through thoughtful design.

This post introduced the core principles of Context Engineering. Future posts in this series will explore the technical implementation patterns in depth.


References

Foundations & terminology

[1] Karpathy, A. (2025). “+1 for ‘context engineering’ … the delicate art and science of filling the context window …” [Tweet]. https://x.com/karpathy/status/1937902205765607626

[2] Lütke, T. (2025). “…context engineering… the art of providing all the context for the task to be plausibly solvable by the LLM.” [Tweet]. https://x.com/tobi/status/1935533422589399127

[3] Anthropic. (2025). Effective context engineering for AI agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

[4] Latent Space. (2025). Context Engineering for Agents (video). https://www.youtube.com/watch?v=_IlTcWciEC4

Protocols & tooling

[5] Anthropic. (2024). Model Context Protocol (MCP). https://www.anthropic.com/news/model-context-protocol

Long-context model windows [6] Anthropic Docs. Context windows / models overview (incl. 1M tokens for Sonnet 4/4.5). https://docs.anthropic.com/en/docs/about-claude/models

[7] Anthropic. (2025). Claude Sonnet 4 now supports 1M tokens of context. https://www.anthropic.com/news/1m-context

[8] Google AI (Developers). Long context with models of 1M+ tokens. https://ai.google.dev/gemini-api/docs/long-context

[9] Google AI (Developers). (2024). Changelog—Gemini 1.5 Pro 2M context GA. https://ai.google.dev/gemini-api/docs/changelog

[10] Google Blog. (2024). Introducing Gemini 1.5… 1M–2M context windows. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/

Working-memory capacity

[11] Miller, G. A. (1956). The Magical Number Seven, Plus or Minus Two. Psychological Review, 63, 81–97. PDF: UT Austin copy

[12] Cowan, N. (2010). The Magical Mystery Four: How Is Working Memory Capacity Limited, and Why? Current Directions in Psychological Science, 19(1), 51–57. DOI: 10.1177/0963721409359277

Contradictions in retrieval / RAG [13] Gokul, V., Tenneti, S., & Nakkiran, A. (2025). Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators… arXiv:2504.00180. https://arxiv.org/abs/2504.00180

[14] Wang, H., Prasad, A., Stengel-Eskin, E., & Bansal, M. (2025). Retrieval-Augmented Generation with Conflicting Evidence. arXiv:2504.13079. https://arxiv.org/abs/2504.13079

Retrieved Information (beyond “RAG”)

[15] Microsoft Research. (2024). GraphRAG: new tool for complex data discovery (and GitHub repo). Blog: MSR; Repo: https://github.com/microsoft/graphrag

[16] Pinecone. Hybrid search (sparse + dense) intro. https://www.pinecone.io/learn/hybrid-search-intro/

[17] IBM Think. What is Agentic RAG? https://www.ibm.com/think/topics/agentic-rag

[18] LangChain Docs. Build a Q&A system over SQL data (query generation). https://python.langchain.com/docs/tutorials/sql_qa/

Pricing (for the cost comparison claim)

[19] OpenAI. API Pricing (GPT-4.1 vs. GPT-4.1 mini, etc.). https://openai.com/api/pricing/

[20] Anthropic Docs. Claude model pricing (incl. long-context rates). https://docs.anthropic.com/en/docs/about-claude/pricing