It’s not in the stdlib - Prompt like a data engineer

In the previous post in this series we looked at how Context Engineering has emerged as an important discipline and what it is. In this post Peter looks at how the prompt itself is a critical influencing part of the broader Context Engineering story.

As the first post in a series of “Adventures in Context Engineering”, I would like to introduce some loose analogies I often find useful when thinking about how to get LLMs do what I want.

Latent space

Chris Olah has some amazing posts I come back from time to time. My favourite is Neural Networks, Types, and Functional Programming Here’s an excerpt:

“With every layer, neural networks transform data, molding it into a form that makes their task easier to do. We call these transformed versions of data ‘representations.’”

When prompting a language model your task is to navigate and bound this space of representations, commonly referred to as the latent space. The latent space is an approximation of the training data. Generative models like LLMs sample from this space to generate new outputs. A prompt creates anchors in this space that together constrain the text generation to stay within a smaller area of that space. That area is context.

Visual aid from Generative chemistry: drug discovery with deep learning generative models: sampling from latent space, in this case the model is generating new molecules

Here's how Keenan Crane illustrates the latent space of an autoencoder: The encoder f maps high-dimensional data x to low-dimensional latents z. The decoder tries to map z back to x. We *always* learn a k-dimensional submanifold M, which is reliable only where we have many samples z. In regions where we don't have many samples, the decoder g isn't reliable: we're basically extrapolating (i.e., guessing) what the true data manifold looks like. The diagram suggests this idea by “cutting off” the manifold—but in reality there’s no clear, hard cutoff.

But we all intuitively know what context is: put a banana peel in a museum and it will be protected, but put it near a trash can and it will be rubbish.

Prompts are programs

Reading Chris Olah’s post further:

“Representations correspond to types. At their crudest, types in computer science are a way of embedding some kind of data in n bits. Similarly, representations in deep learning are a way to embed a data manifold in n dimensions… Recurrent Neural Networks are just folds. They’re often used to allow a neural network to take a variable length list as input, for example taking a sentence as input... It feels like a new kind of programming altogether, a kind of differentiable functional programming. One writes a very rough functional program, with these flexible, learnable pieces, and defines the correct behavior of the program with lots of data.”

“Fold”, if you are not familiar with the term, is equivalent to “reduce” in Javascript and Python. A loop over a series in which you take whatever is important for you into a new output. It does not limit what you consider important to accumulate, that’s up to the lambda function to define.

Why am I bringing this up? “Context engineering”, “RAG is dead” etc. all these headlines highlight that there’s a missing understanding of what prompts are and their practical importance. In my opinion prompts are code in the strictest sense. A true homoiconic programming language.

Homoiconicity basically means, code is data, data is code. But let’s take "Homoiconicity revisited":

“procedures and text have the same representation inside and outside the processor, the term homo-iconic is applicable, from homo meaning the same, and icon meaning representation.”

This is a term from the very early days of AI when everything related was still coded in Lisp. Seemingly only functional programmers care about stuff like this, but I think the analogy cuts deep:

In our programming language the prompt is the code, but also the prompt itself is the data we process with it. The only missing thing from the picture is the interpreter that runs this program: the language model.

Glossary

Latent space: The model's internal representation of all concepts it learned from training data
Homoiconicity: When code and data have the same representation (like Excel formulas that are both text and executable).
Fold/Reduce: Processing a list one item at a time to build up a result (like summing numbers).
Partial function application: Setting some parameters of a function now, others later (like currying in functional programming).
stdlib (standard library): The built-in capabilities a system already knows how to do.

The lambda function in our “fold”, is the sampling of the latent space conditioned on the prompt. On one hand the prompt is just a series of tokens, a string, eg.: “hello ”, on the other hand the prompt is a partial function application! It predefines an area of the learned data distribution where we are going to sample the next token from. If we partially applied “hello” to the sampling function, then the next tokens will likely be one of: “world”, “mr.smith”, etc. and less likely be one of: “law”, “letter” etc. For “law” to be the most likely next token we would need to be in a different area of our latent space. Here the prompt “hello “ is effectively en-coding all the possible results that might come after it, but we are only going to get one of those when we apply it to a model. Hence it’s a partial function application until we sample a token.

An example application of reduce from MDN: a series of functions where we apply the next function on the output of the previous. “double” is a partial function application of the function multiply, limiting the possible outcomes from all numbers to just even numbers.

Declarative programming with examples

So some tokens are data, some are code, but unlike in a typical programming language this separation is never clear cut. Every token you put in the context will pull the model into a certain direction.

“Few-shot examples” are so effective because they create a sequence that can be easily continued. They are the simplest of any program and any LLM can continue them up to a degree. Similarly “instruction following” - unlike the name would let you believe - have not as much to do with following instructions, rather imitating expected turns of speech eg.: questions are followed by answers. In this case the models were literally trained with millions of examples of, you guessed it, questions followed by answers. These examples shape the distribution of possible outcomes to look alike.

I don’t know, but I imagine it might be strange to call this programming for someone who only ever programmed with imperative languages. In an imperative language you describe the steps of how to do something. Compared to declarative languages where you only describe the outcome you want and let the machine figure out the steps to get there.

Imperative is not completely missing though, eg.: chain of thought traces imitate what reasoning steps one should be making to arrive at a good solution. Models are reinforced with the most useful traces of thinking steps. They learn to reason by imitating these trace examples.

If we could just let the machine figure out all the steps by itself.

Programming without composition

In the programmer's perspective the instruction following data a model learned is the “stdlib”. The standard library of procedures it can reliably follow or imitate. In order to write good programs, you need to have a sense for what common knowledge a model already has, how it can be pulled into the right direction, what procedures can be triggered in it and how it could be taught something new.

However, there’s a big problem with our ‘prompts are programs” analogy and with LLMs in general: Prompts do not compose cleanly.

In “How Long Contexts Fail” Drew Breunig categorized common prompting failures into: “context poisoning”, “context distraction”, “context confusion”, “context clash”. Here’s my favourite:

“Context Confusion is when superfluous content in the context is used by the model to generate a low-quality response… The Berkeley Function-Calling Leaderboard is a tool-use benchmark that evaluates the ability of models to effectively use tools to respond to prompts. Now on its 3rd version, the leaderboard shows that every model performs worse when provided with more than one tool”

Simply putting in irrelevant tool descriptions into the prompt made the responses worse. Let me translate that: every token matters. Unlike in a typical programming language our functions don’t compose cleanly.

https://en.wikipedia.org/wiki/Function_composition_(computer_science)

I think we can easily come up with more classes of prompting issues:

Undefined return type: the prompt scope is too wide/narrow and the model wanders. Solution: Add goal, constraints, schema.

Type clash: e.g.: acronyms mean different things in different contexts. Solution: Define ontology, tag snippets, normalize terms.

Not in stdlib: Anchors land in “out-of-domain”, poorly trained regions. Solution: bootstrap the model’s knowledge base with examples, definitions and ontologies, either with prompting or fine-tuning.

Cross-pollination: a special case of not in stdlib, an example is when you are trying to use some unsupported capability eg.: the prompt “never write about pirates” would likely put us in the space of “pirates” rather than move us farther away from it. Solution: do not phrase instructions in ways that were not explicitly trained into the model.

Prefix lock-in: Few-shot prefix over-determines the continuation. Solution: examples or trained knowledge lack variance. Teach the model the differences.

Instruction Occlusion (buried spec): Instructions get shadowed by examples/persona. Solution: Prioritize importance, eg.: put rules first & repeated; compress examples.

Etc.

I think doing error analysis and having a common vocabulary of the failures is very important. I also admire and regularly use the proposed solutions like context quarantine. However, I feel like we can generalize all of these into a common mindset: Prompt like a data engineer.

Generalization via data engineering

When every token will pull the model into a certain direction, the question becomes: is that direction still well covered by the training data the model was trained on?

When selecting a tech stack for Claude Code, Anthropic’s team followed the easy path:

“The team wanted an “on distribution” tech stack for Claude that it was already good at. TypeScript and React are two technologies the model is very capable with, so were a logical choice. However, if the team had chosen a more exotic stack Claude isn’t that great with, then it would be an “off distribution” stack. Boris sums it up:

"With an off-distribution stack, the model can still learn it. But you have to show it the ropes and put in the work. We wanted a tech stack which we didn't need to teach: one where Claude Code could build itself. And it’s working great; around 90% of Claude Code is written with Claude Code.” - How Claude Code is built

The implication here is that there is no true generalization and the models can only generate valid outputs in domains that they have seen before. I am not quite sure of that, since there are clear indications of emergent abilities, but I let the AGI pilled be concerned about that. The extreme interpolation ability of in-domain examples, is already plenty to automate everything in the world, including further data acquisition of unknown domains.

Building new capabilities into the models is the meat of research. The progress is both breakneck and extremely slow: “what do you mean it does not know how to x”, “oh cool there’s a new model hitting 70.85% on BFCL-v4”.

The reason you see any progress at all is the data engineering done at big (and small) labs that go into curating training data and evaluations, in order to bring problems "in-domain" for the model. These training examples constrain the text generation to resemble more of the “useful conversations and problem solving” and less of the “all possible valid English sentences”.

Let’s look at solving a common issue in LLMs with data engineering. Hallucinations:

“While evaluations themselves do not directly cause hallucinations, most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty.

Think about it like a multiple-choice test. If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say ‘I don’t know.” https://openai.com/index/why-language-models-hallucinate/

Scaling alone does not solve the problem:

“Claim: Avoiding hallucinations requires a degree of intelligence which is exclusively achievable with larger models.

Finding: It can be easier for a small model to know its limits. For example, when asked to answer a Māori question, a small model which knows no Māori can simply say “I don’t know” whereas a model that knows some Māori has to determine its confidence. As discussed in the paper, being “calibrated” requires much less computation than being accurate.“

Here the interpreter just learned about the output type void and all you needed was someone asking for it consistently.

The standard library is just being built

And here comes you. The ~~programmer~~, ~~data engineer~~, AI engineer.

If you want your prompts to compose you need to ~~ask~~ prompt, eval and train for it, similar to how hallucinations can only be eliminated by training for abstention over guessing. It’s your job too to teach the model how to orthogonalize concepts and procedures in its latent space. It’s also your job to figure out what is in its stdlib already and use that, and only that, if you want to avoid undefined behaviour.

Prompts are programs and the model’s training data is their standard library. Effective context engineering means keeping generation inside well-trained areas of the learned data-distribution:

use skills (APIs) already exhibited by the model
avoid incompatible function compositions accidentally shifting to generation out of domain
teach new APIs with data to bring the seeked skill in-distribution and to achieve type compatibility between existing APIs

Ultimately your job as an AI engineer is to keep up to date with the models’ capabilities, the skills they lack and to learn how to train in new behaviours.

In the next post we will look at some practical examples of teaching new things to a model, and we will try to answer the age-old question: what is the best way to teach a model: Is it via prompting, fine-tuning, RL, or something else?