When long chats slowly change the rules: context drift in AI-assisted coding

#coding #llm #ai #promptengineering

Context buildup that silently changes answers

I started treating the model like a teammate I could keep in a thread forever. That was a mistake. Over a few days of back-and-forth the model kept producing answers that matched the last few messages but stopped honoring constraints I had stated at the start. In practice this looks like: I ask for code that must use our v1 client, and after three digressions the model outputs an example using the v2 interface. It reads confident. It is wrong.

The technical reason is attention bias toward recent tokens. The model still has the earlier text, but the effective weight of those tokens decays. I saw this most clearly when we compared multiple models in a shared workspace and the drift patterns diverged. If you want to reproduce my tests, try the same prompt across models in a multi-model chat and watch how quickly answers split; I keep these experiments in a shared place like a shared chat workspace so I can replay the thread.

A concrete failure: refactor that broke staging

We let a long thread drive a refactor. The prompt history included a passing mention that the service must not change API semantics. The refactor took many turns as we iterated on naming, batching, and error handling. On the final turn the model suggested swapping a synchronous parser for an async one. Tests passed locally for a subset of cases. Staging crashed under load because an implicit ordering assumption had disappeared.

I traced it back to one line in the model output: it returned an await where we had relied on synchronous side effects. The model never said "I changed the ordering". It assumed we would notice. We did not. The fix was mundane: require the model to emit an explicit "assumptions" block at every change and run a small, targeted harness that exercised ordering. After that I started asserting that any suggested refactor include a bullet list of dependencies and effects.

When tools fail and the model fills the gap

Tool integration made things worse before it made them better. I had an agent that searched our monorepo and then asked the model to edit files. One search call timed out. The agent returned an empty result and the model proceeded, generating edits that referenced helpers that did not exist. The diffs looked plausible so a reviewer approved them. CI failed later with runtime errors that were hard to map back to the thread.

The core mistake was trusting fluent text as evidence that a tool succeeded. I started making tool outputs explicit in the model context and requiring a verification token. If a search returns empty, the model must stop and say "no results". If an external check runs, the model must include the tool's success or failure code inline. I use a separate verification flow for anything that looks like a sourcing problem. When I need a focused sweep of references or changelogs I switch to a sourcing workflow and cross-check with a structured research pass like the one I keep in a deep research workspace.

Small hallucinations become large CI incidents

One hallucinated config key slipped into a logging setup and silently changed verbosity. Lint passed. Types passed. Nobody noticed until a downstream service crashed because it expected a slightly different JSON field. Small hallucinations compound: one line wrong in a helper, another line wrong in its consumer, then an integration test that only runs on a schedule fails in production. The chain is what makes it dangerous.

Practical guardrails I actually use

I added a few low-friction rules that stopped most of the surprises. Reset the conversation after N substantive turns. Insist on an "assumptions" and a "known unknowns" section in every model response that modifies code. Log the model output and the tool outputs together with a hash and a model version. Automate a small harness that runs the exact scenario the model changed. If that harness fails, block the change until someone acknowledges the mismatch.

We also reduced silent acceptance in code review. Any change that came from an AI has to include the prompt used and the top three model replies. I pin model versions for long-living flows and fail on timeouts from tools instead of letting the model fabricate a response. None of these are magic; they are ways to force verification early and to make the model fail loudly when assumptions drift. The surprising part was how often a simple reset or a short harness caught problems that would have taken hours to diagnose later.