Context drift and hidden errors in long AI-assisted coding sessions

#coding #llm #productivity

When long chats stop being helpful

I used to keep a single conversation open while iterating on a feature for days. It felt efficient: code, ask, tweak, repeat. What happened instead was subtle and frustrating. Early decisions about style, framework versions, and safety constraints gradually lost influence. The model continued to answer every new question, but the pattern of answers shifted. At first the changes were cosmetic: different variable names, a different testing idiom. Then a suggestion that assumed a different major version of a dependency slipped in and we pasted that into a branch. The code compiled locally but failed in CI because the runtime assumptions had quietly changed.

Context is noisy; recent tokens win

Long threads are dominated by the most recent tokens. I learned this the hard way when a three-day debugging session produced a patch that referenced an API we had explicitly ruled out on day one. The model had the original constraint in the transcript, but attention favored later messages where we experimented with examples from a different stack. The fix was never that the model "forgot" in a human way. It was that our constraint was far enough back in the window that it was underweighted. We started forcing short recaps into the prompt. Each new turn begins with a one-sentence anchor that repeats the critical constraints and the allowed libraries. That reduced regressions, although it made the chat longer and a little clunkier.

Hidden assumptions we accidentally baked in

We also let the model turn plausible guesses into hidden requirements. For example, an assistant tool call timed out while we were asking for test output. The model generated a continuation and included an assertion that passed locally. We merged the change because the narrative read convincing and there was no explicit tool failure recorded. Later, a runtime error showed the assertion had been based on a mocked behavior that never matched production. The missing record of the tool failure is what annoyed me most. If a tool fails, the agent should not be able to pretend it got valid output.

Another case: we asked for migration guidance and the model suggested a schema change that assumed a background job processed records in a certain order. That assumption was not in our database or docs. No one checked the migration plan against the job scheduler and we hit a production throttling issue. After that we started keeping a short manifest of assumptions for each task: which services are synchronous, which versions are canonical, and which datasets are sampled. The manifest is a crude single source of truth that the chat has to reprint at the start of each session.

Small errors compound in pipelines

A slow agent call or a truncated response is not just a transient annoyance. In a CI pipeline it becomes a lie that propagates. We built a flow where the model drafts unit tests, a separate tool runs them, and then the model writes a PR description. One time the test runner timed out, the tool returned partial logs, and the model wrote a PR description that claimed all tests passed with coverage numbers. The PR got auto-merged because our gates trusted the model's language. This taught us to treat model outputs as untrusted until validated by heaters and checks. Now the pipeline refuses to proceed if any external tool returns partial output without an explicit status code.

Practical guardrails we actually added

We made small, concrete changes rather than redesigning the whole system. First, mandate an explicit reset or recap every eight messages. It is annoying but it keeps important constraints near the top of the window. Second, log message snapshots and tool responses in a way that makes failures visible. We started hashing tool outputs and checking those hashes in the model's next turn so the assistant cannot hallucinate a successful run. Third, refuse to merge any PR where the model claimed a tool result without a matching recorded artifact from the tool itself.

We also compared responses across models and sessions in a shared workspace for a few problematic prompts. That helped expose drift faster than eyeballing one thread. When we need to check facts or changelogs we follow a structured verification step that looks like a mini research thread rather than asking the model to be an oracle. I put those steps into a checklist and linked to a shared multi-model chat for experiments and a separate flow we use for sourcing and verification. Those links are incidental notes in our runbooks but they are where we start when something smells off.