When long conversations cause context drift and hidden errors

#ai #llm #discuss #programming

Why long threads feel consistent but secretly drift

I kept a single chat open for a week while pairing with a model on a refactor. At first the suggestions matched my style: small functions, descriptive names, unit test sketches. Midweek it started recommending patterns I never use. Not obviously wrong. Just subtly different. The model still replied in the same voice, but its attention had migrated to more recent examples in the thread. Recent tokens dominate. The mental state that guided the early prompts thinned out.

That drift is not a dramatic switch. It is a slow creep. I would ask about database migrations and get an answer that assumed a different ORM. I would accept the code shape and only later realize a call signature didn't exist. By then, several local changes depended on the wrong assumption.

Small hidden assumptions become bugs

One concrete example: I asked the model to help convert a set of SQL updates into batched transactions. Somewhere in a long thread it began assuming autocommit behavior was off. The code it produced omitted an explicit commit in one branch. Tests passed locally because the dev DB driver behaved differently from prod. That mismatch created a silent failure in production. I had not asked about commit semantics because I thought the model remembered the earlier notes about our DB. It didnt, not reliably.

Another time it suggested a configuration flag name that matched an older version of a library. The refactor chained several of those suggestions into a migration script. The script ran against staging and wrote to the wrong S3 prefix. Each mistake alone was plausible. Together they turned into a user-facing outage.

Tool calls fail and the model fills the gaps

We wire models to tools: linters, test runners, package managers, small shells. That increased failure surface. I have seen the orchestrator return a partial JSON from a test summary and the model continue as if the missing fields existed. The output looked coherent. It also ignored a failed job because the tool output was malformed. In logs the only clue was a divergent checksum and a later change that undid the expected behavior.

This taught me to treat tool outputs as untrusted inputs. We now log raw tool responses, not just model summaries. When a response is truncated or lacks required fields we abort and surface the raw payload to a human. The model can then be asked to re-process with the full data. It forces a hard stop instead of an invisible assumption.

Practical guardrails I put in place

I added three small rules that reduced these stealth errors more than any prompt tweak. First, split research and code threads. I moved deep exploration to a separate workspace so the coding thread stays short and directive. For exploration we use a focused research flow that keeps citations and raw outputs together in one place; when I need to verify sources I push that work to a research session instead of the live refactor session using a shared workspace for comparisons.

Second, reset or snapshot the context every few turns. If a thread grows beyond a half-dozen edits we snapshot the state, then start a fresh prompt that explicitly enumerates assumptions. Third, enforce schema checks on model outputs. If the model claims a function signature or dependency name, a small validator checks it against the repo before applying changes. Fail fast. Log everything. These arent glamorous, but they stop chains of small mistakes.

When to log, when to reset, and when to ask for proof

Logging matters more than clever prompts. I keep a plain text timeline of model suggestions, tool outputs, and human approvals. That timeline made it easy to reconstruct a production incident and see where the assumptions first appeared. When something odd appears in staging I replay the exact model input that produced the change. Reproducibility is how you spot drift.

Resetting context is cheap compared to debugging a cascade. If a session has mixed styles or multiple research tangents, start a fresh chat, restate the constraints, and paste only the validated snippets. When the model references versions, tests, or changelogs ask for the source and, when possible, force a tool to fetch the authoritative file. We use a separate channel for that kind of sourcing so the main thread does not accumulate noise. I still trust the model to draft, but I no longer trust it to remember every hidden assumption.