When long chats quietly sabotage code: context drift and hidden errors

#productivity #ai #llm #programming

What I mean by context drift

I kept a single assistant thread open while iterating on a feature for three days. Early on I told the model we were using Django 3.2, black formatting, and the project's timezone. By the second day the suggestions started slipping: examples used f-strings in a way that conflicted with our lint rules, and a configuration snippet referenced a settings module that never existed in our repo. The model had not "forgotten" anything in the literal sense, but recent turns dominated. New examples pushed older constraints into the long tail of attention and the output began to look like a different project.

That bias toward recent context is not theoretical. It shows up in long debugging threads, pair-programming sessions, and ticket triage where you keep appending logs and questions. The assistant will re-use patterns that match the latest messages. If you are not explicit again, assumptions mutate. I treat drift as the slow rot of a session: tolerable at first, then quietly corrupting subsequent suggestions.

Hidden assumptions that break pipelines

One time the model suggested a CLI flag to a build tool that does not exist in the version we run on CI. I accepted it and updated the pipeline. Locally the script ran because my machine had a newer tool; CI failed and then tried a fallback path that silently skipped a step. That skipped step meant no schema migration ran in staging. The hard part was that the model never announced the version assumption; it presented the flag as if it were a fact.

To catch these, I started forcing the assistant to list assumptions before it writes code or infra snippets. Ask it to enumerate versions, default flags, and environment differences. I also cross-check anything time sensitive or dependency specific against canonical sources rather than trusting the first answer. For source checks I use a structured research flow rather than a casual follow-up. It helps to consult a dedicated verification channel instead of the working thread, especially if you are comparing notes across models in a shared workspace like a multi-model chat.

How small errors compound into outages

Small mistakes cascade. I let one bad suggestion through: a JSON key name that was off by a character. Tests still passed because the unit test used fixtures that matched the wrong key. Integration tests caught nothing because the staging service was mocked the same way. That mismatch propagated into a deployed feature and produced live errors that were hard to trace back to the original token-level typo in a model reply.

I now treat model output as a draft that needs checkpoints. After any change suggested by the assistant I run a quick targeted test suite, validate API contracts, and add a logging assertion for the specific field. Logs let you trace the error chain instead of guessing which suggestion caused the failure. The extra friction saved us from a multi-hour rollback once.

Tool calls fail in ways that look like reasoning errors

We integrated an agent that runs linters and test runners. One run returned truncated JSON because the test process hit a timeout. The assistant received partial output and generated a remediation script that assumed a different error. From our view it looked like the model inferred the wrong cause. In reality the upstream tool had failed silently and the model filled the gap with a plausible explanation.

That experience taught me to instrument tool outputs aggressively. Validate tool responses before handing them back to the model. If JSON is expected, parse and assert shape. If a tool timed out, fail the agent call and surface a diagnostic instead of letting the model proceed on partial data. When a tool is flaky, put a short circuit test in front of it and record the raw payloads for later analysis. Comparing model responses across turns or models in a controlled environment also helped diagnose when the issue was input quality rather than reasoning.

Concrete guardrails I added

These are things I actually put in place: require the assistant to list assumptions on every design turn; pin a short system message with mandatory constraints and rewrite it every few hours; log every prompt and response along with the last known environment snapshot; run minimal focused tests after each AI-driven change; validate all tool outputs with simple schema checks; and force a hard session reset when the thread grows beyond a known size. I compare edge cases and disagreements in a separate research channel that includes source checks from a structured flow similar to deep research.

None of this is glamorous. It added friction to rapid prototyping but reduced the quiet rebuilds and emergency rollbacks. The thing I keep returning to is this: assume the model will drift, assume tools will fail, and build the smallest verifications that stop the compounding errors.