When long chats change the code: context drift and hidden errors

#codequality #chatgpt #llm #python

How long chats quietly reshape the model

I keep a live assistant thread open during tricky debugging sessions. At first I seed the conversation with stack details: Python 3.11, FastAPI, Postgres 14, specific env names. Hours later the model starts suggesting snippets that match what it saw in the latest few messages, not the initial constraints. Recent tokens dominate. One time the assistant began returning examples for aiohttp while my project was FastAPI. The code compiled superficially in a REPL snippet but failed in integration tests because the session-level middleware I relied on wasn't present.

Hidden assumptions creep into code suggestions

These are not dramatic hallucinations. They are tiny, plausible defaults that don't match our infra. The model will assume a default AWS region, a common package version, or that a database URL is called DATABASE_URL. I merged a patch where the generated Dockerfile used pip install --user inside a root build step. It built locally; in CI it created a non-root runtime image missing site-packages and the service crashed at startup. The assistant had filled a gap with a typical pattern without asking which base image we used.

Another example: a suggested kubectl command that looked fine in chat. It used --record on kubectl rollout restart which is deprecated in our cluster. The command succeeded on a test cluster but failed under our policy agent and created an ambiguous rollout state. The model treated the most typical flag it had seen during training as correct for our environment.

Tool failures amplify hallucinations

I wire models to search and code indexers to ground answers. But the connectors are brittle. A search timeout returned an empty payload and the model synthesized a function name that appeared in no file. I treated the reply as authoritative and copied the name into a refactor. The build failed. From the outside it looked like a hallucination, but the root cause was an unhandled error in the tool layer. The model then kept ‘fixing’ the non-existent function with more invented helpers because it had no real evidence to stop.

Small errors compound in pipelines

These mistakes multiply. One wrong config in a generated CI job — a missing env injection — corrupted a migration run. The migration succeeded locally against a disposable DB, but in staging it ran halfway and left foreign keys broken. Logging helped, but the alert didn’t fire because I trusted the job’s status string rather than checking database state after the run. The model’s suggestion to add a status check had a fencepost bug. It looked minor until rollbacks became manual and tedious.

Another compound example: the assistant recommended a try/except around a bulk insert and silently swallowed IntegrityError, returning a generic success message. Downstream services had stale caches and consumers retried endlessly. Small buffer of error-handling choices cascaded into visible outage time. We had tests for schema and for lint but not for the exact failure-handling semantics the generated code introduced.

Practical fixes I actually use

I stopped treating a long chat as a stable spec. Now I pin constraints in a single message and periodically re-send them. If a thread runs long I reset the assistant with a concise system message and keep a short manifest file in the repo that lists versions, env names, and deployment policies. I also log every AI-generated snippet and run it in an isolated sandbox with strict budgets. When a tool returns no results I make the assistant explicitly declare tool confidence before I accept a change.

Concrete guards that caught issues for me: require generated infra changes behind a feature flag, add runtime assertions for env and version checks, and have CI run integration tests that exercise the exact failure paths the model modified. When I need cross-checks I open parallel prompts in a shared workspace and compare answers across models rather than trusting a single thread. That comparison habit lives alongside a structured verification flow for sourcing facts and docs so I do not treat a model’s suggestion about an API as the last word. You can use a shared chat workspace for comparisons and a deeper research workflow for verification to make these practices less ad hoc.