Quality, Cost, and the Context Window¶
Qin Yu, 21 May 2026, updated 29 May 2026
Why Quality Matters More Than Cost¶
Instead of counting tokens, make every token count.
Token cost is real, but wasted work usually costs more than expensive tokens. A cheap run is not cheap if it produces a wrong patch, wastes review time, or sends the agent down a long recovery path.
Small miss rates compound across long workflows. If each step has a 1% chance of going wrong, a 50-step run has roughly a 40% chance of missing somewhere:
A cheap but uncontrolled long agent loop can cost more than several short, well-scoped, higher-quality runs. Fewer steps, tighter context, and early error detection matter more than raw model power.
Deterministic Guardrails¶
If an early step goes wrong and nothing catches it, the agent continues down the wrong path. Catch errors early with deterministic controls:
- Unit tests — verify correctness at the function level before the agent proceeds.
- Linters and type checkers — catch structural and type errors immediately.
- Security scanners — flag vulnerabilities before they propagate into downstream work.
- Hooks — run non-negotiable checks automatically after edits or before commits.
- Evals — replay representative tasks when changing prompts, models, tools, or routing rules.
Treat these as infrastructure, not optional polish. Better test coverage enables more aggressive automation.
References:
- OpenAI Evals guide
- OpenAI agent improvement loop cookbook
- GitHub Copilot hooks
- Anthropic: Writing tools for agents
Understanding the Context Window¶
Provide as little context as possible, but as much as required.
An LLM is a next-token predictor. Its output quality depends on what it sees in the context window. The context window is shaped by the agent harness, not only by the base model:
Background reading
The diagram below shows two distinct boundaries. You control what you send to the harness — prompts, files, instructions, skills, MCP tools, and so on. The harness then decides how to assemble those inputs into the model's context window: what to include, how to order it, when to compact, and which model to call. Understanding this boundary explains why context discipline and prompt structure matter so much.
sequenceDiagram
autonumber
participant You as You<br/>+<br/>Your Project
participant Agent as Agent<br/>aka<br/>Harness
participant LLM
loop Per task
rect rgb(200, 150, 255)
Note right of You: Prompts,<br/>Files,<br/>Instructions,<br/>Skills,<br/>MCPs,<br/>...
You->>Agent: Send ↑ to
end
Note over Agent: e.g.<br/>VS Code Chat, or<br/>Copilot CLI, or<br/>Copilot Cloud Agent, or<br/>Claude Code, or<br/>OpenAI Codex
loop Per turn (compaction when context window fills)
Agent->>LLM: Send context to
Note over LLM: e.g.<br/>GPT-5.5, or<br/>Claude Opus 4.7
LLM-->>Agent: Text response
end
Agent-->>You: Result
end
You control what enters the harness: prompts, files/URLs, instructions, skills, MCP tools, screenshots, and sometimes memories. That is the primary lever for both quality and cost.
Context Rot¶
Avoid treating the context window as free storage. Long-context systems have known failure modes:
- Position bias — models often retrieve better from the beginning and end than the middle.
- Irrelevant-token distraction — unrelated content competes with useful content.
- Stale-session drift — old assumptions, failed attempts, and tool outputs remain influential.
- Compaction loss — automatic summaries are useful, but they can omit details that later become important.
The old rule of thumb "keep context below 60%" is a useful caution, not a law of physics. Treat context as scarce working memory: keep noisy exploration outside the main conversation, use files for durable information, summaries for intermediate findings, and sub-agents for isolated investigation. Start a new chat or session when switching to unrelated work.
References:
- Lost in the Middle: How Language Models Use Long Contexts
- Anthropic context windows
- OpenAI compaction guide
- Claude Code memory
Context Discipline¶
Context engineering is curating and maintaining the optimal set of tokens in the agent loop. The harness manages prompt caching, tool search, memory, and compaction, but you are responsible for what you add yourself.
Irrelevant context degrades output quality and increases cost.
- Attach only the files and URLs directly relevant to the current task.
- Open a new chat window for each distinct task.
- Use sub-agents to isolate task-specific context into separate context windows.
- Prefer compact summaries over raw logs.
- Prefer file references and targeted reads over pasting large blobs.
- Avoid screenshots unless the visual information is genuinely needed.
Prompt Engineering¶
Prompt engineering is writing and organising instructions for optimal LLM output. It is a subset of context engineering, but its influence on output quality is outsized.
A good prompt usually specifies:
- Outcome — what result you want.
- Scope — what the agent may and may not change.
- Inputs — which files, URLs, tickets, logs, or docs matter.
- Process — whether to research, plan, implement, or review.
- Stop condition — when the agent should stop.
- Validation — how success should be checked.
- Output contract — what summary you expect at the end.
Politeness tokens are not the problem. Vagueness is.
Working in Phases¶
Divide complex work into distinct phases, each with its own context window:
- Research — gather information, read documentation, explore the codebase. Do not edit.
- Plan — synthesise findings into a concrete, reviewable plan.
- Implement — execute against the plan with focused context.
- Verify — run deterministic checks and summarise residual risks.
Use sub-agents for finer isolation: each sub-task gets its own context window, preventing earlier phases from polluting implementation decisions.
A reusable phase prompt set:
Research mode
Read only. Do not edit files.
Task:
<task>
Scope:
<files, directories, issue, logs, docs, or constraints>
Return:
- relevant files/functions and why they matter
- current behaviour, with evidence
- constraints and invariants
- likely failure modes
- missing information
- recommended next step
Plan mode
Do not edit files.
Use the research findings to produce a bounded implementation plan.
Return:
- goal
- assumptions
- files likely to change
- ordered steps
- validation commands
- risks and rollback notes
- stop conditions
Implement mode
Execute only the approved plan.
Rules:
- stay within scope
- do not add dependencies unless explicitly approved
- stop if the plan is wrong or a blocker appears
- run relevant validation commands
Return:
- touched files
- what changed
- validation results
- unresolved risks