Token Efficiency Guidelines¶

Qin Yu, 21 May 2026, updated 2 Jun 2026

Token efficiency means spending tokens on useful context and reasoning while removing avoidable overhead.

Keep Active Context Small¶

Lead with the outcome — state the desired result, scope, constraints, stop condition, validation steps, and expected summary. Short prompts are useful; vague prompts are expensive.
Treat context as working memory, not storage — attach only relevant files and URLs. Start a new session for unrelated work.
Keep IDE context focused — Copilot can use open files for inline suggestions, so close irrelevant tabs.
Reference specific files — reference durable information in files or docs instead of pasting large blocks of text. Avoid screenshots unless the visual information matters.
Keep code modular — split oversized files and functions along meaningful boundaries.
Work in phases — use Research → Plan → Implement → Verify so each phase receives only the context it needs.
Isolate noisy work — use sub-agents for exploration, parallel tasks, and specialised reviews. Return concise findings.

Use the right configuration layer — keep mandatory repo rules in versioned instruction files, reusable workflows in skills, specialised behaviour in scoped agents, deterministic checks in hooks or CI, and temporary details in the session prompt.
Keep persistent instructions short and current — record only durable conventions and recurring fixes. Review and prune them periodically. Use memory as a convenience, not as the source of mandatory policy.
Load specialised context only when needed — use skills and file references for progressive disclosure instead of placing every workflow and reference in the default context.
Remove unused tools — every unnecessary MCP server or tool adds overhead. Prefer a direct API, native CLI, or script for narrow, stable, high-frequency operations.

Think in code, not prose — analyse large files or datasets with a script when possible. Scripts are deterministic, reusable, and produce compact output.
Request bounded outputs — ask for the exact artifact you need, such as one function or a concise diff summary.
Prefer structured output — use summaries, tables, and focused JSON instead of raw logs. Compress verbose shell output with tools such as rtk-ai/rtk.
Collapse related tool calls — batch independent reads and actions where supported. Plugins such as jsturtevant/copilot-codeact-plugin can help.

Route by task shape — use cheaper models for extraction, formatting, search-heavy work, and simple transforms. Reserve stronger models for ambiguity, architecture, difficult debugging, and high-risk review.
Judge cost by completed work — account for retries, long loops, repeated tool output, reasoning tokens, cache costs, and image tokens rather than headline model prices alone.
Parallelise independent work — run separate reviews, comparisons, or checks concurrently when their results do not depend on each other.
Add critique loops selectively — use Generate → Critique → Revise → Validate when quality matters more than latency.

Catch errors early with deterministic guardrails — use tests, linters, type checkers, security scanners, hooks, and CI so mistakes do not compound across long workflows.
Analyse usage regularly — use Codex /status; Copilot CLI /context, /usage, /session, and experimental /chronicle tips; or Claude Code /context, /usage, and /cost.
Benchmark representative tasks — use evals when changing prompts, models, tools, or routing rules. Different models respond differently to context ordering and instruction style.
Treat misses as incidents — when a workflow fails, update the relevant instruction, skill, hook, eval, or routing rule instead of repeatedly retrying the same approach.

When building or scripting agent workflows via API:

Put stable prompt prefixes first — place invariant instructions before variable task data to improve prompt-cache reuse.
Keep tool contracts compact — design tools to return concise, structured results rather than verbose text.

References