Stop Burning 85% of Your AI Context on Invisible Waste
Stop Burning 85% of Your AI Context on Invisible Waste
Your AI is getting dumber and you can't see it happening. Every message you send, every tool call you make, your supposedly cutting-edge assistant is quietly hemorrhaging cognitive capacity on garbage you never asked for. Ghost tokens. Bloated configs. Duplicate system prompts that whisper the same instructions fifty times per session. Stale memory entries past line 200 that Claude loads but never reads. The worst part? You think you're optimizing because you installed a proxy compressor that cleaned up your git status output.
That's 15% of the problem. What about the other 85%?
Meet Token Optimizer — the brutal, zero-dependency audit engine that finds every wasted token in your AI context window, measures the damage in actual dollars, and actively fixes it without breaking your prompt cache or injecting overhead into your conversation. Created by Alex Greenshpun, this isn't another feel-good CLI wrapper. It's a forensic accounting tool for your AI's attention span, and it's already saving heavy Claude Code users $1,500 to $2,500 per month.
If you're still running /context and calling it visibility, you're flying blind. Here's why Token Optimizer is about to become the most important tool in your AI development stack.
What Is Token Optimizer?
Token Optimizer is an open-source, source-available context optimization platform for AI-assisted development environments. Built by Alex Greenshpun and distributed under the PolyForm Noncommercial license, it ships as native plugins for Claude Code (Python), OpenClaw (TypeScript), and Codex (beta Python adapter) — with Windsurf and Cursor support on the roadmap.
The project exploded in visibility because it solves a problem almost every power user feels but few can articulate: context quality decay. As your conversation grows, your AI's effective intelligence degrades measurably. Published MRCR benchmarks show Claude's reasoning drops from 93% to 76% between 256K and 1M context fill. Yet most "optimization" tools only compress command output — the visible 15-25% of your token burn — while leaving the structural waste completely untouched.
Token Optimizer covers both runtime waste and structural waste. It runs fully local with zero runtime dependencies, zero telemetry, and zero context tokens consumed. Every measurement writes to a local SQLite database you own. Nothing phones home. Nothing gets pip-installed. The entire engine is pure Python stdlib (or Node stdlib on OpenClaw), making it deployable anywhere from air-gapped machines to CI pipelines.
The tool's killer feature isn't just finding waste — it's proving the savings. A live dashboard auto-regenerates after every session, showing per-turn token breakdowns, cache analysis, cost across four pricing tiers, quality scores, subagent cost attribution, and cumulative dollars saved. You don't guess whether optimization helped. You watch the receipts update in real time.
Key Features That Separate Token Optimizer from Everything Else
Two-Layer Waste Detection: Most tools handle runtime compression (verbose CLI output). Token Optimizer adds structural audit — scanning CLAUDE.md bloat, unused skills, duplicate system prompts, stale MEMORY.md entries, orphaned topic files, and dead MCP servers. This structural layer represents 75-85% of typical token waste.
Smart Compaction with Session Continuity: Auto-compaction destroys 60-70% of your conversation per trigger. After 2-3 compactions, 88-95% of original context is gone. Token Optimizer checkpoints session state at progressive thresholds (20%, 35%, 50%, 65%, 80% fill plus quality drop bands), restores dropped decisions after compaction, and injects digests of large tool outputs so your AI remembers what it already processed.
Seven-Signal Quality Scoring: A weighted scoring system tracks context fill (20%), stale reads (20%), bloated results (20%), compaction depth (15%), duplicates (10%), decision density (8%), and agent efficiency (7%). Grades range from S (90-100) to F (0-49), with color-coded status bar degradation from green through red.
Active Compression v5: Seven independently toggleable features that actively reduce incoming tokens — Quality Nudges, Loop Detection, Delta Mode (smart re-reads showing only diffs), Structure Map (AST-based skeletons replacing full file re-reads), Bash Compression (16 CLI handlers), Activity Mode Detection, and Decision Extraction.
Fully Local Dashboard: Single-file HTML dashboard at http://localhost:24842/token-optimizer (24843 for Codex). Auto-updates after every session via SessionEnd hook. Tracks per-turn costs, cache hit rates, model mix, skill adoption, CLAUDE.md/MEMORY.md health cards, drift detection, and cumulative savings. Zero network calls, zero tokens from your context budget.
Coach Mode and Fleet Auditor: /token-coach provides prioritized fixes with exact token savings, detecting 8 named anti-patterns. Fleet Auditor scans across multiple agent systems to find idle burns, model misrouting, and config bloat with dollar savings per finding.
Where Token Optimizer Absolutely Shines
High-Volume Claude Code Development: Teams running 500+ message sessions with Opus 4.6 hit degradation cliffs fast. Token Optimizer's progressive checkpoints and quality nudges land /compact at optimal moments, preserving decisions that would otherwise vanish. One real user: 942 sessions, 6.13B input tokens, 90% Opus — $1,500-$2,500 monthly savings.
Codebase Navigation at Scale: Reading the same 180,000-token Python file 5 times per session? Structure Map compresses re-reads by 95-99%, turning full source into 250-token skeletons. Delta Mode handles edited files, showing only unified diffs via Python's difflib — 97% savings on typical re-reads.
Multi-Agent Orchestration: Subagents consuming 30%+ of your budget without visibility? Token Optimizer breaks down orchestrator vs worker spend, ranks top offenders, and flags misrouted models (Opus on simple edits, Haiku on complex tasks).
Long-Running Maintenance Sessions: MEMORY.md silently truncates at line 200. Most power users never know their carefully curated notes become invisible ghosts still eating tokens. memory-review finds orphaned files, broken links, invisible entries, duplicate rules, and stale content with configurable thresholds.
Cross-Platform Fleet Management: Organizations running Claude Code, Codex beta, and OpenClaw simultaneously get unified waste detection. The Fleet Auditor surfaces config drift, idle burns, and optimization opportunities across every ecosystem in one command.
Step-by-Step Installation and Setup Guide
Claude Code (Recommended for All Platforms)
The plugin marketplace install works identically on macOS, Linux, and Windows:
/plugin marketplace add alexgreensh/token-optimizer
/plugin install token-optimizer@alexgreensh-token-optimizer
Then activate with:
/token-optimizer
Critical: Enable auto-update immediately. Claude Code ships with auto-update off by default for third-party plugins, and authors cannot override this. Run /plugin → Marketplaces tab → select alexgreensh-token-optimizer → Enable auto-update. Token Optimizer prints a one-time reminder on first SessionStart so you don't forget.
Windows-specific warning: Do NOT also run the install.sh bash script. Combining plugin install with script install creates an EBUSY: resource busy or locked error because Git Bash holds file handles open during plugin cloning. If you hit this:
- Close all Claude Code windows and Git Bash terminals
- End lingering
git.exeprocesses in Task Manager - Delete
C:\Users\<you>\.claude\token-optimizerandC:\Users\<you>\.claude\plugins\marketplaces\alexgreensh-token-optimizer - Reboot if files remain locked, then delete
- Fresh Claude Code window, re-run the two
/plugincommands
Manual ZIP fallback (if plugin install fails repeatedly): Download the repo ZIP (~800 KB), extract to C:\Users\<you>\.claude\token-optimizer\, then run python measure.py setup-quality-bar from that directory.
macOS / Linux Alternative: Script Install
For users preferring git-managed auto-updates via git pull --ff-only:
git clone https://github.com/alexgreensh/token-optimizer.git ~/.claude/token-optimizer
bash ~/.claude/token-optimizer/install.sh
Do not use this on Windows, and never combine with plugin install. Pick exactly one method per platform.
Codex Beta Setup
codex plugin marketplace add alexgreensh/token-optimizer
Then in Codex TUI: /plugins → install Token Optimizer. Activate conversationally with "Run Token Optimizer".
Configure hooks and dashboard:
TOKEN_OPTIMIZER_RUNTIME=codex python3 skills/token-optimizer/scripts/measure.py codex-install --project "$PWD"
TOKEN_OPTIMIZER_RUNTIME=codex python3 skills/token-optimizer/scripts/measure.py setup-daemon
Dashboard opens at http://localhost:24843/token-optimizer — separate port from Claude Code's 24842, both can run simultaneously.
OpenClaw Setup
# From GitHub (recommended)
openclaw plugins install github:alexgreensh/token-optimizer
# From ClawHub
openclaw plugins install token-optimizer
Inside OpenClaw: /token-optimizer for guided audit with coaching. Zero Python dependency, zero runtime dependencies, works with any model your gateway configures.
Essential Post-Install Commands
Enable the quality bar in your terminal:
python3 measure.py setup-quality-bar
Enable smart compaction with checkpoint/restore hooks:
python3 measure.py setup-smart-compact
Start the auto-updating dashboard daemon:
python3 measure.py setup-daemon
Real Code Examples from the Repository
Example 1: Enabling Smart Compaction with Progressive Checkpoints
Smart Compaction is Token Optimizer's answer to the catastrophic data loss of auto-compaction. Instead of watching 60-70% of your conversation vanish, you checkpoint decisions and restore them after compaction fires.
# One-time setup: install checkpoint and restore hooks
python3 measure.py setup-smart-compact
This command configures hooks that capture session state at multiple thresholds — 20%, 35%, 50%, 65%, and 80% context fill, plus quality drops below 80, 70, 50, and 40. It also snapshots before agent fan-out and after large edit batches. The restoration logic picks the richest eligible checkpoint, not merely the most recent, ensuring you recover maximum decision density.
The background guards use one-shot threshold capture with cooldown suppression and deterministic extraction. No LLM calls occur in the checkpoint path — this is pure local state serialization, keeping your context budget entirely for actual work.
Example 2: Managing Active Compression Features via CLI
Token Optimizer v5's seven compression features are all independently controllable. Here's how to inspect, toggle, and measure them:
# Show all features with current state and impact estimates
python3 measure.py v5 status
# Enable Delta Mode for smart re-reads (shows only diffs on file re-reads)
python3 measure.py v5 enable delta_mode
# Disable Bash Compression when you need full verbose output for debugging
python3 measure.py v5 disable bash_compress
# Get detailed documentation for any single feature
python3 measure.py v5 info delta_mode
# View actual measured savings from your local telemetry database
python3 measure.py compression-stats --days 30
The compression-stats command queries your local SQLite database at ~/.claude/_backups/token-optimizer/trends.db — a file you fully own, inspect, export, or delete. Output shows total events per feature, exact tokens saved, compression ratio, and quality preservation rate. The verified flag distinguishes precise measurements (Delta Mode knows exact before/after token counts) from heuristic estimates (Structure Map uses calculated compression ratios).
Example 3: Memory Health Structural Audit
Your MEMORY.md is probably broken in ways that silently waste tokens. This audit exposes structural failures that no other tool detects:
# Full structural audit with human-readable findings
python3 measure.py memory-review
# Machine-readable JSON for integration with external dashboards
python3 measure.py memory-review --json
# Show actionable fixes with exact commands to apply
python3 measure.py memory-review --apply
# Custom staleness threshold for resolved/superseded content
python3 measure.py memory-review --stale-days 90
This scans for seven specific failure modes: orphaned topic files (unlinked memory directory entries), broken links (index entries pointing to missing files), invisible entries (content past line 200 that Claude truncates), inline content that should live in topic files, duplicate rules already present in CLAUDE.md, stale entries past your configured threshold, and task leakage (TODO lists belonging in task trackers).
The dashboard surfaces CLAUDE.md Health and MEMORY.md Health cards on the Overview tab, showing line count, orphan count, and status at a glance. For semantic contradiction detection — two rules saying opposite things — run the audit inside a Claude session where the tool extracts all NEVER/ALWAYS/MUST rules and Claude reviews them in context with no extra LLM call.
Example 4: Tool Result Archive and Retrieval
Large tool results automatically archive to disk with inline hints visible to your AI, enabling post-compaction recovery without re-running commands:
# List all archived tool results with their IDs
python3 measure.py expand --list
# Manually retrieve a specific archived result by tool-use-id
python3 measure.py expand <tool-use-id>
When a tool result exceeds 4KB, the full output archives to disk and your conversation receives a preview plus an inline hint: [Full result archived (12,400 chars). Use 'expand abc123' to retrieve.]. This hint is visible to Claude, not just you. After compaction destroys the original result summary, if the model needs full output to answer your next question, it invokes expand abc123 automatically and archived content returns through the CLI.
The primary flow requires zero user action — the model sees the hint, requests the bytes, and receives them. You can also run expand manually when you want to inspect specific archived results.
Example 5: Git-Aware Context Suggestions
Token Optimizer analyzes your working tree to suggest contextually relevant files beyond what's currently loaded:
# Suggest files relevant to current git changes
python3 measure.py git-context
# Machine-readable output for scripting
python3 measure.py git-context --json
This identifies test companions (files that test your modified code), frequently co-changed files from your last 50 commits, and import chains for Python/JS/TS. Instead of manually hunting for related files or over-loading context with entire directories, you get surgically precise suggestions that minimize token burn while maximizing relevant context.
Advanced Usage and Best Practices
Start with /token-coach before building new projects. Most users install optimization tools after months of accumulated waste. Running coach mode before writing your first CLAUDE.md prevents anti-pattern accumulation entirely. The tool detects 8 named anti-patterns including "The Kitchen Sink" (loading every possible skill), "The Hoarder" (never cleaning MEMORY.md), and "The Monolith" (500+ word prompts doing too much).
Monitor the 7-signal quality score religiously. Don't wait for visible slowdown. When your grade drops from A to B, investigate immediately. The degradation bands shift from green (<50% fill) through yellow (50-70%), orange (70-80%), to red (80%+). Hitting red means you're already paying premium prices for degraded reasoning.
Use progressive checkpoints strategically. The default thresholds (20%, 35%, 50%, 65%, 80%) work for most sessions, but intense agent-fan-out sessions benefit from custom pre-fan-out snapshots. The richest-eligible restoration algorithm means more frequent checkpoints never hurt — it always selects maximum information density.
Toggle Bash Compression contextually. Keep it enabled for routine git status, pytest, and lint runs. Disable temporarily (python3 measure.py v5 disable bash_compress) when debugging specific test failures or reviewing careful diffs where every line matters. The toggle applies instantly without Claude Code restart.
Set your pricing tier correctly. The dashboard tracks costs across Anthropic API, Vertex Global, Vertex Regional, and AWS Bedrock. Misconfigured tiers mean inaccurate savings calculations. Run python3 measure.py pricing-tier to verify and switch.
How Token Optimizer Compares to Alternatives
| Capability | Token Optimizer | /context |
context-mode | Proxy Compressors |
|---|---|---|---|---|
| Structural waste audit | Deep, per-component | Summary only | No | No |
| Quality degradation tracking | 7-signal score with grades | Capacity % only | No | No |
| Compaction survival | Progressive checkpoints + restore | No | Session guide only | No |
| Runtime output compression | 16 CLI handlers, individually toggleable | No | Yes | Yes, always-on |
| Measures actual savings | Local telemetry with before/after | No | No | No |
| Read deduplication & smart diffs | Yes | No | No | No |
| Behavioral coaching | 11 detectors, subagent breakdown | Basic suggestions | No | No |
| CLAUDE.md/MEMORY.md health | 8 auditors + attention scoring | No | No | No |
| Fleet-level detection | Yes | No | No | No |
| Zero context tokens consumed | Yes, external process | ~200 tokens added | MCP overhead | Injects instructions |
| Zero runtime dependencies | Pure stdlib | N/A | Varies | External binary |
| Zero telemetry | Yes | Yes | Varies | Opt-out telemetry |
| Cross-platform support | Claude Code, Codex beta, OpenClaw | Claude Code only | Several | Several |
Critical distinction on cache safety: Some tools "optimize" by modifying or removing existing conversation blocks. This breaks the prompt cache, forcing full prefix re-sends at uncached rates. The "savings" from removing a few thousand tokens get obliterated by cache invalidation costs across subsequent messages. Token Optimizer never touches content already in your context — structural optimization runs between sessions, active compression works on new incoming content, and compaction boundaries are handled via additive checkpoints.
Frequently Asked Questions
Does Token Optimizer degrade my AI's context quality?
No. Structural optimization removes only genuinely unused components — skills never invoked, duplicate configs, orphaned memory entries. Active Compression features are independently toggleable, and lossy ones like Bash Compression can be disabled instantly. The 7-signal quality score actively tracks any degradation; most users see scores improve after optimization because the model has more room for real work.
Will this break my prompt cache?
No — and this matters enormously. Token Optimizer never modifies content already in your context. Your stable prefix stays intact, meaning you save money twice: less input per turn, and smaller cache-read bills on every subsequent message. Tools that edit existing conversation blocks invalidate your cache and cost more, not less.
Does Token Optimizer send any data anywhere?
Absolutely no network calls. No analytics. No opt-out telemetry because there's nothing to opt out of. Every event writes to a local SQLite file you own, inspect, export, or delete at will.
Can installation or runtime errors hurt my session?
No. All hooks are non-blocking with fail-open design. If any Token Optimizer script errors, your command runs normally. Compression features are individually toggleable. Checkpoints are additive. Quality scoring is read-only measurement.
How much can I realistically save?
Heavy users (90% Opus, 6B+ input tokens monthly) see $1,500-$2,500 savings. Lighter users see proportional savings. Structural audit wins are immediate regardless of volume and compound because smaller prefixes mean smaller cache-read bills on every turn.
What's the catch with zero dependencies?
No catch. Pure Python stdlib on Claude Code/Codex, pure Node stdlib on OpenClaw. Nothing to pip install, nothing to npm install at runtime. What you clone is everything it needs.
Why does the quality bar disappear sometimes?
Claude Code's built-in /statusline command overwrites Token Optimizer's entry in ~/.claude/settings.json. SessionStart detects this and auto-restores the quality bar on your next session. You'll see a one-line notice explaining what happened. To permanently disable: python3 measure.py setup-quality-bar --uninstall.
Conclusion: Your AI's Attention Is Finite. Stop Squandering It.
Token Optimizer isn't a nice-to-have optimization — it's essential infrastructure for anyone spending real money on AI-assisted development. The difference between /context and Token Optimizer is the difference between a fuel gauge and a full engine diagnostic. One tells you you're running low. The other shows you exactly which cylinders are misfiring, which gaskets are leaking, and how much each problem costs per mile.
The structural waste alone — bloated CLAUDE.md, unused skills, duplicate prompts, invisible MEMORY.md entries — represents 75-85% of your token burn that no proxy compressor will ever touch. Add Smart Compaction that preserves decisions across the 60-70% destruction of auto-compact, Delta Mode that turns 2,000-token re-reads into 50-token diffs, and a dashboard that proves every dollar saved with local telemetry you fully own... and the value proposition becomes undeniable.
I've watched too many developers throw expensive model calls at problems caused by their own invisible context bloat. Token Optimizer ends that waste. Install it, run /token-optimizer, and watch your first audit expose the ghosts eating your budget. Your AI will think clearer, respond faster, and cost dramatically less.
Get started now: github.com/alexgreensh/token-optimizer
Comments (0)
No comments yet. Be the first to share your thoughts!