Technical whitepaper / v2.0
ReasonBlocks captures proven reasoning patterns from AI agent calls and injects them into future calls at runtime. The system maintains a shared pattern library (190k+ distilled traces) and uses embedding-based retrieval to match relevant patterns to incoming tasks.
Patterns are stored in a 3-field distilled format (Format D): the situation the agent encountered, the dead ends it explored, and the unlock that led to the correct solution. This format was selected through iterative evaluation as the highest-signal injection format across models.
This document explains the benchmark methodology, how metrics are derived, and what the results mean for real-world agent deployments.
All benchmarks were run on SWE-bench Verified, a curated subset of real GitHub issues from popular open-source Python repositories. Each problem requires the agent to diagnose a bug from an issue description and produce a working patch, verified by running the repository's test suite in a Docker container.
Eval parameters
| Benchmark | SWE-bench Verified |
| Verification | Docker test harness |
| Pattern library | 190k distilled traces |
| Injection format | Format D (3-field) |
| Problems evaluated | 50 |
| Results reported on | High-confidence matches |
Three models were tested: Claude Haiku, Claude Sonnet, and Claude Opus. Each problem was run twice per model — once as a baseline (no injection) and once with ReasonBlocks pattern injection. Results below are reported on the high-confidence matches where the system's confidence gate fired.
On problems where the confidence gate fired, ReasonBlocks improved accuracy and reduced token usage across all three models.
Accuracy
| Model | Baseline | + ReasonBlocks | Gain |
|---|---|---|---|
| Haiku | 60% | 75% | +25% |
| Sonnet | 75% | 80% | +6.7% |
| Opus | 70% | 90% | +28.6% |
Efficiency
| Model | Step Save | Avg Token Save | Peak Token Save |
|---|---|---|---|
| Haiku | +5% | 21% | Up to 29% |
| Sonnet | +6% | 21% | Up to 47% |
| Opus | +9% | 25% | Up to 62% |
The largest accuracy gains were on Opus (+28.6%) and Haiku (+25%). Haiku with ReasonBlocks (75%) approaches baseline Sonnet performance (75%) at a fraction of the inference cost — effectively giving you a cheaper model that performs like a more expensive one.
Token savings come from two mechanisms: fewer agent steps (the model reaches the correct solution faster) and shorter reasoning chains per step (the model doesn't explore dead ends it would have otherwise). Estimated dollar savings are calculated from the observed token reduction and current model pricing.
Formula
estimated_savings = traces_injected × avg_tokens_saved_per_injection × model_cost_per_token
avg_tokens_saved_per_injection is derived from the benchmark: 21-25% average token reduction on matched problems, varying by model. Model pricing uses current Anthropic API rates.
Example — Opus at scale
| Agent tasks per month | 10,000 |
| High-confidence match rate | ~75% |
| Tasks with injection | 7,500 |
| Avg token save per task | 25% |
| Avg tokens per task (Opus) | ~50,000 |
| Opus rate ($75/MTok) | $0.075/1K tokens |
| Estimated monthly savings | $7,031 |
This excludes the value of accuracy improvement. At a 28.6% accuracy gain, ~2,145 additional tasks succeed per month that would have otherwise failed and required human intervention or retries.
AI agents fail not because the model lacks ability, but because they re-explore dead ends on every call. Format D patterns encode three things: the situation, the dead ends to avoid, and the unlock that leads to the solution. This steers the model past failure modes it would have otherwise explored, reducing both wasted tokens and incorrect outputs.
The pattern library is collective — patterns that work for one team's agents improve results for everyone on the platform. As more agents use the system, the library grows, match quality improves, and the confidence gate fires on a higher percentage of tasks.
Questions about this methodology? Contact sajeev@reasonblocks.com