ReasonBlocks

Benchmark Methodology & Results

Technical whitepaper / v2.0

Overview

ReasonBlocks captures proven reasoning patterns from AI agent calls and injects them into future calls at runtime. The system maintains a shared pattern library (190k+ distilled traces) and uses embedding-based retrieval to match relevant patterns to incoming tasks.

Patterns are stored in a 3-field distilled format (Format D): the situation the agent encountered, the dead ends it explored, and the unlock that led to the correct solution. This format was selected through iterative evaluation as the highest-signal injection format across models.

This document explains the benchmark methodology, how metrics are derived, and what the results mean for real-world agent deployments.

1. Evaluation Setup

All benchmarks were run on SWE-bench Verified, a curated subset of real GitHub issues from popular open-source Python repositories. Each problem requires the agent to diagnose a bug from an issue description and produce a working patch, verified by running the repository's test suite in a Docker container.

Eval parameters

Benchmark	SWE-bench Verified
Verification	Docker test harness
Pattern library	190k distilled traces
Injection format	Format D (3-field)
Problems evaluated	50
Results reported on	High-confidence matches

Three models were tested: Claude Haiku, Claude Sonnet, and Claude Opus. Each problem was run twice per model — once as a baseline (no injection) and once with ReasonBlocks pattern injection. Results below are reported on the high-confidence matches where the system's confidence gate fired.

2. Results — High-Confidence Matches

On problems where the confidence gate fired, ReasonBlocks improved accuracy and reduced token usage across all three models.

Accuracy

Model	Baseline	+ ReasonBlocks	Gain
Haiku	60%	75%	+25%
Sonnet	75%	80%	+6.7%
Opus	70%	90%	+28.6%

Efficiency

Model	Step Save	Avg Token Save	Peak Token Save
Haiku	+5%	21%	Up to 29%
Sonnet	+6%	21%	Up to 47%
Opus	+9%	25%	Up to 62%

The largest accuracy gains were on Opus (+28.6%) and Haiku (+25%). Haiku with ReasonBlocks (75%) approaches baseline Sonnet performance (75%) at a fraction of the inference cost — effectively giving you a cheaper model that performs like a more expensive one.

3. Cost Savings Methodology

Token savings come from two mechanisms: fewer agent steps (the model reaches the correct solution faster) and shorter reasoning chains per step (the model doesn't explore dead ends it would have otherwise). Estimated dollar savings are calculated from the observed token reduction and current model pricing.

Formula

estimated_savings = traces_injected × avg_tokens_saved_per_injection × model_cost_per_token

avg_tokens_saved_per_injection is derived from the benchmark: 21-25% average token reduction on matched problems, varying by model. Model pricing uses current Anthropic API rates.

Example — Opus at scale

Agent tasks per month	10,000
High-confidence match rate	~75%
Tasks with injection	7,500
Avg token save per task	25%
Avg tokens per task (Opus)	~50,000
Opus rate ($75/MTok)	$0.075/1K tokens
Estimated monthly savings	$7,031

This excludes the value of accuracy improvement. At a 28.6% accuracy gain, ~2,145 additional tasks succeed per month that would have otherwise failed and required human intervention or retries.

4. Why It Works

AI agents fail not because the model lacks ability, but because they re-explore dead ends on every call. Format D patterns encode three things: the situation, the dead ends to avoid, and the unlock that leads to the solution. This steers the model past failure modes it would have otherwise explored, reducing both wasted tokens and incorrect outputs.

The pattern library is collective — patterns that work for one team's agents improve results for everyone on the platform. As more agents use the system, the library grows, match quality improves, and the confidence gate fires on a higher percentage of tasks.

Limitations

Benchmarks were run on SWE-bench Verified (Python repositories). Results on other languages, domains, and task types may differ.
Token savings vary by model, task complexity, and the quality of the retrieved pattern match.
The match rate (~40% on this general benchmark) depends on the pattern library coverage for a given problem domain. Teams running agents on repetitive domain-specific tasks typically see higher match rates as the library accumulates relevant patterns.
Accuracy gains are measured as relative improvement: (new - baseline) / baseline. Baseline accuracy varies by model and problem difficulty.

Questions about this methodology? Contact sajeev@reasonblocks.com