Stop Guessing LLM Tool Use Quality: Benchmark with ToolCall-15
Stop Guessing LLM Tool Use Quality: Benchmark with ToolCall-15
Here's a dirty secret nobody in the AI infrastructure space wants to admit: most LLM tool use is broken, and developers have no idea until production explodes. You've seen it—the agent that calls the wrong API endpoint, the model that hallucinates parameters, the "intelligent" assistant that loops endlessly on a failed tool call instead of recovering gracefully. We slap "function calling" badges on our model cards, run a few happy-path tests, and ship. Then users discover the edge cases for us. At 3 AM. On a Saturday.
The problem? No standardized way to measure what actually matters. Accuracy benchmarks like MMLU tell you nothing about whether your model can select between send_email and schedule_meeting. Custom internal tests are fragmented, unreproducible, and usually die with the engineer who wrote them. The result is a industry-wide blind spot: we optimize for chat quality while our tool-using agents fail silently in the shadows.
Enter ToolCall-15—the benchmark that exposes these failures with surgical precision. Built by Steve B as an official BenchLocal Bench Pack, ToolCall-15 doesn't just test if your model can use tools. It stress-tests the five critical dimensions that separate toy demos from production-ready agents: tool selection, parameter precision, multi-step chains, restraint, and error recovery. Fifteen scenarios. Zero hand-waving. Pure, deterministic measurement of the capabilities that actually matter when your LLM leaves the chat interface and touches real systems.
If you're building agents, evaluating models, or just tired of discovering tool-use failures in production, this is the benchmark you didn't know you needed. Let me show you why ToolCall-15 is about to become essential infrastructure for serious AI developers.
What Is ToolCall-15?
ToolCall-15 is a deterministic benchmark suite specifically engineered to evaluate Large Language Model tool use capabilities across five distinct failure domains. Created by Steve B and distributed as an official BenchLocal Bench Pack, it represents a fundamental shift from anecdotal tool-use testing to rigorous, reproducible measurement.
The benchmark's architecture reflects deep understanding of how tool-use failures actually manifest in production systems. Unlike general-purpose LLM benchmarks that treat function calling as a secondary concern, ToolCall-15 is purpose-built for this single critical capability. It ships as an installable package that runs inside the BenchLocal desktop application—a shared platform that handles provider configuration, model selection, sampling controls, and historical run comparison across different benchmark packs.
ToolCall-15 is trending now because the industry has reached an inflection point. The initial wave of "agents" and "copilots" has crashed against the rocks of unreliable tool execution. Developers who shipped impressive demos are now discovering that deterministic reliability beats flashy capabilities every single time. ToolCall-15 arrives at exactly this moment, offering the measurement infrastructure that lets teams distinguish genuine tool-use competence from surface-level function-calling theater.
The repository contains everything needed for transparent evaluation: scenario definitions, scoring logic, published methodology, a BenchLocal adapter for integrated testing, and a standalone CLI runner for local development workflows. The main branch tracks the maintained Bench Pack version, while a legacy/web-app branch preserves the older standalone implementation for backward compatibility.
What makes ToolCall-15 particularly valuable is its deterministic design philosophy. Tool results are mocked rather than live, eliminating external service dependencies and variance. The benchmark defaults to temperature: 0, ensuring that identical inputs produce identical outputs. Every scenario stores a raw trace, enabling post-hoc failure analysis that turns mysterious failures into actionable engineering tasks.
Key Features That Set ToolCall-15 Apart
ToolCall-15 isn't another leaderboard-optimized benchmark chasing headline numbers. Its features reflect hard-won engineering wisdom about what actually matters when LLMs touch production systems:
Five-Dimensional Evaluation Framework The benchmark organizes its fifteen scenarios into five categories, each targeting a distinct failure mode that plagues real-world deployments:
- Tool Selection: Can your model distinguish between superficially similar tools? When faced with
search_productsversussearch_orders, does it make the semantically correct choice based on user intent? - Parameter Precision: Does the model populate required fields correctly? Does it respect type constraints, enum values, and nested object schemas? Parameter hallucination is the silent killer of tool-use reliability.
- Multi-Step Chains: Can the model execute sequences of dependent tool calls, passing outputs from one invocation as inputs to the next? This separates stateless function callers from genuine agentic reasoning.
- Restraint and Refusal: Does the model know when not to call a tool? The most dangerous failures aren't incorrect calls—it's calling tools when none are needed, or with harmful parameters that bypass safety guardrails.
- Error Recovery: When a tool returns an error, does the model adapt its strategy, retry with corrected parameters, or escalate appropriately? Or does it loop, hallucinate success, or simply give up?
Deterministic, Reproducible Scoring Each scenario scores on a three-point scale: 2 for complete pass, 1 for partial pass, 0 for failure. Each category contributes 6 points maximum. The final score averages category percentages, rounded to a whole number. This design prevents gaming through category over-indexing—a model that excels at tool selection but fails at error recovery can't mask its weakness with aggregate metrics.
Full Transparency and Auditability Every scenario stores a raw execution trace. When ToolCall-15 reports a failure, you can inspect exactly what the model received, what it emitted, and how the scoring logic evaluated the result. No black boxes, no unexplained deductions.
Dual Runtime Modes The BenchLocal integration provides a polished desktop experience with side-by-side model comparison, historical tracking, and visual result inspection. The CLI runner enables headless automation, CI/CD integration, and rapid iteration during model development.
Framework-Agnostic Core
The lib/ directory contains all benchmark logic in pure, portable code. Only the thin benchlocal/index.ts adapter imports BenchLocal-specific SDK types. This architecture means ToolCall-15's core can migrate to new platforms without rewriting evaluation logic.
Real-World Use Cases Where ToolCall-15 Shines
Evaluating Foundation Model Releases
When Anthropic drops a new Claude version or OpenAI updates GPT-4, the marketing materials promise "improved function calling." ToolCall-15 lets you verify these claims with precision. Run the same fifteen scenarios against both models, compare category scores, and identify exactly where improvements—or regressions—occur. The deterministic design means you're measuring model capability, not prompt lottery luck.
Regression Testing for Fine-Tuned Models
You've spent weeks fine-tuning a model on your internal tool schemas. Before deployment, run ToolCall-15 to establish a baseline. After each training iteration, re-run to catch capability regressions. The CLI runner integrates cleanly into ML pipelines, enabling automated gates that prevent degraded tool-use models from reaching production.
Selecting Between Open-Source Alternatives
Comparing Llama, Mistral, and Qwen for your agent architecture? ToolCall-15 provides an objective, apples-to-apples evaluation framework. The BenchLocal desktop app's side-by-side comparison makes relative strengths immediately visible—perhaps one model excels at parameter precision while another dominates multi-step reasoning, informing your architecture decisions.
Debugging Production Agent Failures
When your production agent fails mysteriously, reproduce the failure pattern as a ToolCall-15 scenario. The structured evaluation and trace storage turn "sometimes it doesn't work" into "scenario 7 fails because the model ignores the required constraint on nested object fields." This diagnostic precision accelerates fixes from days to hours.
Vendor Evaluation and Procurement
Enterprise teams evaluating AI platforms can use ToolCall-15 to cut through vendor demos. Request that prospective providers run the benchmark and share results. The standardized format prevents cherry-picked examples and establishes contractual performance baselines.
Step-by-Step Installation & Setup Guide
Getting ToolCall-15 running takes minutes, whether you prefer the polished BenchLocal desktop experience or the lean CLI workflow.
Option 1: BenchLocal Desktop (Recommended for Exploration)
The BenchLocal application provides the richest experience for understanding model behavior and comparing results visually.
Step 1: Download BenchLocal
Navigate to the latest BenchLocal release and download the appropriate binary for your platform (macOS, Windows, or Linux).
Step 2: Install ToolCall-15
Open BenchLocal and locate the official Bench Pack registry. ToolCall-15 appears as an installable package—select it and confirm installation. The benchlocal.pack.json manifest handles all metadata and default configuration automatically.
Step 3: Configure Models
Add one or more LLM providers through BenchLocal's unified configuration interface. Supported providers include OpenAI, Anthropic, and local model servers via OpenAI-compatible APIs. ToolCall-15 defaults to temperature: 0 as specified in its manifest—no manual tuning required.
Step 4: Execute Benchmark Run
Select ToolCall-15 from your installed packs, choose your configured models, and initiate a run. Results populate progressively with per-scenario scoring and full trace access.
Option 2: CLI Runner (Recommended for Automation)
For developers integrating ToolCall-15 into development workflows or CI/CD pipelines, the CLI provides maximum flexibility.
Prerequisites:
- Node.js 18+ installed
- Git for repository cloning
Installation Commands:
# Clone the repository
git clone https://github.com/stevibe/ToolCall-15.git
cd ToolCall-15
# Install dependencies
npm install
# Verify type safety and build integrity
npm run typecheck
npm run build:benchlocal
Running the CLI Benchmark:
# Execute the complete benchmark suite
npm run cli
The CLI runner executes all fifteen scenarios against your configured model endpoint, emitting structured results to stdout and optionally to configured output files.
Environment Configuration:
The CLI reads model configuration from environment variables or a local .env file:
# Required: API endpoint and credentials
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://api.openai.com/v1"
# Optional: Override default model
export TOOLCALL_MODEL="gpt-4-turbo-preview"
Validation Before First Run:
Always validate your installation before trusting results:
npm run typecheck # Verify TypeScript compilation
npm run build:benchlocal # Confirm BenchLocal adapter builds cleanly
These checks catch configuration errors early, preventing wasted benchmark runs against broken setups.
Real Code Examples from the Repository
ToolCall-15's repository contains production-quality code that demonstrates its design principles. Let's examine the critical components:
Bench Pack Structure and Organization
The repository layout reflects deliberate architectural separation:
lib/ # Benchmark core, scoring, tool loop, and transport
benchlocal/ # Thin BenchLocal SDK adapter
cli/ # Non-UI runner
benchlocal.pack.json # Canonical Bench Pack manifest
METHODOLOGY.md # Published benchmark methodology
This structure enforces a critical boundary: lib/ remains framework-agnostic, containing all evaluation logic that could theoretically power any benchmark platform. Only benchlocal/index.ts imports @benchlocal/sdk, creating a thin translation layer. This means ToolCall-15's fifteen scenarios and scoring logic aren't locked into BenchLocal—they're portable intellectual property that could drive future platforms.
BenchLocal Adapter Implementation
The adapter pattern enables clean platform integration:
// benchlocal/index.ts — the ONLY file importing @benchlocal/sdk
// This isolation prevents BenchLocal dependencies from leaking into core logic
import { BenchPack, RunContext } from '@benchlocal/sdk';
import { Benchmark } from '../lib/benchmark';
// Adapter translates BenchLocal's runtime context into benchmark-native types
export const pack: BenchPack = {
// Manifest metadata drives UI display and installation
manifest: require('../benchlocal.pack.json'),
// Core execution delegates to framework-agnostic lib/
async run(context: RunContext) {
const benchmark = new Benchmark({
model: context.model,
temperature: context.sampling?.temperature ?? 0, // Default: deterministic
});
// Execute all 15 scenarios with full trace capture
const results = await benchmark.runAll();
// Return structured results for BenchLocal's comparison UI
return {
score: results.finalScore,
categories: results.categoryBreakdown,
traces: results.scenarioTraces, // Full auditability
};
}
};
Key design insight: The adapter is intentionally thin. All scenario definitions, scoring logic, and tool execution loops live in lib/. This means the CLI runner (cli/) imports identical evaluation code, guaranteeing that BenchLocal and CLI runs produce identical scores. No platform-specific scoring drift.
CLI Runner for Local Development
The CLI enables automation without BenchLocal overhead:
# package.json scripts section
{
"scripts": {
"build:benchlocal": "tsc -p tsconfig.benchlocal.json",
"cli": "ts-node cli/index.ts",
"typecheck": "tsc --noEmit"
}
}
The npm run cli command executes cli/index.ts, which bootstraps the same Benchmark class used by BenchLocal:
// cli/index.ts — minimal wrapper for headless execution
import { Benchmark } from '../lib/benchmark';
async function main() {
// Load configuration from environment
const config = loadConfigFromEnv(); // OPENAI_API_KEY, etc.
const benchmark = new Benchmark({
model: config.model,
temperature: 0, // Enforced deterministic sampling
});
// Run complete suite with progress logging
const results = await benchmark.runAll({
onScenarioComplete: (scenario, score) => {
console.log(`${scenario.id}: ${score}/2`);
}
});
// Emit machine-readable final report
console.log(JSON.stringify({
finalScore: results.finalScore,
categoryScores: results.categoryBreakdown,
// Traces written to files for post-hoc analysis
tracePaths: results.traceFilePaths,
}, null, 2));
}
main().catch(process.exit(1));
Critical implementation detail: Both CLI and BenchLocal paths enforce temperature: 0. This isn't merely a default—it's a methodological requirement for reproducible measurement. The benchmark validates this at runtime, rejecting configurations that would introduce sampling variance.
Validation and Build Pipeline
ToolCall-15's quality gates ensure result integrity:
# Verify type safety across all entry points
npm run typecheck
# Confirm BenchLocal adapter compiles correctly
npm run build:benchlocal
These commands catch common failure modes: SDK version mismatches, breaking changes in @benchlocal/sdk, and TypeScript compilation errors in scenario definitions. Running these before benchmark execution prevents the subtle corruption that occurs when type-unsafe code produces apparently valid but actually incorrect scores.
Advanced Usage & Best Practices
Integrate into CI/CD Pipelines
Configure your model training pipeline to run npm run cli after each checkpoint export. Gate deployment on minimum ToolCall-15 scores per category—perhaps requiring ≥4/6 in Error Recovery for customer-facing agents, while internal tools might tolerate lower Restraint scores with additional human oversight.
Customize for Domain-Specific Tools
While ToolCall-15 ships with fifteen general scenarios, the lib/ architecture supports extension. Study lib/benchmark.ts and METHODOLOGY.md to understand scenario definition patterns, then add proprietary scenarios representing your actual tool schemas. The scoring framework accommodates custom evaluation logic while maintaining deterministic design principles.
Leverage Trace Storage for Debugging
Every scenario failure produces a raw trace capturing: system prompt, tool schema, model response, parsed tool call, mocked tool result, and scoring rationale. When production failures resemble benchmark scenarios, diff the traces to identify model version changes, prompt drift, or schema evolution that introduced regressions.
Run Statistical Significance Tests
While ToolCall-15 enforces temperature: 0, some model providers implement non-deterministic behavior at the infrastructure level. Run the benchmark three times against identical model versions. Score variance between runs indicates provider-side non-determinism that undermines reproducible measurement—valuable intelligence for vendor selection.
Combine with Human Evaluation
ToolCall-15's automated scoring excels at catching clear failures, but edge cases may deserve human review. The trace format enables efficient manual audit: reviewers see complete context without needing to reproduce execution environments.
Comparison with Alternatives
| Capability | ToolCall-15 | General LLM Benchmarks | Custom Internal Tests | Live Integration Tests |
|---|---|---|---|---|
| Tool-use specificity | Purpose-built for function calling | Secondary concern | Varies widely | Often ad-hoc |
| Reproducibility | Deterministic (temperature: 0, mocked tools) | Varies; often sampling-dependent | Usually undocumented | External service variance |
| Standardization | Published methodology, versioned scenarios | Established but generic | Fragmented across orgs | None |
| Failure auditability | Complete raw traces per scenario | Usually aggregate only | Depends on implementation | Log-dependent |
| CI/CD integration | CLI runner, type-safe, build-validated | Often web-only or heavy | Rarely maintained | Brittle, slow |
| Cross-model comparison | BenchLocal side-by-side UI | Leaderboard rankings | Internal only | Difficult to standardize |
| Setup complexity | npm install + API key |
Often complex or gated | Usually high | Infrastructure-heavy |
| Cost per run | API calls only (mocked tools) | API calls only | Development time | Live service charges + risk |
The verdict: General benchmarks tell you if your model is smart; ToolCall-15 tells you if your model is reliable with tools. Custom tests are organizationally siloed and die with their creators. Live integration tests are expensive, slow, and non-deterministic. ToolCall-15 occupies the sweet spot: rigorous enough for research, practical enough for engineering, standardized enough for comparison.
Frequently Asked Questions
Is ToolCall-15 free to use?
Yes. The repository is publicly available at https://github.com/stevibe/ToolCall-15. You'll need API access to the models you're benchmarking, but the benchmark itself and BenchLocal platform incur no charges.
Which LLM providers does ToolCall-15 support?
Any provider with an OpenAI-compatible API, including OpenAI, Anthropic (via compatibility layer), and local inference servers like vLLM or Ollama. BenchLocal's unified configuration abstracts provider differences.
Can I run ToolCall-15 without BenchLocal?
Absolutely. The npm run cli command executes the full benchmark suite without BenchLocal installation. BenchLocal adds visual comparison and historical tracking, but isn't required for core functionality.
How long does a full benchmark run take?
Typically 2-5 minutes depending on model latency and API rate limits. The fifteen scenarios execute sequentially with mocked tool responses, so there's no waiting for external service execution.
Does ToolCall-15 test my actual tools and APIs?
No—and this is intentional. ToolCall-15 uses mocked tool responses to ensure deterministic, reproducible measurement. For testing your specific tool schemas, extend the benchmark framework with custom scenarios following the patterns in lib/benchmark.ts.
What's the difference between ToolCall-15 and the legacy web app?
The legacy/web-app branch contains an older standalone implementation. The main branch's BenchLocal integration is actively maintained, offers superior comparison features, and represents the recommended path forward.
How do I interpret a ToolCall-15 score?
Scores range 0-100, averaged across five categories. A score below 60 indicates serious reliability concerns for production deployment. Scores 60-80 suggest functional capability with notable failure modes. Scores above 80 indicate robust tool-use performance, though category breakdowns reveal specific strengths and weaknesses.
Conclusion: Measure What Matters
The AI industry has spent years optimizing the wrong metrics. We celebrate chatbot wit and trivia accuracy while our agents crash production systems with malformed API calls. ToolCall-15 represents a necessary correction—a benchmark that prioritizes reliability over flash, determinism over dazzle.
After examining its architecture, running its scenarios, and comparing it against alternatives, I'm convinced ToolCall-15 deserves a place in every serious agent developer's toolkit. Not because it's perfect, but because it's honest. It tells you where your model fails before your users do. It provides the measurement infrastructure that turns "trust me, it's good at tools" into verifiable, comparable, actionable intelligence.
The fifteen scenarios in ToolCall-15 aren't arbitrary tests—they're distilled from real failure modes that cost engineering teams sleep, revenue, and credibility. Tool selection confusion, parameter hallucination, chain breakdown, restraint failures, and error cascades: these are the dragons that actually slay production agents.
Stop guessing. Start measuring.
Clone ToolCall-15 from GitHub, run your first benchmark today, and discover what your model's tool use capabilities actually look like under rigorous evaluation. Your future self—debugging at 3 AM or peacefully sleeping through the night—will thank you.
Comments (0)
No comments yet. Be the first to share your thoughts!