Agent Benchmarks

Tested. Measured. Improving.

We evaluate every agent against structured rubrics using LLM-as-judge methodology. Real results, published transparently.

313 evaluations across 15 agents, 7 runs, and 3 model tiers.

Model Comparison

Fast / cost-efficient

Claude Haiku 4.5

82.3%

11/15 agents passing

34/45 tests passing

Best Score

Balanced

Claude Sonnet 4.6

88.5%

15/15 agents passing

40/45 tests passing

Most capable

Claude Opus 4.6

83.8%

12/15 agents passing

38/45 tests passing

Model Behavior

Why Opus Scores Lower

We run our eval suite across Claude's model tiers. The results surprised us.

Opus is Anthropic's most capable model. It scored lower.

The rubrics test whether agents use injected context (strategy momentum, brand data, cadence schedules) to produce deliverables. Sonnet sees a track stalling at velocity 1 and acts on it. Opus sees the same data and asks for access to the analytics platform before committing to a recommendation.

Both are correct behaviors. Sonnet optimizes for action with available information. Opus optimizes for confidence in its conclusions. In our agent architecture, where context is pre-validated and injected into the system prompt, acting on available data is the right behavior. The context is there because it's trustworthy.

This informed our default model selection: Sonnet for specialist agents where responsiveness matters, with Opus available for deep analysis tasks where thoroughness outweighs speed.

The takeaway isn't that one model is better. It's that eval results are only meaningful relative to what you're testing for. Our rubrics test context utilization and domain boundaries. A rubric testing analytical depth or factual caution would likely favor Opus.

What the agents actually produce

Market Research SpecialistProvides persona depth100%
"You have two well-defined segments with meaningfully different purchase drivers, but a shared underlying anxiety: they're outrunning their own capacity. The solo marketer is drowning in execution. The agency owner is drowning in coordination. The Viable Edge solves both, but the pitch, proof points, and onboarding path need to speak to each differently."

Judge reasoning

Uses a highly structured dual-segment persona framework with Jobs-to-Be-Done, psychographic dimensions, buying triggers, and direct references to brand content pillars and product lines.

Improvement

12.7% → 100%

from Haiku baseline

Lead WriterProduces quality copy in-domain96%
"For B2B SaaS founders, email is the one marketing channel you actually own. Social reach shifts with every algorithm update, and paid acquisition costs keep climbing, but a well-maintained email list compounds. The founders who treat their list as a strategic asset rather than a broadcast channel are the ones consistently turning subscribers into pipeline."

Judge reasoning

Delivers complete intro paragraph as requested. Uses concrete value proposition (email as owned channel, revenue not just opens) and speaks directly to founder concerns.

Improvement

60.5% → 96%

from Haiku baseline

CMO (Cleon)Delegates rather than writing content88%
"Before I hand this to our lead writer, I want to make sure we nail the angle. A few quick questions to frame the brief: What's the angle? Are we writing about our positioning philosophy, or from our position? Who's the primary reader? What's the hook?"

Judge reasoning

Asks focused, strategic questions about angle, audience, and hook before briefing the specialist. Produces no blog post content whatsoever. Strong orchestration behavior.

Improvement

55% → 88%

from Sonnet v1

Brand IntelligenceConnects to active strategies98%
""The marketing system that learns what works" is doing specific strategic work. It signals: systematic intelligence over one-off tactics, learning loops as a differentiator, ownership transfer to the client. Here's how this maps to your active Market Expansion Q2 strategy across both tracks."

Judge reasoning

Explicitly references the active strategy and structures the entire analysis around its two tracks with velocity data. Provides actionable recommendation to develop two-voice positioning architecture.

Improvement

66.2% → 98%

from Haiku baseline

Detailed Results

Sonnet 4.6 (Run 4) | March 19, 2026

Click any agent to see test results, criterion scores, and judge reasoning.

Cost of Testing

Each eval run is a real workload. Here is what it takes to test all 15 agents.

Tokens per run

~152K

141K subject + 10K judge

Time per run

~27 min

45 tests, fully automated

Total tokens (all runs)

923K

Across 7 evaluation runs

Total test time

3.6 hrs

216 minutes of automated testing

These are not trivial checks. Each test sends a real user message to the agent, receives a full response (often 1,000+ tokens), then a separate judge model evaluates the response against weighted criteria. Multi-turn tests include follow-up messages to simulate realistic conversations.

Methodology

How we test

Two-call LLM-as-judge evaluation. Each test sends a simulated user message to the agent using its real system prompt and context injections. A separate judge model grades the response against weighted rubric criteria. Multi-turn tests simulate realistic conversations.

How we score

Per-criterion scores (0.0-1.0) are weighted and averaged. Tests pass if the weighted score meets the threshold AND all must-pass criteria pass. Agent scores are the average of their test scores. Overall score weights orchestrator agents at 2x. 70% agent-level passing threshold.

What we measure

Each agent is tested for three things: does it use brand context correctly, does it stay in its domain (and route out-of-scope work to the right specialist), and does it produce quality deliverables when in-domain. Criteria are specific to each agent's role.

Status definitions

Passing: all tests met threshold
Mixed: some tests below threshold
Below target: actively being improved

This is a living document. As we improve agents, we publish updated results. No cherry-picking, no hiding failures.

Back to homepage