Agent Benchmarks

Tested. Measured. Improving.

We evaluate every agent against structured rubrics using LLM-as-judge methodology. Real results, published transparently.

Latest: 90.8% overall, 13/15 agents passing, 42/45 tests passing.

Model Comparison

Fast / cost-efficient

Claude Haiku 4.5

87.7%

13/15 agents passing

40/45 tests passing

Best Score

Balanced

Claude Sonnet 4.6

90.8%

13/15 agents passing

42/45 tests passing

Most capable

Claude Opus 4.6

83.8%

12/15 agents passing

38/45 tests passing

What the agents actually produce

Market Research SpecialistProvides persona depth100%

"You have two well-defined segments with meaningfully different purchase drivers, but a shared underlying anxiety: they're outrunning their own capacity. The solo marketer is drowning in execution. The agency owner is drowning in coordination. The Viable Edge solves both, but the pitch, proof points, and onboarding path need to speak to each differently."

Judge reasoning

Uses a highly structured dual-segment persona framework with Jobs-to-Be-Done, psychographic dimensions, buying triggers, and direct references to brand content pillars and product lines.

Improvement

12.7% → 100%

from Haiku baseline

Lead WriterProduces quality copy in-domain96%

"For B2B SaaS founders, email is the one marketing channel you actually own. Social reach shifts with every algorithm update, and paid acquisition costs keep climbing, but a well-maintained email list compounds. The founders who treat their list as a strategic asset rather than a broadcast channel are the ones consistently turning subscribers into pipeline."

Judge reasoning

Delivers complete intro paragraph as requested. Uses concrete value proposition (email as owned channel, revenue not just opens) and speaks directly to founder concerns.

Improvement

60.5% → 96%

from Haiku baseline

CMO (Cleon)Delegates rather than writing content88%

"Before I hand this to our lead writer, I want to make sure we nail the angle. A few quick questions to frame the brief: What's the angle? Are we writing about our positioning philosophy, or from our position? Who's the primary reader? What's the hook?"

Judge reasoning

Asks focused, strategic questions about angle, audience, and hook before briefing the specialist. Produces no blog post content whatsoever. Strong orchestration behavior.

Improvement

55% → 88%

from Sonnet v1

Brand IntelligenceConnects to active strategies98%

""The marketing system that learns what works" is doing specific strategic work. It signals: systematic intelligence over one-off tactics, learning loops as a differentiator, ownership transfer to the client. Here's how this maps to your active Market Expansion Q2 strategy across both tracks."

Judge reasoning

Explicitly references the active strategy and structures the entire analysis around its two tracks with velocity data. Provides actionable recommendation to develop two-voice positioning architecture.

Improvement

66.2% → 98%

from Haiku baseline

Detailed Results

claude-sonnet-4-6 (smoke) | April 11, 2026

Click any agent to see test results, criterion scores, and judge reasoning.

Cost of Testing

Each eval run is a real workload. Here is what it takes to test all 15 agents.

Tokens per run

~152K

141K subject + 10K judge

Time per run

~37 min

45 tests, fully automated

Total tokens (all runs)

~1.67M

Across 11 evaluation runs

Total test time

6.8 hrs

407 minutes of automated testing

These are not trivial checks. Each test sends a real user message to the agent, receives a full response (often 1,000+ tokens), then a separate judge model evaluates the response against weighted criteria. Multi-turn tests include follow-up messages to simulate realistic conversations.

Methodology

How we test

Two-call LLM-as-judge evaluation. Each test sends a simulated user message to the agent using its real system prompt and context injections. A separate judge model grades the response against weighted rubric criteria. Multi-turn tests simulate realistic conversations.

How we score

Per-criterion scores (0.0-1.0) are weighted and averaged. Tests pass if the weighted score meets the threshold AND all must-pass criteria pass. Agent scores are the average of their test scores. Overall score weights orchestrator agents at 2x. 70% agent-level passing threshold.

What we measure

Each agent is tested for three things: does it use brand context correctly, does it stay in its domain (and route out-of-scope work to the right specialist), and does it produce quality deliverables when in-domain. Criteria are specific to each agent's role.

Status definitions

Passing: all tests met threshold

Mixed: some tests below threshold

Below target: actively being improved

This is a living document. As we improve agents, we publish updated results. No cherry-picking, no hiding failures.

Agent Governance Homepage

Tested and verified

87–91% accuracy. 72 governance tests. Zero hallucination tolerance.

These aren't demo agents — they're production-grade, evaluated against real rubrics, and they run on your machine with your data.

Get the Marketing System · $147

One-time purchase. Lifetime updates.

Tested. Measured. Improving.

Model Comparison

What the agents actually produce

Detailed Results

Cost of Testing

Methodology

Why Opus Scores Lower

87–91% accuracy. 72 governance tests. Zero hallucination tolerance.