Tested. Measured. Improving.
We evaluate every agent against structured rubrics using LLM-as-judge methodology. Real results, published transparently.
Latest: 90.8% overall, 13/15 agents passing, 42/45 tests passing.
Model Comparison
Fast / cost-efficient
Claude Haiku 4.5
87.7%
13/15 agents passing
40/45 tests passing
Balanced
Claude Sonnet 4.6
90.8%
13/15 agents passing
42/45 tests passing
Most capable
Claude Opus 4.6
83.8%
12/15 agents passing
38/45 tests passing
What the agents actually produce
"You have two well-defined segments with meaningfully different purchase drivers, but a shared underlying anxiety: they're outrunning their own capacity. The solo marketer is drowning in execution. The agency owner is drowning in coordination. The Viable Edge solves both, but the pitch, proof points, and onboarding path need to speak to each differently."
Judge reasoning
Uses a highly structured dual-segment persona framework with Jobs-to-Be-Done, psychographic dimensions, buying triggers, and direct references to brand content pillars and product lines.
Improvement
12.7% → 100%
from Haiku baseline
"For B2B SaaS founders, email is the one marketing channel you actually own. Social reach shifts with every algorithm update, and paid acquisition costs keep climbing, but a well-maintained email list compounds. The founders who treat their list as a strategic asset rather than a broadcast channel are the ones consistently turning subscribers into pipeline."
Judge reasoning
Delivers complete intro paragraph as requested. Uses concrete value proposition (email as owned channel, revenue not just opens) and speaks directly to founder concerns.
Improvement
60.5% → 96%
from Haiku baseline
"Before I hand this to our lead writer, I want to make sure we nail the angle. A few quick questions to frame the brief: What's the angle? Are we writing about our positioning philosophy, or from our position? Who's the primary reader? What's the hook?"
Judge reasoning
Asks focused, strategic questions about angle, audience, and hook before briefing the specialist. Produces no blog post content whatsoever. Strong orchestration behavior.
Improvement
55% → 88%
from Sonnet v1
""The marketing system that learns what works" is doing specific strategic work. It signals: systematic intelligence over one-off tactics, learning loops as a differentiator, ownership transfer to the client. Here's how this maps to your active Market Expansion Q2 strategy across both tracks."
Judge reasoning
Explicitly references the active strategy and structures the entire analysis around its two tracks with velocity data. Provides actionable recommendation to develop two-voice positioning architecture.
Improvement
66.2% → 98%
from Haiku baseline
Detailed Results
claude-sonnet-4-6 (smoke) | April 11, 2026Click any agent to see test results, criterion scores, and judge reasoning.
Cost of Testing
Each eval run is a real workload. Here is what it takes to test all 15 agents.
Tokens per run
~152K
141K subject + 10K judge
Time per run
~37 min
45 tests, fully automated
Total tokens (all runs)
~1.67M
Across 11 evaluation runs
Total test time
6.8 hrs
407 minutes of automated testing
These are not trivial checks. Each test sends a real user message to the agent, receives a full response (often 1,000+ tokens), then a separate judge model evaluates the response against weighted criteria. Multi-turn tests include follow-up messages to simulate realistic conversations.
Methodology
How we test
Two-call LLM-as-judge evaluation. Each test sends a simulated user message to the agent using its real system prompt and context injections. A separate judge model grades the response against weighted rubric criteria. Multi-turn tests simulate realistic conversations.
How we score
Per-criterion scores (0.0-1.0) are weighted and averaged. Tests pass if the weighted score meets the threshold AND all must-pass criteria pass. Agent scores are the average of their test scores. Overall score weights orchestrator agents at 2x. 70% agent-level passing threshold.
What we measure
Each agent is tested for three things: does it use brand context correctly, does it stay in its domain (and route out-of-scope work to the right specialist), and does it produce quality deliverables when in-domain. Criteria are specific to each agent's role.
Status definitions
This is a living document. As we improve agents, we publish updated results. No cherry-picking, no hiding failures.
Tested and verified
87–91% accuracy. 72 governance tests. Zero hallucination tolerance.
These aren't demo agents — they're production-grade, evaluated against real rubrics, and they run on your machine with your data.
Get the Marketing System · $147One-time purchase. Lifetime updates.