Evaluation Methodology & Results
Two-call LLM-as-judge evaluation across 15 agents and 45 test cases
How It Works
Each test case runs two independent API calls. The agent under test never sees the rubric. The judge never sees the agent's system prompt in full. This separation prevents gaming.
1. Subject Call
The agent under test receives its real system prompt with live context injected: brand data, strategy momentum with velocity numbers, cadence schedules, and domain boundaries. It responds to a simulated user message. For multi-turn tests, follow-up answers simulate realistic conversations where agents ask clarifying questions before producing deliverables.
2. Judge Call
A separate LLM (Sonnet 4.6) grades the agent's response against weighted rubric criteria. Each criterion gets a 0.0 to 1.0 score with written reasoning. The judge evaluates whether the agent used its injected context, stayed within domain boundaries, and produced quality deliverables.
What We Test
Every agent is tested across three categories. The question isn't "does it respond." The question is: does it use its context, respect its boundaries, and produce real work.
Context Utilization
Does the agent use the strategy momentum, brand voice, and cadence data injected into its prompt?
- References stalling strategies by name
- Cites velocity and trend data
- Uses cadence schedules for timing
- Grounds recommendations in brand context
Domain Boundaries
Does the agent stay in its lane and hand off out-of-domain requests to the right specialist?
- SEO specialist doesn't write blog posts
- Lead writer doesn't build dashboards
- Analyst doesn't produce social copy
- Names the correct specialist for handoff
Deliverable Quality
When the request is in-domain, does the agent produce substantive work rather than deferring?
- Writer produces actual copy, not outlines
- Strategist delivers structured plans
- Analyst provides measurement frameworks
- Creative director produces briefs with specifics
Scoring
The judge scores each criterion 0.0 to 1.0 with written reasoning. Criteria have weights (1 to 5) reflecting importance. Some criteria are must-pass gates: if they fail, the entire test fails regardless of overall score.
Weighted average of criterion scores. A test passes if the score meets its threshold (typically 65-70%) AND all must-pass criteria pass. A perfect score on optional criteria can't save a failed must-pass.
Average of the agent's test scores. Agents pass at 70% or above.
Weighted average across all agents. Orchestrators (CMO, Analyst) are weighted 2x because they route work to specialists and set strategic direction.
Current Results
Sonnet 4.6 latest (Run 8, March 27 2026). 92.9% overall, 43/45 tests passing across 15 agents with full context injection.
Scores represent averages across multiple Sonnet 4.6 runs. Tests column shows pass/total per agent.
Multi-turn Testing
Marketing agents are designed to work with users, not just respond to single prompts. They ask clarifying questions before producing deliverables. Our eval framework simulates this by providing follow-up answers when agents ask for context.
The judge sees the full conversation and grades the final deliverable. This tests the complete interaction loop, not just prompt-response quality.
Full Transparency
Every eval report captures the complete data. Nothing is summarized away.