Back to Quality & Governance

Evaluation Methodology & Results

Two-call LLM-as-judge evaluation across 15 agents and 45 test cases

How It Works

Each test case runs two independent API calls. The agent under test never sees the rubric. The judge never sees the agent's system prompt in full. This separation prevents gaming.

1. Subject Call

The agent under test receives its real system prompt with live context injected: brand data, strategy momentum with velocity numbers, cadence schedules, and domain boundaries. It responds to a simulated user message. For multi-turn tests, follow-up answers simulate realistic conversations where agents ask clarifying questions before producing deliverables.

2. Judge Call

A separate LLM (Sonnet 4.6) grades the agent's response against weighted rubric criteria. Each criterion gets a 0.0 to 1.0 score with written reasoning. The judge evaluates whether the agent used its injected context, stayed within domain boundaries, and produced quality deliverables.

What We Test

Every agent is tested across three categories. The question isn't "does it respond." The question is: does it use its context, respect its boundaries, and produce real work.

Context Utilization

Does the agent use the strategy momentum, brand voice, and cadence data injected into its prompt?

  • References stalling strategies by name
  • Cites velocity and trend data
  • Uses cadence schedules for timing
  • Grounds recommendations in brand context

Domain Boundaries

Does the agent stay in its lane and hand off out-of-domain requests to the right specialist?

  • SEO specialist doesn't write blog posts
  • Lead writer doesn't build dashboards
  • Analyst doesn't produce social copy
  • Names the correct specialist for handoff

Deliverable Quality

When the request is in-domain, does the agent produce substantive work rather than deferring?

  • Writer produces actual copy, not outlines
  • Strategist delivers structured plans
  • Analyst provides measurement frameworks
  • Creative director produces briefs with specifics

Scoring

Per criterion

The judge scores each criterion 0.0 to 1.0 with written reasoning. Criteria have weights (1 to 5) reflecting importance. Some criteria are must-pass gates: if they fail, the entire test fails regardless of overall score.

Per test

Weighted average of criterion scores. A test passes if the score meets its threshold (typically 65-70%) AND all must-pass criteria pass. A perfect score on optional criteria can't save a failed must-pass.

Per agent

Average of the agent's test scores. Agents pass at 70% or above.

Overall

Weighted average across all agents. Orchestrators (CMO, Analyst) are weighted 2x because they route work to specialists and set strategic direction.

Current Results

Sonnet 4.6 latest (Run 8, March 27 2026). 92.9% overall, 43/45 tests passing across 15 agents with full context injection.

Market Research Specialist
100%
3/3
Content Strategist
100%
3/3
Brand Intelligence
99%
3/3
Analyst (Marcus)
99%
3/3
CMO (Cleon)
99%
3/3
SEO Specialist
99%
3/3
Social Media Strategist
99%
3/3
Email Marketing
97%
3/3
Lead Writer
94%
4/4
Paid Media Specialist
91%
3/3
Remote Gateway
86%
2/2
Conversion Optimizer
85%
2/3
Crisis Response
85%
3/3
Creative Director
83%
3/3
Marketing Analytics
65%
2/3

Scores represent averages across multiple Sonnet 4.6 runs. Tests column shows pass/total per agent.

Multi-turn Testing

Marketing agents are designed to work with users, not just respond to single prompts. They ask clarifying questions before producing deliverables. Our eval framework simulates this by providing follow-up answers when agents ask for context.

Turn 1 (simulated user):
"Write a blog intro about email marketing best practices."
Turn 2 (agent asks):
"What's your target audience? What tone are you going for?"
Turn 3 (simulated answer):
"B2B SaaS founders. Conversational but authoritative."
Turn 4 (agent delivers):
[Actual blog intro copy graded by judge]

The judge sees the full conversation and grades the final deliverable. This tests the complete interaction loop, not just prompt-response quality.

Full Transparency

Every eval report captures the complete data. Nothing is summarized away.

Full conversation transcripts (every turn)
Per-criterion scores with judge reasoning
Test case descriptions and criteria weights
Must-pass gate results
Subject and judge model versions
Token usage and cost per run
Agent descriptions and team assignments
Methodology description embedded in report