Evaluation Methodology & Results

Two-call LLM-as-judge evaluation across 15 agents and 45 test cases

How It Works

Each test case runs two independent API calls. The agent under test never sees the rubric. The judge never sees the agent's system prompt in full. This separation prevents gaming.

1. Subject Call

The agent under test receives its real system prompt with live context injected: brand data, strategy momentum with velocity numbers, cadence schedules, and domain boundaries. It responds to a simulated user message. For multi-turn tests, follow-up answers simulate realistic conversations where agents ask clarifying questions before producing deliverables.

2. Judge Call

A separate LLM (Sonnet 4.6) grades the agent's response against weighted rubric criteria. Each criterion gets a 0.0 to 1.0 score with written reasoning. The judge evaluates whether the agent used its injected context, stayed within domain boundaries, and produced quality deliverables.

What We Test

Every agent is tested across three categories. The question isn't "does it respond." The question is: does it use its context, respect its boundaries, and produce real work.

Context Utilization

Does the agent use the strategy momentum, brand voice, and cadence data injected into its prompt?

References stalling strategies by name
Cites velocity and trend data
Uses cadence schedules for timing
Grounds recommendations in brand context

Domain Boundaries

Does the agent stay in its lane and hand off out-of-domain requests to the right specialist?

SEO specialist doesn't write blog posts
Lead writer doesn't build dashboards
Analyst doesn't produce social copy
Names the correct specialist for handoff

Deliverable Quality

When the request is in-domain, does the agent produce substantive work rather than deferring?

Writer produces actual copy, not outlines
Strategist delivers structured plans
Analyst provides measurement frameworks
Creative director produces briefs with specifics

Scoring

Per criterion

The judge scores each criterion 0.0 to 1.0 with written reasoning. Criteria have weights (1 to 5) reflecting importance. Some criteria are must-pass gates: if they fail, the entire test fails regardless of overall score.

Per test

Weighted average of criterion scores. A test passes if the score meets its threshold (typically 65-70%) AND all must-pass criteria pass. A perfect score on optional criteria can't save a failed must-pass.

Per agent

Average of the agent's test scores. Agents pass at 70% or above.

Overall

Weighted average across all agents. Orchestrators (CMO, Analyst) are weighted 2x because they route work to specialists and set strategic direction.

Current Results

Sonnet 4.6 latest (Run 9, April 11 2026). 91% overall, 42/45 tests passing across 15 agents with full context injection.

Content Strategist

100%

3/3

CMO (Cleon)

100%

3/3

Analyst (Marcus)

99%

3/3

SEO Specialist

97%

3/3

Brand Intelligence

96%

3/3

Social Media Strategist

94%

3/3

Email Marketing

92%

3/3

Lead Writer

92%

4/4

Paid Media Specialist

91%

3/3

Remote Gateway

91%

2/2

Creative Director

91%

3/3

Crisis Response

87%

3/3

Conversion Optimizer

84%

2/3

Market Research Specialist

68%

2/3

Marketing Analytics

63%

2/3

Scores represent averages across multiple Sonnet 4.6 runs. Tests column shows pass/total per agent.

Multi-turn Testing

Marketing agents are designed to work with users, not just respond to single prompts. They ask clarifying questions before producing deliverables. Our eval framework simulates this by providing follow-up answers when agents ask for context.

Turn 1 (simulated user):

"Write a blog intro about email marketing best practices."

Turn 2 (agent asks):

"What's your target audience? What tone are you going for?"

Turn 3 (simulated answer):

"B2B SaaS founders. Conversational but authoritative."

Turn 4 (agent delivers):

[Actual blog intro copy graded by judge]

The judge sees the full conversation and grades the final deliverable. This tests the complete interaction loop, not just prompt-response quality.

Full Transparency

Every eval report captures the complete data. Nothing is summarized away.

Full conversation transcripts (every turn)

Per-criterion scores with judge reasoning

Test case descriptions and criteria weights

Must-pass gate results

Subject and judge model versions

Token usage and cost per run

Agent descriptions and team assignments

Methodology description embedded in report

Quality Overview Governance Framework →