Agent Quality & Governance
Every agent is tested. Every action is governed. Here's how.
AI agents that manage your marketing need two things: proof they work well and hard limits on what they can do. We publish both. The evaluation methodology, the scores, the governance rules, and the enforcement architecture. Transparency isn't a feature. It's the baseline.
Evaluations
15 agents tested against 45 rubric-graded test cases with an independent LLM judge. Multi-turn conversations, weighted criteria, must-pass gates. Published methodology and scores.
Governance
Infrastructure-level enforcement controls what every agent can do. Permission levels, protected paths, steward lockdowns, and a full audit trail. Agents cannot override governance regardless of prompting.
Ongoing Monitoring
Weekly Automated
Smoke tests run every Sunday night. 45 test cases across all 15 agents. Catches regressions from model updates and context drift before they reach clients.
After Context Changes
Targeted evaluations run when strategies change, brand voice updates, or new competitors are onboarded. Validates agents use new context correctly.
Monthly Baseline
Full evaluation with Sonnet as the subject model. This is the client-reportable number. Tracked month over month for quality trends.
Quarterly Deep
Expanded test coverage with client-specific scenarios derived from real conversations. Edge cases, multi-step reasoning, cross-agent coordination.
Model Behavior
We evaluate across Claude model tiers. The results informed our default model selection.
| Model | Score | Behavior |
|---|---|---|
| Sonnet 4.6 | 93% | Acts on available context. Responsive, action-oriented. |
| Opus 4.6 | 84% | More deliberate. Requests additional data before committing. |
Opus scores lower because it's more cautious, not less capable. Our rubrics reward context utilization. A rubric testing analytical depth would likely favor Opus. We default to Sonnet for specialist agents where responsiveness matters, with Opus available for deep analysis tasks.