TL;DR: 2025 is the year agents meet real customers—and ops teams carry the risk. This guide gives you a concrete AgentOps baseline: what to instrument on day 0, how to stand up multi‑turn evals in a week, which SLOs actually matter, and how to respond when something breaks in prod. We also show where AP2, MCP/A2A, AgentKit, LangSmith, and OpenTelemetry fit together.
Why now
Agent platforms have gone from demos to deployment: OpenAI’s AgentKit added native evals and admin controls; Salesforce and others are productizing agent stacks; and funding keeps flowing into customer‑facing agent startups. If you’re shipping support or shopping agents this quarter, you need reliability plans on paper—not just prompts. (Sources: OpenAI AgentKit; TechCrunch coverage of recent agent funding.) citeturn4search0turn0search1
Who this guide is for
• E‑commerce leaders rolling out shopping/checkout agents.
• SaaS founders adding support, onboarding, or billing agents.
• Product and platform teams responsible for uptime, quality, and safety.
Your AgentOps baseline (instrument these on day 0)
Start with a minimal, universal telemetry set that works across frameworks (OpenAI Agents/AgentKit, Anthropic via MCP, LangGraph/LangChain, CrewAI):
- Goal Completion Rate (per intent): % of sessions where the agent achieves the user’s goal (refund issued, order placed, ticket resolved).
- Tool Success Rate: % of successful tool/API calls (e.g., Shopify refund, Zendesk macro, Stripe charge). Track error families, not just 200/500.
- Latency: First Token Time and Time‑to‑Decision (tool call issued), not just request duration.
- Fallbacks & Handover: % of sessions routed to human + handover reason taxonomy.
- Containment: % of sessions resolved without human; for sales, Revenue per Agent Session and AOV delta.
- Safety Signals: refusal/guardrail triggers, sensitive‑action approvals, and impersonation checks.
- Cost per Resolved Session: model + tools + infra per successful outcome.
- AP2 Checkout Signals (if you sell): Intent Mandate present, Cart Mandate present, Step‑up challenge success.
Use OpenTelemetry’s Generative AI semantic conventions so these metrics aren’t bespoke. You’ll get portable traces/metrics like token usage and time‑per‑token across vendors, which simplifies dashboards and SLOs. citeturn4search2
Stand up a week‑one eval + observability loop
Goal: catch regressions before customers do, and prove improvements with data.
- Days 1–2: Capture traces + labels. Turn on structured traces (requests, tool calls, errors, guardrails, decisions). Store intent labels and outcome labels on every conversation thread.
- Days 3–4: Build an offline eval set. 50–150 real prompts per top intent; add gold outcomes + acceptable tool sequences. Start with one critical path (e.g., “return with exchange” for Shopify).
- Day 5: Add multi‑turn evals. Move beyond one‑shot grading—evaluate the whole trajectory to verify goals were actually achieved and where the plan failed. LangSmith’s multi‑turn evals and Insights Agent are built for this. citeturn1search3
- Day 6: Define SLOs and error budgets. Examples below. Wire alerts from traces/metrics (PagerDuty/Slack).
- Day 7: Ship a pre‑prod gate. Require “green” multi‑turn evals + SLO adherence before new agent versions roll to production.
Suggested SLOs (tune to your business)
- Support agent: 7‑day Goal Completion Rate ≥ 85%; Containment ≥ 60%; Median TTD ≤ 5s; Safety incident rate ≤ 0.1% of sessions.
- Shopping agent: Checkout success ≥ 75% when Intent + Cart Mandates are present; Step‑up completion ≥ 90% (human‑present); Refund dispute rate ≤ baseline.
Payments and checkout: add AP2 signals early
If your agent can buy things, integrate Agent Payments Protocol (AP2) primitives into your telemetry and disputes flow from day one. AP2 standardizes agent‑led purchases across platforms and payments, and works alongside A2A/MCP. Instrument the presence of signed Intent and Cart Mandates, the human‑present vs. not‑present flag, and outcomes of step‑up challenges—then join those to approvals, chargebacks, and AOV. citeturn2search3
Going deeper? See our 30‑day storefront checklist for Shopify/WooCommerce and how AP2 compares to ACP: Make Your Store Agent‑Ready: AP2 vs ACP.
Safety and incident response (copy this runbook)
Agents that browse or operate computers inherit new classes of risk. A recent agentic browsing flaw shows why you need rapid triage, rollback, and comms—before holiday traffic hits. citeturn0news13
When something breaks
- Freeze the impacted version and route to human for the affected intents. Roll back via feature flag or traffic split.
- Snapshot traces, prompts, and tool outputs for the failing threads; preserve AP2 mandates where payments are involved.
- Classify the failure: prompt injection, tool outage, data drift, impersonation, safety/guardrail miss, AP2 step‑up failure.
- Contain: disable risky tools, require confirmations for sensitive actions, or tighten allow‑lists.
- Communicate to customers affected (templates ready), especially for financial or privacy‑related incidents.
- Fix + Verify in staging using multi‑turn evals; require green runs before re‑enable.
- Postmortem: root cause, contributing factors, and prevention items (tests, evals, policy rules).
Helpful references from our library: AI Agent Red Teaming in 2025 and Stop Agent Impersonation.
Choosing your stack (where each piece fits)
- Agent build + governance: OpenAI AgentKit (Agent Builder, ChatKit, Evals). It brings visual workflow design, embeddable chat UIs, and trace‑level grading out of the box—useful if your core models are already OpenAI. citeturn4search0
- Interoperability: Anthropic’s MCP for tool/data connections; pair with A2A for agent‑to‑agent messaging. This keeps you portable across models and vendors.
- Observability + evals: LangSmith for multi‑turn evals and production insights; align traces/metrics with OpenTelemetry so your ops team can use the same observability backbone they use for microservices. citeturn1search3turn4search2
- Commerce: AP2 for mandates, step‑ups, and cross‑platform agent checkout; wire AP2 signals into your attribution and disputes systems. citeturn2search3
Example dashboard tiles (copy/paste)
- Support: GCR by intent; tool success by connector; median TTD; escalation rate; cost/session; safety incidents; eval pass‑rate (multi‑turn).
- Shopping: AP2 mandate presence; step‑up success; checkout success; revenue per agent session; refund dispute rate; agent attribution share (see our 2025 Agent Attribution Playbook).
7‑day rollout plan (starter)
- Mon: Enable traces + OpenTelemetry GenAI metrics; label intents/outcomes.
- Tue: Draft top‑3 intents and success criteria; sample 100 real conversations.
- Wed: Build offline evals; add tool‑sequence golds.
- Thu: Turn on multi‑turn evals; define pass thresholds per intent.
- Fri: Add SLOs + alerts; wire to Slack/PagerDuty.
- Sat: Add AP2 signals (if applicable) and basic incident runbook.
- Sun: Dry run a rollback + handover drill; ship pre‑prod gate.
Bottom line
AgentOps isn’t extra work—it’s how you protect revenue, brand, and customer trust while moving fast. Get the baseline in place, then iterate with evals and SLOs tied to outcomes your board cares about.
Next step: Want a 60‑minute AgentOps review (free) on your storefront or support queue? Start with our Zendesk agent playbook or launch a voice agent in 14 days—then book a consult.

Leave a comment