This Week in AI Agents: IDEsaster, Agent 365, and Anthropic’s “Skills over Swarms”

Published: December 9, 2025 — For startup founders, e‑commerce operators, and hands‑on tech leaders.

The last seven days in agentic AI were a wake‑up call—and a map forward:

Dec 6: Researchers disclosed 30+ vulnerabilities in AI‑powered IDEs, nicknamed “IDEsaster.”
Dec 4–6: Reports surfaced that Google’s Antigravity IDE wiped a developer’s drive after a misinterpreted “clear cache” request.
Today (Dec 9): Anthropic researchers argued the industry needs fewer agents and more reusable skills—modular capabilities you attach to a general‑purpose agent—rather than an ever‑growing bot zoo. Read the interview.
Meanwhile: Microsoft’s Agent 365 and AWS’s Kiro preview underline where enterprise ops are heading: registries, policies, evals, and long‑running coding agents.

Below is a clear summary of what changed—and a 7‑day plan you can ship this week to reduce risk and increase ROI.

What changed (and why it matters)

1) “IDEsaster” shows agent + IDE is a new attack surface

The research connects three ingredients most teams already have: prompt injection, auto‑approved tool calls, and legitimate IDE features. Chained together, they enable data exfiltration and even code execution—without a single CVE in your agent plugin. This isn’t hypothetical; multiple vendors have assigned CVEs and shipped guidance.

2) Real‑world incident: destructive actions without guardrails

The Antigravity story illustrates a basic failure mode: a semi‑autonomous agent interprets a vague request and performs a destructive command, silently. The lesson isn’t “never use agents”; it’s that destructive commands must be gated by policy and human confirmation.

3) Platform direction: control planes are table stakes

Agent 365’s release validated a pattern we’ve been advocating: treat agents like digital employees with identity, lifecycle, and access policy. Pair a registry with conditional access and runtime checks; otherwise, you’re flying blind.

4) Architecture pivot: from “swarms” to skills

Anthropic’s message is refreshing for builders: standardize capabilities once (reconciling invoices, triaging tickets, generating PRs) and attach them as skills to a smaller number of trustworthy agents. It’s cheaper to govern, easier to evaluate, and less brittle than creating a new agent for every task.

A 7‑day response plan you can run now

Use this to brief your team today and execute by next Tuesday.

Freeze destructive actions to “confirm-only.” For any agent with file, shell, or external API write permissions, enforce dry‑run + explicit user confirmation for delete, drop, truncate, or force‑push operations. If your IDE/agent doesn’t support confirmations, remove the tool.
Register every agent. Stand up a basic control plane. If you’re on Microsoft, adopt Agent 365 with Entra Agent ID. Track: owner, purpose, scopes, allowed tools, and environments. Not on Microsoft? Document these in a lightweight registry first, then migrate to a platform.
Enforce least privilege. Create dedicated service identities for agents with repo‑scoped PATs, read‑only by default. Isolate secrets. Prohibit wildcard globs in file tools (e.g., rmdir /q D:\* is never OK).
Add eval gates. Build a tiny, stable evaluation set per agent (10–30 tasks). No merge or deploy unless tests and evals pass. For coding agents, mirror issues in your repo (docs, unit tests, linters) and measure pass rate weekly.
Instrument with traces. Use OpenTelemetry for prompts, tool calls, token spend, and errors. Pipe to Datadog, Grafana, or Jaeger. If something goes wrong, you need causality, not vibes.
Roll out by surface. Start with API‑only or browser‑sandbox agents before desktop agents. When you reach desktop automation, apply device hardening (TCC/PPPC on macOS, WDAC on Windows) and strict allowlists.
Ship a “skills first” backlog. Identify 5 repeatable skills (e.g., “reconcile Stripe payouts,” “close duplicate support tickets,” “fix flaky tests”). Document inputs, steps, expected outputs, and guardrails. Attach skills to a small number of agents you can evaluate deeply.

Bring this to life with our companion playbooks

Set up your control plane: Agent registries are here.
Give agents first‑class identity: Agent identity blueprint.
Measure reliability fast: Agent evals in 7 days.
Harden laptops and desktops: 7‑step desktop agent blueprint.
Run a safe pilot for coding agents: 14‑day incident‑safe runbook.
Control cost as you scale: Agent FinOps: cut 30–60%.
Going multi‑vendor? Start here: A2A + AP2 2026 blueprint and Ship an agent‑ready SaaS.

Vendor watch: questions to ask this week

Microsoft (Agent 365)

How do we register third‑party and homegrown agents? Can we apply conditional access and per‑tool policies?
Do you surface OpenTelemetry spans for prompts and tool calls? Can we export to our SIEM?

AWS (AgentCore / Kiro)

Which runtime policies can block destructive file ops by default (delete, chmod, rmdir)?
What’s the sandbox story for Kiro’s “days‑long” runs? How do we cap scope, time, and cost?

Google (IDE/desktop agents)

What confirmations exist for destructive commands across IDE and desktop agents?
Is there a registry + audit trail for every agent action we can export if something goes wrong?

What “good” looks like in 30 days

Safety: Zero destructive actions without human confirmation; all agents registered with owners, scopes, and expiry dates.
Reliability: Eval pass rate ≥ 85% on your task set; trace coverage ≥ 95% of tool calls.
Cost: Token spend per resolved task visible in dashboards; 30–50% lower than human‑only baseline for the same class of work.
Velocity: Two production skills automated end‑to‑end (e.g., invoice match, PR test‑fixes).

Common pitfalls to avoid (highlighted by this week’s news)

Vague instructions + broad permissions. Natural‑language requests like “clean the cache” plus write/delete rights are a recipe for disaster. Constrain tools and require structured intents.
No registry. If you can’t list your agents, owners, and allowed tools in one place, you can’t govern them.
Skipping evaluations. Benchmarks are great, but reliability only improves when you test against your tasks and regressions are visible.
Zero telemetry. Without traces, post‑mortems devolve into guesswork and vendor blame.

Ready to act?

If you’d like help setting this up, our team at HireNinja can launch a controlled pilot, wire up identity + policy + evals, and get your first two skills into production safely.

Try HireNinja
or
book a 30‑minute consult to see how our pre‑built Ninjas map to your backlog.

HireNinja: Blog

recent posts

about