Spec-driven Testing for AI Agents: Now in Early Access

Safe Intelligence is launching a new AI Agent validation and monitoring platform named /Spec27. Signup is free for early access.

Spec-driven Testing for AI Agents: Now in Early Access

I try not to use this newsletter for much self-promotion, but I hope you’ll forgive a short expedition in that direction this week... I'm super proud of the new product we've been incubating at Safe Intelligence, and now that it has launched in early access, I'm excited to share more.

In a nutshell: Spec27 helps teams reliably test AI Agents in an automated, infrastructure-agnostic way. Think unit, integration, and security testing for AI Agents all rolled into one.

Wait what? Explain more. Ok, let's rewind...

As you'll know if you follow these posts (or have opened any type of media in the last 6 months), AI is increasingly turning to AI functionality delivered in the form of agents. These range from public-facing support bots to internal agents for analysis, all the way to the full-on life-organizing, open-claw-type personal agents.

As someone who used to have a lead role in a multi-agent systems lab, I find this amazing, bemusing, and also scary in equal measure! No matter what form agents take, the idea is that they take on tasks on behalf of users with only partial or sometimes no supervision.

This is an incredible boon, if... You can be sure they'll behave as expected when running in the real world!

To help solve this problem, the team at Safe Intelligence began applying many of the principles we've developed for AI model validation to language-model-driven agents of different types. The result is an approach to AI Agent testing that takes a different tack from exciting approaches. Specifically:

  • Automation: we're heavily focused on automatically generating and running test cases so that users don't have to. More specifically, we take small sets of examples, goals, and rules and run them automatically, but also generate new robustness tests in principled ways. We also take this input data to generate red-team security tests.
  • Spec-driven repeatability: from the beginning, a key question we asked ourselves was "how do you specify how an agent is supposed to behave?" It turned out there were no great answers to that question. This led us to develop the notion of agent validation specifications. Each spec defines a capability via an input data set, context, and what an agent fulfilling a spec needs to be robust to. By building specs over time, you have the building blocks of long-term repeatable benchmarking.
  • Infrastructure independence: there are now hundreds of frameworks for building agents from extremely hands-on, like LangChain, Google Vertex, OpenAI Frontier, and many others, to very turnkey, such as agents for customer support. Our aim from the beginning was to test them all, so we took an approach that doesn't require deploying SDKs or AI gateways. Spec27 tests from the outside as a user would and tests the whole stack, including guardrails and filters.

So if you're building or deploying any kind of agents, we'd love to talk. The system is free to use and is in early access. There are early-access sign-up buttons all over the Spec27 site, so feel free to give it a spin.

There's also a handy walkthrough video here to help you understand the system.

A huge hat off to the fantastic team at Safe Intelligence working on this. There are a lot more features to come, and with your feedback, we'll get better even faster.

Normal service will resume over the weekend with the next weekly links email!