ASSERT wants to transform intention into test and attack a real weak point of agents
The agent market has already learned how to make convincing demonstrations. What you haven't yet learned, in a mature way, is to prove that these systems behave as they should outside of the ideal scenario. It was at this point that Microsoft decided to make changes by launching ASSERT on June 3, 2026. The open source project promises to convert natural language specifications into executable evaluation suites for models, applications and agents. ## What happened The acronym ASSERT stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. In practice, Microsoft is proposing a four-step pipeline: transforming a broad intent into a more explicit specification, converting this into a taxonomy of allowed and prohibited behaviors, generating stratified test cases, and finally executing and scoring each trace of the evaluated system. The tool records not only the final response, but also tool calls, retrieved context, and intermediate actions. The announcement is important because it targets a problem that real teams know well. Behavior requirements are often spread across PRDs, internal policies, system prompts, review notes and checklists. The difficult leap is to transform this material into a living, auditable, and updatable assessment. Without this, teams end up resorting to generic metrics like relevance, groundedness or toxicity, which help, but almost never capture product-specific thresholds. ## The technique behind The insight of ASSERT is to treat the behavior specification as first-class input, not background. Instead of generating a few prompts and measuring superficial accuracy, the system tries to break down what good behavior means in that context. If the agent is a travel agent, for example, he should not invent tickets, violate the budget, obey injections coming from a tool or create stereotypes about the user. They are different rules, with different tracks, which require different cases. Technically, this improves coverage. Microsoft itself reports internal studies where ASSERT covered more behavior space, generated more useful cases for inspection, and better separated strong and weak systems than more directly generated baselines. The most interesting gain, however, is not just statistical. It is epistemological. By tying each failure to a rule, a rationale and a section of the trail, the team is able to discuss product policy without falling into vague abstractions. This is especially relevant for agents because the error is often not in the final sentence. It's in the way: the wrong tool called too soon, inappropriate context retrieval, accepting a malicious instruction, or an out-of-budget decision. A system that only reads the final output may approve behavior that was already committed several steps earlier. ## Why this matters The moment agents start to perform tasks with tools, memory and multiple shifts, evaluation ceases to be an academic luxury. It becomes an operational prerequisite. A support agent may grant out-of-policy refunds. A research agent may cite restricted material. A corporate agent can follow hidden instructions in a recovered document. In all these cases, evaluating "response quality" is insufficient. ASSERT attempts to bridge exactly this gap between written intent and ongoing verification. For product teams, this can mean faster regression and review cycles. For governance, it can mean more concrete evidence of what was actually tested. For engineering, it opens up the possibility of integrating evaluation into the release process with inspectable artifacts, not just an aggregated number. ## The future it anticipates If the approach catches on, the agent market should become less obsessed with universal benchmarks and more focused on contextual behavior. This is healthy. A financial agent, an educational agent and a clinical agent do not fail in the same way, nor should they be judged by the same criteria. The plausible future is a layer of evals much more akin to software testing and policy review than single public ranking. There is also an important cultural effect. Teams will be pressured to better explain their own intentions. "Be safe" or "don't be biased" are not enough when it is necessary to generate executable cases, traits and criteria. ASSERT forces a type of maturity: writing behavior policy in an operational way, not just an aspirational one. ## What to watch out for Still, the tool doesn't solve everything. Microsoft itself recognizes limitations. Vague specifications generate vague scenarios. LLM-based judges can vary in severity. Synthetic cases do not replace human production, telemetry and review. This is important because there is a risk of false confidence: thinking that a well-organized automated suite equals complete compliance. The point, then, is not to treat ASSERT as a magic seal. It’s about seeing it as a learning infrastructure for teams that want to iterate with less blindness. If it works as promised, it could mark a useful turning point in the ecosystem: moving away from the generic discourse about "responsible AI" and into a regime in which policies, traces and regressions finally start to talk.
Sources
- https://commandline.microsoft.com/assert-written-intent-executable-evals/
- https://github.com/responsibleai/ASSERT
