Validating multi-agent AI systems: From modular testing to system-level governance

September 22, 2025

The growing sophistication of artificial intelligence (AI) systems challenges existing frameworks for defining and managing model risk. To maintain reliability, safety, and transparency, it’s essential to reevaluate our understanding of how we define a “model”, and how we validate these increasingly complex systems prior to deployment. The Feds SR 11-7 defines a model as:

“...a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates.”

More simply, a model is often described as a system that transforms data inputs into outputs.

But does this definition hold when applied to advanced, distributed architectures such as AI multi agent systems (MAS)? MAS are composed of multiple autonomous agents that interact with one another and with their environment to achieve individual or collective goals. In the context of AI and machine learning, each agent may be a self-contained algorithm capable of making decisions, sharing information, or adapting independently.

This raises a critical question: what exactly constitutes a "model"? Is it each individual agent, or is it the entire system, including their interactions?

The answer is likely both, highlighting the need for validation approaches that address risk at both the agent level and the system level.

AI agent architecture

To understand the validation process, we first provide a brief overview of AI agent architecture, which typically consists of four modules (Lei Wang et al. 2025):

Profile:
Defines the AI agent’s application and use case, as well as the supporting LLM’s configuration. The agent’s profile can be created manually, generated with LLM assistance, aligned using training datasets, or by combining these methods.
Memory:
Stores information perceived from the environment, maintaining unified, hybrid, or long-term memory traces. These memories guide future actions, helping the agent accumulate experiences, self-evolve, and act more consistently and effectively.
Planning:
Distinguishes agents from plain LLMs. Whereas LLMs primarily generate responses reactively, agents incorporate planning to complete tasks in dynamic environments. Some agents rely on fixed or one-shot planning (e.g., external planners, simple reasoning chains), while more advanced agents adapt their plans dynamically in response to feedback from the environment, humans, or other models.
Action:
Guided by the preceding modules, this component translates the agent’s decisions into specific outcomes, enabling direct interaction with the environment (e.g., executing functions, calling APIs, manipulating external tools).

Understanding the agent’s architecture, its capabilities (e.g., planning and multi-step reasoning, function calling and tool use, memory, or self-reflection), and its intended use cases, directly informs which validation tests and procedures are appropriate. Developers should also establish an evaluation framework (e.g., LangSmith, Arize AI, Google Vertex AI, Mosaic AI, among others) to support both pre-deployment validation and post-deployment performance monitoring.

The growing complexity of validation

As we project the future of AI MAS, it’s easy to imagine a level of complexity that exceeds what a single individual can feasibly validate. This challenge parallels other testing and validation practices in other complex industries. For example, consider vehicle or aviation safety inspections: no single person is responsible for the end-to-end evaluation of each cog, screw, and electric sensor in isolation. Instead, a layered, modular testing and validation framework is applied - consistent with principles from system-theoretic approaches like STAMP (Systems-Theoretic Accident Model and Processes) - acknowledging both the independence and interdependence of components, as well as the critical insight that the integrated system as a whole is greater than the sum of its individual parts.

For MAS, a similar philosophy may need to be applied.

Individual agents in MAS may require pre-deployment testing and validation. For context, risk model validations in financial institutions typically take 1–3 months, while safety-critical systems like automobiles or aircraft may undergo years of rigorous testing before deployment. While MAS may not require the same timelines, these examples illustrate the principle of layered validation – confirming that both individual components and the integrated system are tested appropriately. Once agents are validated individually, the assembled system should undergo additional testing to evaluate end-to-end interactions, emergent risks, and overall system reliability for its intended purpose.

The role of periodic monitoring

An essential characteristic of AI systems, particularly MAS, is their inherent probabilistic risk of failure. Unlike static tools, agent performance may degrade over time due to factors such as data or concept drift, emerging risks, changing human behaviors, exogenous shocks, or unforeseen interaction effects. A model that performs well today may not do so tomorrow.

For this reason, ongoing monitoring and performance assurance of AI systems are crucial, akin to inspections in safety-critical industries (e.g., routine vehicle inspections). Known limitations identified in the development process should be reassessed periodically, since their significance may change as conditions evolve post-deployment. Similar to how annual safety checks are mandated for vehicles, MAS should be monitored periodically over time at intervals appropriate to their complexity, risk profile / materiality, and intended application. Individual agents might require periodic retraining, adjustments, or decommissioning – similar to how critical components in engineered systems must be regularly inspected and replaced.

Governance implications

Evaluating MAS components individually not only encourages robust validation but also improves transparency, interpretability, and explainability. By assessing agents separately, stakeholders can better understand each agent’s purpose, behavior, and potential risks. This is often a primary focus of oversight for regulators and assurance functions.

From a model risk management perspective, each validated agent (i.e., model) may require its own model ID and version in the registry, clearly indicating its intended purpose, performance expectations, thresholds, monitoring plan, and validation history. Since agents may be reused across multiple MAS implementations, a modular approach enables efficiencies in testing – provided each reuse undergoes a context-of-use and incremental-risk assessment to confirm assumptions still hold. The assembled MAS should also have a distinct model ID and version that captures the integrated system’s configuration, dependencies, and interaction patterns. This hierarchical structure not only acknowledges the structural differences between agents and systems, but also their distinct operational roles, dependencies, and failure modes.

Crucially, both pre-deployment testing and post-deployment monitoring have important objectives:

Pre-deployment testing: Identify how the agent or system could fail in production, and whether residual risk is acceptable given its intended use and controls.
Post deployment monitoring: Continuously test whether the agent or system is still performing as intended.

The path forward

As AI and machine learning systems grow in complexity, traditional definitions and validation approaches should evolve. Effective governance of MAS, including their multiple interacting agents, feedback loops, and probabilistic components, becomes essential across sectors.

Adapting requires shifting from single-model validation to modular, system-level validation strategies, integrating continuous monitoring, layered accountability, and robust governance frameworks – much like established practices in aviation, automotive safety, and other critical infrastructures.

Ultimately, our objective should be to design, validate, and deploy transparent, interpretable, and explainable AI systems that are safe and trusted.

Risk Modeling at PwC

Contact us

Sean Sexton

Managing Director, Risk Modeling Services, PwC US

Carver Roya

Principal, Risk Modeling Services, PwC US

© 2017 - 2026 PwC. All rights reserved. PwC refers to the PwC network and/or one or more of its member firms, each of which is a separate legal entity. Please see www.pwc.com/structure for further details.