Research11 min read

The EU AI Act's Evaluation Requirements: A Practitioner's Guide

The EU AI Act entered into force in August 2024. Compliance obligations for high-risk AI systems are now active. Here is what the Act actually requires from your evaluation programme — and the gaps most teams have.

Tom Bradley

VP Engineering·18 March 2025

The EU AI Act entered into force on 1 August 2024. For most of 2023, enterprise AI teams treated it as a distant regulatory concern. That window has closed. If you are operating AI systems in the European Union — particularly in categories including employment decisions, access to essential services, and critical infrastructure management — compliance obligations are active and enforcement timelines are now firm.

Where AI Evaluation Fits in the Act

The EU AI Act takes a risk-based approach. General-purpose AI — including most LLM deployments in enterprise applications — falls under different requirements depending on use, not only on technical architecture. The critical category for most enterprise teams is "high-risk AI systems", defined in Annex III and including systems that affect access to essential services, employment decisions, and biometric identification.^[1]

Article 9 of the Act requires that providers and deployers of high-risk AI systems implement a risk management system described as "a continuous iterative process run throughout the entire lifecycle" of the system. This is not a one-time pre-deployment audit. It requires ongoing monitoring — which in practice means ongoing, documented evaluation.

What "Adequate Testing" Means Under Article 9(5)

Article 9(5) specifies that testing shall be performed "to identify the most appropriate risk management measures" and that testing "shall ensure that high-risk AI systems perform consistently for their intended purpose." The Act deliberately avoids prescribing specific methodologies — it is outcomes-based, not process-based.

This creates both flexibility and obligation. You are not required to use any specific evaluation framework. You are required to demonstrate, with documented evidence, that your testing programme covers the system's intended use cases and that it identifies and addresses risks before deployment. In practice, assessors look for:

Documented test procedures: Evaluation scenarios described in writing with sufficient detail for an independent assessor to reproduce them.
Quantified metrics: Test results expressed in measurable terms — not qualitative assessments that cannot be compared across time.
Regression baselines: Demonstration that model changes do not degrade performance on previously validated capabilities.
Traceability: Each finding traceable through the risk management system to a mitigation action and its verification.

The ISO/IEC 42001 Connection

ISO/IEC 42001:2023 — the international standard for AI Management Systems — was published in December 2023 and provides the most comprehensive operationalisation guidance for the risk management requirements the EU AI Act describes in principle.^[2] While compliance with ISO 42001 does not automatically satisfy EU AI Act requirements, organisations that have built their AI governance against the standard have substantially shorter paths to regulatory compliance.

The standard's Clause 9 (Performance Evaluation) maps directly to Article 9 of the Act. Key requirements include: defined evaluation criteria aligned to AI objectives, documented monitoring procedures, and management review processes that close the loop from evaluation findings to system changes.

Common Compliance Gaps

Based on ENISA's guidance for AI Act compliance and industry experience with ISO 42001 implementations, the most common gaps in enterprise evaluation programmes are:^[3]

Evaluation infrastructure that produces scores but not auditable reasoning traces — assessors cannot evaluate the quality of a scoring process that leaves no evidence of how it reached its conclusions
Scenario coverage documentation that describes categories of scenarios rather than specific scenario instances — too vague to constitute a documented test procedure
No mechanism for demonstrating that evaluation programme quality has been maintained over time as the system evolved
Absence of documented human expert baselines against which automated judge calibration can be verified

The EU AI Act does not require perfection. It requires documentation that you are systematically working toward it. The gap between what most teams have today and what the Act requires is not technical — it is procedural.

The UK Government's Responsible AI Framework, published by the Department for Science, Innovation & Technology, echoes this principles-based approach — a signal that even outside the EU, the direction of travel in AI regulation favours demonstrated, documented evaluation practice over point-in-time certification.^[4]

References

#eu-ai-act#compliance#regulation#iso-42001#governance