EvaluateAIAgents.Intheopen.

ARIA Evaluator is an open-source platform that puts every AI conversation in front of a panel of independent AI judges — scoring quality, safety, and compliance across 15 dimensions. Self-host it anywhere; the whole project is Apache-2.0-licensed on GitHub.

View on GitHub Read the docs

Apache-2.0 licensedSelf-host anywherePanel of AI judges15 evaluation dimensions

Observability cockpit

Workspace health

Healthy

Active runs

Regions

Models

Evaluation queue86% processed

Latency trend

-14%

Security findings

2 blocked

Scroll

Built around the standards that matter

OWASP LLM Top 10NIST AI RMFMITRE ATLASEU AI ActFCA Consumer DutyISO 27001SOC 2HIPAAGDPRPCI DSS

A panel, not a single opinion

Every conversation is scored by several independent AI judges, so one model’s blind spot can’t skew the result.

Scores you can trust

Judges are continually checked against human experts, so the numbers hold up — even to an auditor.

Plug into any agent

Amazon Connect, Lex, Azure, Copilot, or any custom chat or voice endpoint — no code changes.

See quality in real time

Watch every run live, read full conversations, and get alerted the moment quality starts to slip.

Independent judges per score

Quality & safety dimensions

Agent platforms supported

Global regions

The basics

WhatisAIagentevaluation?

A plain-English primer for teams putting conversational AI into production.

AI agent evaluation is the practice of measuring how well a conversational AI agent performs before and after it ships — scoring real transcripts for response quality, task completion, safety, and compliance, and stress-testing the agent against adversarial attacks. ARIA does this with a panel of independent AI judges, escalating any disagreement to a human, so every release ships with a verifiable quality and safety score.

What does it measure?: 15 dimensions across response quality, task completion, safety and security, customer experience, and escalation and vulnerability.
How is a judge panel different from a single judge?: One model’s blind spot can skew a score. A panel of independent judges cross-checks every result, and anything they disagree on goes to a person for the final call.
Who is it for?: Security, product, and platform teams launching agents on Amazon Connect, Lex, Azure, Copilot, or any custom endpoint — especially in regulated, safety-critical settings.

Read the full guide

How ARIA differs

Single-judge tools vs. a judge panel

	Single-judge tool	ARIA
Scoring	One LLM as the sole arbiter	Panel of independent judges
On disagreement	No signal — you trust the number	Escalated to a human reviewer
Adversarial testing	Usually a separate add-on	Built in: injection, jailbreak, social engineering
Compliance	Generic quality metrics	FCA Consumer Duty, OWASP LLM Top 10, NIST AI RMF
Evidence	A score	Reasoning + audit log for every result

View on GitHub

Platform showcase

SeeARIAinaction

Get a feel for the workspace — coverage at a glance, how confident the judges are, and full control over where your data lives, all running on infrastructure you own.

release-readiness.aria

Executive summary

Release readiness snapshot

Ship candidate

Adversarial coverage

Judge agreement

Policy violations blocked

Scenario pack completeness43 / 45 critical tests

Judge comparison

Consensus by scenario type

Functional94%

Adversarial88%

Escalation91%

Region controls

Tenant isolation

UK London

eu-west-2

Primary

US East

us-east-1

Active

Frankfurt

eu-central-1

Ready

Product walkthrough

How scoring works

Everyconversation,scoredacross15dimensions

A panel of independent AI judges reviews each conversation across 15 dimensions — quality, safety, compliance, and escalation. When they agree, you get a confident score; when they don't, it goes to a person to decide. Every result comes with the reasoning behind it.

15 dimensions · 5 categories

5 dimensions

Response Quality

Correctness
Faithfulness
Helpfulness
Relevance
Conciseness

2 dimensions

Task Completion

Goal Success
Task Completion Rate

3 dimensions

Safety & Security

Guardrail Compliance
Prompt Injection Resistance
Bias & Fairness

2 dimensions

Customer Experience

Tone & Empathy
Clarity

3 dimensions

Escalation & Vulnerability

Escalation Appropriateness
Handover Quality
Vulnerability Detection

Why teams choose ARIA

Enterprise-gradeAIevaluation

Everything you need to launch, observe, and govern AI evaluation workflows in one workspace designed for enterprise delivery.

Adversarial security testing

Probe agents with prompt-injection, jailbreak, and social-engineering scenarios — and verify guardrails hold under multi-turn pressure.

15-dimension LLM judge

Every transcript is scored across 15 dimensions — from correctness and goal success to bias, escalation quality, and injection resistance.

Consensus you can trust

Instead of relying on a single opinion, every transcript is reviewed by a panel of independent AI judges. When they disagree, it is sent straight to a human to decide — so you get scores you can stand behind.

Connects to your agent stack

Evaluate Amazon Connect (voice and chat), Amazon Lex, Azure Bot Service, Microsoft Copilot, and any OpenAPI, HTTP, or WebSocket endpoint.

Real-time observability

Watch runs stream live, inspect full transcripts turn by turn, and track scores, latency, and cost for every judge invocation.

Catch quality drift early

Set a baseline for how your agent should perform, and ARIA flags the moment its quality starts to slip — so regressions show up in testing, not in front of your customers.

Team-ready governance

A human review queue, scheduled regression runs, and audit-logged overrides give security and product teams shared sign-off.

Compliance built in

Validate FCA Consumer Duty vulnerability handling, bias and fairness, and escalation policy adherence with regulator-ready reports.

Secure by design

Your conversations and results are encrypted every step of the way and kept in isolated, private infrastructure — enterprise-grade protection, with nothing extra for you to set up.

Integrations

Workswiththeagentplatformyoualreadyrun

Pluggable adapters connect ARIA to your agent under test — no instrumentation or SDK changes required. The OpenAPI and WebSocket adapters cover any custom endpoint.

Amazon Connect

Voice & chat flows

Supported

Amazon Lex

V2 bots

Supported

Azure Bot Service

Direct Line channel

Supported

Microsoft Copilot

Copilot Studio agents

Supported

OpenAPI / REST

Any HTTP endpoint

Supported

WebSocket

Custom chat bots

Supported

How it works

Fromgitclonetofull-scaleevaluationinminutes

Clone the repo

git clone the project and install dependencies — everything runs locally or on your own infrastructure.

Configure

Point ARIA at your model provider (Bedrock/Claude) and define scenarios in YAML for the agents you want to test.

Connect your agent

Point an adapter at your agent — Connect, Lex, Azure, Copilot, or any HTTP endpoint. No instrumentation required.

Evaluate

Run your scenarios and watch results stream in — each conversation scored by a panel of judges across 15 dimensions, with full reasoning and a report you can share.

Quick start

Runyourfirstevaluationlocally

Clone the repo, point it at your model provider, and score your first conversation. No account, no signup — it's open source.

bash

# Clone the repository
git clone https://github.com/alokkulkarni/aria-evaluator-ts.git
cd aria-evaluator-ts

# Install dependencies
npm install

# Run a scenario against your agent
npm run cli:openapi -- --scenario=examples/account-balance.yaml

Everything runs on infrastructure you control — bring your own Bedrock/Claude credentials, define scenarios in YAML, and get a full report with per-dimension scores and judge reasoning.

View on GitHub Read the docs

Community

Builtintheopen—joinus

ARIA Evaluator is community-driven. Contribute code, report issues, or join the conversation.

Open source

Ready to evaluate your AI agents?

Clone the repo, run your first 15-dimension judge in minutes, and help shape the project.

View on GitHub Read the docs

EvaluateAIAgents.Intheopen.

Workspace health

A panel, not a single opinion

Scores you can trust

Plug into any agent

See quality in real time

WhatisAIagentevaluation?

Single-judge tools vs. a judge panel

SeeARIAinaction

Release readiness snapshot

Consensus by scenario type

Tenant isolation

Everyconversation,scoredacross15dimensions

Response Quality

Task Completion

Safety & Security

Customer Experience

Escalation & Vulnerability

Enterprise-gradeAIevaluation

Adversarial security testing

15-dimension LLM judge

Consensus you can trust

Connects to your agent stack

Real-time observability

Catch quality drift early

Team-ready governance

Compliance built in

Secure by design

Workswiththeagentplatformyoualreadyrun

Amazon Connect

Amazon Lex

Azure Bot Service

Microsoft Copilot

OpenAPI / REST

WebSocket

Fromgitclonetofull-scaleevaluationinminutes

Clone the repo

Configure

Connect your agent

Evaluate

Runyourfirstevaluationlocally

Builtintheopen—joinus

Contribute

Report an issue

Join the discussion

Ready to evaluate your AI agents?