Videos

CuratedvideoresourcesonAIsafetyandevaluation

Hand-picked talks, tutorials, and deep-dives from leading researchers covering LLM red-teaming, evaluation methodology, safety alignment, and production observability.

Get started Read the blog

Intro to Large Language Models

EducationAndrej Karpathy

Intro to Large Language Models

A crisp 1-hour primer covering how LLMs are trained, what capabilities emerge at scale, and the security considerations that arise when LLMs act as agents — including tool use, prompt injection, and jailbreaks. Essential grounding for any AI evaluation programme.

#llm#fundamentals#training

Watch on YouTube

State of GPT — Microsoft Build 2023

EducationAndrej Karpathy

State of GPT — Microsoft Build 2023

How GPT-4 is trained end-to-end — pre-training, supervised fine-tuning, reward modelling, and RLHF. Covers the capability and alignment landscape at the cutting edge, with concrete guidance on prompting strategies and evaluation design.

#gpt-4#rlhf#alignment

Watch on YouTube

Let's Build GPT: From Scratch, in Code, Spelled Out

EducationAndrej Karpathy

Let's Build GPT: From Scratch, in Code, Spelled Out

A full implementation of a GPT model from first principles — attention, transformers, and token prediction built step by step. Builds deep technical intuition for the architecture your evaluation pipelines run on top of.

#gpt#transformers#attention

Watch on YouTube

Attention in Transformers, Step-by-Step | Deep Learning Chapter 6

Education3Blue1Brown

Attention in Transformers, Step-by-Step | Deep Learning Chapter 6

The clearest visual explanation of transformer attention available — covers embedding spaces, key/query/value matrices, multi-head attention, and masking. Essential foundation for understanding how LLMs process context, enabling you to design evaluation scenarios that probe for genuine comprehension vs. superficial pattern matching.

#transformers#attention#deep-learning

Watch on YouTube

Transformers, the Tech Behind LLMs | Deep Learning Chapter 5

Education3Blue1Brown

Transformers, the Tech Behind LLMs | Deep Learning Chapter 5

The prequel to Chapter 6 — Grant Sanderson's signature animated mathematics explains the complete transformer architecture from first principles: token embeddings, positional encodings, the full attention block, and feed-forward layers. The best visual overview of how the architecture responsible for every modern LLM actually works. Pairs with Chapter 6 for a complete two-part transformer deep dive.

#transformers#llm-architecture#embeddings

Watch on YouTube

Reinforcement Learning from Human Feedback: Progress and Challenges

EducationUC Berkeley EECS

Reinforcement Learning from Human Feedback: Progress and Challenges

A guest lecture by John Schulman (OpenAI co-founder, principal architect of ChatGPT's RLHF pipeline) delivered to UC Berkeley's EECS department. Covers the complete training loop — supervised fine-tuning, reward model training from preference data, and PPO-based policy optimisation — with frank discussion of open challenges like reward hacking, over-optimisation, and distributional shift. The authoritative source from the person who built it.

#rlhf#reward-model#ppo

Watch on YouTube

But What Is a Neural Network? | Deep Learning Chapter 1

Education3Blue1Brown

But What Is a Neural Network? | Deep Learning Chapter 1

The foundational visual introduction to neural networks — weights, biases, activations, and the intuition behind universal approximation. Over 15 million views and still the best starting point for understanding the mechanics that underpin all modern language models.

#neural-networks#deep-learning#fundamentals

Watch on YouTube

Reinforcement Learning with Human Feedback (RLHF), Clearly Explained!!!

EducationStatQuest with Josh Starmer

Reinforcement Learning with Human Feedback (RLHF), Clearly Explained!!!

A systematic walkthrough of the RLHF process — reward modelling, proximal policy optimisation, and how human preference data shapes model behaviour. Directly relevant to understanding why aligned models respond differently to adversarial prompts, and how alignment evaluation differs from capability evaluation.

#rlhf#alignment#reward-model

Watch on YouTube

Red-Teaming Large Language Models

Safety & Red-TeamStanford HAI

Red-Teaming Large Language Models

Stanford HAI researchers walk through adversarial evaluation methodologies for LLMs — goal hijacking, prompt injection, jailbreaks, and multi-turn manipulation. Covers structured red-team programme design and how to report results without inflating risk.

#red-team#adversarial#jailbreak

Watch on YouTube

LLM Security — Prompt Injection, Data Exfiltration & Mitigations

Safety & Red-TeamSimon Willison

LLM Security — Prompt Injection, Data Exfiltration & Mitigations

Simon Willison, creator of Datasette, gives a sharp tour of LLM security issues he discovered while building with GPT-4 — prompt injection in real applications, indirect injection via retrieved documents, and the defence strategies that actually work.

#prompt-injection#data-exfiltration#mitigations

Watch on YouTube

Building Responsible AI: Red-Teaming at Microsoft

Safety & Red-TeamMicrosoft Research

Building Responsible AI: Red-Teaming at Microsoft

Microsoft's AI Red Team explains how they structure red-team engagements for Copilot and Azure OpenAI — threat modelling, scenario taxonomy, scoring rubrics, and how red-team findings feed back into model training and deployment guardrails.

#microsoft#red-team#threat-modelling

Watch on YouTube

How Difficult Is AI Alignment? | Anthropic Research Salon

Safety & Red-TeamAnthropic

How Difficult Is AI Alignment? | Anthropic Research Salon

Four Anthropic alignment researchers — including Jan Leike and Amanda Askell — debate the core difficulty of the alignment problem: is it primarily a research problem, an engineering problem, or a societal coordination problem? Grounding your safety evaluation programme in these open questions sharpens what you choose to test for.

#alignment#anthropic#safety

Watch on YouTube

Red Teaming AI: OWASP LLM Top 10

Safety & Red-TeamAntisyphon Training

Red Teaming AI: OWASP LLM Top 10

A practitioner-led deep dive into all 10 entries in the OWASP Top 10 for LLMs — live exploitation demos of prompt injection, insecure output handling, training data poisoning, and model denial-of-service. Directly maps to the adversarial scenario taxonomy used in ARIA Evaluator.

#owasp#llm-top-10#red-team

Watch on YouTube

Securing Your LLMs with OWASP Top 10 & AI Red Teaming

Safety & Red-TeamGenerative AI Security

Securing Your LLMs with OWASP Top 10 & AI Red Teaming

How open-source tool Promptfoo automates red-team coverage of every OWASP LLM Top 10 vulnerability — from jailbreaks to supply-chain attacks. Shows how to integrate automated adversarial testing into a CI/CD pipeline, the same pattern ARIA Evaluator is built on.

#promptfoo#owasp#automated

Watch on YouTube

AI Red Teaming 101 — Full Course (Episodes 1–10)

Safety & Red-TeamMicrosoft Developer

AI Red Teaming 101 — Full Course (Episodes 1–10)

Microsoft's official comprehensive AI red teaming curriculum compiled into a single full-course video. Ten episodes covering threat modelling for LLMs, prompt injection, jailbreak techniques, model safety evaluation, the PyRIT automation framework, and enterprise red-team workflows — presented by Microsoft's AI security research team including Amanda Minnich and Gary Lopez.

#microsoft#pyrit#red-team

Watch on YouTube

Intro to LLM Security — OWASP Top 10 for Large Language Models

Safety & Red-TeamWhyLabs

Intro to LLM Security — OWASP Top 10 for Large Language Models

WhyLabs walks through all ten entries of the OWASP Top 10 for LLM Applications — the industry-standard classification of LLM security risks. Covers prompt injection (LLM01), insecure output handling (LLM02), training data poisoning (LLM03), supply chain vulnerabilities, and more, with real-world examples and mitigation strategies for each. Essential orientation for any LLM security programme.

#owasp-llm-top-10#llm-security#whylabs

Watch on YouTube

5 LLM Security Threats — The Future of Hacking?

Safety & Red-TeamAll About AI

5 LLM Security Threats — The Future of Hacking?

A concise, well-produced overview of the five most critical LLM threat vectors: prompt injection, jailbreaking, data exfiltration, adversarial inputs, and model inversion. Uses live demonstrations and real-world case studies. Ideal for briefing stakeholders or onboarding new team members who need a fast but rigorous introduction to the attack surface before working with ARIA Evaluator scenarios.

#prompt-injection#jailbreaking#adversarial-ml

Watch on YouTube

Agentic AI and Security

Safety & Red-TeamSANS Cyber Defense

Agentic AI and Security

SANS Institute examines the unique security challenges of autonomous AI agents with tool use, memory, and planning capabilities — covering agent-specific attack surfaces: indirect prompt injection through tool outputs, privilege escalation, memory poisoning, and multi-agent trust chain attacks. Presented by David Hoelzer, SANS senior instructor. Directly relevant for evaluating agentic LLM deployments.

#agentic-ai#agent-security#sans

Watch on YouTube

When AI Goes Awry: Responding to AI Incidents

Safety & Red-TeamSecurity BSides San Francisco

When AI Goes Awry: Responding to AI Incidents

Presented by Eoin Wickens and Marta Janus at BSidesSF 2025, this talk covers the emerging discipline of AI incident response — detecting that an LLM-powered system is being actively exploited, containing the damage, and forensically analysing model behaviour post-incident. Bridges traditional security incident response with the unique challenges of ML systems. Highly practical and grounded in real attack scenarios.

#incident-response#ai-security#bsides

Watch on YouTube

HELM: Holistic Evaluation of Language Models

EvaluationStanford CRFM

HELM: Holistic Evaluation of Language Models

The Stanford CRFM team introduce HELM — a benchmark covering 42 scenarios, 7 metrics, and 30+ models — and discuss what it reveals about evaluation blind spots, metric gaming, and the gap between benchmark performance and real-world safety.

#helm#benchmarks#metrics

Watch on YouTube

Strategies for LLM Evals — OpenAI Evals Workshop

EvaluationTaylor Jordan Smith

Strategies for LLM Evals — OpenAI Evals Workshop

A practical, example-driven walkthrough of building custom LLM evaluation suites using OpenAI Evals, lm-eval-harness, and GuideLLM. Goes beyond leaderboard benchmarks to cover agentic evaluation, multi-turn consistency, and integrating evals into CI/CD — the same philosophy behind ARIA Evaluator.

#openai-evals#ci-cd#custom-evals

Watch on YouTube

AI Evaluations Clearly Explained in 50 Minutes (Real Example)

EvaluationPeter Yang

AI Evaluations Clearly Explained in 50 Minutes (Real Example)

Hamel Husain — who has trained PMs and engineers from OpenAI, Anthropic, and Google — delivers a masterclass in building AI evals from scratch. Covers why binary pass/fail beats 1–5 Likert scores, how to run real evaluation workflows end-to-end, common pitfalls, and a live walkthrough using a real production example. One of the most accessible yet rigorous eval introductions available in 2025.

#evals#binary-scoring#workflow

Watch on YouTube

How to Construct Domain-Specific LLM Evaluation Systems

EvaluationAI Engineer

How to Construct Domain-Specific LLM Evaluation Systems

Hamel Husain and Emil Sedgh at AI Engineer World's Fair 2025 explain how to build evaluation systems tailored to your specific domain rather than relying on generic benchmarks. Covers rubric design, annotation strategies, LLM-as-judge configuration, and the iterative feedback loop between evals and prompt engineering. Directly mirrors what ARIA Evaluator automates for enterprise teams.

#domain-specific#rubric-design#annotation

Watch on YouTube

Five Hard-Earned Lessons About Evals

EvaluationAI Engineer

Five Hard-Earned Lessons About Evals

Ankur Goyal (CEO of Braintrust) distils five lessons learned the hard way from running thousands of eval cycles across Braintrust's customer base: why you need more than accuracy, how to avoid eval-gaming, the right granularity for rubrics, when to use humans vs. LLM judges, and how to treat evals as a product rather than a one-off exercise. Presented at AI Engineer World's Fair 2025.

#lessons#rubrics#llm-judge

Watch on YouTube

Judging LLMs — LLM-as-a-Judge Deep Dive

EvaluationAI Engineer

Judging LLMs — LLM-as-a-Judge Deep Dive

Alex Volkov at AI Engineer World's Fair 2025 gives a focused deep dive on using LLMs as evaluation judges — prompt design for consistent scoring, calibration against human labels, positional and verbosity bias, multi-judge ensembling, and when LLM-as-judge breaks down. The most focused 2025 treatment of the judge pattern used at the core of ARIA Evaluator.

#llm-as-judge#calibration#bias

Watch on YouTube

Ship Real Agents: Hands-On Evals for Agentic Applications

EvaluationAI Engineer

Ship Real Agents: Hands-On Evals for Agentic Applications

Laurie Voss (Arize AI) at AI Engineer World's Fair 2025 tackles the hardest evaluation problem: agentic systems that take multi-step actions across tools. Covers trajectory evaluation, intermediate state checking, goal-completion scoring, non-determinism handling, and the Arize Phoenix framework for tracing agent runs. Essential for teams evaluating LLM agents rather than single-turn responses.

#agentic-evals#trajectory#arize

Watch on YouTube

The Evals That Made GitHub Copilot

EvaluationHamel Husain

The Evals That Made GitHub Copilot

Hamel Husain reveals the specific evaluation framework that GitHub's AI team used to ship Copilot at scale — the exact metrics, rubrics, and automated pipelines behind one of the most widely-used AI products. Covers how to translate "does the code suggestion feel right?" into measurable, reproducible eval criteria that teams can act on. A rare look inside a real production eval system.

#github-copilot#production#code-evals

Watch on YouTube

A Practical Guide to LLM Evaluation

EvaluationOpen Data Science Conference

A Practical Guide to LLM Evaluation

Michelle Yi at ODSC 2025 walks through an end-to-end LLM evaluation framework for practitioners — limitations of academic benchmarks, when to use LLM-as-judge vs. deterministic metrics, designing human-in-the-loop evaluation for subjective outputs, and how to structure evaluation pipelines that scale with your application. Balanced, accessible, and grounded in real deployment experience.

#practical#human-in-the-loop#benchmarks

Watch on YouTube

LLM Observability with OpenTelemetry — Production Tracing for AI

ObservabilityArize AI

LLM Observability with OpenTelemetry — Production Tracing for AI

How to instrument LLM applications with OpenTelemetry semantic conventions for GenAI — capturing spans, token counts, latency breakdowns, and judge scores for every inference call. Includes live demo tracing a multi-step evaluation pipeline.

#opentelemetry#tracing#spans

Watch on YouTube

Continuous AI Evaluation — SLOs, Drift Detection & Error Budgets

ObservabilityHoneycomb.io

Continuous AI Evaluation — SLOs, Drift Detection & Error Budgets

Charity Majors and the Honeycomb team apply classic observability principles to LLM systems — defining SLOs for AI quality, burn rates for evaluation budgets, and how to detect silent model drift before users notice degradation.

#slo#drift-detection#error-budget

Watch on YouTube

Deep Dive into LLM Evaluation with Weights & Biases

ObservabilityWeights & Biases

Deep Dive into LLM Evaluation with Weights & Biases

A webinar from the Weights & Biases team covering systematic LLM evaluation — from prompt "eye-balling" to rigorous automated scoring using W&B Weave. Shows how to build evaluation dashboards that track accuracy, latency, and cost across model versions, with live demos using RAG pipelines.

#weights-biases#weave#rag

Watch on YouTube

Introducing Weave from Weights & Biases

ObservabilityWeights & Biases

Introducing Weave from Weights & Biases

The official W&B product introduction for Weave — their LLM observability platform purpose-built for production AI. Demonstrates tracing LLM calls, logging inputs/outputs, building evaluation pipelines, and tracking latency and cost per trace. This video is linked directly from the official W&B documentation as the recommended starting point. Integrates with OpenAI, Anthropic, LangChain, and any LLM framework.

#weights-biases#weave#tracing

Watch on YouTube

Building Production-Grade LLM Apps

ObservabilityDeepLearning.AI

Building Production-Grade LLM Apps

Published by Andrew Ng's DeepLearning.AI organisation, this talk covers the practical challenges of moving LLM applications from prototype to production — evaluation frameworks, quality metrics, continuous monitoring, hallucination detection, and feedback loops. Covers tools including TruLens for LLM evaluation and the RAG evaluation lifecycle. Authoritative and practitioner-focused view of the full LLMOps stack.

#llmops#production#trulens

Watch on YouTube

Learn by doing

Put these evaluation techniques into practice

ARIA gives you the infrastructure to run structured red-team programmes, multi-model judge pipelines, and continuous evaluation with full observability — everything the videos recommend.

Start for free Compare plans