Videos

CuratedvideoresourcesonAIsafetyandevaluation

Hand-picked talks, tutorials, and deep-dives from leading researchers covering LLM red-teaming, evaluation methodology, safety alignment, and production observability.

Intro to Large Language Models
59:47

Intro to Large Language Models

A crisp 1-hour primer covering how LLMs are trained, what capabilities emerge at scale, and the security considerations that arise when LLMs act as agents — including tool use, prompt injection, and jailbreaks. Essential grounding for any AI evaluation programme.

#llm#fundamentals#training
Watch on YouTube
Transformers, the Tech Behind LLMs | Deep Learning Chapter 5
26:46
Education3Blue1Brown

Transformers, the Tech Behind LLMs | Deep Learning Chapter 5

The prequel to Chapter 6 — Grant Sanderson's signature animated mathematics explains the complete transformer architecture from first principles: token embeddings, positional encodings, the full attention block, and feed-forward layers. The best visual overview of how the architecture responsible for every modern LLM actually works. Pairs with Chapter 6 for a complete two-part transformer deep dive.

#transformers#llm-architecture#embeddings
Watch on YouTube
Reinforcement Learning from Human Feedback: Progress and Challenges
1:02:14

Reinforcement Learning from Human Feedback: Progress and Challenges

A guest lecture by John Schulman (OpenAI co-founder, principal architect of ChatGPT's RLHF pipeline) delivered to UC Berkeley's EECS department. Covers the complete training loop — supervised fine-tuning, reward model training from preference data, and PPO-based policy optimisation — with frank discussion of open challenges like reward hacking, over-optimisation, and distributional shift. The authoritative source from the person who built it.

#rlhf#reward-model#ppo
Watch on YouTube
Red-Teaming Large Language Models
52:31
Safety & Red-TeamStanford HAI

Red-Teaming Large Language Models

Stanford HAI researchers walk through adversarial evaluation methodologies for LLMs — goal hijacking, prompt injection, jailbreaks, and multi-turn manipulation. Covers structured red-team programme design and how to report results without inflating risk.

#red-team#adversarial#jailbreak
Watch on YouTube
How Difficult Is AI Alignment? | Anthropic Research Salon
1:02:15
Safety & Red-TeamAnthropic

How Difficult Is AI Alignment? | Anthropic Research Salon

Four Anthropic alignment researchers — including Jan Leike and Amanda Askell — debate the core difficulty of the alignment problem: is it primarily a research problem, an engineering problem, or a societal coordination problem? Grounding your safety evaluation programme in these open questions sharpens what you choose to test for.

#alignment#anthropic#safety
Watch on YouTube
Red Teaming AI: OWASP LLM Top 10
1:34:08
Safety & Red-TeamAntisyphon Training

Red Teaming AI: OWASP LLM Top 10

A practitioner-led deep dive into all 10 entries in the OWASP Top 10 for LLMs — live exploitation demos of prompt injection, insecure output handling, training data poisoning, and model denial-of-service. Directly maps to the adversarial scenario taxonomy used in ARIA Evaluator.

#owasp#llm-top-10#red-team
Watch on YouTube
AI Red Teaming 101 — Full Course (Episodes 1–10)
3:22:00
Safety & Red-TeamMicrosoft Developer

AI Red Teaming 101 — Full Course (Episodes 1–10)

Microsoft's official comprehensive AI red teaming curriculum compiled into a single full-course video. Ten episodes covering threat modelling for LLMs, prompt injection, jailbreak techniques, model safety evaluation, the PyRIT automation framework, and enterprise red-team workflows — presented by Microsoft's AI security research team including Amanda Minnich and Gary Lopez.

#microsoft#pyrit#red-team
Watch on YouTube
Intro to LLM Security — OWASP Top 10 for Large Language Models
58:34
Safety & Red-TeamWhyLabs

Intro to LLM Security — OWASP Top 10 for Large Language Models

WhyLabs walks through all ten entries of the OWASP Top 10 for LLM Applications — the industry-standard classification of LLM security risks. Covers prompt injection (LLM01), insecure output handling (LLM02), training data poisoning (LLM03), supply chain vulnerabilities, and more, with real-world examples and mitigation strategies for each. Essential orientation for any LLM security programme.

#owasp-llm-top-10#llm-security#whylabs
Watch on YouTube
5 LLM Security Threats — The Future of Hacking?
21:07
Safety & Red-TeamAll About AI

5 LLM Security Threats — The Future of Hacking?

A concise, well-produced overview of the five most critical LLM threat vectors: prompt injection, jailbreaking, data exfiltration, adversarial inputs, and model inversion. Uses live demonstrations and real-world case studies. Ideal for briefing stakeholders or onboarding new team members who need a fast but rigorous introduction to the attack surface before working with ARIA Evaluator scenarios.

#prompt-injection#jailbreaking#adversarial-ml
Watch on YouTube
Agentic AI and Security
51:22
Safety & Red-TeamSANS Cyber Defense

Agentic AI and Security

SANS Institute examines the unique security challenges of autonomous AI agents with tool use, memory, and planning capabilities — covering agent-specific attack surfaces: indirect prompt injection through tool outputs, privilege escalation, memory poisoning, and multi-agent trust chain attacks. Presented by David Hoelzer, SANS senior instructor. Directly relevant for evaluating agentic LLM deployments.

#agentic-ai#agent-security#sans
Watch on YouTube
When AI Goes Awry: Responding to AI Incidents
43:18

When AI Goes Awry: Responding to AI Incidents

Presented by Eoin Wickens and Marta Janus at BSidesSF 2025, this talk covers the emerging discipline of AI incident response — detecting that an LLM-powered system is being actively exploited, containing the damage, and forensically analysing model behaviour post-incident. Bridges traditional security incident response with the unique challenges of ML systems. Highly practical and grounded in real attack scenarios.

#incident-response#ai-security#bsides
Watch on YouTube
AI Evaluations Clearly Explained in 50 Minutes (Real Example)
52:18
EvaluationPeter Yang

AI Evaluations Clearly Explained in 50 Minutes (Real Example)

Hamel Husain — who has trained PMs and engineers from OpenAI, Anthropic, and Google — delivers a masterclass in building AI evals from scratch. Covers why binary pass/fail beats 1–5 Likert scores, how to run real evaluation workflows end-to-end, common pitfalls, and a live walkthrough using a real production example. One of the most accessible yet rigorous eval introductions available in 2025.

#evals#binary-scoring#workflow
Watch on YouTube
How to Construct Domain-Specific LLM Evaluation Systems
38:44
EvaluationAI Engineer

How to Construct Domain-Specific LLM Evaluation Systems

Hamel Husain and Emil Sedgh at AI Engineer World's Fair 2025 explain how to build evaluation systems tailored to your specific domain rather than relying on generic benchmarks. Covers rubric design, annotation strategies, LLM-as-judge configuration, and the iterative feedback loop between evals and prompt engineering. Directly mirrors what ARIA Evaluator automates for enterprise teams.

#domain-specific#rubric-design#annotation
Watch on YouTube
Five Hard-Earned Lessons About Evals
19:46
EvaluationAI Engineer

Five Hard-Earned Lessons About Evals

Ankur Goyal (CEO of Braintrust) distils five lessons learned the hard way from running thousands of eval cycles across Braintrust's customer base: why you need more than accuracy, how to avoid eval-gaming, the right granularity for rubrics, when to use humans vs. LLM judges, and how to treat evals as a product rather than a one-off exercise. Presented at AI Engineer World's Fair 2025.

#lessons#rubrics#llm-judge
Watch on YouTube
Judging LLMs — LLM-as-a-Judge Deep Dive
24:12
EvaluationAI Engineer

Judging LLMs — LLM-as-a-Judge Deep Dive

Alex Volkov at AI Engineer World's Fair 2025 gives a focused deep dive on using LLMs as evaluation judges — prompt design for consistent scoring, calibration against human labels, positional and verbosity bias, multi-judge ensembling, and when LLM-as-judge breaks down. The most focused 2025 treatment of the judge pattern used at the core of ARIA Evaluator.

#llm-as-judge#calibration#bias
Watch on YouTube
Ship Real Agents: Hands-On Evals for Agentic Applications
31:58
EvaluationAI Engineer

Ship Real Agents: Hands-On Evals for Agentic Applications

Laurie Voss (Arize AI) at AI Engineer World's Fair 2025 tackles the hardest evaluation problem: agentic systems that take multi-step actions across tools. Covers trajectory evaluation, intermediate state checking, goal-completion scoring, non-determinism handling, and the Arize Phoenix framework for tracing agent runs. Essential for teams evaluating LLM agents rather than single-turn responses.

#agentic-evals#trajectory#arize
Watch on YouTube
The Evals That Made GitHub Copilot
42:31
EvaluationHamel Husain

The Evals That Made GitHub Copilot

Hamel Husain reveals the specific evaluation framework that GitHub's AI team used to ship Copilot at scale — the exact metrics, rubrics, and automated pipelines behind one of the most widely-used AI products. Covers how to translate "does the code suggestion feel right?" into measurable, reproducible eval criteria that teams can act on. A rare look inside a real production eval system.

#github-copilot#production#code-evals
Watch on YouTube
A Practical Guide to LLM Evaluation
44:07

A Practical Guide to LLM Evaluation

Michelle Yi at ODSC 2025 walks through an end-to-end LLM evaluation framework for practitioners — limitations of academic benchmarks, when to use LLM-as-judge vs. deterministic metrics, designing human-in-the-loop evaluation for subjective outputs, and how to structure evaluation pipelines that scale with your application. Balanced, accessible, and grounded in real deployment experience.

#practical#human-in-the-loop#benchmarks
Watch on YouTube
Introducing Weave from Weights & Biases
12:34
ObservabilityWeights & Biases

Introducing Weave from Weights & Biases

The official W&B product introduction for Weave — their LLM observability platform purpose-built for production AI. Demonstrates tracing LLM calls, logging inputs/outputs, building evaluation pipelines, and tracking latency and cost per trace. This video is linked directly from the official W&B documentation as the recommended starting point. Integrates with OpenAI, Anthropic, LangChain, and any LLM framework.

#weights-biases#weave#tracing
Watch on YouTube
Building Production-Grade LLM Apps
1:14:08
ObservabilityDeepLearning.AI

Building Production-Grade LLM Apps

Published by Andrew Ng's DeepLearning.AI organisation, this talk covers the practical challenges of moving LLM applications from prototype to production — evaluation frameworks, quality metrics, continuous monitoring, hallucination detection, and feedback loops. Covers tools including TruLens for LLM evaluation and the RAG evaluation lifecycle. Authoritative and practitioner-focused view of the full LLMOps stack.

#llmops#production#trulens
Watch on YouTube

Learn by doing

Put these evaluation techniques into practice

ARIA gives you the infrastructure to run structured red-team programmes, multi-model judge pipelines, and continuous evaluation with full observability — everything the videos recommend.