Siirry offline-tilaan Player FM avulla!
34 - AI Evaluations with Beth Barnes
Manage episode 431050436 series 2844728
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
42 jaksoa
Manage episode 431050436 series 2844728
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
42 jaksoa
Kaikki jaksot
×Tervetuloa Player FM:n!
Player FM skannaa verkkoa löytääkseen korkealaatuisia podcasteja, joista voit nauttia juuri nyt. Se on paras podcast-sovellus ja toimii Androidilla, iPhonela, ja verkossa. Rekisteröidy sykronoidaksesi tilaukset laitteiden välillä.