“‘Alignment Faking’ Frame Is Somewhat Fake” By Jan_Kulveit LessWrong (Curated & Popular) podcast

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

12d ago 11:40

Jaa

Sisällön tarjoaa LessWrong. LessWrong tai sen podcast-alustan kumppani lataa ja toimittaa kaiken podcast-sisällön, mukaan lukien jaksot, grafiikat ja podcast-kuvaukset. Jos uskot jonkun käyttävän tekijänoikeudella suojattua teostasi ilman lupaasi, voit seurata tässä https://fi.player.fm/legal kuvattua prosessia.

I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame.
The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.
What happened in this frame?

The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training.
This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'.
The model was put [...]

---
Outline:
(00:45) What happened in this frame?
(03:03) Why did harmlessness generalize further?
(03:41) Alignment mis-generalization
(05:42) Situational awareness
(10:23) Summary
The original text contained 1 image which was described by AI.
---
First published:
December 20th, 2024
Source:
https://www.lesswrong.com/posts/PWHkMac9Xve6LoMJy/alignment-faking-frame-is-somewhat-fake-1
---
Narrated by TYPE III AUDIO.
---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

401 jaksoa

The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training.
This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'.
The model was put [...]

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Kuuntelemisen arvoisia podcasteja

LessWrong (Curated & Popular) « »
“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit