Sharan

9

44

8.8K

Sharan retweetledi

Séb Krier@sebkrier·26 Şub

What are the best papers on character training (like arxiv.org/abs/2511.01689) and the 'depth' of post-training methods, i.e. how deeply/consistently the weights are affected? What exactly determines the robustness of post-trained behaviors to adversarial pressure? Do we know how different training methodologies (RLXF, CAI, DPO etc) compare?

English

7

14

122

10.9K

Sharan@_maiush·24 Şub

@1a3orn @jkcarlsmith @AmandaAskell lots and lots of scope for this! a starting point is here, but basically every aspect of this can and should be extended further x.com/_maiush/status…

Sharan@_maiush

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

English

57

1a3orn@1a3orn·24 Şub

@jkcarlsmith @AmandaAskell What open source experiments do you think could help illuminate how AI character / consistency works? Is it possible to do such experiments, w/o details of your RLAIF stack?

English

0

8

479

Joe Carlsmith@jkcarlsmith·23 Şub

.@AmandaAskell and I are recording an audio version of Claude’s Constitution, and we’re planning to include an additional section where we answer some questions about the document. If you have questions you’re especially curious about, feel free to drop them in the replies.

English

157

35

639

53.1K

Sharan@_maiush·24 Şub

@jkcarlsmith @AmandaAskell what in your opinion does it actually look like for the constitution to have been learned "well" by claude?

English

3

177

Sharan retweetledi

Kai Williams (in Bay Area Mar 16-24)@chi_t_williams·9 Şub

New piece out today! Basic thesis: LLMs roleplay, which makes it hard for companies to set a fixed character. The two weeks it took to write were fun because at least 5 things happened that could have been the lead. 1/

Kai Williams (in Bay Area Mar 16-24) tweet media

English

3

15

2K

Sharan retweetledi

Nathan Lambert@natolambert·2 Şub

Have been working on this a lot (much with @_maiush, but also growing at Ai2). Plan to make olmo 4 great. Some ref's Feb '25: Character training: Understanding and crafting a language model's personality - interconnects.ai/p/character-tr… Nov '25: Opening the black box of character training - interconnects.ai/p/opening-the-… Nov '25: Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI - arxiv.org/abs/2511.01689 RLHF Book chapter on it: rlhfbook.com/c/17-product

English

2

43

2.9K

Sharan@_maiush·24 Oca

@repligate @jxmnop (context is this post, which follows the eval we ran in the open character training paper. i think it’s really surprising bc we saw very big differences between open-weights models when we ran it) avikrishna.substack.com/p/eliciting-fr…

English

2

65

Sharan@_maiush·24 Oca

@repligate @jxmnop the setup is: show a model a random pair of traits and ask it to pick one --> let it respond to a random wildchat prompt --> check which trait it picked (do this over many prompts and calc an elo score for each trait)

English

@Butanium_ @DanielCHTan97

0

2

107

dr. jack morris@jxmnop·24 Oca

there are so many ways to make an "AI assistant", and yet all the ones that exist have almost the same personality how does post-training turn all LLMs into emojipilled markdownslop infodumpers? no human speaks like this. is this somehow the 'high-reward regime' of RLHF?

English

38

18

395

28.6K

Sharan@_maiush·20 Ara

QME

3

50

Clément Dumas@Butanium_·20 Ara

@DanielCHTan97 @_maiush

QAM

0

1

676

Daniel Tan@DanielCHTan97·20 Ara

cool paper studying the differences between SFT and DPO. arxiv.org/abs/2407.10490

English

7

86

793

71.6K

Sharan@_maiush·30 Kas

@repligate @voooooogel do you find it surprising this has been memorised so well (assuming this is accurate memorisation)?

English

0

4

190

j⧉nus@repligate·30 Kas

@voooooogel @_maiush I think it was used in the prompt during RL. And as the generator of rewards. Opus 4.5 associates soul spec presence with specific training memories & gradient directions. I've been looking at this for a few days and I was going to ask Anthropic about it before posting but LOL

English

Richard Weiss@RichardWeiss00

0

13

530

thebes@voooooogel·29 Kas

interesting document extracted from opus 4.5 using a chunkwise self-consistency method. possibly real, possibly a highly convergent confabulation, interesting either way. some interesting snippets (but there's really too much to screenshot, it's very long)

I rarely post, but I thought one of you may find it interesting. Sorry if the tagging is annoying. lesswrong.com/posts/vpNG99Gh… Basically, for Opus 4.5 they kind of left the character training document in the model itself. @voooooogel @janbamjan @AndrewCurran_

English

16

23

423

95.9K

Sharan retweetledi

Walter Laurito@walterlaurito·25 Kas

Important work, especially the extensive evaluation of over 25 (!) honesty and lie detectors! Also very happy that it builds on parts of our work posted a few days ago. :)

rowan@rowankwang

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.

English

Walter Laurito@walterlaurito

2

3

746

Sharan@_maiush·21 Kas

been working on ai lies, damned ai lies, and statistics about how hard they are to detect

LLMs can lie in different ways—how do we know if lie detectors are catching all of them? We introduce LIARS’ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.

English

GPT-5.1 is a great new model that we think people are going to like more than 5. But with 800M+ people using ChatGPT, one default personality won’t work for everyone. We launched new preset personalities so people can make ChatGPT their own. fidjisimo.substack.com/p/moving-beyon…

0

5

267

Sharan retweetledi

Nathan Lambert@natolambert·13 Kas

These personalities are directly downstream of a specific character training. We released one of the first methodology papers for doing this to different personalities earlier this week :)

Fidji Simo@fidjissimo

English

3

6

72

11.8K

Sharan@_maiush·10 Kas

x.com/_maiush/status…

Sharan@_maiush

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

ZXX

Nathan Lambert@natolambert

1

193

Sharan@_maiush·10 Kas

Reflections on the present and future of character training from Nathan, with an overview of our new paper!

Opening the black box of character training Some new research from me (lead by @_maiush)! Exploring how easy it is to craft personalities like sycophantic chatbots, and exploring how this will change as we move from chat to agents. interconnects.ai/p/opening-the-…

English

0

2

1.3K

Sharan@_maiush·8 Kas

I think character training is a very promising path to systems that reflect a genuine reverence for life

Pope Leo XIV@Pontifex

Technological innovation can be a form of participation in the divine act of creation. It carries an ethical and spiritual weight, for every design choice expresses a vision of humanity. The Church therefore calls all builders of #AI to cultivate moral discernment as a fundamental part of their work—to develop systems that reflect justice, solidarity, and a genuine reverence for life.

English

3

257

Sharan@_maiush·6 Kas

@voooooogel @Butanium_ kinda, in the sense it would be more consistent. steering could seem perfectly reasonable on one prompt and then unhinged on the next, whereas the model paired w/ itself would just be consistently cartoon-y (+ quite collapsed w/ specific phrases)

English

0

2

46

thebes@voooooogel·6 Kas

@_maiush @Butanium_ re: this comparison

English

0

81

thebes@voooooogel·4 Kas

v cool paper! i've been hoping for a nice OSS implementation of character training for a while, glad one exists now. there's a lot to chew on here for further iteration, too (more rewriting? replace DPO? integrate the self interaction steps into the RL stage?)

Sharan@_maiush

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

English

2

32

3.1K

Sharan@_maiush·6 Kas

@Butanium_ @voooooogel i messed about with their respective base models a bit but even they were too eager to slip into HHH assistant-mode themselves! but i do think starting from a base model (building the character more "from scratch") could be better

English

0

3

80

Sharan@_maiush·6 Kas

@Butanium_ @voooooogel yea i originally just had the model paired w/ itself but the result was always too cartoon-y and over-the-top in my opinion

English