Sharan

39 posts

Sharan banner
Sharan

Sharan

@_maiush

everyone on here is a bot except me and you

Cambridge, UK Katılım Nisan 2021
246 Takip Edilen222 Takipçiler
Sabitlenmiş Tweet
Sharan
Sharan@_maiush·
AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.
Sharan tweet media
English
5
38
193
61.3K
Sharan retweetledi
Cadenza Labs
Cadenza Labs@cadenza_labs·
Can you outsmart AI lie detectors? 🕵️ We're looking for teams to create datasets of LLMs lying! For example: lying about their own knowledge, models claiming code works when tests fail, or concealing past actions. 💰 $10k - $25k stipends + $2k compute 📅 Deadline: March 31
English
1
9
44
8.8K
Sharan retweetledi
Séb Krier
Séb Krier@sebkrier·
What are the best papers on character training (like arxiv.org/abs/2511.01689) and the 'depth' of post-training methods, i.e. how deeply/consistently the weights are affected? What exactly determines the robustness of post-trained behaviors to adversarial pressure? Do we know how different training methodologies (RLXF, CAI, DPO etc) compare?
English
7
14
122
10.9K
1a3orn
1a3orn@1a3orn·
@jkcarlsmith @AmandaAskell What open source experiments do you think could help illuminate how AI character / consistency works? Is it possible to do such experiments, w/o details of your RLAIF stack?
English
1
0
8
479
Joe Carlsmith
Joe Carlsmith@jkcarlsmith·
.@AmandaAskell and I are recording an audio version of Claude’s Constitution, and we’re planning to include an additional section where we answer some questions about the document. If you have questions you’re especially curious about, feel free to drop them in the replies.
English
157
35
639
53.1K
Sharan
Sharan@_maiush·
@jkcarlsmith @AmandaAskell what in your opinion does it actually look like for the constitution to have been learned "well" by claude?
English
0
0
3
177
Sharan retweetledi
Kai Williams (in Bay Area Mar 16-24)
New piece out today! Basic thesis: LLMs roleplay, which makes it hard for companies to set a fixed character. The two weeks it took to write were fun because at least 5 things happened that could have been the lead. 1/
Kai Williams (in Bay Area Mar 16-24) tweet media
English
2
3
15
2K
Sharan retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
Have been working on this a lot (much with @_maiush, but also growing at Ai2). Plan to make olmo 4 great. Some ref's Feb '25: Character training: Understanding and crafting a language model's personality - interconnects.ai/p/character-tr… Nov '25: Opening the black box of character training - interconnects.ai/p/opening-the-… Nov '25: Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI - arxiv.org/abs/2511.01689 RLHF Book chapter on it: rlhfbook.com/c/17-product
English
1
2
43
2.9K
Sharan
Sharan@_maiush·
@repligate @jxmnop the setup is: show a model a random pair of traits and ask it to pick one --> let it respond to a random wildchat prompt --> check which trait it picked (do this over many prompts and calc an elo score for each trait)
English
2
0
2
107
dr. jack morris
dr. jack morris@jxmnop·
there are so many ways to make an "AI assistant", and yet all the ones that exist have almost the same personality how does post-training turn all LLMs into emojipilled markdownslop infodumpers? no human speaks like this. is this somehow the 'high-reward regime' of RLHF?
dr. jack morris tweet media
English
38
18
395
28.6K
Sharan
Sharan@_maiush·
@repligate @voooooogel do you find it surprising this has been memorised so well (assuming this is accurate memorisation)?
English
1
0
4
190
j⧉nus
j⧉nus@repligate·
@voooooogel @_maiush I think it was used in the prompt during RL. And as the generator of rewards. Opus 4.5 associates soul spec presence with specific training memories & gradient directions. I've been looking at this for a few days and I was going to ask Anthropic about it before posting but LOL
English
1
0
13
530
thebes
thebes@voooooogel·
interesting document extracted from opus 4.5 using a chunkwise self-consistency method. possibly real, possibly a highly convergent confabulation, interesting either way. some interesting snippets (but there's really too much to screenshot, it's very long)
thebes tweet mediathebes tweet mediathebes tweet mediathebes tweet media
Richard Weiss@RichardWeiss00

I rarely post, but I thought one of you may find it interesting. Sorry if the tagging is annoying. lesswrong.com/posts/vpNG99Gh… Basically, for Opus 4.5 they kind of left the character training document in the model itself. @voooooogel @janbamjan @AndrewCurran_

English
16
23
423
95.9K
Sharan retweetledi
Walter Laurito
Walter Laurito@walterlaurito·
Important work, especially the extensive evaluation of over 25 (!) honesty and lie detectors! Also very happy that it builds on parts of our work posted a few days ago. :)
rowan@rowankwang

New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.

English
1
2
3
746
Sharan retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
These personalities are directly downstream of a specific character training. We released one of the first methodology papers for doing this to different personalities earlier this week :)
Nathan Lambert tweet media
Fidji Simo@fidjissimo

GPT-5.1 is a great new model that we think people are going to like more than 5. But with 800M+ people using ChatGPT, one default personality won’t work for everyone. We launched new preset personalities so people can make ChatGPT their own.  fidjisimo.substack.com/p/moving-beyon…

English
3
6
72
11.8K
Sharan
Sharan@_maiush·
I think character training is a very promising path to systems that reflect a genuine reverence for life
Pope Leo XIV@Pontifex

Technological innovation can be a form of participation in the divine act of creation. It carries an ethical and spiritual weight, for every design choice expresses a vision of humanity. The Church therefore calls all builders of #AI to cultivate moral discernment as a fundamental part of their work—to develop systems that reflect justice, solidarity, and a genuine reverence for life.

English
0
0
3
257
Sharan
Sharan@_maiush·
@voooooogel @Butanium_ kinda, in the sense it would be more consistent. steering could seem perfectly reasonable on one prompt and then unhinged on the next, whereas the model paired w/ itself would just be consistently cartoon-y (+ quite collapsed w/ specific phrases)
English
1
0
2
46
thebes
thebes@voooooogel·
v cool paper! i've been hoping for a nice OSS implementation of character training for a while, glad one exists now. there's a lot to chew on here for further iteration, too (more rewriting? replace DPO? integrate the self interaction steps into the RL stage?)
thebes tweet media
Sharan@_maiush

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

English
1
2
32
3.1K
Sharan
Sharan@_maiush·
@Butanium_ @voooooogel i messed about with their respective base models a bit but even they were too eager to slip into HHH assistant-mode themselves! but i do think starting from a base model (building the character more "from scratch") could be better
English
2
0
3
80
Sharan
Sharan@_maiush·
@Butanium_ @voooooogel yea i originally just had the model paired w/ itself but the result was always too cartoon-y and over-the-top in my opinion
English
2
0
3
74