Ansh Radhakrishnan (@anshrad) - Twitter Profili

Ansh Radhakrishnan retweetledi

Sam Bowman@sleepinyourhat·23 Nis

We'll need to do a very good job at aligning the early AGI systems that will go on to automate much of AI R&D. Our understanding of alignment is pretty limited, and when the time comes, I don't think we'll be confident we know what we're doing.

English

13

20

177

35.2K

Ansh Radhakrishnan@anshrad·22 Ara

@TheZvi I think it would be good to refer to this as "the Redwood Research and Anthropic alignment paper".

English

1

29

709

Zvi Mowshowitz@TheZvi·22 Ara

If you'd like to be an early reader on my post on the Anthropic alignment paper, DM or otherwise ping me (include your email), and I'll pick out some people for that. Want to get this one right.

English

2

0

29

4K

Ansh Radhakrishnan retweetledi

Anthropic@AnthropicAI·18 Ara

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

English

211

696

4.3K

1.7M

Ansh Radhakrishnan retweetledi

Jiaxin Wen@jiaxinwen22·27 Kas

If a new Claude-N is too powerful to be trusted, and may even try to bypass safety checks, how can we deploy it safely? We show that an adaptive deployment mechanism can save us. The longer task sequence we process, the better safety-usefulness tradeoff we can obtain!

English

2

14

94

25.1K

Ansh Radhakrishnan retweetledi

Roger Grosse@RogerGrosse·6 Kas

How can you assure the safety of AIs that might be capable enough to strategically undermine evaluations and monitoring if they had a reason to? In our new Anthropic alignment science research blog, we present three sketches of candidate safety cases aimed at such a scenario.

English

5

19

154

29.1K

Ansh Radhakrishnan retweetledi

Alex Mallen@alextmallen·22 Eki

New paper! How should we make trade-offs between the quantity and quality of labels used for eliciting knowledge from capable AI systems?

English

1

8

46

5.5K

Ansh Radhakrishnan retweetledi

Dario Amodei@DarioAmodei·11 Eki

Machines of Loving Grace: my essay on how AI could transform the world for the better darioamodei.com/machines-of-lo…

English

0

1.2K

5.4K

2.4M

Ansh Radhakrishnan retweetledi

Sam Bowman@sleepinyourhat·3 Eyl

A big part of my job these days is to think about what technical work Anthropic needs to do to make things go well with the development of very powerful AI. I digested my thinking on this, plus some of the Anthropic zeitgeist around it, into this piece: sleepinyourhat.github.io/checklist/

English

11

58

454

70.1K

Ansh Radhakrishnan retweetledi

Charlie George@__Charlie_G·27 Ağu

1/ Can GPT-3.5 supervise GPT-4o debates on hard closed QA tasks? We find some early results that suggest yes!

English

1

6

27

5.2K

Ansh Radhakrishnan retweetledi

Dan Valentine@danvalentine256·23 Tem

We won a Best Paper award for our Debate paper at #ICML2024! What an amazing group of co-authors, it's been so great to work with them on this over the past year. ❤️ @akbirkhan @McHughes288 @LauraRuis @_rockt @SachanKshitij @anshrad @egrefen @sleepinyourhat @EthanJPerez

English

1

4

77

4.2K

Ansh Radhakrishnan retweetledi

Nat McAleese@__nmca__·27 Haz

As AI improves humans will need more and more help to monitor and control it. So my team at OpenAI have trained an AI that helps humans to evaluate AI! (1/5)

English

15

32

344

181.6K

Ansh Radhakrishnan retweetledi

Jan Leike@janleike·27 Haz

Very exciting that this is out now (from my time at OpenAI): We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight! openai.com/index/finding-…

English

20

142

1.3K

155.1K

Ansh Radhakrishnan retweetledi

Buck Shlegeris@bshlgrs·17 Haz

ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA.

English

45

171

1.4K

761.7K

Ansh Radhakrishnan retweetledi

Ethan Perez@EthanJPerez·28 May

Welcome!! My team and I will be joining Jan's new, larger team, to help spin up a new push on these areas of alignment. Come join us!

Jan Leike@janleike

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

English

0

5

208

30.3K

Ansh Radhakrishnan retweetledi

Sam Bowman@sleepinyourhat·28 May

✨🪩 Woo! 🪩✨ Jan's led some seminally important work on technical AI safety and I'm thrilled to be working with him! We'll be leading twin teams aimed at different parts of the problem of aligning AI systems at human level and beyond.

Jan Leike@janleike

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

English

2

9

245

26.4K

Ansh Radhakrishnan retweetledi

Jan Leike@janleike·28 May

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

English

399

486

8.4K

1.4M

Ansh Radhakrishnan retweetledi

Tristan Hume@trishume·3 Nis

Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can have fun testing those as speed optimizations via overly-costly low batch size. Come work with me at Anthropic on things like this, more info in thread 🧵

English

9

39

454

98.1K

Ansh Radhakrishnan retweetledi

Jesse Mu@jayelmnop·20 Mar

We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

English

4

70

450

72.5K

Ansh Radhakrishnan retweetledi

Ethan Perez@EthanJPerez·20 Mar

Come join our team! We're trying to make LLMs unjailbreakable, or clearly demonstrate it's not possible. More in this 🧵 on what we're up to

Jesse Mu@jayelmnop

We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

English

0

5

64

7.3K

Ansh Radhakrishnan retweetledi

Sam Bowman@sleepinyourhat·4 Mar

Claude 3 is out, and tops out at 59.5 (or 50.4 zero-shot) on GPQA.

david rein@idavidrein

GPQA is still very hard for new LLMs! sidenote: it's crazy that this is worth saying despite it coming out only three months ago x.com/idavidrein/sta…

English

3