Ansh Radhakrishnan

51 posts

Ansh Radhakrishnan

Ansh Radhakrishnan

@anshrad

Researcher @AnthropicAI

Manhattan, NY Katılım Şubat 2022
2.2K Takip Edilen377 Takipçiler
Ansh Radhakrishnan retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
We'll need to do a very good job at aligning the early AGI systems that will go on to automate much of AI R&D. Our understanding of alignment is pretty limited, and when the time comes, I don't think we'll be confident we know what we're doing.
Sam Bowman tweet media
English
13
20
177
35.2K
Ansh Radhakrishnan
Ansh Radhakrishnan@anshrad·
@TheZvi I think it would be good to refer to this as "the Redwood Research and Anthropic alignment paper".
English
1
1
29
709
Zvi Mowshowitz
Zvi Mowshowitz@TheZvi·
If you'd like to be an early reader on my post on the Anthropic alignment paper, DM or otherwise ping me (include your email), and I'll pick out some people for that. Want to get this one right.
English
2
0
29
4K
Ansh Radhakrishnan retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Anthropic tweet media
English
211
696
4.3K
1.7M
Ansh Radhakrishnan retweetledi
Jiaxin Wen
Jiaxin Wen@jiaxinwen22·
If a new Claude-N is too powerful to be trusted, and may even try to bypass safety checks, how can we deploy it safely? We show that an adaptive deployment mechanism can save us. The longer task sequence we process, the better safety-usefulness tradeoff we can obtain!
Jiaxin Wen tweet media
English
2
14
94
25.1K
Ansh Radhakrishnan retweetledi
Roger Grosse
Roger Grosse@RogerGrosse·
How can you assure the safety of AIs that might be capable enough to strategically undermine evaluations and monitoring if they had a reason to? In our new Anthropic alignment science research blog, we present three sketches of candidate safety cases aimed at such a scenario.
English
5
19
154
29.1K
Ansh Radhakrishnan retweetledi
Alex Mallen
Alex Mallen@alextmallen·
New paper! How should we make trade-offs between the quantity and quality of labels used for eliciting knowledge from capable AI systems?
Alex Mallen tweet media
English
1
8
46
5.5K
Ansh Radhakrishnan retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
A big part of my job these days is to think about what technical work Anthropic needs to do to make things go well with the development of very powerful AI. I digested my thinking on this, plus some of the Anthropic zeitgeist around it, into this piece: sleepinyourhat.github.io/checklist/
Sam Bowman tweet media
English
11
58
454
70.1K
Ansh Radhakrishnan retweetledi
Charlie George
Charlie George@__Charlie_G·
1/ Can GPT-3.5 supervise GPT-4o debates on hard closed QA tasks? We find some early results that suggest yes!
English
1
6
27
5.2K
Ansh Radhakrishnan retweetledi
Nat McAleese
Nat McAleese@__nmca__·
As AI improves humans will need more and more help to monitor and control it. So my team at OpenAI have trained an AI that helps humans to evaluate AI! (1/5)
English
15
32
344
181.6K
Ansh Radhakrishnan retweetledi
Jan Leike
Jan Leike@janleike·
Very exciting that this is out now (from my time at OpenAI): We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight! openai.com/index/finding-…
Jan Leike tweet media
English
20
142
1.3K
155.1K
Ansh Radhakrishnan retweetledi
Buck Shlegeris
Buck Shlegeris@bshlgrs·
ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA.
Buck Shlegeris tweet media
English
45
171
1.4K
761.7K
Ansh Radhakrishnan retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
Welcome!! My team and I will be joining Jan's new, larger team, to help spin up a new push on these areas of alignment. Come join us!
Jan Leike@janleike

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

English
0
5
208
30.3K
Ansh Radhakrishnan retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
✨🪩 Woo! 🪩✨ Jan's led some seminally important work on technical AI safety and I'm thrilled to be working with him! We'll be leading twin teams aimed at different parts of the problem of aligning AI systems at human level and beyond.
Jan Leike@janleike

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

English
2
9
245
26.4K
Ansh Radhakrishnan retweetledi
Jan Leike
Jan Leike@janleike·
I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
English
399
486
8.4K
1.4M
Ansh Radhakrishnan retweetledi
Tristan Hume
Tristan Hume@trishume·
Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can have fun testing those as speed optimizations via overly-costly low batch size. Come work with me at Anthropic on things like this, more info in thread 🧵
English
9
39
454
98.1K
Ansh Radhakrishnan retweetledi
Jesse Mu
Jesse Mu@jayelmnop·
We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)
Jesse Mu tweet media
English
4
70
450
72.5K
Ansh Radhakrishnan retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
Come join our team! We're trying to make LLMs unjailbreakable, or clearly demonstrate it's not possible. More in this 🧵 on what we're up to
Jesse Mu@jayelmnop

We’re hiring for the adversarial robustness team @AnthropicAI! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

English
0
5
64
7.3K