Sebastian Farquhar

600 posts

Sebastian Farquhar banner
Sebastian Farquhar

Sebastian Farquhar

@seb_far

Research Scientist @DeepMind - AI Alignment. Associate Member @OATML_Oxford and RainML @UniofOxford. All views my dog's.

Oxford, UK Katılım Eylül 2012
136 Takip Edilen2.9K Takipçiler
Sebastian Farquhar retweetledi
David Lindner
David Lindner@davlindner·
New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring No flashy results but lots of important details for deploying future AI agents safely!
David Lindner tweet media
English
7
30
93
19.3K
Sebastian Farquhar
Sebastian Farquhar@seb_far·
I'm hiring at DeepMind AGI Safety! Looking for research engineers to help assess catastrophic risks from frontier models. Our work directly informs safety cases and governance. Lon/SF/NYC - engineers/scientists both wanted
Sebastian Farquhar tweet media
English
19
45
445
27.7K
Sebastian Farquhar retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
DeepMind AGI Safety is hiring! We're looking for research engineers to help assess catastrophic frontier risks from Gemini and whether our mitigations are sufficient. I think this is a highly impactful role and I'd love to get strong candidates! Lon/NYC/SF
Neel Nanda tweet media
English
18
44
487
38K
Sebastian Farquhar
Sebastian Farquhar@seb_far·
In the final stages of assembling your ICML submission? For an excellent paper, each section has a purpose and each paragraph and sentence is crafted to drive that purpose. Tips on how to get the most out of your paper in link reply 👇🔗
English
1
2
10
1.8K
Sebastian Farquhar retweetledi
Anca Dragan
Anca Dragan@ancadianadragan·
New paper from my team on avoiding reward hacking. MONA reduced RL's ability to pursue a multi-turn reward hacking strategy by doing myopic optimization with a trusted advantage/value estimator. Note that this can mean a performance hit depending on how good that estimator is, and it's important to keep pushing on that safe and capable pareto frontier. deepmindsafetyresearch.medium.com/mona-a-method-…
David Lindner@davlindner

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

English
2
6
47
5K
Sebastian Farquhar retweetledi
Rohin Shah
Rohin Shah@rohinmshah·
New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. x.com/davlindner/sta…
David Lindner@davlindner

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

English
0
13
89
7.1K
Sebastian Farquhar
Sebastian Farquhar@seb_far·
By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk. Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇
David Lindner@davlindner

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

English
1
1
19
1.4K
Sebastian Farquhar
Sebastian Farquhar@seb_far·
@MaxiIgl I just started with blu e sky. Missing a lot of people, but the posts are much better.
English
1
0
1
97
Maximilian Igl
Maximilian Igl@MaxiIgl·
@seb_far I haven't looked into those other ones at all - any particular one you'd recommend I should look into? Which one did you find most useful?
English
1
0
0
146
Sebastian Farquhar
Sebastian Farquhar@seb_far·
Did you know that on the other twitter-like sites people actually post links to neat articles and pages? I'd forgotten what a killer feature that was. 10x value from 1/10th the posts.
English
1
0
6
674
Dan Roy
Dan Roy@roydanroy·
@y0b1byte There is no transfer learning in the worst case. And so the problem is under specified.
English
3
0
2
789
yobibyte
yobibyte@y0b1byte·
What does 'out-of-distribution generalisation' mean for you? Please, answer before your read others' replies.
English
25
0
16
7.2K
Sebastian Farquhar
Sebastian Farquhar@seb_far·
@paul_cal @_xjdr Ah sorry I misunderstood. Yes, your first sentence is what we do, I now understand the later sentences are a different proposal.
English
0
0
1
40
Paul Calcraft
Paul Calcraft@paul_cal·
@seb_far @_xjdr Sorry just the first sentence was describing your work, which I think doesn't contradict your explanation? Let me know if I'm missing something
English
1
0
1
77
Sebastian Farquhar
Sebastian Farquhar@seb_far·
@paul_cal @_xjdr Not quite - you take the generation probability of a sequence and compute the entropy of the probabilities of multiple sequences. Each individual sequence only gets one joint prob over all sampled tokens.
English
1
0
2
103
Paul Calcraft
Paul Calcraft@paul_cal·
@_xjdr Nature paper from earlier this year used semantic entropy across multiple regenerations to detect hallucinations. But from what I've read from you, you're doing ent/varent over output token logits for a single generation path? (& then resampling) x.com/seb_far/status…
Sebastian Farquhar@seb_far

For those who have been using our method published in an earlier ICLR version, a minor tweak leading to big performance improvement is to estimate the entropy with a slightly different summation. 7/

English
2
0
24
4.2K
Sebastian Farquhar retweetledi
Allan Dafoe
Allan Dafoe@AllanDafoe·
Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵
Allan Dafoe@AllanDafoe

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…

English
2
19
143
42.8K
Aidan Gomez
Aidan Gomez@aidangomez·
The most beautiful, intelligent, and kind woman I’ve ever known agreed to marry me.
Aidan Gomez tweet mediaAidan Gomez tweet mediaAidan Gomez tweet media
English
155
9
1.4K
162.9K
Sebastian Farquhar retweetledi
Anca Dragan
Anca Dragan@ancadianadragan·
So freaking proud of the AGI safety&alignment team -- read here a retrospective of the work over the past 1.5 years across frontier safety, oversight, interpretability, and more. Onwards! alignmentforum.org/posts/79BPxvSs…
English
6
61
322
48.4K
Sebastian Farquhar retweetledi
Joshua Schrier
Joshua Schrier@JoshuaSchrier·
This was a really interesting article and idea for detecting #llm confabulations, so I decided to think through it from scratch in @WolframResearch #Mathematica and write a tutorial example implementation: jschrier.github.io/blog/2024/07/3…
Sebastian Farquhar@seb_far

Is your LLM hallucinating? 👻 Our @Nature paper shows how to detect when an LLM is making things up. A 'confabulating' LLM answers with inconsistent meanings when re-asked the same question. We use this to estimate uncertainty and detect confabulations. Learn more 🧵👇 1/

English
0
1
2
718