Sebastian Farquhar

600 posts

Sebastian Farquhar

@seb_far

Research Scientist @DeepMind - AI Alignment. Associate Member @OATML_Oxford and RainML @UniofOxford. All views my dog's.

Oxford, UK Katılım Eylül 2012

136 Takip Edilen2.9K Takipçiler

Sebastian Farquhar retweetledi

David Lindner@davlindner·12 Oca

New DeepMind x UK AISI paper: what would it take to prevent harm from misaligned AI agents via monitoring in real deployments? We wrote a safety case sketch for control monitoring No flashy results but lots of important details for deploying future AI agents safely!

English

19.3K

Sebastian Farquhar@seb_far·9 Oca

The role: job-boards.greenhouse.io/deepmind/jobs/…

English

1.6K

Sebastian Farquhar@seb_far·9 Oca

I'm hiring at DeepMind AGI Safety! Looking for research engineers to help assess catastrophic risks from frontier models. Our work directly informs safety cases and governance. Lon/SF/NYC - engineers/scientists both wanted

English

445

27.7K

Sebastian Farquhar retweetledi

Neel Nanda@NeelNanda5·5 Oca

DeepMind AGI Safety is hiring! We're looking for research engineers to help assess catastrophic frontier risks from Gemini and whether our mitigations are sufficient. I think this is a highly impactful role and I'd love to get strong candidates! Lon/NYC/SF

English

487

38K

Sebastian Farquhar@seb_far·29 Oca

sebastianfarquhar.com/on-research/20…

ZXX

472

Sebastian Farquhar@seb_far·29 Oca

In the final stages of assembling your ICML submission? For an excellent paper, each section has a purpose and each paragraph and sentence is crafted to drive that purpose. Tips on how to get the most out of your paper in link reply 👇🔗

English

1.8K

Sebastian Farquhar retweetledi

Anca Dragan@ancadianadragan·24 Oca

New paper from my team on avoiding reward hacking. MONA reduced RL's ability to pursue a multi-turn reward hacking strategy by doing myopic optimization with a trusted advantage/value estimator. Note that this can mean a performance hit depending on how good that estimator is, and it's important to keep pushing on that safe and capable pareto frontier. deepmindsafetyresearch.medium.com/mona-a-method-…

David Lindner@davlindner

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

English

Sebastian Farquhar retweetledi

Rohin Shah@rohinmshah·23 Oca

New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL-induced steganography” (see the loans environment). So you can also think of this as a project about legible chains of thought. x.com/davlindner/sta…

David Lindner@davlindner

English

7.1K

Sebastian Farquhar retweetledi

Séb Krier@sebkrier·23 Oca

Check out the paper itself: arxiv.org/abs/2501.13011 An introductory explainer: deepmindsafetyresearch.medium.com/mona-a-method-… The technical safety post: alignmentforum.org/posts/zWySWKuX… Congrats @seb_far, @VikrantVarma_, @davlindner, @davidelson, @CalebBiddulph, @goodfellow_ian, and @rohinmshah!

English

625

Sebastian Farquhar@seb_far·23 Oca

By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk. Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇

David Lindner@davlindner

English

1.4K

Sebastian Farquhar@seb_far·10 Kas

@MaxiIgl I just started with blu e sky. Missing a lot of people, but the posts are much better.

English

Maximilian Igl@MaxiIgl·10 Kas

@seb_far I haven't looked into those other ones at all - any particular one you'd recommend I should look into? Which one did you find most useful?

English

146

Sebastian Farquhar@seb_far·10 Kas

Did you know that on the other twitter-like sites people actually post links to neat articles and pages? I'd forgotten what a killer feature that was. 10x value from 1/10th the posts.

English

674

Sebastian Farquhar@seb_far·16 Eki

@roydanroy @y0b1byte You might find this slightly grumpy note I once wrote about the underspecification of OOD, which someone else just shared, interesting - openreview.net/pdf?id=XCS_zBH…

English

333

Dan Roy@roydanroy·15 Eki

@y0b1byte There is no transfer learning in the worst case. And so the problem is under specified.

English

789

yobibyte@y0b1byte·15 Eki

What does 'out-of-distribution generalisation' mean for you? Please, answer before your read others' replies.

English

7.2K

Sebastian Farquhar@seb_far·7 Eki

@paul_cal @_xjdr Ah sorry I misunderstood. Yes, your first sentence is what we do, I now understand the later sentences are a different proposal.

English

Paul Calcraft@paul_cal·7 Eki

@seb_far @_xjdr Sorry just the first sentence was describing your work, which I think doesn't contradict your explanation? Let me know if I'm missing something

English

xjdr@_xjdr·7 Eki

2 independent early verifications seem to suggest ...

Hensen Juang@basedjensen

lol huh did we accidentally just fix hallucination problem ?

English

495

81.6K

Sebastian Farquhar@seb_far·7 Eki

@paul_cal @_xjdr Not quite - you take the generation probability of a sequence and compute the entropy of the probabilities of multiple sequences. Each individual sequence only gets one joint prob over all sampled tokens.

English

103

Paul Calcraft@paul_cal·7 Eki

@_xjdr Nature paper from earlier this year used semantic entropy across multiple regenerations to detect hallucinations. But from what I've read from you, you're doing ent/varent over output token logits for a single generation path? (& then resampling) x.com/seb_far/status…

Sebastian Farquhar@seb_far

For those who have been using our method published in an earlier ICLR version, a minor tweak leading to big performance improvement is to estimate the entropy with a slightly different summation. 7/

English

4.2K

Sebastian Farquhar retweetledi

Allan Dafoe@AllanDafoe·7 Eki

Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priority areas: forecasting AI, geopolitics and AGI efforts, FSF risk management, agents, global governance. More details below: 🧵

Allan Dafoe@AllanDafoe

We are hiring! Google DeepMind's Frontier Safety and Governance team is dedicated to mitigating frontier AI risks; we work closely with technical safety, policy, responsibility, security, and GDM leadership. Please encourage great people to apply! 1/ boards.greenhouse.io/deepmind/jobs/…

English

143

42.8K

Sebastian Farquhar@seb_far·25 Ağu

@aidangomez Congratulations!

English

350

Aidan Gomez@aidangomez·25 Ağu

The most beautiful, intelligent, and kind woman I’ve ever known agreed to marry me.

English

155

1.4K

162.9K

Sebastian Farquhar retweetledi

Anca Dragan@ancadianadragan·20 Ağu

So freaking proud of the AGI safety&alignment team -- read here a retrospective of the work over the past 1.5 years across frontier safety, oversight, interpretability, and more. Onwards! alignmentforum.org/posts/79BPxvSs…

English

322

48.4K

Sebastian Farquhar@seb_far·5 Ağu

@robertwiblin Maybe business success causes overconfidence.

English

Sebastian Farquhar retweetledi

Joshua Schrier@JoshuaSchrier·1 Ağu

This was a really interesting article and idea for detecting #llm confabulations, so I decided to think through it from scratch in @WolframResearch #Mathematica and write a tutorial example implementation: jschrier.github.io/blog/2024/07/3…

Sebastian Farquhar@seb_far

Is your LLM hallucinating? 👻 Our @Nature paper shows how to detect when an LLM is making things up. A 'confabulating' LLM answers with inconsistent meanings when re-asked the same question. We use this to estimate uncertainty and detect confabulations. Learn more 🧵👇 1/

English

718

Keşfet

@VikrantVarma_ @davlindner @davidelson @CalebBiddulph @goodfellow_ian @rohinmshah @MaxiIgl @roydanroy