Andrei Lupu

282 posts

Andrei Lupu

@_andreilupu

DPhil student @FLAIR_Ox and @AIatMeta. Previously @Mila_Quebec and @rllabmcgill Theory of Mind / Coordination / Rainbow Teaming 🌈 Opinions my own.

Katılım Aralık 2016

371 Takip Edilen778 Takipçiler

Sabitlenmiş Tweet

Andrei Lupu@_andreilupu·26 Haz

Theory of Mind (ToM) is crucial for next gen LLM Agents, yet current benchmarks suffer from multiple shortcomings. Enter 💽 Decrypto, an interactive benchmark for multi-agent reasoning and ToM in LLMs! Work done with @TimonWilli & @j_foerst at @AIatMeta & @FLAIR_Ox 🧵👇

English

104

23.3K

Andrei Lupu retweetledi

Uljad@uljadb99·20 Nis

Natural evolution's open-endedness leads to beautiful, complex emergent structures and self-organizing behavior 🌱✨. Replicating this in silico is famously hard 💻. Our paper points to a promising direction by evolving populations of competing neural cellular automata with lifelike behavior 🧬🤖 #Isambard ⚠️⚠️flashing lights, rapid cuts, or strobe effects in this thread! 🚨🚨 1/n

English

146

21.6K

Andrei Lupu@_andreilupu·17 Nis

@nathanrmonette @EugeneVinitsky @nyutandon @FLAIR_Ox Congratulations! So much more confident in the future of self-driving with you working on it!

English

168

nathan monette@nathanrmonette·17 Nis

Excited to announce I’ll be joining @EugeneVinitsky at @nyutandon this autumn for a PhD! I will be working on the intersection of game theory, reinforcement learning, and autonomous vehicles. Thanks to everyone who helped me get to this point, especially from @FLAIR_Ox :)

English

116

13.4K

Andrei Lupu@_andreilupu·13 Nis

@PraCha98 @jsuarez What do you mean? OvercookedV2 was built in Jax from the ground up, and is available through JaxMARL: github.com/FLAIROx/JaxMAR…

English

Pra Cha@PraCha98·12 Nis

@jsuarez Last fall, i was working on Overcooked V2 MARL zero shot coordination, the main issue we faced is its not available properly in JAX

English

Joseph Suarez 🐡@jsuarez·12 Nis

Overcooked merged into PufferLib today - thanks Roze! Agents trained in <10 seconds on a 4090. This is without hparam tuning for 4.0.

GIF

English

306

53.9K

Andrei Lupu@_andreilupu·26 Mar

ARC-AGI-3: Revenge of the POMDP Now I can't wait for ARC-AGI-4 to be entirely multi-agent!

ARC Prize@arcprize

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn

English

967

Andrei Lupu@_andreilupu·26 Mar

Benchmarking algorithm discovery with rigorous meta-train and meta-test splits! The world will be a very different place the day DiscoGen is saturated!

Alex Goldie@AlexDGoldie

1/ 🪩 Automating the discovery of new algorithms could unlock significant breakthroughs in ML research. But optimising agents for this research has been limited by too few tasks to learn from! Introducing DiscoGen, a procedural generator of algorithm discovery tasks 🧵

English

1.2K

Andrei Lupu retweetledi

Harry Mayne@HarryMayne5·26 Şub

New paper. A Positive Case for Faithfulness. When asked to explain their decisions, LLMs can give highly plausible self-explanations. But are these explanations actually faithful, or are they just post-hoc rationalizations? We measure faithfulness via simulatability.

English

5.6K

Andrei Lupu retweetledi

MattStaniek@MattStaniek·20 Ara

Windermere vs Buttermere Windermere: Since 2017, over fifty billion litres of treated sewage and more than 30,000 hours of untreated sewage have made their way into the lake. Buttermere: a lake where the water company is not allowed to put any sewage at all. Images taken on 24 and 25 April 2025. The only way to protect Windermere is to end the sewage pollution once and for all. savewindermere.com

English

1.6K

6.1K

198.5K

Andrei Lupu@_andreilupu·1 Ara

A much-needed benchmark for research agents, and the closest we've ever been to an LLM dance-off! Check out DiscoBench! 🕺

Alex Goldie@AlexDGoldie

🪩 So excited to reveal DiscoBench: An Open-Ended Benchmark for Algorithm Discovery! 🪩 It addresses the key issues of current evals with its broad task coverage, modular file system, meta-train/meta-test split and emphasis on open-ended tasks! 🧵

English

1.2K

Andrei Lupu retweetledi

Jakob Foerster@j_foerst·18 Kas

My Oxford lab (@FLAIR_Ox ) is hiring Phd students! If you are thinking of doing a Phd in blue-sky and -sort of crazy ambitious- ML and have a technically strong background and love to work with others, please consider all options for joining us: 1) Direct entry - deadline is the 1st of Dec AOE (ox.ac.uk/admissions/gra…) 2) AIMS CDT (ox.ac.uk/admissions/gra…) deadline on 27th of Jan 2026 AOE 3) EIT CDT (ox.ac.uk/admissions/gra…) deadline on the 7th of Jan 2026 AOE Student funding is a real constraint / concern in the UK (especially for overseas students) and by applying for these three programs you can maximize your chances of ending up in a very very special place.

English

160

14.2K

Andrei Lupu@_andreilupu·13 Kas

This resonates deeply. It is sad to read, and unfortunately the norm in recent years. The most original papers or those at the intersection of different sub-fields get the worst of it too, since it is harder to grasp the value of their contribution at a glance.

Peter Richtarik@peter_richtarik

I am an AC for ICLR 2026. One of the papers in my batch was just withdrawn. The authors wrote a brief response, explaining why the reviewers failed at their job. I agree with most of their comments. The authors gave up. They are fed up. Just like many of us. I understand. We pretend the emperor has clothes, but he is naked. Here is the final part of their withdrawal notice. I took the liberty to make it public, to highlight that what we are doing with AI conference reviews these last few years is, basically, madness. --- Comment: We thank the reviewers for their time. However, upon reading the reviews for our paper, it became immediately apparent that the four "reject" ratings are not based on good-faith academic disagreement, but on a critical failure to read the submitted paper. The reviews are rife with demonstrably false claims that are directly contradicted by the text. The core justifications for rejection rely on asserting that key components are "missing" when they are explicitly detailed in the manuscript. Some specific examples are (and many are even fake claims). Claim: Harder tasks like GSM8K are missing. Fact: GSM8K results are in many tables, like Table 2 (Section 4.2) and Appendix G. Claim: The method does not use per-layer ranks. Fact: This is the entire point of our method. The reviewer clearly mistook our method for the baselines. (Section 2, Table 1). Claim: The GP kernel is not specified. Fact: It is specified in Appendix E (Table 6). Claim: There is no ablation of the method's three stages. Fact: Section 4.4 ("Ablation Study") and Appendix J are dedicated to this. Reviewers have a fundamental responsibility to read and evaluate the work they are assigned. The nature of these errors is so fundamental, so systemic in overlooking explicit content, that it goes far beyond what "limited time" or "oversight" can explain. This work has gone through several rounds of revision over the last year. In earlier submissions, the paper usually received borderline or weak-accept scores. Numerous signs strongly suggest that some reviewers are relying entirely on AI tools to automatically generate peer reviews, rather than fulfilling their fundamental responsibility of personally reading and evaluating manuscripts. We strongly protest this. This is a gross disrespect to the authors. It is a flagrant desecration of the reviewer's sacred duty. It fundamentally undermines the integrity of the entire peer-review process. Given that the reviews are not based on the actual content of our paper, we have decided to withdraw the submission. We leave this comment so that future readers of the OpenReview page are aware that the items described as "missing" are already present in the submitted manuscript. These negative reviews for this submission are factually unsound and do not reflect the content of the paper. We cannot and will not accept an assessment that is not based on the work we actually submitted.

English

1.4K

Andrei Lupu@_andreilupu·12 Kas

Mika is fantastic, I would 100% jump on this if I could!

Mikayel Samvelyan@_samvelyan

I’m hiring a Student Researcher at @GoogleDeepMind. This research role centers on topics of open-ended self-improvement and discovery with LLM agents. 📍 Location: London 🗓️ Duration: 6 months, 100% 🚀 Start date: June or July 2026 Apply now using the links below👇

English

14.1K

Andrei Lupu@_andreilupu·19 Eyl

A band-aid, to be sure, but an elegant one.

NeurIPS Europe Conference@EurIPSConf

Congratulations to everyone who got their @NeurIPSConf papers accepted 🎉🎉🎉 At #EurIPS we are looking forward to welcoming presentations of all accepted NeurIPS papers, including a new “Salon des Refusés” track for papers which were rejected due to space constraints!

English

457

Andrei Lupu retweetledi

Minqi Jiang@MinqiJiang·8 Eyl

What if you kept asking an LLM to "make it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time. Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode. We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and *self-divergence.* In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free). This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks. We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.

English

405

41K

Andrei Lupu@_andreilupu·15 Ağu

@jpineau1 @cohere Very exciting! Congratulations!

English

Joelle Pineau@jpineau1·14 Ağu

I’m thrilled to be joining @cohere in the role of Chief AI Officer, helping advance cutting-edge research and product development. Cohere has an incredible team and mission. Exciting new chapter for me!

Cohere@cohere

We’re excited to announce $500M in new funding to accelerate our global expansion and build the next generation of enterprise AI technology! We are also welcoming two additions to our leadership team: Joelle Pineau as Chief AI Officer and Francois Chadwick as Chief Financial Officer. cohere.com/blog/august-20…

English

123

1.7K

180.6K

Andrei Lupu retweetledi

Keith Sakata, MD@KeithSakata·11 Ağu

I’m a psychiatrist. In 2025, I’ve seen 12 people hospitalized after losing touch with reality because of AI. Online, I’m seeing the same pattern. Here’s what “AI psychosis” looks like, and why it’s spreading fast: 🧵

English

1.5K

13.1K

92.2K

7.7M

Andrei Lupu@_andreilupu·8 Ağu

Congratulations! Well deserved! 🎉

Alex Goldie@AlexDGoldie

🥳 It’s an honour to have been awarded the Outstanding Paper for Scientific Understanding in RL at RLC for our work, ‘How Should We Meta-Learn RL Algorithms?’ Thank you to the organisers @RL_Conference for putting on a great conference, and congratulations to the other winners!

English

438

Andrei Lupu@_andreilupu·7 Ağu

It's because 9.11 is bigger than 9.8

Joseph Thacker@rez0__

@sama how is 52 higher than 69?

English

340

Andrei Lupu@_andreilupu·7 Ağu

@akbirkhan @_chris_lu_ @TimonWilli @TimonWilli is just smart enough to do it pre-agi. While we all compete in tech, he'll be playing the real estate game.

English

akbir.@akbirkhan·7 Ağu

@_chris_lu_ @TimonWilli post-agi we will all retire in switzerland

English

177

Chris Lu@_chris_lu_·5 Ağu

To all my academic friends who gave me crap for joining OpenAI: We just open-sourced some banger models. Have fun with them!

OpenAI@OpenAI

Our open models are here. Both of them. openai.com/open-models

English

2.2K

177.9K

Andrei Lupu@_andreilupu·6 Ağu

@shlomifruchter Thanks for the link, but I don't think it's settled! Genie 3 seems to excel at realism and consistency for objects out of frame, but most examples are quite static. I would love to see how it handles more complex interactions (shooting, grabbing or throwing objects, etc.)

English

Shlomi Fruchter@shlomifruchter·6 Ağu

@_andreilupu I think we are 1 year beyond the "can it run DOOM" point :) gamengen.github.io

English

1.5K

Andrei Lupu@_andreilupu·6 Ağu

We discovered alien intelligence in sand, and can now play its dreams in real time with a mouse and keyboard. Congrats to the team! Now, can it run Doom? 🤔

Jack Parker-Holder@jparkerholder

Genie 3 feels like a watershed moment for world models 🌐: we can now generate multi-minute, real-time interactive simulations of any imaginable world. This could be the key missing piece for embodied AGI… and it can also create beautiful beaches with my dog, playable real time

English

1.4K

Keşfet

@nathanrmonette @EugeneVinitsky @nyutandon @FLAIR_Ox @PraCha98 @jsuarez @jpineau1 @cohere