Philippe Laban

368 posts

Philippe Laban

@PhilippeLaban

Research Scientist @MSFTResearch. NLP/HCI Research.

New York City انضم Nisan 2022

776 يتبع1.3K المتابعون

Philippe Laban أُعيد تغريده

Jessy Li@jessyjli·13 Mar

Want to know how well the models can brainstorm connections across different concepts? Super excited about @ManyaWadhwa1’s work on measuring associative creativity!

Manya Wadhwa@ManyaWadhwa1

⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇

English

2.8K

Philippe Laban أُعيد تغريده

Matthew Hutson@SilverJacket·11 Mar

“Why AI Chatbots Agree with You Even When You’re Wrong”: My latest for @IEEESpectrum, on LLM sycophancy. Thanks, @chengmyra1, @PhilippeLaban, @KaiShu0327. spectrum.ieee.org/ai-sycophancy

English

180

Philippe Laban أُعيد تغريده

Arvind Narayanan@random_walker·6 Mar

The real sign of AI writing is not superficial stuff like “It’s not X—it’s Y”. It’s the hollowness. Polished writing but relatively mundane ideas. The giveaway is that you’re less impressed when you read it the second time. With good writing, it should be the other way around. I’m not sure this is inherently about AI. It’s more about the fact that people tend to turn to AI when they don’t have much to say. Reading text that has the syntactic smell of AI is mildly annoying, but when I read hollow writing I feel the writer is wasting my time, which is much more frustrating. So don’t do it. People are unlikely to respond to your email or subscribe to your newsletter or whatever you’re trying to get them to do. And they’ll probably remember that you betrayed their trust as a reader.

English

246

436.6K

Philippe Laban أُعيد تغريده

Jocelyn Shen@jocelynjshen·3 Mar

Excited to share our #CHI2026 paper “Texterial: A Text-as-Material Interaction Paradigm for LLM-Mediated Writing” (done during internship at Microsoft Research) We imagine interacting with LLMs by treating text as a material like plants/clay. 📃arxiv.org/pdf/2603.00452 🧵[1/n]

English

158

16.8K

Philippe Laban أُعيد تغريده

Lucy Li@lucy3_li·3 Mar

Models are now expert math solvers, and so AI for math education is receiving increasing attention. Our new preprint evaluates 11 VLMs on our QA benchmark, DrawEduMath. We highlight a startling gap: models perform less well on inputs from K-12 students who need more help. 🧵

English

5.7K

Philippe Laban@PhilippeLaban·25 Şub

@reliabytes @hiroakiLhayashi @ProfJenNeville It is whatever is set by default by each API provider. For example, it is "medium" reasoning for the OpenAI models. This is just to say we did not adjust any settings. Fwiw, our initial experiments showed reasoning was not always beneficial to avoid getting lost in conversation.

English

Bandit@reliabytes·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville What is "default" reasoning considered? Hard to evaluate these results without that explicitly noted.

English

405

Philippe Laban@PhilippeLaban·24 Şub

LLMs *Still* Get Lost In Multi-Turn Conversation. We re-ran experiments with newer models. Performance still drops, but with modest gains: mostly from improvements on the Python coding task. Also: Lost in Conversation will be presented at ICLR 2026 🎉🇧🇷

English

285

23K

Philippe Laban@PhilippeLaban·25 Şub

@GetYourCheeson @hiroakiLhayashi @ProfJenNeville Yes, that's an interesting approach. The hope is we move to a world where the assistant LMs work out of the box without these tricks ^^

English

Earl Cheeson 🧀@GetYourCheeson·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville a good focal lense is a good system prompt that keeps the ai on task. i make mine think its under duress and if it mentions its under duress it might be deleted. works wonders for me lately.

English

Philippe Laban@PhilippeLaban·25 Şub

@madmaxbr5 @hiroakiLhayashi @ProfJenNeville Yes, I understand how the post as is reads a little out of context. Our original thread gave a high-level summary of the experiments: x.com/PhilippeLaban/… Or the full paper, particularly Section 3: arxiv.org/pdf/2505.06120

Philippe Laban@PhilippeLaban

🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120

English

Max Andrews@madmaxbr5·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville You should probably explain what full, concat, and sharded mean in this context…

English

102

Philippe Laban@PhilippeLaban·25 Şub

@vincent_koc @hiroakiLhayashi @ProfJenNeville Here's the link: arxiv.org/abs/2505.06120 !

English

187

Vincent Koc@vincent_koc·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville Preprint?

English

487

Philippe Laban@PhilippeLaban·25 Şub

@ketansingh279 @hiroakiLhayashi @ProfJenNeville Hey Ketan, absolutely: Here's the preprint: arxiv.org/abs/2505.06120 I think Figures 2-3-4 do a good job at explaining the main experiment!

English

148

Ketan Singh@ketansingh279·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville This seems useful. Can you help me understand what the numbers actually mean (or point me to the resource with more context)

English

272

Philippe Laban@PhilippeLaban·25 Şub

@GOrlanski @lucy3_li @hiroakiLhayashi @ProfJenNeville But I also think that a 10-20% drop is still significant, particularly because our user simulator is quite benign, so real degradations are likely worse in practice still (my conjecture).

English

Philippe Laban@PhilippeLaban·25 Şub

@GOrlanski @lucy3_li @hiroakiLhayashi @ProfJenNeville My guess is that Python-related tasks have received a lot more attention, and it's nice to see that is shows in our results in this way. Also shows the importance of having multi-domain benchmarks, otherwise we could be myopic in thinking that the more general problem is solved

English

135

Philippe Laban@PhilippeLaban·25 Şub

@sanskxr02 @hiroakiLhayashi @ProfJenNeville Yes, I agree. In many ways our "user simulators" is quite benign, and varying the level confusion introduced by the user more systematically to understand downstream effects is a good direction to go in next. For instance, what happens when the user changes their mind...

English

134

Sanskar Pandey@sanskxr02·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville Good work. We did some work around this,I feel like quantifying Multi-turn conversational stamina with different degrees of adversarial pressure could be a fun study too!

English

189

Philippe Laban@PhilippeLaban·25 Şub

@akshitwt @hiroakiLhayashi @ProfJenNeville Ooh, thanks for sharing and indeed very relevant! Let's plan to meet at ICLR :)

English

221

Akshit@akshitwt·25 Şub

@PhilippeLaban @hiroakiLhayashi @ProfJenNeville i saw this paper while browsing iclr submissions! very cool work; we did a similar study on a toy task for long-horizons, and found similar degradation results. it might be of interest (also will be at ICLR)! arxiv.org/abs/2509.09677

English

621

Philippe Laban أُعيد تغريده

Hiroaki_Hayashi@hiroakiLhayashi·24 Şub

Latest models still get lost in multi-turn conversation, but less so for coding tasks!! Excited to present this work with @PhilippeLaban in 🇧🇷.

Philippe Laban@PhilippeLaban

English

453

Philippe Laban أُعيد تغريده

Ewan Morrison@MrEwanMorrison·19 Şub

Paper shows: every major AI model gets dramatically worse the longer you talk to it. This is an important variant on the "synthetic data leads to model collapse" problem - the more an LLM ingests it own output, the sicker it gets.

Robert Youssef@rryssf_

Microsoft Research and Salesforce analyzed 200,000+ AI conversations and found something the entire industry already suspected but nobody would say out loud. every major model gets dramatically worse the longer you talk to it. GPT-4, Claude, Gemini, Llama. all of them. no exceptions. paper: arxiv.org/abs/2505.06120

English

108

486

19K

Philippe Laban أُعيد تغريده

Hasan Toor@hasantoxr·19 Şub

🚨BREAKING: Microsoft Research + Salesforce just dropped a paper that should scare every AI builder. They tested 15 top LLMs GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek R1, Llama 4 across 200,000+ simulated conversations. Single-turn prompt: 90% performance. Multi-turn conversation: 65% performance. Same model. Same task. Just... talking normally. The culprit isn't intelligence. Aptitude only dropped 15%. Unreliability EXPLODED by 112%. → LLMs answer before you finish explaining (wrong assumptions get baked in permanently) → They fall in love with their first wrong answer and build on it → They forget the middle of your conversation entirely → Longer responses introduce more assumptions = more errors Even reasoning models failed. o3 and DeepSeek R1 performed just as badly. Extra thinking tokens did nothing. Setting temperature to 0? Still broken. The fix right now: give your AI everything upfront in one message instead of back-and-forth. Every benchmark you've seen was tested on single-turn prompts in perfect lab conditions. Real conversations break every model on the market and nobody's talking about it.

English

700

1.7K

9.1K

1.6M

Philippe Laban أُعيد تغريده

Robert Youssef@rryssf_·18 Şub

English

358

1.4K

5.5K

692.8K

Philippe Laban أُعيد تغريده

Jad Kabbara@jad_kabbara·14 Şub

Still looking for emergency reviewers for 3 papers. Topics below. Please help if you can! Comment below or DM. Thanks! 1: Data Synthesis, LLM-based Agents, Agentic RL 2: Agentic search, RL, safety and alignment, tools 3: Long-Term Memory, agent memory, RL in agents, LLM agents

Jad Kabbara@jad_kabbara

I'm looking for emergency reviewers for several papers in the general area of LLM agents. More detailed topics in the two tweets below. If you can help, please DM or comment the paper number. Any help appreciated as reviews will be released in less than 48 hours. Thanks!

English

1.2K

اكتشف

@ManyaWadhwa1 @IEEESpectrum @chengmyra1 @KaiShu0327 @reliabytes @hiroakiLhayashi @ProfJenNeville @GetYourCheeson