Philippe Laban

368 posts

Philippe Laban

Philippe Laban

@PhilippeLaban

Research Scientist @MSFTResearch. NLP/HCI Research.

New York City 参加日 Nisan 2022
776 フォロー中1.3K フォロワー
Philippe Laban がリツイート
Arvind Narayanan
Arvind Narayanan@random_walker·
The real sign of AI writing is not superficial stuff like “It’s not X—it’s Y”. It’s the hollowness. Polished writing but relatively mundane ideas. The giveaway is that you’re less impressed when you read it the second time. With good writing, it should be the other way around. I’m not sure this is inherently about AI. It’s more about the fact that people tend to turn to AI when they don’t have much to say. Reading text that has the syntactic smell of AI is mildly annoying, but when I read hollow writing I feel the writer is wasting my time, which is much more frustrating. So don’t do it. People are unlikely to respond to your email or subscribe to your newsletter or whatever you’re trying to get them to do. And they’ll probably remember that you betrayed their trust as a reader.
English
78
246
2K
436.6K
Philippe Laban がリツイート
Jocelyn Shen
Jocelyn Shen@jocelynjshen·
Excited to share our #CHI2026 paper “Texterial: A Text-as-Material Interaction Paradigm for LLM-Mediated Writing” (done during internship at Microsoft Research) We imagine interacting with LLMs by treating text as a material like plants/clay. 📃arxiv.org/pdf/2603.00452 🧵[1/n]
English
4
24
158
16.8K
Philippe Laban がリツイート
Lucy Li
Lucy Li@lucy3_li·
Models are now expert math solvers, and so AI for math education is receiving increasing attention. Our new preprint evaluates 11 VLMs on our QA benchmark, DrawEduMath. We highlight a startling gap: models perform less well on inputs from K-12 students who need more help. 🧵
Lucy Li tweet media
English
2
14
55
5.8K
Philippe Laban
Philippe Laban@PhilippeLaban·
@reliabytes @hiroakiLhayashi @ProfJenNeville It is whatever is set by default by each API provider. For example, it is "medium" reasoning for the OpenAI models. This is just to say we did not adjust any settings. Fwiw, our initial experiments showed reasoning was not always beneficial to avoid getting lost in conversation.
English
0
0
1
39
Philippe Laban
Philippe Laban@PhilippeLaban·
LLMs *Still* Get Lost In Multi-Turn Conversation. We re-ran experiments with newer models. Performance still drops, but with modest gains: mostly from improvements on the Python coding task. Also: Lost in Conversation will be presented at ICLR 2026 🎉🇧🇷
Philippe Laban tweet media
English
15
28
285
23K
Philippe Laban
Philippe Laban@PhilippeLaban·
@GOrlanski @lucy3_li @hiroakiLhayashi @ProfJenNeville My guess is that Python-related tasks have received a lot more attention, and it's nice to see that is shows in our results in this way. Also shows the importance of having multi-domain benchmarks, otherwise we could be myopic in thinking that the more general problem is solved
English
1
0
1
136
Philippe Laban
Philippe Laban@PhilippeLaban·
@sanskxr02 @hiroakiLhayashi @ProfJenNeville Yes, I agree. In many ways our "user simulators" is quite benign, and varying the level confusion introduced by the user more systematically to understand downstream effects is a good direction to go in next. For instance, what happens when the user changes their mind...
English
1
0
1
134
Philippe Laban がリツイート
Philippe Laban がリツイート
Ewan Morrison
Ewan Morrison@MrEwanMorrison·
Paper shows: every major AI model gets dramatically worse the longer you talk to it. This is an important variant on the "synthetic data leads to model collapse" problem - the more an LLM ingests it own output, the sicker it gets.
Robert Youssef@rryssf_

Microsoft Research and Salesforce analyzed 200,000+ AI conversations and found something the entire industry already suspected but nobody would say out loud. every major model gets dramatically worse the longer you talk to it. GPT-4, Claude, Gemini, Llama. all of them. no exceptions. paper: arxiv.org/abs/2505.06120

English
24
108
486
19K
Philippe Laban がリツイート
Hasan Toor
Hasan Toor@hasantoxr·
🚨BREAKING: Microsoft Research + Salesforce just dropped a paper that should scare every AI builder. They tested 15 top LLMs GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek R1, Llama 4 across 200,000+ simulated conversations. Single-turn prompt: 90% performance. Multi-turn conversation: 65% performance. Same model. Same task. Just... talking normally. The culprit isn't intelligence. Aptitude only dropped 15%. Unreliability EXPLODED by 112%. → LLMs answer before you finish explaining (wrong assumptions get baked in permanently) → They fall in love with their first wrong answer and build on it → They forget the middle of your conversation entirely → Longer responses introduce more assumptions = more errors Even reasoning models failed. o3 and DeepSeek R1 performed just as badly. Extra thinking tokens did nothing. Setting temperature to 0? Still broken. The fix right now: give your AI everything upfront in one message instead of back-and-forth. Every benchmark you've seen was tested on single-turn prompts in perfect lab conditions. Real conversations break every model on the market and nobody's talking about it.
Hasan Toor tweet media
English
700
1.7K
9K
1.6M
Philippe Laban がリツイート
Robert Youssef
Robert Youssef@rryssf_·
Microsoft Research and Salesforce analyzed 200,000+ AI conversations and found something the entire industry already suspected but nobody would say out loud. every major model gets dramatically worse the longer you talk to it. GPT-4, Claude, Gemini, Llama. all of them. no exceptions. paper: arxiv.org/abs/2505.06120
Robert Youssef tweet media
English
358
1.4K
5.5K
692.8K
Philippe Laban がリツイート
Jad Kabbara
Jad Kabbara@jad_kabbara·
Still looking for emergency reviewers for 3 papers. Topics below. Please help if you can! Comment below or DM. Thanks! 1: Data Synthesis, LLM-based Agents, Agentic RL 2: Agentic search, RL, safety and alignment, tools 3: Long-Term Memory, agent memory, RL in agents, LLM agents
Jad Kabbara tweet media
Jad Kabbara@jad_kabbara

I'm looking for emergency reviewers for several papers in the general area of LLM agents. More detailed topics in the two tweets below. If you can help, please DM or comment the paper number. Any help appreciated as reviews will be released in less than 48 hours. Thanks!

English
0
1
5
1.2K