Philippe Laban

397 posts

Philippe Laban

Philippe Laban

@PhilippeLaban

Research Scientist @MSFTResearch. NLP/HCI Research.

New York City Katılım Nisan 2022
807 Takip Edilen1.5K Takipçiler
Philippe Laban retweetledi
Tal Schuster
Tal Schuster@TalSchuster·
We've just released open source MTP style drafters for Gemma 4 models ⚡ Now Gemma 4 models are even faster on your choice of hardware, without losing quality! Grateful for the fruitful collaboration between my team, Gemma team, and many collaborators to enable this release!
Omar Sanseviero@osanseviero

Excited to introduce Gemma 4 Multi-Token Prediction Drafters⚡️Accelerated inference right in your pockets - Up to a 3x speedup - Same quality guarantees - Available in your favorite open-source tools

English
4
3
27
3.1K
Philippe Laban
Philippe Laban@PhilippeLaban·
Hey @xtuffai, great question! The main experiment is indeed "first party" as you mention. But in the paper, we also implement a simple agentic harness (with Python code execution that can write files as a tool), and find that the four LLMs we test with tools (i.e., agentic) perform worse than without tools (non-agentic). We explain those results and why the agentic setup doesn't just resolve the issue in Section 4.2 of the paper. Would love to know what you think!
Philippe Laban tweet media
English
0
0
0
12
Frederick Zimmerman
Do I understand this correctly? - All the edit tasks involve the model as "first party", reading the entire document and writing it back out per edit instructions given to a text generator. - None of the tasks use the model as "third party", writing a program that then applies deterministic operations to the document and then programmatically verifies them. - In the first party case, the unsurprising result is that if you tell an LLM to recreate a document several times, it will introduce changes every time, no matter how strictly you prompt it. - The third party case is not discussed in the paper, but it is what agentic harnesses spend a *lot* of their time doing bash tools, grep, sed, and python "quoinks", and these deterministic flows are *not* corrupted. @PhilippeLaban
English
1
0
1
21
Philippe Laban retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
New Microsoft paper shows that current AI assistants often damage documents during long editing jobs. Even the frontier models still ended up corrupting about 25% of document content on average, while many other models damaged far more. The problem is that delegated AI work only makes sense if a model can keep a document correct across many edits, not just do 1 step well. The paper tests this with reversible task pairs, where a model edits a file and then tries to undo that edit, so a reliable system should return to the original document. The authors built real work setups across 52 domains, from coding and science to accounting and music notation, and ran 19 models through 20 editing interactions. The failures were usually not lots of tiny slips but occasional big mistakes that silently broke parts of the document and then compounded over time. Agentic tool use did not help in their tests, and bigger files, longer workflows, and irrelevant extra documents made the corruption worse. The reason this matters is that current LLMs can look strong in short demos or narrow coding tasks yet still be unreliable delegates for long real-world document work. ---- Paper Link – arxiv. org/abs/2604.15597 Paper Title: "LLMs Corrupt Your Documents When You Delegate"
Rohan Paul tweet media
English
25
77
314
50.5K
Philippe Laban retweetledi
Ming Li @ UMD PhD
Ming Li @ UMD PhD@Ming_Liiii·
Excited to share our ACL 2026 work, trying to solve the issue raised by the ICLR Outstanding Paper “LLMs Get Lost In Multi-Turn Conversation”! Our RLAAR (arxiv.org/pdf/2510.18731) is an RL framework that trains LLMs to both answer correctly and wait when context is insufficient, using verifiable accuracy and abstention rewards. This tackles a key weakness in today’s conversational LLMs: they often answer too early, make wrong assumptions, and struggle to recover as conversations unfold. We’re also excited to see this challenge highlighted by “LLMs Get Lost In Multi-Turn Conversation” (arxiv.org/pdf/2505.06120) being recognized as an ICLR 2026 Outstanding Paper. Reliable conversational AI needs to know when to answer — and when to hold back. #ACL2026 #ICLR2026 #LLM #RLVR #ConversationalAI
Ming Li @ UMD PhD tweet media
English
1
11
72
5.4K
Philippe Laban retweetledi
Joachim Baumann @ ICLR'26
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
Joachim Baumann @ ICLR'26 tweet media
English
14
77
468
67K
Philippe Laban retweetledi
Eunsol Choi
Eunsol Choi@eunsolc·
We study sampling diverse output from a suite of LLMs. One key surprise for me was that it's better to carefully pick a single model to sample many times, rather than naively mixing outputs from multiple models.
Yuhan Liu@YuhanLiu_nlp

Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵

English
1
6
36
5.3K
Philippe Laban retweetledi
Yuhan Liu
Yuhan Liu@YuhanLiu_nlp·
Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵
Yuhan Liu tweet media
English
2
34
171
22.4K
Philippe Laban retweetledi
Paul Röttger
Paul Röttger@paul_rottger·
New paper w/ @AISecurityInst: AI writing assistance distorts how others perceive AI users and their opinions. Millions of people now use AI to help them write and communicate. In three large experiments (14k participants, 3m+ human ratings) we show that AI writing assistance systematically distorts writer personas – their perceived beliefs, personality, and identity. These distortions are consistent across AI models and persist even under realistic conditions of human oversight. 🧵
Paul Röttger tweet media
English
3
33
117
16.6K
Philippe Laban retweetledi
elvis
elvis@omarsar0·
NEW paper from Microsoft. This is an important read. (bookmark it) The work introduces DELEGATE-52, a benchmark simulating long document-editing workflows across 52 professional domains like coding, crystallography, and music notation. Across 19 tested models, even frontier ones (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupted an average of 25% of document content by the end of long workflows. Agentic tool use didn't help. Lots of other insights in this one. Check it out below... Paper: arxiv.org/abs/2604.15597 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
14
76
354
26.6K
Philippe Laban retweetledi
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
«most models can manipulate Python code losslessly» but «documents» get corrupted in the first 2 turns huh this is embarrassing
Philippe Laban@PhilippeLaban

Finding #1: Every model degrades documents over time. We tested 19 LLMs. Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt 25% of document content after 20 interactions. Average across all models: 50% content loss.

English
14
9
204
28K
Philippe Laban retweetledi
Philippe Laban
Philippe Laban@PhilippeLaban·
Finding #5: It's not death by a thousand cuts. Models maintain near-perfect reconstruction in some rounds, then experience *critical failures* — losing 10% of contents in a single step. These sparse critical failures explain ~80% of total degradation. Stronger models don't avoid small errors better, they delay catastrophic ones.
Philippe Laban tweet media
English
1
4
47
4.4K
Philippe Laban
Philippe Laban@PhilippeLaban·
New paper! LLMs Corrupt Your Documents When You Delegate LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf. Delegation requires trust: does the LLM complete tasks without introducing errors? We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N
English
44
157
885
119.5K