Philippe Laban

397 posts

Philippe Laban

@PhilippeLaban

Research Scientist @MSFTResearch. NLP/HCI Research.

New York City Katılım Nisan 2022

807 Takip Edilen1.5K Takipçiler

Philippe Laban retweetledi

Tal Schuster@TalSchuster·10h

We've just released open source MTP style drafters for Gemma 4 models ⚡ Now Gemma 4 models are even faster on your choice of hardware, without losing quality! Grateful for the fruitful collaboration between my team, Gemma team, and many collaborators to enable this release!

Omar Sanseviero@osanseviero

Excited to introduce Gemma 4 Multi-Token Prediction Drafters⚡️Accelerated inference right in your pockets - Up to a 3x speedup - Same quality guarantees - Available in your favorite open-source tools

English

3.1K

Philippe Laban@PhilippeLaban·1d

Hey @xtuffai, great question! The main experiment is indeed "first party" as you mention. But in the paper, we also implement a simple agentic harness (with Python code execution that can write files as a tool), and find that the four LLMs we test with tools (i.e., agentic) perform worse than without tools (non-agentic). We explain those results and why the agentic setup doesn't just resolve the issue in Section 4.2 of the paper. Would love to know what you think!

English

Frederick Zimmerman@xtuffai·1d

Do I understand this correctly? - All the edit tasks involve the model as "first party", reading the entire document and writing it back out per edit instructions given to a text generator. - None of the tasks use the model as "third party", writing a program that then applies deterministic operations to the document and then programmatically verifies them. - In the first party case, the unsurprising result is that if you tell an LLM to recreate a document several times, it will introduce changes every time, no matter how strictly you prompt it. - The third party case is not discussed in the paper, but it is what agentic harnesses spend a *lot* of their time doing bash tools, grep, sed, and python "quoinks", and these deterministic flows are *not* corrupted. @PhilippeLaban

English

Philippe Laban retweetledi

Rohan Paul@rohanpaul_ai·5d

New Microsoft paper shows that current AI assistants often damage documents during long editing jobs. Even the frontier models still ended up corrupting about 25% of document content on average, while many other models damaged far more. The problem is that delegated AI work only makes sense if a model can keep a document correct across many edits, not just do 1 step well. The paper tests this with reversible task pairs, where a model edits a file and then tries to undo that edit, so a reliable system should return to the original document. The authors built real work setups across 52 domains, from coding and science to accounting and music notation, and ran 19 models through 20 editing interactions. The failures were usually not lots of tiny slips but occasional big mistakes that silently broke parts of the document and then compounded over time. Agentic tool use did not help in their tests, and bigger files, longer workflows, and irrelevant extra documents made the corruption worse. The reason this matters is that current LLMs can look strong in short demos or narrow coding tasks yet still be unreliable delegates for long real-world document work. ---- Paper Link – arxiv. org/abs/2604.15597 Paper Title: "LLMs Corrupt Your Documents When You Delegate"

English

314

50.5K

Philippe Laban retweetledi

Ming Li @ UMD PhD@Ming_Liiii·6d

Excited to share our ACL 2026 work, trying to solve the issue raised by the ICLR Outstanding Paper “LLMs Get Lost In Multi-Turn Conversation”! Our RLAAR (arxiv.org/pdf/2510.18731) is an RL framework that trains LLMs to both answer correctly and wait when context is insufficient, using verifiable accuracy and abstention rewards. This tackles a key weakness in today’s conversational LLMs: they often answer too early, make wrong assumptions, and struggle to recover as conversations unfold. We’re also excited to see this challenge highlighted by “LLMs Get Lost In Multi-Turn Conversation” (arxiv.org/pdf/2505.06120) being recognized as an ICLR 2026 Outstanding Paper. Reliable conversational AI needs to know when to answer — and when to hold back. #ACL2026 #ICLR2026 #LLM #RLVR #ConversationalAI

English

5.4K

Philippe Laban retweetledi

Joachim Baumann @ ICLR'26@joabaum·27 Nis

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

English

468

67K

Philippe Laban retweetledi

Eunsol Choi@eunsolc·29 Nis

We study sampling diverse output from a suite of LLMs. One key surprise for me was that it's better to carefully pick a single model to sample many times, rather than naively mixing outputs from multiple models.

Yuhan Liu@YuhanLiu_nlp

Can LLMs generate diverse outputs for open-ended questions? Is it helpful if we ensemble outputs from multiple models? We study 18 LLMs on 4 datasets and find that no single model is best at generating diverse outputs 👇/ 🧵

English

5.3K

Philippe Laban retweetledi

Yuhan Liu@YuhanLiu_nlp·28 Nis

English

171

22.4K

Philippe Laban retweetledi

Paul Röttger@paul_rottger·27 Nis

New paper w/ @AISecurityInst: AI writing assistance distorts how others perceive AI users and their opinions. Millions of people now use AI to help them write and communicate. In three large experiments (14k participants, 3m+ human ratings) we show that AI writing assistance systematically distorts writer personas – their perceived beliefs, personality, and identity. These distortions are consistent across AI models and persist even under realistic conditions of human oversight. 🧵

English

117

16.6K

Philippe Laban retweetledi

elvis@omarsar0·25 Nis

NEW paper from Microsoft. This is an important read. (bookmark it) The work introduces DELEGATE-52, a benchmark simulating long document-editing workflows across 52 professional domains like coding, crystallography, and music notation. Across 19 tested models, even frontier ones (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupted an average of 25% of document content by the end of long workflows. Agentic tool use didn't help. Lots of other insights in this one. Check it out below... Paper: arxiv.org/abs/2604.15597 Learn to build effective AI agents in our academy: academy.dair.ai

English

354

26.6K

Philippe Laban@PhilippeLaban·24 Nis

@TuhinChakr @hiroakiLhayashi @iclr_conf Its great collaborators like you that inspire me to do great work. Thank you for your work as well my friend :)

English

126

Tuhin Chakrabarty@TuhinChakr·24 Nis

Congrats to my ex teammate ( and long time friend) @PhilippeLaban and my current teammate @hiroakiLhayashi for winning the best paper at @iclr_conf . Great representation for NLP + Evaluation research !!! #ICLR2026

English

Philippe Laban retweetledi

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·22 Nis

«most models can manipulate Python code losslessly» but «documents» get corrupted in the first 2 turns huh this is embarrassing

Philippe Laban@PhilippeLaban

Finding #1: Every model degrades documents over time. We tested 19 LLMs. Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt 25% of document content after 20 interactions. Average across all models: 50% content loss.

English

204

28K

Philippe Laban retweetledi

Nathan Witkin@NateWitkin·21 Nis

Important paper. Reliability and hallucination issues heavily constrain productivity gains from LLMs.

Philippe Laban@PhilippeLaban

New paper! LLMs Corrupt Your Documents When You Delegate LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf. Delegation requires trust: does the LLM complete tasks without introducing errors? We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N

English

7.6K

Philippe Laban@PhilippeLaban·21 Nis

Conclusion: Current LLMs are unreliable delegates. They introduce sparse but severe errors that compound and corrupt work documents. We release DELEGATE-52 to encourage the study of delegated work across knowledge work domains: 💻 github.com/microsoft/DELE… 🤗 hf.co/datasets/micro… 📄 arxiv.org/abs/2604.15597 Work done with my wonderful colleagues Tobias Schnabel and @ProfJenNeville at @MSFTResearch

English

3.5K

Philippe Laban@PhilippeLaban·21 Nis

Finding #5: It's not death by a thousand cuts. Models maintain near-perfect reconstruction in some rounds, then experience *critical failures* — losing 10% of contents in a single step. These sparse critical failures explain ~80% of total degradation. Stronger models don't avoid small errors better, they delay catastrophic ones.

English

4.4K

Philippe Laban@PhilippeLaban·21 Nis

English

157

885

119.5K

Keşfet

@xtuffai @AISecurityInst @TuhinChakr @hiroakiLhayashi @iclr_conf @ProfJenNeville @MSFTResearch @elonmusk