Xilun Chen

732 posts

Xilun Chen

@ccsasuke

Research Scientist @ Meta FAIR

Seattle, WA Katılım Mart 2010

513 Takip Edilen564 Takipçiler

Sabitlenmiş Tweet

Xilun Chen@ccsasuke·3 May

Introducing FLAME🔥: Factuality-Aware Alignment for LLMs We found that the standard alignment process **encourages** hallucination. We hence propose factuality-aware alignment while maintaining the LLM's general instruction-following capability. arxiv.org/abs/2405.01525

English

Xilun Chen retweetledi

Akari Asai@AkariAsai·25 Kas

1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) 🎓 I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in 🧵

English

120

641

147.6K

Xilun Chen retweetledi

Gargi Ghosh@gargighosh·29 Ağu

New research from FAIR- Active Reading: a framework to learn a given set of material with self-generated learning strategies for generalized and expert domains(such as Finance). Absorb significantly more knowledge than vanilla finetuning and usual data augmentations strategies

Jessy Lin@realJessyLin

🚀 We're excited to release the 1T Active Reading-augmented Wikipedia dataset and open-source the WikiExpert model for the community: Paper: arxiv.org/abs/2508.09494 Dataset: huggingface.co/datasets/faceb… Model: huggingface.co/facebook/meta-… Thanks to my great collaborators – @vinceberges, @ccsasuke, @scottyih, @gargighosh, @barlas_berkeley ! 🧵[n/n]

English

Xilun Chen retweetledi

Jessy Lin@realJessyLin·27 Ağu

🔍 How do we teach an LLM to 𝘮𝘢𝘴𝘵𝘦𝘳 a body of knowledge? In new work with @AIatMeta, we propose Active Reading 📙: a way for models to teach themselves new things by self-studying their training data. Results: * 𝟔𝟔% on SimpleQA w/ an 8B model by studying the wikipedia docs (+𝟑𝟏𝟑% vs plain finetuning) * a domain-specific expert model: 𝟏𝟔𝟎% vs FT on FinanceBench knowledge * an 8B wikipedia expert competitive w/ 405B on factuality (💥open-sourced!) 🧵[1/n]

English

150

1.1K

132.1K

Xilun Chen retweetledi

Xueguang Ma@xueguang_ma·11 Ağu

🚀 Introducing BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent. It is a new Deep-Research evaluation benchmark built on top of BrowseComp. It features - 📚 a fixed, carefully curated corpus of web documents - ✅ human-verified positive documents - ⚔️ web-mined challenging hard negatives. With BrowseComp-Plus, you can thoroughly evaluate and compare the performance of different components in a deep-research system. e.g. GPT-5 + Qwen3-Embedding. Code, dataset, and leaderboard links are provided at the end of this thread.

English

235

61.5K

Xilun Chen retweetledi

Rulin Shao@RulinShao·8 Ağu

Factuality and logical reasoning (e.g., math, code) favor different sets of reasoning patterns. 🧑‍🍳 A fresh RL recipe to improve factuality is here — crafted by the amazing @ccsasuke!

Jason Weston@jaseweston

...is today a good day for new paper posts? 🤖Learning to Reason for Factuality 🤖 📝: arxiv.org/abs/2508.05618 - New reward func for GRPO training of long CoTs for *factuality* - Design stops reward hacking by favoring precision, detail AND quality - Improves base model across all axes 🧵1/3

English

7.7K

Xilun Chen retweetledi

Jason Weston@jaseweston·8 Ağu

English

382

36.7K

Xilun Chen retweetledi

Xueguang Ma@xueguang_ma·17 May

Now accepted by #ACL2025 main. We propose a training framework to generate strong smaller retriever with integration of LLM data augmentation and LLM pruning, letting smaller retriever improves together with the advancement of LLM.

Xueguang Ma@xueguang_ma

Introducing DRAMA🎭: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers. We propose to train a smaller dense retriever using a pruned LLM as the backbone, fine-tuned with diverse LLM data augmentations. With single-stage training, DRAMA achieves strong performance on both English and multilingual retrieval tasks—enabling smaller retrievers to benefit from ongoing LLM advancements.

English

3.2K

Xilun Chen retweetledi

Rulin Shao@RulinShao·19 May

Accepted by #ACL2025! Congrats @mingdachen and the team🥳 Several cool ideas: - Maintain an explicit editable working memory during generation; - Actively integrate external feedback (factual check w/ VeriScore); A smart LM learns to memorize, a smarter LM learns to forget too!

Aran Komatsuzaki@arankomatsuzaki

Meta presents Improving Factuality with Explicit Working Memory Presents EWE, a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources EWE outperforms strong baselines on four fact-seeking long-form generation datasets, increasing the factuality metric, VeriScore, by 2 to 10 points absolute without sacrificing the helpfulness of the responses. arxiv.org/abs/2412.18069

English

108

11.4K

Xilun Chen@ccsasuke·15 May

@wzhao_nlp @jmhessel @UMassAmherst @Meta Wow congrats!

English

103

Wenting Zhao@wzhao_nlp·13 May

Some personal news: I'll join @UMassAmherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at @Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

English

849

71.1K

Xilun Chen retweetledi

AK@_akhaliq·30 Nis

Meta just dropped ReasonIR on Hugging Face Training Retrievers for Reasoning Tasks

English

308

41.4K

Xilun Chen retweetledi

AI at Meta@AIatMeta·5 Nis

Today is the start of a new era of natively multimodal AI innovation. Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality. Llama 4 Scout • 17B-active-parameter model with 16 experts. • Industry-leading context window of 10M tokens. • Outperforms Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 across a broad range of widely accepted benchmarks. Llama 4 Maverick • 17B-active-parameter model with 128 experts. • Best-in-class image grounding with the ability to align user prompts with relevant visual concepts and anchor model responses to regions in the image. • Outperforms GPT-4o and Gemini 2.0 Flash across a broad range of widely accepted benchmarks. • Achieves comparable results to DeepSeek v3 on reasoning and coding — at half the active parameters. • Unparalleled performance-to-cost ratio with a chat version scoring ELO of 1417 on LMArena. These models are our best yet thanks to distillation from Llama 4 Behemoth, our most powerful model yet. Llama 4 Behemoth is still in training and is currently seeing results that outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks. We’re excited to share more details about it even while it’s still in flight. Read more about the first Llama 4 models, including training and benchmarks ➡️ go.fb.me/gmjohs Download Llama 4 ➡️ go.fb.me/bwwhe9

English

827

2.4K

12.8K

3.7M

Xilun Chen retweetledi

Zhuang Liu@liuzhuang1234·14 Mar

New paper - Transformers, but without normalization layers (1/n)

English

578

4.1K

1.3M

Xilun Chen retweetledi

Matthew Finlayson ✈️ ICLR@mattf1n·26 Şub

🧵 Adapting your LLM for new tasks is dangerous! A bad training set degrades models by encouraging hallucinations and other misbehavior. Our paper remedies this for RAG training by replacing gold responses with self-generated demonstrations. Check it out: arxiv.org/abs/2502.10

English

458

Xilun Chen@ccsasuke·27 Şub

Today we released DRAMA, a set of small (sub-1B) multilingual dense retrievers that perform strongly across multiple languages and tasks. It also offers flexible model sizes and embedding dimensionalities. Led by my awesome intern @xueguang_ma arxiv.org/abs/2502.18460

Xueguang Ma@xueguang_ma

English

1.3K

Xilun Chen retweetledi

Srini Iyer@sriniiyer88·13 Ara

New paper! Byte-Level models are finally competitive with tokenizer-based models with better inference efficiency and robustness! Dynamic patching is the answer! Read all about it here: dl.fbaipublicfiles.com/blt/BLT__Patch… (1/n)

English

18.5K

Xilun Chen retweetledi

Jack Lin@jacklin_64·11 Ara

I will present our paper FLAME on factuality alignment for LLMs with @luyu_gao at #NeurIPS2024! 🎉 Join us at East Exhibit Hall A-C, Booth #3501 for a chat on Wed (Dec 11, 4:30--7:30 pm). Looking forward to connecting! More detail: neurips.cc/virtual/2024/p…

Xilun Chen@ccsasuke

English

2.7K

Xilun Chen retweetledi

Akari Asai@AkariAsai·4 Ara

🚨 I’m on the job market this year! 🚨 I’m completing my @uwcse Ph.D. (2025), where I identify and tackle key LLM limitations like hallucinations by developing new models—Retrieval-Augmented LMs—to build more reliable real-world AI systems. Learn more in the thread! 🧵

English

117

813

126.7K

Xilun Chen retweetledi

Minghan@alexlimh23·7 Ara

1/ Excited to share that our paper "NEST🪺: Nearest Neighbor Speculative Decoding for LLM Generation and Attribution" is accepted at #NeurIPS2024! 🚀 Catch us at the poster session on Thu, Dec 12, 4:30–7:30 PM PST, East Exhibit Hall A-C, #2201. [Details: neurips.cc/virtual/2024/p…]

Minghan@alexlimh23

Curious about enhancing factuality and attribution in LLM generation? Check out our paper: arxiv.org/abs/2405.19325 Introducing NEST🪺: Nearest Neighbor Speculative Decoding for LLM Generation and Attribution, a training-free method that adds real-world texts into LLM generation.

English

11.8K

Xilun Chen retweetledi

Jason Wei@_jasonwei·30 Eki

Excited to open-source a new hallucinations eval called SimpleQA! For a while it felt like there was no great benchmark for factuality, and so we created an eval that was simple, reliable, and easy-to-use for researchers. Main features of SimpleQA: 1. Very simple setup: there are 4k diverse fact-seeking questions written by humans where there can only be a single, indisputable answer. Model completions are graded by an autograder as either correct, incorrect, or not attempted. 2. We created it so that it would be challenging for the current class of frontier models; both o1-preview and Claude Sonnet 3.5 are below 50% accuracy. 3. Reference answers have high correctness. Questions are written to be non-ambiguous and reference answers were verified by two independent annotators. Questions are also written to be timeless, so SimpleQA can be a useful benchmark even 5 or 10 years from now. The way that I think about evals is that they are an incentive for the AI community. New benchmarks in AI get saturated very quickly, and what they incentivize gets encoded into the next generation of language models. With a good hallucinations eval, hopefully the next wave of language models will be more trustworthy and reliable!

English

120

860

106.5K

Xilun Chen retweetledi

Lili Yu@liliyu_lili·21 Ağu

🚀 Excited to share our latest work: Transfusion! A new multi-modal generative training combining language modeling and image diffusion in a single transformer! Huge shout to @violet_zct @omerlevy_ @michiyasunaga @arunbabu1234 @kushal_tirumala and other collaborators.

Chunting Zhou@violet_zct

Introducing *Transfusion* - a unified approach for training models that can generate both text and images. arxiv.org/pdf/2408.11039 Transfusion combines language modeling (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. This allows us to leverage the strengths of both approaches in one model. 1/5

English

124

58.7K

Keşfet

@AIatMeta @mingdachen @wzhao_nlp @jmhessel @UMassAmherst @Meta @xueguang_ma @luyu_gao