Chenxi Whitehouse

31 posts

Chenxi Whitehouse

@chenx_wh

Research Scientist @Meta FAIR. Previously @CambridgeNLP @GoogleDeepMind

London Katılım Ekim 2021

206 Takip Edilen428 Takipçiler

Chenxi Whitehouse retweetledi

Jason Weston@jaseweston·1 May

💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create high-quality training & evaluation data. Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*. We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data. Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods. Overall, we believe this direction has the potential to change how we build AI data! Read more in the blog post: facebookresearch.github.io/RAM/blogs/auto…

English

105

621

43.6K

Chenxi Whitehouse@chenx_wh·23 Nis

I am attending #ICLR26 and will present two papers tomorrow (23 April) for our work at MSL and FAIR: 1. MENLO: x.com/seb_ruder/stat… with @seb_ruder 2. J1: x.com/jaseweston/sta…, presenting with @swarnaNLP Looking forward!

Sebastian Ruder@seb_ruder

🚨 New paper! 🌎 MENLO: From Preferences to Proficiency We introduce a framework + dataset for evaluating and modeling native-like LLM response quality across 47 languages, inspired by audience design principles. 📄 Paper: arxiv.org/abs/2509.26601 🤗 Data: huggingface.co/datasets/faceb… 🧵Details 👇

English

5.4K

Chenxi Whitehouse@chenx_wh·23 Mar

Check out RLLM, my primary focus during the latter half of last year. Building on the success of J1, we developed training recipes to adapt static LLM-judges into on-policy Generative Reward Models!

Jason Weston@jaseweston

🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/rllm/

English

5.6K

Chenxi Whitehouse retweetledi

Jason Weston@jaseweston·22 Oca

Our team in FAIR at Meta is hiring a (full-time) researcher! We work on the topics of Reasoning, Alignment and Memory/architectures (RAM) for self-improvement & co-improvement. Apply here: metacareers.com/profile/job_de… Location: NY, Seattle or Menlo Park. Some of our recent work to give flavor: Co-Improvement (position): arxiv.org/abs/2512.05356 SPICE (Self-Play in Corpus Environments): arxiv.org/abs/2510.24684 Self-Challenging Agents: arxiv.org/abs/2506.01716 RL from Human Interaction: arxiv.org/abs/2509.25137 AggLM (parallel aggregation): arxiv.org/abs/2509.06870 StepWiser (CoT-PRM RL): arxiv.org/abs/2508.19229 DARLING (diversity-trained RL): arxiv.org/abs/2509.02534 J1 (RL-trained LLM-as-Judge): arxiv.org/abs/2505.10320 CoT-Self-Instruct: arxiv.org/abs/2507.23751 Multi-Token Attention: arxiv.org/abs/2504.00927

English

351

58K

Chenxi Whitehouse@chenx_wh·30 Ara

Check out our new work on AI Co-Scientist for Research Plan Generation with our amazing intern @ShashwatGoel7!

Shashwat Goel@ShashwatGoel7

🚨New paper: Training AI Co-Scientists using Rubric Rewards In my recent internship at Meta Superintelligence Labs, I pursued an opinionated research bet: a general, scalable training recipe to improve AI at helping scientists achieve their research goals. Motivation Existing work on training AI for Science optimizes pre-defined, narrow scientific objectives with execution feedback in specially constructed environments (e.g. RLVR). However, it's infeasible to learn from trial and error in many sciences. For e.g. medical research is hard to simulate digitally, and it is unethical to run clinical trials with suboptimal approaches proposed in early training.😬 Moreover, when pursuing a novel research goal, the primary intellectual challenge often lies in defining the experiment setup and objective itself. In the past year, I have increasingly used AI assistance for this (especially GPT-5) in my own research. Of course, models often fail to follow some explicitly stated requirements, and sometimes propose bad design choices, but that is fine! The generated plans are still useful for brainstorming, and I can implement them with further refinement. Method This made us wonder🤔: how can we train models to be better at this task of generating research plans, given an open-ended research goal? For training, we need to collect a large number of research goals, and obtain fast verification signals. Human experts are expensive to access, and that wouldn't scale. 💡Equipped with the vast corpus of openly licensed scientific literature, and the recent success of RL, Synthetic Data Curation, and Rubrics, we propose a scalable recipe: Extract research goals and goal-specific grading rubrics from existing papers with an LLM, and use them for RL training. Specifically, a frozen copy of the initial model rewards the plans generated during training using the goal-specific rubrics, checking seven general guidelines for parts of the plan relevant to each rubric item. 🤔Won't this lead to reward hacking? It will. At some point. But until then, improvements on the training reward might generalize to better research plans for humans. We are hoping the goal-specific rubrics, provided as privileged information to the grader, create a generator-verifier gap that improves research plan generation without external supervision. The only way to find out? Perform a human study. We ask Machine Learning experts to compare plans generated by the finetuned vs initial Qwen3-30B model for ML research goals. This is slow and expensive, it required 45 minutes per annotation to carefully analyze plans, so we could only do this once at the end of the project for evaluation. Results Individual annotations are still noisy, as evaluating research plans is inherently subjective. But sure enough, there is non-trivial signal. The experts preferred (p < 0.01) our finetuned models plans for 70% research goals extracted from NeurIPS'24 / ICLR'25 Oral papers (top 1%) ✅ But only ML, and finetuned vs initial, is boring. Remember, the goal is generality. So we also finetuned Qwen3-30B on goals extracted from medical research, and new arXiv prerints spanning 8 domains. We use rubric evaluations with a jury of frontier models, which also allows us to compare many frontier models across domains. Notable findings: 1) In-domain finetuning leads to 12-22% relative improvements in scores across the three domains: arXiv, medical, and ML 📈 2) Significant cross-domain generalization, especially with the medical finetune improving on ML and new arXiv research goals. This might be evidence for our "generality" thesis 📊 3) Our 30B finetune matches much larger models like Grok-4-Thinking, but GPT-5-Thinking is a cut above the rest (consistent with my qualitative experience) 🤖 Limitations Now of course, LLM-based evaluations, even with a jury and rubrics, are imperfect. But while the individual sample scoring is noisy, we hope for directionally correct results in aggregate, as the jury has positive alignment with human majority vote in our human study on ML. We think the grading scheme holds promise, as optimizing against a much weaker grader (30B), led to improvements in human preference. This work has many such limitations, so treat it more like an early proof-of-concept. We candidly acknowledge them in our paper, and encourage you to scrutinize the details: 📜 alphaxiv.org/abs/2512.23707 Released Artefacts The paper has many ablations and analyses: - our appendix also has sample outputs across domains for vibe-checks, making it 119 pages! - criteria-wise breakdown of performance evolution during training, thanks to our structured grading - SFT on long-form plans worsened model performance - training also improves Gemma, Llama models 🤗We release our train and test data on @huggingface. At a sample-level the data is noisy, and generated by Llama-4-Maverick. Still human experts approved 84% of the rubric items in ML so there's promise, and the same methodology will lead to better quality data as language models improve. Overall, we think the potential of our approach is high: the scientific method is quite general, deep learning benefits from generality (transfer learning), and language models are amazing (better every month!). We hope approaches like this make LMs better at assisting researchers across diverse problem settings and scientific disciplines. Some cool figures from the paper, and acknowledgements in thread🧵. I'm all ears to feedback on how we could've done things better! 1/3

English

875

Chenxi Whitehouse@chenx_wh·1 Eki

Check out our new paper MENLO, collaboration with @seb_ruder and linguists!

Sebastian Ruder@seb_ruder

English

726

Chenxi Whitehouse@chenx_wh·16 May

Paper accepted at #ACL2025 main conference! We present a new dataset and studies on video-to-text summarisation focusing on scientific presentations. Well done @dongqi_me and team!

Dongqi Liu@dongqi_me

🚨 Long Paper Accepted at @aclmeeting 2025 main conference! 🚨 🎥 Our work "What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations" introduces VISTA, a large-scale benchmark for scientific video summarization. #ACL2025 #NLProc #LLMs 🧵(1/3)

English

Chenxi Whitehouse@chenx_wh·16 May

Presenting new work: Thinking LLM-as-a-Judge via RL! It’s been great fun working with @swarnaNLP, @jaseweston, @uralik1 and team!

Jason Weston@jaseweston

🚨 New paper 🚨 J1: Incentivizing Thinking in LLM-as-a-Judge via RL - Converts judgement task into a verifiable one for both verifiable and non-verifiable prompts. Uses only synthetic pairwise data - Optimizes thoughts, scores, and judgments using GRPO - Outperforms all baselines at 8B & 70B scale, o1-mini, and on some benchmarks, even R1 - We find J1 uses various thought strategies: outlines evaluation criteria, compares against self-generated reference answers, and re-evaluates correctness 📝: arxiv.org/abs/2505.10320

English

1.1K

Chenxi Whitehouse retweetledi

Andreas Vlachos@vlachos_nlp·29 Oca

Pleased to announce the next @FEVERworkshop at ACL2025! Regular workshop papers (ARR and direct submissions) due 15th of April! And new shared task focusing on reproducible and efficient verification of real world claims! Check fever.ai and get keen!

English

5.5K

Chenxi Whitehouse@chenx_wh·18 Haz

@jasmijnbastings I got the time wrong it is actually 9am tomorrow Mexico time 😅! (4pm was in the London time Zone shown on underline)

English

Chenxi Whitehouse@chenx_wh·16 Haz

In Mexico City for #NAACL2024! Happy to meet and chat, do come by our poster session on 19th 4pm on LoRA for multilingual summarisation with @GoogleDeepMind, kudos for my amazing collaborators @jasmijnbastings @fantinehuot @kitsing_l @m__dehghani Mirella! arxiv.org/abs/2311.08572

English

2.8K

Chenxi Whitehouse@chenx_wh·12 Haz

@AlhamFikri Happy to be part of it!

English

149

Alham Fikri Aji@AlhamFikri·12 Haz

🎉Happy to share our recent collaborative effort on building a culturally diverse, multilingual visual QA dataset! CVQA consists of over 9,000 questions across 28 countries, covering 26 languages (with more to be added!) 🌐cvqa-benchmark.org 📜arxiv.org/pdf/2406.05967

English

197

70.7K

Chenxi Whitehouse retweetledi

Andreas Vlachos@vlachos_nlp·26 Nis

Interested in fact-checking? Just announced: Next @FEVERworkshop at EMNLP shared task: fever.ai/task.html aiming to evaluate systems verifying real-world claims with evidence from the Web, based on the AVERITEC dataset: arxiv.org/abs/2305.13117 With the amazing team:

English

8.6K

Chenxi Whitehouse@chenx_wh·2 Nis

Overall LoRA presents a strong parameter-efficient alternative for for multilingual summarization, especially benefiting low-resource languages! Looking forward to sharing more details at #NAACL2024 in Mexico!

English

291

Chenxi Whitehouse@chenx_wh·2 Nis

For Low-data and zero-shot cross-lingual transfer, LoRA consistently out-performs full FT, where the latter exhibits catastrophic forgetting. For few-shot, LoRA continued training also surpasses Full FT and LoRA module composing.

English

495

Chenxi Whitehouse@chenx_wh·2 Nis

Excited to share our paper "Low-Rank Adaptation for Multilingual Summarization: An Empirical Study" from @GoogleDeepMind internship got accepted at #NAACL2024🎉 Huge thanks to my co-authors @fantinehuot @jasmijnbastings @m__dehghani @kitsing_l & Mirella! arxiv.org/abs/2311.08572

English

15.1K

Chenxi Whitehouse@chenx_wh·2 Mar

I successfully defended my thesis! Many thanks to @IAugenstein @nikaletras for being such kind examiners! Heartfelt gratitude to my supervisors @tweyde @foobarin, internship mentors @fenchri @claravania @AlhamFikri @jasmijnbastings @fantinehuot & all who helped me on the journey!

English

6.1K

Chenxi Whitehouse retweetledi

Monojit Choudhury@monojitchou·4 Ara

Also presenting "LLM powered data augmentation for enhanced cross lingual performance", with @chenxi_jw and @AlhamFikri (8 Dec, 0830 - 1000, mains poster) Work done at @mbzuai 3/3

English

1.1K

Chenxi Whitehouse@chenx_wh·4 Ara

@levelsio How about the new version now?

English

@levelsio@levelsio·25 Kas

@chenxi_jw I think there's a bug on replicate.com/cjwbw/video-re… where it defaults to the preset example audio and it's impossible to add your own audio, any idea how to fix?

English

149

Keşfet

@seb_ruder @swarnaNLP @ShashwatGoel7 @dongqi_me @jaseweston @uralik1 @FEVERworkshop @GoogleDeepMind