Fangyuan Xu

189 posts

Fangyuan Xu

@brunchavecmoi

许方园👩🏻‍💻phd student @ nyu, interested in natural language processing

🌎 Katılım Ağustos 2019

701 Takip Edilen578 Takipçiler

Sabitlenmiş Tweet

Fangyuan Xu@brunchavecmoi·3 Mar

A lot of useful training data can't be shared due to privacy. How do we create synthetic training data without data owners ever sharing their content? 🚀 Introducing 𝐃𝐏-𝐑𝐅𝐓: using RL to train LLMs to generate high-fidelity domain data without seeing a single private sample.

English

130

10.9K

Fangyuan Xu retweetledi

Yoonsang Lee@yoonsang_·2d

How should we effectively aggregate long-horizon agent trajectories? 🧐 Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented. Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with.

English

242

19.9K

Fangyuan Xu retweetledi

Jenna Russell@jennajrussell·7 Nis

Would you realize if the book you were reading was AI? What if it was humanized to remove AI-speak? We find that even without using stylistic cues (e.g., word choice or sentence structure) narrative choices alone give AI fiction away!

English

176

19.8K

Fangyuan Xu retweetledi

Zayne Sprague ✈️ ICLR Rio@ZayneSprague·6 Nis

x.com/i/article/2039…

ZXX

10.7K

Fangyuan Xu retweetledi

Hongli Zhan@HongliZhan·3 Nis

PhD defended at UT Austin today.🤘 The best thing was having an advisor who believed in me before I believed in myself. Jessy taught me how to write, how to think, and how to chase research ideas. Then the rest followed. Thank you, @jessyjli

English

123

12K

Fangyuan Xu retweetledi

Chau Minh Pham@chautmpham·27 Mar

👀 Can AI produce a novel worth reading? We built a platform to find out. 📚 Introducing AutoFiction: a web platform that hosts AI-generated novels by Claude Code & Codex, rated and reviewed by real readers. We have 33 books so far, spanning dark fantasy, murder mysteries, Harry Potter fanfics, and more. All free to read. (1/n)

English

3.8K

Fangyuan Xu retweetledi

Tengxiao Liu@TengxiaoLiu·25 Mar

Auto research is on 🔥 We give algorithmic problems (like circle packing) to general coding agents, let it run overnight. 🌙 Agents reach SoTA. But more importantly: we analyze 100+ hours of trajectories to understand how it gets there 🧵

English

31.2K

Fangyuan Xu retweetledi

Yoonjoo Lee@yoonjoo_le2·24 Mar

Proud to share our CHI 2026 Honorable Mention paper, Evalet! 🏅 LLM-as-a-Judge is everywhere, but a single score hides so much. Evalet fragments outputs into functional units so you can see exactly what's working and what's not—across hundreds of outputs, from reasoning traces to red-teaming conversations to computer-use agents. I had a great time working on this project led by the amazing @tae_skim and @heechanleekr, with @josephseering and @imjuhokim ! Check out the full breakdown below ⬇️

Tae Soo Kim@tae_skim

AI agents are running complex workflows, and writing full codebases and docs. But how do you verify these outputs at scale? An LLM judge gives you "3/5" with a broad reason. Which parts caused that score? Any patterns across outputs? We built Evalet 🔬 to fix this. CHI'26

English

2.2K

Fangyuan Xu retweetledi

Shankar Padmanabhan@shankarpad8·23 Mar

1/5 How do we update a model trained in 2025 with new world knowledge from 2026? ⚠️Continued training will undo skills learned by LLMs during post-training, e.g. instruction-following/math/code. 🤝Our method DiSC updates LLMs with new knowledge while preserving existing skills!

English

10K

Fangyuan Xu retweetledi

Shuyan Zhou@shuyanzh36·23 Mar

In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60%+ success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infin… Repo: github.com/web-arena-x/we… 🧵 (1/n)

GIF

English

325

42.4K

Fangyuan Xu retweetledi

Vaibhav Adlakha@vaibhav_adlakha·12 Mar

Your LLM already knows the answer. Why is your embedding model still encoding the question? 🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the embedding back into text. 🏆 SOTA self-supervised embeddings 🛡️ Free transfer of instruction-following, safety, and reasoning

GIF

English

191

49.7K

Fangyuan Xu retweetledi

Manya Wadhwa@ManyaWadhwa1·13 Mar

⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇

English

143

21.2K

Fangyuan Xu retweetledi

Jiyeon Kim@jiyeonkimd·11 Mar

🌎Real-world knowledge evolves constantly and emerges incrementally. Can LLMs adapt to new information on the fly? 🤯Frontier models and agentic approaches all struggle, missing when to update the fact, or getting distracted by irrelevant information. We introduce ✨OAKS✨, a benchmark for evaluating models’ online adaptation to streaming, continually updating knowledge.

English

10.6K

Fangyuan Xu retweetledi

Akari Asai@AkariAsai·10 Mar

Many Deep Research agents still rely on search engines and embedding models built for humans, not agents. They retrieve from a query (+ maybe an instruction), but ignore the much richer context agents generate while reasoning. Retrieval should be redesigned for agents - introducing AgentIR. 💡Key Idea: Use the DR agent's reasoning tokens during retrieval and train embedding models to leverage these reasoning traces more effectively, rather than relying only on the final query. 📈Results: State-of-the-art performance on BrowseComp-Plus, outperforming powerful reasoning-aware embedding models (ReasonIR) or query re-writing methods. Interestingly, we find that reasoning traces provide a much richer retrieval context than simply paraphrasing or concatenating queries. They capture evolving intent, decomposition, and search rationale in ways standard query-only retrieval misses. Looking ahead, we believe future embedding (or retrieval systems in general) should be trained natively for Deep Research agents, so they can fully leverage these reasoning chains.

Zijian Chen@zijian42chen

🚀 Introducing AgentIR, a retriever that reads your agent’s mind (literally!) 🧠 Unlike humans, agents explicitly expose thoughts in reasoning tokens. Put them to use! 📈 Simple, substantial gains for agents on BrowseComp-Plus, 35% (BM25) ➡️ 50% (Qwen3-Embed) ➡️ 67% (AgentIR) 🧵

English

219

28.6K

Fangyuan Xu retweetledi

CLS@ChengleiSi·9 Mar

Great to see autoresearch blowing up becoz of the legendary Karpathy sensei. This year will ofc be an exciting year for automated AI research. For all of you guys excited to jump onto it, hopefully our papers will be some helpful references: - automated feedback loop for research agents to optimize LLM pre-training and post-training stacks: x.com/ChengleiSi/sta… - generating novel research ideas with LLMs, along with a comparison against human experts: x.com/ChengleiSi/sta… - evaluating the effectiveness of LLM-generated ideas through experiment execution: x.com/ChengleiSi/sta… - finetuning LLMs to directly predict the effectiveness of research ideas: x.com/jiaxinwen22/st…

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

343

50.2K

Fangyuan Xu retweetledi

Xi Ye@xiye_nlp·5 Mar

We propose a new decoding algorithm, DySCO🪩 (Dynamic Attention Scaling), directly improving long-context reasoning without training. At each decoding step, we dynamically identify and upweight attention to important context for the next token. 📈20% gains on multiple tasks.

GIF

English

7.4K

Fangyuan Xu@brunchavecmoi·3 Mar

📄 arxiv.org/abs/2602.18633 Please check out our paper for more analysis! ✨ Work done during my internship at Microsoft OAR, huge thanks to my mentors and collaborators @soshsihao @lin_zinan @taiwei_shi @peizNLP @mengtingwan @ylongqi @eunsolc @ProfJenNeville and more!

English

384

Fangyuan Xu@brunchavecmoi·3 Mar

We evaluate on long-form, domain-specific generation tasks: news articles, meeting transcripts, medical abstracts and chat history. DP-RFT outperforms prior eyes-off methods and closes the gap with methods that require direct data access — in both fidelity and downstream utility.

English

339

Fangyuan Xu@brunchavecmoi·3 Mar

English

130

10.9K

Keşfet

@jessyjli @tae_skim @heechanleekr @josephseering @imjuhokim @soshsihao @lin_zinan @taiwei_shi