Fangyuan Xu

189 posts

Fangyuan Xu banner
Fangyuan Xu

Fangyuan Xu

@brunchavecmoi

许方园👩🏻‍💻phd student @ nyu, interested in natural language processing

🌎 Katılım Ağustos 2019
701 Takip Edilen578 Takipçiler
Sabitlenmiş Tweet
Fangyuan Xu
Fangyuan Xu@brunchavecmoi·
A lot of useful training data can't be shared due to privacy. How do we create synthetic training data without data owners ever sharing their content? 🚀 Introducing 𝐃𝐏-𝐑𝐅𝐓: using RL to train LLMs to generate high-fidelity domain data without seeing a single private sample.
Fangyuan Xu tweet media
English
5
30
130
10.9K
Fangyuan Xu retweetledi
Yoonsang Lee
Yoonsang Lee@yoonsang_·
How should we effectively aggregate long-horizon agent trajectories? 🧐 Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented. Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with.
Yoonsang Lee tweet media
English
3
39
242
19.9K
Fangyuan Xu retweetledi
Jenna Russell
Jenna Russell@jennajrussell·
Would you realize if the book you were reading was AI? What if it was humanized to remove AI-speak? We find that even without using stylistic cues (e.g., word choice or sentence structure) narrative choices alone give AI fiction away!
Jenna Russell tweet media
English
9
49
176
19.8K
Fangyuan Xu retweetledi
Hongli Zhan
Hongli Zhan@HongliZhan·
PhD defended at UT Austin today.🤘 The best thing was having an advisor who believed in me before I believed in myself. Jessy taught me how to write, how to think, and how to chase research ideas. Then the rest followed. Thank you, @jessyjli
Hongli Zhan tweet mediaHongli Zhan tweet mediaHongli Zhan tweet mediaHongli Zhan tweet media
English
11
3
123
12K
Fangyuan Xu retweetledi
Chau Minh Pham
Chau Minh Pham@chautmpham·
👀 Can AI produce a novel worth reading? We built a platform to find out. 📚 Introducing AutoFiction: a web platform that hosts AI-generated novels by Claude Code & Codex, rated and reviewed by real readers. We have 33 books so far, spanning dark fantasy, murder mysteries, Harry Potter fanfics, and more. All free to read. (1/n)
Chau Minh Pham tweet media
English
1
29
59
3.8K
Fangyuan Xu retweetledi
Tengxiao Liu
Tengxiao Liu@TengxiaoLiu·
Auto research is on 🔥 We give algorithmic problems (like circle packing) to general coding agents, let it run overnight. 🌙 Agents reach SoTA. But more importantly: we analyze 100+ hours of trajectories to understand how it gets there 🧵
Tengxiao Liu tweet media
English
6
18
62
31.2K
Fangyuan Xu retweetledi
Yoonjoo Lee
Yoonjoo Lee@yoonjoo_le2·
Proud to share our CHI 2026 Honorable Mention paper, Evalet! 🏅 LLM-as-a-Judge is everywhere, but a single score hides so much. Evalet fragments outputs into functional units so you can see exactly what's working and what's not—across hundreds of outputs, from reasoning traces to red-teaming conversations to computer-use agents. I had a great time working on this project led by the amazing @tae_skim and @heechanleekr, with @josephseering and @imjuhokim ! Check out the full breakdown below ⬇️
Tae Soo Kim@tae_skim

AI agents are running complex workflows, and writing full codebases and docs. But how do you verify these outputs at scale? An LLM judge gives you "3/5" with a broad reason. Which parts caused that score? Any patterns across outputs? We built Evalet 🔬 to fix this. CHI'26

English
0
5
32
2.2K
Fangyuan Xu retweetledi
Shankar Padmanabhan
Shankar Padmanabhan@shankarpad8·
1/5 How do we update a model trained in 2025 with new world knowledge from 2026? ⚠️Continued training will undo skills learned by LLMs during post-training, e.g. instruction-following/math/code. 🤝Our method DiSC updates LLMs with new knowledge while preserving existing skills!
English
1
16
61
10K
Fangyuan Xu retweetledi
Shuyan Zhou
Shuyan Zhou@shuyanzh36·
In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60%+ success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infin… Repo: github.com/web-arena-x/we… 🧵 (1/n)
GIF
English
12
48
325
42.4K
Fangyuan Xu retweetledi
Vaibhav Adlakha
Vaibhav Adlakha@vaibhav_adlakha·
Your LLM already knows the answer. Why is your embedding model still encoding the question? 🚨Introducing LLM2Vec-Gen: your frozen LLM generates the answer's embedding in a single forward pass — without ever generating the answer. Not only that, the frozen LLM can decode the embedding back into text. 🏆 SOTA self-supervised embeddings 🛡️ Free transfer of instruction-following, safety, and reasoning
GIF
English
4
37
191
49.7K
Fangyuan Xu retweetledi
Manya Wadhwa
Manya Wadhwa@ManyaWadhwa1·
⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇
Manya Wadhwa tweet media
English
4
43
143
21.2K
Fangyuan Xu retweetledi
Jiyeon Kim
Jiyeon Kim@jiyeonkimd·
🌎Real-world knowledge evolves constantly and emerges incrementally. Can LLMs adapt to new information on the fly? 🤯Frontier models and agentic approaches all struggle, missing when to update the fact, or getting distracted by irrelevant information. We introduce ✨OAKS✨, a benchmark for evaluating models’ online adaptation to streaming, continually updating knowledge.
Jiyeon Kim tweet media
English
3
21
63
10.6K
Fangyuan Xu retweetledi
Akari Asai
Akari Asai@AkariAsai·
Many Deep Research agents still rely on search engines and embedding models built for humans, not agents. They retrieve from a query (+ maybe an instruction), but ignore the much richer context agents generate while reasoning. Retrieval should be redesigned for agents - introducing AgentIR. 💡Key Idea: Use the DR agent's reasoning tokens during retrieval and train embedding models to leverage these reasoning traces more effectively, rather than relying only on the final query. 📈Results: State-of-the-art performance on BrowseComp-Plus, outperforming powerful reasoning-aware embedding models (ReasonIR) or query re-writing methods. Interestingly, we find that reasoning traces provide a much richer retrieval context than simply paraphrasing or concatenating queries. They capture evolving intent, decomposition, and search rationale in ways standard query-only retrieval misses. Looking ahead, we believe future embedding (or retrieval systems in general) should be trained natively for Deep Research agents, so they can fully leverage these reasoning chains.
Zijian Chen@zijian42chen

🚀 Introducing AgentIR, a retriever that reads your agent’s mind (literally!) 🧠 Unlike humans, agents explicitly expose thoughts in reasoning tokens. Put them to use! 📈 Simple, substantial gains for agents on BrowseComp-Plus, 35% (BM25) ➡️ 50% (Qwen3-Embed) ➡️ 67% (AgentIR) 🧵

English
4
24
219
28.6K
Fangyuan Xu retweetledi
CLS
CLS@ChengleiSi·
Great to see autoresearch blowing up becoz of the legendary Karpathy sensei. This year will ofc be an exciting year for automated AI research. For all of you guys excited to jump onto it, hopefully our papers will be some helpful references: - automated feedback loop for research agents to optimize LLM pre-training and post-training stacks: x.com/ChengleiSi/sta… - generating novel research ideas with LLMs, along with a comparison against human experts: x.com/ChengleiSi/sta… - evaluating the effectiveness of LLM-generated ideas through experiment execution: x.com/ChengleiSi/sta… - finetuning LLMs to directly predict the effectiveness of research ideas: x.com/jiaxinwen22/st…
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
9
27
343
50.2K
Fangyuan Xu retweetledi
Xi Ye
Xi Ye@xiye_nlp·
We propose a new decoding algorithm, DySCO🪩 (Dynamic Attention Scaling), directly improving long-context reasoning without training. At each decoding step, we dynamically identify and upweight attention to important context for the next token. 📈20% gains on multiple tasks.
GIF
English
3
22
82
7.4K
Fangyuan Xu
Fangyuan Xu@brunchavecmoi·
We evaluate on long-form, domain-specific generation tasks: news articles, meeting transcripts, medical abstracts and chat history. DP-RFT outperforms prior eyes-off methods and closes the gap with methods that require direct data access — in both fidelity and downstream utility.
Fangyuan Xu tweet mediaFangyuan Xu tweet media
English
1
2
4
339
Fangyuan Xu
Fangyuan Xu@brunchavecmoi·
A lot of useful training data can't be shared due to privacy. How do we create synthetic training data without data owners ever sharing their content? 🚀 Introducing 𝐃𝐏-𝐑𝐅𝐓: using RL to train LLMs to generate high-fidelity domain data without seeing a single private sample.
Fangyuan Xu tweet media
English
5
30
130
10.9K