Andrew Wang

33 posts

Andrew Wang

@andrewwnlp

PhD student at @jhuclsp

Katılım Mart 2024

156 Takip Edilen89 Takipçiler

Sabitlenmiş Tweet

Andrew Wang@andrewwnlp·18 Eyl

Tools break in the real world all the time, but not much attention has been given to how well LLMs deal with tool failures. We introduce HOHW, a tool-use benchmark where problems remain solvable even when tools break adversarially.

English

2.4K

Andrew Wang retweetledi

Sophia Hager@SophiaNLP·1d

Can we learn to recognize artificial uncertainty as a proxy for real uncertainty? As LLMs memorize more of the internet, they become correctly confident on almost any existing question you can throw at them. Creating new challenging calibration data is unsustainably expensive.🧵

English

444

Andrew Wang retweetledi

Rohan Jha@Robro612·5 May

New 📄: we replicate XTR, a multi-vector retrieval method that makes ColBERT faster by avoiding its expensive step of gathering full document embeddings XTR is not a free lunch over ColBERT, but its training objective is useful for modern efficient engines like PLAID and WARP 👇🏼

English

11.4K

Andrew Wang retweetledi

Brian Zheyuan Zhang@zheyuanzhang99·29 Nis

We benchmarked GPT-5.5 High on AgentOdyssey with the Long Context Agent: 5 main quests + 20 supplementary reward — on par with Claude Opus 4.6, but at lower cost. Trajectory analysis shows stronger exploration, general skill learning, long-horizon planning, and reasoning than GPT-5 and GPT-5-mini. Still, gaps remain: weak learning of combat skills, poor recovery from repeated failures, and quadratic token cost in the Long Context Agent. Detailed observations below 👇

Brian Zheyuan Zhang@zheyuanzhang99

Introducing AgentOdyssey — an open-ended, long-horizon text game generation engine for 𝐭𝐞𝐬𝐭-𝐭𝐢𝐦𝐞 𝐜𝐨𝐧𝐭𝐢𝐧𝐮𝐚𝐥 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐚𝐠𝐞𝐧𝐭𝐬. Real-world agents cannot have a boundary between training and testing: they must learn continuously from interaction with the world at test time. AgentOdyssey is designed to study five key abilities that make this possible: 🧭 𝐞𝐱𝐩𝐥𝐨𝐫𝐚𝐭𝐢𝐨𝐧, 🧠 𝐞𝐩𝐢𝐬𝐨𝐝𝐢𝐜 𝐦𝐞𝐦𝐨𝐫𝐲, 🌍 𝐰𝐨𝐫𝐥𝐝 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐚𝐜𝐪𝐮𝐢𝐬𝐢𝐭𝐢𝐨𝐧, 🛠️ 𝐬𝐤𝐢𝐥𝐥 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠, and 🎯 𝐥𝐨𝐧𝐠-𝐡𝐨𝐫𝐢𝐳𝐨𝐧 𝐩𝐥𝐚𝐧𝐧𝐢𝐧𝐠. Just `pip install agentodyssey` and start generating games and evaluating your agents and LLMs! 🌐 Project website: agentodyssey.github.io

English

2.1K

Andrew Wang@andrewwnlp·20 Nis

Excited to share AgentOdyssey! We show that LLM agents still have a long way to go before they achieve human level long-term memory

Brian Zheyuan Zhang@zheyuanzhang99

English

842

Andrew Wang retweetledi

Jack Jingyu Zhang@jackjingyuzhang·15 Nis

Real-world agents juggle instructions from skill files, tools, other agents, ... each with different trust levels. When these conflict, can models reliably prioritize the most trusted one? Our ManyIH-Bench🪜 finds that even frontier models like GPT-5.4 only get ~40% accuracy! 👇

English

120

11.8K

Andrew Wang retweetledi

Sophia Hager@SophiaNLP·10 Nis

Amazon Science put out a blog post about RuleForge, which I helped make the evaluation/refinement mechanism for last summer in my internship!

Amazon Science@AmazonScience

Amazon built RuleForge to tackle 48,000+ new vulnerabilities. The agentic AI system generates detection rules 336% faster than human analysts. amazon.science/blog/how-amazo…

English

Andrew Wang retweetledi

Jack Jingyu Zhang@jackjingyuzhang·30 Mar

Check out our new work! ParaGator🐊 conducts end-to-end online RL to jointly train generation (pass@k) and aggregation (pass@1), so the model learns to produce diverse candidates and synthesize them into a final answer. Strong gains on scientific reasoning & competition math!

Jason Weston@jaseweston

🔗Learning to Aggregate through Online RL🎯 ParaGator🔀🐊: strong parallel reasoning aggregation Core claim: aggregation works best when training both stages together: - LLM generator should produce diverse candidates - LLM aggregator should synthesize into final answer ParaGator trains candidate generation with pass @k, and aggregation with pass@1 on-policy, end-to-end. Stops mode collapse/off-policy mismatch. Improves math & scientific reasoning. 🚀🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/para…

English

2.1K

Andrew Wang retweetledi

Marc Marone@ruyimarone·27 Kas

I'm on the job market and at #neurips2025! Looking for research roles around data for foundation models and would love to chat with folks - resume/site in my bio. I've recently worked @AIatMeta and @databricks and publish papers with my awesome collaborators @jhuclsp!

English

10.7K

Andrew Wang retweetledi

Jack Jingyu Zhang@jackjingyuzhang·14 Eki

We introduce WaltzRL🎶, a multi-agent RL framework that treats LLM safety as a positive-sum game between conversation & feedback agents. It strikes an elegant balance between helpfulness & harmlessness, boosting safety & reduces overrefusals without degrading capabilities!

Jason Weston@jaseweston

💃New Multi-Agent RL Method: WaltzRL💃 📝: arxiv.org/abs/2510.08240 - Makes LLM safety a positive-sum game between a conversation & feedback agent - At inference feedback is adaptive, used when needed -> Improves safety & reduces overrefusals without degrading capabilities! 🧵1/5

English

12.2K

Andrew Wang retweetledi

Orion Weller@orionweller·9 Eyl

XLM-R has been SOTA for 6 years for multilingual encoders. That's an eternity in AI 🤯 Time for an upgrade. Introducing mmBERT: 2-4x faster than previous models ⚡ while even beating o3 and Gemini 2.5 Pro 🔥 + open models & training data - try it now! How did we do it? 🧵

English

249

43.3K

Andrew Wang retweetledi

Aayush Mishra@aamixsh·2 Eki

"Pre-training is our crappy evolution. It is one candidate solution to the cold start problem..." Exactly! When presented with information rich context, LLMs prepare how to respond using their pre-trained (evolved) brains. In our paper, we exploit this signal to improve SFT!

Andrej Karpathy@karpathy

Finally had a chance to listen through this pod with Sutton, which was interesting and amusing. As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea is sufficiently "bitter lesson pilled" (meaning arranged so that it benefits from added computation for free) as a proxy for whether it's going to work or worth even pursuing. The underlying assumption being that LLMs are of course highly "bitter lesson pilled" indeed, just look at LLM scaling laws where if you put compute on the x-axis, number go up and to the right. So it's amusing to see that Sutton, the author of the post, is not so sure that LLMs are "bitter lesson pilled" at all. They are trained on giant datasets of fundamentally human data, which is both 1) human generated and 2) finite. What do you do when you run out? How do you prevent a human bias? So there you have it, bitter lesson pilled LLM researchers taken down by the author of the bitter lesson - rough! In some sense, Dwarkesh (who represents the LLM researchers viewpoint in the pod) and Sutton are slightly speaking past each other because Sutton has a very different architecture in mind and LLMs break a lot of its principles. He calls himself a "classicist" and evokes the original concept of Alan Turing of building a "child machine" - a system capable of learning through experience by dynamically interacting with the world. There's no giant pretraining stage of imitating internet webpages. There's also no supervised finetuning, which he points out is absent in the animal kingdom (it's a subtle point but Sutton is right in the strong sense: animals may of course observe demonstrations, but their actions are not directly forced/"teleoperated" by other animals). Another important note he makes is that even if you just treat pretraining as an initialization of a prior before you finetune with reinforcement learning, Sutton sees the approach as tainted with human bias and fundamentally off course, a bit like when AlphaZero (which has never seen human games of Go) beats AlphaGo (which initializes from them). In Sutton's world view, all there is is an interaction with a world via reinforcement learning, where the reward functions are partially environment specific, but also intrinsically motivated, e.g. "fun", "curiosity", and related to the quality of the prediction in your world model. And the agent is always learning at test time by default, it's not trained once and then deployed thereafter. Overall, Sutton is a lot more interested in what we have common with the animal kingdom instead of what differentiates us. "If we understood a squirrel, we'd be almost done". As for my take... First, I should say that I think Sutton was a great guest for the pod and I like that the AI field maintains entropy of thought and that not everyone is exploiting the next local iteration LLMs. AI has gone through too many discrete transitions of the dominant approach to lose that. And I also think that his criticism of LLMs as not bitter lesson pilled is not inadequate. Frontier LLMs are now highly complex artifacts with a lot of humanness involved at all the stages - the foundation (the pretraining data) is all human text, the finetuning data is human and curated, the reinforcement learning environment mixture is tuned by human engineers. We do not in fact have an actual, single, clean, actually bitter lesson pilled, "turn the crank" algorithm that you could unleash upon the world and see it learn automatically from experience alone. Does such an algorithm even exist? Finding it would of course be a huge AI breakthrough. Two "example proofs" are commonly offered to argue that such a thing is possible. The first example is the success of AlphaZero learning to play Go completely from scratch with no human supervision whatsoever. But the game of Go is clearly such a simple, closed, environment that it's difficult to see the analogous formulation in the messiness of reality. I love Go, but algorithmically and categorically, it is essentially a harder version of tic tac toe. The second example is that of animals, like squirrels. And here, personally, I am also quite hesitant whether it's appropriate because animals arise by a very different computational process and via different constraints than what we have practically available to us in the industry. Animal brains are nowhere near the blank slate they appear to be at birth. First, a lot of what is commonly attributed to "learning" is imo a lot more "maturation". And second, even that which clearly is "learning" and not maturation is a lot more "finetuning" on top of something clearly powerful and preexisting. Example. A baby zebra is born and within a few dozen minutes it can run around the savannah and follow its mother. This is a highly complex sensory-motor task and there is no way in my mind that this is achieved from scratch, tabula rasa. The brains of animals and the billions of parameters within have a powerful initialization encoded in the ATCGs of their DNA, trained via the "outer loop" optimization in the course of evolution. If the baby zebra spasmed its muscles around at random as a reinforcement learning policy would have you do at initialization, it wouldn't get very far at all. Similarly, our AIs now also have neural networks with billions of parameters. These parameters need their own rich, high information density supervision signal. We are not going to re-run evolution. But we do have mountains of internet documents. Yes it is basically supervised learning that is ~absent in the animal kingdom. But it is a way to practically gather enough soft constraints over billions of parameters, to try to get to a point where you're not starting from scratch. TLDR: Pretraining is our crappy evolution. It is one candidate solution to the cold start problem, to be followed later by finetuning on tasks that look more correct, e.g. within the reinforcement learning framework, as state of the art frontier LLM labs now do pervasively. I still think it is worth to be inspired by animals. I think there are multiple powerful ideas that LLM agents are algorithmically missing that can still be adapted from animal intelligence. And I still think the bitter lesson is correct, but I see it more as something platonic to pursue, not necessarily to reach, in our real world and practically speaking. And I say both of these with double digit percent uncertainty and cheer the work of those who disagree, especially those a lot more ambitious bitter lesson wise. So that brings us to where we are. Stated plainly, today's frontier LLM research is not about building animals. It is about summoning ghosts. You can think of ghosts as a fundamentally different kind of point in the space of possible intelligences. They are muddled by humanity. Thoroughly engineered by it. They are these imperfect replicas, a kind of statistical distillation of humanity's documents with some sprinkle on top. They are not platonically bitter lesson pilled, but they are perhaps "practically" bitter lesson pilled, at least compared to a lot of what came before. It seems possibly to me that over time, we can further finetune our ghosts more and more in the direction of animals; That it's not so much a fundamental incompatibility but a matter of initialization in the intelligence space. But it's also quite possible that they diverge even further and end up permanently different, un-animal-like, but still incredibly helpful and properly world-altering. It's possible that ghosts:animals :: planes:birds. Anyway, in summary, overall and actionably, I think this pod is solid "real talk" from Sutton to the frontier LLM researchers, who might be gear shifted a little too much in the exploit mode. Probably we are still not sufficiently bitter lesson pilled and there is a very good chance of more powerful ideas and paradigms, other than exhaustive benchbuilding and benchmaxxing. And animals might be a good source of inspiration. Intrinsic motivation, fun, curiosity, empowerment, multi-agent self-play, culture. Use your imagination.

English

140

35.9K

Andrew Wang retweetledi

Arda Uzunoğlu@aardauzunoglu·1 Eki

🛑 What's the Flaw of Averages? 📄: arxiv.org/abs/2509.25671 We’re in an evaluation crisis. Benchmarks are saturating, creating a false sense that tasks are solved. As training/eval chase these sets, plateaued averages hide shortcutting and distributional skew. 🧵1/7

English

12.5K

Andrew Wang retweetledi

Dongwei Jiang@Dongwei__Jiang·23 Eyl

Our paper has been accepted to #NeurIPS2025! Since our initial findings, we discovered some fascinating insights into WHY models resist feedback. Thread below 🧵

Dongwei Jiang@Dongwei__Jiang

🧵 Recent studies show LLMs can self-improve their responses when given external feedback. But how effectively can they incorporate it? We tested this systematically—and found they can't fully integrate feedback, even when the feedback is high-quality and backed by ground-truth.

English

6.1K

Andrew Wang@andrewwnlp·18 Eyl

Thanks to my collaborators Sophia Hager, Adi Asija, Nick Andrews, and @DanielKhashabi at @jhuclsp! Arxiv: arxiv.org/abs/2508.11027 Code: github.com/JHU-CLSP/hell-… (Data coming soon!)

English

324

Andrew Wang@andrewwnlp·18 Eyl

More tools = worse at handling tool failures When tool schemas are provided in-context, we find that performance gaps between adversarial and non-adversarial settings increases with the number of schemas.

English

154

Andrew Wang@andrewwnlp·18 Eyl

English

2.4K

Andrew Wang retweetledi

Tianjian Li@tli104·3 Eyl

Language models often produce repetitive responses, and this issue is further amplified by post-training. In this work, we introduce DARLING, a method that explicitly optimizes for both response diversity and quality within online reinforcement learning!

Jason Weston@jaseweston

🌀Diversity Aware RL (DARLING)🌀 📝: arxiv.org/abs/2509.02534 - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

English

10.7K

Andrew Wang retweetledi

Dongwei Jiang@Dongwei__Jiang·16 Haz

English

111

14.7K

Andrew Wang retweetledi

Jieneng Chen@jieneng_chen·19 Kas

Introducing Genex: Generative World Explorer. 🧠 Humans mentally explore unseen parts of the world, revising their beliefs with imagined observations. ✨ Genex replicates this human-like ability, advancing embodied AI in planning with partial observations. (1/6)

English

164

36.9K

Keşfet

@AIatMeta @databricks @jhuclsp @DanielKhashabi @elonmusk @BarackObama @taylorswift13 @cristiano