Arnav Jain

300 posts

Arnav Jain banner
Arnav Jain

Arnav Jain

@arnavkj95

PhD student University of Montréal and @Mila_Quebec. Prev @Cohere @Microsoft, @IITKgp.

Katılım Aralık 2014
1.6K Takip Edilen644 Takipçiler
Arnav Jain retweetledi
Richard Sutton
Richard Sutton@RichardSSutton·
The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.
English
136
967
7.4K
555.9K
Arnav Jain retweetledi
Moksh Jain
Moksh Jain@JainMoksh·
The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.
English
1
17
38
6.1K
Arnav Jain retweetledi
ICLR
ICLR@iclr_conf·
#ICLR2026 Test of Time Award talk happening now -- "Continuous control with deep reinforcement learning" 🤖 by Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra
English
1
10
84
10.4K
Arnav Jain retweetledi
OpenAI
OpenAI@OpenAI·
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.
English
2.5K
7K
51.8K
13.1M
Arnav Jain retweetledi
Boyuan Chen
Boyuan Chen@BoyuanChen0·
This is what I’ve been cooking in the past 4 months . GPT Image 2 is over a massive 240 elo jump over the second place model, marking the biggest jump bigger than the rest of the leaderboard combined
Arena.ai@arena

Exciting news - GPT-Image-2 by @OpenAI has claimed the #1 spot across all Image Arena leaderboards! A clean sweep with a record-breaking +242 point lead in Text-to-Image - the largest gap we’ve seen to date. - #1 Text-to-Image (1512), +242 over #2 (Nano-banana-2 with web-search aka gemini-3.1-flash-image) - #1 Single-Image Edit (1513), +125 over #2 (Nano-banana-pro aka gemini-3-pro-image) - #1 Multi-Image Edit (1464), +90 over #2 (Nano-banana-2) No model has dominated Image Arena with margins this wide. Huge congratulations to @OpenAI on this major breakthrough in image generation! More performance breakdowns by category in the thread below.

English
74
76
1.6K
150.9K
Arnav Jain retweetledi
Andjela Mladenovic
Andjela Mladenovic@ml_andjela·
Hi! If you are interested in game-theoretic analysis of the AI race and open vs. closed sourcing, check out our new paper: " Why Open Source? A Game-Theoretic Analysis of the AI Race " arxiv.org/pdf/2604.16227 There are some cute complexity results there 🙂
English
1
8
19
2.4K
Arnav Jain retweetledi
Deepak Nathani
Deepak Nathani@deepaknathani11·
🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. 🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: 🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone 📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment 👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval 📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration 📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏
Deepak Nathani tweet media
English
3
21
59
21.9K
Arnav Jain retweetledi
Nate Rahn
Nate Rahn@n8rahn·
New Anthropic Fellows research: Abstractive red-teaming of language model character The worst way to find out about a character flaw in your language model is from a viral screenshot. How can we find these issues before deployment, rather than after? In this work, we introduce abstractive red-teaming, a new approach that searches over natural-language categories of queries, rather than individual prompts.
Nate Rahn tweet media
English
2
29
149
18.3K
Arnav Jain retweetledi
Jason Weston
Jason Weston@jaseweston·
🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/rllm/
Jason Weston tweet media
English
5
46
310
25.9K
Arnav Jain retweetledi
Darshan Patil
Darshan Patil@dapatil211·
🧬 New paper Scientific datasets evolve as science evolves. With proteins, new sequences get added, annotations get corrected, and noisy entries get curated out. Introducing CoPeP, a continual-pretraining benchmark for protein LMs. Details 🧵 1/n
Darshan Patil tweet media
English
2
29
84
8.5K
Arnav Jain retweetledi
Jesse Zhang
Jesse Zhang@Jesse_Y_Zhang·
A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)
English
7
105
409
97.9K
Arnav Jain retweetledi
Kianté Brantley
Kianté Brantley@xkianteb·
Does LLM RL post-training need to be on-policy?
English
10
44
327
113.1K
Arnav Jain retweetledi
Gokul Swamy
Gokul Swamy@g_k_swamy·
It took a few years of deep thinking, but I'm super excited to finally share PROSPER: a beautiful, regression-based algorithm for RL from *rubric rewards* that robustly handles the *inconsistent feedback* that LLM judges provide. Let's go Back to Black(well)! 🧵(1/n)
Gokul Swamy tweet media
English
3
33
270
51.4K
Arnav Jain retweetledi
Kushal
Kushal@kushalk_·
🤖 Can a single robot policy manipulate diverse tools without ever seeing them before? Introducing SimToolReal 🔨 : a generalist dexterous manipulation policy that transfers zero-shot sim→real to unseen tools + unseen tasks All videos are 1x speed (60 Hz control) 🧵👇
English
21
78
381
106.2K
Arnav Jain retweetledi
Sheshansh Agrawal
Sheshansh Agrawal@sheshanshag·
**New research: Introducing ⚡BlitzRank** Current LLM rerankers waste tokens on information they already have. If A > B and B > C, you already know A > C, existing methods don’t track this. BlitzRank fixes this. It uses tournament graphs to extract maximal information from each LLM call. 📊 Pareto-optimal across 14 benchmarks × 5 LLMs ⚡ 25–40% fewer tokens than comparable methods ⚡ 7× cheaper than pairwise at near-identical quality
Sheshansh Agrawal tweet media
English
4
20
72
18.7K
Arnav Jain retweetledi
Emiliano Penaloza
Emiliano Penaloza@emilianopp_·
Remember all the self-distillation papers that came out last week. Well, we also propose it 😅, but… But alongside something better 😎 π-Distill We show that with this method, you can distill closed-source frontier models even tho their traces are hidden 🔒. Both our methods can reach and even surpass the performance of the industry-standard SFT + RL with access to reasoning traces 🤯. 🔬And we spent ~100,000 hours GPU hours on a comprehensive analysis, not because the method is finicky, but because we wanted to understand why it works so well. 🧵 1/10
English
11
78
434
51.5K
Arnav Jain retweetledi
Wenting Zhao
Wenting Zhao@wzhao_nlp·
This release is an emtional one for me because I had stayed up so much for it 🥹 It has been truly amazing to see this model becomes better bit by bit through every change we make, and we have come a long way. Since I did mid-training for this model, I wanted to share a little anecdote about this part. We really made this model with user experience as first-class consideration. We want people to actually use it, period. We took it so serious that we redid midtraining because we saw cases where models failed to follow instructions on out-of-distribution scaffolds. We decided straight-up that we would fix this in a fundamental way instead of surface-level patching. The resulting base model, which we also release, is thus a healthy base. We find that, compared to other base models, this one better learns new tasks. Try fine-tuning our base and lmk what you think 🥳 huggingface.co/Qwen/Qwen3-Cod…
Qwen@Alibaba_Qwen

🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active ✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc 🤗 Hugging Face: huggingface.co/collections/Qw… 🤖 ModelScope: modelscope.cn/collections/Qw… 📝 Blog: qwen.ai/blog?id=qwen3-… 📄 Tech report: github.com/QwenLM/Qwen3-C…

English
56
84
1.4K
108.7K