Arnav Jain

300 posts

Arnav Jain

@arnavkj95

PhD student University of Montréal and @Mila_Quebec. Prev @Cohere @Microsoft, @IITKgp.

Katılım Aralık 2014

1.6K Takip Edilen644 Takipçiler

Sabitlenmiş Tweet

Arnav Jain@arnavkj95·19 Eyl

⛵️ Excited to share that 𝚂𝙰𝙸𝙻𝙾𝚁 will dock in San Diego ⚓️ this December as a NeurIPS 2025 Spotlight (top ~3%)! Paper: arxiv.org/abs/2506.05294 Code: github.com/arnavkj1995/SA… Website: gokul.dev/sailor/

Gokul Swamy@g_k_swamy

Congrats to @arnavkj95, @vib2810_, and all the authors on their #NeurIPS2025 Spotlight! We have one more surprise up our sleeves I'm excited to share soon 😉

English

3.3K

Arnav Jain retweetledi

Richard Sutton@RichardSSutton·6d

The bitter lesson in 26 words: Don’t be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning.

English

136

967

7.4K

555.9K

Arnav Jain retweetledi

Moksh Jain@JainMoksh·13 May

The scientific process involves collecting informative measurements while effectively allocating limited resources. We developed MaD-Physics, a new benchmark to measure this capability of agents.

English

6.1K

Arnav Jain retweetledi

ICLR@iclr_conf·24 Nis

#ICLR2026 Test of Time Award talk happening now -- "Continuous control with deep reinforcement learning" 🤖 by Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra

English

10.4K

Arnav Jain retweetledi

OpenAI@OpenAI·23 Nis

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

English

2.5K

51.8K

13.1M

Arnav Jain retweetledi

Boyuan Chen@BoyuanChen0·21 Nis

This is what I’ve been cooking in the past 4 months . GPT Image 2 is over a massive 240 elo jump over the second place model, marking the biggest jump bigger than the rest of the leaderboard combined

Arena.ai@arena

Exciting news - GPT-Image-2 by @OpenAI has claimed the #1 spot across all Image Arena leaderboards! A clean sweep with a record-breaking +242 point lead in Text-to-Image - the largest gap we’ve seen to date. - #1 Text-to-Image (1512), +242 over #2 (Nano-banana-2 with web-search aka gemini-3.1-flash-image) - #1 Single-Image Edit (1513), +125 over #2 (Nano-banana-pro aka gemini-3-pro-image) - #1 Multi-Image Edit (1464), +90 over #2 (Nano-banana-2) No model has dominated Image Arena with margins this wide. Huge congratulations to @OpenAI on this major breakthrough in image generation! More performance breakdowns by category in the thread below.

English

1.6K

150.9K

Arnav Jain retweetledi

Andjela Mladenovic@ml_andjela·20 Nis

Hi! If you are interested in game-theoretic analysis of the AI race and open vs. closed sourcing, check out our new paper: " Why Open Source? A Game-Theoretic Analysis of the AI Race " arxiv.org/pdf/2604.16227 There are some cute complexity results there 🙂

English

2.4K

Arnav Jain retweetledi

Deepak Nathani@deepaknathani11·2 Nis

🎉 Excited to share 🍐 PARE and PARE-Bench - a framework and benchmark for evaluating proactive assistants through active user simulation in mobile environments. Current LM agents are reactive: they wait for you to tell them what to do. Proactive agents flip this. They observe what you're doing and figure out how to help. Imagine your assistant notices you got a text from your roommate saying "we're out of soap" while you're editing your shopping list, and adds soap to your list. 🚧 Evaluating these agents is challenging because they must observe realistic user behavior to infer goals. You can't do this with static benchmarks or passive users. Our key contributions: 🍐 PARE: an active user simulation framework where users navigate apps through Finite State Machine (FSM) based stateful interfaces, just like on a real phone 📱 Asymmetric design: users and assistants observe different information and interact through different interfaces, matching real-world deployment 👀 Observe-Execute architecture: lightweight observer monitors continuously, executor acts only after user approval 📋 PARE-Bench: 143 tasks across 9 app categories testing goal inference, intervention timing, and multi-app orchestration 📊 Evaluation of 7 LLMs reveals that even frontier models achieve only 42% success rate PARE is built on top of Meta's Agent Research Environment (ARE) and enables scalable, repeatable evaluation of proactive agents. In PARE, the simulated user goes about their day on the phone: accomplishing goals, navigating between apps, and responding to notifications. The proactive agent watches all of this unfold and uses the user's actions and environment signals to build context about what the user might need help with. Huge thanks to my advisors @xwang_lk @WilliamWangNLP and my amazing collaborators @JasonZ118707 @HuanCC2002 Jiaming Shan @yinfeiy Alkesh Patel @zhegan4 @m2saxon 🙏

English

21.9K

Arnav Jain retweetledi

Nate Rahn@n8rahn·26 Mar

New Anthropic Fellows research: Abstractive red-teaming of language model character The worst way to find out about a character flaw in your language model is from a viral screenshot. How can we find these issues before deployment, rather than after? In this work, we introduce abstractive red-teaming, a new approach that searches over natural-language categories of queries, rather than individual prompts.

English

149

18.3K

Arnav Jain retweetledi

Jason Weston@jaseweston·23 Mar

🌐Unified Post-Training via On-Policy-Trained LM-as-RM🔧 RLLM = RL + LM-as-RM: - post-training framework that unifies RL across easy-, hard-to-verify, and non-verifiable tasks. - trains the LM-as-RM reward model on-policy from the policy’s own outputs, then uses those generative rewards to optimize the policy. 🔗📈 - uses the LLM’s reasoning + instruction-following for higher-quality rewards — boosting performance on all task types. 🚀🤖🏆 Read more in the blog post: facebookresearch.github.io/RAM/blogs/rllm/

English

310

25.9K

Arnav Jain retweetledi

Jesse Farebrother@JesseFarebro·17 Mar

@ShamKakade6 Can you talk about the relationship to inverse RL? Eg, we had done exactly this in continuous control (arxiv.org/abs/2411.07007), and makes me wonder how feature matching compares with IQ-learn for LLMs as seen here: arxiv.org/abs/2409.01369.

English

817

Arnav Jain retweetledi

Jesse Farebrother@JesseFarebro·9 Mar

@kjaved_ @RichardSSutton Slight bit of self promotion: even better for (3) is conditioning on a policy as we did here: arxiv.org/abs/2602.19634. We’re getting close to having full option transition models.

English

1.2K

Arnav Jain retweetledi

Darshan Patil@dapatil211·5 Mar

🧬 New paper Scientific datasets evolve as science evolves. With proteins, new sequences get added, annotations get corrected, and noisy entries get curated out. Introducing CoPeP, a continual-pretraining benchmark for protein LMs. Details 🧵 1/n

English

8.5K

Arnav Jain retweetledi

Jesse Zhang@Jesse_Y_Zhang·3 Mar

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

English

105

409

97.9K

Arnav Jain retweetledi

Kianté Brantley@xkianteb·27 Şub

Does LLM RL post-training need to be on-policy?

English

327

113.1K

Arnav Jain retweetledi

Gokul Swamy@g_k_swamy·27 Şub

It took a few years of deep thinking, but I'm super excited to finally share PROSPER: a beautiful, regression-based algorithm for RL from *rubric rewards* that robustly handles the *inconsistent feedback* that LLM judges provide. Let's go Back to Black(well)! 🧵(1/n)

English

270

51.4K

Arnav Jain retweetledi

Kushal@kushalk_·24 Şub

🤖 Can a single robot policy manipulate diverse tools without ever seeing them before? Introducing SimToolReal 🔨 : a generalist dexterous manipulation policy that transfers zero-shot sim→real to unseen tools + unseen tasks All videos are 1x speed (60 Hz control) 🧵👇

English

381

106.2K

Arnav Jain retweetledi

Emiliano Penaloza@emilianopp_·19 Şub

x.com/i/article/2024…

ZXX

514

151.1K

Arnav Jain retweetledi

Sheshansh Agrawal@sheshanshag·6 Şub

**New research: Introducing ⚡BlitzRank** Current LLM rerankers waste tokens on information they already have. If A > B and B > C, you already know A > C, existing methods don’t track this. BlitzRank fixes this. It uses tournament graphs to extract maximal information from each LLM call. 📊 Pareto-optimal across 14 benchmarks × 5 LLMs ⚡ 25–40% fewer tokens than comparable methods ⚡ 7× cheaper than pairwise at near-identical quality

English

18.7K

Arnav Jain retweetledi

Emiliano Penaloza@emilianopp_·6 Şub

Remember all the self-distillation papers that came out last week. Well, we also propose it 😅, but… But alongside something better 😎 π-Distill We show that with this method, you can distill closed-source frontier models even tho their traces are hidden 🔒. Both our methods can reach and even surpass the performance of the industry-standard SFT + RL with access to reasoning traces 🤯. 🔬And we spent ~100,000 hours GPU hours on a comprehensive analysis, not because the method is finicky, but because we wanted to understand why it works so well. 🧵 1/10

English

434

51.5K

Arnav Jain retweetledi

Wenting Zhao@wzhao_nlp·3 Şub

This release is an emtional one for me because I had stayed up so much for it 🥹 It has been truly amazing to see this model becomes better bit by bit through every change we make, and we have come a long way. Since I did mid-training for this model, I wanted to share a little anecdote about this part. We really made this model with user experience as first-class consideration. We want people to actually use it, period. We took it so serious that we redid midtraining because we saw cases where models failed to follow instructions on out-of-distribution scaffolds. We decided straight-up that we would fix this in a fundamental way instead of surface-level patching. The resulting base model, which we also release, is thus a healthy base. We find that, compared to other base models, this one better learns new tasks. Try fine-tuning our base and lmk what you think 🥳 huggingface.co/Qwen/Qwen3-Cod…

Qwen@Alibaba_Qwen

🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and 3B active ✨ Supports OpenClaw, Qwen Code, Claude Code, web dev, browser use, Cline, etc 🤗 Hugging Face: huggingface.co/collections/Qw… 🤖 ModelScope: modelscope.cn/collections/Qw… 📝 Blog: qwen.ai/blog?id=qwen3-… 📄 Tech report: github.com/QwenLM/Qwen3-C…

English

1.4K

108.7K

Keşfet

@xwang_lk @WilliamWangNLP @JasonZ118707 @HuanCC2002 @yinfeiy @zhegan4 @m2saxon @ShamKakade6