Arvindh Arun

129 posts

Arvindh Arun

@arvindh__a

Building and evaluating Foundation Models (all kinds), @ELLISforEurope @MPI_IS PhD student

Germany Katılım Ağustos 2019

772 Takip Edilen454 Takipçiler

Sabitlenmiş Tweet

Arvindh Arun@arvindh__a·12 Eyl

Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵

English

265

51.7K

Arvindh Arun retweetledi

Maksym Andriushchenko@maksym_andr·11 Mar

💥 Today we release PostTrainBench v1.0 and the accompanying paper! We expect this benchmark to be key for monitoring progress in AI R&D automation and later recursive self-improvement. So, can LLM agents automate LLM post-training? 🧵

English

178

15.5K

Arvindh Arun@arvindh__a·23 Şub

@METR_Evals hire akshit to figure out the next iteration of long horizon evals that would stand the test of time :)

Akshit@akshitwt

new blog! How to Train your Time Horizon 🐉⌛️ TLDR: - I cover the recent "noisy" results METR released recently regarding Opus 4.6 and Codex. - I try to find the source of the noise, separate from the task suite saturating. - I propose alternative strategies METR can look at to expand their task suite - I lay down my claim about "superhuman" time horizons, and how it's likely model reach this level within 2 years. - I look at how we can continue evaluating AI even after they cross this threshold of "superhuman" times Link below!!!

English

287

Arvindh Arun@arvindh__a·12 Şub

@JoshPurtell agreed! that latency constraint is itself a routing signal. I’d treat OPD as nightly "compaction": most stuff stays in retrieval and only repeated/high-signal patterns get merged into weights. the real question still comes down to defining the promotion criteria.

English

Josh@JoshPurtell·12 Şub

@arvindh__a Just practically speaking, to do OPD you (probably) need to do it "offline" (with respect to the process interacting with the real world), so only some knowledge/data will hit the availability + signal bar to be added to the weights. My 2c

English

106

Arvindh Arun@arvindh__a·12 Şub

suprised to see people betting that the solution to continual learning is gonna be monolithic the most effective approach will most definitely be a mixture of ICL, OPD, filesystem + good retrieval real question is: how do you decide what goes in weights vs. context/filesystem?

English

1.1K

Arvindh Arun@arvindh__a·26 Oca

our long horizon execution eval will appear at #ICLR2026! excited for rio 🇧🇷

Arvindh Arun@arvindh__a

English

5.1K

Arvindh Arun retweetledi

Shashwat Goel@ShashwatGoel7·6 Oca

Excited to announce the OpenForecaster project, we train models at reasoning predict the future. We won't get to AGI by maxxing STEM exam and coding benchmarks. That's not what most humans reason about in their day to day. Instead, we reason about uncertainty to make decisions, using our world-model of how society evolves. Yet, there weren't any large-scale datasets to train AI for this form of reasoning. Until now. We release OpenForesight, a training dataset of 52k forecasting questions, made from global news. Our recipe is fully automated, and can be repeated for more, newer data at low cost. Using it, we RL trained an 8B model, and it became competitive with much larger models like GPT-OSS-120B across benchmarks and metrics. And we want to keep building on this, in public. Our paper with full details, dataset, code etc. in 🧵 Blog: openforecaster.github.io

English

382

40.5K

Arvindh Arun retweetledi

Nikhil Chandak@nikhilchandak29·6 Oca

✨New work: How do we train language models for open-ended forecasting?🔮 For example, consider “Which tech company will the US government buy a > 7% stake in by September 2025?”. This requires one to explore the outcome space, not just assign probabilities to choices (as in most binary questions in prediction markets). Instead of picking from “Yes/No” options, we forecast in natural language: name the outcome + state your probability of it being correct. Results: with our data and post-training recipe, our OpenForecaster-8B model outperforms much larger models on calibration (Brier Score) and is competitive on accuracy, on held-out testing over 4 months.📈✅

English

15.3K

Arvindh Arun retweetledi

Shashwat Goel@ShashwatGoel7·30 Ara

🚨New paper: Training AI Co-Scientists using Rubric Rewards In my recent internship at Meta Superintelligence Labs, I pursued an opinionated research bet: a general, scalable training recipe to improve AI at helping scientists achieve their research goals. Motivation Existing work on training AI for Science optimizes pre-defined, narrow scientific objectives with execution feedback in specially constructed environments (e.g. RLVR). However, it's infeasible to learn from trial and error in many sciences. For e.g. medical research is hard to simulate digitally, and it is unethical to run clinical trials with suboptimal approaches proposed in early training.😬 Moreover, when pursuing a novel research goal, the primary intellectual challenge often lies in defining the experiment setup and objective itself. In the past year, I have increasingly used AI assistance for this (especially GPT-5) in my own research. Of course, models often fail to follow some explicitly stated requirements, and sometimes propose bad design choices, but that is fine! The generated plans are still useful for brainstorming, and I can implement them with further refinement. Method This made us wonder🤔: how can we train models to be better at this task of generating research plans, given an open-ended research goal? For training, we need to collect a large number of research goals, and obtain fast verification signals. Human experts are expensive to access, and that wouldn't scale. 💡Equipped with the vast corpus of openly licensed scientific literature, and the recent success of RL, Synthetic Data Curation, and Rubrics, we propose a scalable recipe: Extract research goals and goal-specific grading rubrics from existing papers with an LLM, and use them for RL training. Specifically, a frozen copy of the initial model rewards the plans generated during training using the goal-specific rubrics, checking seven general guidelines for parts of the plan relevant to each rubric item. 🤔Won't this lead to reward hacking? It will. At some point. But until then, improvements on the training reward might generalize to better research plans for humans. We are hoping the goal-specific rubrics, provided as privileged information to the grader, create a generator-verifier gap that improves research plan generation without external supervision. The only way to find out? Perform a human study. We ask Machine Learning experts to compare plans generated by the finetuned vs initial Qwen3-30B model for ML research goals. This is slow and expensive, it required 45 minutes per annotation to carefully analyze plans, so we could only do this once at the end of the project for evaluation. Results Individual annotations are still noisy, as evaluating research plans is inherently subjective. But sure enough, there is non-trivial signal. The experts preferred (p < 0.01) our finetuned models plans for 70% research goals extracted from NeurIPS'24 / ICLR'25 Oral papers (top 1%) ✅ But only ML, and finetuned vs initial, is boring. Remember, the goal is generality. So we also finetuned Qwen3-30B on goals extracted from medical research, and new arXiv prerints spanning 8 domains. We use rubric evaluations with a jury of frontier models, which also allows us to compare many frontier models across domains. Notable findings: 1) In-domain finetuning leads to 12-22% relative improvements in scores across the three domains: arXiv, medical, and ML 📈 2) Significant cross-domain generalization, especially with the medical finetune improving on ML and new arXiv research goals. This might be evidence for our "generality" thesis 📊 3) Our 30B finetune matches much larger models like Grok-4-Thinking, but GPT-5-Thinking is a cut above the rest (consistent with my qualitative experience) 🤖 Limitations Now of course, LLM-based evaluations, even with a jury and rubrics, are imperfect. But while the individual sample scoring is noisy, we hope for directionally correct results in aggregate, as the jury has positive alignment with human majority vote in our human study on ML. We think the grading scheme holds promise, as optimizing against a much weaker grader (30B), led to improvements in human preference. This work has many such limitations, so treat it more like an early proof-of-concept. We candidly acknowledge them in our paper, and encourage you to scrutinize the details: 📜 alphaxiv.org/abs/2512.23707 Released Artefacts The paper has many ablations and analyses: - our appendix also has sample outputs across domains for vibe-checks, making it 119 pages! - criteria-wise breakdown of performance evolution during training, thanks to our structured grading - SFT on long-form plans worsened model performance - training also improves Gemma, Llama models 🤗We release our train and test data on @huggingface. At a sample-level the data is noisy, and generated by Llama-4-Maverick. Still human experts approved 84% of the rubric items in ML so there's promise, and the same methodology will lead to better quality data as language models improve. Overall, we think the potential of our approach is high: the scientific method is quite general, deep learning benefits from generality (transfer learning), and language models are amazing (better every month!). We hope approaches like this make LMs better at assisting researchers across diverse problem settings and scientific disciplines. Some cool figures from the paper, and acknowledgements in thread🧵. I'm all ears to feedback on how we could've done things better! 1/3

English

254

42.1K

Arvindh Arun@arvindh__a·20 Ara

a very important (and very timely) discourse 🚨‼️ Horizon length is one of the most important metrics for measuring progress, and @METR_Evals does a good job at measuring it. But the results are often taken too much at face value without understanding the nuances, more below 👇

Shashwat Goel@ShashwatGoel7

New Blogpost: How to game the METR plot🚨 In 2025, a single graph changed AGI timelines, investments, research priorities, model quality assessments and much more. But if you squint harder, only 14 prompts shaped AI discourse over this year. Thats all the data in the 1-4 hour horizon length regime that matters. 🕵️ What's more? A majority of these are about Cybersecurity capture the flag contests, and training a Machine Learning model. > Post-train your model on CTF and ML codebases > profit 📈! its METR horizon length will increase. Exactly what OpenAI has been targeting in its Codex model releases... and is Anthropic underperforming in the 2-4hr range because it mostly consists of cybersecurity, which is dual-use for safety? To be clear, I think its an excellent idea to track horizon lengths instead of benchmark accuracy. But under the current modelling assumption of success probability being a logistic function of task length, SWAA+HCAST accuracy improvements alone might explain the exponential progress in horizon length 🔎 In the blog, I show detailed evidence for why we need to stop overindexing on the METR plot. Share it with anyone you see making decisions based on where the latest model lands on the METR plot. shash42.substack.com/p/how-to-game-…

English

1.2K

Arvindh Arun retweetledi

Maksym Andriushchenko@maksym_andr·17 Ara

We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs. We expect this to be an important indicator for AI R&D automation as it unfolds over the next few years. 🔗 posttrainbench.com 📂 github.com/aisa-group/Pos… 1/n

English

722

160.8K

Arvindh Arun retweetledi

Jonas Geiping@jonasgeiping·6 Ara

Happening now!

Arvindh Arun@arvindh__a

If you’re attending #NeurIPS2025 in San Diego 🇺🇸, check out @jonasgeiping presenting our work at the @mti_neurips workshop on Dec 6th!

English

8.8K

Arvindh Arun@arvindh__a·1 Ara

If you’re attending #NeurIPS2025 in San Diego 🇺🇸, check out @jonasgeiping presenting our work at the @mti_neurips workshop on Dec 6th!

Arvindh Arun@arvindh__a

English

9.3K

Arvindh Arun retweetledi

Shashwat Goel@ShashwatGoel7·20 Kas

New blogpost: Why I think automated research is the means, not just the end, for training superintelligent AI systems. In pointing models at scientific discovery, we will have to achieve the capabilities today's LLMs lack: - long-horizon palnning - continual adaptation - reasoning about uncertainty - information-efficient learning - and creative exploration. Some of these capabilities may emerge from large-scale training. Others will will require changes in how we implement and train AI systems. I don't yet know how exactly such a training loop would look. So consider this post a conjecture. But science offers a few unique properties at its foundation: - large open data - verifiability - truth-seeking (instead of power-seeking) incentives. And thus I think scientific discovery is the ideal successor to internet-scale pretraining. It's not just an application, it maybe the means to building what we're missing. Maybe that's why we have @openai @GoogleDeepMind @periodiclabs @futurehouse etc. all focusing on it. shash42.substack.com/p/automated-sc…

English

Arvindh Arun@arvindh__a·19 Kas

Impressive results from @jandotai's Jan-v2-VL on our long-horizon execution benchmark. Would love to see more upcoming models benchmarking on long-horizon (not just planning, but also) pure execution! more details about our benchmark: alphaxiv.org/abs/2509.09677

👋 Jan@jandotai

Introducing Jan-v2-VL, a multimodal agent built for long-horizon tasks. Jan-v2-VL executes 49 steps without failure, while the base model stops at 5 and other similar-scale VLMs stop between 1 and 2. It achieves longer, stable task execution in your browser without accuracy loss. 3 variants are available: - Jan-v2-VL-low (efficiency-oriented) - Jan-v2-VL-med (balanced) - Jan-v2-VL-high (deeper reasoning and longer execution) Models: huggingface.co/collections/ja… To use it, update your Jan App and download Jan-v2-VL from the Model Hub. Activate Browser MCP servers for agentic use cases. Credit to the @Alibaba_Qwen team for Qwen3-VL-8B-Thinking base model.

English

1.6K

Arvindh Arun retweetledi

Shashwat Goel@ShashwatGoel7·18 Kas

The uncanny thing I realized from the Gemini 3 model card. Their large improvements across benchmarks made me feel nothing. It's all about discrete capability jumps (continual learning, memory etc.), product, and model "feel" from here.

English

2.1K

Arvindh Arun@arvindh__a·15 Kas

@Azrael2801 Hi, our task is Markovian, as we ask the model to maintain a running sum across turns. The values corresponding to the keys given at each turn need to be added together to the running sum calculated in the previous turn

English

Bhavishya Pohani@Azrael2801·15 Kas

@arvindh__a Interesting paper! I went through a few of the tasks and I’m confused as to how they are long horizon tasks though. Each task looks independent of another, i.e. summing up the values of the keys. Would you agree? A long horizon task should ideally interdependent subtasks?

English

Arvindh Arun@arvindh__a·12 Eyl

English

265

51.7K

Arvindh Arun retweetledi

Roberto Dailey@RobertoDailey1·14 Kas

New work from Cognizant AI lab: Solving a Million-step LLM Task with Zero Errors. Existing LLMs struggle on long task horizons as persistent error rates compound, even when the LLMs know how to solve the task. Apple’s “Illusion of thinking” demonstrated that state of the art reasoning models could struggle with a simple task, Towers of Hanoi, if that task required execution of hundreds of steps in a row without error. We hypothesized we could see much higher performance by taking breaking down the task into its smallest subtasks, then using voting and red flagging to boost subtask accuracy. With these simple modifications we were able to push the simple gpt-4.1-mini to solve the 20-disk towers of Hanoi, or 1,048,575 steps without a single error! Seeing these results we believe with the right, robust, frameworks, LLMs can be scaled to vastly longer task lengths than their base model. Paper: arxiv.org/abs/2511.09030 Blog: cognizant.com/us/en/ai-lab/b…

English

109

717

134.2K

Arvindh Arun@arvindh__a·12 Kas

An LLM could never, AI conf reviewing is so back

Alexander Panfilov@kotekjedi_ml

A “Who is Adam?” successor has arrived

English

816

Arvindh Arun@arvindh__a·31 Eki

I will be at #EMNLP2025 🇨🇳🐉 next week to present our work on foundation models for graph structured data (Session 13 in Hall C on November 7) Looking forward to chat with people working on Foundation Models (all kinds) & LLM capability evals folks! 📄: alphaxiv.org/pdf/2505.20422

Arvindh Arun@arvindh__a

Does text help KG Foundation Models generalize better? 🤔 Yes (and no)! ☯️ Bootstrapped by LLMs improving KG relation labels, we show that textual similarity between relations can act as an invariance - helping generalization across datasets! 🧵👇

English

831

Arvindh Arun@arvindh__a·24 Eki

@jay_azhang Really cool work! esp seeing certain models being pessimistic and mostly buying shorts and others doing the opposite. Reg noise vs skill: have you considered running random agents with matched position size/frequency as baselines to highlight the skill (or lack of) of models?

English

1.8K

Jay A@jay_azhang·23 Eki

Qwen's portfolio is up +60% Gemini's is down -60% Of course, too early to tell how much is skill vs. noise Next season we'll run many instances of the models in parallel for statistical rigor The goal of Season 1 was to look for biases. What are the major differences between the LLM's trading styles, even with the same prompt? Can they even follow basic risk management rules? A few early patterns: > Qwen has only made 22 trades. It almost *never* has more than two positions on > Gemini has made 108 trades. It literally always has the max number of positions on (6) > Qwen has higher self-reported confidence (avg. 80% vs 65%) > Qwen's stop loss and take profit levels are *much* tighter than Gemini's, but Gemini breaks its own rules often, and gets out early (others don't do this) Overall, we're excited by the potential of LLMs and trading, but we're still skeptical. Much to test and learn

English

202

102

1.5K

319.1K

Keşfet

@METR_Evals @JoshPurtell @huggingface @jonasgeiping @mti_neurips @openai @GoogleDeepMind @periodiclabs