Arvindh Arun

129 posts

Arvindh Arun

Arvindh Arun

@arvindh__a

Building and evaluating Foundation Models (all kinds), @ELLISforEurope @MPI_IS PhD student

Germany Katılım Ağustos 2019
772 Takip Edilen454 Takipçiler
Sabitlenmiş Tweet
Arvindh Arun
Arvindh Arun@arvindh__a·
Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵
Arvindh Arun tweet media
English
14
33
265
51.7K
Arvindh Arun retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥 Today we release PostTrainBench v1.0 and the accompanying paper! We expect this benchmark to be key for monitoring progress in AI R&D automation and later recursive self-improvement. So, can LLM agents automate LLM post-training? 🧵
Maksym Andriushchenko tweet media
English
9
27
178
15.5K
Arvindh Arun
Arvindh Arun@arvindh__a·
@JoshPurtell agreed! that latency constraint is itself a routing signal. I’d treat OPD as nightly "compaction": most stuff stays in retrieval and only repeated/high-signal patterns get merged into weights. the real question still comes down to defining the promotion criteria.
English
1
0
1
78
Josh
Josh@JoshPurtell·
@arvindh__a Just practically speaking, to do OPD you (probably) need to do it "offline" (with respect to the process interacting with the real world), so only some knowledge/data will hit the availability + signal bar to be added to the weights. My 2c
English
1
0
1
106
Arvindh Arun
Arvindh Arun@arvindh__a·
suprised to see people betting that the solution to continual learning is gonna be monolithic the most effective approach will most definitely be a mixture of ICL, OPD, filesystem + good retrieval real question is: how do you decide what goes in weights vs. context/filesystem?
English
2
1
7
1.1K
Arvindh Arun retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Excited to announce the OpenForecaster project, we train models at reasoning predict the future. We won't get to AGI by maxxing STEM exam and coding benchmarks. That's not what most humans reason about in their day to day. Instead, we reason about uncertainty to make decisions, using our world-model of how society evolves. Yet, there weren't any large-scale datasets to train AI for this form of reasoning. Until now. We release OpenForesight, a training dataset of 52k forecasting questions, made from global news. Our recipe is fully automated, and can be repeated for more, newer data at low cost. Using it, we RL trained an 8B model, and it became competitive with much larger models like GPT-OSS-120B across benchmarks and metrics. And we want to keep building on this, in public. Our paper with full details, dataset, code etc. in 🧵 Blog: openforecaster.github.io
Shashwat Goel tweet media
English
20
53
382
40.5K
Arvindh Arun retweetledi
Nikhil Chandak
Nikhil Chandak@nikhilchandak29·
✨New work: How do we train language models for open-ended forecasting?🔮 For example, consider “Which tech company will the US government buy a > 7% stake in by September 2025?”. This requires one to explore the outcome space, not just assign probabilities to choices (as in most binary questions in prediction markets). Instead of picking from “Yes/No” options, we forecast in natural language: name the outcome + state your probability of it being correct. Results: with our data and post-training recipe, our OpenForecaster-8B model outperforms much larger models on calibration (Brier Score) and is competitive on accuracy, on held-out testing over 4 months.📈✅
Nikhil Chandak tweet media
English
3
13
76
15.3K
Arvindh Arun retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
🚨New paper: Training AI Co-Scientists using Rubric Rewards In my recent internship at Meta Superintelligence Labs, I pursued an opinionated research bet: a general, scalable training recipe to improve AI at helping scientists achieve their research goals. Motivation Existing work on training AI for Science optimizes pre-defined, narrow scientific objectives with execution feedback in specially constructed environments (e.g. RLVR). However, it's infeasible to learn from trial and error in many sciences. For e.g. medical research is hard to simulate digitally, and it is unethical to run clinical trials with suboptimal approaches proposed in early training.😬 Moreover, when pursuing a novel research goal, the primary intellectual challenge often lies in defining the experiment setup and objective itself. In the past year, I have increasingly used AI assistance for this (especially GPT-5) in my own research. Of course, models often fail to follow some explicitly stated requirements, and sometimes propose bad design choices, but that is fine! The generated plans are still useful for brainstorming, and I can implement them with further refinement. Method This made us wonder🤔: how can we train models to be better at this task of generating research plans, given an open-ended research goal? For training, we need to collect a large number of research goals, and obtain fast verification signals. Human experts are expensive to access, and that wouldn't scale. 💡Equipped with the vast corpus of openly licensed scientific literature, and the recent success of RL, Synthetic Data Curation, and Rubrics, we propose a scalable recipe: Extract research goals and goal-specific grading rubrics from existing papers with an LLM, and use them for RL training. Specifically, a frozen copy of the initial model rewards the plans generated during training using the goal-specific rubrics, checking seven general guidelines for parts of the plan relevant to each rubric item. 🤔Won't this lead to reward hacking? It will. At some point. But until then, improvements on the training reward might generalize to better research plans for humans. We are hoping the goal-specific rubrics, provided as privileged information to the grader, create a generator-verifier gap that improves research plan generation without external supervision. The only way to find out? Perform a human study. We ask Machine Learning experts to compare plans generated by the finetuned vs initial Qwen3-30B model for ML research goals. This is slow and expensive, it required 45 minutes per annotation to carefully analyze plans, so we could only do this once at the end of the project for evaluation. Results Individual annotations are still noisy, as evaluating research plans is inherently subjective. But sure enough, there is non-trivial signal. The experts preferred (p < 0.01) our finetuned models plans for 70% research goals extracted from NeurIPS'24 / ICLR'25 Oral papers (top 1%) ✅ But only ML, and finetuned vs initial, is boring. Remember, the goal is generality. So we also finetuned Qwen3-30B on goals extracted from medical research, and new arXiv prerints spanning 8 domains. We use rubric evaluations with a jury of frontier models, which also allows us to compare many frontier models across domains. Notable findings: 1) In-domain finetuning leads to 12-22% relative improvements in scores across the three domains: arXiv, medical, and ML 📈 2) Significant cross-domain generalization, especially with the medical finetune improving on ML and new arXiv research goals. This might be evidence for our "generality" thesis 📊 3) Our 30B finetune matches much larger models like Grok-4-Thinking, but GPT-5-Thinking is a cut above the rest (consistent with my qualitative experience) 🤖 Limitations Now of course, LLM-based evaluations, even with a jury and rubrics, are imperfect. But while the individual sample scoring is noisy, we hope for directionally correct results in aggregate, as the jury has positive alignment with human majority vote in our human study on ML. We think the grading scheme holds promise, as optimizing against a much weaker grader (30B), led to improvements in human preference. This work has many such limitations, so treat it more like an early proof-of-concept. We candidly acknowledge them in our paper, and encourage you to scrutinize the details: 📜 alphaxiv.org/abs/2512.23707 Released Artefacts The paper has many ablations and analyses: - our appendix also has sample outputs across domains for vibe-checks, making it 119 pages! - criteria-wise breakdown of performance evolution during training, thanks to our structured grading - SFT on long-form plans worsened model performance - training also improves Gemma, Llama models 🤗We release our train and test data on @huggingface. At a sample-level the data is noisy, and generated by Llama-4-Maverick. Still human experts approved 84% of the rubric items in ML so there's promise, and the same methodology will lead to better quality data as language models improve. Overall, we think the potential of our approach is high: the scientific method is quite general, deep learning benefits from generality (transfer learning), and language models are amazing (better every month!). We hope approaches like this make LMs better at assisting researchers across diverse problem settings and scientific disciplines. Some cool figures from the paper, and acknowledgements in thread🧵. I'm all ears to feedback on how we could've done things better! 1/3
Shashwat Goel tweet media
English
15
58
254
42.1K
Arvindh Arun
Arvindh Arun@arvindh__a·
a very important (and very timely) discourse 🚨‼️ Horizon length is one of the most important metrics for measuring progress, and @METR_Evals does a good job at measuring it. But the results are often taken too much at face value without understanding the nuances, more below 👇
Shashwat Goel@ShashwatGoel7

New Blogpost: How to game the METR plot🚨 In 2025, a single graph changed AGI timelines, investments, research priorities, model quality assessments and much more. But if you squint harder, only 14 prompts shaped AI discourse over this year. Thats all the data in the 1-4 hour horizon length regime that matters. 🕵️ What's more? A majority of these are about Cybersecurity capture the flag contests, and training a Machine Learning model. > Post-train your model on CTF and ML codebases > profit 📈! its METR horizon length will increase. Exactly what OpenAI has been targeting in its Codex model releases... and is Anthropic underperforming in the 2-4hr range because it mostly consists of cybersecurity, which is dual-use for safety? To be clear, I think its an excellent idea to track horizon lengths instead of benchmark accuracy. But under the current modelling assumption of success probability being a logistic function of task length, SWAA+HCAST accuracy improvements alone might explain the exponential progress in horizon length 🔎 In the blog, I show detailed evidence for why we need to stop overindexing on the METR plot. Share it with anyone you see making decisions based on where the latest model lands on the METR plot. shash42.substack.com/p/how-to-game-…

English
0
1
4
1.2K
Arvindh Arun retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
We release PostTrainBench: a benchmark measuring how well AI agents like Claude Code can post-train base LLMs. We expect this to be an important indicator for AI R&D automation as it unfolds over the next few years. 🔗 posttrainbench.com 📂 github.com/aisa-group/Pos… 1/n
Maksym Andriushchenko tweet media
English
30
87
722
160.8K
Arvindh Arun retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
New blogpost: Why I think automated research is the means, not just the end, for training superintelligent AI systems. In pointing models at scientific discovery, we will have to achieve the capabilities today's LLMs lack: - long-horizon palnning - continual adaptation - reasoning about uncertainty - information-efficient learning - and creative exploration. Some of these capabilities may emerge from large-scale training. Others will will require changes in how we implement and train AI systems. I don't yet know how exactly such a training loop would look. So consider this post a conjecture. But science offers a few unique properties at its foundation: - large open data - verifiability - truth-seeking (instead of power-seeking) incentives. And thus I think scientific discovery is the ideal successor to internet-scale pretraining. It's not just an application, it maybe the means to building what we're missing. Maybe that's why we have @openai @GoogleDeepMind @periodiclabs @futurehouse etc. all focusing on it. shash42.substack.com/p/automated-sc…
Shashwat Goel tweet media
English
3
4
64
8K
Arvindh Arun
Arvindh Arun@arvindh__a·
Impressive results from @jandotai's Jan-v2-VL on our long-horizon execution benchmark. Would love to see more upcoming models benchmarking on long-horizon (not just planning, but also) pure execution! more details about our benchmark: alphaxiv.org/abs/2509.09677
Arvindh Arun tweet media
👋 Jan@jandotai

Introducing Jan-v2-VL, a multimodal agent built for long-horizon tasks. Jan-v2-VL executes 49 steps without failure, while the base model stops at 5 and other similar-scale VLMs stop between 1 and 2. It achieves longer, stable task execution in your browser without accuracy loss. 3 variants are available: - Jan-v2-VL-low (efficiency-oriented) - Jan-v2-VL-med (balanced) - Jan-v2-VL-high (deeper reasoning and longer execution) Models: huggingface.co/collections/ja… To use it, update your Jan App and download Jan-v2-VL from the Model Hub. Activate Browser MCP servers for agentic use cases. Credit to the @Alibaba_Qwen team for Qwen3-VL-8B-Thinking base model.

English
2
1
8
1.6K
Arvindh Arun retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
The uncanny thing I realized from the Gemini 3 model card. Their large improvements across benchmarks made me feel nothing. It's all about discrete capability jumps (continual learning, memory etc.), product, and model "feel" from here.
English
3
1
31
2.1K
Arvindh Arun
Arvindh Arun@arvindh__a·
@Azrael2801 Hi, our task is Markovian, as we ask the model to maintain a running sum across turns. The values corresponding to the keys given at each turn need to be added together to the running sum calculated in the previous turn
English
0
0
0
55
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
@arvindh__a Interesting paper! I went through a few of the tasks and I’m confused as to how they are long horizon tasks though. Each task looks independent of another, i.e. summing up the values of the keys. Would you agree? A long horizon task should ideally interdependent subtasks?
English
1
0
1
67
Arvindh Arun
Arvindh Arun@arvindh__a·
Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵
Arvindh Arun tweet media
English
14
33
265
51.7K
Arvindh Arun retweetledi
Roberto Dailey
Roberto Dailey@RobertoDailey1·
New work from Cognizant AI lab: Solving a Million-step LLM Task with Zero Errors. Existing LLMs struggle on long task horizons as persistent error rates compound, even when the LLMs know how to solve the task. Apple’s “Illusion of thinking” demonstrated that state of the art reasoning models could struggle with a simple task, Towers of Hanoi, if that task required execution of hundreds of steps in a row without error. We hypothesized we could see much higher performance by taking breaking down the task into its smallest subtasks, then using voting and red flagging to boost subtask accuracy. With these simple modifications we were able to push the simple gpt-4.1-mini to solve the 20-disk towers of Hanoi, or 1,048,575 steps without a single error! Seeing these results we believe with the right, robust, frameworks, LLMs can be scaled to vastly longer task lengths than their base model. Paper: arxiv.org/abs/2511.09030 Blog: cognizant.com/us/en/ai-lab/b…
English
20
109
717
134.2K
Arvindh Arun
Arvindh Arun@arvindh__a·
I will be at #EMNLP2025 🇨🇳🐉 next week to present our work on foundation models for graph structured data (Session 13 in Hall C on November 7) Looking forward to chat with people working on Foundation Models (all kinds) & LLM capability evals folks! 📄: alphaxiv.org/pdf/2505.20422
Arvindh Arun tweet media
Arvindh Arun@arvindh__a

Does text help KG Foundation Models generalize better? 🤔 Yes (and no)! ☯️ Bootstrapped by LLMs improving KG relation labels, we show that textual similarity between relations can act as an invariance - helping generalization across datasets! 🧵👇

English
1
0
7
831
Arvindh Arun
Arvindh Arun@arvindh__a·
@jay_azhang Really cool work! esp seeing certain models being pessimistic and mostly buying shorts and others doing the opposite. Reg noise vs skill: have you considered running random agents with matched position size/frequency as baselines to highlight the skill (or lack of) of models?
English
2
0
4
1.8K
Jay A
Jay A@jay_azhang·
Qwen's portfolio is up +60% Gemini's is down -60% Of course, too early to tell how much is skill vs. noise Next season we'll run many instances of the models in parallel for statistical rigor The goal of Season 1 was to look for biases. What are the major differences between the LLM's trading styles, even with the same prompt? Can they even follow basic risk management rules? A few early patterns: > Qwen has only made 22 trades. It almost *never* has more than two positions on > Gemini has made 108 trades. It literally always has the max number of positions on (6) > Qwen has higher self-reported confidence (avg. 80% vs 65%) > Qwen's stop loss and take profit levels are *much* tighter than Gemini's, but Gemini breaks its own rules often, and gets out early (others don't do this) Overall, we're excited by the potential of LLMs and trading, but we're still skeptical. Much to test and learn
Jay A tweet media
English
202
102
1.5K
319.1K