Bhavishya Pohani

20 posts

Bhavishya Pohani

Bhavishya Pohani

@Azrael2801

Applied Research Scientist @ Snorkel AI

Katılım Aralık 2014
105 Takip Edilen28 Takipçiler
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
Just tried out @gepa_ai's optimize_anything library — we split the Snorkel Finance Benchmark into train/val/test, gave optimize_anything the 100 train examples, and it boosted test set performance by 8 points 👀👀 We gave it the system prompt to optimize, and found it added helpful tool usage rules, numerical extraction & formatting additions & best practices.
English
2
5
17
2.8K
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
Snorkel 🤝 UC Berkeley rLLM team! Pretty interesting work we were involved in over the last couple of months, where the rLLM team fine-tuned and evaluated using Snorkel's finance reasoning benchmarks! Read the blog for more details > @mananroongta @sijun_tan
Snorkel AI@SnorkelAI

A 4B model > 235B on financial reasoning. We partnered with @rllm_project to fine-tune Qwen3-4B-Instruct-2507 — and it outperformed Qwen3-235B-A22B on expert-curated financial benchmarks.

English
0
1
13
953
Bhavishya Pohani retweetledi
Chris Glaze
Chris Glaze@chris_m_glaze·
Frontier models like Gemini 3 Pro making impressive strides as code agents, still showing basic errors in real world tasks though when applying coding skills to solve enterprise-style problems. We took the verified version of Tau^2 Bench made by the AGI team at @amazon and swapped in a code interpreter, challenging models to figure out how to solve problems in creative, open ended ways without the hand-holding of bespoke tools that bake in required reasoning. Takeaways: 1. Verification indeed makes a big difference, with this version showing that frontier models can do impressively well in updating backends after verifying that real software engineers could do it with the same information. 2. They still struggle here however, doing proportionally much better at tasks that simply require inference. 3. Even when they succeed, they do it inefficiently and fail to exploit standard methods for working with metadata. This is part of our ongoing r&d at @SnorkelAI in extending our @terminalbench approach to include richer, more complex environments that can encapsulate enterprise scenarios with code agents. Showing pass@k analysis along with an example of an inefficiency from Gemini 3 Pro (took 4 steps to figure out how to even work with the database). You can find samples from all the models here: huggingface.co/datasets/snork…
Chris Glaze tweet mediaChris Glaze tweet media
English
3
6
35
1.7K
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
@alexalbert__ It would be great if the model decided to use subagents(especially in exploration-based settings) on it's own. Right now, I've had to prompt it specifically
English
0
0
1
91
Alex Albert
Alex Albert@alexalbert__·
Reply with all your Opus 4.5 gripes so we can fix everything before our next model The more specific (including prompts), the more likely we'll be able to fix it!
English
890
59
2.1K
302.6K
Nando de Freitas
Nando de Freitas@NandoDF·
Why is it that with ChatGPT, Gemini, Claude, Copilot and other LLMs we have to always start new chats for them to work well? What is the scientific explanation? What are the hypotheses? What is the evidence for each?
English
111
15
207
38.7K
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
Excited to see the OLMO model from AllenAI institute! Open source is the way! The Olmo3 model passed the Bhavishya test compared to the Olmo2 variant! (OlmoTrace revealed overlaps in the training data for the original strawberry test, so we ran the Bhavishya test instead.)
Bhavishya Pohani tweet mediaBhavishya Pohani tweet mediaBhavishya Pohani tweet media
English
1
0
2
125
Bhavishya Pohani retweetledi
Snorkel AI
Snorkel AI@SnorkelAI·
We had a terrific interview with the creators of Terminal Bench 2.0. They unpack: • why terminals → more reliable and powerful agents • key design tradeoffs in TB 2.0 • Creating Harbor to enable eval, RL, and agent workflows at scale • lessons from building a 100+ contributor community around the benchmark
Snorkel AI tweet media
English
1
5
17
895
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
@arvindh__a Interesting paper! I went through a few of the tasks and I’m confused as to how they are long horizon tasks though. Each task looks independent of another, i.e. summing up the values of the keys. Would you agree? A long horizon task should ideally interdependent subtasks?
English
1
0
1
67
Arvindh Arun
Arvindh Arun@arvindh__a·
Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵
Arvindh Arun tweet media
English
14
33
265
51.7K
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
Something's cooking 🧑‍🍳 Do models need tools to operate? Or can they do just fine with only code? In addition to what @chris_m_glaze mentioned here, we are also seeing huge efficiency gains with the code-agent in terms of the number of turns🚀.
Chris Glaze@chris_m_glaze

What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to solve any problem that lives in an IT ecosystem. But I don't think we're there yet. (1) this will require serious guardrails, (2) the jury is still out on whether this is the even the most efficient approach. As part of our ongoing experiments at @SnorkelAI around this debate we’re making “code-agent” versions of popular tool-based environments in which we challenge agents to solve tasks by writing raw code instead  – they only have access to a Python interpreter and a pointer to the relevant file systems. This is related to the idea behind @terminalbench but on environments that simulate entire production-grade systems. Really interesting findings when we do this to the Tau Bench 2 Airlines benchmark from @SierraPlatform: when we strip out all tools and swap in a Python interpreter, models do 𝘣𝘦𝘵𝘵𝘦𝘳 at inference and communication with users, and 𝘸𝘰𝘳𝘴𝘦 at write-operations (database updates). We confirmed that models are 𝘤𝘢𝘱𝘢𝘣𝘭𝘦 of write-operations, however. In the original version of the benchmark, the tools hard code write-operation logic that the models are challenged to figure out on their own in the code-agent version. Hugging Face dataset with results here: huggingface.co/datasets/snork… The successful examples show some really fun behavior though, with a lot of exploration and self-correcting behavior. For example, @AnthropicAI 's Claude Sonnet 4.5 often attempts to interact with the database without first reading in the schema; fails; then reads in the schema by simply printing out all attributes of the object for itself. My guess is we'll land on some optimality point between the two extremes here as models develop: exploration via code and exploitation via tool creation.

English
0
0
2
42
Bhavishya Pohani retweetledi
Amanda Dsouza
Amanda Dsouza@amanda_dsouza·
🚨 New research from @SnorkelAI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces benchmarks much closer to target difficulty levels, with a 2-4x improvement over the baselines. Paper: huggingface.co/papers/2510.25…
Amanda Dsouza tweet media
English
4
11
27
1.9K
Bhavishya Pohani
Bhavishya Pohani@Azrael2801·
@dlwh Any reason to leave it out? Super interesting thread btw
English
1
0
1
38
David Hall
David Hall@dlwh·
Now, look, we knew QK Norm was a good idea. We just thought it wasn't a **necessary** idea, not for us. We were different. Anyway, let's fix it.
English
2
1
53
7.8K
Percy Liang
Percy Liang@percyliang·
⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:
Percy Liang tweet media
English
20
87
599
126.6K
Bhavishya Pohani retweetledi
Snorkel AI
Snorkel AI@SnorkelAI·
🤖 What happens when simply using the best AI model starts to break the bank? We explored how multi-agent systems — multiple AI agents collaborating across tools — could benefit enterprise AI. Details👇
Snorkel AI tweet media
English
1
3
7
672
Bhavishya Pohani retweetledi
Chris Glaze
Chris Glaze@chris_m_glaze·
Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at @SnorkelAI we’re making “code-only” versions of environments in which we challenge agents to solve tasks by writing raw code instead of using tools – they only have access to a Python interpreter and a pointer to the relevant file systems. This is related to the work we have developed with @terminalbench but on rich environments that simulate entire production-grade systems. Really interesting findings so far: When left no other choice, some frontier models are already pretty good at this with creative coding, while others require tools for guided access to their environment. In this first experiment we made a code-only version of Snorkel’s insurance underwriting environment and found that Claude Sonnet 4.5 and GPT 5 actually do a little better with only an interpreter while other models clearly do worse without guided access from tools. You can see example traces here with really interesting examples of how models are solving these tasks in the Hugging Face dataset. The most successful models explore the environment and learn in-context: huggingface.co/datasets/snork…. The models take more compute as expected to do things this way, but it shows that they could possibly be leveraged for more autonomous applications as they mature.
Chris Glaze tweet mediaChris Glaze tweet mediaChris Glaze tweet media
English
1
4
27
1.2K