Bhavishya Pohani

20 posts

Bhavishya Pohani

@Azrael2801

Applied Research Scientist @ Snorkel AI

Katılım Aralık 2014

105 Takip Edilen28 Takipçiler

Bhavishya Pohani@Azrael2801·20 Şub

Just tried out @gepa_ai's optimize_anything library — we split the Snorkel Finance Benchmark into train/val/test, gave optimize_anything the 100 train examples, and it boosted test set performance by 8 points 👀👀 We gave it the system prompt to optimize, and found it added helpful tool usage rules, numerical extraction & formatting additions & best practices.

English

2.8K

Bhavishya Pohani retweetledi

rLLM@rllm_project·19 Şub

x.com/i/article/2017…

ZXX

4.9K

Bhavishya Pohani@Azrael2801·19 Şub

Snorkel 🤝 UC Berkeley rLLM team! Pretty interesting work we were involved in over the last couple of months, where the rLLM team fine-tuned and evaluated using Snorkel's finance reasoning benchmarks! Read the blog for more details > @mananroongta @sijun_tan

Snorkel AI@SnorkelAI

A 4B model > 235B on financial reasoning. We partnered with @rllm_project to fine-tune Qwen3-4B-Instruct-2507 — and it outperformed Qwen3-235B-A22B on expert-curated financial benchmarks.

English

953

Bhavishya Pohani retweetledi

Chris Glaze@chris_m_glaze·18 Ara

Frontier models like Gemini 3 Pro making impressive strides as code agents, still showing basic errors in real world tasks though when applying coding skills to solve enterprise-style problems. We took the verified version of Tau^2 Bench made by the AGI team at @amazon and swapped in a code interpreter, challenging models to figure out how to solve problems in creative, open ended ways without the hand-holding of bespoke tools that bake in required reasoning. Takeaways: 1. Verification indeed makes a big difference, with this version showing that frontier models can do impressively well in updating backends after verifying that real software engineers could do it with the same information. 2. They still struggle here however, doing proportionally much better at tasks that simply require inference. 3. Even when they succeed, they do it inefficiently and fail to exploit standard methods for working with metadata. This is part of our ongoing r&d at @SnorkelAI in extending our @terminalbench approach to include richer, more complex environments that can encapsulate enterprise scenarios with code agents. Showing pass@k analysis along with an example of an inefficiency from Gemini 3 Pro (took 4 steps to figure out how to even work with the database). You can find samples from all the models here: huggingface.co/datasets/snork…

English

1.7K

Bhavishya Pohani@Azrael2801·10 Ara

@alexalbert__ It would be great if the model decided to use subagents(especially in exploration-based settings) on it's own. Right now, I've had to prompt it specifically

English

Alex Albert@alexalbert__·10 Ara

Reply with all your Opus 4.5 gripes so we can fix everything before our next model The more specific (including prompts), the more likely we'll be able to fix it!

English

890

2.1K

302.6K

Bhavishya Pohani@Azrael2801·7 Ara

@NandoDF @PMinervini Claude Code does this pretty well with /subagents

English

Bhavishya Pohani@Azrael2801·7 Ara

@NandoDF @NandoDF Context rot ^, as @PMinervini mentioned, especially in scenarios where reasoning is involved, we measured the impact of this in a tool-use scenario. Interesting to see that using multiple agents to combat this works pretty well snorkel.ai/blog/multi-age…

English

Nando de Freitas@NandoDF·6 Ara

Why is it that with ChatGPT, Gemini, Claude, Copilot and other LLMs we have to always start new chats for them to work well? What is the scientific explanation? What are the hypotheses? What is the evidence for each?

English

111

207

38.7K

Bhavishya Pohani@Azrael2801·22 Kas

@natolambert @allen_ai #OpenSource

QME

Bhavishya Pohani@Azrael2801·22 Kas

Excited to see the OLMO model from AllenAI institute! Open source is the way! The Olmo3 model passed the Bhavishya test compared to the Olmo2 variant! (OlmoTrace revealed overlaps in the training data for the original strawberry test, so we ran the Bhavishya test instead.)

English

125

Bhavishya Pohani@Azrael2801·21 Kas

How do we optimize the usage of LLMs w.r.t the energy expenditure that they cause? Pretty interesting research below! twitter.com/SnorkelAI/stat…

Snorkel AI@SnorkelAI

AI demand is exploding—but so are energy costs. Proud to support @HazyResearch, @StanfordAILab, @Avanika15, and @JonSaadFalcon on Intelligence Per Watt, a study redefining how we measure progress: how efficiently we turn energy into intelligence.

English

101

Bhavishya Pohani@Azrael2801·21 Kas

Incase you missed it > @ArminPCM breaks down how we build RL environments grounded in real enterprise workflows to train and evaluate agents with reliability. twitter.com/SnorkelAI/stat…

Snorkel AI@SnorkelAI

New from @ArminPCM : how Snorkel builds reinforcement learning (RL) environments that train and evaluate agents in realistic, enterprise-grade settings.

English

Bhavishya Pohani@Azrael2801·21 Kas

Loving this direction from the Snorkel research crew—benchmarks that evolve as fast as the models they measure. BeTaL is a big step. twitter.com/SnorkelAI/stat…

Snorkel AI@SnorkelAI

Static benchmarks can’t keep up with the pace of AI progress. Our latest research introduces BeTaL—Benchmark Tuning with an LLM-in-the-loop—a framework that uses reasoning models to optimize benchmark design dynamically. ✍️ From the Snorkel Research team: @amanda_dsouza , @harit_v , @qi_zhengyang , @realjustinbauer, @pham_derek, Tom Walshe, @ArminParchami, @fredsala, and Paroma Varma

English

Bhavishya Pohani retweetledi

Snorkel AI@SnorkelAI·19 Kas

We had a terrific interview with the creators of Terminal Bench 2.0. They unpack: • why terminals → more reliable and powerful agents • key design tradeoffs in TB 2.0 • Creating Harbor to enable eval, RL, and agent workflows at scale • lessons from building a 100+ contributor community around the benchmark

English

895

Bhavishya Pohani@Azrael2801·15 Kas

@arvindh__a Interesting paper! I went through a few of the tasks and I’m confused as to how they are long horizon tasks though. Each task looks independent of another, i.e. summing up the values of the keys. Would you agree? A long horizon task should ideally interdependent subtasks?

English

Arvindh Arun@arvindh__a·12 Eyl

Why does horizon length grow exponentially as shown in the METR plot? Our new paper investigates this by isolating the execution capabilities of LLMs. Here's why you shouldn't be fooled by slowing progress on typical short-task benchmarks... 🧵

English

265

51.7K

Bhavishya Pohani@Azrael2801·13 Kas

Something's cooking 🧑‍🍳 Do models need tools to operate? Or can they do just fine with only code? In addition to what @chris_m_glaze mentioned here, we are also seeing huge efficiency gains with the code-agent in terms of the number of turns🚀.

Chris Glaze@chris_m_glaze

What’s the future-state of AI agents in real-world scenarios? How often will they just solve problems as coders vs interacting with more constrained but complex tool sets? Models like Claude Sonnet 4.5 are indeed impressive at coding and can in theory use these same skills to solve any problem that lives in an IT ecosystem. But I don't think we're there yet. (1) this will require serious guardrails, (2) the jury is still out on whether this is the even the most efficient approach. As part of our ongoing experiments at @SnorkelAI around this debate we’re making “code-agent” versions of popular tool-based environments in which we challenge agents to solve tasks by writing raw code instead – they only have access to a Python interpreter and a pointer to the relevant file systems. This is related to the idea behind @terminalbench but on environments that simulate entire production-grade systems. Really interesting findings when we do this to the Tau Bench 2 Airlines benchmark from @SierraPlatform: when we strip out all tools and swap in a Python interpreter, models do 𝘣𝘦𝘵𝘵𝘦𝘳 at inference and communication with users, and 𝘸𝘰𝘳𝘴𝘦 at write-operations (database updates). We confirmed that models are 𝘤𝘢𝘱𝘢𝘣𝘭𝘦 of write-operations, however. In the original version of the benchmark, the tools hard code write-operation logic that the models are challenged to figure out on their own in the code-agent version. Hugging Face dataset with results here: huggingface.co/datasets/snork… The successful examples show some really fun behavior though, with a lot of exploration and self-correcting behavior. For example, @AnthropicAI 's Claude Sonnet 4.5 often attempts to interact with the database without first reading in the schema; fails; then reads in the schema by simply printing out all attributes of the object for itself. My guess is we'll land on some optimality point between the two extremes here as models develop: exploration via code and exploitation via tool creation.

English

Bhavishya Pohani retweetledi

Amanda Dsouza@amanda_dsouza·4 Kas

🚨 New research from @SnorkelAI tackles a critical problem: LLMs are evolving faster than our ability to evaluate them 📊 We develop BeTaL— Benchmark Tuning with an LLM-in-the-loop— a framework that automates benchmark design using reasoning models as optimizers. BeTaL produces benchmarks much closer to target difficulty levels, with a 2-4x improvement over the baselines. Paper: huggingface.co/papers/2510.25…

English

1.9K

Bhavishya Pohani@Azrael2801·30 Eki

@dlwh Any reason to leave it out? Super interesting thread btw

English

David Hall@dlwh·26 Haz

Now, look, we knew QK Norm was a good idea. We just thought it wasn't a **necessary** idea, not for us. We were different. Anyway, let's fix it.

English

7.8K

David Hall@dlwh·26 Haz

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. )

Percy Liang@percyliang

Marin 32B training crossed 1.5 trillion tokens today...

English

101

306.8K

Bhavishya Pohani@Azrael2801·30 Eki

@percyliang 👀👀 That's amazing! What benchmarks are you measuring against

English

156

Percy Liang@percyliang·29 Eki

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

English

599

126.6K

Bhavishya Pohani retweetledi

Snorkel AI@SnorkelAI·10 Eki

🤖 What happens when simply using the best AI model starts to break the bank? We explored how multi-agent systems — multiple AI agents collaborating across tools — could benefit enterprise AI. Details👇

English

672

Bhavishya Pohani retweetledi

Chris Glaze@chris_m_glaze·27 Eki

Just how good are AI agents at exploring their environments in novel ways to solve real-world enterprise problems? As part of our ongoing experiments around agentic autonomy at @SnorkelAI we’re making “code-only” versions of environments in which we challenge agents to solve tasks by writing raw code instead of using tools – they only have access to a Python interpreter and a pointer to the relevant file systems. This is related to the work we have developed with @terminalbench but on rich environments that simulate entire production-grade systems. Really interesting findings so far: When left no other choice, some frontier models are already pretty good at this with creative coding, while others require tools for guided access to their environment. In this first experiment we made a code-only version of Snorkel’s insurance underwriting environment and found that Claude Sonnet 4.5 and GPT 5 actually do a little better with only an interpreter while other models clearly do worse without guided access from tools. You can see example traces here with really interesting examples of how models are solving these tasks in the Hugging Face dataset. The most successful models explore the environment and learn in-context: huggingface.co/datasets/snork…. The models take more compute as expected to do things this way, but it shows that they could possibly be leveraged for more autonomous applications as they mature.

English

1.2K

Keşfet

@gepa_ai @mananroongta @sijun_tan @amazon @SnorkelAI @terminalbench @alexalbert__ @NandoDF