Sarthak

307 posts

Sarthak banner
Sarthak

Sarthak

@kaytraser

Katılım Nisan 2015
2.4K Takip Edilen257 Takipçiler
Gauri Gupta
Gauri Gupta@gauri__gupta·
We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).
English
45
41
244
80.9K
Sarthak retweetledi
Martin Tutek
Martin Tutek@mtutek·
This blog by Nicolas Carlini is stellar: nicholas.carlini.com/writing/2026/h… Internalizing things based on words is much more difficult to do than internalizing from (bad) experience, but if there is one place you should try hard to learn from as a researcher, it is this post.
English
1
3
21
1.3K
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
A simple example which is widely used across popular benchmarks like HLE: answer matching. If you give privileged information to the judge (e.g. reference answer, instance specific rubrics etc.) you can increase the SNR by a lot. x.com/ShashwatGoel7/…
Shashwat Goel@ShashwatGoel7

There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image ⁉️. Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. We show discrimination is easier than even verification, let alone generation. 🤔 But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers... We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer. 👨‍🔬We conduct a meta-evaluation by comparing to ground-truth verification on MATH, and human grading on MMLU-Pro and GPQA-Diamond questions. Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced💡 Lets shift the benchmarking ecosystem from MCQs to Answer Matching. Impacts: Leaderboards: We show model rankings can change and accuracies go down making benchmarks seem less saturated. Benchmark Creation: Instead of creating harder MCQs, we should focus our efforts on creating questions with for answer matching, much like SimpleQA, GAIA etc. 🤑 Cost: Finally, to our great surprise, answer matching evals are cheaper to run than MCQs! See our paper for more, its packed with insights. 🧵 has paper and more result figures.

English
2
1
2
787
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
I have to say, i strongly disagree with this take. This will only widen the gap between capabilities reflected by the benchmark, and real-world use gaps. There's a wide variety of problems that can only fuzzily be verified. They still can have a generator-verifier gap. And noisy benchmark measurements (as in the case of using LMs as part of the eval) should be accepted more widely. They still have signal.
Ofir Press@OfirPress

Another day, another reason not to use an LM as a judge. Building benchmarks is tough, and sometimes using an LM-as-a-judge looks like an easy solution to this problem, but it almost never is. Building benchmarks is about finding tough problems whose solution is easy to verify. And we've shown, in SWE-bench, SciCode, AlgoTune, SWE-fficiency, VideoGameBench, CodeClash, and CritPt that we can find extremely tough challenges that are verifiable deterministically. And we'll continue to find even tougher benchmarks, without using any type of ML model to judge correctness.

English
3
1
24
3.6K
Sarthak
Sarthak@kaytraser·
@vvvincent_c womdering what those 14 hour tasks are, are they chained tasks or more monolithic?
English
0
0
0
261
Sarthak
Sarthak@kaytraser·
@joel_bkr earlier I used to hink synthetic data would break this logic but seems like there are too many issues with collapse/bad-distributions as we scale that the above intuition still holds
English
0
0
0
40
Joel Becker
Joel Becker@joel_bkr·
@kaytraser totally agree that reasoning from hypothesized data availability can be very OP
English
1
0
1
303
Joel Becker
Joel Becker@joel_bkr·
i'm extraordinarily unsure about what the next 6-12 months in AI look like. plausibly: capabilities come with many caveats, full AI R&D automation feels far off. however, i struggle to confidently name _any_ software-based task that AIs will be unable to autonomously complete.
English
23
15
206
13.8K
Ahmad
Ahmad@TheAhmadOsman·
what do people use Opus for nowadays? Kimi, GLM, and MiniMax are overall a better, cheaper, and faster models Codex is more intelligent as well why would anyone pay Anthropic for a Claude subscription that gets nerfed?
English
227
31
774
93.6K
Sarthak
Sarthak@kaytraser·
@sharut_gupta great work! quick question: what is rhe ratio of trainable parameters in the input embeddings vs total model weights that we see here?
English
0
0
0
43
Sharut Gupta
Sharut Gupta@sharut_gupta·
1/n Can LLMs learn to reason on hard benchmarks like AIME and GPQA purely through context, without SFT, RL, or any weight updates? Turns out… Yes! And it can have strong performance while being highly efficient Paper: arxiv.org/pdf/2602.02366 Blog: reasoncache.github.io
Sharut Gupta tweet media
English
4
35
208
17.2K
Sarthak
Sarthak@kaytraser·
@bilaltwovec I feel like this thing the equivalent of the "autocomplete phase" we saw in AI coding in automated AI research not sure how long it might take to begin considering abstracting away the underlying research like we're thinking about the future of code right now
English
0
0
0
72
Sarthak
Sarthak@kaytraser·
@khoomeik dataset distillation? I vividly remember those blurry images representative of the entire class, training on just 10 images gave a great performance on imagenet this was a very interesting direction back in the resnet days, wondering where it went
English
1
0
0
98
Sarthak
Sarthak@kaytraser·
@MaziyarPanahi a bit tangential but, have you been using LLMs as judges to supervise the CoTs since CoT supervision would be the primary challenge in this situation
English
0
0
0
9
Maziyar PANAHI
Maziyar PANAHI@MaziyarPanahi·
@kaytraser it's not that hard to beat those models to be honest. the world knowledge is already in most open models, during pre-training. we just need good post-training to structure the reasoning and thinking at expert level to get that knowledge out in a correct way.
English
1
0
0
22
Sarthak
Sarthak@kaytraser·
@MaziyarPanahi hm, this could definitely be a great thing for cold-starting but how would we then beat those SOTA models?
English
1
0
1
14
Maziyar PANAHI
Maziyar PANAHI@MaziyarPanahi·
you can only use another model that scored high in the medical evals to evaluate sub samples. i once did a medical annotation project with 15 doctors, half of them didn't agree with the other half whether something was right or wrong! even among experts you have nuance and edge-cases, so i am trying to make sure we use the best open models available to generate diverse traces of thinking which would be great foundation for RL
English
1
0
0
26
Sarthak
Sarthak@kaytraser·
@sdathath I was considering the situation where pre-training itself might be at fault for mode collapse
English
0
0
0
58
Sumanth Dathathri
Sumanth Dathathri@sdathath·
@kaytraser Not sure I follow. E.g., the KL regularization does push the model towards pretraining, so it should increase the influence of pretraining?
English
1
0
0
68
Sarthak
Sarthak@kaytraser·
did synth data generation for the same task in Sept 2024 and today fighting mode collapse was so hard back then and is completely absent now we've came a long way, wondering if it is only because models got larger or did the labs actually get an improved data distribution
English
1
0
8
11.1K
Sarthak
Sarthak@kaytraser·
@sdathath this seems to be more aligned with the task of pure next token generation hence the suspension of being more influenced by changes in pre training
English
1
0
0
75
Sarthak
Sarthak@kaytraser·
@sdathath the reason being that I'm doing generation for a somewhat simple task example: where the earlier models used to fill the names with "john doe" 7/10 times now give a really good diversity of names and this observation goes for most of the peculiarities of the data I know of
English
1
0
1
93
Sarthak
Sarthak@kaytraser·
@sdathath I thought RL(VR) leads to a worse distribution Also most of the RL is for reasoning tasks, how would that help diversity?
English
1
0
2
604
Sumanth Dathathri
Sumanth Dathathri@sdathath·
@kaytraser I think maybe more less SFT and more RL with entropy reg in post-training these days? SFT used to nuke entropy, so you probably see a bit more diversity in responses.
English
1
0
2
636