Sam Snelling

2.4K posts

Sam Snelling

Sam Snelling

@snellingio

building https://t.co/5Sf1CVIUTj

Oklahoma City Katılım Temmuz 2009
713 Takip Edilen1.3K Takipçiler
Sam Snelling retweetledi
Cursor
Cursor@cursor_ai·
Cursor can now search millions of files and find results in milliseconds. This dramatically speeds up how fast agents complete tasks. We're sharing how we built Instant Grep, including the algorithms and tradeoffs behind the design.
Cursor tweet media
English
147
256
4.6K
616.2K
Sam Snelling retweetledi
NP
NP@np_hard·
As part of @PrimeIntellect's RL residency program, I've been exploring how to do multi-agent RL using their current stack (from verifiers + prime-rl to lab experiments with hosted training /evals) and thinking about how it could be extended to support these abstractions natively. I've summarized my findings the blogpost below and I'll leave a few comments here, too...
NP tweet media
English
7
40
326
40K
Sam Snelling
Sam Snelling@snellingio·
interesting that 1) we're including harnesses now? eg Junie 2) stepfun is finally being included, and is performing great for it's size
Ibragim@ibragim_bad

🚨 SWE-rebench update! SWE-rebench is a live benchmark with fresh SWE tasks (issue+PR) from GitHub every month. updates: > we removed demonstrations and the 80-step limit (modern models can now handle huge contexts without getting trapped in loops!). > we added auxiliary interfaces for specific tasks like in SWE-bench-Pro to evaluate larger tasks fairly, ensuring valid solutions don't fail just because of mismatched test calls. insights: > Top models perform similarly. Among open-source options, GLM @Zai_org shows strong results, and StepFun @StepFun_ai is very cheap for its performance level ($0.14 per task). > GPT-5.4 shows high token efficiency, it ranks in the top 5 overall but uses the lowest number of tokens (774k per task) > Qwen3-Coder-Next & Step-3.5-Flash benefit massively from huge contexts. Qwen is an extreme case, averaging a wild 8.12M tokens. > We evaluated agentic harnesses (Claude Code, Codex, and Junie) and found a few things. Even in headless mode, they sometimes ask for additional context or attempt web searches. We explicitly disabled search and verified their curl commands to ensure they aren't just pulling solutions from the web. 🏆 You can find the full leaderboard here: swe-rebench.com 👾 Also, we launched our Discord! Join our leaderboard channel to discuss models, share ideas, ask questions, or report issues: discord.gg/V8FqXQ4CgU

English
1
0
0
67
Sam Snelling
Sam Snelling@snellingio·
the llamaindex liteparse lib is great. tesseract is obv not great, but being about to get a rough outline + screenshots lets just about everything to be converted with okayish accuracy very quickly. would recommend
English
1
1
0
55
Sam Snelling retweetledi
MERICA MEMED
MERICA MEMED@Mericamemed·
now give the lobster access to firearms
English
62
322
2.7K
201.4K
Sam Snelling
Sam Snelling@snellingio·
i guess the binned 15 core would be a closer comparison? which would be like a 9% increase single core, 14% increase multi core, and unknown power draw. maybe not quite fair with different gpu’s. would be interesting to know what they are doing on the software side because the idle stating the same with no efficiency cores is pretty neat. unfortunately no review will capture this perfectly, but is interesting to see the trade offs imo
English
1
0
0
27
Marco
Marco@MarcoNL88·
@snellingio @mweinbach Idle vs load. Efficiency probably uses less power idle but more under load. So if you let the battery drain without doing anything, efficiency core will win. But if you run light loads then the new performance core will win.
English
1
0
1
43
Sam Snelling retweetledi
Sam Paech
Sam Paech@sam_paech·
The Qwen3.5 models really took over the pareto for LLM-judging. Local models that are actually capable at data scoring is a huge accelerator imo.
Sam Paech tweet media
English
16
33
438
25.3K
Sam Snelling retweetledi
Wenliang Dai
Wenliang Dai@Wenliang_Dai·
🚀 Introducing Nemotron-Cascade 2: our new best-in-class 30B-A3B MoE model. 🥇 Gold Medal at IMO 2025, IOI 2025, and the ICPC World Finals. 🔥 Outperforms Qwen3.5-35B-A3B across Math, Code Reasoning, alignment, and instruction following. 🔓 Great reproducibility: Model weights, SFT, and RL data are open! Check out our technical report and huggingface page for more details and insights 👇 📰 Technical Report: t.co/dFC00m6RZU 🤗 Model & Data: t.co/4QJqfTOt6I
Wenliang Dai tweet media
Wei Ping@_weiping

🚀 Introducing Nemotron-Cascade 2 🚀 Just 3 months after Nemotron-Cascade 1, we’re releasing Nemotron-Cascade 2: an open 30B MoE with 3B active parameters, delivering best-in-class reasoning and strong agentic capabilities. 🥇 Gold Medal-level performance on IMO 2025, IOI 2025, and ICPC World Finals 2025: • Capabilities once thought achievable only by frontier proprietary models (e.g. Gemini Deep Think) or frontier-scale open models (i.e. DeepSeek-V3.2-Speciale-671B-A37B). • Remarkably high intelligence density with 20× fewer parameters. 🏆 Best-in-class across math, code reasoning, alignment, and instruction following: • Outperforms the latest Qwen3.5-35B-A3B (2026-02-24) and even larger Qwen3.5-122B-A10B (2026-03-11). 🧠 Powered by Cascade RL + multi-domain on-policy distillation: • Significantly expand Cascade RL across a much broader range of reasoning and agentic domains than Nemotron-Cascade 1, while distilling from the strongest intermediate teacher models throughout training to recover regressions and sustain gains. 🤗 Model + SFT + RL data: 👉 huggingface.co/collections/nv… 📄 Technical report: 👉 research.nvidia.com/labs/nemotron/…

English
9
33
287
23.2K
Sam Snelling retweetledi
Kangwook Lee
Kangwook Lee@Kangwook_Lee·
The current Terminal Bench has a pretty significant design flaw: agents are not told how much time they have left, so they just keep working until they are abruptly shut down. (And this time budget varies across tasks!) That setup systematically hurts "thinking" models. In many cases, they score much worse than non-thinking models, not because they are less capable, but because the benchmark punishes models that spend time reasoning. It is basically like giving students an exam and then taking away their papers at a random moment without telling them when time is up. The fix is straightforward: tell agents how much time remains. Once they can budget their time, a big part of this bias disappears.
elie@eliebakouch

so 3x the training compute gets you 1% improvement on swe bench multilingual and 21% on terminal bench 2.0 but k2.5 is in non thinking mode? if those benchmarks are useless, it's weird that they are the ones reported in cursor blog then? something is wrong

English
10
9
136
24.5K
Sam Snelling retweetledi
elie
elie@eliebakouch·
i want to update this post and explain more my thoughts since a few people read it: 1) license is ok and used through fireworks (kimi confirmed it) 2) they will give credit to the base model provider in the future (most important part imo for the oss ecosystem!!) 3) still unsure about the compute spent, 3x mentioned in lee's post might be wrong since aman mentioned "4x scaled up" which imo means 4x from previous cursor post training which would make more sense? idk 4) seems to be k2.5, selected based on ppl eval 5) for terminal bench 2.0, it's unclear if thinking hurts or not, likely yes so it's actually a big gap! (thanks @Kangwook_Lee) 6) swe bench multilingual is contaminated so 1% improvement might not be that big of a deal but: > they should have reported on swe bench Pro then > k2.5 data point on cursorbench imo the way they report evals in the blog is still weird to me, this is important because it's hard to get the improvement compared to the base model + if they are over/underselling composer 2, tho i admit this is a recurring issue in frontier releases and due to a lack of good open evals because they are hard to make and doomed to be benchmaxed
elie@eliebakouch

so 3x the training compute gets you 1% improvement on swe bench multilingual and 21% on terminal bench 2.0 but k2.5 is in non thinking mode? if those benchmarks are useless, it's weird that they are the ones reported in cursor blog then? something is wrong

English
2
4
58
11K
testtm
testtm@test_tm7873·
@mkurman88 It's good. But beside qwen what else we got. :((
English
1
0
1
29
testtm
testtm@test_tm7873·
We have too few really good opensourced 7B models for some reason. We got epic small models (qwen, PleIAs) and epic big models (minimax zai StepFun kimi) But where the small-medium models usefull for the vram poor bros at?!
English
2
0
9
432
Sam Snelling
Sam Snelling@snellingio·
@mkurman88 yeah it's actually very decent. my only real concern is that cursor doesn't do cached input billing, which takes any financial advantage away from them. would recommend though!
English
0
0
1
46
Sam Snelling
Sam Snelling@snellingio·
so if cursor is k2.5 with RL, that just makes me want to use it more?
English
0
0
3
46