Van0SS

61 posts

Van0SS

@Van0SS

CTO & Co-founder

SF Katılım Temmuz 2013

101 Takip Edilen80 Takipçiler

Van0SS@Van0SS·18 Şub

@neversupervised LLMs are different species and should be treated as such

English

Ivan Bercovich@neversupervised·17 Şub

How come these near-AGI models can be so stupid at times? Telling you to walk to the nearby car wash, or stating that a cup with a sealed top and an open bottom is useless (it’s upside down). LLMs learn differently than humans do. As models get trained, they develop islands of generalization. When we step outside that territory, the behavior is disappointing. When we’re operating in the right domain, an AI is much, much smarter than all but a tiny percentage of humans at most topics. Outside, it can be likewise much dumber than all but a small fraction of humans. LLMs have much more peaky learning than humans do. But as we make them bigger and feed them more FLOPs, the islands grow and start to overlap. It becomes harder and harder to find notable examples, which is why these prompts go viral. The scaling laws continue to work. The error rates continue to drop predictably. AIs will continue to outsmart humans in more and more end-to-end tasks. Eventually this will cover most economically valuable tasks. That’s not to say there aren’t issues, that benchmarks aren’t flawed, or that transformers are sufficient to get to AGI. These examples are great for honing our intuition about how AI works. But they aren’t hard evidence against AGI.

English

Van0SS@Van0SS·13 Şub

@Shevan05 @ibragim_bad @agolubev13 Let's gooo! Wondering why 5.3-codex didn't make it here? Issues to run within the codex harness?

English

230

Anton Shevtsov@Shevan05·13 Şub

We’ve updated SWE-rebench (January set). Key pattern: there’s a clear ~1M token wall. SWE-rebench is a live benchmark: each month we add fresh real-world SWE tasks (GitHub issue + PR pairs) and evaluate models in a coding-agent setup. In this setup, models iteratively read files, write patches, run tests, observe failures, and refine solutions. Token counts therefore reflect full agent trajectories — not single-shot completions. 1. A clear top cluster Claude Code, Claude Opus 4.6, and gpt-5.2-xhigh lead the leaderboard while operating in the ~1–2M tokens per problem regime. Frontier-level results are associated with both strong model capability and long execution traces. 2. Marginal gains beyond ~1M tokens Beyond ~1M tokens/problem, additional tokens yield only marginal pass@1 gains. Token budget becomes a dominant scaling axis. If a deployment cannot afford ~1M+ tokens per task, it is unlikely to reach the top accuracy cluster. 3. Efficiency matters gpt-5.2-codex is a notable exception. It operates below ~1M tokens/problem yet achieves strong performance relative to the frontier group. Raw token volume alone does not determine outcomes. Trace efficiency — how effectively an agent uses its budget — is a critical factor. Takeaway SWE-rebench positioning is shaped by two interacting axes: - Model capability - Token budget and utilization efficiency Top-cluster systems combine both. Efficient systems demonstrate that careful trace usage can narrow much of the gap without matching the highest token budgets. swe-rebench.com

English

8.1K

Van0SS retweetledi

Ryan Marten@ryanmart3n·13 Şub

Exciting to see a standard API emerge for training that allows you to drop in different backends. Moving between open source infra on self managed clusters and hosted solutions flexibly based on your needs for scale / sovereignty is massively valuable.

Tyler Griggs@tyler_griggs_

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker 🧵

English

Van0SS@Van0SS·7 Şub

@guohao_li That's amazing! Curios how did you source the data?

English

Guohao Li 🐫@guohao_li·6 Şub

We just dropped ~1000 more terminal coding RL training environments. Open AI and Anthropic just release GPT-5.3-Codex and Opus 4.6 model. The terminal-bench 2.0 is one of the most important benchmarks and the only overlapping one on their benchmarks However, there is not enough high-quality open-source terminal coding training environments In SETA, we open-sourced 1,376 validated terminal environments across: SE • sysadmin • security • debugging • networking • DevOps Compatible with Terminal Bench & available in Harbor framework registry GitHub: github.com/camel-ai/seta-…

Guohao Li 🐫@guohao_li

Frontier labs spend millions purchasing RL environments for training terminal agents. But we decided to open source it. Introducing SETA: Scaling Environments for Terminal Agents, the largest open source training RL environments for terminal agents. We released: - 400 termianl agent training environments, more to come - SOTA agent harness on terminal-bench with CAMEL terminal toolkit - The RL training pipeline and trained SETA-RL-Qwen3-8B model weights

English

264

26.9K

Van0SS@Van0SS·3 Şub

@myhandleisbest Yeah, maybe it's just kids trying to hassle hard and get a job using chatgpt

English

Logan@myhandleisbest·3 Şub

@Van0SS Has happened to us on upwork before 😆

English

Van0SS@Van0SS·3 Şub

@adcock_brett am i dumb that i can't pass even first page myself? seems strange challenge for agents if human can't do it (maybe not the smartest one)

English

1.4K

Brett Adcock@adcock_brett·2 Şub

Solve this in under 5 minutes and I’ll offer you $500k/year in cash plus several million in equity I'm building a Computer-Use team, goal is to use computers better than humans No experience or PhD needed Instructions: 1. Solve all 30 challenges on this website in under 5 minutes: serene-frangipane-7fd25b.netlify.app 2. Feel free to use any tools or vibe code it. Provide us a zip folder with instructions on how to run the agent and reproduce your results, as well your run statistics 3. The agent should be able to solve all the challenges, use browser, and provide overall metrics around time taken, token usage and token cost. Your agent must solve this challenge in under 5 minutes Email your response: agents@brettadcock.com If you have any questions about this challenge, feel free to email us

English

344

147

2.6K

1.4M

Van0SS@Van0SS·26 Oca

asking claude code to configure itself inside clawdbot 2026 just started

English

130

Van0SS@Van0SS·26 Oca

@daniel_mints i'll shave my head when we got paid, reverse motivation

English

271

Daniel Mints@daniel_mints·25 Oca

having an unemployed crash out just shaved my head ggs

English

159

16.4K

Van0SS@Van0SS·26 Oca

@timurkhakhalev rate limits on gemini ultra are ridiculous too

English

Timur Khakhalev@timurkhakhalev·25 Oca

In Jan 2026 I still faces with that type of issues in gemini cli I don't understand why I should pay even a penny for this Unbelievable

English

206

Van0SS@Van0SS·26 Oca

@oliviscusAI i mean they just changed the prompt

English

725

Oliver Prompts@oliviscusAI·25 Oca

Tencent just killed fine-tuning and RL with a $18 budget 🤯 They developed a method that replaces traditional Reinforcement Learning (RL) entirely. It’s called Training-Free GRPO. It allows LLMs to learn from 100 examples by treating memory as a policy optimizer.

English

189

1.5K

217.9K

Van0SS@Van0SS·11 Oca

@neversupervised @xdotli @alexgshaw @Mike_A_Merrill 100%

Ivan Bercovich@neversupervised·10 Oca

@xdotli @alexgshaw @Mike_A_Merrill One environment to rule them all.

English

640

Van0SS@Van0SS·11 Oca

@nestymee со тру

Русский

Nadia Zueva@nestymee·9 Oca

everyone wants distribution nobody wants to post every day

English

222

322

11K

Van0SS@Van0SS·7 Oca

@PhilippeLaban @karpathy That's an interesting work, curios if modern models have similar issues

English

Philippe Laban@PhilippeLaban·22 Kas

@karpathy In arxiv.org/abs/2409.14509, we paid expert writers to "remove the slop" from AI writing. We did a categorization of the 7 most common edit types, and found (surprisingly) Gemini / GPT / Llama have similar distributions of edit types (types of slop): a slop recipe.

English

200

Andrej Karpathy@karpathy·22 Kas

Has anyone encountered a good definition of “slop”. In a quantitative, measurable sense. My brain has an intuitive “slop index” I can ~reliably estimate, but I’m not sure how to define it. I have some bad ideas that involve the use of LLM miniseries and thinking token budgets.

English

941

158

4.3K

649.8K

Van0SS@Van0SS·6 Oca

starlink helps to touch grass and still ship

English

354

Van0SS@Van0SS·6 Oca

@luckyrobots would love to see sim2real demo

English

643

Lucky Robots@luckyrobots·5 Oca

We’re building a new kind of robotics simulator, stay tuned.

English

144

144.6K

Van0SS@Van0SS·6 Oca

@shiri_shh guess what we do parsewave.ai

English

shirish@shiri_shh·5 Oca

Show me your app, website or project and I’ll share my honest thoughts👇

English

941

549

54.9K

Van0SS@Van0SS·6 Oca

@nestymee you can just hire an intern who would be replying for you

English

Nadia Zueva@nestymee·5 Oca

no one talks about how hard it actually is to be a reply guy

English

437

443

20.9K

Van0SS@Van0SS·29 Ara

this is what happens when you come to sf yeah it's portable sauna i didn't even know it exists

English

838

Van0SS@Van0SS·13 Ara

@nestymee But don't tear your ACL like I did. Prolly no more bouldering for me

English

Nadia Zueva@nestymee·13 Ara

Funny how bouldering resets my brain better than any productivity hack. This is your sign to try it

English

561

Van0SS@Van0SS·10 Ara

@NeginRaoof_ Very impressive work! Curios how many gpu-hours it took to train each stage?

English

Negin Raoof@NeginRaoof_·6 Ara

How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)

English

289

125K

Keşfet

@neversupervised @Shevan05 @ibragim_bad @agolubev13 @guohao_li @myhandleisbest @adcock_brett @daniel_mints