Van0SS

61 posts

Van0SS banner
Van0SS

Van0SS

@Van0SS

CTO & Co-founder

SF Katılım Temmuz 2013
101 Takip Edilen80 Takipçiler
Ivan Bercovich
Ivan Bercovich@neversupervised·
How come these near-AGI models can be so stupid at times? Telling you to walk to the nearby car wash, or stating that a cup with a sealed top and an open bottom is useless (it’s upside down). LLMs learn differently than humans do. As models get trained, they develop islands of generalization. When we step outside that territory, the behavior is disappointing. When we’re operating in the right domain, an AI is much, much smarter than all but a tiny percentage of humans at most topics. Outside, it can be likewise much dumber than all but a small fraction of humans. LLMs have much more peaky learning than humans do. But as we make them bigger and feed them more FLOPs, the islands grow and start to overlap. It becomes harder and harder to find notable examples, which is why these prompts go viral. The scaling laws continue to work. The error rates continue to drop predictably. AIs will continue to outsmart humans in more and more end-to-end tasks. Eventually this will cover most economically valuable tasks. That’s not to say there aren’t issues, that benchmarks aren’t flawed, or that transformers are sufficient to get to AGI. These examples are great for honing our intuition about how AI works. But they aren’t hard evidence against AGI.
English
1
0
1
52
Anton Shevtsov
Anton Shevtsov@Shevan05·
We’ve updated SWE-rebench (January set). Key pattern: there’s a clear ~1M token wall. SWE-rebench is a live benchmark: each month we add fresh real-world SWE tasks (GitHub issue + PR pairs) and evaluate models in a coding-agent setup. In this setup, models iteratively read files, write patches, run tests, observe failures, and refine solutions. Token counts therefore reflect full agent trajectories — not single-shot completions. 1. A clear top cluster Claude Code, Claude Opus 4.6, and gpt-5.2-xhigh lead the leaderboard while operating in the ~1–2M tokens per problem regime. Frontier-level results are associated with both strong model capability and long execution traces. 2. Marginal gains beyond ~1M tokens Beyond ~1M tokens/problem, additional tokens yield only marginal pass@1 gains. Token budget becomes a dominant scaling axis. If a deployment cannot afford ~1M+ tokens per task, it is unlikely to reach the top accuracy cluster. 3. Efficiency matters gpt-5.2-codex is a notable exception. It operates below ~1M tokens/problem yet achieves strong performance relative to the frontier group. Raw token volume alone does not determine outcomes. Trace efficiency — how effectively an agent uses its budget — is a critical factor. Takeaway SWE-rebench positioning is shaped by two interacting axes: - Model capability - Token budget and utilization efficiency Top-cluster systems combine both. Efficient systems demonstrate that careful trace usage can narrow much of the gap without matching the highest token budgets. swe-rebench.com
Anton Shevtsov tweet media
English
3
3
28
8.1K
Van0SS retweetledi
Ryan Marten
Ryan Marten@ryanmart3n·
Exciting to see a standard API emerge for training that allows you to drop in different backends. Moving between open source infra on self managed clusters and hosted solutions flexibly based on your needs for scale / sovereignty is massively valuable.
Tyler Griggs@tyler_griggs_

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker 🧵

English
0
5
27
3K
Van0SS
Van0SS@Van0SS·
@guohao_li That's amazing! Curios how did you source the data?
English
0
0
0
73
Guohao Li 🐫
Guohao Li 🐫@guohao_li·
We just dropped ~1000 more terminal coding RL training environments. Open AI and Anthropic just release GPT-5.3-Codex and Opus 4.6 model. The terminal-bench 2.0 is one of the most important benchmarks and the only overlapping one on their benchmarks However, there is not enough high-quality open-source terminal coding training environments In SETA, we open-sourced 1,376 validated terminal environments across: SE • sysadmin • security • debugging • networking • DevOps Compatible with Terminal Bench & available in Harbor framework registry GitHub: github.com/camel-ai/seta-…
Guohao Li 🐫@guohao_li

Frontier labs spend millions purchasing RL environments for training terminal agents. But we decided to open source it. Introducing SETA: Scaling Environments for Terminal Agents, the largest open source training RL environments for terminal agents. We released: - 400 termianl agent training environments, more to come - SOTA agent harness on terminal-bench with CAMEL terminal toolkit - The RL training pipeline and trained SETA-RL-Qwen3-8B model weights

English
11
23
264
26.9K
Van0SS
Van0SS@Van0SS·
@myhandleisbest Yeah, maybe it's just kids trying to hassle hard and get a job using chatgpt
English
1
0
0
23
Logan
Logan@myhandleisbest·
@Van0SS Has happened to us on upwork before 😆
English
1
0
1
25
Van0SS
Van0SS@Van0SS·
@adcock_brett am i dumb that i can't pass even first page myself? seems strange challenge for agents if human can't do it (maybe not the smartest one)
English
0
0
4
1.4K
Brett Adcock
Brett Adcock@adcock_brett·
Solve this in under 5 minutes and I’ll offer you $500k/year in cash plus several million in equity I'm building a Computer-Use team, goal is to use computers better than humans No experience or PhD needed Instructions: 1. Solve all 30 challenges on this website in under 5 minutes: serene-frangipane-7fd25b.netlify.app 2. Feel free to use any tools or vibe code it. Provide us a zip folder with instructions on how to run the agent and reproduce your results, as well your run statistics 3. The agent should be able to solve all the challenges, use browser, and provide overall metrics around time taken, token usage and token cost. Your agent must solve this challenge in under 5 minutes Email your response: agents@brettadcock.com If you have any questions about this challenge, feel free to email us
English
344
147
2.6K
1.4M
Van0SS
Van0SS@Van0SS·
asking claude code to configure itself inside clawdbot 2026 just started
Van0SS tweet media
English
0
0
3
130
Van0SS
Van0SS@Van0SS·
@daniel_mints i'll shave my head when we got paid, reverse motivation
English
0
0
2
271
Daniel Mints
Daniel Mints@daniel_mints·
having an unemployed crash out just shaved my head ggs
Daniel Mints tweet media
English
47
0
159
16.4K
Timur Khakhalev
Timur Khakhalev@timurkhakhalev·
In Jan 2026 I still faces with that type of issues in gemini cli I don't understand why I should pay even a penny for this Unbelievable
Timur Khakhalev tweet media
English
2
0
1
206
Oliver Prompts
Oliver Prompts@oliviscusAI·
Tencent just killed fine-tuning and RL with a $18 budget 🤯 They developed a method that replaces traditional Reinforcement Learning (RL) entirely. It’s called Training-Free GRPO. It allows LLMs to learn from 100 examples by treating memory as a policy optimizer.
Oliver Prompts tweet media
English
39
189
1.5K
217.9K
Nadia Zueva
Nadia Zueva@nestymee·
everyone wants distribution nobody wants to post every day
English
222
5
322
11K
Philippe Laban
Philippe Laban@PhilippeLaban·
@karpathy In arxiv.org/abs/2409.14509, we paid expert writers to "remove the slop" from AI writing. We did a categorization of the 7 most common edit types, and found (surprisingly) Gemini / GPT / Llama have similar distributions of edit types (types of slop): a slop recipe.
Philippe Laban tweet mediaPhilippe Laban tweet media
English
1
0
3
200
Andrej Karpathy
Andrej Karpathy@karpathy·
Has anyone encountered a good definition of “slop”. In a quantitative, measurable sense. My brain has an intuitive “slop index” I can ~reliably estimate, but I’m not sure how to define it. I have some bad ideas that involve the use of LLM miniseries and thinking token budgets.
English
941
158
4.3K
649.8K
Van0SS
Van0SS@Van0SS·
starlink helps to touch grass and still ship
Van0SS tweet media
English
0
0
1
354
Lucky Robots
Lucky Robots@luckyrobots·
We’re building a new kind of robotics simulator, stay tuned.
English
17
14
144
144.6K
shirish
shirish@shiri_shh·
Show me your app, website or project and I’ll share my honest thoughts👇
English
941
8
549
54.9K
Van0SS
Van0SS@Van0SS·
@nestymee you can just hire an intern who would be replying for you
English
0
0
1
13
Nadia Zueva
Nadia Zueva@nestymee·
no one talks about how hard it actually is to be a reply guy
English
437
13
443
20.9K
Van0SS
Van0SS@Van0SS·
this is what happens when you come to sf yeah it's portable sauna i didn't even know it exists
Van0SS tweet mediaVan0SS tweet media
English
2
0
6
838
Van0SS
Van0SS@Van0SS·
@nestymee But don't tear your ACL like I did. Prolly no more bouldering for me
English
1
0
1
26
Nadia Zueva
Nadia Zueva@nestymee·
Funny how bouldering resets my brain better than any productivity hack. This is your sign to try it
Nadia Zueva tweet media
English
3
0
14
561
Van0SS
Van0SS@Van0SS·
@NeginRaoof_ Very impressive work! Curios how many gpu-hours it took to train each stage?
English
1
0
0
92
Negin Raoof
Negin Raoof@NeginRaoof_·
How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)
Negin Raoof tweet media
English
17
77
289
125K