Anton Shevtsov

12 posts

Anton Shevtsov

Anton Shevtsov

@Shevan05

Amsterdam, The Netherlands Katılım Mart 2012
24 Takip Edilen35 Takipçiler
Anton Shevtsov retweetledi
Ori Press
Ori Press@ori_press·
We evaluated GPT 5.4 on AlgoTune: for the first time an @OpenAI model is worse than its predecessor. Some analysis: In graph_laplacian, GPT-5.2's approach is: build the sparse matrix once, call SciPy’s Laplacian routine, and return the sparse result directly. (cont.)🧵
Ori Press tweet media
English
2
2
19
4.5K
Anton Shevtsov retweetledi
Ibragim
Ibragim@ibragim_bad·
📟 Meet SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! We at Nebius AI R&D are releasing the biggest open dataset of RL environments for training coding agents. We built an automated pipeline to extract real-world tasks at scale, and now we are sharing everything with the community. This release is designed for large-scale RL training. What’s inside: > 32,000+ executable tasks — every task is based on a real-world issue and comes with a pre-built Docker env. > 20 programming languages — moving beyond Python-only datasets (including less-represented ones like Lua, Clojure, etc.). > 120,000+ extra tasks derived from real pull requests. > High quality — tasks are filtered and labeled using an LLM ensemble. They are also enriched with metadata and tested interfaces to ensure solvability. We are also dropping a technical report with all the details on our extraction pipeline and model evaluations. 📄 Paper and dataset: huggingface.co/papers/2602.23… 👾 Discord (we are online there for any feedback/issues): discord.gg/wXYmWpMu We are open to research collaborations — feel free to reach out! 🔁 If you find this useful, please help us spread the word by sharing
Ibragim tweet media
English
13
44
349
45K
Anton Shevtsov retweetledi
Alexander Golubev
Alexander Golubev@agolubev13·
Running MCTS in stateful environments, rolling back faulty actions, branching for exploration, reverting a series of steps once you hit a dead end – all of this kept coming up in our SWE-agent research for the past year. And it required a lot of complex tooling, especially for Docker containers where actions aren't easily revertible. Today our R&D team is releasing contree.dev – a sandbox we built to solve exactly this. The core idea is that ConTree snapshots the container filesystem state after each command, so all of the above becomes straightforward to implement. If you're working on SWE-agent research, you might find it especially useful. It goes with 7k+ SWE environments ready to launch, as well as mini-SWE-agent integration. It's early access right now and we have some keys available. DM me if you're interested, feedback is very much appreciated.
English
0
4
6
224
Anton Shevtsov
Anton Shevtsov@Shevan05·
Good point — plotting error rate (1 − resolved) can make relative gains more visible near saturation. At the moment we’re still far from that regime (top resolved ~53%, pass@5 ~70%), so resolved rate remains intuitive. If we start approaching saturation, we can adjust difficulty via task selection — e.g., include harder issues or broaden to multilingual tasks — since SWE-rebench is refreshed monthly from newly created GitHub issue+PR pairs.
English
0
0
1
130
Santiago Afonso
Santiago Afonso@SantiagoAfonso·
@Shevan05 @ibragim_bad @agolubev13 Great bench. Have you considered using unresolved rate for the y-axis? When nearing saturation, resolve rate makes progress appear much slower (eg going 98 to 99%).
English
2
0
2
195
Anton Shevtsov
Anton Shevtsov@Shevan05·
We’ve updated SWE-rebench (January set). Key pattern: there’s a clear ~1M token wall. SWE-rebench is a live benchmark: each month we add fresh real-world SWE tasks (GitHub issue + PR pairs) and evaluate models in a coding-agent setup. In this setup, models iteratively read files, write patches, run tests, observe failures, and refine solutions. Token counts therefore reflect full agent trajectories — not single-shot completions. 1. A clear top cluster Claude Code, Claude Opus 4.6, and gpt-5.2-xhigh lead the leaderboard while operating in the ~1–2M tokens per problem regime. Frontier-level results are associated with both strong model capability and long execution traces. 2. Marginal gains beyond ~1M tokens Beyond ~1M tokens/problem, additional tokens yield only marginal pass@1 gains. Token budget becomes a dominant scaling axis. If a deployment cannot afford ~1M+ tokens per task, it is unlikely to reach the top accuracy cluster. 3. Efficiency matters gpt-5.2-codex is a notable exception. It operates below ~1M tokens/problem yet achieves strong performance relative to the frontier group. Raw token volume alone does not determine outcomes. Trace efficiency — how effectively an agent uses its budget — is a critical factor. Takeaway SWE-rebench positioning is shaped by two interacting axes: - Model capability - Token budget and utilization efficiency Top-cluster systems combine both. Efficient systems demonstrate that careful trace usage can narrow much of the gap without matching the highest token budgets. swe-rebench.com
Anton Shevtsov tweet media
English
3
3
28
8.1K
Anton Shevtsov
Anton Shevtsov@Shevan05·
Hi, We run all models with consistent methodology, infrastructure, and sampling across releases. SWE-rebench evaluates on fresh monthly tasks, and with a limited task set some month-to-month variance is expected depending on issue distribution. Over longer time windows, rankings tend to be more stable.
English
0
0
1
112
Anton Shevtsov
Anton Shevtsov@Shevan05·
Hi, gpt-5.3-codex is not yet broadly available via API, and current access limits make large-scale, reproducible evaluation difficult. Token pricing has also not been finalized publicly, which prevents consistent cost comparisons. For this release, we evaluated Codex using gpt-5.2-codex, which is fully available and benchmarkable. We’ll add 5.3-codex once API access and pricing are finalized.
English
1
0
4
212
Anton Shevtsov retweetledi
Nebius
Nebius@nebiusai·
Our own SWE-rebench just became the #1 most downloaded dataset on @HuggingFace 🥇 SWE-rebench is a dataset and benchmark for code agents based on LLMs, developed by our AI R&D team. It has been downloaded more than 3.9M times — 3.1M in the last month. 1/4
Nebius tweet media
English
2
11
85
8.9K
Anton Shevtsov retweetledi
hr0nix
hr0nix@hr0nix·
An extended writeup of our earlier research blogpost on training critics for SWE agents has been accepted to ICML! Some details below ⬇️
hr0nix tweet media
English
1
5
14
5K
Anton Shevtsov retweetledi
hr0nix
hr0nix@hr0nix·
Spirit of open-source is in the air thanks to DeepSeek! And today we are happy to release kvax, our implementation of flash attention for jax! It is very fast and has some advanced features such as context parallelism support that might not be easy to come by. Details ⬇️
English
4
19
64
59.9K
Anton Shevtsov retweetledi
Nebius
Nebius@nebiusai·
We release world’s first datasets for training software engineering agents🔥 More specifically, our AI R&D team uploaded two datasets to @HuggingFace: one with 6,411 Issue-PR pairs, and the other with 80,036 agent trajectories. Learn more on our blog: eu1.hubs.ly/H0fwMB00
Nebius tweet media
English
5
14
46
7.2K
Anton Shevtsov retweetledi
hr0nix
hr0nix@hr0nix·
As a follow up to our work on applying search to software engineering agents, today we are releasing datasets of problem instances and agent trajectories. This is the training data we previously used to achieve 40.6% on SWE-bench Verified using open-weight models only! 🧵⬇️
English
2
19
37
3.6K