Anton Shevtsov

12 posts

Anton Shevtsov

@Shevan05

Amsterdam, The Netherlands Katılım Mart 2012

25 Takip Edilen36 Takipçiler

Anton Shevtsov retweetledi

Ori Press@ori_press·12 Mar

We evaluated GPT 5.4 on AlgoTune: for the first time an @OpenAI model is worse than its predecessor. Some analysis: In graph_laplacian, GPT-5.2's approach is: build the sparse matrix once, call SciPy’s Laplacian routine, and return the sparse result directly. (cont.)🧵

English

4.6K

Anton Shevtsov retweetledi

Ibragim (icml@Seoul)@ibragim_bad·3 Mar

📟 Meet SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! We at Nebius AI R&D are releasing the biggest open dataset of RL environments for training coding agents. We built an automated pipeline to extract real-world tasks at scale, and now we are sharing everything with the community. This release is designed for large-scale RL training. What’s inside: > 32,000+ executable tasks — every task is based on a real-world issue and comes with a pre-built Docker env. > 20 programming languages — moving beyond Python-only datasets (including less-represented ones like Lua, Clojure, etc.). > 120,000+ extra tasks derived from real pull requests. > High quality — tasks are filtered and labeled using an LLM ensemble. They are also enriched with metadata and tested interfaces to ensure solvability. We are also dropping a technical report with all the details on our extraction pipeline and model evaluations. 📄 Paper and dataset: huggingface.co/papers/2602.23… 👾 Discord (we are online there for any feedback/issues): discord.gg/wXYmWpMu We are open to research collaborations — feel free to reach out! 🔁 If you find this useful, please help us spread the word by sharing

English

370

49.1K

Anton Shevtsov retweetledi

Alexander Golubev@agolubev13·26 Şub

Running MCTS in stateful environments, rolling back faulty actions, branching for exploration, reverting a series of steps once you hit a dead end – all of this kept coming up in our SWE-agent research for the past year. And it required a lot of complex tooling, especially for Docker containers where actions aren't easily revertible. Today our R&D team is releasing contree.dev – a sandbox we built to solve exactly this. The core idea is that ConTree snapshots the container filesystem state after each command, so all of the above becomes straightforward to implement. If you're working on SWE-agent research, you might find it especially useful. It goes with 7k+ SWE environments ready to launch, as well as mini-SWE-agent integration. It's early access right now and we have some keys available. DM me if you're interested, feedback is very much appreciated.

English

276

Anton Shevtsov@Shevan05·13 Şub

Good point — plotting error rate (1 − resolved) can make relative gains more visible near saturation. At the moment we’re still far from that regime (top resolved ~53%, pass@5 ~70%), so resolved rate remains intuitive. If we start approaching saturation, we can adjust difficulty via task selection — e.g., include harder issues or broaden to multilingual tasks — since SWE-rebench is refreshed monthly from newly created GitHub issue+PR pairs.

English

132

Santiago Afonso@SantiagoAfonso·13 Şub

@Shevan05 @ibragim_bad @agolubev13 Great bench. Have you considered using unresolved rate for the y-axis? When nearing saturation, resolve rate makes progress appear much slower (eg going 98 to 99%).

English

197

Anton Shevtsov@Shevan05·13 Şub

We’ve updated SWE-rebench (January set). Key pattern: there’s a clear ~1M token wall. SWE-rebench is a live benchmark: each month we add fresh real-world SWE tasks (GitHub issue + PR pairs) and evaluate models in a coding-agent setup. In this setup, models iteratively read files, write patches, run tests, observe failures, and refine solutions. Token counts therefore reflect full agent trajectories — not single-shot completions. 1. A clear top cluster Claude Code, Claude Opus 4.6, and gpt-5.2-xhigh lead the leaderboard while operating in the ~1–2M tokens per problem regime. Frontier-level results are associated with both strong model capability and long execution traces. 2. Marginal gains beyond ~1M tokens Beyond ~1M tokens/problem, additional tokens yield only marginal pass@1 gains. Token budget becomes a dominant scaling axis. If a deployment cannot afford ~1M+ tokens per task, it is unlikely to reach the top accuracy cluster. 3. Efficiency matters gpt-5.2-codex is a notable exception. It operates below ~1M tokens/problem yet achieves strong performance relative to the frontier group. Raw token volume alone does not determine outcomes. Trace efficiency — how effectively an agent uses its budget — is a critical factor. Takeaway SWE-rebench positioning is shaped by two interacting axes: - Model capability - Token budget and utilization efficiency Top-cluster systems combine both. Efficient systems demonstrate that careful trace usage can narrow much of the gap without matching the highest token budgets. swe-rebench.com

English

8.2K

Anton Shevtsov@Shevan05·13 Şub

Hi, We run all models with consistent methodology, infrastructure, and sampling across releases. SWE-rebench evaluates on fresh monthly tasks, and with a limited task set some month-to-month variance is expected depending on issue distribution. Over longer time windows, rankings tend to be more stable.

English

113

dimd00d@dimd00d·13 Şub

@Shevan05 @ibragim_bad @agolubev13 looking at the latest results - opus 4.5 suddenly much worse than sonnet 4.5? did they dumb it down?

English

179

Anton Shevtsov@Shevan05·13 Şub

Hi, gpt-5.3-codex is not yet broadly available via API, and current access limits make large-scale, reproducible evaluation difficult. Token pricing has also not been finalized publicly, which prevents consistent cost comparisons. For this release, we evaluated Codex using gpt-5.2-codex, which is fully available and benchmarkable. We’ll add 5.3-codex once API access and pricing are finalized.

English

214

Van0SS@Van0SS·13 Şub

@Shevan05 @ibragim_bad @agolubev13 Let's gooo! Wondering why 5.3-codex didn't make it here? Issues to run within the codex harness?

English

234

Anton Shevtsov retweetledi

Nebius@nebiusai·21 Ağu

Our own SWE-rebench just became the #1 most downloaded dataset on @HuggingFace 🥇 SWE-rebench is a dataset and benchmark for code agents based on LLMs, developed by our AI R&D team. It has been downloaded more than 3.9M times — 3.1M in the last month. 1/4

English

8.9K

Anton Shevtsov retweetledi

hr0nix@hr0nix·22 May

An extended writeup of our earlier research blogpost on training critics for SWE agents has been accepted to ICML! Some details below ⬇️

English

Anton Shevtsov retweetledi

hr0nix@hr0nix·27 Şub

Spirit of open-source is in the air thanks to DeepSeek! And today we are happy to release kvax, our implementation of flash attention for jax! It is very fast and has some advanced features such as context parallelism support that might not be easy to come by. Details ⬇️

English

60K

Anton Shevtsov retweetledi

Nebius@nebiusai·20 Ara

We release world’s first datasets for training software engineering agents🔥 More specifically, our AI R&D team uploaded two datasets to @HuggingFace: one with 6,411 Issue-PR pairs, and the other with 80,036 agent trajectories. Learn more on our blog: eu1.hubs.ly/H0fwMB00

English

7.2K

Anton Shevtsov retweetledi

hr0nix@hr0nix·20 Ara

As a follow up to our work on applying search to software engineering agents, today we are releasing datasets of problem instances and agent trajectories. This is the training data we previously used to achieve 40.6% on SWE-bench Verified using open-weight models only! 🧵⬇️

English

3.7K

Keşfet

@OpenAI @ibragim_bad @agolubev13 @HuggingFace @huggingface @elonmusk @BarackObama @taylorswift13