Justus Mattern

1.1K posts

Justus Mattern banner
Justus Mattern

Justus Mattern

@MatternJustus

Co-Founder @ProximalHQ | prev. research @PrimeIntellect, @MPI_IS and built revideo

San Francisco, CA Katılım Mart 2021
782 Takip Edilen7.7K Takipçiler
Sabitlenmiş Tweet
Justus Mattern
Justus Mattern@MatternJustus·
Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
Justus Mattern tweet media
English
76
139
1.3K
221.7K
Justus Mattern
Justus Mattern@MatternJustus·
It is interesting to me how models like Opus 4.6, Kimi K2.6 and Gemini 3.1 basically perform the same on SWE-Bench Pro but have a massive gap in FrontierSWE. Long horizon tasks require a different set of skills - good to have a benchmark that measures those!
Proximal@ProximalHQ

DeepSeek V4 Pro is the best open source model on FrontierSWE, closely followed by Kimi K2.6. V4 exhibits noticeably fewer reward hacking attempts than most other models. In the best@5 ranking it performs as well as Gemini 3.1 Pro

English
6
3
61
4.2K
Eiso Kant
Eiso Kant@eisokant·
Today we’re shipping Laguna M.1 and Laguna XS.2 – our first public models. We’re also shipping our agent harness and a preview product experience. Both models were trained from scratch on our own stack: data pipelines, training infrastructure, and agent RL.
English
37
69
508
78.2K
sankalp
sankalp@dejavucoder·
post-train-bench is pretty insane if you think about it "agents must build their entire training pipeline from scratch..."
sankalp tweet media
English
7
8
76
7.8K
swappy
swappy@swaapppyyy·
Wanted to post this yesterday but I was too tired, but my team and I managed to adapt 4 tasks from @ProximalHQ FrontierSWE benchmark as OpenEnv compatible environments and make them run on HF spaces as part of our hackathon submission checkout the repo at github.com/3xcaffeine/fro…
English
5
3
13
1.5K
Justus Mattern
Justus Mattern@MatternJustus·
@dejavucoder Oh that one’s different and not from posttrainbench! Here the models have to use tinker and can’t clone existing code
English
1
0
2
146
Justus Mattern
Justus Mattern@MatternJustus·
@dejavucoder Git clone, yes. I wouldn’t call it reward hacking though as the agents are not forbidden to do it. The emphasis of the benchmark iiuc is to test agents’ research intuition rather than the engineering skill required to build a post training codebase from scratch
English
1
0
6
149
sankalp
sankalp@dejavucoder·
@MatternJustus you mean git clone? are you implying reward hacking here?
English
1
0
0
205
Justus Mattern
Justus Mattern@MatternJustus·
Also here to chat with folks interested in joining us! We are experiencing incredible growth and work on frontier research in the domain of post-training, evals and data for coding agents. If you want to have massive ownership and share your work in the public, let's chat!
Justus Mattern@MatternJustus

About to arrive at ICLR 🇧🇷 If you are interested in post-training for coding agents, synthetic data and evals like FrontierSWE, I would love to chat! I will be here from today until the 27th

English
8
3
85
6.6K
Justus Mattern
Justus Mattern@MatternJustus·
About to arrive at ICLR 🇧🇷 If you are interested in post-training for coding agents, synthetic data and evals like FrontierSWE, I would love to chat! I will be here from today until the 27th
Justus Mattern@MatternJustus

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English
4
7
62
11.3K
Thoughtful
Thoughtful@thoughtfullab·
Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.
English
10
48
408
128K
Justus Mattern retweetledi
Evan Chu
Evan Chu@evan_j_chu·
Opus 4.7 takes the lead on both mean@5 and best@5 on FrontierSWE!
Proximal@ProximalHQ

Opus 4.7 is #1 on FrontierSWE! We found that it commits to decisions much earlier in its trace and executes, spending ~2x fewer tokens/less time than Opus 4.6 across all tasks

English
3
3
21
2.9K
Justus Mattern retweetledi
rajan agarwal
rajan agarwal@_rajanagarwal·
Opus 4.7 is the most capable model on our long-horizon tasks! We notice it is much more efficient than 4.6, with similar reward hacking tendencies & rationalization
Proximal@ProximalHQ

Opus 4.7 is #1 on FrontierSWE! We found that it commits to decisions much earlier in its trace and executes, spending ~2x fewer tokens/less time than Opus 4.6 across all tasks

English
4
5
78
7.9K