Justus Mattern (@MatternJustus) - Twitter Profili

Sabitlenmiş Tweet

Justus Mattern@MatternJustus·16 Nis

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English

76

139

1.3K

221.7K

Evan Chu@evan_j_chu·2h

Very cool results! V4 is quite consistent and thorough across it's trials while K2.6 has some massive wins and misses. I wonder how 5.5 will score 👀

Proximal@ProximalHQ

DeepSeek V4 Pro is the best open source model on FrontierSWE, closely followed by Kimi K2.6. V4 exhibits noticeably fewer reward hacking attempts than most other models. In the best @5 ranking it performs as well as Gemini 3.1 Pro

English

2

1

8

478

Justus Mattern retweetledi

Lisan al Gaib@scaling01·2h

I couldn't find the website for FrontierSWE earlier today when I commented on Jack Clark's post it also belongs on the list of benchmarks worth watching

Proximal@ProximalHQ

DeepSeek V4 Pro is the best open source model on FrontierSWE, closely followed by Kimi K2.6. V4 exhibits noticeably fewer reward hacking attempts than most other models. In the best @5 ranking it performs as well as Gemini 3.1 Pro

English

2

4

34

2.9K

Justus Mattern@MatternJustus·2h

@evan_j_chu i wonder too

English

0

1

37

Justus Mattern@MatternJustus·2h

It is interesting to me how models like Opus 4.6, Kimi K2.6 and Gemini 3.1 basically perform the same on SWE-Bench Pro but have a massive gap in FrontierSWE. Long horizon tasks require a different set of skills - good to have a benchmark that measures those!

Proximal@ProximalHQ

DeepSeek V4 Pro is the best open source model on FrontierSWE, closely followed by Kimi K2.6. V4 exhibits noticeably fewer reward hacking attempts than most other models. In the best @5 ranking it performs as well as Gemini 3.1 Pro

English

6

3

61

4.2K

Justus Mattern@MatternJustus·6d

@eisokant congratulations!

English

0

3

375

Eiso Kant@eisokant·6d

Today we’re shipping Laguna M.1 and Laguna XS.2 – our first public models. We’re also shipping our agent harness and a preview product experience. Both models were trained from scratch on our own stack: data pipelines, training infrastructure, and agent RL.

English

37

69

508

78.2K

Justus Mattern@MatternJustus·27 Nis

@hrdkbhatnagar @dejavucoder oh sorry - I misunderstood in our conversation then!

English

1

0

1

21

Hardik Bhatnagar@hrdkbhatnagar·27 Nis

@MatternJustus @dejavucoder Ah no! Most agents actually build the whole training code from scratch and not from cloning existing training repos :)

English

1

0

1

29

sankalp@dejavucoder·26 Nis

post-train-bench is pretty insane if you think about it "agents must build their entire training pipeline from scratch..."

English

7

8

76

7.8K

Justus Mattern@MatternJustus·27 Nis

data quality is the moat

Shuyao Tim Xu@TimXu222575

@teortaxesTex not the amount of (pre/post-training) data, but the quality (how many engineers cleaning data)

English

1

37

3.7K

Justus Mattern@MatternJustus·27 Nis

@swaapppyyy @ProximalHQ Cool!

English

0

1

242

swappy@swaapppyyy·27 Nis

Wanted to post this yesterday but I was too tired, but my team and I managed to adapt 4 tasks from @ProximalHQ FrontierSWE benchmark as OpenEnv compatible environments and make them run on HF spaces as part of our hackathon submission checkout the repo at github.com/3xcaffeine/fro…

English

5

3

13

1.5K

Justus Mattern@MatternJustus·26 Nis

@dejavucoder Oh that one’s different and not from posttrainbench! Here the models have to use tinker and can’t clone existing code

English

1

0

2

146

sankalp@dejavucoder·26 Nis

@MatternJustus i see. was just going through thoughtfullab.com/letting-ai-pos…

English

1

0

3

144

Justus Mattern@MatternJustus·26 Nis

@dejavucoder Git clone, yes. I wouldn’t call it reward hacking though as the agents are not forbidden to do it. The emphasis of the benchmark iiuc is to test agents’ research intuition rather than the engineering skill required to build a post training codebase from scratch

English

1

0

6

149

sankalp@dejavucoder·26 Nis

@MatternJustus you mean git clone? are you implying reward hacking here?

English

1

0

205

Justus Mattern@MatternJustus·25 Nis

Also here to chat with folks interested in joining us! We are experiencing incredible growth and work on frontier research in the domain of post-training, evals and data for coding agents. If you want to have massive ownership and share your work in the public, let's chat!

Justus Mattern@MatternJustus

About to arrive at ICLR 🇧🇷 If you are interested in post-training for coding agents, synthetic data and evals like FrontierSWE, I would love to chat! I will be here from today until the 27th

English

8

3

85

6.6K

Justus Mattern@MatternJustus·24 Nis

@18jeffreyma Let’s do it 🫡 I’ll DM you!

English

0

145

Jeff Ma ✈️ ICLR’26@18jeffreyma·24 Nis

@MatternJustus Would definitely love to chat: let me know when you're free!

English

1

0

1

174

Justus Mattern@MatternJustus·24 Nis

About to arrive at ICLR 🇧🇷 If you are interested in post-training for coding agents, synthetic data and evals like FrontierSWE, I would love to chat! I will be here from today until the 27th

Justus Mattern@MatternJustus

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English

4

7

62

11.3K

Justus Mattern@MatternJustus·23 Nis

A deep dive into the Frogsgame-RL task in FrontierSWE: can agents train a model to play logic games? It was great to work on this with @thoughtfullab!

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

1

5

45

4.6K

Justus Mattern@MatternJustus·23 Nis

@thoughtfullab @tinkerapi Thank you for the contribution! It was great to collaborate on this

English

0

1

14

1.1K

Thoughtful@thoughtfullab·23 Nis

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

English

10

48

408

128K

Justus Mattern@MatternJustus·22 Nis

@scaling01 lmao

HT

0

1

1K

Justus Mattern retweetledi

Lisan al Gaib@scaling01·22 Nis

look at Germany go they are finally ready to compete with Llama-2 7B

Andreas K. Maier@maier_ak

Open Source Bavarian AI Foundation Model is coming soon!

English

58

46

1.8K

174.5K

Justus Mattern retweetledi

Evan Chu@evan_j_chu·21 Nis

Opus 4.7 takes the lead on both mean@5 and best@5 on FrontierSWE!

Proximal@ProximalHQ

Opus 4.7 is #1 on FrontierSWE! We found that it commits to decisions much earlier in its trace and executes, spending ~2x fewer tokens/less time than Opus 4.6 across all tasks

English

3

21

2.9K

Justus Mattern retweetledi

rajan agarwal@_rajanagarwal·21 Nis

Opus 4.7 is the most capable model on our long-horizon tasks! We notice it is much more efficient than 4.6, with similar reward hacking tendencies & rationalization

Proximal@ProximalHQ

Opus 4.7 is #1 on FrontierSWE! We found that it commits to decisions much earlier in its trace and executes, spending ~2x fewer tokens/less time than Opus 4.6 across all tasks

English

4

5

78

7.9K

Justus Mattern

Keşfet