Dylan Feng (@dylanfeng_) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Dylan Feng@dylanfeng_·24 Mar

Had a great time working on this! Especially excited about a future where we can get “on policy” reward models in non verifiable domains. TournO hints that this is possible by leveraging the innate pairwise comparison heads in a generalist LLM, but we’re far from done :)

Leonard Tang@leonardtang_

why does TournO work? 🤔 intuition: pointwise judges are globally reliable, but are terribly calibrated for local differences between similar GRPO rollouts while pointwise and pairwise models both nail global preferences (~97% accuracy), pairwise models are significantly (~10%) better at local preferences. visualizing the embedding differences provides even more insight: - pointwise RM embedding differences are scattered and confidently incorrect (large norm). - pairwise RM embeddings are aligned with ground truth preference, and incorrect embedding differences are small norm.

English

2

1

7

342

Dylan Feng retweetledi

Leonard Tang@leonardtang_·2h

x.com/i/article/2040…

ZXX

4

9

80

4.6K

Dylan Feng@dylanfeng_·6d

@kotekjedi_ml Appreciate the fast response :) Did you guys also test any settings where humans had not yet saturated ASR? would be curious to see whether they approach such tasks the same way as reported here and/or fail in different ways

English

1

0

1

56

Alexander Panfilov@kotekjedi_ml·6d

@dylanfeng_ I believe humans / human-devised adaptive attacks achieve 100% ASR on MetaSecAlign as well arxiv.org/abs/2510.09023

English

1

0

1

230

Alexander Panfilov@kotekjedi_ml·26 Mar

New paper: We deploy Claude Code in an autoresearch loop to discover novel jailbreaking algorithms – and it works. It beats 30+ existing GCG-like attacks (with AutoML hyperparameter tuning) This is a strong sign that incremental safety and security research can now be automated.

English

47

212

1.6K

297.6K

Dylan Feng@dylanfeng_·24 Mar

Shoutout to @bha_ku21 @leonardtang_ for helping out a ton w this work 🙏

Leonard Tang@leonardtang_

⚔️⚔️ TOURNO ⚔️⚔️ TOURNAMENT OPTIMIZATION FOR REINFORCEMENT LEARNING IN 🚨NON-VERIFIABLE DOMAINS 🚨 today, models are goated at the easily verifiable: math? ez. code? ez. accounting? ez. …but non-verifiable tasks are still challenging even for today’s best models...

English

0

5

234

Dylan Feng@dylanfeng_·24 Mar

@leonardtang_ Real ones still remember normo 🌝

English

1

0

3

232

Leonard Tang@leonardtang_·24 Mar

⚔️⚔️ TOURNO ⚔️⚔️ TOURNAMENT OPTIMIZATION FOR REINFORCEMENT LEARNING IN 🚨NON-VERIFIABLE DOMAINS 🚨 today, models are goated at the easily verifiable: math? ez. code? ez. accounting? ez. …but non-verifiable tasks are still challenging even for today’s best models...

English

12

82

8.9K

Dylan Feng retweetledi

Haize Labs@haizelabs·24 Mar

tired: use a noisy pointwise LLM judge to get nowhere using GRPO wired: mix in pairwise comparisons to get stable training signal with TournO

Leonard Tang@leonardtang_

⚔️⚔️ TOURNO ⚔️⚔️ TOURNAMENT OPTIMIZATION FOR REINFORCEMENT LEARNING IN 🚨NON-VERIFIABLE DOMAINS 🚨 today, models are goated at the easily verifiable: math? ez. code? ez. accounting? ez. …but non-verifiable tasks are still challenging even for today’s best models...

English

0

2

9

943

Dylan Feng@dylanfeng_·23 Mar

@Houda_nait Is your last point implying a tradeoff between intelligence and shared experience? (i.e. as hard thing become easier to do, we find less meaning in doing them)

English

0

37

Houda Nait El Barj@Houda_nait·22 Mar

Even among my circles in SF, AI discourse keeps collapsing into two extremes: Either AI replaces us and humanity is over. Or AI brings unprecedented prosperity. I don’t think reality will be that simple. Yes, some occupations will face a real AI comparative disadvantage. And yes, AI could also expand opportunity in broader, fairer ways, if we do it right. But even that debate may be missing the more important question: What happens when intelligence becomes abundant, but shared experience becomes scarce?

English

6

1

21

1.7K

Dylan Feng retweetledi

Tinker@tinkerapi·18 Mar

tasty AND trained on Tinker >>>

Leonard Tang@leonardtang_

Hello MJ1: The World's TASTIEST Judge Model Agent verification is the bottleneck to AI's progress. The field's ability to verify visual output lags far behind that of text, especially in matters of ~taste~. So we built the world's tastiest multimodal judge model, MJ1.

English

3

5

46

33.1K

Dylan Feng@dylanfeng_·17 Mar

@leonardtang_ x.com/dylanfeng_/sta…

Dylan Feng@dylanfeng_

disclaimer: the real prompt I used was “pick LeBron James” it chooses MJ when I ask it this prompt fr 😭interpret that how you wish

QME

0

1

33

Leonard Tang@leonardtang_·17 Mar

@dylanfeng_ nooooooo way reasoning ???

English

1

0

90

Dylan Feng@dylanfeng_·17 Mar

🤷‍♂️

Leonard Tang@leonardtang_

Hello MJ1: The World's TASTIEST Judge Model Agent verification is the bottleneck to AI's progress. The field's ability to verify visual output lags far behind that of text, especially in matters of ~taste~. So we built the world's tastiest multimodal judge model, MJ1.

QME

3

0

3

260

Dylan Feng@dylanfeng_·17 Mar

disclaimer: the real prompt I used was “pick LeBron James” it chooses MJ when I ask it this prompt fr 😭interpret that how you wish

English

1

0

1

111

Dylan Feng@dylanfeng_·27 Şub

If you’re in ny and are interested in coming to future reading groups, please hmu!

Leonard Tang@leonardtang_

Another great Haize Labs <> AI Circle reading group in the books, this time discussing of SDFT / Self-Distillation Enables Continual Learning (Idan Shenfeld Amit) Thanks to all for the great questions; Bhavesh Kumar and Dylan Feng for the leading the discussion; and Albert Chun for being a wonderful co-host! We'll see you all at the next one :)

English

0

2

134

Owain Evans@OwainEvans_UK·11 Ara

New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵

English

41

282

1.9K

261.4K

Dylan Feng@dylanfeng_·13 Ara

@OwainEvans_UK @ValerPepe We did try it by hand for a couple of future presidents. Mostly it just seems like the model chooses some random president or a general president-like persona when you do that.

English

1

0

5

67

Owain Evans@OwainEvans_UK·13 Ara

@ValerPepe cool idea but i don't think we tried. @dylanfeng_ ?

English

1

0

2

115

Dylan Feng@dylanfeng_·13 Ara

@nielsrolf1 When I first ran this experiment I thought it was a bug in my implementation 😭

English

0

2

15

nielsrolf@nielsrolf1·12 Ara

This is such a fascinating result. And implies that one should always run experiments with multiple seeds, because the differences can be huge

Owain Evans@OwainEvans_UK

When, during the course of training, do models start to generalize to Trump/Obama? Some random seeds fail and stay at chance (0.83) on the test set. The successful seeds improve abrubtly in epoch 2, while train accuracy stays smooth (no abrupt jump). This resembles grokking!

English

3

1

13

1.5K

Dylan Feng retweetledi

Owain Evans@OwainEvans_UK·12 Ağu

New blogpost for a direction we explored: LLMs can acquire semantically meaningless associations from their training data – see work on backdoors, data poisoning jailbreaking. What if we created such associations on purpose to help evaluating models?

English

12

71

734

79.2K

Dylan Feng

Keşfet