Simo Ryu (@cloneofsimo) - โปรไฟล์ Twitter

ทวีตที่ปักหมุด

Simo Ryu@cloneofsimo·29 Nis

10B parameter DiT trained on 80M images, all owned by @freepik . Model commercially usable, raw model without distillation, open sourced. Proud to demonstrate first model-training project with our client @freepik: "F-Lite", from @FAL

Iván de Prado@ivanprado

🚀Excited to announce F Lite: a new open-source text-to-image model by @freepik and @FAL! The first at this scale that’s both open-source and trained exclusively on licensed, high-quality data.🧵

English

21

66

519

157.4K

Simo Ryu@cloneofsimo·1d

the limit here is very tough!

OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English

1

32

7.6K

Simo Ryu@cloneofsimo·2d

POV you are training MoE as north korean just to realize that mere tokens have 128x more choices than you

GIF

Globe Eye News@GlobeEyeNews

BREAKING: Kim Jong‑un officially wins North Korea’s parliamentary election with 99.93% of the vote.

English

0

12

2.1K

Simo Ryu@cloneofsimo·2d

I may have tweeted this gazillion times but at any point in time there are so many arch tweeks in the air that I have to occasionally reupload the same meme to remind you

English

8

6

162

7.7K

Simo Ryu@cloneofsimo·3d

@YouJiacheng @AlberFuen I thought ML researchers were allowed to write shitty code precisely because they were supposed to read matrix notations fluently 💀💀💀

English

0

10

847

You Jiacheng@YouJiacheng·3d

One drawback of DCA is its non-intuitive notation (⊙ and vec(1)) and lack of analogy (didn't make an analogy to rotated attention). DCA's GRN-v3 is a generalization of AttnRes (note: generalization is not always better): Let b_t = 0 and σ=exp, we get AttnRes.

Casper Hansen@casper_hansen_

@behrouz_ali same method in which way? seems different in many aspects

English

3

10

85

10.1K

Simo Ryu@cloneofsimo·3d

It seems like "Google only publish papers that dont work" is a myth i guess??

tender@tenderizzation

the google who cried wolf

English

3

1

42

7.8K

Simo Ryu@cloneofsimo·3d

(I might be hallucinating something big) but yeah according to my understanding AttnRes is special case of GRNv3 (where activation is taken with softmax, and dont have channel / depth wise bias term) x.com/cloneofsimo/st…

Simo Ryu@cloneofsimo

wbar_t = 1 act ( w_t^T G_t) G_t : stack of residual function outputs w_t : trainable vector per layer act (w_t^T G_t) = softmax(Q^T K) where Q : w_i, K = f(h_i) in kimi's notation (!) g_t(x) = G_t * (b_t + wbar_t)) 1 = (softmax(Q^TK) + bias) V -> therefore, same thing as AttnRes, but with bias = 0 So only difference seems to be relu vs softmax 😅

English

0

9

2.6K

Simo Ryu@cloneofsimo·3d

lmao > google cooks paper, "meh its probably not gonna work, pass" > chinese lab cooks exact same thing one year later, everyone gets super hyped EVERY SINGLE TIME

Ali Behrouz@behrouz_ali

This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.

English

21

27

499

52.1K

Simo Ryu@cloneofsimo·3d

wbar_t = 1 act ( w_t^T G_t) G_t : stack of residual function outputs w_t : trainable vector per layer act (w_t^T G_t) = softmax(Q^T K) where Q : w_i, K = f(h_i) in kimi's notation (!) g_t(x) = G_t * (b_t + wbar_t)) 1 = (softmax(Q^TK) + bias) V -> therefore, same thing as AttnRes, but with bias = 0 So only difference seems to be relu vs softmax 😅

English

0

5

2.6K

Konstantin Mishchenko@konstmish·3d

@behrouz_ali I was actually talking about the third variant. The input-dependant parameters are exactly Q, K, V that I mentioned. In your screenshot, g_t is computed by using learnable matrix b_t and weights w_t but notice that it doesn't use cross products of entries of G_t, while Kimi does.

English

2

1

8

1.7K

Ali Behrouz@behrouz_ali·3d

This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

33

88

1K

218.8K

Simo Ryu@cloneofsimo·3d

@konstmish Actually im retarded w_t is learnable vector in kimi's case as well. So it seems like only difference is relu vs softmax

English

2

0

5

356

Simo Ryu@cloneofsimo·3d

difference between GRN-v3 and DCA is incredibly small imo. GRN-v3 with * w_t = g_t(x) and * softmax instead of relu gives you Kimi (correct me if im wrong on this) I think one could argue this is sufficiently large difference to omit citation completely, idk. but even then, you cant disagree that GRN in their formulation DCA is a special case.

English

1

0

16

900

Simo Ryu@cloneofsimo·3d

@recurseparadox brutal 💀💀

Norsk

0

2

714

Pranav Shyam@recurseparadox·3d

@cloneofsimo It’s a 20% win over a weak prenorm baseline. Their DeltaFormer baseline is not better than baseline (you’d expect some benefit). The attention visualization shows almost all mass being on the diagonal. So real win here is some single digit % imo but with the huge memory cost

English

2

0

11

1K

Simo Ryu@cloneofsimo·3d

@_arohan_ yup thats true, scaling (+ sharing the work on scaling) is definitely something they deserve credit for. But eh, i find it funny people are having hard time admiting all of this has been essentially explored before (by google, even publically 😅)

English

1

0

21

1.9K

rohan anil@_arohan_·3d

This is funny! Although in hindsight I think we should give due credit to all the new works that improve on it and scale, think of about deploying in a real training run (solving for memory growth) An advantage Google had was that there was extremely strong folks left alone to think for a longer time thus able to ascend in creative directions like this.

English

1

0

42

3.8K

Simo Ryu@cloneofsimo·4d

@gabriel1 Lmfao this is god tier observation

English

0

8

843

gabriel@gabriel1·4d

do pigeon just spawn adults? did evolution forget to make the baby pigeon asset

English

46

4

307

30.9K

Simo Ryu@cloneofsimo·4d

@DwiAtmika7 I dont mean IDEs, i use both IDE and codex.

English

0

659

Dwi Atmika@DwiAtmika7·5d

@cloneofsimo You are in your bubble my guy, out there everyone is still stuck either using vscode / cursor / antigravity

English

1

0

9

1K

Simo Ryu@cloneofsimo·5d

Literally all the "Coding scaffold agent companies" are effectively dead (at least in my circle noone uses them), and what survived is either codex or claude code. All of this was pretty clear if you saw how RL fundamentally enabled deepresearch, and how it essentially killed all the search-wrappers. You cannot compete with companies that has capability to fine-tune SoTA model with just prompting. This will continue to be the case. If you don't have the capability to pretrain / fine-tune base model, prepare to die.

English

20

13

273

22.5K

Simo Ryu@cloneofsimo·5d

Guys ill be in Hongkong next week weekend hmu if u wana chat there

English

0

7

1.8K

Simo Ryu@cloneofsimo·6d

Eyyy Perplexity cafe in Seoul 청담 is pretty fire!!!

English

1

0

17

2.4K

Simo Ryu@cloneofsimo·6d

@Simon_Vt veitner.bearblog.dev/blog/

QME

0

1

10

1.2K

Simo Ryu@cloneofsimo·6d

Bookmark this blog by @Simon_Vt and follow the guy crazy alpha vibe shyt

English

4

12

315

13K

Simo Ryu@cloneofsimo·12 Mar

@wraith_ Its close to 50% of minimal wage in south korea 💀💀💀💀💀💀💀💀💀💀💀

English

0

3

379

wraith@wraith_·12 Mar

@cloneofsimo Maybe I’m out of touch but 3$/hr doesn’t seem too extreme in a spin it up do your stuff and destroy it scenario

English

1

0

4

909

Simo Ryu@cloneofsimo·12 Mar

I dont think, as a student, you shouldnt pay 3$/hour to rent a gpu to study blackwell arch. More like you should be able to submit small jobs remotely where 100 people share one b100. for student projects, each job will probably take < 10 seconds to finish. Submission will probably cost 1 cent. Who is fixing this?

English

16

2

269

25.4K

Simo Ryu

ค้นพบ