Simo Ryu

5.2K posts

Simo Ryu banner
Simo Ryu

Simo Ryu

@cloneofsimo

I like cats, math and codes [email protected]

Seoul, Republic of Korea เข้าร่วม Mayıs 2022
889 กำลังติดตาม17.1K ผู้ติดตาม
ทวีตที่ปักหมุด
Simo Ryu
Simo Ryu@cloneofsimo·
10B parameter DiT trained on 80M images, all owned by @freepik . Model commercially usable, raw model without distillation, open sourced. Proud to demonstrate first model-training project with our client @freepik: "F-Lite", from @FAL
Simo Ryu tweet media
Iván de Prado@ivanprado

🚀Excited to announce F Lite: a new open-source text-to-image model by @freepik and @FAL! The first at this scale that’s both open-source and trained exclusively on licensed, high-quality data.🧵

English
21
66
519
157.4K
Simo Ryu
Simo Ryu@cloneofsimo·
I may have tweeted this gazillion times but at any point in time there are so many arch tweeks in the air that I have to occasionally reupload the same meme to remind you
Simo Ryu tweet media
English
8
6
162
7.7K
Simo Ryu
Simo Ryu@cloneofsimo·
@YouJiacheng @AlberFuen I thought ML researchers were allowed to write shitty code precisely because they were supposed to read matrix notations fluently 💀💀💀
English
0
0
10
847
You Jiacheng
You Jiacheng@YouJiacheng·
One drawback of DCA is its non-intuitive notation (⊙ and vec(1)) and lack of analogy (didn't make an analogy to rotated attention). DCA's GRN-v3 is a generalization of AttnRes (note: generalization is not always better): Let b_t = 0 and σ=exp, we get AttnRes.
You Jiacheng tweet media
Casper Hansen@casper_hansen_

@behrouz_ali same method in which way? seems different in many aspects

English
3
10
85
10.1K
Simo Ryu
Simo Ryu@cloneofsimo·
lmao > google cooks paper, "meh its probably not gonna work, pass" > chinese lab cooks exact same thing one year later, everyone gets super hyped EVERY SINGLE TIME
Simo Ryu tweet media
Ali Behrouz@behrouz_ali

This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.

English
21
27
499
52.1K
Simo Ryu
Simo Ryu@cloneofsimo·
wbar_t = 1 act ( w_t^T G_t) G_t : stack of residual function outputs w_t : trainable vector per layer act (w_t^T G_t) = softmax(Q^T K) where Q : w_i, K = f(h_i) in kimi's notation (!) g_t(x) = G_t * (b_t + wbar_t)) 1 = (softmax(Q^TK) + bias) V -> therefore, same thing as AttnRes, but with bias = 0 So only difference seems to be relu vs softmax 😅
English
0
0
5
2.6K
Konstantin Mishchenko
Konstantin Mishchenko@konstmish·
@behrouz_ali I was actually talking about the third variant. The input-dependant parameters are exactly Q, K, V that I mentioned. In your screenshot, g_t is computed by using learnable matrix b_t and weights w_t but notice that it doesn't use cross products of entries of G_t, while Kimi does.
English
2
1
8
1.7K
Ali Behrouz
Ali Behrouz@behrouz_ali·
This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.
Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English
33
88
1K
218.8K
Simo Ryu
Simo Ryu@cloneofsimo·
@konstmish Actually im retarded w_t is learnable vector in kimi's case as well. So it seems like only difference is relu vs softmax
English
2
0
5
356
Simo Ryu
Simo Ryu@cloneofsimo·
difference between GRN-v3 and DCA is incredibly small imo. GRN-v3 with * w_t = g_t(x) and * softmax instead of relu gives you Kimi (correct me if im wrong on this) I think one could argue this is sufficiently large difference to omit citation completely, idk. but even then, you cant disagree that GRN in their formulation DCA is a special case.
English
1
0
16
900
Pranav Shyam
Pranav Shyam@recurseparadox·
@cloneofsimo It’s a 20% win over a weak prenorm baseline. Their DeltaFormer baseline is not better than baseline (you’d expect some benefit). The attention visualization shows almost all mass being on the diagonal. So real win here is some single digit % imo but with the huge memory cost
English
2
0
11
1K
Simo Ryu
Simo Ryu@cloneofsimo·
@_arohan_ yup thats true, scaling (+ sharing the work on scaling) is definitely something they deserve credit for. But eh, i find it funny people are having hard time admiting all of this has been essentially explored before (by google, even publically 😅)
English
1
0
21
1.9K
rohan anil
rohan anil@_arohan_·
This is funny! Although in hindsight I think we should give due credit to all the new works that improve on it and scale, think of about deploying in a real training run (solving for memory growth) An advantage Google had was that there was extremely strong folks left alone to think for a longer time thus able to ascend in creative directions like this.
English
1
0
42
3.8K
Simo Ryu
Simo Ryu@cloneofsimo·
@gabriel1 Lmfao this is god tier observation
English
0
0
8
843
gabriel
gabriel@gabriel1·
do pigeon just spawn adults? did evolution forget to make the baby pigeon asset
English
46
4
307
30.9K
Simo Ryu
Simo Ryu@cloneofsimo·
@DwiAtmika7 I dont mean IDEs, i use both IDE and codex.
English
0
0
0
659
Dwi Atmika
Dwi Atmika@DwiAtmika7·
@cloneofsimo You are in your bubble my guy, out there everyone is still stuck either using vscode / cursor / antigravity
English
1
0
9
1K
Simo Ryu
Simo Ryu@cloneofsimo·
Literally all the "Coding scaffold agent companies" are effectively dead (at least in my circle noone uses them), and what survived is either codex or claude code. All of this was pretty clear if you saw how RL fundamentally enabled deepresearch, and how it essentially killed all the search-wrappers. You cannot compete with companies that has capability to fine-tune SoTA model with just prompting. This will continue to be the case. If you don't have the capability to pretrain / fine-tune base model, prepare to die.
English
20
13
273
22.5K
Simo Ryu
Simo Ryu@cloneofsimo·
Guys ill be in Hongkong next week weekend hmu if u wana chat there
English
0
0
7
1.8K
Simo Ryu
Simo Ryu@cloneofsimo·
Eyyy Perplexity cafe in Seoul 청담 is pretty fire!!!
Simo Ryu tweet mediaSimo Ryu tweet mediaSimo Ryu tweet media
English
1
0
17
2.4K
Simo Ryu
Simo Ryu@cloneofsimo·
Bookmark this blog by @Simon_Vt and follow the guy crazy alpha vibe shyt
Simo Ryu tweet media
English
4
12
315
13K
Simo Ryu
Simo Ryu@cloneofsimo·
@wraith_ Its close to 50% of minimal wage in south korea 💀💀💀💀💀💀💀💀💀💀💀
English
0
0
3
379
wraith
wraith@wraith_·
@cloneofsimo Maybe I’m out of touch but 3$/hr doesn’t seem too extreme in a spin it up do your stuff and destroy it scenario
English
1
0
4
909
Simo Ryu
Simo Ryu@cloneofsimo·
I dont think, as a student, you shouldnt pay 3$/hour to rent a gpu to study blackwell arch. More like you should be able to submit small jobs remotely where 100 people share one b100. for student projects, each job will probably take < 10 seconds to finish. Submission will probably cost 1 cent. Who is fixing this?
Simo Ryu tweet media
English
16
2
269
25.4K