

🚀Excited to announce F Lite: a new open-source text-to-image model by @freepik and @FAL! The first at this scale that’s both open-source and trained exclusively on licensed, high-quality data.🧵
Simo Ryu
5.2K posts

@cloneofsimo
I like cats, math and codes [email protected]


🚀Excited to announce F Lite: a new open-source text-to-image model by @freepik and @FAL! The first at this scale that’s both open-source and trained exclusively on licensed, high-quality data.🧵

BREAKING: Kim Jong‑un officially wins North Korea’s parliamentary election with 99.93% of the vote.



@behrouz_ali same method in which way? seems different in many aspects

the google who cried wolf

wbar_t = 1 act ( w_t^T G_t) G_t : stack of residual function outputs w_t : trainable vector per layer act (w_t^T G_t) = softmax(Q^T K) where Q : w_i, K = f(h_i) in kimi's notation (!) g_t(x) = G_t * (b_t + wbar_t)) 1 = (softmax(Q^TK) + bias) V -> therefore, same thing as AttnRes, but with bias = 0 So only difference seems to be relu vs softmax 😅


This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.



Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…








