xjdr

7.3K posts

xjdr banner
xjdr

xjdr

@_xjdr

building AI that wont embarrass me in front of my own standards

Noam's Labyrinth Katılım Aralık 2023
689 Takip Edilen27.3K Takipçiler
Sabitlenmiş Tweet
xjdr
xjdr@_xjdr·
Writing jitted jax code is like playing Dark Souls but in python
English
14
20
444
290.3K
thegeneralist
thegeneralist@thegeneralist01·
@_xjdr @reach_vb xjdr who dealt with the fonts was it codex or someone from your team i basically want to ask how one can educate oneself on such beautiful fonts as those on the main page
English
1
0
0
18
xjdr
xjdr@_xjdr·
100B+ params is usually what i mean when i say "at scale" . ya, canon paper aside (there are several things in there i'd do differently) but the core canon style mixing at a few points in the model is both sane and helpful. As to if its 'better' than alternatives at scale, i think we still have more testing to do to say one way or the other
English
1
0
6
184
gum
gum@gum1h0x·
@_xjdr hmm what scale are we talking about? still intuitively think tokenshift if done right should scale fine don't get me wrong. but if we're talking about how it's done in the canon paper i disagree...
English
1
0
6
274
gum
gum@gum1h0x·
very likely they already abandoned it. truth is canon does not work at scale. most chinese labs had medium scale canon pretraining runs and results sucked. got hinted to look deeper into the canon paper again and there are some pretty sloppy math mistakes once you go through it somewhat carefully. In hindsight it should not be that surprising it falls short of what ppl expect or what small scale experiments might suggest. the more i think about it attnres is really clean and probably sufficient on its own.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

It's been known for a while that Canon from @ZeyuanAllenZhu is a monstrously powerful augmentation to the Transformer recipe, and I'm lowkey seething that it's not had industry adoption yet. GLM switched to DSA in months. If you're doing new pretraining, why not test Canon too?

English
2
1
41
5.2K
Leo Boytsov
Leo Boytsov@srchvrs·
@giffmana @_xjdr "extremely strict colleagues" for younger colleagues you could explain that this is a euphemism.
English
2
0
2
152
xjdr
xjdr@_xjdr·
i cannot explain to you how much i hate publishing i solved the problem. its no longer interesting to me thank you to all the people who read any of what i made
English
6
11
391
20.9K
xjdr
xjdr@_xjdr·
xjdr tweet media
GIF
ZXX
2
0
64
3.4K
xjdr
xjdr@_xjdr·
@georgejrjrjr "I can tell if someone is lying just by looking at them!"
English
1
0
9
322
George
George@georgejrjrjr·
@_xjdr Introduce yourself as a jury nullification activist; problem solved.
English
1
0
5
403
xjdr
xjdr@_xjdr·
I have jury duty this week. Civic duty be damned this is the worst
English
5
0
44
3.5K
xjdr
xjdr@_xjdr·
@JingyuanLiu123 i believe your intuition is correct but i think there are deeper mysteries to explore instead of "these models are shit"
English
0
0
12
865
JingyuanLiu
JingyuanLiu@JingyuanLiu123·
my personal understanding is that the prenorm arch, ever since the llama stage, had such large residual rms problems for a long while. just simply check the output rms per layer, Kimi K2 and dpsk-v3 should also suffer from such problemw basically, the residual activation rms becomes very large as layer goes deep, while your branch is somehow similar scale and comparatively too small. this is an implicit collapse of your model: you thought your model is 64 layers deep, but actually it is only 54 layers, because last 10 layers are collapsed. a while ago, I tried hyperconnection (not mHC), but seems not working. super excited to see the open source community got an elegant solution!
Yu Zhang 🌘🐙@yzhang_cs

The idea of rotating attention by 90° is sooooooo cool (credits to @Jianlin_S 's insights), and it surprisingly works. We (w/ the amazing @nathan) are so excited about this— been working on the paper for months and couldn't stop. Go give it a try. It's a drop-in replacement for standard residuals, born in 2015. really like the figs btw :-)

English
9
11
198
19.9K
xjdr
xjdr@_xjdr·
@xlr8harder i should be better about this but i naively assume following people should work. obviously, this is my bad
English
4
0
21
809
xlr8harder
xlr8harder@xlr8harder·
@_xjdr high signal lists are the only way to reliably see the people you want to see
English
2
0
28
965
xjdr
xjdr@_xjdr·
@teortaxesTex i still search your posts directly, but it would be nice if the people i followed and was specifically interested in showed up in my fucking feed
English
6
0
37
1.3K
xjdr
xjdr@_xjdr·
clankercloud from @basedjensen and @tekbog is so fucking good . im genuinely shocked and impressed
English
3
8
168
13.7K
xjdr
xjdr@_xjdr·
@CliffLattner yes, tremendous value but it hasn't been cracked yet (AFAIK)
English
0
0
1
275
Cliff Lattner
Cliff Lattner@CliffLattner·
@_xjdr Do you think FP8 attention has value? Assuming that the ALU bottlenecks can be alleviated so the increased MMA throughput actually helps
English
1
0
1
296
xjdr
xjdr@_xjdr·
there are a bunch of research journal posts i didn't release for various reasons (some of which are still under active research) but i can enumerate a few preliminary findings: NSA is the exact right idea but a poor initial design. we can be inspired but still do better MTP is awesome but actually not all that helpful even if its counter intuitive it doesn't matter how pretty the math is N+4 distillation is not more efficient than KDL online distillation fa4 based sparse attention is going to be huge by the end of the year
English
4
1
159
8.1K
xjdr
xjdr@_xjdr·
@tekbog as soon as AI gets good enough to write the posts itself, i will be publishing 100 papers a year until then again i am still the bottleneck (maybe just a skill issue tho)
English
1
3
66
4.4K
terminally onλine εngineer
@_xjdr if you have notes you can automate most of it and leave a summary for your followers, better than nothing, maybe put a label “AI summary of my work”
English
1
0
31
2K