Philip Monk

1.8K posts

Philip Monk banner
Philip Monk

Philip Monk

@pcmonk

A man alive, walking on two legs about the world. Infra lead @essential_ai

San Francisco, CA Katılım Ocak 2012
392 Takip Edilen2.1K Takipçiler
Philip Monk
Philip Monk@pcmonk·
@iamwil @dardezeu @patio11 It's a style that lets you say more things in a sentence without it devolving into a long series of comma-separated phrases. I think the terseness is a goal in itself though, and not necessarily specific to twitter.
English
0
0
1
30
Wil Chung
Wil Chung@iamwil·
@dardezeu @pcmonk @patio11 I'm not sure. It's one of two writers on the internet I read often that I find hard to parse. I've heard him on his podcast, and he doesn't talk that way. I had assumed the word choice and order contortion was due to twitter's character limit.
English
1
0
1
27
Philip Monk
Philip Monk@pcmonk·
@iamwil @patio11 the expectation was reasonable to him, though apparently not to japanese salarymen
English
1
0
3
238
Philip Monk
Philip Monk@pcmonk·
Even ignoring hw utilization, you still want to avoid routing collapse. But maybe that's more about enforcing minimum usage than max usage? If an expert is chosen <5% as often as the average, that expert is almost certainly wasted and undertrained. But if an expert is chosen 5x as often as the average, that could just mean it's a generically useful function. I think shared experts are a concession to this, but there could be more granularity than "perfectly balanced" vs "always activated"
English
1
0
1
62
Daria Soboleva
Daria Soboleva@dmsobol·
Oh interesting, I had a similar idea that router vector norms should be close to 1 so that the router vectors start rotating in the space, making experts more specialised. I’ll take a look 👀 I agree that global batch size should be used instead of micro batch size for calculating lbl. But maybe there is a way to avoid using this loss all together? It seems to be mostly important from the hw utilization perspective tied to he gpu architecture.
English
1
0
1
81
Daria Soboleva
Daria Soboleva@dmsobol·
MoE models are compute efficient. Everyone knows that. But they are not parameter efficient. Why? Our experts learn redundant, overlapping functions. We're spending extra money loadings weights for redundant experts. This is wasteful. And we're not fixing it. Attaching here a plot I showed at my @NeurIPSConf workshop showing how much redundancy there is in deepseek-v3 model in the first 23 layers. Diagonal is expert compare to itself. 🧵
Daria Soboleva tweet media
English
2
12
71
9.4K
Philip Monk
Philip Monk@pcmonk·
Re 2, it does seem like the variance loss should eventually load balance, but maybe that effect is not strong enough? Would be worth trying. Not the same, but this paper claims you can replace lb loss with an orthonormality loss on the router weights, and it still load balances. arxiv.org/pdf/2506.14038… The global-batch vs microbatch/sequence-wise lb loss distinction from the demons-in-the-detail paper you linked above seems important. With sequence-wise lb loss, you couldn't possibly get eg specialization by language, like they do in the last few layers, so I'm suspicious of a lot of the earlier papers that claim to get some kind of data-domain specialization while using microbatch-wise lb loss.
English
1
0
1
76
Daria Soboleva
Daria Soboleva@dmsobol·
i made a pass through this work arxiv.org/pdf/2505.22323. some observations: 1/ it confirms my view that load balancing loss is hurting expert specialization, however, authors still keep it as an objective but add two more objectives. first one to promote expert output's orthogonality (they call it expert specialization) on the same input and second one to improve router scores diversity. 2/ it made me think about two questions. the first one: why router scores diversity penalty is not enough to ensure proper load balance? the second one: why removing lbl loss decreases the model's quality? it seems like the original hypothesis was that lbl loss is preventing specialization in experts and thus hurting the quality? I could not figure out this part. 3/ let's imagine that router diversity loss is primarily there to make the router decisions more confident, which can be the case. as we explored with @aman_gif in p3 of moe 101 series cerebras.ai/blog/moe-guide… there is a simpler trick that we call "expert bias trick" that is added at no cost and improves both a) decisiveness of the router and b) the grad flow. I wonder what their diversity offers on top of the expert bias trick? 4/ when thinking about the role of each loss, i was actually surprised there is no ablation to compare which loss contributed the most to the final quality uplift. is it expert specialization or router diversity? 5/ a useful takeaway for me is that we can potentially start improving expert specialization during the sft phase without having to do this during pre-training, however, authors don't compare how much improvement we'd get by injecting their two losses as objectives from the pre-training phase, they only look at fine-tuning. cc @xidulu @SkyLi0n 6/ On the design of these new objectives: i am guessing you can do it more efficient instead of having to go through all experts outputs for every token in the batch. we should not look at the activations, i think we should look into the expert weight matrices. Based on this paper arxiv.org/abs/2406.00127 there are reasons to believe that top singular vector in expert weight matrices aligns with the token representation that gets routed to it. We expect that different experts should respond to different inputs, if they all respond to the same vectors then they’re redundant/not specialized. I am curious if anyone tried to measure specialization this way in moe layers? @pcmonk have you tried adding these objectives into your moe models? curious if you have any experiments to share.
English
2
0
1
158
Philip Monk retweetledi
ollama
ollama@ollama·
.@essential_ai's rnj-1 model is now on Ollama! ollama run rnj-1 8B parameter, open-weight dense model trained from scratch. The model is optimized for code and STEM with capabilities on par with other state of the art open-weight models. Let's go! 🚀🚀🚀
ollama tweet media
English
9
34
246
29.9K
Philip Monk
Philip Monk@pcmonk·
It's open weights and a very convenient size to run locally, btw. I get 20 tok/s on an M3 mac with llama.cpp.
English
0
0
8
413
Philip Monk retweetledi
Essential AI
Essential AI@essential_ai·
Today, we’re excited to introduce Rnj-1, @essential_ai's first open model; a world-class 8B base + instruct pair, built with scientific rigor, intentional design, and a belief that the advancement and equitable distribution of AI depend on building in the open. We bring American open-source at par with the best in the world.
Essential AI tweet media
English
36
153
1K
578.5K
Philip Monk
Philip Monk@pcmonk·
@joji_teira I wasn't around for Wang, so I just used a trillion flops to save a trip to wikipedia
English
0
0
1
18
Joji Teira
Joji Teira@joji_teira·
@pcmonk I used 8" disks on a Wang, and I am not even kidding
English
1
0
1
11
Philip Monk
Philip Monk@pcmonk·
You all have it so easy today with your petaflop gpus. In my day we had *floppy disks* that could only handle a few hundred kiloflops/s
English
1
0
1
181
Philip Monk retweetledi
Essential AI
Essential AI@essential_ai·
[1/2] We at Essential are driven by mission to advance fundamental research guided by first principles, rigor and sharing research openly.
English
1
10
31
5.2K
Philip Monk
Philip Monk@pcmonk·
@hastuc_dibtux I've not looked into those much. It seems a very tall order, and my first question would be what's their story for flash attention
English
0
0
1
71
Philip Monk
Philip Monk@pcmonk·
@hastuc_dibtux Flash at its core is "just" fusion, but it requires a lot more knowledge than just the shapes, which is what compilers usually get
English
1
0
1
101
Philip Monk
Philip Monk@pcmonk·
@hastuc_dibtux The performance difference between flash v2 and v3 is like double, and that's just adding like ping pong scheduling. It's a deep rabbit hole and also completely non-optional for anything at scale.
English
1
0
1
108