Philip Monk

1.8K posts

Philip Monk

@pcmonk

A man alive, walking on two legs about the world. Infra lead @essential_ai

San Francisco, CA Katılım Ocak 2012

392 Takip Edilen2.1K Takipçiler

Philip Monk@pcmonk·2d

@iamwil @dardezeu @patio11 It's a style that lets you say more things in a sentence without it devolving into a long series of comma-separated phrases. I think the terseness is a goal in itself though, and not necessarily specific to twitter.

English

Wil Chung@iamwil·2d

@dardezeu @pcmonk @patio11 I'm not sure. It's one of two writers on the internet I read often that I find hard to parse. I've heard him on his podcast, and he doesn't talk that way. I had assumed the word choice and order contortion was due to twitter's character limit.

English

Patrick McKenzie@patio11·4d

LLMs keep winning anecdote. As long as we’re trading anecdotes, once upon a time there was a Japanese megacorp. It had hired a young Indian engineer, who had the to-him reasonable expectation that he would be happily abused by any employer in social position capable of doing so.

Happy Captain@EODHappyCaptain

Ai has its uses. Here’s a recent one, I witnessed firsthand: A Soldier was notified that they had a substantial financial debt to the tune of close to 7k dollars. The only problem? The Soldier has no idea what the issue is, so they go to finance. Finance can’t figure it out. They say the debt is legitimate and the Soldier must pay. His only option is to request forgiveness from the first COL in the chain of command. In order to do so, he must provide evidence as to why he doesn’t owe the money. The problem is that no one knows what the debt is for, so he cant provide evidence. So him and the 1SG take all of his paystubs and put it into GenAi. The response? There was a mistake at a previous duty station. Not only does he not owe the money, finance owes HIM money. He takes all the evidence to finance. They process everything and he goes from owing thousands to getting thousands. Stay on top of your finances folks.

English

564

119K

Philip Monk@pcmonk·4d

@iamwil @patio11 the expectation was reasonable to him, though apparently not to japanese salarymen

English

238

Wil Chung@iamwil·4d

@patio11 what does "the to-him" mean?

English

4.4K

Philip Monk@pcmonk·4 Oca

Even ignoring hw utilization, you still want to avoid routing collapse. But maybe that's more about enforcing minimum usage than max usage? If an expert is chosen <5% as often as the average, that expert is almost certainly wasted and undertrained. But if an expert is chosen 5x as often as the average, that could just mean it's a generically useful function. I think shared experts are a concession to this, but there could be more granularity than "perfectly balanced" vs "always activated"

English

Daria Soboleva@dmsobol·4 Oca

Oh interesting, I had a similar idea that router vector norms should be close to 1 so that the router vectors start rotating in the space, making experts more specialised. I’ll take a look 👀 I agree that global batch size should be used instead of micro batch size for calculating lbl. But maybe there is a way to avoid using this loss all together? It seems to be mostly important from the hw utilization perspective tied to he gpu architecture.

English

Daria Soboleva@dmsobol·2 Oca

MoE models are compute efficient. Everyone knows that. But they are not parameter efficient. Why? Our experts learn redundant, overlapping functions. We're spending extra money loadings weights for redundant experts. This is wasteful. And we're not fixing it. Attaching here a plot I showed at my @NeurIPSConf workshop showing how much redundancy there is in deepseek-v3 model in the first 23 layers. Diagonal is expert compare to itself. 🧵

English

9.4K

Philip Monk@pcmonk·4 Oca

Re 2, it does seem like the variance loss should eventually load balance, but maybe that effect is not strong enough? Would be worth trying. Not the same, but this paper claims you can replace lb loss with an orthonormality loss on the router weights, and it still load balances. arxiv.org/pdf/2506.14038… The global-batch vs microbatch/sequence-wise lb loss distinction from the demons-in-the-detail paper you linked above seems important. With sequence-wise lb loss, you couldn't possibly get eg specialization by language, like they do in the last few layers, so I'm suspicious of a lot of the earlier papers that claim to get some kind of data-domain specialization while using microbatch-wise lb loss.

English

Daria Soboleva@dmsobol·4 Oca

i made a pass through this work arxiv.org/pdf/2505.22323. some observations: 1/ it confirms my view that load balancing loss is hurting expert specialization, however, authors still keep it as an objective but add two more objectives. first one to promote expert output's orthogonality (they call it expert specialization) on the same input and second one to improve router scores diversity. 2/ it made me think about two questions. the first one: why router scores diversity penalty is not enough to ensure proper load balance? the second one: why removing lbl loss decreases the model's quality? it seems like the original hypothesis was that lbl loss is preventing specialization in experts and thus hurting the quality? I could not figure out this part. 3/ let's imagine that router diversity loss is primarily there to make the router decisions more confident, which can be the case. as we explored with @aman_gif in p3 of moe 101 series cerebras.ai/blog/moe-guide… there is a simpler trick that we call "expert bias trick" that is added at no cost and improves both a) decisiveness of the router and b) the grad flow. I wonder what their diversity offers on top of the expert bias trick? 4/ when thinking about the role of each loss, i was actually surprised there is no ablation to compare which loss contributed the most to the final quality uplift. is it expert specialization or router diversity? 5/ a useful takeaway for me is that we can potentially start improving expert specialization during the sft phase without having to do this during pre-training, however, authors don't compare how much improvement we'd get by injecting their two losses as objectives from the pre-training phase, they only look at fine-tuning. cc @xidulu @SkyLi0n 6/ On the design of these new objectives: i am guessing you can do it more efficient instead of having to go through all experts outputs for every token in the batch. we should not look at the activations, i think we should look into the expert weight matrices. Based on this paper arxiv.org/abs/2406.00127 there are reasons to believe that top singular vector in expert weight matrices aligns with the token representation that gets routed to it. We expect that different experts should respond to different inputs, if they all respond to the same vectors then they’re redundant/not specialized. I am curious if anyone tried to measure specialization this way in moe layers? @pcmonk have you tried adding these objectives into your moe models? curious if you have any experiments to share.

English

158

Philip Monk@pcmonk·3 Oca

English

Philip Monk@pcmonk·3 Oca

@_selebou @dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai I also worry about using routing scores to measure specialization. I guess the times when you can draw these metrics are (am I missing any?): - routing weights - routing scores - expert weights - expert activations - impact on predictive loss I feel like later is better, usually

English

Philip Monk@pcmonk·3 Oca

@dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @_selebou @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai A good proxy would be useful, especially if you can add it directly to the loss. Orthogonality of either the weights or the activations feels good, but I haven't seen a conclusive connection to "true" specialization yet. This is the closest: arxiv.org/pdf/2505.22323

English

156

Philip Monk@pcmonk·3 Oca

@dmsobol @jahulas @maciejpioro @kuba_krj @Alibaba_Qwen @BytedanceTalk @_selebou @SkyLi0n @ChengZhoujun @QuentinAnthon15 @xidulu @aman_gif @darkproger @essential_ai I don't feel like I have a good collection of metrics for expert specialization yet. What feels closest to the actual objective is masking out each expert and seeing how much loss increases, but this is expensive to compute.

English

143

Philip Monk retweetledi

ollama@ollama·9 Ara

.@essential_ai's rnj-1 model is now on Ollama! ollama run rnj-1 8B parameter, open-weight dense model trained from scratch. The model is optimized for code and STEM with capabilities on par with other state of the art open-weight models. Let's go! 🚀🚀🚀

English

246

29.9K

Philip Monk@pcmonk·6 Ara

It's open weights and a very convenient size to run locally, btw. I get 20 tok/s on an M3 mac with llama.cpp.

English

413

Philip Monk@pcmonk·6 Ara

It's been a blast to lead the infrastructure effort to train this model. I'm excited to see it out in the world!

Ashish Vaswani@ashVaswani

We are beyond thrilled to share our first flagship models, Rnj-1 base and instruct 8B parameter models. Rnj-1 is the culmination of 10 months of hard work by a phenomenal team, dedicated to advancing American SOTA OSS AI. Lots of wins with Rnj-1. 1. SWE bench performance close to GPT 4o. 2. Tool use outperforming all comparable open source models. 3. Mathematical reasoning (AIME’25) nearly at par with GPT OSS MoE 20B. ….

English

779

Philip Monk retweetledi

Essential AI@essential_ai·6 Ara

Today, we’re excited to introduce Rnj-1, @essential_ai's first open model; a world-class 8B base + instruct pair, built with scientific rigor, intentional design, and a belief that the advancement and equitable distribution of AI depend on building in the open. We bring American open-source at par with the best in the world.

English

153

578.5K

Philip Monk@pcmonk·25 Kas

@joji_teira I wasn't around for Wang, so I just used a trillion flops to save a trip to wikipedia

English

Joji Teira@joji_teira·25 Kas

@pcmonk I used 8" disks on a Wang, and I am not even kidding

English

Philip Monk@pcmonk·25 Kas

You all have it so easy today with your petaflop gpus. In my day we had *floppy disks* that could only handle a few hundred kiloflops/s

English

181

Philip Monk@pcmonk·28 Eki

It finally happened: I ran into a bug that rust would have caught

Philip Monk@pcmonk

The things that are hard about ml infra are not things that rust solves

English

271

Philip Monk retweetledi

Essential AI@essential_ai·5 Eyl

[1/2] We at Essential are driven by mission to advance fundamental research guided by first principles, rigor and sharing research openly.

English

5.2K

Philip Monk@pcmonk·15 Ağu

@hastuc_dibtux I've not looked into those much. It seems a very tall order, and my first question would be what's their story for flash attention

English

Dr. Oskar Sarkon@hastuc_dibtux·15 Ağu

@pcmonk x.com/hastuc_dibtux/… Given how ridiculously uneven GPU performance cliffs are, what do you think of projects like bend/HVM

Dr. Oskar Sarkon@hastuc_dibtux

“We can just write a general purpose VM that compiles to a GPU” t. someone who has never looked at a roofline chart

English

110

Philip Monk@pcmonk·15 Ağu

The things that are hard about ml infra are not things that rust solves

mattparlmer 🪐 🌷@mattparlmer

The fact that Python is the standard for machine learning is a serious indictment of the field’s engineering standards

English

1.3K

Philip Monk@pcmonk·15 Ağu

@hastuc_dibtux I wish I knew, because I'm not satisfied with what I'm doing (juggling ymls overriding parts of files like this: github.com/AI-Hypercomput…)

English

Dr. Oskar Sarkon@hastuc_dibtux·15 Ağu

@pcmonk What is sota for that kind of configuration?

English

Philip Monk@pcmonk·15 Ağu

@hastuc_dibtux Flash at its core is "just" fusion, but it requires a lot more knowledge than just the shapes, which is what compilers usually get

English

101

Philip Monk@pcmonk·15 Ağu

@hastuc_dibtux The performance difference between flash v2 and v3 is like double, and that's just adding like ping pong scheduling. It's a deep rabbit hole and also completely non-optional for anything at scale.

English

108

Keşfet

@iamwil @dardezeu @patio11 @NeurIPSConf @aman_gif @xidulu @SkyLi0n @_selebou