S16A 🚀
2.1K posts


"A large touchscreen doesn't work in a car": Sir Jony Ive on designing the Ferrari Luce's interior ➡️ top-gear.visitlink.me/yTpZer











the router in mixture of experts models is a linear layer. it takes a token's hidden state, multiplies it by a weight matrix of shape (num_experts, hidden_dim), softmaxes the result, and picks the top-k experts. that's it. but why does a matrix multiply "know" which expert to pick? each row of the router matrix is basically a learned prototype for that expert. the dot product measures how similar the token is to that prototype. high score = that expert gets activated. the cool part is nobody hardcodes what each expert specializes in. during training, gradient descent naturally pushes experts toward specialization because it minimizes loss better that way. one problem though - without a load balancing auxiliary loss, the router collapses and keeps sending tokens to the same 2-3 experts while the rest rot. that's why every moe paper has some balancing trick.


liked 3 tweets of my tweetlikingtuationship and they liked 5 back



I will never understand the rationalization of cooking the ground beef BEFORE the onions. Onions first!!




