
EDITH
10K posts

EDITH
@Infopulsed
Singularitarian/post-humanism/technocratic hedonism/ML engineer and researcher/physicist











It was Tyler Robinson.


Introducing Expert Threshold Routing: - ✅ load balance - ✅ dynamic computation - ✅ autoregressive - ✅ zero train-inference mismatch At 2.4B params, Expert Threshold achieves 0.067 lower CE loss than Token Choice (equivalent to 1.6× data efficiency).

Incorporating SFT data during pretraining is more effective for finetuning than the plain pretraining and finetuning scheme, even considering replay during finetuning. But the ratio of SFT data during pretraining should consider the token budget for pretraining. They built a scaling law for this.

For only 20 Mil @arcee_ai pulled a decent Openweights model. Why are people not talking about this?













