The Nurse Engineer🇳🇬

1.1K posts

The Nurse Engineer🇳🇬 banner
The Nurse Engineer🇳🇬

The Nurse Engineer🇳🇬

@boochi_dot_dev

• ICU Nurse • Computer Scientist • NeuralMind • ML Engineer

Portsmouth, England Katılım Ekim 2017
640 Takip Edilen328 Takipçiler
Sabitlenmiş Tweet
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
1/ Most engineers are sleeping on why DeepSeek-V4 is a "black swan" moment for local AI. Guys, look into the future: we are moving from 64K context to 1M context on consumer grade GPUs (RTX 3090/4090). A Twitter Trend 🧵
English
2
0
20
39.6K
Arjun
Arjun@arjunkocher·
Kimi is also Yang Zhilin’s English name.
Filipino
2
5
117
12.4K
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
@thegenioo There’s nothing special about composer 2.5; it’s just more compute and quality training data. Most open source labs don’t have this, which is sad cos they actually do the hard work with pre-training and neural algorithm research that pushes the field forward
English
1
0
0
591
Hamza
Hamza@thegenioo·
Honestly I think Moonshot fumbled big time with Kimi K2.6 Their previous model K2.5 was just so good, and it just needed that perfect polish and a few bits of upgrades to get a really strong, cheap model. And you know what? That exists! Yes, it is Composer 2.5. This is what Moonshot should have done: K2.5 should have been polished like Composer 2.5 and released as K2.6 Now don't get me wrong, K2.6 is a very powerful and strong model, but it has some issues: - It just overthinks the hell out of things and gets stuck in long, endless thinking loops - It is unbelievably slow, like really slow - It is good, but I found DeepSeek and Qwen models more efficient, workable, and faster So what Cursor has pulled off here with Composer 2.5 should have been done by Moonshot with the release of Kimi K2.6, and I hope they fix these issues with K3
English
25
9
235
25.3K
Habdulakeem Bhadmus
Habdulakeem Bhadmus@Mrbhadoosky·
Immigrants in the USA = Alien Immigrants in the UK = Boriswave Life though everywhere.
English
7
34
64
10.7K
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
@neural_avb I also forgot to add that the perplexity score has a correlation to the value of the loss function in next token prediction tasks. So you end up with the best checkpoint having low perplexity and lower loss function output value.
English
0
0
0
12
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
I’m not a PhD expert yet, but from my experience, this is what pre-training engineers look at: perplexity score. If you have 5 to 10 checkpoints of a pre-trained base LLM, you curate a set of text completion benchmarks across multiple categories such as coding, math, and text generation. Then you validate each checkpoint independently on the benchmarks using perplexity. Note that perplexity tells you how uncertain an LLM is about its prediction. Whichever checkpoint returns the lower perplexity score (i.e., lower uncertainty) becomes the champion or winner checkpoint. One piece of evidence for this approach was documented in the Cursor Composer 2 technical report, where they stated that Kimi K2.5 was chosen as the base LLM for post-training their composer model over others (GLM-5.1, Deepseek-3.2, Qwen-3-235B) because it returned a lower perplexity score on their internal Cursor benchmark for coding-related tasks.
English
1
0
0
71
AVB
AVB@neural_avb·
Pretraining is the most mysterious aspect of LM training for me... All I know about is it: "feed the whole internet's data into it and train next token prediction" Feels too simple... More importantly, do you actually measure intelligence at that layer, other than basic token overlap (rogue/bleu/f1)? The last time a base model was reallllly studied was GPT-3... but no one talks or writes papers about this anymore. We all know pretraining is where the model gathers the most world knowledge, albeit in an unstructured way. At that stage the model is already capable of in-context learning and recognizing tasks/patterns. I wanna know what kind of ICL experiments the big labs evaluate their pretrained base models on before going to the next stage. (Or do they?)
AVB tweet media
English
9
6
70
3.8K
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
Hot take: You don’t have to love Yann LeCun to see the genius in the JEPA architecture. Theoretically, it’s a perfect match for: • Recommender Systems • Intelligent Tutoring • Neural Adaptive Testing But will it kill Transformers? No. Autoregressive models will absolutely remain SOTA for mainstream GenAI (text, images, etc.). Both can win. 🤝
English
0
0
0
60
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
Whether Gemini’s underlying architecture is sparse MoE or dense remains, in the strict sense, ignoramus et ignorabimus: unknown and likely unknowable.
English
0
0
0
22
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
@djasnive @shiri_shh That flash is at least >1.5 Trillion parameters (given estimates from its API pricing) and yet still underperforms Composer 2.5 which is 1 Trillion parameters
English
1
0
0
26
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
Reinforcement Learning looks very simple and straightforward in its code implementation; it’s almost as if nothing serious is happening until you read the mathematical formulation behind it.
English
0
0
0
8
The Nurse Engineer🇳🇬
The Nurse Engineer🇳🇬@boochi_dot_dev·
@Thyndd @pmddomingos Good points, but why does the parallel approach (i.e attention) always perform better than sequential approach (RNN)? We know the parallel should/would be faster to train….yes. But why is it giving better results in terms of performance?
English
1
0
0
2
Thyndd
Thyndd@Thyndd·
@boochi_dot_dev @pmddomingos But that has nothing to do with attention. Attention actually originates in the context of RNNs. Transformers just said, let's get rid of the R. So it's not attention what makes transformers perform better, it's the fact that they're highly parallelizable and stable to train.
English
2
0
0
34
Benson
Benson@IntegralOye·
@OlawAlausa @boochi_dot_dev @shelovesore His own made it clearer: she didn’t tell us what it is replacing and why. Just read the second to the last and also the last paragraph again.
English
2
0
2
28
khaleesi🧍🏽‍♀️
Colon (:) introduces something. You usually use it when what comes after is a list, or an explanation, or even a reveal. An example is “she had one rule: never apologize” or “this is a list of things you should get from the market: eggs, tomatoes…” It’s also used as the eyes in a smiley face :) Semicolon (;) connects two complete thoughts that are related but could stand alone as separate sentences. It’s stronger than a comma but softer than a full stop. An example is “Frank never apologized; he didn't think he was wrong” It’s also used as the eyes in a winking face ;)
Chioma Genia@eugeeyy

What's the difference between : and ;

English
185
7.4K
37.1K
1.3M