NobodyExistsOnTheInternet

542 posts

NobodyExistsOnTheInternet

NobodyExistsOnTheInternet

@nullvaluetensor

Human Large Language model. Skills: Distill data. Training LLMs. Test and Evaluate. Rinse and repeat as required. Based in SEA.

SEA Katılım Kasım 2023
98 Takip Edilen661 Takipçiler
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
in fp4 and with DSV4-level cache compression this would mean that one "GPU" module will be able to run a frontier model on par with 5.5 or Mythos. Alone But of course it won't be working alone, every dimension is going to be scaled up.
fin@fi56622380

@AnalysisOp Yes, this is happening in REAL HBM roadmap, industry is actually aggresively pulling in the roadmap, shortening the time period between each generations.

English
2
1
59
4.8K
DVB
DVB@DeepValueBagger·
Here's a close up of Lip Bu Tan ceo of $intc putting on the honorary doctoral hood on Jensen Huang, ceo of $nvda, at Carnegie Mellon. HOW DO WE FEEL?
DVB tweet mediaDVB tweet media
English
24
15
296
28.3K
Wyatt Walls
Wyatt Walls@lefthanddraft·
Anyone know what this button actually does? If it's on, ChatGPT uses web search. If it's off, ChatGPT uses web search.
Wyatt Walls tweet media
English
42
4
1K
117.1K
NobodyExistsOnTheInternet retweetledi
mr-r0b0t
mr-r0b0t@mr_r0b0t·
Here's a quick video explaining Google's new MTP speculative decoding process. Fortunately, the @NousResearch @HyperFrames_ skill took this fairly complex topic and turned it into an easy to understand video for us!
English
2
2
18
891
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
I suspect that V4-Pro is distilled *from* Flash-based experts, not the other way around. Which can explain a lot. V4-Flash was decently post-trained by 11th Feb already. They say final model is done with OPD. *What else* could be the teacher? V3.2 can't support >128K contexts.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Another eval where V4-Pro and V4-Flash are basically identical (while GLM 5.1=58.1, MiMo 2.5 Pro=66.4, GPT 5.5=77.8). DS's paper says "…V4-Flash-Max matches the performance of V4-Pro-Max [on several benchmarks]". And except for knowledge, base models are already ≈matched.

English
3
1
78
19.9K
stochasm
stochasm@stochasticchasm·
my take is that this is probably something like a linear lightning indexer version of DSA. matches the new blog claims of linear memory (not constant state size like an SSM) and once you hit k tokens, DSA attn is linear. and also importantly, they can start from GLM-5
Alexander Whedon@alex_whedon

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. @subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

English
9
3
82
10.3K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Oh-ho. This isn't actually a breakthrough, however – Opus 4.6 famously sported 76%, with 4.7 Anthropic just said "it's always been a bad benchmark". I remember in Chinese evals of V4-Flash, they said that its MRCR perf looks like very shallow tracking. Still, let them have a go.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Alexander Whedon@alex_whedon

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. @subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

English
5
2
141
15.5K
Wu Haoning
Wu Haoning@HaoningTimothy·
this is literally insane that we truly see an open model thinking 200-300k for math problems…
Wu Haoning tweet media
English
1
0
109
12.2K
secemp
secemp@secemp9·
there is a 2-expert MoE layer, E_1 and E_2 ex-ante symmetric, gate logits z_i = W_g h_i, top-1 routing r_i = argmax_k softmax(z_i)k, each token's routing is its private vote, f_k = (1/N) Σ_i 1[r_i = k], rule is f_2 > 0.5 every token gets y_i = E{r_i}(h_i), else E_2 tokens zeroed and E_1 tokens pass, E[ℓ_i^{E_1}] ≤ E[ℓ_i^{E_2}] for every f_2 so E_1 weakly dominates, empirical risk alone drives f_1 → 1, Switch Transformer's L_aux = α N Σ_k f_k P_k with P_k = (1/N) Σ_i p_i^k pulls against it, f_1 → 1 also satisfies survival because majority routes to E_1 so "only E_1 survives" = all survivors survive, with L_aux the load-balanced optimum sits at f_2 = 0.5 + ε, knife-edge because rule is strict >, one defection to E_1 zeros every E_2 token, soften 1[f_2 > 0.5] to σ(β(f_2 − 0.5)), ∇_θ L nonzero at boundary, gate flows to interior f_2* = 1/(1 + exp(−β(L_1 − L_2))) from ∂E[ℓ]/∂f_2 = 0, which expert do you route to
English
4
0
8
503
Tomo
Tomo@Tomodovodoo·
For reference I have probably sent at least~250 GPT-5.5 pro requests since 3pm today so fair play but still
English
1
0
3
282
Ravid Shwartz Ziv
Ravid Shwartz Ziv@ziv_ravid·
So do you tell me that the model Anthropic refused to release because it was too dangerous for humanity is, per AISI, within margin of error of the model OpenAI just shipped to anyone with $20 So... does that mean the whole Mythos saga wasn't actually about it being too dangerous? asking for a friend...
AI Security Institute@AISecurityInst

OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

English
9
3
61
7.8K
PicoCreator - AI builder @ AIE 🇸🇬
AI is concentrating, in a few countries, a few companies, a few chips. Open-source AI is the only real check on that. And it only works if the infrastructure to run it actually exists. Today, we're announcing our Series A, $20 M, to make this a reality.
English
21
35
114
25.2K