NobodyExistsOnTheInternet

542 posts

NobodyExistsOnTheInternet

@nullvaluetensor

Human Large Language model. Skills: Distill data. Training LLMs. Test and Evaluate. Rinse and repeat as required. Based in SEA.

SEA Katılım Kasım 2023

98 Takip Edilen661 Takipçiler

Sabitlenmiş Tweet

NobodyExistsOnTheInternet@nullvaluetensor·10 Oca

For vibe coding imo.

English

355

NobodyExistsOnTheInternet@nullvaluetensor·2d

@teortaxesTex DSV4 cooked all my tokenomics numbers, deepseek is crazily good at arch

English

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·3d

in fp4 and with DSV4-level cache compression this would mean that one "GPU" module will be able to run a frontier model on par with 5.5 or Mythos. Alone But of course it won't be working alone, every dimension is going to be scaled up.

fin@fi56622380

@AnalysisOp Yes, this is happening in REAL HBM roadmap, industry is actually aggresively pulling in the roadmap, shortening the time period between each generations.

English

4.8K

NobodyExistsOnTheInternet@nullvaluetensor·4d

@DeepValueBagger old

DVB@DeepValueBagger·4d

Here's a close up of Lip Bu Tan ceo of $intc putting on the honorary doctoral hood on Jensen Huang, ceo of $nvda, at Carnegie Mellon. HOW DO WE FEEL?

English

296

28.3K

NobodyExistsOnTheInternet@nullvaluetensor·4d

@tenderizzation >but codex >i have a 5090 >>"Got it. Sending your resume to nearest Macdonalds"

English

102

tender (mlsys 5/18-21)@tenderizzation·4d

“Got it. I’ll use MIG to partition the gpu into four logical devices.”

difficultyang@difficultyang

dear codex, i have a distributed training script but i only have one gpu, replace all the collectives with local operations that make enough sense that i don't get a NaN loss, make no mistakes

English

8.4K

NobodyExistsOnTheInternet@nullvaluetensor·6d

@lefthanddraft The illusion ^__^ (oo)\_______ (__)\ )\/\ ||----w | || || of free choice.

English

550

Wyatt Walls@lefthanddraft·6d

Anyone know what this button actually does? If it's on, ChatGPT uses web search. If it's off, ChatGPT uses web search.

English

117.1K

NobodyExistsOnTheInternet@nullvaluetensor·8 May

@banteg more training data

English

360

banteg@banteg·8 May

why do you need a bug bounty if you have mythos? also 3k for critical? why not adopt the industry standard and pay $85 billion (fair 10%)? kidding

Anthropic@AnthropicAI

Our security bug bounty program is now public on HackerOne. We've run the program privately within the security research community, and their findings have strengthened our products. Now anyone can report vulnerabilities and get rewarded. Read more: hackerone.com/anthropic

English

583

70.9K

NobodyExistsOnTheInternet@nullvaluetensor·8 May

@IanCutress damn that's really cool

English

𝐷𝑟. 𝐼𝑎𝑛 𝐶𝑢𝑡𝑟𝑒𝑠𝑠@IanCutress·7 May

Base GB200 NVL72 is 2 GPUs per CPU. Meta's custom Catalina NVL72, ➡️shown almost than a year ago at a public event, is 1:1. Some of y'all just missed the reason why, I guess.

Amarillo Slim@Amarillo_Slim1

This CPU:GPU ratio going from 4:1 to 1:1 is literally a made up narrative.

English

291

58.8K

NobodyExistsOnTheInternet retweetledi

mr-r0b0t@mr_r0b0t·6 May

Here's a quick video explaining Google's new MTP speculative decoding process. Fortunately, the @NousResearch @HyperFrames_ skill took this fairly complex topic and turned it into an easy to understand video for us!

English

891

NobodyExistsOnTheInternet@nullvaluetensor·6 May

@teortaxesTex its the same tic profile as dsv4

English

NobodyExistsOnTheInternet@nullvaluetensor·6 May

@teortaxesTex have you read the CoTs of gpt-5.X models...

English

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·6 May

I suspect that V4-Pro is distilled *from* Flash-based experts, not the other way around. Which can explain a lot. V4-Flash was decently post-trained by 11th Feb already. They say final model is done with OPD. *What else* could be the teacher? V3.2 can't support >128K contexts.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Another eval where V4-Pro and V4-Flash are basically identical (while GLM 5.1=58.1, MiMo 2.5 Pro=66.4, GPT 5.5=77.8). DS's paper says "…V4-Flash-Max matches the performance of V4-Pro-Max [on several benchmarks]". And except for knowledge, base models are already ≈matched.

English

19.9K

NobodyExistsOnTheInternet@nullvaluetensor·6 May

@stochasticchasm they trained on the eval apparently lmao

English

stochasm@stochasticchasm·6 May

my take is that this is probably something like a linear lightning indexer version of DSA. matches the new blog claims of linear memory (not constant state size like an SSM) and once you hit k tokens, DSA attn is linear. and also importantly, they can start from GLM-5

Alexander Whedon@alex_whedon

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. @subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

English

10.3K

NobodyExistsOnTheInternet@nullvaluetensor·6 May

@R2Cdev_ that's a system prompt bug

English

Raphi-2Code@R2Cdev_·6 May

GPT-5.5 Pro/Thinking in ChatGPT also has a Aug 31 knowledge cutoff

ρ:ɡeσn@pigeon__s

something I just noticed is gpt-5.5-instant has a older cutoff than 5.5-thinking and it really confuses me why in the world this would be the case @OpenAI @aidan_mclau @tszzl this means the 5.5 generation doesnt natively have that dec cutoff its just cpt with the thinking model?

English

1.6K

NobodyExistsOnTheInternet@nullvaluetensor·6 May

@teortaxesTex yea its bc they stopped benchmaxxing it

English

127

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·5 May

Oh-ho. This isn't actually a breakthrough, however – Opus 4.6 famously sported 76%, with 4.7 Anthropic just said "it's always been a bad benchmark". I remember in Chinese evals of V4-Flash, they said that its MRCR perf looks like very shallow tracking. Still, let them have a go.

Alexander Whedon@alex_whedon

English

141

15.5K

NobodyExistsOnTheInternet@nullvaluetensor·5 May

@HaoningTimothy whenever people pretrain on 200k long seq length samples.

English

609

Wu Haoning@HaoningTimothy·5 May

this is literally insane that we truly see an open model thinking 200-300k for math problems…

English

109

12.2K

NobodyExistsOnTheInternet@nullvaluetensor·2 May

@scaling01 this doesnt actually account for gpt-5.X memorizing all of wikipedia

English

104

Lisan al Gaib@scaling01·2 May

much better estimates

Lawrence Chan@justanotherlaw

A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc. @ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T).

English

487

78K

NobodyExistsOnTheInternet@nullvaluetensor·2 May

@secemp9 five. hundred. billion active params

English

secemp@secemp9·2 May

@nullvaluetensor okay mr arthur zucker

English

secemp@secemp9·1 May

there is a 2-expert MoE layer, E_1 and E_2 ex-ante symmetric, gate logits z_i = W_g h_i, top-1 routing r_i = argmax_k softmax(z_i)k, each token's routing is its private vote, f_k = (1/N) Σ_i 1[r_i = k], rule is f_2 > 0.5 every token gets y_i = E{r_i}(h_i), else E_2 tokens zeroed and E_1 tokens pass, E[ℓ_i^{E_1}] ≤ E[ℓ_i^{E_2}] for every f_2 so E_1 weakly dominates, empirical risk alone drives f_1 → 1, Switch Transformer's L_aux = α N Σ_k f_k P_k with P_k = (1/N) Σ_i p_i^k pulls against it, f_1 → 1 also satisfies survival because majority routes to E_1 so "only E_1 survives" = all survivors survive, with L_aux the load-balanced optimum sits at f_2 = 0.5 + ε, knife-edge because rule is strict >, one defection to E_1 zeros every E_2 token, soften 1[f_2 > 0.5] to σ(β(f_2 − 0.5)), ∇_θ L nonzero at boundary, gate flows to interior f_2* = 1/(1 + exp(−β(L_1 − L_2))) from ∂E[ℓ]/∂f_2 = 0, which expert do you route to

English

503

NobodyExistsOnTheInternet@nullvaluetensor·1 May

@Tomodovodoo me when i cost OAI over a thousand dollars in requests in a day (they rate limited me)

English

Tomo@Tomodovodoo·1 May

For reference I have probably sent at least~250 GPT-5.5 pro requests since 3pm today so fair play but still

English

282

Tomo@Tomodovodoo·1 May

At least I'm getting somewhere, though I've never seen this limit before on GPT-5.5 pro. And this time, it wasn't even automated. It's all-human authentic spamming of GPT-5.5 pro until it can get me the results I need. I guess paying 200 a month wasn't enough ;P

Tomo@Tomodovodoo

Doing another futile attempt at the order 668 Hadamard matrix problem, might as well do something, read a new interesting paper on it yesterday

English

3.2K

NobodyExistsOnTheInternet@nullvaluetensor·30 Nis

@ziv_ravid anthropic has no guardrails, openai's models are structured around guardrails

English

284

Ravid Shwartz Ziv@ziv_ravid·30 Nis

So do you tell me that the model Anthropic refused to release because it was too dangerous for humanity is, per AISI, within margin of error of the model OpenAI just shipped to anyone with $20 So... does that mean the whole Mythos saga wasn't actually about it being too dangerous? asking for a friend...

AI Security Institute@AISecurityInst

OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

English

7.8K

NobodyExistsOnTheInternet@nullvaluetensor·30 Nis

@picocreator lfg!!!

218

PicoCreator - AI builder @ AIE 🇸🇬@picocreator·30 Nis

AI is concentrating, in a few countries, a few companies, a few chips. Open-source AI is the only real check on that. And it only works if the infrastructure to run it actually exists. Today, we're announcing our Series A, $20 M, to make this a reality.

English

114

25.2K

Keşfet

@teortaxesTex @DeepValueBagger @tenderizzation @lefthanddraft @banteg @IanCutress @NousResearch @HyperFrames_