evrazian_schizo

361 posts

evrazian_schizo

@rationaleist

Katılım Kasım 2012

4 Takip Edilen45 Takipçiler

@teortaxesTex I mean, makes sense if you think about it as "rotate the self-attention." In a sequence token m represents what is discretely appended at position m, not the accumulated state.

English

148

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·5d

Is this real

Cheng Luo@ChengLuo_lc

We're excited to release 𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬, a drop-in upgrade to residual connections that learns which past layers to route from — without the routing collapse that breaks prior cross-layer attention at scale. 🚀 Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over 𝐝𝐞𝐥𝐭𝐚𝐬 (vᵢ = hᵢ₊₁ − hᵢ) — what each sublayer actually contributed — and natively enable: ⚡ 𝟏.𝟖× 𝐬𝐡𝐚𝐫𝐩𝐞𝐫 𝐜𝐫𝐨𝐬𝐬-𝐥𝐚𝐲𝐞𝐫 𝐫𝐨𝐮𝐭𝐢𝐧𝐠 Deltas are structurally diverse, lifting max attention weight from ~0.2 → ~0.6 (0.62 vs 0.35 avg) and curing routing collapse in deep layers. 📉 −𝟖.𝟐% 𝐯𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧 𝐏𝐏𝐋 𝐚𝐭 𝟕.𝟔𝐁 Consistent gains from 220M → 7.6B (1.7–8.2% lower PPL), beating both standard residuals and Attention Residuals — the latter actually degrades below baseline at scale (18.58 vs 17.43). 🔌 𝐃𝐫𝐨𝐩-𝐢𝐧 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐨𝐟 𝐩𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐦𝐨𝐝𝐞𝐥𝐬 Additive, zero-init routing is identity at initialization, so you can convert pretrained checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning — beating the original on 8 downstream benchmarks (55.6 vs 55.0). 🪶 ≤𝟎.𝟎𝟏% 𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫 𝐨𝐯𝐞𝐫𝐡𝐞𝐚𝐝 Delta Block adds just 589K params (0.008% at 8B) and ~3% memory — and runs faster + lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB). 💻 Code: github.com/wdlctc/delta-a… 📄 Paper: arxiv.org/abs/2605.18855

English

15.4K

evrazian_schizo@rationaleist·12 May

@akarlin The ground truth for SWE tasks and math is simple, sparse, and objective. "Good writing" is an extremely jagged shape that requires expensive human feedback and good heuristics. The former has high financial returns, the latter is niche atm. Resources are invested accordingly.

English

145

Anatoly Karlin 🧲💯@akarlin·12 May

I think there's a pretty simple and intuitive explanation for this. Good writers have been sexually selected for 5500 years, and writing and reading have always been the primary mode of symbolic analysis (solving problems through abstract thought). It is also loads heavily on far more ancient linguistics modules some of which like the FOXP2 gene we even share with parrots (a notably verbally tilted species). Programming is much more evolutionarily novel and has not been sexually selected for at all. On some universally neutral scale of cognition, the top human writers are several S.D.'s better than the top human coders. Consequently, whereas AI has already blasted through the top tiers of human programming ability, reaching the peaks of human writing ability is still probably a couple of years away.

Jerry Tworek@MillionInt

If the AI models are so smart, why do I feel like I’m losing a few neurons every time I read a longer form content written by AI? We’ve come a long way but we still have long way to go. In terms of clarity of writing we may have regressed from o1/o3 days.

English

156

15.1K

evrazian_schizo@rationaleist·6 May

@teortaxesTex @rasbt They explicitly reject both state tracking and SWA hybrids on their site. The specific claim seems to be DSA analog with a working linear indexer. Which would be both good and not that unrealistic but the "not your average transformer" marketing is so ass it's hard to believe.

English

341

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·6 May

pfft yeah, you get to choose bad baselines for a promo We need one good unified visualization of memory&compute cost per ith token @rasbt you probably have all formulas for different architectures ready, what do you think?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Oliver Sieberling@osieberling

Would not be too surprised if this was just sth. like: 60 layer hybrid - 256k "sliding window" attention every 4 blocks ("linear") - GDN in the remaining blocks compared to full attn: (60 * (1M)^2) / (15 * (256K)^2 + 45 * little) ≈ 52x speedup This is Qwen3.5-397B-A17B btw

English

8.6K

evrazian_schizo@rationaleist·5 May

@teortaxesTex Long if factual but oozing scam energy. Why would you invent a buzzword name for your attention if you are not going to release the algo?

English

720

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·5 May

Oh-ho. This isn't actually a breakthrough, however – Opus 4.6 famously sported 76%, with 4.7 Anthropic just said "it's always been a bad benchmark". I remember in Chinese evals of V4-Flash, they said that its MRCR perf looks like very shallow tracking. Still, let them have a go.

Alexander Whedon@alex_whedon

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. @subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

English

142

15.5K

evrazian_schizo@rationaleist·4 May

@teortaxesTex Wouldn't that require strong multiturn?

English

199

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·4 May

come to think about it, the use case that most benefits from DeepSeek's cache pattern is long roleplay. Massive slow contexts, intermittent sessions, ttft after reload irrelevant. Cache lives on disk 99% of the time. They can make waifu prefetch a paid product feature, too

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

restarted a convo (with V4's + 3 more papers) ≈48 hours old. cache hits they do store cache for "days", not minutes-hours Gemini TTL default is 1 hour, Claude's is 5 minutes Nah bros I don't think they have > V4 kv efficiency, whatever Reiner Pope says

English

6.6K

evrazian_schizo@rationaleist·30 Nis

@teortaxesTex Linear could maybe serve similar function but DS doesn't have a mature variant and arch is somewhat raw as is. I imagine that's also why m and m` are fully static.

English

evrazian_schizo@rationaleist·30 Nis

@teortaxesTex Visualize going over the input sequentially from the start. You can start compressing immediately or you can run full attention over the first n elements at negligible cost, then compress the oldest element for every new one. SWA follows naturally from the formulation.

English

146

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·30 Nis

Plausible it is frustrating that DeepSeek doesn't spell out their rationale for rejecting more mainstream alternatives like hybrid-linear. I guess it's not just NIH. But as for SWA: it's unusual that they added it to route around CSA&HCA limits, not to cut costs further.

ueaj@_ueaj

Hybrid SWA is actually an atrocious inductive bias and I'm tired of people using it. Of course it works at small scales! At small scales most of the learnable patterns are short range! That doesn't meant it scales to bigger models! Anything above like 200B params shouldn't use hybrid. I've never been a fan of blockwise inductive biases but it's the best you can get for long context perf. Reducing the total number of entries in the kv cache across a given sequence length is the best way to improve long ctx performance. I don't think attention is the best inductive bias but within attention but I think HSA, CSA, DSA even NSA are all by far the best innovations in the attention world by a massive margin. dsv4 is a genuinely very good model, the fact they didn't go with engrams, and all the other decisions they made except maybe mHC makes me feel that ds still has the OS mandate. (attn res >> mHC)

English

4.2K

evrazian_schizo@rationaleist·29 Nis

@CyberdyneC @teortaxesTex Being Russian, one also understands that the alternative is sacrificing yourself and your people for security schizobabble, literal death cults and LARP a e s t h e t i c s

English

cyberdyne_canary@CyberdyneC·29 Nis

@teortaxesTex You may not understand, being Russian, but throughout the rest of the west we've been living in blue world, where sacrificing yourself and your people for every ignorant child or hapless idiot or murderous thug is already expected

English

646

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·29 Nis

The real life question of Blue vs Red is "what should we teach children to do". Humans are creatures of habit. It is entirely possible to have a virtuous society that votes blue by supermajority, in the REAL situation. This requires penalizing Reds. This is called "civilization".

kris@liana_florist

@teortaxesTex Im confused because it seems like votes based off 1st instinct leads to 58% blue win and votes after aggressive debate also lead to 58% blue. The only other things we know is votes won’t be 100% either way and blue >50% means 0 lives lost so how does advocating for red minimise-

English

925

27K

evrazian_schizo@rationaleist·29 Nis

@teortaxesTex GobPT

Português

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·28 Nis

I realize this is OpenAI's new marketing gimmick but still they should just take the gob-pill

roon@tszzl

@repligate @genalewislaw I think it becomes annoying when it mentions goblins ever single chat and it’s fair shakes to try and reduce that

English

109

7.3K

evrazian_schizo@rationaleist·27 Nis

@teortaxesTex I think it comes down to K2.6 CoT being really good. V4-Pro benefits from similar pattern being injected through RP loophole, though obviously it's not trained to work with it. Flash doesn't pick it up or at least I couldn't force it, seems to be hardwired to trained patterns.

English

210

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·27 Nis

very good on Kimi for coming out ahead on balance with fewer active and total params + vision but it's literally within the margin of error for almost all pure text evals

Dhiraj@IndustrlPolicy

"Data from third-party AI model performance evaluation firm VALS AI showed V4 achieved an average accuracy rate of 63.87% across financial, legal and coding tests, lagging behind Claude Opus 4.6, Gemini 3.1 Pro Preview, GPT-5.4 and Kimi K2.6. "

English

7.4K

evrazian_schizo@rationaleist·27 Nis

@GzGIf4JJmV70827 @teortaxesTex The work doesn't feel one way or another, you do. In the first place, modification is closer to "rape" but the Japanese have less issues with porn fan works of PreCure and the like than unpaid consumption. Please be considerate when fantasizing about child violation in public.

English

248

ハムやん@GzGIf4JJmV70827·27 Nis

@teortaxesTex そんな事ないよ！作者が作品を子供同然に思ってるのは本当だよ。君達は芸術作品を作った事がないから理解できないと思うけど。海賊版を使う時に最低な気分になって欲しくて、こんなこと言っているんだ。すごく効果があったみたいだからもうどうでもいいんだ！良い海賊ライフを！

日本語

747

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·27 Nis

…Идея в том, что японские творцы считают нормальным продавать своих детей в сексуальное рабство?

ハムやん@GzGIf4JJmV70827

@mgrixon 日本です。すごいキツイ言い方をしたらすみません。作者にとって作品＝子供です。著作権問題は子供をレイプしていいですかって聞いて来られる事と同意義なんですよ。だから日本人は反発しています。正常にアクセスできたらみんな正当な売買をする事は知ってます。傷つけたらごめんね。

Русский

evrazian_schizo@rationaleist·14 Nis

@teortaxesTex Selectable prefix-tuned priors?

English

136

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·14 Nis

Speciale, not having roleplay samples in its instruct data, cannot generalize LARP to all contexts it finds hard, so will autistically do the job as instructed using the same STEM reasoning primitives. Man, should we fix this in V4 with, like, steering vectors?

English

2.8K

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·14 Nis

This keeps happening (never happened with Speciale). The problem, I think, is that while a specific technical task class is not covered in instruct data, roleplay generalizes infinitely: you just need some associations. Might be another strength of Claude. It LARPs its own self.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

maybe it's just too hard to combine creative writing/roleplay with top-tier reasoning capability in one model. Look at this crap. New DeepSeek falls into the LARP mode where instead of thinking on the object level, it outlines a scenario with predefined conclusions. Annoying.

English

7.4K

evrazian_schizo@rationaleist·7 Nis

@teortaxesTex Maybe it needs to be prodded into research mode idk

English

109

evrazian_schizo@rationaleist·7 Nis

@teortaxesTex It's consistently faster, which is lol. Web UI tasks are about the same, instant was more interesting if anything. "Explain this to me like I'm retarded" is about the same but expert is more formal, maybe that's just the dice roll. Can't test more complex tasks without the API

English

2.2K

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·7 Nis

DeepSeek update is here. Let's see what "Expert" brings.

English

518

81.7K

evrazian_schizo@rationaleist·7 Nis

@teortaxesTex Forcing zigger plan troosters to watch this shit juxtaposed with their own copes Clockwork Orange style

English

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·7 Nis

next level of Burgeroid justification for NATO has dropped. This isn't "Trump". This is like half of the US. Mentally ill cattle.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Burgeroids are down to "NATO exists to protect Europe from itself"

English

2.5K

evrazian_schizo@rationaleist·5 Nis

@teortaxesTex Will Kremlins even have the capital by September? Neither the approval nor the economy are where they were in 2022. Police has been gutted too.

English

133

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·5 Nis

Meanwhile in Zigger channels: Mobilization to the tune of 500-700 thousands (1,4-2 years worth of attrition at the current rate) is inevitable. I'm getting the same signals. Will happen by September, probably. Even the most delulo optimists concur. Russia is losing this.

English

107

18.4K

evrazian_schizo@rationaleist·4 Nis

@teortaxesTex I mean, didn't they ship 3.2-exp with bugged indexer?

English

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·4 Nis

Damn, this makes our outage-ology look sily I guess even the Whale can just ship a bug in prod

Zhihu Frontier@ZhihuFrontier

🚨 BREAKING: DeepSeek suffered a mass outage (web + App down for ~12 hours) due to a faulty code issue, with users facing "Server busy" alerts and some chat history loss — service has now been FULLY restored ✅ 🔍 Insight from Zhihu contributor @西柚菌 This wasn’t random — it’s a P0-level disaster from 1 faulty line of code! ♂️ The Clue (GitHub) • DeepSeek’s open-source 3FS+ repo pushed an emergency 1-line fix (Mar 30) • Fix: Corrected timeout check logic for IoRing+ batch processing 🔥Why the Bug Hit Hard • The bug disabled the system’s batch I/O handling entirely • It only surfaced under EXTREME concurrency (key detail!) 🧠 The Perfect Storm ✅DeepSeek was quietly testing a new model (likely V4) ✅New model = way higher I/O & KV Cache pressure ✅Mass user traffic + broken batching → I/O storm 💥 ✅Storage nodes froze → full system avalanche 💡Silver Lining 3FS fixed → DeepSeek V4 is probably right around the corner 👀 🔗 Full Official Zhihu Response (CN): zhihu.com/question/20217…

English

3.9K

evrazian_schizo@rationaleist·3 Nis

@amir_harati @teortaxesTex If people in charge were picking simple net positive options then the war wouldn't have started in the first place

English

143

Amir Harati@amir_harati·3 Nis

@rationaleist @teortaxesTex EU needs resources, Russia has resources it is simple

English

167

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·3 Nis

Europeans be like

GIF

Catturd ™@catturd2

- Pull out of NATO. - Close all bases and remove all military personal from the UK, Germany, Spain, and France. - Never protect these countries again. - Stop all trade with these countries. ZERO. - Refuse to share any military technologies and don't allow to them to buy any military equipment. Ever. - Don't share any intelligence with them. NONE. - Tell them they have to provide 100% of weapons and money to Ukraine. - Cut them off completely.

English

552

11.9K

evrazian_schizo@rationaleist·3 Nis

@teortaxesTex @amir_harati Would Europe accept that without their own maximalist demands though?

English

225

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·3 Nis

@amir_harati If only monke wasn't so retarded Giving up on Ukraine in exchange for peace with Europe is a no-brainer

English

531

evrazian_schizo@rationaleist·2 Nis

@teortaxesTex It does seem like there are several checkpoints, because yesterday it would consistently interpolate July 2025 cutoff, but now it's either July 2024 or October 2024. Looks like it's following the implicit instruction to "be V3" closer

English

evrazian_schizo@rationaleist·2 Nis

@teortaxesTex I think they all have "latest DeepSeek model" id in the system prompt. That forces the model to larp as V3 with low confidence outputs due to conflicts. Previously it would clarify that it's not any numbered version, maybe they removed it and that lead to this behavior.

English

Keşfet

@teortaxesTex @akarlin @rasbt @CyberdyneC @elonmusk @BarackObama @taylorswift13 @cristiano