evrazian_schizo

361 posts

evrazian_schizo

evrazian_schizo

@rationaleist

Katılım Kasım 2012
4 Takip Edilen45 Takipçiler
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex I mean, makes sense if you think about it as "rotate the self-attention." In a sequence token m represents what is discretely appended at position m, not the accumulated state.
English
0
0
0
146
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Is this real
Cheng Luo@ChengLuo_lc

We're excited to release 𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬, a drop-in upgrade to residual connections that learns which past layers to route from — without the routing collapse that breaks prior cross-layer attention at scale. 🚀 Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over 𝐝𝐞𝐥𝐭𝐚𝐬 (vᵢ = hᵢ₊₁ − hᵢ) — what each sublayer actually contributed — and natively enable: ⚡ 𝟏.𝟖× 𝐬𝐡𝐚𝐫𝐩𝐞𝐫 𝐜𝐫𝐨𝐬𝐬-𝐥𝐚𝐲𝐞𝐫 𝐫𝐨𝐮𝐭𝐢𝐧𝐠 Deltas are structurally diverse, lifting max attention weight from ~0.2 → ~0.6 (0.62 vs 0.35 avg) and curing routing collapse in deep layers. 📉 −𝟖.𝟐% 𝐯𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧 𝐏𝐏𝐋 𝐚𝐭 𝟕.𝟔𝐁 Consistent gains from 220M → 7.6B (1.7–8.2% lower PPL), beating both standard residuals and Attention Residuals — the latter actually degrades below baseline at scale (18.58 vs 17.43). 🔌 𝐃𝐫𝐨𝐩-𝐢𝐧 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐨𝐟 𝐩𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐦𝐨𝐝𝐞𝐥𝐬 Additive, zero-init routing is identity at initialization, so you can convert pretrained checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning — beating the original on 8 downstream benchmarks (55.6 vs 55.0). 🪶 ≤𝟎.𝟎𝟏% 𝐩𝐚𝐫𝐚𝐦𝐞𝐭𝐞𝐫 𝐨𝐯𝐞𝐫𝐡𝐞𝐚𝐝 Delta Block adds just 589K params (0.008% at 8B) and ~3% memory — and runs faster + lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB). 💻 Code: github.com/wdlctc/delta-a… 📄 Paper: arxiv.org/abs/2605.18855

English
4
7
91
15.4K
evrazian_schizo
evrazian_schizo@rationaleist·
@akarlin The ground truth for SWE tasks and math is simple, sparse, and objective. "Good writing" is an extremely jagged shape that requires expensive human feedback and good heuristics. The former has high financial returns, the latter is niche atm. Resources are invested accordingly.
English
0
0
2
145
Anatoly Karlin 🧲💯
Anatoly Karlin 🧲💯@akarlin·
I think there's a pretty simple and intuitive explanation for this. Good writers have been sexually selected for 5500 years, and writing and reading have always been the primary mode of symbolic analysis (solving problems through abstract thought). It is also loads heavily on far more ancient linguistics modules some of which like the FOXP2 gene we even share with parrots (a notably verbally tilted species). Programming is much more evolutionarily novel and has not been sexually selected for at all. On some universally neutral scale of cognition, the top human writers are several S.D.'s better than the top human coders. Consequently, whereas AI has already blasted through the top tiers of human programming ability, reaching the peaks of human writing ability is still probably a couple of years away.
Jerry Tworek@MillionInt

If the AI models are so smart, why do I feel like I’m losing a few neurons every time I read a longer form content written by AI? We’ve come a long way but we still have long way to go. In terms of clarity of writing we may have regressed from o1/o3 days.

English
17
8
156
15.1K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex @rasbt They explicitly reject both state tracking and SWA hybrids on their site. The specific claim seems to be DSA analog with a working linear indexer. Which would be both good and not that unrealistic but the "not your average transformer" marketing is so ass it's hard to believe.
English
1
0
4
341
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
pfft yeah, you get to choose bad baselines for a promo We need one good unified visualization of memory&compute cost per ith token @rasbt you probably have all formulas for different architectures ready, what do you think?
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Oliver Sieberling@osieberling

Would not be too surprised if this was just sth. like: 60 layer hybrid - 256k "sliding window" attention every 4 blocks ("linear") - GDN in the remaining blocks compared to full attn: (60 * (1M)^2) / (15 * (256K)^2 + 45 * little) ≈ 52x speedup This is Qwen3.5-397B-A17B btw

English
1
0
19
8.6K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex Long if factual but oozing scam energy. Why would you invent a buzzword name for your attention if you are not going to release the algo?
English
0
0
6
720
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Oh-ho. This isn't actually a breakthrough, however – Opus 4.6 famously sported 76%, with 4.7 Anthropic just said "it's always been a bad benchmark". I remember in Chinese evals of V4-Flash, they said that its MRCR perf looks like very shallow tracking. Still, let them have a go.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Alexander Whedon@alex_whedon

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. @subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

English
5
2
142
15.5K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
come to think about it, the use case that most benefits from DeepSeek's cache pattern is long roleplay. Massive slow contexts, intermittent sessions, ttft after reload irrelevant. Cache lives on disk 99% of the time. They can make waifu prefetch a paid product feature, too
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

restarted a convo (with V4's + 3 more papers) ≈48 hours old. cache hits they do store cache for "days", not minutes-hours Gemini TTL default is 1 hour, Claude's is 5 minutes Nah bros I don't think they have > V4 kv efficiency, whatever Reiner Pope says

English
3
3
85
6.6K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex Linear could maybe serve similar function but DS doesn't have a mature variant and arch is somewhat raw as is. I imagine that's also why m and m` are fully static.
English
0
0
1
35
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex Visualize going over the input sequentially from the start. You can start compressing immediately or you can run full attention over the first n elements at negligible cost, then compress the oldest element for every new one. SWA follows naturally from the formulation.
English
1
0
1
146
evrazian_schizo
evrazian_schizo@rationaleist·
@CyberdyneC @teortaxesTex Being Russian, one also understands that the alternative is sacrificing yourself and your people for security schizobabble, literal death cults and LARP a e s t h e t i c s
English
1
0
4
76
cyberdyne_canary
cyberdyne_canary@CyberdyneC·
@teortaxesTex You may not understand, being Russian, but throughout the rest of the west we've been living in blue world, where sacrificing yourself and your people for every ignorant child or hapless idiot or murderous thug is already expected
English
3
0
5
646
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
The real life question of Blue vs Red is "what should we teach children to do". Humans are creatures of habit. It is entirely possible to have a virtuous society that votes blue by supermajority, in the REAL situation. This requires penalizing Reds. This is called "civilization".
kris@liana_florist

@teortaxesTex Im confused because it seems like votes based off 1st instinct leads to 58% blue win and votes after aggressive debate also lead to 58% blue. The only other things we know is votes won’t be 100% either way and blue >50% means 0 lives lost so how does advocating for red minimise-

English
53
56
925
27K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex I think it comes down to K2.6 CoT being really good. V4-Pro benefits from similar pattern being injected through RP loophole, though obviously it's not trained to work with it. Flash doesn't pick it up or at least I couldn't force it, seems to be hardwired to trained patterns.
English
0
0
1
210
evrazian_schizo
evrazian_schizo@rationaleist·
@GzGIf4JJmV70827 @teortaxesTex The work doesn't feel one way or another, you do. In the first place, modification is closer to "rape" but the Japanese have less issues with porn fan works of PreCure and the like than unpaid consumption. Please be considerate when fantasizing about child violation in public.
English
0
0
6
247
ハムやん
ハムやん@GzGIf4JJmV70827·
@teortaxesTex そんな事ないよ! 作者が作品を子供同然に思ってるのは本当だよ。 君達は芸術作品を作った事がないから理解できないと思うけど。 海賊版を使う時に最低な気分になって欲しくて、こんなこと言っているんだ。すごく効果があったみたいだからもうどうでもいいんだ! 良い海賊ライフを!
日本語
7
0
0
747
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
…Идея в том, что японские творцы считают нормальным продавать своих детей в сексуальное рабство?
ハムやん@GzGIf4JJmV70827

@mgrixon 日本です。すごいキツイ言い方をしたらすみません。 作者にとって作品=子供です。 著作権問題は子供をレイプしていいですかって聞いて来られる事と同意義なんですよ。 だから日本人は反発しています。 正常にアクセスできたらみんな正当な売買をする事は知ってます。傷つけたらごめんね。

Русский
5
0
79
7K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Speciale, not having roleplay samples in its instruct data, cannot generalize LARP to all contexts it finds hard, so will autistically do the job as instructed using the same STEM reasoning primitives. Man, should we fix this in V4 with, like, steering vectors?
English
3
0
18
2.8K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
This keeps happening (never happened with Speciale). The problem, I think, is that while a specific technical task class is not covered in instruct data, roleplay generalizes infinitely: you just need some associations. Might be another strength of Claude. It LARPs its own self.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

maybe it's just too hard to combine creative writing/roleplay with top-tier reasoning capability in one model. Look at this crap. New DeepSeek falls into the LARP mode where instead of thinking on the object level, it outlines a scenario with predefined conclusions. Annoying.

English
2
1
39
7.4K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex It's consistently faster, which is lol. Web UI tasks are about the same, instant was more interesting if anything. "Explain this to me like I'm retarded" is about the same but expert is more formal, maybe that's just the dice roll. Can't test more complex tasks without the API
English
1
0
6
2.2K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex Forcing zigger plan troosters to watch this shit juxtaposed with their own copes Clockwork Orange style
English
0
0
1
69
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex Will Kremlins even have the capital by September? Neither the approval nor the economy are where they were in 2022. Police has been gutted too.
English
0
0
0
133
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Meanwhile in Zigger channels: Mobilization to the tune of 500-700 thousands (1,4-2 years worth of attrition at the current rate) is inevitable. I'm getting the same signals. Will happen by September, probably. Even the most delulo optimists concur. Russia is losing this.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet mediaTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
English
21
1
107
18.4K
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex It does seem like there are several checkpoints, because yesterday it would consistently interpolate July 2025 cutoff, but now it's either July 2024 or October 2024. Looks like it's following the implicit instruction to "be V3" closer
English
0
0
2
22
evrazian_schizo
evrazian_schizo@rationaleist·
@teortaxesTex I think they all have "latest DeepSeek model" id in the system prompt. That forces the model to larp as V3 with low confidence outputs due to conflicts. Previously it would clarify that it's not any numbered version, maybe they removed it and that lead to this behavior.
English
1
0
1
56