Dzulfikar Otsusuki

5.7K posts

Dzulfikar Otsusuki

@PartikelQ

Katılım Şubat 2022

157 Takip Edilen64 Takipçiler

Sabitlenmiş Tweet

Dzulfikar Otsusuki@PartikelQ·29 Ara

This was a translation of Amado and Isshiki's statement about Shibai's high-dimensional existence. The kanji used is a mathematical high dimension.

English

Dzulfikar Otsusuki retweetledi

Alexander Doria@Dorialexander·1d

So DeepSeek-V4: finally took me the week. Overall the paper is attempting many things at once, not easy to disentangle as it's all surprisingly connected. It's first a serious attempt at briding the gap between close and open LLM architecture. It is generally rumored that Opus and [largest model bundled in GPT-5] belong to an entirely different category of models: very large, very sparse mixture of experts, able to holding an unprecendently wide search space while still being servable. Simply put current hardware cannot hold a model on one node, so you have to play with the interconnect and various level of quantization, for different layers, at different stage of training. An important focus of DsV4 is on communication latency, showing it can be hidden through effective management of interconnect (roughly you slide communication time inside computation side). Overall, you cannot simply enter this game without the capability to rewrite kernels from scratch and the model report relentlessly come back to this. Because this is the frontier game. It's then a radical, but very successful attempt at making long context simultaneously more efficient and more affordable. Long context is literally a "context" problems: what exactly is worth attending? An obvious fix is to prioritize the most recent tokens. This might be sufficient for basic search but not for the new demands of agentic pipelines that require accurate recall of distant yet strategic content. V4 clever approach is to rely on two different axis of memorization by allocating layers to two different attention compression schemes. Like the name suggest, Heavily Compressed Attention is the brute force method collapsing each sequence of 128 tokens to a unique entry and take care of the fuzzy yet global context. Compressed Sparsed Attention rely on a "lighting indexer" to bring the relevant local blocks for query, even when they can be thousands of tokens away. Everything here is optimized for end inference: there is very large head_dim (512) which is costlier for training but allows for even more compressed kv cache which is your actual bottleneck at inference time, especially in prefill mode. End result is very classical DeepSeek play, introducing a new radical disruption of inference economics after DSA. I predict hybrid CSA/HCA (or similar counterparts) will be essentially part of the mainstream arch by the end of this year. Now we come to the more ambitious but also more unfinished part: an attempt at redefining model architecture and the learning signal. Most preeminent part is mHC and hybrid CSA/HCA, but it's actually a long list of less documented innovations: swapping softmax for sqrt(softplus) or using an hybrid two-stage scheme with non-standard values for Muon. Yet the interconnection all of these new components is still unknown and likely to account for the significant training unstabilities: typically "mHC involves a matrix multiplication with an output dimension of only 24" which introduces non-determinism. Even one the best AI labs in the world will run here into ablation combinatorial explosion, so the association of all these choices is likely non-tractable and would require a more consistent theory — which the conclusion gestures at, but does not solve ("In future iterations, we will carry out more comprehensive and principled investigations to distill the architecture down to its most essential designs"). The more limited experiments in post-training are maybe more promising. Significantly, the one lab that popularized the standard RL+reasoning recipe is rethinking the recipe. For now it's a two stage design (RL on specialized model, then on-policy distillation): ever since Self-Principled Critique Tuning DeepSeek has been concerned with expanding the reasoning training signal beyond final sparse reward. I'm not sure this is final say: in this domain everything is a bit in flux and you could even argue the type of verified pipeline we designed for SYNTH is a form of extreme offline RL-like training. There is an even longer term plan (here >3-5 years), which is about redefining hardware. For now it's a way of transforming a constraint into an opportunity: as the leading Chinese labs, DeepSeek was very incentivized to make training work on Ascend and contribute to the national effort for chips autonomy. Very unusually, the report includes a lengthy wishlist for future hardware to come in the report itself. As several experts noted, many of these recommendations don't really hold up for Nvidia but make perfect sense for a newcomer in the GPU hardware business. DeepSeek seem to be anticipating a world where labs have to secure a close hardware partner to retroactively fit the chips to the particular demand of model design or inference. Now there is what DeepSeek did not do yet. The paper hardly mention anything about synthetic pipelines, rephrasing, simulated environment. Training data size (32T tokens) likely involve some significant part of generated data, as this is more quality tokens than the web and other digitized sources could held — so maybe similar synthetic proportions as Trinity (roughly half) or Kimi. Still, it's pretty clear that all their attention was focused on the infra, architecture and scaling side, leaving a proper extensive retraining for later. This is likely not that dissimilar to how Anthropic or OpenAI proceeded: the fact we're still in the middle of the same model series even though significant parts of the model have changed (the tokennizer with Opus 4.7) suggests that a model lifecycle involves multiple rounds of training potentially as large as a pretraining a few years ago. The fact DeepSeek took on multiple Moonshot innovation (and Moonshot in turn has been hugely reliant on DeepSeek) suggest we might also have an ecosystem dynamic here. Maybe DeepSeek can exclusively focus on hard infrastructure problems and expect some of the axis of development to be sorted out later.

English

743

66.3K

Dzulfikar Otsusuki@PartikelQ·2d

@PradyuPrasad @_arohan_ DeepSeek Math for sure

English

Pradyumna (in Bay Area)@PradyuPrasad·2d

@_arohan_ simonwillison.net/2024/Nov/27/qw… actually no it was qwen

English

149

rohan anil@_arohan_·2d

If O1 had not mentioned thinking traces, thinking etc. i think rest of the companies would have taken longer to get there. In some sense, the world should be thankful for O1 team for creating the intelligence explosion for decisions they implicitly made. As well DeepSeek for reproducing an open recipe, that allowed for more folks to catch up (iiuc, deepseek was the second thinking release after o1?)

English

513

37.1K

Dzulfikar Otsusuki@PartikelQ·3d

@IXWRLDD @lastborn_1j Yeah, Hax or pure Lifting Strength, we need to see that.

English

Jacob@IXWRLDD·4d

@PartikelQ @lastborn_1j Ohnoki used a hax to perform this feat FYI

English

lastborn_@lastborn_1j·4d

Am I crazy, or does Onoki has some of the craziest abilities and feats in the whole show.

English

164

5.7K

Dzulfikar Otsusuki@PartikelQ·4d

@teortaxesTex chat.deepseek.com/share/r9km91ui…

QME

146

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·4d

From reports I've seen so far, V4-Vision is pretty strong (not perfect, but did well on all examples) and VERY FAST

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

wang dawen@wangdawen58190

我也有了，秒出我靠，太快了，没有思考过程

English

6.6K

Dzulfikar Otsusuki retweetledi

𝕱𝖗𝖊chiKQ@FrehiKQ·13 Nis

El paralelismo simbólico entre Madara y Jiraya siempre me ha encantado Ambos intentaron crear un "Mesías", mientras Jiraya instruyó a Naruto para que nunca renunciase a sí mismo. Madara le arrebató su identidad a Obito y lo transformó en una extensión de sus ideales.

Español

295

2.8K

41.5K

Dzulfikar Otsusuki retweetledi

Elliot Arledge@elliotarledge·24 Nis

LMAO

368

37.1K

Dzulfikar Otsusuki retweetledi

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·24 Nis

Very rudimentary nostalgic test but I like it that DeepSeek *just gets it* now. It doesn't need Sonnet's magic double-decode sight… 😭

English

4.7K

Dzulfikar Otsusuki@PartikelQ·23 Nis

How about this x.com/i/status/20467… Admit that once in a while retardo?

The Ace 🃏♠️@dfw_joee

Hate on a “series” and it’s a question about one village compared to the numerous villages in the same “series” and the fact that I prefer boogie woogie to Amenotejikara?? Narutards never disappoint in selling wrong narratives

English

Dzulfikar Otsusuki@PartikelQ·23 Nis

@Sasukebix Classic Cope

English

Treetabix@Sasukebix·23 Nis

@PartikelQ I got it mixed up but my point still stands. The punch isn’t the same as holding a tonne in ur hand hypothetically (and again like I said I was joking about the swords weight). U need to read more books and learn to improve ur comprehension skills

English

Treetabix@Sasukebix·23 Nis

Look at Sasuke effortlessly throw this sword, and mind you all ninja tools are weighted so it’s at least 800-900 kg

English

235

10.5K

Dzulfikar Otsusuki@PartikelQ·23 Nis

@Sasukebix Show me Naruto using an enhanced chakra with that uppercut.

English

105

Treetabix@Sasukebix·23 Nis

@PartikelQ That’s literally not how it works. A chakra enhanced punch =/ the same as holding a tonne in ur hand and throwing it like this. Not to mention I was clearly joking but ur too slow to realise

English

115

Dzulfikar Otsusuki@PartikelQ·23 Nis

@Sasukebix Even better, in context Naruto didn’t just throw him—he sent him flying with uppercut.

English

Dzulfikar Otsusuki@PartikelQ·23 Nis

@Sasukebix Base Naruto who is already extremely exhausted, with only a few drops of chakra left, sending Sasuke flying through a cliff wall with an uppercut, is far better than that shit. With that tweet, you only make Boruto scaling look pathetic if Naruto weren’t there.

English

144

Dzulfikar Otsusuki@PartikelQ·23 Nis

That way, we also separate the scaling, so Naruto and Boruto are scaled individually, meaning Boruto's scaling isn't even city level. Do it Thought? @Assfro101 @SenninEnergy

lastborn_@lastborn_1j

Do you guys consider the Naruto fandom and boruto fandom to be 2 different fandoms?

English

Dzulfikar Otsusuki@PartikelQ·23 Nis

@Sasukebix In the context of power scaling, it wouldn't even be worth mentioning.

English

Dzulfikar Otsusuki@PartikelQ·23 Nis

@Sasukebix What's the point of overhyped it? We'd all be impressed if he threw a trillion-ton weight, instead of less than 1 ton.

English

937

Dzulfikar Otsusuki@PartikelQ·22 Nis

Always from that country. What isn't plot-driven in storytelling? Kurama has been around from the beginning with Naruto, not just suddenly appearing. Learn what plot armor is.

The Ace 🃏♠️@dfw_joee

There are Naruto fans out there that will tell you Naruto dog walked Neji and didn’t win by plot, it is very important that you don’t listen to them

English

Dzulfikar Otsusuki@PartikelQ·22 Nis

@Boltsxk

GIF

QME

Horlali.@Boltsxk·21 Nis

Low IQ individuals who refuse to learn context and scaling off nonsense head canons

Dzulfikar Otsusuki@PartikelQ

English

Dzulfikar Otsusuki@PartikelQ·21 Nis

You didn't answer the main point. I'd better go to sleep

BORUGOAT_ENT@BORUGOAT_ENT

It was because of that excuse is the reason they were able to defeat him with low casualties They strategize and plan to use that weakness against him Why are we still on this topic

English

Dzulfikar Otsusuki@PartikelQ·21 Nis

@BORUGOAT_ENT

QME

BORUGOAT_ENT@BORUGOAT_ENT·21 Nis

@PartikelQ 1. Being stronger than your opponent doesn't guarantee speed 2. my point still stands not that you are wrong or anything,you can still harm those stronger than you if you outsmarted or are more strategic 3. being able to damage someone doesn't instantly an you are above them

English

Dzulfikar Otsusuki@PartikelQ·21 Nis

Stink scale. It's already bad just relying on AP, without the occasional feats like DBS, now, another dogshit scaling like that. Logically, Code's strength and speed are far above Sasuke with Rinnegan, let alone Sasuke without Rinnegan.

BORUGOAT_ENT@BORUGOAT_ENT

Physical stronger ≠ wincon Depending on the situation If your opponent is a better strategist,out haxes you or have better speed, intelligence They can harm you,jigen was outsmarted twice against Naruto and Sasuke it's his hax that saved him twice, even though he is stronger

English

Keşfet

@PradyuPrasad @_arohan_ @IXWRLDD @lastborn_1j @teortaxesTex @Sasukebix @elonmusk @BarackObama