Moon Head

73 posts

Moon Head

Moon Head

@MoonHeead

Data / AI Engineer

Katılım Mayıs 2016
338 Takip Edilen19 Takipçiler
Moon Head retweetledi
Alexander Doria
Alexander Doria@Dorialexander·
So DeepSeek-V4: finally took me the week. Overall the paper is attempting many things at once, not easy to disentangle as it's all surprisingly connected. It's first a serious attempt at briding the gap between close and open LLM architecture. It is generally rumored that Opus and [largest model bundled in GPT-5] belong to an entirely different category of models: very large, very sparse mixture of experts, able to holding an unprecendently wide search space while still being servable. Simply put current hardware cannot hold a model on one node, so you have to play with the interconnect and various level of quantization, for different layers, at different stage of training. An important focus of DsV4 is on communication latency, showing it can be hidden through effective management of interconnect (roughly you slide communication time inside computation side). Overall, you cannot simply enter this game without the capability to rewrite kernels from scratch and the model report relentlessly come back to this. Because this is the frontier game. It's then a radical, but very successful attempt at making long context simultaneously more efficient and more affordable. Long context is literally a "context" problems: what exactly is worth attending? An obvious fix is to prioritize the most recent tokens. This might be sufficient for basic search but not for the new demands of agentic pipelines that require accurate recall of distant yet strategic content. V4 clever approach is to rely on two different axis of memorization by allocating layers to two different attention compression schemes. Like the name suggest, Heavily Compressed Attention is the brute force method collapsing each sequence of 128 tokens to a unique entry and take care of the fuzzy yet global context. Compressed Sparsed Attention rely on a "lighting indexer" to bring the relevant local blocks for query, even when they can be thousands of tokens away. Everything here is optimized for end inference: there is very large head_dim (512) which is costlier for training but allows for even more compressed kv cache which is your actual bottleneck at inference time, especially in prefill mode. End result is very classical DeepSeek play, introducing a new radical disruption of inference economics after DSA. I predict hybrid CSA/HCA (or similar counterparts) will be essentially part of the mainstream arch by the end of this year. Now we come to the more ambitious but also more unfinished part: an attempt at redefining model architecture and the learning signal. Most preeminent part is mHC and hybrid CSA/HCA, but it's actually a long list of less documented innovations: swapping softmax for sqrt(softplus) or using an hybrid two-stage scheme with non-standard values for Muon. Yet the interconnection all of these new components is still unknown and likely to account for the significant training unstabilities: typically "mHC involves a matrix multiplication with an output dimension of only 24" which introduces non-determinism. Even one the best AI labs in the world will run here into ablation combinatorial explosion, so the association of all these choices is likely non-tractable and would require a more consistent theory — which the conclusion gestures at, but does not solve ("In future iterations, we will carry out more comprehensive and principled investigations to distill the architecture down to its most essential designs"). The more limited experiments in post-training are maybe more promising. Significantly, the one lab that popularized the standard RL+reasoning recipe is rethinking the recipe. For now it's a two stage design (RL on specialized model, then on-policy distillation): ever since Self-Principled Critique Tuning DeepSeek has been concerned with expanding the reasoning training signal beyond final sparse reward. I'm not sure this is final say: in this domain everything is a bit in flux and you could even argue the type of verified pipeline we designed for SYNTH is a form of extreme offline RL-like training. There is an even longer term plan (here >3-5 years), which is about redefining hardware. For now it's a way of transforming a constraint into an opportunity: as the leading Chinese labs, DeepSeek was very incentivized to make training work on Ascend and contribute to the national effort for chips autonomy. Very unusually, the report includes a lengthy wishlist for future hardware to come in the report itself. As several experts noted, many of these recommendations don't really hold up for Nvidia but make perfect sense for a newcomer in the GPU hardware business. DeepSeek seem to be anticipating a world where labs have to secure a close hardware partner to retroactively fit the chips to the particular demand of model design or inference. Now there is what DeepSeek did not do yet. The paper hardly mention anything about synthetic pipelines, rephrasing, simulated environment. Training data size (32T tokens) likely involve some significant part of generated data, as this is more quality tokens than the web and other digitized sources could held — so maybe similar synthetic proportions as Trinity (roughly half) or Kimi. Still, it's pretty clear that all their attention was focused on the infra, architecture and scaling side, leaving a proper extensive retraining for later. This is likely not that dissimilar to how Anthropic or OpenAI proceeded: the fact we're still in the middle of the same model series even though significant parts of the model have changed (the tokennizer with Opus 4.7) suggests that a model lifecycle involves multiple rounds of training potentially as large as a pretraining a few years ago. The fact DeepSeek took on multiple Moonshot innovation (and Moonshot in turn has been hugely reliant on DeepSeek) suggest we might also have an ecosystem dynamic here. Maybe DeepSeek can exclusively focus on hard infrastructure problems and expect some of the axis of development to be sorted out later.
English
21
99
774
69.6K
Moon Head retweetledi
fks
fks@FredKSchott·
Introducing Flue — The First Agent Harness Framework Flue is a TypeScript framework for building the next generation of agents, designed around a built-in agent harness. Flue is like Claude Code, but 100% headless and programmable. There's no baked in assumption like requiring a human operator to function. No TUI. No GUI. Just TypeScript. But using Flue feels like using Claude Code. The agents you build act autonomously to solve problems and complete tasks. They require very little code to run. Most of the "logic" lives in Markdown: skills and context and AGENTS.md. Flue is like Astro or Next.js for agents (not surprising, given my background 🙃). It's not another AI SDK. It's a proper runtime-agnostic framework. Write once, build, and deploy your agents anywhere (Node.js, Cloudflare, GitHub Actions, GitLab CI/CD, etc). We originally built Flue to power AI workflows inside of the Astro GitHub repo. But then @_bgiori got his hands on it, and we realized that every agent needs a framework like Flue, not just us. Check it out! It's early, but I'm curious to hear what people think. Are agents ready for their library -> framework moment?
fks tweet media
English
172
329
3.6K
681.4K
Moon Head
Moon Head@MoonHeead·
@Cyb3rDav3 @allen_explains It was simple: 1.Prompt all AI assistants to write a tech report on the video (transcript). 2.Prompt all AI verifiers to verify the reports, giving them the original prompt to rate them. 3.Results (shown above). PS: All AI used via UI (free mode).
English
0
0
0
23
Allen Braden
Allen Braden@allen_explains·
🚨 A junior at Jane Street reportedly landed a $220K–$600K role because he used AI to analyze trillions of data points faster than most teams ever could. In this 1-hour lecture, he breaks down the exact system behind it: • how he researches massive datasets • how AI finds patterns humans miss • how his machine turns raw data into decisions • how you can apply the same thinking yourself Skip Netflix tonight. Watch this instead. One hour could completely change how you think about research, AI, and opportunity.
English
60
711
5.4K
858.1K
Moon Head
Moon Head@MoonHeead·
@allen_explains Relate to this topic, I made single prompt to 5 AI Assistants (thinking mode) & 3 other AI Verifier: ASSISTANTS= A: -A1: Qwen 3.6plus -A2: GPT 5.4 -A3: Kimi k2.6 agent -A4: Deepseek v4 pro -A5: GLM-5.1 VERIFIERS: Sonnet 4.6, DeepSeek v4 pro, Qwen 3.6plus Results: A1>A4>A3>A2>A5
Moon Head tweet media
English
2
0
1
2.7K
Moon Head
Moon Head@MoonHeead·
@KhalidWarsa 36h non-stop. Heavy research +light coding. 255M tokens burned. Total cost: $2.71. stack I used: -Hermes agent as the central orchestrator -ML-intern for HF ecosystem -Subagents for ArXiv, GitHub, Subreddits, & X etc -My own AI infra: I use GPT 5.5 as an LLM verifier layer
English
1
0
6
1.5K
Moon Head retweetledi
NASA Administrator Jared Isaacman
Congratulations to the Kingdom of Morocco on joining the Artemis Accords. Together, we’re building the future of exploration.
English
420
1.3K
11.1K
1.1M
Moon Head
Moon Head@MoonHeead·
@TheAhmadOsman Maaan , from day one I keep telling ma homies that Openclaw to Pi is like Thomas Edison to Nikola Tesla.
English
0
0
2
287
Ahmad
Ahmad@TheAhmadOsman·
Alternatives to the BLOATED OpenClaw? - Hermes Agent - ZeroClaw - Pi (OpenClaw is built on top of it) - NanoClaw Recently I also came across GitClaw which has an interesting design but I am yet to have a verdict on it
English
54
11
312
38.7K
Moon Head
Moon Head@MoonHeead·
Here’s the research stack I used: - Hermes agent (by Nous Research) as the central orchestrator - ML-intern (by Hugging Face) for research inside the HF ecosystem Tailored subagents for ArXiv, GitHub, Subreddits, & X - MY AI infra : used GPT 5.5 as an LLM verifier layer
English
0
0
2
139
Moon Head
Moon Head@MoonHeead·
3 days of fully autonomous iteration with DeepSeek V4 Pro. 36h non-stop. Heavy research + light coding. 255M tokens burned. Total cost: $2.71. For comparison, other frontier models: 1,5kto 2k The price-performance gap is insane.
English
1
0
12
512
Moon Head
Moon Head@MoonHeead·
@0xSero bro i would like to have them as gift but iam faar away , Morocco
English
0
0
0
5
0xSero
0xSero@0xSero·
Winner of yesterday's book giveaway (:
carlos@carloslondrez

@0xSero I think it'll look good next to my collection of beaten up Noah Harari books

English
5
0
29
3.1K
Moon Head retweetledi
Rasmus Andersson
Rasmus Andersson@rsms·
This is a glimpse of big changes ahead of us. If you’re betting on big central models you should think twice. I run the exact same setup (M5 MacBook, qwen3.6-27B, pi, ollama) and while its not as fast or good as one of the big central models, it’s past the line of “cool demo” into “truly useful.” Kind of where the big frontier models were in late 2025. In ~24 months we might have local models that are fast and good enough for most tasks.
Julien Chaumond@julien_c

This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚‍♀️ Qwen3.6 27B running inside of Pi coding agent via Llama.cpp on the MacBook Pro For non-trivial tasks on the @huggingface codebases, this feels very, very close to hitting the latest Opus in Claude Code, or whatever shiny monopolistic closed source API of the day is. In full airplane mode. Most people haven’t realized this yet. If you have, it means you have a huge headstart to what I call the second revolution of AI. Powerful local models for efficiency, security, privacy, sovereignty 🔥

English
40
56
907
108.3K
wh
wh@nrehiew_·
How I read papers now. This is an explainer by Claude about the new Compressed Sparse Attention v4 uses to compress the KV cache.
wh tweet media
wh@nrehiew_

Now reading:

English
6
69
699
55.5K
Moon Head
Moon Head@MoonHeead·
@deepseek_ai DeepSeek-V4 Preview ??? something still cooking there hmm DeepSeek v4.1 will beat Claude Mythos ? 👀 wait & see
GIF
English
0
0
0
89
DeepSeek
DeepSeek@deepseek_ai·
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n
DeepSeek tweet media
English
1.6K
7.7K
45K
9.5M
Moon Head
Moon Head@MoonHeead·
@ollama At least clarify that only for pro sub !!
English
0
0
1
730
ollama
ollama@ollama·
deepseek-v4-flash is now available on Ollama's cloud! Hosted in the US. Try it with Claude Code: ollama launch claude --model deepseek-v4-flash:cloud Try it with OpenClaw: ollama launch openclaw --model deepseek-v4-flash:cloud Try it with Hermes: ollama launch hermes --model deepseek-v4-flash:cloud Try it with chat: ollama run deepseek-v4-flash:cloud (DeepSeek V4 Pro is coming shortly) 🧵
DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English
84
155
1.6K
147.4K
Moon Head
Moon Head@MoonHeead·
It’s a chain reaction. Break one link, and the entire economic system starts to destabilize.
English
0
0
0
9
Moon Head
Moon Head@MoonHeead·
The biggest contradiction here. If companies replace us with AI, they reduce payroll costs, but also remove the very consumers who sustain demand. Fewer employed people → less spending → lower demand for the same products and services those companies provide. 🤷‍♂️
English
1
0
0
12