Seba

361 posts

Seba

@CulStory

TLDR Current proj: Realtime Text-to-Speech on Apple NPU https://t.co/kJWvH9C7TK

Katılım Temmuz 2011

2.3K Takip Edilen293 Takipçiler

Seba@CulStory·3d

V4 is an R1 moment

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

restarted a convo (with V4's + 3 more papers) ≈48 hours old. cache hits they do store cache for "days", not minutes-hours Gemini TTL default is 1 hour, Claude's is 5 minutes Nah bros I don't think they have > V4 kv efficiency, whatever Reiner Pope says

English

Seba@CulStory·28 Nis

@Meituan_LongCat i’m waitin for your next release

English

Meituan LongCat@Meituan_LongCat·28 Nis

Looooooong thanks for Nathan! Hope you like this Looooooong Cat. 🐱

Nathan Lambert@natolambert

Looooooong cat biiiiiiiig model Rooting for @Meituan_LongCat (Chinese DoorDash)

English

3.8K

Seba@CulStory·27 Nis

@ivanfioravanti looks like kv is in 16-bits there?

English

174

Ivan Fioravanti ᯅ@ivanfioravanti·27 Nis

MLX DeepSeek-V4-Flash-2bit-DQ MLX 4K context issue solved! Benchmark results on Apple M5 Max, 128.0GB RAM, 18 CPU cores, 40 GPU cores A comparison M3 Ultra vs M5 Max including bath performance will follow shortly. 0.5k pp 446 tg 42 t/s mem 97.8GB kv 0.02GB 1k pp 578 tg 42 t/s mem 98.1GB kv 0.02GB 2k pp 622 tg 40 t/s mem 99.2GB kv 0.03GB 4k pp 570 tg 37 t/s mem 100.7GB kv 0.04GB 8k pp 513 tg 37 t/s mem 101.4GB kv 0.06GB 16k pp 390 tg 37 t/s mem 102.7GB kv 0.12GB 32k pp 343 tg 36 t/s mem 104.5GB kv 0.23GB 64k pp 297 tg 34 t/s mem 109.4GB kv 0.45GB This is using this PR from @0xClandestine 🔥 It's faster than yesterday! I bet it's using matmul in hardware much more. github.com/Blaizzy/mlx-lm…

English

139

13.1K

Seba@CulStory·27 Nis

@teortaxesTex this is just a preview x.com/CulStory/statu…

Seba@CulStory

this was the plan all along, an intermediate step to million tokens long synthetic data pipelines

English

2.7K

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·27 Nis

I've been rereading V4 paper too, and getting dizzy again from how insane it is. And you know what's funny. The real problem they were solving, the "pivotal design goal of V4" – at any cost! – was not "1M context", it was *batch invariance*. Opus proposes a conspiracy theory:

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Arthur Zucker@art_zucker

Reading @deepseek_ai 's v4 paper.... absolute hats off. Every problem has a mathematical solution, nothing is left to chance. I have so much respect for them, putting out months or years of efforts entirely for free, in the open for anyone to benefit. Real goats 🫡

English

638

73.5K

Seba@CulStory·26 Nis

this was the plan all along, an intermediate step to million tokens long synthetic data pipelines

Seba@CulStory

the most important part about v4 flash/pro. you can probably serve 100s of users at >100k context each on a single gpu/node.

English

2.9K

Seba@CulStory·25 Nis

the most important part about v4 flash/pro. you can probably serve 100s of users at >100k context each on a single gpu/node.

English

203

Seba@CulStory·24 Nis

@Is36E @huggingface you should look into making hf and mlx weights compatible, hate having to download both

English

358

Isalia20@Is36E·24 Nis

This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team. My first project: safetensors on Mac is up to 3x faster🚀 Parallel reads straight into MPS unified memory, no CPU staging. MB Pro M5 Pro - Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×) - Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)

English

154

12.3K

Seba@CulStory·24 Nis

just apl appleing

Seba@CulStory

Disk usage before and after converting a 900MB CoreML model, hate it so much.

English

Seba@CulStory·23 Nis

@maderix hopefully someday will allow to implement flash attention on their npu, for now you have to chunk it

English

185

maderix@maderix·23 Nis

Sigmoid Self attention runs quiet well on ANE , hits almost 90% fp16 peak via ANE private APIs. While sigmoid can't be naively used ,with normalisation tricks it probably can be done. Currently investigating if Softmax self attention can be improved as well

English

253

Seba@CulStory·17 Nis

more axes to squeezee the bits

Omead Pooladzandi@HessianFree

> > anon asked for one more state  > > we added zero  > > +600 MB  > > +5 benchmark points  > > 75.5 avg at 1.75 GB  > > still ~1/9 the size of Qwen3 8B  > > shout out brahmagupta  > > zero mattered

English

116

Seba@CulStory·17 Nis

@PrismML we need a per-channel quant to make it run on npu 🙃

English

314

PrismML@PrismML·16 Nis

Today we’re announcing Ternary Bonsai: Top intelligence at 1.58 bits Using ternary weights {-1, 0, +1}, we built a family of models that are 9x smaller than their 16-bit counterparts while outperforming most models in their respective parameter classes on standard benchmarks. We’re open-sourcing the models under the Apache 2.0 license in three sizes: 8B (1.75 GB), 4B (0.86 GB), and 1.7B (0.37 GB).

English

117

306

2.2K

473.3K

Seba@CulStory·17 Nis

@ronaldmannak idk if this still the case, but some time ago, qmv (small batch size mm) added latency in m1/m2 so i made a custom kernel for that, may be useful for batched requests. github.com/0seba/mlx-eagl…

English

1.1K

Ronald Mannak@ronaldmannak·16 Nis

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

English

352

47.4K

Seba@CulStory·15 Nis

@Prince_Canuma @runsonai @liranringel x.com/CulStory/statu…

Seba@CulStory

@anemll for ddtree compute compute grows exponentially to achieve sub-linear speedup. i like dflash, but imo beyond that point it is better to use the compute on higher quality tokens rather than more tokens

QME

Prince Canuma@Prince_Canuma·15 Nis

@runsonai @liranringel That’s awesome! I tried ddtree yesterday on MLX-VLM but didn’t see a significant speed up, maybe I missed something

English

806

Thanh Pham@runsonai·14 Nis

Currently porting @liranringel ddtree to accelerate speculative decoding on mlx (apple). Looking very promising on first test.

English

1.4K

Seba@CulStory·14 Nis

this is very important, future is 50 parallel streams generating locally at 1000 t/s combined

Cursor@cursor_ai

We've been developing a multi-agent system that builds and maintains complex software autonomously. Recently, we partnered with NVIDIA to apply it to optimizing CUDA kernels. In 3 weeks, it delivered a 38% geomean speedup across 235 problems.

English

Seba@CulStory·14 Nis

English

151

Anemll@anemll·14 Nis

Diffusion Speculative decoding is really accelerating the rate of progress since January... This is more of hybrid approach. Numbers are crazy good. Hope it checks out 🫰

Liran Ringel@liranringel

Across the reported settings, DDTree achieves higher speedup than vanilla DFlash in all 60 dataset/model/temperature combinations. Peak reported speedup is 8.22x on HumanEval, and peak acceptance length is 10.73 on MATH-500.

English

3.3K

Seba@CulStory·11 Nis

you should be sparsemaxxing

LMSYS Org@lmsysorg

🚀 New blog is out: HiSparse — Turbocharging Sparse Attention with Hierarchical Memory! Sparse attention cuts compute costs, but the full KV cache still sits in GPU HBM, making it capacity-bound. HiSparse fixes this. Results: ⚡️ 3× throughput at 256 concurrent requests vs. baseline (32K input, 8K output on 8×H200) 🚀 Up to 5× throughput on long-context scenarios (two H20 PD-disaggregated deployment) Key techniques include: 💾 Proactively offloads inactive KV cache to host memory, freeing GPU HBM for larger batch sizes 🧠 Hot device buffer keeps frequently accessed KV regions on-device to minimize swap-in latency 🔧 Custom CUDA kernel: top-k miss detection + LRU eviction + page table updates in one pass Currently supports DeepSeek Sparse Attention (DSA) models: DeepSeek-V3.2 and GLM-5.1. Thanks to @Zhiqiang_Xie and the team for this great contribution!

English

Seba@CulStory·11 Nis

also had p2p issues on mi50s, let's see

Qubitium@qubitium

If you have multiple AMD gpus that are P2P capable over pcie, you might to look at this patch. Many server boards and prosumer mbs have embedded pcie switches. 🧐

English

Seba@CulStory·4 Nis

@danveloper not so sure about lower power, slow disk storage uses a lot more power and heat is an additional issue

English

Dan Woods@danveloper·4 Nis

Somehow this remains an easily overlooked aspect of the flash-moe strategy... you can run a 26 billion parameter model in a very very small memory footprint. You pay with token throughput, but that might be an acceptable tradeoff. Lower power, less RAM, bigger models.

Anemll@anemll

@Alexey_CA @twostraws @jeremyphoward This is what Flash-MoE is trying to address: running in low RAM environments. This runs 26B in a 3GB footprint. Improving iPhone helps with M5Max 128GB and M3U optimizations for me.

English

1.7K

Seba@CulStory·4 Nis

have to say this is looking like an important point, rn swa+sparse attention higher on the the list for me. indexcache/hysparse seem to add even more

Artur Chakhvadze@norpadon

State space models are a pain in the ass Implementing virtual KV cache for state space models is pain in the ass Implementing dynamic tree attention for state space models is pain in the ass Implementing speculative decoding for state space models is pain in the ass

English

Seba@CulStory·3 Nis

@demishassabis QAT?

165

Demis Hassabis@demishassabis·3 Nis

Gemma 4 outperforms models over 10x their size! (note the x-axis is log scale!)

English

145

240

216.2K

Keşfet

@Meituan_LongCat @ivanfioravanti @0xClandestine @teortaxesTex @Is36E @huggingface @maderix @PrismML