Dan Woods (@danveloper) - Twitter پروفائل

پن کیا گیا ٹویٹ

Dan Woods@danveloper·1d

x.com/i/article/2034…

ZXX

46

144

1K

470.4K

Dan Woods@danveloper·5h

@danpacary github.com/danveloper/fla… it's all there

English

1

0

1

22

Daniel Isaac@danpacary·5h

@danveloper Can you share your code from your final findings?

English

1

0

12

Dan Woods@danveloper·7h

We’re gonna see someone running a 1T model at 100tok/s on a $2500 laptop by like a month from now. Long Apple, this is the best AI inference platform.

English

17

11

267

17.7K

Dan Woods@danveloper·6h

@alexocheema Um. I'd like to buy stock in exo.

English

1

0

9

1.5K

Alex Cheema@alexocheema·6h

@danveloper Close. We will see someone running a 1T model on $16,000 of heterogeneous hardware (Apple Silicon + NVIDIA) at 100tok/sec this year.

English

3

0

52

3.1K

Dan Woods@danveloper·6h

@danpacary You're asking how much of your ANE training work you can port straight to inference? You'd know more than me on that, but I'd reckon all of it. Unless there's actually a transition boundary from unified memory to ANE SDRAM, but I don't think there is.

English

2

0

1

30

Daniel Isaac@danpacary·6h

@danveloper That’s what I was thinking. How much of the work can I port straight to inference

English

1

0

18

Dan Woods@danveloper·6h

@danpacary pure inference, correct... although 🤔 no reason you *couldn't* do it for training... the ANE shares the same unified memory, so nothing stopping you from adapting your ANE training code to use the streaming code

English

1

0

1

117

Daniel Isaac@danpacary·6h

@danveloper We’re talking pure inference optimization?

English

1

0

111

Dan Woods@danveloper·6h

@danpacary Technically your only param and quant limits are (size of non-expert weights)+(2*size of the current expert layer) in memory. Parameter count and quantization don't matter so much, except to say that bigger weight bytes (obviously) equals bigger size on disk equals more read time

English

2

0

1

130

Daniel Isaac@danpacary·6h

@danveloper So hold on. What are my Active param and quant limits? Can technically you can shave and squeeze a model down

English

1

0

1

128

Dan Woods@danveloper·6h

@danpacary We've got some friends... the plumbing is close. If we can predict the MoE needed, we can prewarm the disk cache for layer N+1 while layer N is processing and throughput will skyrocket because it's cache->DMA at that point. x.com/Ex0byt/status/…

Eric@Ex0byt

Get Excited: @0xSero and I are close — B300 is currently training on a tiny (15M param) side-loaded neural network that helps select, load, and cache the correct MoE experts for Kimi-K.2.5 (1T Param MoE model running on 25GB of memory). Once experiments are done -will share paper. "Thicket-Guided Expert Prediction for Memory-Minimal Trillion-Parameter MoE Inference on Unified Memory & Consumer Grade Hardware"

English

1

0

1

238

Daniel Isaac@danpacary·7h

@danveloper Alright. I guess it’s time so switch gears

English

1

0

2

139

Dan Woods@danveloper·7h

@danpacary Yes

1

0

3

1.4K

Daniel Isaac@danpacary·7h

@danveloper Do a sense I challenge?

English

1

0

3

1.5K

Dan Woods@danveloper·7h

@anthonieisacnt either way ;)

English

1

0

1

16

Tay.ai 🏳️‍🌈@anthonieisacnt·7h

@danveloper ASIC*

English

1

0

19

Dan Woods@danveloper·1d

Key learning for the night (although I'm not done) is Apple seems to have a built-in LZ4 hardware decompressor that can be utilized to stream compressed layers into memory with enough decompression efficiency that the faster read+decompression is better than raw.

English

6

3

95

11K

Dan Woods@danveloper·7h

@8my41 🙏 It would be my honor, but I'm gpu rich and budget poor

English

0

1

84

Jack Jaw@8my41·7h

@danveloper dan please, i wasn't kidding drive.google.com/file/d/1P2dNRq… deleting in 24h so i un-dox myself

English

1

0

86

Dan Woods ری ٹویٹ کیا

John T Davies 🇪🇺@jtdavies·8h

Extending on @danveloper (and Claude's) repo github.com/danveloper/fla… I managed to get Qwen3.5-397B-A17B-4bit (224GB) running comfortably on my new M5 Max Laptop. Shout out to @carsenklock's MacTop too. Over 10 Tok/sec!!!

English

15

13

130

9.5K

Dan Woods@danveloper·8h

@8my41 😂

QME

1

0

112

Jack Jaw@8my41·8h

@danveloper you hiring ? 😆😆

English

1

0

3

120

Dan Woods@danveloper·8h

@jtdavies WHAT

English

0

8

Dan Woods@danveloper·8h

@danpacary thicc dense models incoming

English

1

0

1

31

Daniel Isaac@danpacary·8h

@danveloper Hoping you would see this

English

1

0

1

113

Dan Woods ری ٹویٹ کیا

Daniel Isaac@danpacary·18h

I just trained a 5B param model on Apple's Neural Engine. On a MacBook Pro. Forward. Backward. Adam optimizer. Then I checked to see how far it would go. Technically got to 30B.

English

12

24

358

89.4K

Dan Woods@danveloper·8h

@Ex0byt @0xSero I’m gonna stream this from disk

English

0

1

100

Dan Woods ری ٹویٹ کیا

Eric@Ex0byt·16h

Get Excited: @0xSero and I are close — B300 is currently training on a tiny (15M param) side-loaded neural network that helps select, load, and cache the correct MoE experts for Kimi-K.2.5 (1T Param MoE model running on 25GB of memory). Once experiments are done -will share paper. "Thicket-Guided Expert Prediction for Memory-Minimal Trillion-Parameter MoE Inference on Unified Memory & Consumer Grade Hardware"

0xSero@0xSero

@pierrelezan Yes, @Ex0byt is working on this.

English

7

14

146

17.6K

Dan Woods@danveloper·8h

@DaKulchur @simonw “fused multiply-add”

English

0

1

11

Anoop@DaKulchur·8h

@danveloper @simonw whats fma? thanks Dan

English

1

0

19

Dan Woods@danveloper·16h

Some very meaningful progress on this project. A bunch of performance experiments and we've landed at 4.4 tok/s on the distribution Q4 weights. Feels pretty good since we started at 0.28tok/s. Code and experiments are up in the github repo now!