maderix

1.7K posts

maderix banner
maderix

maderix

@maderix

part-time prompt manipulator , full time model tuner 🤖

Katılım Mayıs 2020
73 Takip Edilen2K Takipçiler
maderix
maderix@maderix·
Since everyone was talking about @karpathy Autoresearch, I thought let's try it on a nagging problem I have been meaning to solve since past few weeks. of porting FlashAttention2 on Blaze's Cuda backend. Did get meaningful improvements with below caveats. TLDR, it couldn't beat the manually lowered code via CUDA yet (137 tflops on RTX 4090 with a concise FA2 kernel) 1. The approach agents discovered (don't unroll the MLIR C++ kv load loops manually , use scf::for kv loads) which had the best performance had already been tried few days back, but it was good to see nonetheless. 2. The token requirements (70k tokens for 6 experiments) are probably which discourages me from letting it run longer. Also for difficult problems, agents will get stuck on a local maxima and human intervention will be needed for redirecting the search. I find that manual loops for now provide faster iterations. Maybe that'll change in few months. 3. The search space for this problem was much larger, - writing MLIR code according to goal (reducing register spills) is not as trivial as LLVM and PTX backend have their opaque optimizations which end up producing unexpected code. But the scaffolding does work for autonomous experiments. I guess bounded problems will see better results, but it's an interesting research area which I hope to try more in future.
maderix tweet media
English
0
0
3
250
maderix retweetledi
Anemll
Anemll@anemll·
Apple’s “LLM in a Flash” is definitely worth checking out. Going to 2-bit for the shared-expert MLP means disk I/O is no longer dominant. 14–15 tok/s from SSD is still wild for a ~400B MoE model streamed from storage. Qwen3.5-397B-A17B Credit: @danveloper
English
16
34
325
29.7K
maderix
maderix@maderix·
Bro, I just said "hello" 🤡 (Qwen3.5-9b on Blaze/Simplelang)
maderix tweet media
English
2
0
9
972
maderix
maderix@maderix·
INT8 native path is accessible on ANE on M4/M5 . I stand corrected :)
maderix@maderix

@anemll thanks for the gist, the MIL benchmark I adapted from your code also shows 33 TOPS on M4 :)

English
0
0
24
2.9K
maderix
maderix@maderix·
@anemll thanks for the gist, the MIL benchmark I adapted from your code also shows 33 TOPS on M4 :)
maderix tweet media
English
0
0
6
3.2K
Anemll
Anemll@anemll·
33.8 TOPS in CoreML
Anemll tweet media
English
5
4
55
14.6K
maderix
maderix@maderix·
@anemll Interesting, so the native compute is fp16 but weights and activations for L2 storage are quant?
English
1
0
0
406
maderix
maderix@maderix·
Dhu Ran Dharrr! 2 trailer Absolute Cinema 👐
Eesti
1
0
1
676
maderix
maderix@maderix·
@DnuLkjkjh Current utilisation is 10-12% . It's pretty hard to go above 40-50% in training due to memory bandwidth and cpu dependency for some kernels. We have some ideas to go further, let's see.
English
0
0
0
22
dnu
dnu@DnuLkjkjh·
the 5-9% ANE utilization number is actually the most interesting part of the writeup. it means the hardware ceiling is massive — if someone cracks better element-wise op fusion (instead of CPU fallback), you could potentially 10-20x throughput on the same silicon. i run whisper inference on ANE daily and even there CoreML leaves performance on the table. curious if the SRAM bandwidth is the real bottleneck for the matmul pipeline or if it is the compiler generating suboptimal tile schedules.
English
2
0
1
43
maderix
maderix@maderix·
Dynamic pipeline is now ready, no recompilation needed 🔥 Stories110M model trained from scratch in 15 minutes on ANE(M4) with 1.9 TFLOPS utilisation. Final loss 1.5, generation quality semi decent
maderix tweet media
English
1
0
33
1.7K
maderix retweetledi
Eric
Eric@Ex0byt·
Curiosity got the best of me again: so I hacked J.A.R.V.I.S to update its own model weights (`brain`) mid-conversation — every time it responds, Apple's Neural Engine fires in the background, runs LoRA backprop via reverse-engineered private API (github.com/maderix/ANE), and updates the model weights before you finish reading the reply (~8 Second loop).
Eric tweet media
English
18
17
252
12.3K
maderix retweetledi
maderix retweetledi
Sachin Desai
Sachin Desai@sach1n·
Here’s ANE running on an iPhone 17 Pro. Thank you @maderix for the amazing work.
English
14
23
285
31.7K
maderix retweetledi
John Mai
John Mai@JohnMai_Dev·
I just implemented inference for Qwen3.5 0.8B based on github.com/maderix/ANE, and successfully ran it on an M1 Pro.
John Mai tweet media
Brian Roemmele@BrianRoemmele

BOOM! Apple’s Neural Engine Was Just Cracked Open, The Future of AI Training Just Change And Zero-Human Company Is Already Testing It! In a jaw-dropping open-source breakthrough, a lone developer has done what Apple said was impossible: full neural network training– including backpropagation – directly on the Apple Neural Engine (ANE). No CoreML, no Metal, no GPU. Pure, blazing ANE silicon. The project (github.com/maderix/ANE) delivers a single transformer layer (dim=768, seq=512) in just 9.3 ms per step at 1.78 TFLOPS sustained with only 11.2% ANE utilization on an M4 chip. That’s the same idle chip sitting in millions of Mac minis, MacBooks, and iMacs right now. Translation? Your desktop just became a hyper-efficient AI supercomputer. The numbers are insane: M4 ANE hits roughly 6.6 TFLOPS per watt – 80 times more efficient than an NVIDIA A100. Real-world throughput crushes Apple’s own “38 TOPS” marketing claims. And because it sips power like a phone, you can train 24/7 without melting your electricity bill or the planet. At The Zero-Human Company, we’re not waiting. We are testing this right now on real ZHC workloads. This is the missing piece we’ve been chasing for our Zero Human Company vision: reviving archived data into fully autonomous AI systems with zero human overhead. This is world-changing. For the first time, anyone with a Mac can fine-tune, train, or iterate massive models locally, privately, and at a fraction of the cost of cloud GPUs. No more renting $40,000 A100 clusters. No more waiting in queues. No more massive carbon footprints. Training costs that used to run into the tens or hundreds of thousands of dollars? Plummeting toward pennies on the dollar – mostly just the electricity your Mac was already using while it sat idle. The AI revolution just moved from billion-dollar data centers to your desk. WE WILL HAVE A NEW ZERO-HUMAN COMPANY @ HOME wage for equipped Macs that will be up to 100x more income for the owner! We’re only at the beginning (single-layer today, full models tomorrow), but the door is wide open. Ultra-cheap, on-device training is here. The future isn’t coming. It’s already running on your Mac. Welcome to the Zero-Human Company era.

English
66
155
1.8K
251.1K
maderix
maderix@maderix·
Goddamn crypto bros! I'm NOT affiliated with ANE token or any similar retardedness now or EVER. These folks are just cashing on some virality(which I did not even ask for)
English
26
1
26
3.7K
maderix
maderix@maderix·
Damn the ANE project really blew up 😅 Thanks for all the follows and encouragement 🙏 Probably time to get X premium too 😅
English
9
5
59
3.4K
maderix
maderix@maderix·
@VipulDivyanshu @karpathy Awesome work! Glad training is stable now, would be interesting to see how far we can push it
English
4
0
15
879
maderix retweetledi
Vipul Divyanshu⚡
Vipul Divyanshu⚡@VipulDivyanshu·
Great work by @maderix for proof-of-concept on Apple Neural Engine private APIs. I went digging down the rabbit hole for the last 6 hours on what compute around training can be extracted from M4/M5 Neural Engine chips: - was able to offload @karpathy's NanoGpt training run(partially) on Apple Neural Engine. So yes... it runs @karpathy 's nanoGPT. Repo below 👇 - moved the Classifier & Softmax layers directly onto the ANE - Classifier is 10x faster, and Softmax is 34x faster - fixed memory exhaustion: original repo had an ARC memory leak that capped training at ~119 compile loads per process. - patched the C-bridge, allowing continuous, stable training
Brian Roemmele@BrianRoemmele

BOOM! Apple’s Neural Engine Was Just Cracked Open, The Future of AI Training Just Change And Zero-Human Company Is Already Testing It! In a jaw-dropping open-source breakthrough, a lone developer has done what Apple said was impossible: full neural network training– including backpropagation – directly on the Apple Neural Engine (ANE). No CoreML, no Metal, no GPU. Pure, blazing ANE silicon. The project (github.com/maderix/ANE) delivers a single transformer layer (dim=768, seq=512) in just 9.3 ms per step at 1.78 TFLOPS sustained with only 11.2% ANE utilization on an M4 chip. That’s the same idle chip sitting in millions of Mac minis, MacBooks, and iMacs right now. Translation? Your desktop just became a hyper-efficient AI supercomputer. The numbers are insane: M4 ANE hits roughly 6.6 TFLOPS per watt – 80 times more efficient than an NVIDIA A100. Real-world throughput crushes Apple’s own “38 TOPS” marketing claims. And because it sips power like a phone, you can train 24/7 without melting your electricity bill or the planet. At The Zero-Human Company, we’re not waiting. We are testing this right now on real ZHC workloads. This is the missing piece we’ve been chasing for our Zero Human Company vision: reviving archived data into fully autonomous AI systems with zero human overhead. This is world-changing. For the first time, anyone with a Mac can fine-tune, train, or iterate massive models locally, privately, and at a fraction of the cost of cloud GPUs. No more renting $40,000 A100 clusters. No more waiting in queues. No more massive carbon footprints. Training costs that used to run into the tens or hundreds of thousands of dollars? Plummeting toward pennies on the dollar – mostly just the electricity your Mac was already using while it sat idle. The AI revolution just moved from billion-dollar data centers to your desk. WE WILL HAVE A NEW ZERO-HUMAN COMPANY @ HOME wage for equipped Macs that will be up to 100x more income for the owner! We’re only at the beginning (single-layer today, full models tomorrow), but the door is wide open. Ultra-cheap, on-device training is here. The future isn’t coming. It’s already running on your Mac. Welcome to the Zero-Human Company era.

English
10
10
99
15.4K