allbilly01

980 posts

allbilly01

@allbilly01

NPU driver hobbyist

Katılım Şubat 2012

866 Takip Edilen80 Takipçiler

allbilly01@allbilly01·3d

@_asadmemon run openpilot on NPU with tinygrad, is RV1106 NVDLA based ?

English

118

Asad Memon@_asadmemon·3d

@allbilly01 What was your process. I want to see how different is rv1106 npu and can similar be done there.

English

195

allbilly01@allbilly01·4d

I reversed RK3588 NPU registers and integrated to tinygrad. Next, i will document it as detail as my ane repo. Link in comment

English

147

11.5K

allbilly01 retweetledi

Greg@GregDavill·4d

Fresh and custom silicon day! 😍 This is a multicore riscv soc design based on @OlofKindgren's serv processor core. Based on GF180MCU PDK, and fabricated through wafer.space 🚀

English

159

7.3K

allbilly01@allbilly01·4d

@__tinygrad__ Thanks a lot❤️. The code was before rangeify and I feel far from ready to even post in the Discord. Would you mind to make a Rockchip-hardware channel?

English

489

the tiny corp@__tinygrad__·4d

@allbilly01 Cool! A bunch of cleanup work and test is needed, but it would be a good backend to have.

English

3.3K

allbilly01@allbilly01·4d

my C implementation supports not only MATMUL, but also MIN, MAX, ADD, DIV, SUB, MUL, RELU, CONV1D, CONV2D, SIGMOID, SILU, NEG, ABS, MAXPOOL, AVGPOOL, CELU, SELU, GELU, ELU, EXP, CMPLT, CMPEQ, CMPLE, ROUNDOFF, ROUNDDOWN for tinygrad, not all are supported yet

English

490

allbilly01@allbilly01·4d

#L1087-L1138" target="_blank" rel="nofollow noopener">github.com/allbilly/npu/b…

ZXX

534

allbilly01@allbilly01·4d

@marty188586 @knuxyl @limcheekin @invisiofficial

QAM

516

allbilly01@allbilly01·4d

900 lines python. forgot to mention

English

594

allbilly01@allbilly01·4d

@orangepixunlong would you like to get Orangepi 6 running on tinygrad?

English

670

allbilly01@allbilly01·4d

github.com/allbilly/ane/b…

ZXX

140

allbilly01@allbilly01·4d

ANE registers

English

671

allbilly01@allbilly01·4d

Apple ANE CMD_BUF decoded

English

2.3K

allbilly01@allbilly01·4d

And run it with pure python and numpy. No hwx file, no anecc, no .ane files.

English

allbilly01@allbilly01·4d

After I reversed RK3588 NPU registers, I have done the same for Apple ANE and fully documented how to use ANE on Asahi Linux. Link in comemnt.

English

136

allbilly01@allbilly01·4d

deepseek-v4-pro hosted on @nahcrof helps a lot!

English

allbilly01@allbilly01·4d

@JohnMai_Dev @maderix @danpacary @anemll

QAM

104

allbilly01@allbilly01·4d

github.com/allbilly/tinyg…

ZXX

660

allbilly01@allbilly01·4d

github.com/allbilly/ane

ZXX

allbilly01 retweetledi

Yvette Carlisle@YvetteCipher·6d

@sudoingX I did a more detailed test. x.com/YvetteCipher/s…

Yvette Carlisle@YvetteCipher

x.com/i/article/2049…

English

966

allbilly01 retweetledi

Sudo su@sudoingX·6d

"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math. model bytes at idle = 16gb (q4_k_m of 27b dense) kv cache at 262k context with q4_0 for both k and v = 5gb total = 21gb on the card headroom = 3gb for prompts and tool call traces the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k: - fp16 kv cache = does not fit at all - q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed) - q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k) most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware. flags i run: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what they do: -ngl 99 = offload everything to gpu -c 262144 = 262k context window -np 1 = single user slot (do not enable multi-slot, eats headroom) -fa on = flash attention on (memory and speed both win) --cache-type-k q4_0 --cache-type-v q4_0 = the unlock if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing. q4 is not a compromise on consumer hardware. it is the right call.

English

110

1.3K

73.1K

Keşfet

@_asadmemon @OlofKindgren @__tinygrad__ @marty188586 @knuxyl @limcheekin @invisiofficial @orangepixunlong