allbilly01

980 posts

allbilly01 banner
allbilly01

allbilly01

@allbilly01

NPU driver hobbyist

Katılım Şubat 2012
866 Takip Edilen80 Takipçiler
allbilly01
allbilly01@allbilly01·
@_asadmemon run openpilot on NPU with tinygrad, is RV1106 NVDLA based ?
English
0
0
2
118
Asad Memon
Asad Memon@_asadmemon·
@allbilly01 What was your process. I want to see how different is rv1106 npu and can similar be done there.
English
1
0
2
195
allbilly01
allbilly01@allbilly01·
I reversed RK3588 NPU registers and integrated to tinygrad. Next, i will document it as detail as my ane repo. Link in comment
English
8
9
147
11.5K
allbilly01 retweetledi
Greg
Greg@GregDavill·
Fresh and custom silicon day! 😍 This is a multicore riscv soc design based on @OlofKindgren's serv processor core. Based on GF180MCU PDK, and fabricated through wafer.space 🚀
Greg tweet mediaGreg tweet mediaGreg tweet media
English
0
18
159
7.3K
allbilly01
allbilly01@allbilly01·
@__tinygrad__ Thanks a lot❤️. The code was before rangeify and I feel far from ready to even post in the Discord. Would you mind to make a Rockchip-hardware channel?
English
1
0
8
489
the tiny corp
the tiny corp@__tinygrad__·
@allbilly01 Cool! A bunch of cleanup work and test is needed, but it would be a good backend to have.
English
1
0
26
3.3K
allbilly01
allbilly01@allbilly01·
my C implementation supports not only MATMUL, but also MIN, MAX, ADD, DIV, SUB, MUL, RELU, CONV1D, CONV2D, SIGMOID, SILU, NEG, ABS, MAXPOOL, AVGPOOL, CELU, SELU, GELU, ELU, EXP, CMPLT, CMPEQ, CMPLE, ROUNDOFF, ROUNDDOWN for tinygrad, not all are supported yet
English
0
0
5
490
allbilly01
allbilly01@allbilly01·
#L1087-L1138" target="_blank" rel="nofollow noopener">github.com/allbilly/npu/b…
ZXX
1
0
5
534
allbilly01
allbilly01@allbilly01·
900 lines python. forgot to mention
English
0
0
3
594
allbilly01
allbilly01@allbilly01·
ANE registers
allbilly01 tweet media
English
2
2
17
671
allbilly01
allbilly01@allbilly01·
Apple ANE CMD_BUF decoded
allbilly01 tweet media
English
1
3
34
2.3K
allbilly01
allbilly01@allbilly01·
And run it with pure python and numpy. No hwx file, no anecc, no .ane files.
English
0
0
0
68
allbilly01
allbilly01@allbilly01·
After I reversed RK3588 NPU registers, I have done the same for Apple ANE and fully documented how to use ANE on Asahi Linux. Link in comemnt.
English
4
0
4
136
allbilly01 retweetledi
Sudo su
Sudo su@sudoingX·
"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math. model bytes at idle = 16gb (q4_k_m of 27b dense) kv cache at 262k context with q4_0 for both k and v = 5gb total = 21gb on the card headroom = 3gb for prompts and tool call traces the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k: - fp16 kv cache = does not fit at all - q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed) - q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k) most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware. flags i run: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what they do: -ngl 99 = offload everything to gpu -c 262144 = 262k context window -np 1 = single user slot (do not enable multi-slot, eats headroom) -fa on = flash attention on (memory and speed both win) --cache-type-k q4_0 --cache-type-v q4_0 = the unlock if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing. q4 is not a compromise on consumer hardware. it is the right call.
English
85
110
1.3K
73.1K