Fresh and custom silicon day! 😍
This is a multicore riscv soc design based on @OlofKindgren's serv processor core.
Based on GF180MCU PDK, and fabricated through wafer.space 🚀
@__tinygrad__ Thanks a lot❤️. The code was before rangeify and I feel far from ready to even post in the Discord. Would you mind to make a Rockchip-hardware channel?
my C implementation supports not only MATMUL, but also MIN, MAX, ADD, DIV, SUB, MUL, RELU, CONV1D, CONV2D, SIGMOID, SILU, NEG, ABS, MAXPOOL, AVGPOOL, CELU, SELU, GELU, ELU, EXP, CMPLT, CMPEQ, CMPLE, ROUNDOFF, ROUNDDOWN
for tinygrad, not all are supported yet
"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math.
model bytes at idle = 16gb (q4_k_m of 27b dense)
kv cache at 262k context with q4_0 for both k and v = 5gb
total = 21gb on the card
headroom = 3gb for prompts and tool call traces
the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k:
- fp16 kv cache = does not fit at all
- q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed)
- q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k)
most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware.
flags i run:
./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0
what they do:
-ngl 99 = offload everything to gpu
-c 262144 = 262k context window
-np 1 = single user slot (do not enable multi-slot, eats headroom)
-fa on = flash attention on (memory and speed both win)
--cache-type-k q4_0 --cache-type-v q4_0 = the unlock
if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing.
q4 is not a compromise on consumer hardware. it is the right call.