Robot
6.1K posts






this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026. weights ship in the evening, dynamic ggufs land by midnight, fused kernel + speculative decoding stack runs the new model 12 hours after release. his dflash + ddtree stack loads qwen 3.6 asis because the architecture string matches 3.5. zero retraining of the draft model, zero waiting for upstream support. the same hand tuned consumer hardware kernel work that pushed 3.5 to 134 tok/s already eats 3.6 at 73, with a regression he is openly flagging because the draft model needs a dedicated pass for 3.6. this is the lane almost nobody is working on. major labs are stuck shipping framework abstractions optimized for h100 fleets. @pupposandro is hand tuning kernels for the silicon actual builders own. 3090 has 24 gigs of vram, mature cuda support, and almost zero kernel level optimization coming out of the big shops. it is the most underrated research platform in consumer ai right now. i am running honest baseline q4_k_m on llama.cpp now to set the dense floor without tricks. then sandro's stack runs on the same gpu, same model, same prompt. generic inference vs hand tuned kernels with speculative decoding. that delta is where the next 5 years of consumer ai live. receipts incoming.










NVIDIA在偷偷送钱,大多数人还不知道的白嫖机会! build.nvidia.com → NVIDIA官方的API平台,注册就送1000+ credits,API Key可以设成永不过期 能白嫖到什么: → MiniMax、Kimi-K2.5、GLM-5、Llama等一堆主流模型 → 免费额度下限速约40次/分钟,个人日常使用完全够 → Key生成时选"Never Expire",不用担心过期失效 注册流程3步: → 打开 build.nvidia.com → Google/GitHub一键登录 → 手机号验证(卡住就切无痕模式重试) → 右上角头像 → API Keys → Generate,过期时间选永不过期 注意账号名和云账号名都填英文+数字,中文会卡注册流程 几个实际限制要清楚: → 免费额度有上限,用完需要去官网申请 → 40次/分钟的速率限制 → 个人玩可以,跑生产环境不行 → credits什么时候收紧不确定,NVIDIA说了算 适合什么人 → 想低成本接入多个大模型API做实验、跑Agent、搭个人项目的开发者。企业级需求还是老老实实买额度


















