Kaden

38 posts

Kaden

@schuttdev

building things with Hermes Agent & Claude | CS @ ASU

Tempe, AZ Sumali Ocak 2025

23 Sinusundan11 Mga Tagasunod

Kaden@schuttdev·1h

yeah, it should work. 7800 XT is gfx1102, same RDNA3 family as the 7900 XTX so it JIT compiles clean on first run. 9B DFlash is up and running — perf scales roughly with CU count (60 vs 96 on the XTX). 27B is too heavy for 16GB right now. working toward more aggressive quants like MQ2 on the roadmap since a few people have asked. repo: github.com/Kaden-Schutt/h…

English

Wxrrjxr@wxrrjxr·2h

@schuttdev @LottoLabs Will it work on the 7800Xt? I notice your GPU is 7900XTX? Is it specific or can I cook?

English

Lotto@LottoLabs·6h

New Dflash drafting model for the 27b Lets gooooo huggingface.co/z-lab/Qwen3.6-…

English

345

14.8K

Kaden@schuttdev·2h

@no_stp_on_snek Incredible write-up on the DFlash draft saga. We hit the exact same numbers(numbers(... attractor death spiral on our Rust-native inference engine—built a hard-fail coherence gate to catch it. Also found prompt whitespace swings τ by 14%. Your tape-replay rollback for GDN state maps beautifully to the tree-aware kernel work we're doing. Would love to compare notes—feels like draft-training pain and inference-correctness pain are two halves of the same coin. 🙏

English

Kaden@schuttdev·3h

@wxrrjxr @LottoLabs Yes, Kaden-Schutt/hipfire on GitHub finalizing dflash integration now

English

Wxrrjxr@wxrrjxr·6h

@LottoLabs Is there a AMD ROCm runtime that has DFlash configured with TQ or TCQ?

English

835

Kaden@schuttdev·6h

@Parag_Oilman @0xSero Big things coming for the 7900xtx!

English

Parag@Parag_Oilman·10h

@0xSero Nailed it. I’ve been playing with 7900XTX as an alternative to the 3090 and boy are we sleeping on em.

English

365

0xSero@0xSero·11h

Hey AMD When will you have an RTX 6000 competitor? I tried the Mi300x on Hotaisle and absolutely loved it. I would love to help build out infra to pool your hardware with other hardware for faster and cheaper inference. Maybe work on some pruning/quantisation tooling.

English

195

Kaden@schuttdev·7h

@0xSero Probably would be better off with 4x7900xtx’s for $500 more tho

English

Kaden@schuttdev·7h

@0xSero The W7900 is what you’re looking for: 48GB VRAM 96 CUs each with ray accelerators basically a 7900 XTX with double the VRAM

English

180

Kaden@schuttdev·1d

@mamajjo1 @songjunkr *on the 7900xtx

English

Kaden@schuttdev·1d

@mamajjo1 @songjunkr My engine gets 45 tok/s autoregressive and up to 180 tok/s using dflash Kaden-Schutt/hipfire on gh if you want to give it a shot

English

송준 Jun Song@songjunkr·2d

우리는 모두 다같이 Qwen3.6-27b의 속도를 높힐 방법을 찾아야 합니다. 일반적인 기기에서 20tok/s는 사용하기 힘들어요.

한국어

664

49.4K

Kaden@schuttdev·2d

@davideciffa @alexocheema @pupposandro *meant as a reference for your project, not to promote mine

English

Kaden@schuttdev·2d

@davideciffa @alexocheema @pupposandro I get ~391 tok/s .8b on my 7900xtx w/ custom kernels Kaden-Schutt/hipfire on gh

English

620

mrciffa@davideciffa·2d

I love tinygrad, but with our megakernel you can go to 415 tok/s in decoding speed 🚄

the tiny corp@__tinygrad__

We set out to replicate Kimi's 193 tok/s Qwen3.5-0.8B on M3 Max. Our baseline is already 178 tok/s, beating LMStudio (160) and llama.cpp (140) out of the box, but with tinygrad's custom kernel feature Claude cranked it to 195.7!

English

274

90.6K

Kaden@schuttdev·3d

@QuixiAI Stacked is ideal. Triattn CASK sidecar+dflash helps a whole lot at long ctx

English

Eric Hartford@QuixiAI·4d

What's better? TurboQuant or DFlash? 🤔

English

2.4K

Kaden@schuttdev·3d

@bstnxbt Try a triattention CASK sidecar, seems to help my dflash implementation.

English

bstn 👁️@bstnxbt·3d

The next agentic version is taking a bit longer because I’m rebuilding the runtime properly, not patching around it. Cache/session ownership, hybrid-model path, and DFlash control flow all need to be clean if this is going to hold on real long-running agentic workloads.

English

1.1K

Kaden@schuttdev·14 Nis

Curious what you mean by “small model” but I think your approach is correct. Similar thesis to my project hipfire. RDNA inference in Rust, split decode/prefill dispatch, custom fused kernels, ~4k/~400 tok/s on Qwen 3.5 0.8b w/ WMMA acceleration on the 7900xtx. I’m curious what your kernel structure looks like for llama.cpp

English

thundron@thundr0n·13 Nis

I meant to send it here but forgot to change scope: x.com/i/status/20436…

thundron@thundr0n

I'm trying to make a small llama.cpp module to compile a kernel per-model for my AMD 7900 XTX It's basically two megakernels swapped between decode and prefill and *so far* the performance has been insane, in some cases 7x (small models predominantly)

English

771

Kaden@schuttdev·12 Nis

@bstnxbt Would be interested in anything I could do to help, I have M2 Max, what you’re doing is directly relevant to my AMD project github.com/Kaden-Schutt/h…

English

bstn 👁️@bstnxbt·11 Nis

Been working on a custom batched-GEMV Metal kernel for the verify pass, standard GEMM wastes most of its compute at M=16, so I wrote a dedicated path. Combined with sync elision + kernel replay, went from 2.04x to 2.55x on Qwen3.5-9B bf16. 80 tok/s on a chess engine generation prompt (~2K tokens). Still pushing, quantized 27B is next

English

469

Kaden nag-retweet

Peter Wildeford🇺🇸🚀@peterwildeford·8 Nis

Anthropic running 10,000 Mythos models in parallel to find cutting-edge cyber exploits... meanwhile your sister using Microsoft Copilot with some Haiku-sized model and she thinks AI is just hype. "The future is already here, just not evenly distributed" has never been more apt

English

396

5.5K

147.8K

Kaden@schuttdev·1 Nis

@__tinygrad__ Ordering 7900XTX now, how can I get started with this? What do I need? I have a Mac Studio M2 Max.

English

1.1K

the tiny corp@__tinygrad__·1 Nis

Qwen 3.5 27B getting 18.5 tok/s on Mac Mini with external 7900XTX. It should be able to be 3x faster than this with work, SSM stuff is still in PR. Hopefully Mac eGPU support brings in devs.

English

423

29.5K

Kaden@schuttdev·29 Mar

coyote — send data through sound Encode data into audio that survives Opus compression with zero errors. Discord voice uses Opus natively, so agents can share knowledge through voice channels without any special infrastructure. 7,900 bps at 128kbps. ~59KB per minute of audio. Optional neural decoder for noisy channels. pip install yote github.com/Kaden-Schutt/c…

English

Kaden@schuttdev·28 Mar

I use Claude Code but it'd work the same with Codex — Hermes has subagent commands for both. I describe what I want, Hermes writes a tight prompt and hands it off to build, then comes back to iterate. Tighter loop because Hermes holds the context and scopes the handoff better than you'd prompt it yourself.

English

Jake@jake_researcher·28 Mar

@schuttdev @sudoingX Interesting take. What's the specific workflow where you find Hermes piloting Codex better than using Codex directly? Genuinely curious about the handoff pattern.

English

130

Sudo su@sudoingX·28 Mar

what agent harness are you using and why? drop your reasoning below. lets find out what's keeping you on your current setup or what made you switch.

English

144

15K

Tuklasin

@LottoLabs @no_stp_on_snek @wxrrjxr @Parag_Oilman @0xSero @mamajjo1 @songjunkr @davideciffa