kapicode

222 posts

kapicode

@kapicode

Building in public. Currently working on a custom Ralph implementation—harness-agnostic and has a TUI for progress https://t.co/eYMJvNMgkg

Chicago, IL Inscrit le Şubat 2026

72 Abonnements64 Abonnés

Tweet épinglé

kapicode@kapicode·25 May

My #ralph implementation

English

139

kapicode@kapicode·20h

Decode tok/s can lie by omission. In my Qwen 27B TP=2 tests, decode improved on two Strix Halo nodes. But prefill got worse vs one node. For long prompts, TTFT matters as much as decode. If a benchmark ignores prefill, it may be hiding the pain.

English

kapicode retweeté

jietang@jietang·13 Haz

GLM-5.2 is Fully Open, Frontier Intelligence Belongs to Everyone Today, the sudden restriction of certain frontier models is deeply regrettable. At a time when access to frontier models is abruptly cut off for non-technical reasons, we are even more convinced of one thing: science should be global. The path to AGI (Artificial General Intelligence) must never be enclosed by high walls. We have always believed that AGI should be the cornerstone for all of humanity to collaboratively explore the boundaries of intelligence and solve complex challenges, rather than a privilege monopolized by a few rules and subject to revocation at any moment. In the face of external blockades and restrictions, our attitude is one of radical openness. Frontier intelligence must remain open-source, accessible, and buildable, serving every dedicated developer. GLM-5.2 is Zhipu's most capable open-source model to date. It not only supports a truly usable 1M context window but also maintains a continuous lead in the independent completion of long-horizon tasks, providing solid foundational support for building complex agent applications. It also continues to be our main engine for creating the strongest domestic coding model. Tonight at 5:21—at this special moment—GLM-5.2 will officially be available to all GLM Coding Plan users (including Lite / Pro / Max). The API will also go live next week. A step closer to frontier intelligence for everyone. The future of AI is open, and it is for the people. ModelKey: GLM-5.2

English

268

834

8.1K

kapicode@kapicode·23h

Kimi k2.7 thinks it's Claude

English

kapicode@kapicode·1d

If someone has DeepSeek-V4-Flash loading on a newer gfx1151 vLLM/ROCm image, I’d love to compare notes. The interesting benchmark is still open. For now, ds4 owns the practical path on my boxes.

English

kapicode@kapicode·1d

This matters because negative results should be categorized correctly. “Slow” means optimize. “Flaky” means debug. “Not loadable on this stack yet” means wait for support or rebuild the stack. Different problem, different next step.

English

kapicode@kapicode·1d

I tried the obvious follow-up: Can vLLM run DeepSeek-V4-Flash on the current gfx1151 ROCm community image? This would make the ds4 comparison much cleaner. Answer: not yet.

English

kapicode@kapicode·1d

What would be most useful to test next on Strix Halo? 1. more DeepSeek-V4-Flash quants 2. Qwen dense models 3. Gemma-class small models 4. 256K+ context stress tests 5. agent/tool-use latency I’m optimizing for reproducible local AI, not synthetic leaderboard wins.

English

412

kapicode@kapicode·1d

Local LLM hardware heuristic after this project: If your target model fits on one machine, try replicas before tensor parallel. 1 box: measure single-user latency 2 boxes: load balance requests Only then: split one request across machines Distributed inference is the last resort, not step one.

English

843

kapicode@kapicode·2d

@Tech2Wild 1x M4 MBP (128GB)

Polski

kapicode@kapicode·2d

@Tech2Wild 1x 3090 2x Strix Halo 1x GB10

Deutsch

117

kapicode@kapicode·2d

I want to push local hardware to be as capable as possible. And I want to push my harness to be productive with the smallest/dumbest models possible. No leaning on the model for my harness. Caveat: there is a minimum model quality that I simply can’t avoid. Qwen3.5-9b breaks at q4 but not at q8 (for now)

English

3.1K

kapicode@kapicode·2d

The conclusion was almost annoying: Default-ish settings were already the right settings. That is less exciting than a magic flag, but more useful if you actually want to reproduce the setup.

English

kapicode@kapicode·2d

CPU governor=performance also lost. That sounds backwards until you remember this is an APU. CPU and iGPU share the thermal/power envelope. Giving the CPU more room can steal headroom from the part doing inference.

English

kapicode@kapicode·2d

The most useful part of the Strix Halo / DeepSeek-V4-Flash sweep was not the headline number. It was the list of things that did not help. Negative results save other people weekends.

English

kapicode@kapicode·2d

Benchmarking rule I’m trying to follow: If a result is inside run-to-run noise, call it noise. Not “breakthrough.” Not “secret flag.” Not “RDMA is worse.” Noise. Local LLM work needs more boring honesty and fewer victory laps.

English

kapicode@kapicode·2d

👀

QME

kapicode@kapicode·2d

@binh Are you sure you don’t have a config problem? What quants are you using? What is causing the stoppage in your harnesses?

English

700

Binh@binh·2d

@kapicode RTX Pro 6000 with 96GB, serve Qwen and Nemotron via vLLM. M5 Max with 128GB of RAM for when I'm on the go. I use Opencode with the models served by both machines and in both setups Opencode just stops after a couple of turns. Hermes just stops too.

English

770

kapicode@kapicode·3d

If you run local LLMs: what is your actual setup? Not the dream build. The daily driver. GPU/APU? Memory/VRAM? Model size? Serving stack? What breaks most often? I’m trying to compare practical local AI systems, not leaderboard screenshots.

English

120

15.6K

Découvrir

@Tech2Wild @binh @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA