kapicode

222 posts

kapicode banner
kapicode

kapicode

@kapicode

Building in public. Currently working on a custom Ralph implementation—harness-agnostic and has a TUI for progress https://t.co/eYMJvNMgkg

Chicago, IL Tham gia Şubat 2026
72 Đang theo dõi64 Người theo dõi
Tweet ghim
kapicode
kapicode@kapicode·
My #ralph implementation
English
1
0
2
139
kapicode
kapicode@kapicode·
Decode tok/s can lie by omission. In my Qwen 27B TP=2 tests, decode improved on two Strix Halo nodes. But prefill got worse vs one node. For long prompts, TTFT matters as much as decode. If a benchmark ignores prefill, it may be hiding the pain.
English
0
0
0
30
kapicode đã retweet
jietang
jietang@jietang·
GLM-5.2 is Fully Open, Frontier Intelligence Belongs to Everyone Today, the sudden restriction of certain frontier models is deeply regrettable. At a time when access to frontier models is abruptly cut off for non-technical reasons, we are even more convinced of one thing: science should be global. The path to AGI (Artificial General Intelligence) must never be enclosed by high walls. We have always believed that AGI should be the cornerstone for all of humanity to collaboratively explore the boundaries of intelligence and solve complex challenges, rather than a privilege monopolized by a few rules and subject to revocation at any moment. In the face of external blockades and restrictions, our attitude is one of radical openness. Frontier intelligence must remain open-source, accessible, and buildable, serving every dedicated developer. GLM-5.2 is Zhipu's most capable open-source model to date. It not only supports a truly usable 1M context window but also maintains a continuous lead in the independent completion of long-horizon tasks, providing solid foundational support for building complex agent applications. It also continues to be our main engine for creating the strongest domestic coding model. Tonight at 5:21—at this special moment—GLM-5.2 will officially be available to all GLM Coding Plan users (including Lite / Pro / Max). The API will also go live next week. A step closer to frontier intelligence for everyone. The future of AI is open, and it is for the people. ModelKey: GLM-5.2
English
269
830
8K
1M
kapicode
kapicode@kapicode·
Kimi k2.7 thinks it's Claude
kapicode tweet media
English
0
0
0
17
kapicode
kapicode@kapicode·
If someone has DeepSeek-V4-Flash loading on a newer gfx1151 vLLM/ROCm image, I’d love to compare notes. The interesting benchmark is still open. For now, ds4 owns the practical path on my boxes.
English
0
0
0
23
kapicode
kapicode@kapicode·
This matters because negative results should be categorized correctly. “Slow” means optimize. “Flaky” means debug. “Not loadable on this stack yet” means wait for support or rebuild the stack. Different problem, different next step.
English
1
0
0
3
kapicode
kapicode@kapicode·
I tried the obvious follow-up: Can vLLM run DeepSeek-V4-Flash on the current gfx1151 ROCm community image? This would make the ds4 comparison much cleaner. Answer: not yet.
English
1
0
0
35
kapicode
kapicode@kapicode·
What would be most useful to test next on Strix Halo? 1. more DeepSeek-V4-Flash quants 2. Qwen dense models 3. Gemma-class small models 4. 256K+ context stress tests 5. agent/tool-use latency I’m optimizing for reproducible local AI, not synthetic leaderboard wins.
English
3
0
3
365
kapicode
kapicode@kapicode·
Local LLM hardware heuristic after this project: If your target model fits on one machine, try replicas before tensor parallel. 1 box: measure single-user latency 2 boxes: load balance requests Only then: split one request across machines Distributed inference is the last resort, not step one.
English
1
0
3
840
kapicode
kapicode@kapicode·
I want to push local hardware to be as capable as possible. And I want to push my harness to be productive with the smallest/dumbest models possible. No leaning on the model for my harness. Caveat: there is a minimum model quality that I simply can’t avoid. Qwen3.5-9b breaks at q4 but not at q8 (for now)
English
3
0
38
3.1K
kapicode
kapicode@kapicode·
The conclusion was almost annoying: Default-ish settings were already the right settings. That is less exciting than a magic flag, but more useful if you actually want to reproduce the setup.
English
0
0
0
7
kapicode
kapicode@kapicode·
CPU governor=performance also lost. That sounds backwards until you remember this is an APU. CPU and iGPU share the thermal/power envelope. Giving the CPU more room can steal headroom from the part doing inference.
English
1
0
0
12
kapicode
kapicode@kapicode·
The most useful part of the Strix Halo / DeepSeek-V4-Flash sweep was not the headline number. It was the list of things that did not help. Negative results save other people weekends.
English
1
0
0
88
kapicode
kapicode@kapicode·
Benchmarking rule I’m trying to follow: If a result is inside run-to-run noise, call it noise. Not “breakthrough.” Not “secret flag.” Not “RDMA is worse.” Noise. Local LLM work needs more boring honesty and fewer victory laps.
English
0
0
0
30
kapicode
kapicode@kapicode·
@binh Are you sure you don’t have a config problem? What quants are you using? What is causing the stoppage in your harnesses?
English
1
0
0
700
Binh
Binh@binh·
@kapicode RTX Pro 6000 with 96GB, serve Qwen and Nemotron via vLLM. M5 Max with 128GB of RAM for when I'm on the go. I use Opencode with the models served by both machines and in both setups Opencode just stops after a couple of turns. Hermes just stops too.
English
2
0
3
768
kapicode
kapicode@kapicode·
If you run local LLMs: what is your actual setup? Not the dream build. The daily driver. GPU/APU? Memory/VRAM? Model size? Serving stack? What breaks most often? I’m trying to compare practical local AI systems, not leaderboard screenshots.
English
120
3
73
15.6K