Noetic Co
793 posts


- DeepSeek V4 Flash - Native Precision (FP4 + FP8)
- Fits on 2x RTX Pro 6000 GPUs + 256 GB DDR5 RAM
- Using KTransformers: KVCache-AI fork of SGLang for GPU/CPU memory inference
I have a somewhat obsession running applications on resource constrained systems to squeeze the maximum performance possible. Part of that comes from a past life working as a systems engineer, building & upgrading nationwide (USA) Video-On-Demand streaming backends, while navigating headless *nix servers around the time "cloud" was becoming a buzzword.
KTransformers gets less mention across the LLM inference-sphere despite being among the engines listed for many of the popular models on HuggingFace (alongside vLLM, SGLang, & llama.cpp). The KVCache-AI team is best known for providing a forked SGLang for hybrid GPU / CPU memory inference, benefitting MoE models. I expect these hybrid setups to gain in popularity, especially on the consumer side as hardware prices continue soaring.
"Necessity is the mother of invention" as they say, and local AI runners will continue finding more creative ways to run intelligence, whether that involves GPU/CPU memory offload, distributed training / inference, model weight / KV Cache quants, or REAPs.
Here I have DeepSeek V4 Flash running at a 1M context length on 2x RTX Pro 6000s GPUs, using its native mixed precision of FP4 + FP8. KTransformers allows you to reduce your GPU utilization by offloading experts per MoE layer onto GPU VRAM, with the remaining balanced across system RAM. KTransformers also has the ability to update GPU expert placement during inference from routing statistics collected during the prefill phase. There's also a lot of trial and error involved given the limited amount of kernel support for RTX Pro 6000s.
Two of the prompt load stress-test benchmarks I like to run are from the local-inference-lab/llm-inference-bench Github repo & AlienKevin/SWE-ZERO-12M-trajectories HuggingFace dataset.
Here are the main KTransformers SGLang optimized flags:
- Context Length: 1048576
- Total Number of Tokens: 1048576
- Chunked Prefill Size: 16384
- Max Prefill Tokens: 16384
- GPU Prefill Token Threshold: 1024
- GPU Memory Utilization: 87%
- Number of Experts per MoE Layer on GPU: 134 / 256
- Max Running Requests: 256
- CUDA Graph Max Batch Size: 256
- CUDA Graph Batch Sizes: 1 2 4 8 16 32 64 128 256
- Available GPU Memory: 20.81GB (anything less was too tight for agentic coding)
Below are the AlienKevin/SWE-ZERO-12M-trajectories benchmark results for 100 prompts with 10 concurrent, ~8k input tokens, & ~1k output tokens. Both Radix & Chunked Prefix Cache were disabled for the absolute worst-case scenario:
- Prefill Mean Batch Tokens: 35756.93 tok/sec
- Prefill Median Batch Tokens: 652.90 tok/sec
- TTFT Mean: 20.698s
- TTFT Median: 12.714s
- Decode Mean Batch Output Tokens: 27.39 tok/sec
- Decode Median Batch Output Tokens: 20.63 tok/sec
- Utilized CPU memory: ~200 GB
A more detailed write-up will follow, which'll include the methodology of calculating the number of experts per MoE layer on GPU, maximum number of tokens, and GPU memory utilization for a healthy balance for running tool calls & benchmarks in this hybrid setup.
Hopefully this'll be reproducible for you and on alternative GPUs, as well as current & future models. Let me know how it works for you! My future plans involve GPU/CPU memory inference tests for MiniMax M3, GLM-5.2, and Kimi K2.7-Code.
All links for all of the resources getting DeepSeek V4 Flash native mixed precision on 2x RTX Pro 6000 GPUs + 256 GB RAM can be found in the follow up post.

English

@anton_onAI @zerohedge moonshot.ai api is ass, for one. you can get a much better kimi instance on openrouter
English

@zerohedge I mean why would you run openai or anthropic on openrouter anyway?
Though to be fair why would you be use openrouter in general lol.
Best way to get the worst out of any model.
English

@WonkaWeirdo @realfrugalmogul Everyone is driving a $100k truck now. All of them.
English

@realfrugalmogul What did you expect for a $50 house call? I would honestly like to know what the appointment would look like?
Pay a guy, pay for a truck, part for parts and supplies to be stocked on said truck, Insurnaces, like 4 kinds, and so much overhead. You fell for the trap an got upset.
English

HVAC Guy: I’m here for the $50 HVAC tune up
Me: Sure, furnace is in the basement
* 10 minutes later *
HVAC Guy: Bad news. Something is rusted and cracked inside. That means CO2 is leaking into the house. I have to condemn your furnace, put a tag on it, and turn it off
Me: You will do none of those things.
HVAC Guy: I have to by code…
Me: You will not touch my furnace, let me show you the door 🚪
English

@jonathan_wilke if your usual prompt is "how do these pants fit me, claude" or "where's the closest nail bar", then you won't understand the efficiency hack that is CLI.
English
Noetic Co retweetledi

@NOLABALLER @CardPurchaser I buy the card at the right price and can regrade later. now, i don't like the SGC slab. least favorite. fat, tiny text, ugly and no QR/bar code to streamline collection management. @natsturner
English

Is SGC dead in value as a grading card company?? I’ve seen so many people struggle to sell SGC slabs and buyers love a card but steer clear of buying it due to SGC slab.
Are you team SGC or staying far away from these slabs?
@CardPurchaser

English

@omgsidewalks scale + quantifiable ROI. teachers would be high value/status if they had a claim on their students' future productivity. bread molds, flowers wilt. no scale.
English

@Gamingtronium 200 requests/second is not high volume. php bandwidth is all about caching and caching is not viable in a conversational near real-time paradigm.
English

Be honest with me, Americans 🇺🇸
Do you actually own a gun?
In Japan, I have never seen a real one. Not once.
Not at a friend's house. Not in a drawer. Never.
But online, every American just goes
"oh, mine's in the nightstand"
like it's a phone charger 🔌
So now I'm genuinely curious:
What's YOUR gun? The very first one you ever got?
Is this normal everywhere in the US?
Or just some states?

English

@johnarnold current frontier subscription plan token costs are subsidized about 90%.
English

@DA_Stockman Stakeholder capitalism didn't sell, so now it's AI infinite abundance and space as a jurisdiction = security moat vs. monkeywrenchers/adversaries, tax and regulatory gray zone to exploit in best of scenarios.
English

Well, here's some math. Starlink is a profitable business with about $11 billion of sales and $3 billion of free cash flow. It might be worth $75 billion at a frisky multiple of 25X free cash flow.
The balance----the space launch business and the AI/data centers in space fantasy----has $7 billion of sales and NEGATIVE -$17 billion of free cash flow. So why is it worth anything, unless you are pricing a dream peddled by sell-side hucksters?!
In short, after trading up to $2 trillion based on $75 billion of tangible Starlink value, where's the remaining $1.925 trillion of it?
This isn't just the classical mania of the crowds. This is sui generis--- mass insanity in a casino that has been giving a lobotomy by three decades of money-printing madness at the Fed and its fellow-traveling central banks around the planet.
Jim Stewartson, Decelerationist 🇨🇦🇺🇦🇺🇸@jimstewartson
For $135 per share of SpaceX, you get 1/13,000,000,000th (One 13-BILLIONTH) of a company that in 2025 received $18,000,000,000 and lost $5,000,000,000 It’s allegedly worth $1,770,000,000,000 Do people not understand arithmetic anymore? Can they not count zeroes? Mass delusion.
English

@calcarinus @quantinine @Kimi_Moonshot @basedjensen @TheAhmadOsman i mean, you can get into a DIY single rtx pro 6k setup for under $15k with room to add a second incrementally. If there's an observable boost going to DSv4 flash over 3.6 27b, it's at a huge hardware cost. I think it boils down to what to do local vs. api offload.
English

@NoetekCo @quantinine @Kimi_Moonshot @basedjensen @TheAhmadOsman Would you favour qwen 27b fp16 over deepseek v4 flash? (Assuming you have two R6KPro)?
English

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced!
🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite.
🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6.
🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates.
⚡️ 6x High-Speed Mode coming soon!
🔌 Available today via Kimi API and Kimi Code.
🔗 Kimi Code: kimi.com/code
🔗 API: platform.moonshot.ai


English













