Qubitium

6.1K posts

Qubitium

@qubitium

Building GPT-QModel, ModelCloudAI. OSS contributor to SGLang, vLLM, HF and more. AI SW/HW { Python, Go, Kotlin } Quantization Accelerator.

Earth Katılım Şubat 2020

4K Takip Edilen1.3K Takipçiler

Sabitlenmiş Tweet

Qubitium@qubitium·2d

🥳 GPT-QModel v5.8.0 relased. pypi wheels will be ready in a couple of hours. 🤠 Transformers 5.3.0 support 😍 New cpu kernels for gptq/awq 🫡 Defuser integration for auto-defusing models And much more! 👇 Btw, v6.0.0 roadmap is set and will be ready in a week.

English

338

Qubitium@qubitium·17h

Triton bench/warmup threading bug (nogil) patch PR has been ready since October 2025. I have addressed all issues and it is still not considered good enough due to nitpicks. Guys, you guys need to start threading fix somewhere. It is not my job to guarantee the end-user is spawning 32 threads on 32 gpus and triton is giving back the wrong benchmark values: that's what I consider an end-user bug. github.com/triton-lang/tr…

English

119

Qubitium@qubitium·1d

I am not saying Transformers is faster than SGLang or vLLM but paged attention with fa2 is a beast. On ~1200 test set, paged fa2 is 2x of fa2, >4x of sdpa. A100, Llama 3.2 1b instruct, fp16 native.

English

295

Qubitium@qubitium·1d

The last straw. I am completely fed up with an important open source pkg that has not gotten a proper refractor, due to it's code structure have outlived all the featured MacGyver-ed on top one after another with bubble gum. it is bloated (you might puke if you actually the read the code) it is slow to execute (omg it is slow) it is prone to errors it has compat issues since it never update it's depends yet, people still uses it and apparently one guy is maintaining it with PRs piled up for year(s). I will release an alternative in a few days.

English

129

Qubitium@qubitium·1d

There is a very popular Attention pkg used by millions right now that had this hilarious episode. The dev, pretty much one man at the time, had the habit of micro commits. Many small commits so regressions are easily caught. At least that's why I think he was doing this. Except he commits so much he got tired of writing clear commit message (don't we all) and just winged it with single letter f bombs and s bombs, for that fateful day. There are like 30+ commits with this f-bombs on github that day/night. I alerted the dev almost immediately and he quickly reverted the main tree. I don't know why this came to me today but it was hilariously human.

English

Qubitium@qubitium·1d

One thing ai coders have issue with is over abstraction. Ai loves text book style of writing code by making everything abstract and make it "extensible". it's like functional vs oo debate. At some point, my mind just folds like a pancake where are too many nested objects. half the battle is me reviewing the code and screaming back, no...do not abstract that code. That part keeps me sane.

English

Apurva Mishra@mav3ri3k·1d

@qubitium Ah, so you understand the flow of the code. But the progress is mostly test driven. > So I had to get ai to de-ai their ai code and fix their structure. lol If your ai, de-ai-ed their code, then that means the fault was in the person who created the code, not in ai, lol

English

Qubitium@qubitium·2d

Nothing to see here... Just 33k lines of PR. It was 50K before it went on a diet. github.com/ModelCloud/GPT…

English

171

Qubitium@qubitium·1d

Did the hifi community straightup invent a word called "timbre" to make themselves sound smart. How about clarity? Or my own *crystality*. Two can plan this game.

English

Qubitium@qubitium·1d

SuperMicro board members spending more time stealing peoples money like mafia than actually doing real work. Now I understand how their bios update cadence are some of the worst in class.

NIK@ns123abc

🚨BREAKING: SUPER MICRO CO-FOUNDER ARRESTED FOR SMUGGLING $2.5B IN NVIDIA GPUs TO CHINA >SMCI co-founder Yih-Shyan "Wally" Liaw arrested today >personally holds $464 MILLION in SMCI stock >charged with smuggling BILLIONS in Nvidia servers to china >used a southeast asian shell company to funnel $2.5B in servers to chinese buyers >$510 million worth shipped in just THREE WEEKS in spring 2025 >built thousands of fake dummy servers to fool U.S compliance auditors >caught on surveillance camera using a HAIR DRYER to swap serial number stickers >coordinated the whole thing over encrypted group chats >SMCI down 12% after hours >faces up to 30 years in federal prison ITS SO OVER…

English

176

Qubitium@qubitium·2d

Recipe: 1. Done over 10+ days. 2. Add unit test for each critical modification. 3. Run that unit test and other units that may regress. 4. Repeat. I think 1/3 of the code is unit tests. This part of the code I do the least oversight/review imho. And the secret is still knowing exactly the codeflow from A-Z and add the human clarity. I have seen and recent4ed ported over code that was ai generated by billion dollar company I wont name and I say to myself, this dev just accepted the ai slop code without considering it just made any future changes he wants to make even harder. So I had to get ai to de-ai their ai code and fix their structure. lol I guess the secret is is to make sure any ai assisted code is actually human readable in both code and structure. At some point, you need to make sure the you understand all the code.

English

Apurva Mishra@mav3ri3k·2d

@qubitium 33k lines really ? there is no way you were able to review all that. so what is the trick ?

English

Qubitium@qubitium·2d

github.com/ModelCloud/GPT…

ZXX

Qubitium@qubitium·2d

English

338

Qubitium@qubitium·2d

GPTQ 4bit Llama 3.2 Instruct model under a stream workload: - staggered arrivals - mixed long prompt lengths - shared prefixes - scheduler="prefill_first" - use_async_batching=True

English

167

Qubitium@qubitium·2d

Paged Attention with FA2 on Transformers 5.3.0 in a streaming, staggered, concurrent, imho more realistic workload, can offer ~3x improvement vs native FA2 or SDPA so my small test:

English

3.7K

Qubitium@qubitium·4d

I am going to call it. OpenAI will launch HireADev feature that is literally a model and api designed to mimic an above average coder that does not sleep or need a product manager. The cost: $10K a month with zero days of vacation and benefits. CA launches robo workers tax for human deplacement reparation. Tax is 75% per robot. Robots in 2040 starts to feel marginalized. TensorNet is born and launched to space and subsequently renamed to SkyNet.

English

127

Qubitium@qubitium·4d

Transformers 5.3.0 which auto fuses modules preallocated stacked/fused parameter/buffers causing massive cpu memory usage even when the modules are lazy loaded by default, negating the lazy loading effect of pre 5.0 transformers. Fix: 1) lazy fusing, not on load but on first forward call 2) defuse (replace) the auto fusing code with non-fused version For inference this is not an issue, for quant libraries that mutates weights on a per module/layer basis, this is a pretty bad resource regression that we have to deal with. For now, I think I will get GPT-QModel to defuse the modeling code before the mode loading happens to revert back to 5.7.x behavior. This is important because a small Qwen 3 30B bf16 model may take over 100GB of cpu ram on 5.3.0 before a single forward call is called.

English

226

Qubitium@qubitium·4d

Instead of gifting DGX builds to influencers, maybe just lend a B300 oem to GPT-QModel quantization team so we can build and validate more kernels for Blackwell? @Dell @MichaelDell

English

164

Qubitium@qubitium·6d

@AlpinDale Python 3.15? Looks like we can explicitely set it targeting 3.15. peps.python.org/pep-0810/

English

Alpin@AlpinDale·13 Mar

@qubitium Lazy loading imports can't come too soon for python.

English

154

Qubitium@qubitium·13 Mar

Transformers taking 8.5s to import AutoProcessor on Zen3 is wonkers. Part of it Transformers, part of it is my system. Patch incoming. But even after patching Transformers is taking 4.5s to import AutoProcessor. lol 4.5! To load a submodule in a library. Let's see if we can get this down to 1-2s.

English

596

Qubitium@qubitium·6d

GPT-QModel unreleased v5.8.0 has surpassed 200+ gpu hours of unit (a100 + 4090) testing over the past week resulting in many dozens of patches for tokenization, model config, modeling code normalization/config for loading with transformers v5.3.0 release. Damn proud that the library can correctly load/inference/quantize many older hf hosted models better than latest transformers itself. There is no magic, just per model unit testing and patch fixing (when applicable). Codex is helping me alot on the grind but it's still a slow slow grind. Fixing A may regress B so every lifecycle/loading patch has to re-trigger the entire unit test suite. Maintaining a github/pypi pkg that users use and other pkgs depend on is no joke. You either need to do the dirty work, grind, or get the hell out of the game (many have and I fully understand why).

English

187

Keşfet

@Dell @MichaelDell @AlpinDale @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates