Qubitium

6.1K posts

Qubitium banner
Qubitium

Qubitium

@qubitium

Building GPT-QModel, ModelCloudAI. OSS contributor to SGLang, vLLM, HF and more. AI SW/HW { Python, Go, Kotlin } Quantization Accelerator.

Earth Katılım Şubat 2020
4K Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Qubitium
Qubitium@qubitium·
🥳 GPT-QModel v5.8.0 relased. pypi wheels will be ready in a couple of hours. 🤠 Transformers 5.3.0 support 😍 New cpu kernels for gptq/awq 🫡 Defuser integration for auto-defusing models And much more! 👇 Btw, v6.0.0 roadmap is set and will be ready in a week.
Qubitium tweet media
English
1
0
3
338
Qubitium
Qubitium@qubitium·
Triton bench/warmup threading bug (nogil) patch PR has been ready since October 2025. I have addressed all issues and it is still not considered good enough due to nitpicks. Guys, you guys need to start threading fix somewhere. It is not my job to guarantee the end-user is spawning 32 threads on 32 gpus and triton is giving back the wrong benchmark values: that's what I consider an end-user bug. github.com/triton-lang/tr…
English
0
0
0
119
Qubitium
Qubitium@qubitium·
I am not saying Transformers is faster than SGLang or vLLM but paged attention with fa2 is a beast. On ~1200 test set, paged fa2 is 2x of fa2, >4x of sdpa. A100, Llama 3.2 1b instruct, fp16 native.
Qubitium tweet media
English
3
0
4
295
Qubitium
Qubitium@qubitium·
The last straw. I am completely fed up with an important open source pkg that has not gotten a proper refractor, due to it's code structure have outlived all the featured MacGyver-ed on top one after another with bubble gum. it is bloated (you might puke if you actually the read the code) it is slow to execute (omg it is slow) it is prone to errors it has compat issues since it never update it's depends yet, people still uses it and apparently one guy is maintaining it with PRs piled up for year(s). I will release an alternative in a few days.
English
0
0
2
129
Qubitium
Qubitium@qubitium·
There is a very popular Attention pkg used by millions right now that had this hilarious episode. The dev, pretty much one man at the time, had the habit of micro commits. Many small commits so regressions are easily caught. At least that's why I think he was doing this. Except he commits so much he got tired of writing clear commit message (don't we all) and just winged it with single letter f bombs and s bombs, for that fateful day. There are like 30+ commits with this f-bombs on github that day/night. I alerted the dev almost immediately and he quickly reverted the main tree. I don't know why this came to me today but it was hilariously human.
English
0
0
0
85
Qubitium
Qubitium@qubitium·
One thing ai coders have issue with is over abstraction. Ai loves text book style of writing code by making everything abstract and make it "extensible". it's like functional vs oo debate. At some point, my mind just folds like a pancake where are too many nested objects. half the battle is me reviewing the code and screaming back, no...do not abstract that code. That part keeps me sane.
English
1
0
1
21
Apurva Mishra
Apurva Mishra@mav3ri3k·
@qubitium Ah, so you understand the flow of the code. But the progress is mostly test driven. > So I had to get ai to de-ai their ai code and fix their structure. lol If your ai, de-ai-ed their code, then that means the fault was in the person who created the code, not in ai, lol
English
1
0
1
19
Qubitium
Qubitium@qubitium·
Did the hifi community straightup invent a word called "timbre" to make themselves sound smart. How about clarity? Or my own *crystality*. Two can plan this game.
English
0
0
0
57
Qubitium
Qubitium@qubitium·
Recipe: 1. Done over 10+ days. 2. Add unit test for each critical modification. 3. Run that unit test and other units that may regress. 4. Repeat. I think 1/3 of the code is unit tests. This part of the code I do the least oversight/review imho. And the secret is still knowing exactly the codeflow from A-Z and add the human clarity. I have seen and recent4ed ported over code that was ai generated by billion dollar company I wont name and I say to myself, this dev just accepted the ai slop code without considering it just made any future changes he wants to make even harder. So I had to get ai to de-ai their ai code and fix their structure. lol I guess the secret is is to make sure any ai assisted code is actually human readable in both code and structure. At some point, you need to make sure the you understand all the code.
English
1
0
1
14
Apurva Mishra
Apurva Mishra@mav3ri3k·
@qubitium 33k lines really ? there is no way you were able to review all that. so what is the trick ?
English
1
0
0
15
Qubitium
Qubitium@qubitium·
🥳 GPT-QModel v5.8.0 relased. pypi wheels will be ready in a couple of hours. 🤠 Transformers 5.3.0 support 😍 New cpu kernels for gptq/awq 🫡 Defuser integration for auto-defusing models And much more! 👇 Btw, v6.0.0 roadmap is set and will be ready in a week.
Qubitium tweet media
English
1
0
3
338
Qubitium
Qubitium@qubitium·
GPTQ 4bit Llama 3.2 Instruct model under a stream workload: - staggered arrivals - mixed long prompt lengths - shared prefixes - scheduler="prefill_first" - use_async_batching=True
English
0
0
0
167
Qubitium
Qubitium@qubitium·
Paged Attention with FA2 on Transformers 5.3.0 in a streaming, staggered, concurrent, imho more realistic workload, can offer ~3x improvement vs native FA2 or SDPA so my small test:
Qubitium tweet media
English
1
6
54
3.7K
Qubitium
Qubitium@qubitium·
I am going to call it. OpenAI will launch HireADev feature that is literally a model and api designed to mimic an above average coder that does not sleep or need a product manager. The cost: $10K a month with zero days of vacation and benefits. CA launches robo workers tax for human deplacement reparation. Tax is 75% per robot. Robots in 2040 starts to feel marginalized. TensorNet is born and launched to space and subsequently renamed to SkyNet.
English
0
0
0
127
Qubitium
Qubitium@qubitium·
Transformers 5.3.0 which auto fuses modules preallocated stacked/fused parameter/buffers causing massive cpu memory usage even when the modules are lazy loaded by default, negating the lazy loading effect of pre 5.0 transformers. Fix: 1) lazy fusing, not on load but on first forward call 2) defuse (replace) the auto fusing code with non-fused version For inference this is not an issue, for quant libraries that mutates weights on a per module/layer basis, this is a pretty bad resource regression that we have to deal with. For now, I think I will get GPT-QModel to defuse the modeling code before the mode loading happens to revert back to 5.7.x behavior. This is important because a small Qwen 3 30B bf16 model may take over 100GB of cpu ram on 5.3.0 before a single forward call is called.
English
0
0
3
226
Qubitium
Qubitium@qubitium·
Instead of gifting DGX builds to influencers, maybe just lend a B300 oem to GPT-QModel quantization team so we can build and validate more kernels for Blackwell? @Dell @MichaelDell
English
0
0
2
164
Alpin
Alpin@AlpinDale·
@qubitium Lazy loading imports can't come too soon for python.
English
1
0
1
154
Qubitium
Qubitium@qubitium·
Transformers taking 8.5s to import AutoProcessor on Zen3 is wonkers. Part of it Transformers, part of it is my system. Patch incoming. But even after patching Transformers is taking 4.5s to import AutoProcessor. lol 4.5! To load a submodule in a library. Let's see if we can get this down to 1-2s.
English
1
0
5
596
Qubitium
Qubitium@qubitium·
GPT-QModel unreleased v5.8.0 has surpassed 200+ gpu hours of unit (a100 + 4090) testing over the past week resulting in many dozens of patches for tokenization, model config, modeling code normalization/config for loading with transformers v5.3.0 release. Damn proud that the library can correctly load/inference/quantize many older hf hosted models better than latest transformers itself. There is no magic, just per model unit testing and patch fixing (when applicable). Codex is helping me alot on the grind but it's still a slow slow grind. Fixing A may regress B so every lifecycle/loading patch has to re-trigger the entire unit test suite. Maintaining a github/pypi pkg that users use and other pkgs depend on is no joke. You either need to do the dirty work, grind, or get the hell out of the game (many have and I fully understand why).
English
1
0
1
187