Espen JD

3.8K posts

Espen JD banner
Espen JD

Espen JD

@Snixtp

Cyber Network Engineer | Codex enthusiast | Local AI | RTX Pro 6000 enjoyer

Katılım Haziran 2020
656 Takip Edilen440 Takipçiler
Luigi Cruz
Luigi Cruz@luigifcruz·
We are confident GPU-accelerated signal processing is the future of radio astronomy. Our Stelline Developer Kit, based on @NVIDIAAI DGX Spark, lets us develop compute and networking capabilities locally before deploying to observatories. First units headed to scientists now!
Luigi Cruz tweet media
English
11
5
66
24.7K
Espen JD
Espen JD@Snixtp·
@nzl3reb That looks to be possible, but the cc3 at 10k is going to be very limited in what they can do before the context is full. Their context would have to be cleared quite often and only used for one-shot small tasks, whatever that might be
English
1
0
1
7
nZl
nZl@nzl3reb·
@Snixtp Oo nice.. And i think the gemma-4 31b would give similar results (at 128k not 256k) Im thinking of having some infra like 1 or 2 rtx 6k pro to host someof these things production It might be effective to also make like 3cc at 10k, 2 at 50k, 1 at 128k And route them
English
1
0
1
22
Espen JD
Espen JD@Snixtp·
The concurrency on the Pro 6000 is just crazy cc=96 2296.5 tok/s
Espen JD tweet media
English
32
23
506
42.9K
Espen JD
Espen JD@Snixtp·
Thats beyond the scope of this little test. Ideally you'd have many of these in a server to server many people. I think both you and I know it can't do cc96 with 128k ctx. On paper, if my math is right, cc3 looks to be possible if you use FP8 KV cache. Speed? 35-55 tok/s wouldn't surprise me
English
1
0
1
15
nZl
nZl@nzl3reb·
@Snixtp Hey i mean this isn't usable at 4k context At 128k context, how many cc can it load? And at what speed?
English
1
0
0
13
Espen JD
Espen JD@Snixtp·
I'm setting up some actual evals for models today. Speed is one thing, but if the model stupid, it's useless If anyone know about a good benchmark on github pls let me know
Espen JD tweet media
English
0
0
0
60
Espen JD
Espen JD@Snixtp·
If anyone have any experiments they want me to try on the Pro 6000, let me know! I'm sure I can look into it Last thing I want is it sitting idle all day :)
English
2
0
1
80
Espen JD
Espen JD@Snixtp·
@sarlev_ vLLM: 0.20.1 Python: 3.12.3 Torch: 2.11.0 + CUDA 13.0 Driver: 580.142 GPU capability: sm_120 / Blackwell flashinfer-python: 0.6.8.post1
Deutsch
1
0
1
20
Zima
Zima@MichaelZima·
@Snixtp Now you have to keep up with the mental gymnastics!
English
1
0
1
139
Espen JD
Espen JD@Snixtp·
CUDA_VISIBLE_DEVICES=0 \ CUDA_DEVICE_ORDER=PCI_BUS_ID \ HF_HOME=/LINUX/Models/hf \ vllm serve /LINUX/Models/unsloth/Qwen3.6-27B-NVFP4 \ --host 127.0.0.1 \ --port 8000 \ --served-model-name qwen36-nvfp4-pro6000-cprofile \ --max-model-len 64000 \ --gpu-memory-utilization 0.90 \ --trust-remote-code
Català
0
0
0
32
Mike Key
Mike Key@1337hero·
Spent $3998.98 total to have 96gb of VRAM using AMD's AI Pro R9700 Cards. (brand new) Comparatively I had spent $1520.00 on two used RX 7900 XTX's for 48gb of VRAM. If ur team RED, a single XTX is CHEAPER than a RTX 3090. Should I have bought a Mac or DGX Spark instead?
Mike Key tweet mediaMike Key tweet media
English
29
2
102
14.6K
Espen JD
Espen JD@Snixtp·
@sarlev_ Nvfp4 works great. Codex fixed it for me lol
English
1
0
0
28
~sarlev (e/acc)
~sarlev (e/acc)@sarlev_·
@Snixtp thanks. any luck on your end running nvfp4 quants? i've gotten effectively similar performance on qwen3.6 35b but i think i'm not using the most up to date kernels...
English
1
0
1
33
JJ Pop
JJ Pop@PopJj73071·
@Snixtp Dm won’t go though, I wanna ask sbout your setup. Also have a RTX 6000.
English
1
0
0
15
Europurr
Europurr@vrloom·
@Snixtp Ye I am running Minimax M2.7 on 4x rtx6000 st concurrency 16 and getting 1000t/s
English
1
0
1
107
Espen JD
Espen JD@Snixtp·
@PranshuBahadur @HCColenbrander Yes, a lot. I even tried cc16 and 32 but the latency got so bad because there just weren’t enough room on the card for kv cache
English
1
0
1
70
Pranshu Bahadur
Pranshu Bahadur@PranshuBahadur·
@Snixtp ohh so this was vllm i presume? should be similar i think (especially for decode at 96, unless sglang doesnt permute); oh wow you even have the energy consumption! is there anyway you can produce energy consumption values for H100? or is your repo / script public so we can try?
English
1
0
1
84
Espen JD
Espen JD@Snixtp·
@loktar00 That is exactly why I’m powerlimiting, don’t want it to burn up🫣 mine is permanently set to 450 now
English
0
0
1
25
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
@Snixtp I'm too paranoid with that dang connector I ran mine at 600 for a bit just to make sure they didn't blow up in the first week of owning them (5090s) but after that I powered limited mine permanently to 460w funny that we both ended up at around the same number! I need a 6000😂
English
1
0
0
43
Espen JD
Espen JD@Snixtp·
I also tested concurrency 96 on the RTX Pro 6000. Efficiency depends a lot on the workload. My previous test was basically single user chat inference, with cc=1. Max power draw during that test was 434W, thats why the graph is flat from that point onwards. But during this test, the GPU is being pushed much harder, and able to hit the target of 600W For Qwen3.6 27B at c96, peak efficiency is ~275W. Not very different from the cc1 test, which was ~250W After that, total output keeps increasing, but efficiency starts dropping. Personally, I can take the extra heat and cost, and I see no problem running it at 450W
Espen JD tweet media
Espen JD@Snixtp

RTX Pro 6000 Workstation Edition efficiency numbers Models: - Qwen3.6 27B BF16 - Qwen3.5 122B REAP Just like with the 3090, best efficiency is 250W Optimal power I would still say is 350W, speed gains are limited after that point.

English
2
1
21
1.8K
Espen JD
Espen JD@Snixtp·
@PranshuBahadur Here is NVFP4, its a little faster, but that is because it’s a more aggressive quant than fp8 I’m testing it with sglang later today, that should improve cc tok/s
Espen JD tweet media
English
1
1
4
1K
Pranshu Bahadur
Pranshu Bahadur@PranshuBahadur·
@Snixtp is this good? idk how to tell 😅 it's a "fp8 27B" no? should try nvfp4 it'll be a lot fast i think...also what's the decode batch_size?
English
2
0
1
1.1K