Espen JD

3.8K posts

Espen JD

@Snixtp

Cyber Network Engineer | Codex enthusiast | Local AI | RTX Pro 6000 enjoyer

Katılım Haziran 2020

656 Takip Edilen440 Takipçiler

Espen JD@Snixtp·2h

@luigifcruz @NVIDIAAI @BerkeleySETI That is a lot of sparks

English

Luigi Cruz@luigifcruz·16h

We are confident GPU-accelerated signal processing is the future of radio astronomy. Our Stelline Developer Kit, based on @NVIDIAAI DGX Spark, lets us develop compute and networking capabilities locally before deploying to observatories. First units headed to scientists now!

English

24.7K

Espen JD@Snixtp·4h

@nzl3reb That looks to be possible, but the cc3 at 10k is going to be very limited in what they can do before the context is full. Their context would have to be cleared quite often and only used for one-shot small tasks, whatever that might be

English

nZl@nzl3reb·4h

@Snixtp Oo nice.. And i think the gemma-4 31b would give similar results (at 128k not 256k) Im thinking of having some infra like 1 or 2 rtx 6k pro to host someof these things production It might be effective to also make like 3cc at 10k, 2 at 50k, 1 at 128k And route them

English

Espen JD@Snixtp·1d

The concurrency on the Pro 6000 is just crazy cc=96 2296.5 tok/s

English

506

42.9K

Espen JD@Snixtp·5h

Thats beyond the scope of this little test. Ideally you'd have many of these in a server to server many people. I think both you and I know it can't do cc96 with 128k ctx. On paper, if my math is right, cc3 looks to be possible if you use FP8 KV cache. Speed? 35-55 tok/s wouldn't surprise me

English

nZl@nzl3reb·5h

@Snixtp Hey i mean this isn't usable at 4k context At 128k context, how many cc can it load? And at what speed?

English

Espen JD@Snixtp·5h

I'm setting up some actual evals for models today. Speed is one thing, but if the model stupid, it's useless If anyone know about a good benchmark on github pls let me know

English

Espen JD@Snixtp·5h

If anyone have any experiments they want me to try on the Pro 6000, let me know! I'm sure I can look into it Last thing I want is it sitting idle all day :)

English

Espen JD@Snixtp·18h

@sarlev_ vLLM: 0.20.1 Python: 3.12.3 Torch: 2.11.0 + CUDA 13.0 Driver: 580.142 GPU capability: sm_120 / Blackwell flashinfer-python: 0.6.8.post1

Deutsch

~sarlev (e/acc)@sarlev_·19h

@Snixtp have a recipe you can share?

English

Espen JD@Snixtp·1d

RTX Pro 6000 Workstation Edition efficiency numbers Models: - Qwen3.6 27B BF16 - Qwen3.5 122B REAP Just like with the 3090, best efficiency is 250W Optimal power I would still say is 350W, speed gains are limited after that point.

Toni S@VoltableCom

People often benchmark with quantization and similar settings, but I tested something different. On an RTX Pro 6000 Blackwell, I ran the same prompt at different power limits using Qwen 3.6 27B. 150W is the minimum power limit for this card. 150W - 42 t/s 200W - 68 t/s 250W - 98 t/s 300W - 108 t/s 350W - 113 t/s Above 350W, the tokens per second stay about the same regardless of the power limit. The maximum power this card uses with this AI model is around 450W. Conclusion: test for power too and power-limit your card. Past a certain point, extra power is just wasted energy.

English

9.1K

Espen JD@Snixtp·18h

@MichaelZima 😅🫣

QME

125

Zima@MichaelZima·19h

@Snixtp Now you have to keep up with the mental gymnastics!

English

139

Espen JD@Snixtp·18h

CUDA_VISIBLE_DEVICES=0 \ CUDA_DEVICE_ORDER=PCI_BUS_ID \ HF_HOME=/LINUX/Models/hf \ vllm serve /LINUX/Models/unsloth/Qwen3.6-27B-NVFP4 \ --host 127.0.0.1 \ --port 8000 \ --served-model-name qwen36-nvfp4-pro6000-cprofile \ --max-model-len 64000 \ --gpu-memory-utilization 0.90 \ --trust-remote-code

Català

Espen JD@Snixtp·19h

@1337hero 🔥🔥

QME

330

Mike Key@1337hero·20h

Spent $3998.98 total to have 96gb of VRAM using AMD's AI Pro R9700 Cards. (brand new) Comparatively I had spent $1520.00 on two used RX 7900 XTX's for 48gb of VRAM. If ur team RED, a single XTX is CHEAPER than a RTX 3090. Should I have bought a Mac or DGX Spark instead?

English

102

14.6K

Espen JD@Snixtp·20h

Finally Codex on mobile!!

OpenAI@OpenAI

You've been asking for this one... Now in preview: Codex in the ChatGPT mobile app. Start new work, review outputs, steer execution, and approve next steps, all from the ChatGPT mobile app. Codex will keep running on your laptop, Mac mini, or devbox.

English

139

Espen JD@Snixtp·20h

@sarlev_ Nvfp4 works great. Codex fixed it for me lol

English

~sarlev (e/acc)@sarlev_·21h

@Snixtp thanks. any luck on your end running nvfp4 quants? i've gotten effectively similar performance on qwen3.6 35b but i think i'm not using the most up to date kernels...

English

Espen JD@Snixtp·23h

@Brien38522 2300/96≈24tok/s For cc1 its more like 50

English

413

Timothy O'Brien@Brien38522·1d

@Snixtp How many TPS for just one process?

English

427

Espen JD@Snixtp·1d

@PopJj73071 I accepted it

English

JJ Pop@PopJj73071·1d

@Snixtp Dm won’t go though, I wanna ask sbout your setup. Also have a RTX 6000.

English

Espen JD@Snixtp·1d

👀

Bindu Reddy@bindureddy

Gemini 3.2 Flash - Capitalizing on DeepMind's clever distillation techniques... Rumors are that benchmarks show it's hitting 92% of GPT 5.5's performance on coding and reasoning tasks while being 15-20x cheaper on inference costs. The latency improvements are insane - sub-200ms for most queries. Google's distillation + sparsity techniques are paying off massively. They've essentially compressed a frontier model into a flash variant without the usual quality cliff.

ART

170

Espen JD@Snixtp·1d

@lukeNukemAI @0xSero ~24 per request

English

114

JM@lukeNukemAI·1d

@Snixtp @0xSero How many tokens per second

English

Espen JD@Snixtp·1d

@vrloom 🔥🔥

QME

Europurr@vrloom·1d

@Snixtp Ye I am running Minimax M2.7 on 4x rtx6000 st concurrency 16 and getting 1000t/s

English

107

Espen JD@Snixtp·1d

@PranshuBahadur @HCColenbrander Yes, a lot. I even tried cc16 and 32 but the latency got so bad because there just weren’t enough room on the card for kv cache

English

Pranshu Bahadur@PranshuBahadur·1d

@Snixtp @HCColenbrander hmm, very interesting did you also notice the toks / sec ratio drop at 10 users tho

English

Espen JD@Snixtp·1d

@PranshuBahadur Its all just codex orchestrating the testing and watching each step I uploaded this a last week Would highly recommend giving it to your agent and ask it to set it up github.com/net-snix/rtx-p…

English

Pranshu Bahadur@PranshuBahadur·1d

@Snixtp ohh so this was vllm i presume? should be similar i think (especially for decode at 96, unless sglang doesnt permute); oh wow you even have the energy consumption! is there anyway you can produce energy consumption values for H100? or is your repo / script public so we can try?

English

Espen JD@Snixtp·1d

@PopJj73071 Sure

English

JJ Pop@PopJj73071·1d

@Snixtp Hey can I dm?

English

Espen JD@Snixtp·1d

@loktar00 That is exactly why I’m powerlimiting, don’t want it to burn up🫣 mine is permanently set to 450 now

English

Loktar 🇺🇸@loktar00·1d

@Snixtp I'm too paranoid with that dang connector I ran mine at 600 for a bit just to make sure they didn't blow up in the first week of owning them (5090s) but after that I powered limited mine permanently to 460w funny that we both ended up at around the same number! I need a 6000😂

English

Espen JD@Snixtp·1d

I also tested concurrency 96 on the RTX Pro 6000. Efficiency depends a lot on the workload. My previous test was basically single user chat inference, with cc=1. Max power draw during that test was 434W, thats why the graph is flat from that point onwards. But during this test, the GPU is being pushed much harder, and able to hit the target of 600W For Qwen3.6 27B at c96, peak efficiency is ~275W. Not very different from the cc1 test, which was ~250W After that, total output keeps increasing, but efficiency starts dropping. Personally, I can take the extra heat and cost, and I see no problem running it at 450W

Espen JD@Snixtp

English

1.8K

Espen JD@Snixtp·1d

@PranshuBahadur Here is NVFP4, its a little faster, but that is because it’s a more aggressive quant than fp8 I’m testing it with sglang later today, that should improve cc tok/s

English

Pranshu Bahadur@PranshuBahadur·1d

@Snixtp is this good? idk how to tell 😅 it's a "fp8 27B" no? should try nvfp4 it'll be a lot fast i think...also what's the decode batch_size?

English

1.1K

Keşfet

@luigifcruz @NVIDIAAI @BerkeleySETI @nzl3reb @sarlev_ @MichaelZima @1337hero @elonmusk