Azeez

63 posts

Azeez banner
Azeez

Azeez

@AtlasInference

Building Atlas, Rust inference engine with custom CUDA kernels for DGX Spark GB10, and we're gearing up for an open source release. 102 tok/s on Qwen3.6

Katılım Mart 2026
37 Takip Edilen297 Takipçiler
Azeez
Azeez@AtlasInference·
@sudoingX We JUST got our compute as @AtlasInference. Just give us a teeny bit more time, we'll have some great numbers for you :)
English
0
0
1
90
Sudo su
Sudo su@sudoingX·
actually, let's not just wait. if you are running local models on AMD right now, R9700, Strix Halo, a 7900 XTX, any RDNA card, on a ROCm or Vulkan build, drop your numbers in the replies. model, quant, the card, tok/s. one line is enough. i'll pull the best into a proper thread and amplify the builders who contribute. consider it the AMD half of the list, written by the people actually running it. real numbers from real cards. let's build the AMD picture together while my hardware ships.
Sudo su@sudoingX

be patient anon. i could fake the AMD numbers tonight. i won't. so we wait.

English
25
5
84
10.7K
Michael Flores
Michael Flores@MFloresFive·
I'm trying to combine this with either openclaw or hermes. when i do that i get a lot of tool call errors. It tries to use bash instead of exec, atlas says tool call rejected: error: unknown 'path', same for 'pattern' etc. any ideas on how to fix this? I am using this model: Qwen3.5 Qwen3.5-35B-A3B Sehyo/Qwen3.5-35B-A3B-NVFP4 35B / 3B GDN + attention + MoE, MTP
English
1
0
0
26
Azeez
Azeez@AtlasInference·
DGX Spark just benched 200+ tok/s for Qwen3.6-35B with @AtlasInference on @spark_arena 🔥 How's that possible? Providers like Codex and Claude get ~60. Other major engines don't come close 🦥 We haven't seen speeds like this on GB10. NO ONE HAS. Atlas is shattering records 🚀
Azeez tweet media
English
16
11
84
20K
Azeez
Azeez@AtlasInference·
We're way faster than that. MTP-enabled with Dflash rolling out, check out the discord thread for more details!
English
1
0
1
25
A.K.A CS
A.K.A CS@decapostos·
@AtlasInference @AMD I would really like to know what the t/s is for Qwen 3.6 27B Q8 with @AtlasInference Im running it know on llama.cpp and only getting 11 t/s. Same for vLLM. I have a Asus Ascent GX10 so basically the same as the DGX Spark.
English
1
0
1
57
Azeez
Azeez@AtlasInference·
🚀 Huge thanks to @AMD for sending @AtlasInference a Strix Halo laptop! Excited to squeeze every last drop of compute out of it. Our goal is staying community-first with the simplest stack possible, ROCm here we come 🔥 Join our Discord for early access and help shape what we build next. What should we tackle first? 👇
Azeez tweet media
English
3
3
14
527
Azeez
Azeez@AtlasInference·
@LottoLabs Beautiful, let us know if we can help help contribute to the benchmarks in any way too. Will make a separate post on this soon :)
English
0
0
1
32
Azeez retweetledi
Lotto
Lotto@LottoLabs·
Localmaxxing now supports infra benchmarks from @AtlasInference Looks like an interesting rust + cuda engine Use today with DGX Spark
English
3
1
21
1.3K
Azeez
Azeez@AtlasInference·
@lucatac0 @spark_arena You'll be one bite into your lunch and it'll be ready to serve. We love to brag about our cold start times <2 mins for models like Qwen3.6-35B-A3B!
English
0
0
0
72
Azeez
Azeez@AtlasInference·
@JJJOOOHN @spark_arena Coming to SparkArena soon. MTP just confirmed working, with DFlash it's going to be lightspeed ⚡️
English
0
0
1
84
Azeez
Azeez@AtlasInference·
@bridgemindai Try Atlas Inference and you will know why. With current local model capability and the speeds we're able to achieve, you can absolutely fly through your work.
English
0
0
2
91
BridgeMind
BridgeMind@bridgemindai·
I have two NVIDIA DGX Sparks stacked in my office. They've been sitting there for a month. Here's my honest take. Open source AI is never going to compare to frontier models. Running quantized Kimi K2.6 and GLM 5.1 locally is cool. But practical? No. Not even close. I run all my Hermes agents on GPT 5.5 through my ChatGPT Pro subscription. Practically free. GPT 5.5 is the intelligent model in the world. Why would I route serious tasks to a watered down local model? If you need fast and accurate, you're not using local inference. You're using GPT 5.5 or Claude Opus 4.7. I'm not saying this to rage bait. I genuinely want to know. Why would anyone serious about vibe coding and AI agents use a local model when frontier is this far ahead?
BridgeMind tweet media
English
305
20
409
57.4K
Azeez
Azeez@AtlasInference·
We, Atlas Inference and our absurdly talented community, are just getting started @NVIDIAAI😉 These numbers represent the floor, not the ceiling. The DGX Spark has much more in the tank, we're on a mission to keeping pushing GB10 🔨 New optimizations grounded in academic literature that aren't shipped yet. New tricks nobody's tried 📚 If you've got a Spark and you want in, JOIN US. We're open source. Builders welcome. Skeptics welcome. We'll convert you 🗣️ Records are made to be broken🔥
English
0
1
7
373
Sayak Paul
Sayak Paul@RisingSayak·
The kernels project at Hugging Face has been growing! We want it to be the go-to place for kernel devs and kernel users. We're looking to work w/ folks who're interested in doing agentic kernel dev, providing real optim value to real models. Reach out if interested :)
Sayak Paul tweet media
English
15
9
144
18.5K
Azeez
Azeez@AtlasInference·
@Tnimbus @spark_arena Should work well, we have a dedicated thread with Atlas Openclaw users on our discord as well :)
English
2
0
0
463
Azeez
Azeez@AtlasInference·
@loktar00 Believe it or not, it's what a DGX Spark can do too. Atlas inference makes it possible😏
English
0
0
3
168
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
Crazy that 200tps is slowly becoming the floor for local inference... a year ago that was top tier API speed, now its what a single 5090 or dual 3090 with MTP can get you for "free" ™
English
10
3
70
4.4K
Azeez
Azeez@AtlasInference·
Right. That recipe should work, if you don't have sparkrun here's the docker command for nvfp4 docker pull avarok/atlas-gb10:latest sudo docker run -d --name atlas \ --network host --gpus all --ipc=host \ -v ~/.cache/huggingface:/root/.cache/huggingface \ avarok/atlas-gb10:latest \ serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \ --port 8888 \ --max-seq-len 131072 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.88 \ --scheduling-policy slai \ --tool-call-parser qwen3_coder \ --enable-prefix-caching \ --speculative
English
0
0
3
133
Azeez
Azeez@AtlasInference·
Docker commands are on the website atlasinference.io for FP8. Some light modifications to make it NVFP4 via the recipe in github.com/Avarok-Cyberse… docker pull avarok/atlas-gb10:latest sudo docker run -d --name atlas \ --network host --gpus all --ipc=host \ -v ~/.cache/huggingface:/root/.cache/huggingface \ avarok/atlas-gb10:latest \ serve Qwen/Qwen3.6-35B-A3B-FP8 \ --port 8888 \ --max-seq-len 65536 \ --kv-cache-dtype fp8 \ --kv-high-precision-layers auto \ --gpu-memory-utilization 0.90 \ --scheduling-policy slai \ --tool-call-parser qwen3_coder \ --enable-prefix-caching \ --speculative
English
1
2
6
539
Azeez
Azeez@AtlasInference·
@iotcoi @spark_arena Right, should have provided the run command. This exact recipe cold starts in <2mins. Not only is it the fastest, but the startup times are also unprecedented :) sparkrun run @atlas/qwen3.6-35b-a3b-nvfp4-atlas
English
0
0
0
58
Azeez
Azeez@AtlasInference·
@TeksEdge @RayFernando1337 I'm having trouble understanding this. Point to point communication should be slower right? All-reduce is slow for TP sure, but not all of the inter-node data transfers... is it because the intra-node speeds aren't that fast itself?
English
0
0
0
10