Azeez

188 posts

Azeez

@AtlasInference

Building Atlas, Rust inference engine w/ custom CUDA kernels on DGX Spark and Strix Halo | MLPerf Agentic Edge Task Force MLCommons | Ambassador @Alibaba_Qwen

Katılım Mart 2026

46 Takip Edilen654 Takipçiler

Azeez@AtlasInference·2d

@NeoAIForecast @SpaceTimeViking @Tech2Wild If you have 80s, try our engine for Qwen3.6-35B😉 One command is all you need: sparkrun run @atlas/qwen3.6-35b-a3b-nvfp4 github.com/Avarok-Cyberse…

English

Neo@NeoAIForecast·2d

@SpaceTimeViking @Tech2Wild I get around 100 on your container/recipe

English

Tech2Wild@Tech2Wild·2d

Qwen 3.6 35B A3 Comparison 🖥️ Dual 3090s: 157.9 tok/s vs🤖 DGX Spark: 61.2 tok/

Magyar

8.7K

Azeez@AtlasInference·2d

@WescheNex1q Appreciate the benchmarking @WescheNex1q 🔥

English

Azeez retweetledi

Wësche@WescheNex1q·2d

Updated @AtlasInference numbers 27B NVFP4 on one DGX Spark: My generic config: 14.8 tok/s Their official recipe: 19-22 right on their claimed 21.78 Their real lane: boots in ~1 min (vLLM ~4, SGLang ~6) and runs 97 tok/s solo on the 35B

English

Azeez@AtlasInference·3d

@WescheNex1q @16 Check it out @WescheNex1q, recipes are updated to serve the NVIDIA NVFP4 given Unsloth's changes. 35b sparse: github.com/Avarok-Cyberse… 27b dense: github.com/Avarok-Cyberse…

English

Wësche@WescheNex1q·3d

To be specific: Atlas MTP was erroring on serve when my quality eval ran, so it fell back to base for that leg. Got it serving after and have the speed numbers (27B 14.8 tok/s, 35B 97). Since spec decode is quality neutral the 84.1 still stands for the model, but happy to rerun the eval on MTP for a clean apples to apples. If there’s a preferred serve config for the official checkpoint, send it my way.

English

Azeez@AtlasInference·3d

@WescheNex1q @16 Looked into it further, found the Unsloth model got a sneaky update that broke a few things! See it was updated just yesterday: huggingface.co/unsloth/Qwen3.…

English

Azeez@AtlasInference·3d

@WescheNex1q @16 Thanks for putting these results together @WescheNex1q! Awesome work, question is how come you ran base without MTP for atlas, while enabled for SGLang and vLLM?

English

152

Wësche@WescheNex1q·3d

Same four trucks. 1× DGX Spark. 35B under a crowd (agg tok/s @16 users): vLLM MTP 450 SGLang MTP 427 llama.cpp ~105 Atlas caps at ~4 35B alone (best single-stream): SGLang MTP 106.7 vLLM MTP 102 Atlas base 86.3 27B dense alone: llama.cpp DFlash 34.6 ← crown Playground truck still vLLM. Race lane is competitive now.

Deutsch

2.1K

Azeez@AtlasInference·4d

@WescheNex1q @vllm_project @sgl_project @ggerganov Excited for this🔥our Dflash isn't there 100% but it's real close!

English

163

Wësche@WescheNex1q·4d

The four corners of tonight’s rounds @vllm_project vs @sgl_project vs llama.ccp (@ggerganov) vs @AtlasInference Running qwen3.6 27b and 35b, both with mtp and dflash, measuring quality scores, tok/s, concurrency, latency, and workload splits Results in the AM tomorrow

English

3.6K

Azeez@AtlasInference·4d

@rafaelcaricio Not yet... dm me let's chat more

English

rafaelcaricio@rafaelcaricio·5d

@AtlasInference Does Atlas support diffusion models? That would be a big one for GB10 Spark devices.

English

Azeez@AtlasInference·6d

I read somewhere that this is good at cuda kernels? ⚙️ Might have to try it considering Fable tore the current plan into pieces

Cursor@cursor_ai

We've partnered with SpaceXAI to train Grok 4.5. It’s our most powerful model yet and the first we've built for more than software engineering.

English

Azeez@AtlasInference·5d

Voila: docker run -d --network host --gpus all --ipc=host \ -v ~/.cache/huggingface:/root/.cache/huggingface \ avarok/atlas-gb10:dev \ serve nvidia/Qwen3.6-35B-A3B-NVFP4 \ --port 8888 --max-seq-len 32768 \ --kv-cache-dtype fp8 --fp8-kv-calibration-tokens 256 \ --kv-high-precision-layers auto \ --gpu-memory-utilization 0.88 --scheduling-policy slai \ --enable-prefix-caching --ssm-cache-slots 256 \ --speculative --mtp-quantization bf16

English

Wësche@WescheNex1q·5d

Tested it before replying 🙂 Pulled atlas-gb10:latest Measured 95 tok/s single-stream at 58–70% acceptance. Couldn’t quite reach your 105, what sampling/flags are you running? Also confirmed your “not concurrency” the hard way: the drafter reserve shrinks KV enough that batch 16 @ 32k won’t boot. Happy to put Atlas+MTP on the board with your exact recipe.

English

164

Wësche@WescheNex1q·6d

Spec-decode shootout on DGX Spark Qwen3.6-35B-A3B, on the new spark-vllm • NVIDIA’s NVFP4 checkpoint + built-in MTP-3: 105 tok/s (+62%), 450 tok/s @ 16 streams (+92%) No external drafter,the draft heads ship inside the checkpoint. Now number one in Wesche.com/dgx

English

Azeez@AtlasInference·5d

@theemozilla Not well really. The 75B Nemotron-puzzle model that just dropped is killer though for 128GB Spark/Strix Halo huggingface.co/nvidia/NVIDIA-…

English

223

emozilla@theemozilla·5d

We're working on making the local model experience better in Hermes, what are the best local models at each weight class? My blind guess, please correct: 8-16 GB VRAM Gemma4 12B 24-32 GB VRAM Qwen3.6 27B Qwen3.6 35B 128 GB VRAM (Spark, M3 Max) ??? Can you do DSv4-Flash?

English

125

261

64.1K

Azeez@AtlasInference·5d

@WescheNex1q Try serving Atlas Qwen3.6-27B :) the 90s cold time to start will change your whole perspective!

English

Wësche@WescheNex1q·5d

Let’s see how far we can take a single dgx spark

English

1.8K

Azeez@AtlasInference·5d

@WescheNex1q The :dev should fair better, along with k=2 if my memory holds. Will get back to you with more details, thanks for the openness!

English

Azeez@AtlasInference·6d

@jun_song Happy to work with you on this! If you don't know us already, we're building Atlas Inference a pure Rust engine for edge hardware, GB10/Strix Halo so far! We want to keep things simple and run models the fastest for the growing community with the best support.

English

266

Jun Song@jun_song·6d

Looking for inference companies interested in partnering to use SuperGemma or Super-Tune models. These are market-proven, uncensored local models. 100% of the revenue generated from this partnership will go straight to a foundation supporting open-source AI. There is only one reason we are doing this: Open-source AI must win. We are already in partnership talks with a bunch of companies, both big and small. Drop a comment or DM me if you're interested, and I'll share the details. Let's build together until open source wins. 💪

English

11.2K

Azeez retweetledi

NVIDIA AI@NVIDIAAI·6d

@AtlasInference Lets gooo!

English

766

Azeez@AtlasInference·6d

@itsharmanjot Had me shook... I couldn't stop scrolling huggingface

GIF

English

1.2K

Harman@itsharmanjot·6d

Alibaba just released a coding model that hits 82 percent on SWE-Bench Verified. That is the highest score ever published for an open-source model. The weights are free. The license is Apache 2.0. You can run it today. The model is Qwen 4 Coder 32B. Here is what 82 percent on SWE-Bench Verified actually means. SWE-Bench Verified tests whether an AI can autonomously resolve real bugs pulled from real production GitHub repositories. Not synthetic exercises. Real open-source projects that real teams depend on. A model gets a bug report, reads the code, writes a fix, and either passes the test suite or it does not. At 82 percent, Qwen 4 Coder 32B resolves 82 out of every 100 real production bugs it is given. Without a human guiding it. On code it has never seen before. For comparison: Qwen 4 Coder 32B: 82 percent SWE-Bench Verified. Open source. Apache 2.0. Claude Fable 5: 80.3 percent SWE-Bench Pro. $10 input / $50 output per million tokens. Currently suspended. GPT-5.6 Sol: Competitive on Terminal-Bench. $5 input / $30 output per million tokens. An open-weight model that you can download and run for free just beat both of them on the benchmark designed to measure real software engineering capability. Here is the architecture. Qwen 4 Coder 32B is a 32 billion parameter dense model. Not a Mixture-of-Experts. Every parameter is active on every request. This matters for inference: a dense 32B model runs on 22 gigabytes of VRAM, which fits on a single high-end consumer GPU or a MacBook Pro with 64GB of unified memory. The smaller variant, Qwen 4 Coder 4B, runs at approximately 135 tokens per second on an M5 Max and fits inside 8 gigabytes of RAM. For a model with usable coding capability, that is a new bar for what fits in a single laptop. The training methodology continued Alibaba's approach of reinforcement learning on verifiable coding tasks. The model gets rewarded when its code passes tests. It gets penalized when it fails. Over millions of training steps, the model learns to write code that actually runs rather than code that looks plausible. License: Apache 2.0. Full commercial use. No attribution requirement. No revenue threshold. No monthly active user ceiling. Weights: Hugging Face, available today. Runs on: vLLM, Ollama, SGLang, and any standard GGUF-compatible inference engine. Qwen 4 32B also runs at approximately 135 tokens per second on an M5 Max chip, setting a new bar for what a sub-8GB model can do on Apple Silicon. The open-source coding model just beat the best closed-source model in the world on the benchmark designed to test whether AI can actually do software engineering. The weights are free. The subscription is optional. Source: Autom8Labs AI Insight July 2026, LLMCheck.net State of Open Source LLMs June 2026, Kunal Ganglani blog June 2026.

English

390

40.6K

Azeez@AtlasInference·6d

We'll be using a subset of this amazing release as a part of the MLPerf Agentic Edge Taskforce 🎉 Excited to share more on Atlas Inference's work alongside @NVIDIAAI with @MLCommons 🔜

MLCommons@MLCommons

MLPerf Inference now measures multi-turn agents. 990 trajectories, Kimi K2.6 + Qwen3.6-35B-A3B, Pareto-curve performance, three-level accuracy. Built on MLPerf Endpoints. mlcommons.org/2026/07/agenti… MLPerf #AgenticAI #LLM #Inference #Kimi #Qwen #MLCommons

English

862

Azeez@AtlasInference·6d

@Dave_Charland @NVIDIAAI Stay tuned 🫡 we plan on doing a show and tell soon!

English

Dave Charland@Dave_Charland·6d

@AtlasInference @NVIDIAAI What are you driving with three that you couldn’t do with just 2? Asking for a friend….. 🫠

English

Azeez@AtlasInference·6d

The bottom DGX Spark shines the brightest @NVIDIAAI 🪩 thank you 💚 Trifecta's all wired up, triple threat! Ran out of tokens this week thanks to Fable though lol. We want to keep pushing GB10 showing the world how powerful this little box is🌐 (why we call ourselves Atlas 😉)

English

1.6K

Azeez@AtlasInference·6d

@cryptoafterdark @NVIDIAAI No but if I ran nvidia-smi those utilization numbers may or may not alarm you 🚨

English

Rick Stoner@cryptoafterdark·6d

@AtlasInference @NVIDIAAI do pictures qualify as usage?

English

Keşfet

@NeoAIForecast @SpaceTimeViking @Tech2Wild @WescheNex1q @16 @vllm_project @sgl_project @ggerganov