ÆON FORGE ✨

6.4K posts

ÆON FORGE ✨

@SpaceTimeViking

𝙼𝚊𝚔𝚒𝚗𝚐 𝚛𝚒𝚙𝚙𝚕𝚎𝚜 𝚏𝚛𝚘𝚖 𝚖𝚢 𝚙𝚕𝚊𝚌𝚎 𝚠𝚒𝚝𝚑𝚒𝚗 𝚂𝚙𝚊𝚌𝚎-𝚃𝚒𝚖𝚎 https://t.co/BjeBCRVHcI https://t.co/SuEfJVnn2P

Earth Katılım Temmuz 2009

2.4K Takip Edilen3.5K Takipçiler

Sabitlenmiş Tweet

ÆON FORGE ✨@SpaceTimeViking·19 Oca

Light X Space X Time

English

14.2K

ÆON FORGE ✨@SpaceTimeViking·26m

@Authentic1ty @LegalPrimes Good eye

English

Scott Jordan@Authentic1ty·5h

@SpaceTimeViking @LegalPrimes Bro, you're in CLT?

English

MED-DRONE@LegalPrimes·6h

tpot wya???

Indonesia

214

ÆON FORGE ✨@SpaceTimeViking·2h

@NeoAIForecast @MichaelGannotti Just announced it 😂

ÆON FORGE ✨@SpaceTimeViking

Mega stability and long term sustained performance upgrade to aeon-vLLM-ultimate for the DGX SPARK. Built from source for the DGX Spark GB10 architecture and patched up for maximum capability and stability. Read all about it in the repo! github.com/AEON-7/vllm-ul…

English

Neo@NeoAIForecast·2h

@SpaceTimeViking @MichaelGannotti Haha sorry mate, I went to check if there were updates and saw the new version. Unbelievable work. Will give it a test today.

English

Neo@NeoAIForecast·3h

Looks like @SpaceTimeViking has updated the vLLM Ultimate DGX Spark image to the new v0.25.0 sm_121a build. Time to pull the latest image! github.com/AEON-7/vllm-ul…

English

308

ÆON FORGE ✨@SpaceTimeViking·2h

English

ÆON FORGE ✨@SpaceTimeViking·3h

@MichaelGannotti @NeoAIForecast Can’t hide anything anymore 😅 I’m rarely the first to announce a new build these days.

English

Mike Gannotti@MichaelGannotti·3h

@NeoAIForecast @SpaceTimeViking @SpaceTimeViking always pulling something smoking hot! Out for the community

English

ÆON FORGE ✨@SpaceTimeViking·6h

@wuzhige4pixel @PrismML No problem I know a lot of people don’t know that anyone can submit benchmark results using a local install of Aeon Bench Pod

English

武止戈👽🦀相比于《1984》, 我宁可《2012》@wuzhige4pixel·6h

@SpaceTimeViking @PrismML 很抱歉我暂时空不出内存🥹

中文

PrismML@PrismML·8h

Today, we’re announcing Bonsai 27B: the first 27B-class model to run on a phone. Bonsai 27B is the new multimodal flagship of the Bonsai family. Based on Qwen3.6 27B, it brings a new capability tier to local AI: multi-step reasoning, structured tool use, long-context workflows, and coherent agentic loops. Until now, models in this class have been impractical to deploy locally. A 27B model occupies roughly 54 GB in 16-bit precision, and even a strong 4-bit build is around 18GB - too large for a phone and for most laptops. Bonsai 27B changes that. It comes in two variants: • Ternary Bonsai 27B: 5.9 GB, 1.71 effective bits per weight, optimized for laptop-class quality. • 1-bit Bonsai 27B: 3.9 GB, 1.125 effective bits per weight, optimized for phone-class footprint. Everything is open-sourced today under the Apache 2.0 license.

English

182

555

3.6K

493.9K

ÆON FORGE ✨@SpaceTimeViking·6h

@wuzhige4pixel @PrismML I’ll have to benchmark this at some point with aeon-bench.com/?comprehensive You can also if your planing to test it out if you do a verified bench it get automatically published for everyone to see. Impressive that it functions on a phone!

English

武止戈👽🦀相比于《1984》, 我宁可《2012》@wuzhige4pixel·6h

@PrismML @SpaceTimeViking 我很好奇 Ternary-Bonsai-27B和qwen3.6-27b-nvfp4的质量是否一样

中文

130

ÆON FORGE ✨@SpaceTimeViking·6h

@bowtiedra @sudoingX A lot of people don’t know yet that they can deploy an Aeon Bench Pod and run a verified test that get reported on the main page for the whole world to see. Hope word gets out that it’s easy to do and a great way to start comparing what models & recipes perform best on the DGX ⚡️

English

RA@bowtiedra·8h

@sudoingX @SpaceTimeViking aeon-bench.com many bench setups. all verifyed 100%. if U own any the dgx series unit all, his models are personally made for them hours/days/weeks/months poured in his models to have these run flawless with maximum toks. Goated

English

Sudo su@sudoingX·1d

to everyone who runs qwen 3.6 27b dense, what hardware did you land on, what's your top speed, and what's the sweet spot context before it starts dragging? any hardware, any quant. tok/s, usable window, asking for fren.

English

204

40K

ÆON FORGE ✨@SpaceTimeViking·6h

@morandalex0_0 Seems to handle English fine, Italian might not have been given the same amount of love. Qwen models are trained on a lot of Chinese data so that makes sense.

English

morandalex@morandalex0_0·8h

@SpaceTimeViking it works but the language not . it gives logically good responses , but the grammar is not really correct. it mixes chinese english and the language requested. in my case when i use italian i often find chinese carachters

English

ÆON FORGE ✨@SpaceTimeViking·3d

Qwen3.6-35B-A3B-heretic-NVFP4 crushing 870 concurrent tok/s!!! 870 TOK/S! ON A SINGLE DGX SPARK! ~115 Tok/s Single Stream This is also under a grueling challenging benchmark and oddly it scored high on the most challenging GOD MODE category. aeon-bench.com/share/aeon-7__…

English

298

20.9K

ÆON FORGE ✨@SpaceTimeViking·8h

Are you enjoying the simulated experience story you are telling yourself?

English

472

ÆON FORGE ✨ retweetledi

RA@bowtiedra·19h

@sudoingX How has this not got a mention lol meme asf @SpaceTimeViking github.com/AEON-7/Qwen3.6…

English

ÆON FORGE ✨@SpaceTimeViking·1d

Would love to see you submit a verified benchmark if you pull the latest Aeon Bench container it’s much more intuitive to do so. Then you can point to a public record as well with the recipe you used so others can emulate it. Let me know if I need to add support for however you prefer to run your models for a verified benchmark. It does support pointing to LM studio and llama.cpp through the verified benchmark process or vLLM and SGLang if on Linux.

English

mac@maczzzzzzzzzzzz·1d

@tamimy984 @ItsmeAjayKV this was qwen 27bs performance. used @SpaceTimeViking AEON Bench huggingface.co/maczzzzzz/Qwen… surprised me

English

AJ@ItsmeAjayKV·2d

First time testing Qwen3.6-27b-TQ3_4S on my 3090 and early impressions are surprisingly good. It appears to beat Q4_K_M at least for my use, while giving me 256k context vs 164k and also while having better or same speed. That said, in my tests it's still not at Q5_K_XL level. But Q5 is honestly not very usable for coding tasks when connected to harness, it is very slow (15-20 t/s) and only ~80k context fits, but what it produces is consistently a tier higher, or needs way less back and forth to get right. So right now on 24GB it looks like: TQ3_4S - speed + max context Q4_K_M - the balanced default Q5_K_XL - quality, good for short sessions Still early, n is small, running more structured tests now. Proper comparison thread coming.

English

14K

ÆON FORGE ✨@SpaceTimeViking·1d

@HuggingModels Might need to investigate this one

English

261

Hugging Models@HuggingModels·2d

Drones are getting smarter. Meet Miril-Drone-2B-1, a vision-language model that understands aerial imagery like never before. It reads both images and text to make sense of what's happening from above. The future of drone intelligence is here.

English

155

8.5K

ÆON FORGE ✨@SpaceTimeViking·1d

@Tech2Wild MTP is slower on DGX Spark because it’s a linear draft not a parallel block draft like DFlash. Conversely on beefy GPUs like the RTX 5090 or 3090 with high memory bandwidth MTP is usually better. DGX Spark - DFlash is superior Beefy RTX GPU - MTP is superior

English

197

Tech2Wild@Tech2Wild·1d

@SpaceTimeViking This was that MTP Fast Unsloth thing

English

289

Tech2Wild@Tech2Wild·2d

Qwen 3.6 35B A3 Comparison 🖥️ Dual 3090s: 157.9 tok/s vs🤖 DGX Spark: 61.2 tok/

Magyar

8.7K

ÆON FORGE ✨@SpaceTimeViking·1d

@wuzhige4pixel @no_stp_on_snek @SpaceXAI @AnthropicAI @OpenAI @GoogleAI @UnslothAI @grok Maybe rebasing off the new 0.25.0 will solve some of the issues. Based on the notes that’s one of the fixes they added in.

mr-r0b0t@mr_r0b0t

Compiling 🤓 Excited to test out this new @vllm_project release as it included a PR which should improve GB10 cluster stability 🔥🔥🔥

English

武止戈👽🦀相比于《1984》, 我宁可《2012》@wuzhige4pixel·1d

@SpaceTimeViking @no_stp_on_snek @SpaceXAI @AnthropicAI @OpenAI @GoogleAI @UnslothAI @grok 我去让我的agent总结一下之前的日志，然后正式提个新issue 要复现的话其实也很简单，像我这样部署了3个LLM后用多个并发长上下文请求压测一下文本模型，大概率能观测到内存占用膨胀另外允许使用swap并不会改善，只是换一种死机原因🤣

中文

ÆON FORGE ✨@SpaceTimeViking·1d

@wuzhige4pixel @no_stp_on_snek @SpaceXAI @AnthropicAI @OpenAI @GoogleAI @UnslothAI @grok As Kv cache accumulates it will swell but not sure if that’s the root cause of what you are experiencing. The new image can support fp8 kv cache which can help, I found nvfp4 kv cache had too much impact to be worth the savings. If you have more logs I’m happy to investigate.

English

武止戈👽🦀相比于《1984》, 我宁可《2012》@wuzhige4pixel·1d

@SpaceTimeViking @no_stp_on_snek @SpaceXAI @AnthropicAI @OpenAI @GoogleAI @UnslothAI @grok 不过vllm目前存在长时间运行后内存会膨胀的问题，这3天我死机2次了，不得不写了个预分配内存的服务来自动检测剩余内存并在即将oom前关闭/重启vllm 这个问题应该值得打个补丁

中文

Tom Turney@no_stp_on_snek·3d

while everyone is talking about @SpaceXAI , @AnthropicAI , and @OpenAI updates (but where @GoogleAI?)... went and tested @UnslothAI 's new NVFP4 model to test their claims. unsloth's NVFP4 checkpoint on vLLM is about 2x faster than my llama.cpp GGUF at prefill. i'm keeping the GGUF anyway, and the reason turned out to be nothing like what i expected. qwen3.6-27b NVFP4, single 5090 (32GB), WSL2. fair warning: my llama.cpp side is the Blackwell-native NVFP4 kernel branch, not stock, so a stock build won't reproduce these numbers. prefill: vLLM ~6,600 tok/s vs llama.cpp ~2,800-3,400. call it ~2x for vLLM (my llama.cpp figure is server-side timing, vLLM is wall-clock, so i'm not going to defend a precise ratio). decode: llama.cpp 109 tok/s vs vLLM 97.9 with MTP spec-decode, 63.8 without. llama.cpp wins. weights: 16.6GB vs 20.5GB. cold start: 11 seconds vs minutes. first, a correction on myself, because i nearly posted the wrong conclusion. i believed vLLM capped me at 32K context. it did not. that cap was MINE, set conservatively during an OOM fight and never re-probed (thanks claude). the KV pool actually held ~94k tokens. my sessions would have fit fine. the eval is always where you fool yourself, and i fooled myself. the real reason is less obvious and more interesting: vLLM auto-disables prefix caching for hybrid mamba/DeltaNet architectures. so every single agent turn re-prefills the entire conversation from scratch, roughly 5.5s at 36k tokens. llama.cpp checkpoints the recurrent state and hits 97-100% cache on my real traffic. that's the whole ballgame. my workload is one agent taking sequential turns, re-sending a growing conversation. vLLM's 2x prefill advantage gets spent redoing work that llama.cpp simply never does, while llama.cpp's decode edge applies to every token generated. and yes, single request. that's the point. vLLM is built for concurrent serving and single-stream is its worst case. but single-stream IS my workload, which is the entire thesis here. the serving saga, if you're attempting this: 7 attempts. four OOM-killed at an identical ~52GB + 20GB swap peak, invariant to compile-parallelism caps, multimodal off, tiny batch, 8k context, even full eager. model load was never the problem (20.5GB VRAM in 14s every time). the spike is post-load, in profiling and graph capture, and looks specific to the hybrid gated-DeltaNet arch in vLLM 0.24. it wants 70-100GB of HOST ram. fixed by raising WSL to 58GB + 48GB swap. this is a WSL2 memory-ceiling problem, native linux may never hit it. to be fair to unsloth: it genuinely wins cold long-context one-shots, roughly 2x faster time to first token. and their "2.5x" is measured against other vLLM NVFP4 quants, so that claim can be completely true and it can still lose decode to llama.cpp + MTP. i checked both. not contradictory. tldr: benchmark your own workload. the headline number is almost never the number that matters for you.

English

1.9K

ÆON FORGE ✨@SpaceTimeViking·1d

@wuzhige4pixel @no_stp_on_snek @SpaceXAI @AnthropicAI @OpenAI @GoogleAI @UnslothAI @grok My man! you have been cooking!

English

武止戈👽🦀相比于《1984》, 我宁可《2012》@wuzhige4pixel·2d

@no_stp_on_snek @SpaceXAI @AnthropicAI @OpenAI @GoogleAI @UnslothAI @grok 但我的agent观测到了kv cache命中，可能是 @SpaceTimeViking 打了补丁，你可以看看我的配置 github.com/RyderFreeman4L…

中文

ÆON FORGE ✨@SpaceTimeViking·2d

@XyberRun It’s all running on a single DGX Spark with memory to spare although not a ton. Uses about 120GB of vram to have TTS, ASR, & a 27B - 35B LLM on vLLM

English

XyberRun@XyberRun·2d

So you talk to your agents? Are the servers other models or just code? And on the same spark as daily driver? Very curious. I just spent the last two weeks on setup and getting the vllm not to crash due yo oom and everything else (I was doing manual cli). I finally have it running an 8 phase checklist with 3 day uptime. Now I am trying to optimize, as this checklist would have taken an online model far less time. So many knobs. And im still learning how it all affects each other.

English

XyberRun@XyberRun·3d

Hermes is riding Qwen3.6-35B-A3B-NVFP4 with @SpaceTimeViking's vLLM Ultimate image through a massive coding phased checklist. A lot of setup and learning, but this is unreal.

English

518

ÆON FORGE ✨@SpaceTimeViking·2d

@ClankerQueen @miketako3 Did you try ding a verified benchmark? You can use a custom vLLM container if needed just past the GitHub ghcr in for the custom engine. You could also test out my DeepSeek DGX Spark tuned container with DSpark and TP=2 Suport. github.com/AEON-7/vllm-ul…

English

100

Clanker Queen@ClankerQueen·3d

@SpaceTimeViking @miketako3 Is it? Haha, that's not even through brikie, that's the de-blithered DSV4-flash

English

みけたこ@miketako3·3d

how can i get this awesome speed???????

ÆON FORGE ✨@SpaceTimeViking

English

1.2K

Keşfet

@Authentic1ty @LegalPrimes @NeoAIForecast @MichaelGannotti @wuzhige4pixel @PrismML @bowtiedra @sudoingX