Nico Hezel

35 posts

Nico Hezel

Nico Hezel

@NicoHezel

Image Processing and Machine Learning fanatic with an affinity for performance optimization.

Germany / Berlin Katılım Mayıs 2011
33 Takip Edilen42 Takipçiler
Nico Hezel
Nico Hezel@NicoHezel·
Comparing Llama.cpp v8184 against v8287 across the Qwen 3.5 family reveals massive token generation speed improvements. The fewer active parameters a model has, the larger the performance jump: 0.8B: +31% 4B: +20.5% 9B: +14% 27B: +10.5% 35B-A3B (MoE): +23.5%
English
0
1
3
104
Nico Hezel
Nico Hezel@NicoHezel·
Next test: Qwen3.5 27B (GGUF Q4) on Llama v8184 vs. v8287. While the MoE model saw a ~25% jump, this dense 27B variant still gained 10% in generation speed on the newer build. Lower gains, but still a "free" performance boost just by updating binaries. Software optimization remains the most undervalued variable in local LLM benchmarking. Full comparison across all Qwen3.5 models is in the works. Stay tuned.
English
0
0
1
74
Nico Hezel
Nico Hezel@NicoHezel·
I just compared Qwen3.5 35B A3B (GGUF Q4) on two different Llama versions: v8184 (2 weeks old) vs. v8287 (yesterday). Same hardware, but token generation speed increased by almost 25%. It’s a reminder that hardware is only half the battle. Especially when new models drop, staying on top of software updates is mandatory to actually utilize the performance you paid for.
English
0
1
3
123
0xSero
0xSero@0xSero·
My missions have taken me places, I am up to 88.56% accuracy with only 37% of experts in VRAM at a time and an incredibly low swap time. I am probably wrong in a billion ways, but here's the repo github.com/0xSero/reap-ex…
English
7
7
156
12.5K
Nico Hezel
Nico Hezel@NicoHezel·
@oatius90180 @HuggingModels This is just a small model which adopted the thinking patterns of Opus. This improves critial thinking as it lays out more information first, before coming to an conclusion. But it surely does not beat a 50x bigger, multi million if not billion dollar model from Google.
English
0
0
0
13
Hugging Models
Hugging Models@HuggingModels·
Meet GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill: a distilled reasoning powerhouse. This GGUF model combines GLM architecture with Claude-level reasoning, distilled for efficiency. It's like getting premium reasoning in a lightweight package. Perfect for local deployment!
Hugging Models tweet media
English
36
118
1.5K
99.7K
Nico Hezel
Nico Hezel@NicoHezel·
I hadn’t used Antigravity for a few days and just opened it to check the model quotas. From what I see, Gemini 3.1 Pro and the Claude models now have weekly quotas, while Flash still runs on the 5-hour quota window. That basically makes Antigravity unusable for more complex tasks. Also calling the Pro plan the one for people who “live in the IDE and don’t rely on agents” feels misleading, since Antigravity itself runs agents under the hood.
Nico Hezel tweet media
English
20
13
463
53K
Google Antigravity
Google Antigravity@antigravity·
We’re evolving Google AI plans to give you more control over how you build. Every subscription includes built-in AI credits, which can now be used for Antigravity, giving you a seamless path to scale. Google AI Pro is the home for the practical builder, hobbyists, students, and developers who live in the IDE and don't necessarily rely on an agent. This plan features generous limits for Gemini Flash, with a baseline quota included to "taste test" our most advanced premium models. Google AI Ultra serves as the daily driver for those shipping at the highest scale who need consistent, high-volume access to our most complex models. If you’re on Pro but need "extra juice" for a heavy sprint or deeper access to premium models, simply top up your AI credits to customize your plan. Keep building. Keep shipping.
English
1.5K
302
4.4K
1.5M
Mehdi Ataei
Mehdi Ataei@AtaeiMe·
@andrew_n_carr I think the models are kind of useless for daily use unless they work on fixing their refusal rates. 6/10 normal requests are rejected by the models.
English
2
0
1
362
Andrew Carr 🤸
Andrew Carr 🤸@andrew_n_carr·
it's getting a bit hard to keep up, but there's a lot to be said for the 3.5 27B 9B approaching previous gen 235B is certainly something
Andrew Carr 🤸 tweet media
English
6
5
93
6.9K
Nico Hezel
Nico Hezel@NicoHezel·
@sudoingX The Ryzen AI 9 HX 375 (890M) is already hitting 19 t/s (Vulkan) and 17 t/s (ROCm) on this 35B model. For an iGPU, that's only ~11 t/s behind the 6800XT (20-30 t/s) mentioned. If an integrated chip is this close, tuned numbers should be significantly higher for the 6800XT.
English
0
0
0
43
Sudo su
Sudo su@sudoingX·
AMD entering the chat. 6800XT pulling 20-30 tok/s on Qwen3.5-35B-A3B. haven't tested ROCm yet but it's on the list. the model fits in 20GB. every AMD card with 24GB VRAM should run it full speed. NVIDIA numbers are pouring in. AMD users, AMD side is wide open.
Dark@DarkSmak812

@sudoingX getting 20-30 tok/s on 6800XT 16GB and 7950X3D 64GB Slow but usable

English
13
4
52
6.6K
Nico Hezel
Nico Hezel@NicoHezel·
@sudoingX Ryzen AI 9 HX 375 (Radeon 890M / iGPU) benchmarks for Qwen3.5-35B-A3B-UD-Q4_K_XL using llama: Vulkan: 280 t/s prefill | 19 t/s gen ROCm: 230 t/s prefill | 17 t/s gen Interesting: Vulkan currently outperforming ROCm by ~12% in generation and ~22% in prefill on this silicon.
English
0
0
0
85
Sudo su
Sudo su@sudoingX·
AMD is pulling 20-30 tok/s on the same model NVIDIA hits 112-157 tok/s on. same Qwen3.5-35B-A3B. same 4-bit quant. 19.7GB on disk. fits entirely on any 24GB card. but most AMD submissions so far are Vulkan or ROCm with default configs. nobody has gone deep on tuning yet. NVIDIA's numbers climbed 2-3x once people started optimizing. 6800XT (16GB): 20-30 tok/s Ryzen AI Max+ 395 (96GB unified): 59 tok/s 3090 (24GB): 112 tok/s (was 50 before flags) 4090 (24GB): 157 tok/s (was ~80 stock) haven't tested ROCm myself yet. definitely on the list. AMD users. try llama.cpp from source with ROCm. try the cache flags. the gap might be real or it might be a config gap. only one way to find out. want to see those numbers climb.
Sudo su@sudoingX

the numbers coming in from this thread: 5090: 166 tok/s (z33b0t), 153 tok/s (EmmanuelMr) 4090: 122 tok/s (StubbyTech) 3090: 112 tok/s (sudo), 100 tok/s (Eduardo) 6800XT: 20-30 tok/s (Dark) Qwen3.5-35B-A3B. 4-bit quant, 19.7 GB on disk. fits entirely on any single 3090 24GB card with room to spare. no offloading, no splitting, full speed. 5090 owners keep pushing the ceiling and we haven't found it yet. NVIDIA side is stacking up. where are the ROCm numbers? report your GPU and tok/s below. building the full map.

English
16
19
182
33.5K
Nico Hezel
Nico Hezel@NicoHezel·
@txhno @HuggingModels Why? Its just a destill of GLM-4.7-Flash based on Claude-Opus-4.5-High-Reasoning as a teacher.
English
0
0
1
136
Nico Hezel
Nico Hezel@NicoHezel·
@huang_chao4969 OpenAI Prism is not a competitor, it does not feel like Overleaf at all, more like a cheap notebook. Litewrite seems to have more latex settings and helpful buttons to help write latex code. Only the dark color scheme is to dark and colorful, some icons missing tooltips.
English
1
0
0
84
Chao Huang
Chao Huang@huang_chao4969·
Just got sniped by OpenAI's Prism🎯...We're open-sourcing ✨LiteWrite✨ next week! We've all been suffering with Overleaf for way too long - researchers constantly frustrated by writing pain points, with basic features locked behind paywalls... So our team decided to build LiteWritefrom scratch - an Overleaf alternative focused on AI-native vibe writing. Fast compilation with support for dozens of real-time collaborators. Just noticed OpenAI dropping Prism... 😅 Talk about perfect timing! We assumed big tech would stay away from such a niche market, but apparently not... Well, this just accelerated our timeline! We're open-sourcing LiteWrite next week. We already have solid user traction with great feedback, and our mission remains: giving researchers a truly powerful AI-native writing platform 💪 LiteWrite - AI-Native Writing Suite That Makes Creating Feel Natural - ⚡ TAP Smart Completion: Type a few words, AI continues the thought, just hit Tab and you're done - 💬 ASK Mode: Summon your AI assistant anytime - stuck on grammar? formatting issues? Just ask and move on - 🤖 Agent Mode: The ultimate hands-off experience! AI auto-edits, polishes, formats, or unleashes creative flow for you - 🔍 Deep Research: AI conducts deep research + auto-generates reports, research while you write 👨‍🏫 ✨ Try LiteWrite: litewrite.ai
English
2
10
65
7.2K
sun 🐶
sun 🐶@sunncynn·
@iScienceLuvr Not only do people sleep on Qwen, even the benchmarks do 😭🥲
English
1
0
1
227
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
Llama-Nemotron: Efficient Reasoning Models NVIDIA introduces the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. "As of April 2025, our flagship model LN-Ultra is the most “intelligent” open model according to Artificial Analysis." The family comes in three sizes—Nano (8B), Super (49B), and Ultra (253B)— and performs competitively with state of the art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency.
Tanishq Mathew Abraham, Ph.D. tweet media
English
6
50
244
21.7K
LP89
LP89@Lumiphoton89·
@NicoHezel @Alibaba_Qwen @UnslothAI I'm not seeing where Unsloth provided empirical results showing these dynamic GGUFs have better accuracy? Are we supposed to trust them blindly?
English
2
0
1
106
Qwen
Qwen@Alibaba_Qwen·
We will release the quantized models of Qwen3 to you in the following days. Today we release the AWQ and GGUFs of Qwen3-14B and Qwen3-32B, which enables using the models with limited GPU memory. Qwen3-32B-AWQ: huggingface.co/Qwen/Qwen3-32B… Qwen3-32B-GGUF: huggingface.co/Qwen/Qwen3-32B… Qwen3-14B-AWQ: huggingface.co/Qwen/Qwen3-14B… Qwen3-14B-GGUF: huggingface.co/Qwen/Qwen3-14B… Note that when using the GGUFs in Ollama and LMStudio, to switch from thinking to non_thinking, you just need to add the special token `/no_think` at the end of the input. Below is an example. Enjoy!
Qwen tweet mediaQwen tweet media
English
59
207
1.7K
126.8K
Nico Hezel
Nico Hezel@NicoHezel·
@gm8xx8 Similar conclusion to @AIatMeta Perception Encoder. Here are their layerwise experiments.
Nico Hezel tweet media
English
0
0
0
35
Franco Maria Nardini
Franco Maria Nardini@fmnardini·
hey #ECIR2025! Are you interested in Approximate Nearest Neighbors search? You should come and see our demo of “kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search” in Sagrestia, after the afternoon coffee break! live coding and tasting!
Franco Maria Nardini tweet media
English
1
1
30
590