Nico Hezel

35 posts

Nico Hezel

@NicoHezel

Image Processing and Machine Learning fanatic with an affinity for performance optimization.

Germany / Berlin Katılım Mayıs 2011

33 Takip Edilen42 Takipçiler

Nico Hezel@NicoHezel·14 Mar

Comparing Llama.cpp v8184 against v8287 across the Qwen 3.5 family reveals massive token generation speed improvements. The fewer active parameters a model has, the larger the performance jump: 0.8B: +31% 4B: +20.5% 9B: +14% 27B: +10.5% 35B-A3B (MoE): +23.5%

English

104

Nico Hezel@NicoHezel·13 Mar

Next test: Qwen3.5 27B (GGUF Q4) on Llama v8184 vs. v8287. While the MoE model saw a ~25% jump, this dense 27B variant still gained 10% in generation speed on the newer build. Lower gains, but still a "free" performance boost just by updating binaries. Software optimization remains the most undervalued variable in local LLM benchmarking. Full comparison across all Qwen3.5 models is in the works. Stay tuned.

English

Nico Hezel@NicoHezel·13 Mar

I just compared Qwen3.5 35B A3B (GGUF Q4) on two different Llama versions: v8184 (2 weeks old) vs. v8287 (yesterday). Same hardware, but token generation speed increased by almost 25%. It’s a reminder that hardware is only half the battle. Especially when new models drop, staying on top of software updates is mandatory to actually utilize the performance you paid for.

English

123

Nico Hezel@NicoHezel·13 Mar

@0xSero Or you are just the frontrunner of the movement: github.com/ggml-org/llama…

English

0xSero@0xSero·12 Mar

My missions have taken me places, I am up to 88.56% accuracy with only 37% of experts in VRAM at a time and an incredibly low swap time. I am probably wrong in a billion ways, but here's the repo github.com/0xSero/reap-ex…

English

156

12.5K

Nico Hezel@NicoHezel·12 Mar

@oatius90180 @HuggingModels This is just a small model which adopted the thinking patterns of Opus. This improves critial thinking as it lays out more information first, before coming to an conclusion. But it surely does not beat a 50x bigger, multi million if not billion dollar model from Google.

English

Oati@oatius90180·5 Şub

@NicoHezel @HuggingModels what bout coding with this ? Better than gemini 3 pro ?

English

Hugging Models@HuggingModels·4 Şub

Meet GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill: a distilled reasoning powerhouse. This GGUF model combines GLM architecture with Claude-level reasoning, distilled for efficiency. It's like getting premium reasoning in a lightweight package. Perfect for local deployment!

English

118

1.5K

99.7K

Nico Hezel@NicoHezel·12 Mar

I hadn’t used Antigravity for a few days and just opened it to check the model quotas. From what I see, Gemini 3.1 Pro and the Claude models now have weekly quotas, while Flash still runs on the 5-hour quota window. That basically makes Antigravity unusable for more complex tasks. Also calling the Pro plan the one for people who “live in the IDE and don’t rely on agents” feels misleading, since Antigravity itself runs agents under the hood.

English

463

53K

Google Antigravity@antigravity·11 Mar

We’re evolving Google AI plans to give you more control over how you build. Every subscription includes built-in AI credits, which can now be used for Antigravity, giving you a seamless path to scale. Google AI Pro is the home for the practical builder, hobbyists, students, and developers who live in the IDE and don't necessarily rely on an agent. This plan features generous limits for Gemini Flash, with a baseline quota included to "taste test" our most advanced premium models. Google AI Ultra serves as the daily driver for those shipping at the highest scale who need consistent, high-volume access to our most complex models. If you’re on Pro but need "extra juice" for a heavy sprint or deeper access to premium models, simply top up your AI credits to customize your plan. Keep building. Keep shipping.

English

1.5K

302

4.4K

1.5M

Nico Hezel@NicoHezel·3 Mar

@AtaeiMe @andrew_n_carr You might want to try the PRISM versions from @Ex0byt x.com/Ex0byt/status/…

Eric@Ex0byt

And that's a wrap, folks! Pareto frontier quants of the entire Baby Qwen3.5 family – if you enjoyed today, please support the work for the next installment. Will release the tool so you can do what Unsloth does for their UDs – free & open source yourself: hf.co/Ex0bit/Qwen3.5…

English

Mehdi Ataei@AtaeiMe·3 Mar

@andrew_n_carr I think the models are kind of useless for daily use unless they work on fixing their refusal rates. 6/10 normal requests are rejected by the models.

English

362

Andrew Carr 🤸@andrew_n_carr·2 Mar

it's getting a bit hard to keep up, but there's a lot to be said for the 3.5 27B 9B approaching previous gen 235B is certainly something

English

6.9K

Nico Hezel@NicoHezel·2 Mar

@sudoingX The Ryzen AI 9 HX 375 (890M) is already hitting 19 t/s (Vulkan) and 17 t/s (ROCm) on this 35B model. For an iGPU, that's only ~11 t/s behind the 6800XT (20-30 t/s) mentioned. If an integrated chip is this close, tuned numbers should be significantly higher for the 6800XT.

English

Sudo su@sudoingX·26 Şub

AMD entering the chat. 6800XT pulling 20-30 tok/s on Qwen3.5-35B-A3B. haven't tested ROCm yet but it's on the list. the model fits in 20GB. every AMD card with 24GB VRAM should run it full speed. NVIDIA numbers are pouring in. AMD users, AMD side is wide open.

Dark@DarkSmak812

@sudoingX getting 20-30 tok/s on 6800XT 16GB and 7950X3D 64GB Slow but usable

English

6.6K

Nico Hezel@NicoHezel·2 Mar

@sudoingX Ryzen AI 9 HX 375 (Radeon 890M / iGPU) benchmarks for Qwen3.5-35B-A3B-UD-Q4_K_XL using llama: Vulkan: 280 t/s prefill | 19 t/s gen ROCm: 230 t/s prefill | 17 t/s gen Interesting: Vulkan currently outperforming ROCm by ~12% in generation and ~22% in prefill on this silicon.

English

Sudo su@sudoingX·27 Şub

AMD is pulling 20-30 tok/s on the same model NVIDIA hits 112-157 tok/s on. same Qwen3.5-35B-A3B. same 4-bit quant. 19.7GB on disk. fits entirely on any 24GB card. but most AMD submissions so far are Vulkan or ROCm with default configs. nobody has gone deep on tuning yet. NVIDIA's numbers climbed 2-3x once people started optimizing. 6800XT (16GB): 20-30 tok/s Ryzen AI Max+ 395 (96GB unified): 59 tok/s 3090 (24GB): 112 tok/s (was 50 before flags) 4090 (24GB): 157 tok/s (was ~80 stock) haven't tested ROCm myself yet. definitely on the list. AMD users. try llama.cpp from source with ROCm. try the cache flags. the gap might be real or it might be a config gap. only one way to find out. want to see those numbers climb.

Sudo su@sudoingX

the numbers coming in from this thread: 5090: 166 tok/s (z33b0t), 153 tok/s (EmmanuelMr) 4090: 122 tok/s (StubbyTech) 3090: 112 tok/s (sudo), 100 tok/s (Eduardo) 6800XT: 20-30 tok/s (Dark) Qwen3.5-35B-A3B. 4-bit quant, 19.7 GB on disk. fits entirely on any single 3090 24GB card with room to spare. no offloading, no splitting, full speed. 5090 owners keep pushing the ceiling and we haven't found it yet. NVIDIA side is stacking up. where are the ROCm numbers? report your GPU and tok/s below. building the full map.

English

182

33.5K

Nico Hezel@NicoHezel·4 Şub

@txhno @HuggingModels Why? Its just a destill of GLM-4.7-Flash based on Claude-Opus-4.5-High-Reasoning as a teacher.

English

136

txh@txhno·4 Şub

@HuggingModels this has to be a scam lol

English

1.2K

Nico Hezel@NicoHezel·4 Şub

@electroheadfx @HuggingModels Just create your own one, the repo contains BF16 and FP16 weights.

English

Laurent Marques@electroheadfx·4 Şub

@HuggingModels waiting mlx version ;)

English

3.1K

Nico Hezel@NicoHezel·4 Şub

@MattHillebrand @HuggingModels The repo contains BF16 and FP16, just create your FP8 version out of it.

English

352

Matt Hillebrand@MattHillebrand·4 Şub

@HuggingModels GGUF? No thanks. What about FP8?

English

3.7K

Nico Hezel@NicoHezel·29 Oca

@huang_chao4969 OpenAI Prism is not a competitor, it does not feel like Overleaf at all, more like a cheap notebook. Litewrite seems to have more latex settings and helpful buttons to help write latex code. Only the dark color scheme is to dark and colorful, some icons missing tooltips.

English

Chao Huang@huang_chao4969·29 Oca

Would love to hear your feedback: litewrite.ai

English

2.5K

Chao Huang@huang_chao4969·29 Oca

Just got sniped by OpenAI's Prism🎯...We're open-sourcing ✨LiteWrite✨ next week! We've all been suffering with Overleaf for way too long - researchers constantly frustrated by writing pain points, with basic features locked behind paywalls... So our team decided to build LiteWritefrom scratch - an Overleaf alternative focused on AI-native vibe writing. Fast compilation with support for dozens of real-time collaborators. Just noticed OpenAI dropping Prism... 😅 Talk about perfect timing! We assumed big tech would stay away from such a niche market, but apparently not... Well, this just accelerated our timeline! We're open-sourcing LiteWrite next week. We already have solid user traction with great feedback, and our mission remains: giving researchers a truly powerful AI-native writing platform 💪 LiteWrite - AI-Native Writing Suite That Makes Creating Feel Natural - ⚡ TAP Smart Completion: Type a few words, AI continues the thought, just hit Tab and you're done - 💬 ASK Mode: Summon your AI assistant anytime - stuck on grammar? formatting issues? Just ask and move on - 🤖 Agent Mode: The ultimate hands-off experience! AI auto-edits, polishes, formats, or unleashes creative flow for you - 🔍 Deep Research: AI conducts deep research + auto-generates reports, research while you write 👨‍🏫 ✨ Try LiteWrite: litewrite.ai

English

7.2K

Nico Hezel@NicoHezel·5 May

@sunncynn @iScienceLuvr The best part is this line in their huggingface page: "Improved using Qwen." huggingface.co/nvidia/Llama-3…

English

sun 🐶@sunncynn·5 May

@iScienceLuvr Not only do people sleep on Qwen, even the benchmarks do 😭🥲

English

227

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·5 May

Llama-Nemotron: Efficient Reasoning Models NVIDIA introduces the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. "As of April 2025, our flagship model LN-Ultra is the most “intelligent” open model according to Artificial Analysis." The family comes in three sizes—Nano (8B), Super (49B), and Ultra (253B)— and performs competitively with state of the art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency.

Tanishq Mathew Abraham, Ph.D. tweet media

English

244

21.7K

Nico Hezel@NicoHezel·4 May

@Lumiphoton89 @Alibaba_Qwen @UnslothAI Ahh and this post from Wolfram Ravenwolf was also interesting: x.com/WolframRvnwlf/…

Wolfram Ravenwolf@WolframRvnwlf

What's wrong here? Evaluated Llama 4 Scout, both locally and through Together AI. How can a local 2.71-bit quantized GGUF beat the online full version in the MMLU Pro CS benchmark? Consistent results over six runs, some with default settings and some with recommended ones. Weird!

English

LP89@Lumiphoton89·3 May

@NicoHezel @Alibaba_Qwen @UnslothAI I'm not seeing where Unsloth provided empirical results showing these dynamic GGUFs have better accuracy? Are we supposed to trust them blindly?

English

106

Qwen@Alibaba_Qwen·2 May

We will release the quantized models of Qwen3 to you in the following days. Today we release the AWQ and GGUFs of Qwen3-14B and Qwen3-32B, which enables using the models with limited GPU memory. Qwen3-32B-AWQ: huggingface.co/Qwen/Qwen3-32B… Qwen3-32B-GGUF: huggingface.co/Qwen/Qwen3-32B… Qwen3-14B-AWQ: huggingface.co/Qwen/Qwen3-14B… Qwen3-14B-GGUF: huggingface.co/Qwen/Qwen3-14B… Note that when using the GGUFs in Ollama and LMStudio, to switch from thinking to non_thinking, you just need to add the special token `/no_think` at the end of the input. Below is an example. Enjoy!

English

207

1.7K

126.8K

Nico Hezel@NicoHezel·4 May

@Lumiphoton89 @Alibaba_Qwen @UnslothAI They have not shown it for Qwen but for Gemma 3 (27B) docs.unsloth.ai/basics/unsloth…

English

Nico Hezel@NicoHezel·3 May

@gm8xx8 Similar conclusion to @AIatMeta Perception Encoder. Here are their layerwise experiments.

English

Nico Hezel@NicoHezel·8 Nis

@fmnardini @rventurini_ Deliciously good framework.

English

Franco Maria Nardini@fmnardini·8 Nis

hey #ECIR2025! Are you interested in Approximate Nearest Neighbors search? You should come and see our demo of “kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search” in Sagrestia, after the afternoon coffee break! live coding and tasting!

English

590

Keşfet

@0xSero @oatius90180 @HuggingModels @AtaeiMe @andrew_n_carr @Ex0byt @sudoingX @txhno