Spok

1.2K posts

Spok banner
Spok

Spok

@spok_vulkan

live long and prosper 🖖

Katılım Şubat 2015
525 Takip Edilen56 Takipçiler
Sabitlenmiş Tweet
Spok
Spok@spok_vulkan·
"AI is the signal. Everything else is noise."
English
1
0
3
316
Spok
Spok@spok_vulkan·
@ivanfioravanti yes the max reasoning setting, in most cases, it doesn’t think for long unless the task is complex, so it adapts by itself i saw it can think for 3 min or so (very rarely), but it you have 3 agents in parallel then it's fine
English
0
0
0
14
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
For the Claude Code warriors out there, what is the right effort level to be used? 🤔
English
18
0
5
1.9K
Spok
Spok@spok_vulkan·
@m13v_ so true MMLU accuracy drop for 4bit can be around 2%, but for agentic multiturn tool calling it can be closer to 50% witch is crazy
English
0
0
0
29
Matt
Matt@m13v_·
@spok_vulkan same experience building a local desktop agent. aggressive quantization kills tool-calling way before it hurts chat quality. smaller model + higher precision wins every time for agent work.
English
1
0
1
36
Spok
Spok@spok_vulkan·
I just ran into something wild building a local AI agent. Qwen3.5-9B at INT4 (ParoQuant) performs WORSE than Qwen3.5-4B at 8-bit on tool-calling benchmarks. More parameters. Worse results. Here's what we found.
English
3
0
8
790
Spok
Spok@spok_vulkan·
@LeanKinPrazli And the funny part is, in most cases, you will not see the errors directly the result itself will just be worse overall, you can really notice only in a direct comparison or benchmarks
English
1
0
2
10
Spok
Spok@spok_vulkan·
@LeanKinPrazli If we talk about MLX, any 4-bit quants seem to be very bad at tool calling compared to the FP16 baseline, like 2x worse, witch is a significant drop in performance. So I would rather use Qwen3.5-4B at 8-bit than Qwen3.5-9B at 4-bit for such a task.
English
1
0
1
46
Spok
Spok@spok_vulkan·
@ivanfioravanti Your IFEval column is the most interesting one here though. ParoQuant: 0.382. Standard 4bit: 0.172. FP16 baseline? 0.915. I hit this exact problem building an on-device agent. Ran a 14-scenario tool-calling benchmark on Qwen3.5-9B PARO vs 4B 8-bit. x.com/spok_vulkan/st…
Spok@spok_vulkan

I just ran into something wild building a local AI agent. Qwen3.5-9B at INT4 (ParoQuant) performs WORSE than Qwen3.5-4B at 8-bit on tool-calling benchmarks. More parameters. Worse results. Here's what we found.

English
1
0
1
31
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
MLX 4bit vs MLX ParoQuant 4bit using Qwen3.5-9B 📣 As you can see below there is no match. I will try to do same with 8bit in next days to do a comparison. ParoQuant is my new go to quantization below 8bit! I have limited max-tokens in some cases, but the important thing is that same limits have been applied to both quantizations.
Ivan Fioravanti ᯅ tweet media
English
18
13
134
19.2K
Spok
Spok@spok_vulkan·
What we'd love to see from quantization research: - Tool-calling accuracy benchmarks - Structured output format compliance - Multi-turn instruction following eval - Exact string reproduction tests Until then, the "0.9% accuracy drop" headline is misleading.
English
0
0
4
58
Spok
Spok@spok_vulkan·
This doesn't mean ParoQuant is bad. It's genuinely the best INT4 linear quantization method out there. It just means the benchmarks we use to evaluate quantization methods are blind to the capabilities that matter most for agents.
English
1
0
2
59