Spok

1.2K posts

Spok

@spok_vulkan

live long and prosper 🖖

Katılım Şubat 2015

525 Takip Edilen56 Takipçiler

Sabitlenmiş Tweet

Spok@spok_vulkan·28 Eyl

"AI is the signal. Everything else is noise."

English

316

Spok@spok_vulkan·2h

@vilinskyy 🤣

QME

Alexander Vilinskyy@vilinskyy·12h

i would call it... tupperware problem.

English

6.1K

132.3K

Spok@spok_vulkan·2h

@ivanfioravanti yes the max reasoning setting, in most cases, it doesn’t think for long unless the task is complex, so it adapts by itself i saw it can think for 3 min or so (very rarely), but it you have 3 agents in parallel then it's fine

English

Ivan Fioravanti ᯅ@ivanfioravanti·2h

@spok_vulkan to the very top? Is it slow?

English

Ivan Fioravanti ᯅ@ivanfioravanti·2h

For the Claude Code warriors out there, what is the right effort level to be used? 🤔

English

1.9K

Spok@spok_vulkan·6h

@m13v_ so true MMLU accuracy drop for 4bit can be around 2%, but for agentic multiturn tool calling it can be closer to 50% witch is crazy

English

Matt@m13v_·6h

@spok_vulkan same experience building a local desktop agent. aggressive quantization kills tool-calling way before it hurts chat quality. smaller model + higher precision wins every time for agent work.

English

Spok@spok_vulkan·9h

I just ran into something wild building a local AI agent. Qwen3.5-9B at INT4 (ParoQuant) performs WORSE than Qwen3.5-4B at 8-bit on tool-calling benchmarks. More parameters. Worse results. Here's what we found.

English

790

Spok@spok_vulkan·6h

@LeanKinPrazli And the funny part is, in most cases, you will not see the errors directly the result itself will just be worse overall, you can really notice only in a direct comparison or benchmarks

English

Spok@spok_vulkan·6h

@LeanKinPrazli If we talk about MLX, any 4-bit quants seem to be very bad at tool calling compared to the FP16 baseline, like 2x worse, witch is a significant drop in performance. So I would rather use Qwen3.5-4B at 8-bit than Qwen3.5-9B at 4-bit for such a task.

English

Spok@spok_vulkan·9h

@ivanfioravanti Your IFEval column is the most interesting one here though. ParoQuant: 0.382. Standard 4bit: 0.172. FP16 baseline? 0.915. I hit this exact problem building an on-device agent. Ran a 14-scenario tool-calling benchmark on Qwen3.5-9B PARO vs 4B 8-bit. x.com/spok_vulkan/st…

Spok@spok_vulkan

English

Ivan Fioravanti ᯅ@ivanfioravanti·13 Mar

MLX 4bit vs MLX ParoQuant 4bit using Qwen3.5-9B 📣 As you can see below there is no match. I will try to do same with 8bit in next days to do a comparison. ParoQuant is my new go to quantization below 8bit! I have limited max-tokens in some cases, but the important thing is that same limits have been applied to both quantizations.

English

134

19.2K

Spok@spok_vulkan·9h

What we'd love to see from quantization research: - Tool-calling accuracy benchmarks - Structured output format compliance - Multi-turn instruction following eval - Exact string reproduction tests Until then, the "0.9% accuracy drop" headline is misleading.

English

Spok@spok_vulkan·9h

This doesn't mean ParoQuant is bad. It's genuinely the best INT4 linear quantization method out there. It just means the benchmarks we use to evaluate quantization methods are blind to the capabilities that matter most for agents.

English

Spok retweetledi

Fight With Memes@FightWithMemes·1d

ZXX

378

6.1K

148.4K

Keşfet

@vilinskyy @ivanfioravanti @m13v_ @LeanKinPrazli @elonmusk @BarackObama @taylorswift13 @cristiano