andthattoo

694 posts

andthattoo banner
andthattoo

andthattoo

@andthatto

drums of liberation @driaforall

Katılım Temmuz 2016
966 Takip Edilen780 Takipçiler
Sabitlenmiş Tweet
andthattoo
andthattoo@andthatto·
Qwen 3.6 is frontier for local. It also thinks forever. I tried a dumb inference-time trick: make its block obey a tiny grammar. Result: - HumanEval+: 22x fewer think tokens, no accuracy loss - LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens
English
51
82
1.3K
128.6K
andthattoo
andthattoo@andthatto·
Opened a llama.cpp discussion about whether custom GBNF grammars can compose with tool calls in llama-server. Right now tools work alone, grammar works alone, but tools+grammar doesn't. If you use llama.cpp + agent frameworks, sharing your use cases would help move the design faster. github.com/ggml-org/llama…
English
0
0
5
338
andthattoo
andthattoo@andthatto·
@XReyRobert yes I've noticed, gonna post about this in llama cpp discussions
English
0
0
1
21
XReyRobert
XReyRobert@XReyRobert·
@andthatto It seems that llama.cpp supports grammars and tool calling, but not both in the same request... I failed trying to use this with hermes tonight...
English
1
0
0
40
andthattoo
andthattoo@andthatto·
Qwen 3.6 is frontier for local. It also thinks forever. I tried a dumb inference-time trick: make its block obey a tiny grammar. Result: - HumanEval+: 22x fewer think tokens, no accuracy loss - LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens
English
51
82
1.3K
128.6K
chad
chad@chaddotphp·
@andthatto I'm testing locally with this now, and so far the results are very impressive. Thanks!
English
1
0
2
792
LeetLLM.com
LeetLLM.com@leetllm·
@andthatto super clever trick, but my brain immediately goes to the bitter lesson. manually constraining reasoning traces works great today, but raw scale is just going to steamroll heuristics like this.
English
2
0
18
4.1K
andthattoo
andthattoo@andthatto·
@voxmenthe @VictorTaelin Yes, basically. It’s an inference-time prior over the shape of the scratchpad. The bet is: many reasoning tokens are low-value narration, and a small structured harness preserves the useful planning bits while cutting the ramble.
English
1
1
36
828
andthattoo
andthattoo@andthatto·
@vega_holdings repo is not a plug-n play so make sure codex/claude reads it and makes something useful out of it for your case, good luck!
English
1
0
1
120
vega
vega@vega_holdings·
@andthatto qwen overthinking was killing my per turn token limit will try it out thanks!
English
1
0
1
144
andthattoo
andthattoo@andthatto·
Tiny grammar = constrain only the block at decoding time. So instead of free-form thought, it must write e.g. GOAL / APPROACH / EDGE. The final answer is still open. Yes, it can affect thinking. The surprising part is that in these runs it compressed reasoning hard without hurting pass@1 and improve in some cases. But that is purely task+grammar based.
English
4
0
51
4K
Taelin
Taelin@VictorTaelin·
@andthatto a tiny grammar? wdym? won't that affect its thinking
English
1
0
27
6K
andthattoo
andthattoo@andthatto·
@ljupc0 Go ahead! But make sure to explore grammar best fitting you tasks/agents. Ones in the repo may not be optimal for your case.
English
0
0
1
164
Ljubomir Josifovski
Wow - thanks! Exactly what I need. I've had a problem with the Qwen-s thinking a lot, need to put a limit to that. (need a response sooner - even if not perfect.) The 3.5-s were bad in that they output nothing when interrupted. :-( Couldn't use them reliably. The 3.6-s are better now, :-) they do produce output when interrupted with --n-predict 8192 \ --reasoning on --reasoning-format deepseek \ --chat-template-kwargs '{"preserve_thinking":true}' \ --reasoning-budget 3072 --reasoning-budget-message 'Reasoning budget exhausted. Stop thinking and provide the best final answer now.' \ But of course it would be even better if they didn't get stuck thinking forever in the 1st place. :-) I also use LCB tiny portion to test - giving this a try now... Thanks!
English
2
0
1
244
andthattoo
andthattoo@andthatto·
My insight is that a lot of verbose CoT is scaffolding, not essential computation. Constrained decoding can force a denser interface to the model’s latent reasoning. But if the task really needs more deliberation, it leaks somewhere else.
English
2
1
61
6.5K
andthattoo
andthattoo@andthatto·
I'm onto something
English
1
0
4
163
andthattoo retweetledi
DeepSeek
DeepSeek@deepseek_ai·
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n
DeepSeek tweet media
English
1.6K
7.6K
44.4K
9M