Eduardo Gonzalez

3.8K posts

Eduardo Gonzalez banner
Eduardo Gonzalez

Eduardo Gonzalez

@wm_eddie

Founder of @xpressai, maker of AI infrastructure tools. Co-Author of the Japanese book “Learning DL by Implementing Applications” Also on sigmoid.

Himeji-shi, Hyogo Katılım Aralık 2007
907 Takip Edilen778 Takipçiler
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
@ktosopl My son got into Starfox 64 recently. It really is a masterpiece of a game. Almost 30 years later and it is still a lot of fun and still looks great.
English
1
0
1
47
Eduardo Gonzalez retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Alibaba's Qwen3.6 27B is the new open weights leader under 150B parameters scoring 46 on the Artificial Analysis Intelligence Index, but uses ~3.7x the output tokens and costs ~21x more than Gemma 4 31B (39) to run the full Intelligence Index @Alibaba_Qwen has released two open weights models in the Qwen3.6 family: Qwen3.6 27B (Dense, 46 on the Intelligence Index) and Qwen3.6 35B A3B (MoE, 43). The MoE variant has 36B total parameters but only activates 3B per forward pass. Both are Apache 2.0 licensed, support 262K context, include native multimodal input, and use the unified thinking/non-thinking hybrid architecture. Unlike Qwen3.5, Alibaba has not released larger Qwen3.6 models as open weights - Qwen3.6 Plus and Qwen3.6 Max Preview remain proprietary, so the Qwen3.6 open weights family is currently all under 50B models. All scores below are for reasoning mode. The Intelligence Index is our synthesis metric incorporating 10 evaluations covering agentic tasks, coding, and scientific reasoning. Key takeaways: ➤ Qwen3.6 27B is the most intelligent open weights model under 150B parameters. At 46 on the Intelligence Index, Qwen3.6 27B is ahead of Qwen3.6 35B A3B (43), Qwen3.5 27B (42), and Gemma 4 31B (39). It is also ahead of larger open weights models including NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), Qwen3.5 122B A10B (42) and gpt-oss-120b (high, 33). In native BF16 precision, the 27B takes ~56GB to store the weights, fitting on a single H100, and in 4-bit quantization the weights fit on consumer hardware with 16GB+ of RAM ➤ Qwen3.6 35B A3B is the most intelligent open weights model with ~3B active parameters, 6 points ahead of Qwen3.5 35B A3B (37) and 13 points ahead of GLM-4.7-Flash (30). Other ~3B active peers include Gemma 4 26B A4B (31), Qwen3 Coder Next (80B total, 28), and NVIDIA Nemotron Cascade 2 30B A3B (28) ➤ AA-Omniscience improvement is driven entirely by abstention rather than accuracy. Qwen3.6 27B's hallucination rate falls from 80% to 48% versus Qwen3.5 27B, while accuracy is roughly flat - consistent with our finding that AA-Omniscience accuracy typically correlates with total parameter count and Qwen3.6 27B retains the same 27B parameter count as its predecessor. The 35B A3B shows the same pattern whereby hallucination drops from 84% to 50% while accuracy remains equivalent ➤ Token usage is up across both models versus Qwen3.5 and significantly higher than Gemma 4 31B. Qwen3.6 27B used ~144M output tokens to run the Intelligence Index (~1.5x Qwen3.5 27B at 98M, ~3.7x Gemma 4 31B at 39M). Qwen3.6 35B A3B used ~143M (~1.4x Qwen3.5 35B A3B at 100M, ~3.7x Gemma 4 31B) ➤ The 27B got materially more expensive while the 35B A3B is roughly flat versus predecessor. Per-token pricing on Alibaba Cloud moved differently, with the 27B going from $0.30/$2.40 to $0.60/$3.60 while the 35B A3B (Reasoning) remains nearly flat at $0.248/$1.485 (vs $0.25/$2.00 for Qwen3.5 35B A3B). Qwen3.6 27B costs ~$659 to run the Intelligence Index, ~2.2x Qwen3.5 27B (~$299) and ~21x Gemma 4 31B (~$31 at median third-party pricing of $0.14/$0.40 per 1M input/output tokens). Qwen3.6 35B A3B costs ~$280, roughly tied with Qwen3.5 35B A3B (~$302) and ~9x Gemma 4 31B ➤ Qwen3.6 27B is competitive with leading models on agentic real-world work tasks despite its size. At 1414 Elo on GDPval-AA, Qwen3.6 27B is ahead of recent open weights peers Qwen3.6 35B A3B (1297), Qwen3.5 27B (1157) and Gemma 4 31B (1115), but trails larger open weights leaders including DeepSeek V4 Pro (Reasoning, Max Effort, 1554) and GLM-5.1 (Reasoning, 1535). It matches DeepSeek V4 Flash (Reasoning, High Effort, 1414) at 284B total parameters, and sits roughly in line with GPT-5.4 mini (xhigh, 1436) and Muse Spark (1421). ➤ Non-reasoning variants remain equivalent versus Qwen3.5. Qwen3.6 27B (Non-reasoning, 37) is effectively tied with Qwen3.5 27B (Non-reasoning, 37); Qwen3.6 35B A3B (Non-reasoning, 32) is equivalent to Qwen3.5 35B A3B (Non-reasoning, 31). The Qwen3.6 generation gains are concentrated in reasoning mode Other information: ➤ Context window: 262K tokens (equivalent to Qwen3.5) ➤ License: Apache 2.0 ➤ Multimodality: Native vision input (text and image), text output ➤ API pricing (Alibaba Cloud): Qwen3.6 27B: $0.60/$3.60, Qwen3.6 35B A3B (Reasoning): $0.248/$1.485 ➤ Availability: Available on Alibaba Cloud first-party API. Qwen3.6 35B A3B is available on several third-party APIs such as @DeepInfra, @parasail_io, @clarifai and @novita_labs
Artificial Analysis tweet media
English
21
74
596
55.7K
Eduardo Gonzalez retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
Everyone talks about AGI, but you change the formatting of toolcall outputs a bit and SWE-bench performance drops by 5%
English
19
14
265
22.4K
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
Damn, the thing doesn't even understand uv anymore.
English
0
0
0
58
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
Well opus-4-5 is completely lobotomized now... ⏺ The module-level import sys at the top of files might be causing issues with import ordering. Let me remove them:
English
1
0
0
88
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
@alexgraveley I was thinking the same thing. But the main problem I see is discovery. How will the agent know how to use the different files properly…
English
0
0
0
33
Alex Graveley
Alex Graveley@alexgraveley·
Agents in 2026: Plan9 all the things!
Alex Graveley tweet media
English
3
1
14
2.3K
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
Anthropic is playing a dumb game blocking other clients. If Claude Code worked on my servers I’d use it. But it just crashes on boot. OpenCode just works. And my own harness is way better for long term memories…
English
0
0
0
123
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
This is the most interesting part of the DeepseekV3.2 paper IMHO. Very close to something I've been meaning to try for a long time.
Eduardo Gonzalez tweet media
English
0
0
2
239
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
@_m0se_ ./build/bin/llama-server -m models/qwen3-vl-24b-reap-Q4_K_M.gguf --mmproj models/qwen3-vl-24b-mmproj-bf16.gguf --cpu-moe -c 32768 を利用すればギリギリ8gbの2070Superで使えます。良いですねこれ。
Eduardo Gonzalez tweet media
日本語
0
0
1
130
OpenMOSE
OpenMOSE@_m0se_·
Qwen3-VL-REAP-24B-A3B-GGUF GGUFバージョンも作りました。 imatrix版です。 cpu-moeをうまく使えば、8GB GPUにのると思います huggingface.co/OpenMOSE/Qwen3…
日本語
1
1
20
785
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
@jzawodn I also ran into this. The latest version works fine. Which is interesting. Wonder what happened there.
English
0
0
0
23
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
@abacaj This is one of the reasons I use SambaNova. They don’t quantize the weights. The difference is huge. They are only superficially equivalent. If only SambaNova supported more models.
English
1
0
4
1.1K
anton
anton@abacaj·
Run gpt-oss-20b on openrouter get 32/100 on benchmark. Run gpt-oss-20b on vllm with h200s get 83/100 on benchmark. What are these providers doing? Deepinfra terrible results
English
54
15
573
65.5K
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
Hmm... I think it may be very important that we do not train models on Asimov's work.
English
0
0
0
113
Eduardo Gonzalez
Eduardo Gonzalez@wm_eddie·
@YouKnowEno I got the silicone sport band for this very reason. Has enough holes that even if it bothers me I can move it back far enough to not touch the MacBook.
English
1
0
1
37
Eno
Eno@YouKnowEno·
how people work on their macbooks with a watch on? the sounds and feeling of metal scraping metal drives me nuts.
English
11
0
19
1.3K