Johnny Everson

3.9K posts

Johnny Everson banner
Johnny Everson

Johnny Everson

@johnny_everson

Software Engineer. Trying to find a balance between being a nice guy and not be angry all the time in this broken world. Brazilian (yeah, I need a hug).

Maceió, Brasil Katılım Ocak 2010
458 Takip Edilen266 Takipçiler
Johnny Everson
Johnny Everson@johnny_everson·
@CardilloSamuel Surprisingly, Q6 of the new model performed worse on stevibe's tool testing. It is consistently not passing on one of the tests, while your original passed even at Q4. Since my main use is tool use, I will I revert until I can test this better.
English
1
0
0
20
Samuel Cardillo
Samuel Cardillo@CardilloSamuel·
guess who released yet another fine tune? me! here is a qlora for Qwen3.5 35b a3b MoE which uses the same dataset and technics than jackrong/qwopus3.5v3 with slight changes to fit with the moe structure. benchmarks hold well. available in q8, q6, q5 and q4. thanks @johnny_everson for the suggestion! huggingface.co/samuelcardillo…
English
11
7
127
6.9K
Johnny Everson retweetledi
Henrique Bastos
Henrique Bastos@henriquebastos·
I find it funny watching non-Brazilians frustrated with Anthropic's lack of clarity on what you can and can't do with Claude Max subscriptions. Yes, the information is contradictory. Personal use is fine, but not OpenClaw. Not allowed, but maybe allowed. Allowed, but maybe not. Brazilians laugh at this because most laws in Brazil work exactly like that. There's no point looking for logic. The ambiguity IS the strategy. It lets the authority decide on the fly. Anthropic is almost certainly profiling subscriber usage patterns to block those who go beyond what they consider acceptable for a subsidized subscription price. Welcome to the Brazilian legal experience.
English
19
18
353
18.4K
Samuel Cardillo
Samuel Cardillo@CardilloSamuel·
@johnny_everson aaaah really? i trhought he did an moe version too! okay i will see what i can do tomorrow!
English
1
0
1
87
Samuel Cardillo
Samuel Cardillo@CardilloSamuel·
so i've (finally) finished my own benchmark to put to the test the new google released gemma 4 vs alibaba qwen 3.5. just for clarity: when i benchmark models, i benchmark them based on real scenarios i have had/have with my own use cases but also businesses i have helped set up local infra. i don't use existing benchmarks because i don't trust weights not to be "benchmaxxed" (its a technic that some research labs uses to perform super well on specific task to score high, mainly marketing shit). the test was between opus3.5 35b a3b and gemma 4 26b a3b-it - so both moe model, because i care about deploying on the dgx spark. 1. hermes agent - research a company, multi languages hermes agent running with camofox locally for browser usage. the test consist in 7 tasks: do a quick research about [x] company? ; now translate that in french ; who is [x] person? ; write a file in the folder benchmarks/hermes with the name [model name] ; show me the content of the file ; add a .txt in the file name qwen3.5 won by a landslide, it did everything perfectly. the company research was insanely thorough, it understood which directory it needed to create the file and even added .txt by itself before i even ask. the main issue: the french language translation was flacky. some words had grammar mistakes. gemma 4 wrote a shallow report, missing tons of infos, wrote the file in the wrong directory (went into the hermes agent temp folder) and i had to steer it quite a lot. 2. custom code - single-term tool calling i have then tested against a little custom code of mine which limits the amount of tool callings to 1 maximum. meaning, the models are presented with problems (in this 18 different ones) and they have to choose the best tool to solve it, no second chances. in this case, both models performed amazingly, they both succeeded and chose the right things to do. 3. herrmes agent - database migration, security incident, ... devops stuff then i tested with multiple tools, back to hermes agent. both models had to migrate a database, do some full stack deployment, deal with a fake security incident, ... and the results were pretty interesting gemma 4 is really good at doing very specific tasks like the database migration or the full stack deployment but get quickly stucks in cases that required more thinking & start behaving bad. qwen3.5 did all the tasks but skipped some. so i would say they both sucks for unsupervised long operations and i would trust more gemma 4 for devops stuff. 4. hermes agent - formatting compliance the idea is you get a bunch of elements that needs to be analyzed by the models and they need to ouput a result following the same exact format all the time. and they both sucks. they did terrible. now the good news is: since they're small models ,you can easily train a qlora to teach them the format you want. but that's extra work. 5. opencode - code a website about yourself little thing here, it was opposing the moe model and gemma 31b-it dense, not qwen3.5. the prompt was "build a website using whatever framework you want and threejs which explains what's new about gemma 4". simple, vague prompt. the dense model was super slow BUT delivered an extremely cool result with particles effects that change based on the viewport scrolling & all. its really cool. the moe model was more conservative and just created a simple landing page. they both chose vuejs + typescript. somehow dense model had trouble understand it can run npm run dev while the moe understood directly. CONCLUSION : i definitively prefer qwen3.5 moe. it felt more grounded and required way less human steering at every steps. its far from perfect but for companies using unified memory hardware for personal assistant kind of stuff - which is the majority of companies hitting me up to help them out - its clearly the best choice.
Samuel Cardillo tweet media
English
22
8
152
10.9K
Johnny Everson
Johnny Everson@johnny_everson·
@CardilloSamuel I was talking about qwen3.5 35B opus. Jackrong v3 doesn't have MoE option. Dense is great, but I was hoping to get MoE speeds.
English
1
0
0
111
Samuel Cardillo
Samuel Cardillo@CardilloSamuel·
which model , the opus3.5 opus or the qwen3-coder-next? if you speaking about the opus3.5 , i recommend you to use the new jackrong version (qwopus3.5) instead - he did insane improvement to it and else, i use full precision on all of them because the spark is 128 GB unified memory
English
1
0
0
560
Johnny Everson
Johnny Everson@johnny_everson·
@sudoingX got myself a rtx 5090. What models do you recommend for strong tool calling and speed? My experience with 16gb vram was that I could not have both. System has 128gb ram.
English
0
0
1
16
Johnny Everson
Johnny Everson@johnny_everson·
@bnjmn_marie Qwopus was the only version that could fit on my 16gb vram and pass @stevibe tool calling test. I tested a dozen variants. Vanilla didn’t pass.
English
0
0
5
825
stevibe
stevibe@stevibe·
Gemma4 just dropped. How does it handle tool calls? I ran ToolCall-15 across the full Gemma4 families. Gemma4 31b = Qwen3.5 27b. Both perfect 15/15. But here's what's wild: Qwen3.5 9b already clears 13/15, Gemma4 needs 26b to match that.
English
40
41
454
49.1K
charbob
charbob@Char__Bob·
@johnny_everson LibreChat is neat. A bit heavy but it works. If you’re doing chat and not coding, have you tried qwen3.5-35b-a3b? Higher tok/s is not only nice for chat, but also means faster iteration/turnaround on tools I am liking qwopus27b v3 on my 3090 but I mainly code
English
2
0
1
47
Johnny Everson
Johnny Everson@johnny_everson·
To run a LLM service that uses tool calling heavily, e.g. web search, url context (find specific info in a website). I am using Qwopus 27B and tool calling examples from unsloth. Is this the right way to do it or should I use lib or existing app, like open web ui?
English
2
0
0
271
Johnny Everson
Johnny Everson@johnny_everson·
@Char__Bob MoE don’t seem to do well in tool calling, at least not that I have seen so far. I tested with @stevibe tool and only dense models passed it.
English
0
0
1
13
Ettore Di Giacinto
Ettore Di Giacinto@mudler_it·
Next APEX releases on my list (not necessarly in this order): - Qwen3-Coder-30B-A3B 👈 benchmarks done, quantizing and uploading now ( will be available at huggingface.co/mudler/Qwen3-C… ) - Qwen3.5-122B-A10B 🔃 Starting - arcee-ai/Trinity-Large-Thinking - MiniMaxAI/MiniMax-M2.5 - Hcompany/Holo3-35B-A3B - nvidia/Nemotron-Cascade-2-30B-A3B - ... and what next? You will be able to find all APEX quants in my collection in @huggingface here: huggingface.co/collections/mu… !
English
10
7
68
3.6K
Johnny Everson
Johnny Everson@johnny_everson·
@stevibe Got other model to pass all green on my 16Gb gpu: mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic-v2-i1-GGUF at i1-IQ3_S/i1-IQ3_S at 27t/s.
English
0
0
0
60
Johnny Everson
Johnny Everson@johnny_everson·
@stevibe I tested a dozen qwen3.5 9B variants, none were all green. The faster model I could run on my 16GB gpu that passes everything was: samuelcardillo/.../Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf
English
0
0
0
62
stevibe
stevibe@stevibe·
Qwen3.5-27B went 15/15 on our tool-calling benchmark. But which quant should you actually run? Tested Unsloth's Q2_K_XL all the way to Q8_K_XL TL;DR: Q8 — 15/15 ✅ Q6 — 15/15 ✅ Q5 — 14/15 Q4 — 14/15 Q3 — 14/15 Q2 — 13/15 Q6 is the sweet spot. Same perfect score as Q8, smaller footprint. Also, the results scale almost linearly, seems like ToolCall-15 is actually measuring something real.
English
52
77
906
60.7K
Johnny Everson retweetledi
Guri Singh
Guri Singh@heygurisingh·
Humans: 100% Gemini 3.1 Pro: 0.37% GPT 5.4: 0.26% Opus 4.6: 0.25% Grok-4.20: 0.00% François Chollet just released ARC-AGI-3 -- the hardest AI test ever created. 135 novel game environments. No instructions. No rules. No goals given. Figure it out or fail. Untrained humans solved every single one. Every frontier AI model scored below 1%. Each environment was handcrafted by game designers. The AI gets dropped in and has to explore, discover what winning looks like, and adapt in real time. The scoring punishes brute force. If a human needs 10 actions and the AI needs 100, the AI doesn't get 10%. It gets 1%. You can't throw more compute at this. For context: ARC-AGI-1 is basically solved. Gemini scores 98% on it. ARC-AGI-2 went from 3% to 77% in under a year. Labs spent millions training on earlier versions. ARC-AGI-3 resets the entire scoreboard to near zero. The benchmark launched live at Y Combinator with a fireside between Chollet and Sam Altman. $2M in prizes on Kaggle. All winning solutions must be open-sourced. Scaling alone will not close this gap. We are nowhere near AGI. (Link in the comments)
Guri Singh tweet media
English
318
1.1K
6.4K
1.3M
Gergely Orosz
Gergely Orosz@GergelyOrosz·
Devs who can code also WITHOUT AI as well looking to became 10x more valuable They are the ones who won’t panic or be idle when their Claude quota runs out… So much for all the advice on how learning to code is not worth it any more…
Thariq@trq212

To manage growing demand for Claude we're adjusting our 5 hour session limits for free/Pro/Max subs during peak hours. Your weekly limits remain unchanged. During weekdays between 5am–11am PT / 1pm–7pm GMT, you'll move through your 5-hour session limits faster than before.

English
155
141
2K
277.3K
Johnny Everson retweetledi
Eric Alper 🎧
Eric Alper 🎧@ThatEricAlper·
Eric Alper 🎧 tweet media
ZXX
249
860
12.8K
174.1K
Johnny Everson retweetledi
trish
trish@_trish_xD·
i used to roll my eyes whenever senior devs said "just use the standard library." i was wrong. they were right. so much third-party stuff is genuinely unnecessary.
English
57
30
888
163.4K
Johnny Everson
Johnny Everson@johnny_everson·
@0xSero Noobie question: can we distill only the coding/tool calling, reasoning experts from a large model to make a small/medium model very specialized for agentic coding?
English
0
0
0
142