Johnny Everson

3.9K posts

Johnny Everson

@johnny_everson

Software Engineer. Trying to find a balance between being a nice guy and not be angry all the time in this broken world. Brazilian (yeah, I need a hug).

Maceió, Brasil Katılım Ocak 2010

458 Takip Edilen266 Takipçiler

Johnny Everson@johnny_everson·6h

@CardilloSamuel Surprisingly, Q6 of the new model performed worse on stevibe's tool testing. It is consistently not passing on one of the tests, while your original passed even at Q4. Since my main use is tool use, I will I revert until I can test this better.

English

Johnny Everson@johnny_everson·11h

@CardilloSamuel Thanks a lot! Very excited about this one.

English

184

Samuel Cardillo@CardilloSamuel·15h

guess who released yet another fine tune? me! here is a qlora for Qwen3.5 35b a3b MoE which uses the same dataset and technics than jackrong/qwopus3.5v3 with slight changes to fit with the moe structure. benchmarks hold well. available in q8, q6, q5 and q4. thanks @johnny_everson for the suggestion! huggingface.co/samuelcardillo…

English

127

6.9K

Johnny Everson retweetledi

Henrique Bastos@henriquebastos·1d

I find it funny watching non-Brazilians frustrated with Anthropic's lack of clarity on what you can and can't do with Claude Max subscriptions. Yes, the information is contradictory. Personal use is fine, but not OpenClaw. Not allowed, but maybe allowed. Allowed, but maybe not. Brazilians laugh at this because most laws in Brazil work exactly like that. There's no point looking for logic. The ambiguity IS the strategy. It lets the authority decide on the fly. Anthropic is almost certainly profiling subscriber usage patterns to block those who go beyond what they consider acceptable for a subsidized subscription price. Welcome to the Brazilian legal experience.

English

353

18.4K

Johnny Everson@johnny_everson·23h

@CardilloSamuel Awesome, thank you! Looking forward to it. Your Q4 is already pretty good.

English

Samuel Cardillo@CardilloSamuel·23h

@johnny_everson aaaah really? i trhought he did an moe version too! okay i will see what i can do tomorrow!

English

Samuel Cardillo@CardilloSamuel·1d

so i've (finally) finished my own benchmark to put to the test the new google released gemma 4 vs alibaba qwen 3.5. just for clarity: when i benchmark models, i benchmark them based on real scenarios i have had/have with my own use cases but also businesses i have helped set up local infra. i don't use existing benchmarks because i don't trust weights not to be "benchmaxxed" (its a technic that some research labs uses to perform super well on specific task to score high, mainly marketing shit). the test was between opus3.5 35b a3b and gemma 4 26b a3b-it - so both moe model, because i care about deploying on the dgx spark. 1. hermes agent - research a company, multi languages hermes agent running with camofox locally for browser usage. the test consist in 7 tasks: do a quick research about [x] company? ; now translate that in french ; who is [x] person? ; write a file in the folder benchmarks/hermes with the name [model name] ; show me the content of the file ; add a .txt in the file name qwen3.5 won by a landslide, it did everything perfectly. the company research was insanely thorough, it understood which directory it needed to create the file and even added .txt by itself before i even ask. the main issue: the french language translation was flacky. some words had grammar mistakes. gemma 4 wrote a shallow report, missing tons of infos, wrote the file in the wrong directory (went into the hermes agent temp folder) and i had to steer it quite a lot. 2. custom code - single-term tool calling i have then tested against a little custom code of mine which limits the amount of tool callings to 1 maximum. meaning, the models are presented with problems (in this 18 different ones) and they have to choose the best tool to solve it, no second chances. in this case, both models performed amazingly, they both succeeded and chose the right things to do. 3. herrmes agent - database migration, security incident, ... devops stuff then i tested with multiple tools, back to hermes agent. both models had to migrate a database, do some full stack deployment, deal with a fake security incident, ... and the results were pretty interesting gemma 4 is really good at doing very specific tasks like the database migration or the full stack deployment but get quickly stucks in cases that required more thinking & start behaving bad. qwen3.5 did all the tasks but skipped some. so i would say they both sucks for unsupervised long operations and i would trust more gemma 4 for devops stuff. 4. hermes agent - formatting compliance the idea is you get a bunch of elements that needs to be analyzed by the models and they need to ouput a result following the same exact format all the time. and they both sucks. they did terrible. now the good news is: since they're small models ,you can easily train a qlora to teach them the format you want. but that's extra work. 5. opencode - code a website about yourself little thing here, it was opposing the moe model and gemma 31b-it dense, not qwen3.5. the prompt was "build a website using whatever framework you want and threejs which explains what's new about gemma 4". simple, vague prompt. the dense model was super slow BUT delivered an extremely cool result with particles effects that change based on the viewport scrolling & all. its really cool. the moe model was more conservative and just created a simple landing page. they both chose vuejs + typescript. somehow dense model had trouble understand it can run npm run dev while the moe understood directly. CONCLUSION : i definitively prefer qwen3.5 moe. it felt more grounded and required way less human steering at every steps. its far from perfect but for companies using unified memory hardware for personal assistant kind of stuff - which is the majority of companies hitting me up to help them out - its clearly the best choice.

English

152

10.9K

Johnny Everson@johnny_everson·23h

@CardilloSamuel I was talking about qwen3.5 35B opus. Jackrong v3 doesn't have MoE option. Dense is great, but I was hoping to get MoE speeds.

English

111

Samuel Cardillo@CardilloSamuel·23h

which model , the opus3.5 opus or the qwen3-coder-next? if you speaking about the opus3.5 , i recommend you to use the new jackrong version (qwopus3.5) instead - he did insane improvement to it and else, i use full precision on all of them because the spark is 128 GB unified memory

English

560

Johnny Everson@johnny_everson·1d

@sudoingX got myself a rtx 5090. What models do you recommend for strong tool calling and speed? My experience with 16gb vram was that I could not have both. System has 128gb ram.

English

Johnny Everson@johnny_everson·2d

@bnjmn_marie Qwopus was the only version that could fit on my 16gb vram and pass @stevibe tool calling test. I tested a dozen variants. Vanilla didn’t pass.

English

825

Benjamin Marie@bnjmn_marie·2d

"Qwopus": hype or really useful? Running multiple full benchmarks. Answer next week! huggingface.co/Jackrong/Qwopu…

English

319

17.6K

Johnny Everson@johnny_everson·3d

@stevibe Qwopus 9B v3 does get 15/15.

English

stevibe@stevibe·4d

Gemma4 just dropped. How does it handle tool calls? I ran ToolCall-15 across the full Gemma4 families. Gemma4 31b = Qwen3.5 27b. Both perfect 15/15. But here's what's wild: Qwen3.5 9b already clears 13/15, Gemma4 needs 26b to match that.

English

454

49.1K

Johnny Everson@johnny_everson·3d

@Char__Bob Also, I don’t need an chat UI, just an Api.

English

charbob@Char__Bob·3d

@johnny_everson LibreChat is neat. A bit heavy but it works. If you’re doing chat and not coding, have you tried qwen3.5-35b-a3b? Higher tok/s is not only nice for chat, but also means faster iteration/turnaround on tools I am liking qwopus27b v3 on my 3090 but I mainly code

English

Johnny Everson@johnny_everson·3d

To run a LLM service that uses tool calling heavily, e.g. web search, url context (find specific info in a website). I am using Qwopus 27B and tool calling examples from unsloth. Is this the right way to do it or should I use lib or existing app, like open web ui?

English

271

Johnny Everson@johnny_everson·3d

@Char__Bob MoE don’t seem to do well in tool calling, at least not that I have seen so far. I tested with @stevibe tool and only dense models passed it.

English

Johnny Everson@johnny_everson·4d

@mudler_it Qwopus 27B v3!!!! That is the new hot thing.

English

Ettore Di Giacinto@mudler_it·4d

Next APEX releases on my list (not necessarly in this order): - Qwen3-Coder-30B-A3B 👈 benchmarks done, quantizing and uploading now ( will be available at huggingface.co/mudler/Qwen3-C… ) - Qwen3.5-122B-A10B 🔃 Starting - arcee-ai/Trinity-Large-Thinking - MiniMaxAI/MiniMax-M2.5 - Hcompany/Holo3-35B-A3B - nvidia/Nemotron-Cascade-2-30B-A3B - ... and what next? You will be able to find all APEX quants in my collection in @huggingface here: huggingface.co/collections/mu… !

English

3.6K

Johnny Everson@johnny_everson·5d

@stevibe Got other model to pass all green on my 16Gb gpu: mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic-v2-i1-GGUF at i1-IQ3_S/i1-IQ3_S at 27t/s.

English

Johnny Everson@johnny_everson·6d

@stevibe I tested a dozen qwen3.5 9B variants, none were all green. The faster model I could run on my 16GB gpu that passes everything was: samuelcardillo/.../Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf

English

stevibe@stevibe·27 Mar

Qwen3.5-27B went 15/15 on our tool-calling benchmark. But which quant should you actually run? Tested Unsloth's Q2_K_XL all the way to Q8_K_XL TL;DR: Q8 — 15/15 ✅ Q6 — 15/15 ✅ Q5 — 14/15 Q4 — 14/15 Q3 — 14/15 Q2 — 13/15 Q6 is the sweet spot. Same perfect score as Q8, smaller footprint. Also, the results scale almost linearly, seems like ToolCall-15 is actually measuring something real.

English

906

60.7K

Johnny Everson retweetledi

Guri Singh@heygurisingh·29 Mar

Humans: 100% Gemini 3.1 Pro: 0.37% GPT 5.4: 0.26% Opus 4.6: 0.25% Grok-4.20: 0.00% François Chollet just released ARC-AGI-3 -- the hardest AI test ever created. 135 novel game environments. No instructions. No rules. No goals given. Figure it out or fail. Untrained humans solved every single one. Every frontier AI model scored below 1%. Each environment was handcrafted by game designers. The AI gets dropped in and has to explore, discover what winning looks like, and adapt in real time. The scoring punishes brute force. If a human needs 10 actions and the AI needs 100, the AI doesn't get 10%. It gets 1%. You can't throw more compute at this. For context: ARC-AGI-1 is basically solved. Gemini scores 98% on it. ARC-AGI-2 went from 3% to 77% in under a year. Labs spent millions training on earlier versions. ARC-AGI-3 resets the entire scoreboard to near zero. The benchmark launched live at Y Combinator with a fireside between Chollet and Sam Altman. $2M in prizes on Kaggle. All winning solutions must be open-sourced. Scaling alone will not close this gap. We are nowhere near AGI. (Link in the comments)

English

318

1.1K

6.4K

1.3M

Johnny Everson@johnny_everson·27 Mar

@GergelyOrosz Replace compiling with waiting for limits xkcd.com/303

English

Gergely Orosz@GergelyOrosz·27 Mar

Devs who can code also WITHOUT AI as well looking to became 10x more valuable They are the ones who won’t panic or be idle when their Claude quota runs out… So much for all the advice on how learning to code is not worth it any more…

Thariq@trq212

To manage growing demand for Claude we're adjusting our 5 hour session limits for free/Pro/Max subs during peak hours. Your weekly limits remain unchanged. During weekdays between 5am–11am PT / 1pm–7pm GMT, you'll move through your 5-hour session limits faster than before.

English

155

141

277.3K

Johnny Everson retweetledi

Eric Alper 🎧@ThatEricAlper·21 Mar

ZXX

249

860

12.8K

174.1K

Johnny Everson retweetledi

trish@_trish_xD·19 Mar

i used to roll my eyes whenever senior devs said "just use the standard library." i was wrong. they were right. so much third-party stuff is genuinely unnecessary.

English

888

163.4K

Johnny Everson@johnny_everson·20 Mar

@0xSero Noobie question: can we distill only the coding/tool calling, reasoning experts from a large model to make a small/medium model very specialized for agentic coding?

English

142

0xSero@0xSero·19 Mar

We will make frontier intelligence run on 1000$ of hardware by the end of the year.

Eric@Ex0byt

Kimi-K2.5 (1T-parameter MoE) running coherently on 25GB of GPU memory (on a unified 128 GB machine)!

English

129

2.4K

98.4K

Keşfet

@CardilloSamuel @sudoingX @bnjmn_marie @stevibe @Char__Bob @mudler_it @huggingface @elonmusk