charbob

146 posts

charbob

@Char__Bob

applied math, ecom, indie rock

Katılım Nisan 2021

264 Takip Edilen25 Takipçiler

Sabitlenmiş Tweet

charbob@Char__Bob·11 Oca

Doing an Avatar but for Becoming Chinese

English

charbob@Char__Bob·3d

@prayag_sonar ~1k prefill ~35 decode

Português

prayag sonar@prayag_sonar·3d

@Char__Bob How much tokens per second are you getting?

English

prayag sonar@prayag_sonar·4d

Has anyone used Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled for agents locally? How did it fair?

English

721

charbob@Char__Bob·3d

@Hangsiin What use-cases? I was running tool-use-heavy research today with a few MCPs and it was extraordinarily lazy. And it wouldn’t believe me no matter the prompt about the date (it kept thinking it was 2024/25). 120 tok/s was nice but not worth the lobotomy vs Qwen3.5 27b

English

NomoreID@Hangsiin·4d

I’m really impressed with Gemma 4. (26B-A4B-it, Q4_K_M) What stands out most is that it doesn’t feel awkward in Korean at all. It feels like a genuinely solid, well-built model. I had never felt this from a model of this size before. The gap compared with Gemma 3 also feels really significant. I’m still testing it, but for a few use cases, I’m starting to feel that moving to a local setup would be worth it. This is a really impressive release!

English

4.8K

charbob@Char__Bob·3d

@sudoingX @arlanr Having built actual production systems for enterprise, I think there’s a place for RAG to seamlessly inform basic queries behind the scenes AND curate an agent-searchable wiki-style KB for research

English

Sudo su@sudoingX·3d

@arlanr the people who said RAG is dead never built a system that needed to remember anything.

English

180

6.5K

Arlan@arlanr·3d

RAG has been successfully killed!

English

193

35.5K

charbob@Char__Bob·3d

@Snixtp What models are you hoping to run? I’ve been enjoying my 3090 a lot. But I do feel the itch to run something meatier. Qwen3.5-122b seems like a Q4-6 would be great. But it seems like there’s a bit of a gap until 397b?

English

Espen JD@Snixtp·4d

Proud owner of an RTX Pro 6000 Best investment ever made

Espen JD@Snixtp

Buy a GPU RTX Pro 6000 is mine

English

1.8K

charbob@Char__Bob·3d

@huydq179 Welcome! Would love details on your setup. Btw I love Hanoi—I visited a few years ago and think of it often

English

Do Quang Huy@huydq179·4d

Nice to meet all you guys!

English

2.1K

charbob@Char__Bob·3d

@grzracz Which Qwen were you using before? Q4 Gemma 26b was extraordinary lazy for me today. Not impressed. Personally not worth the increased tok/s (~120 decode) vs decreased intelligence + laziness compared to vanilla Qwen 27b/Qwopus

English

grzracz@grzracz·4d

Gemma-4-26b-a4b Q4 on LM Studio (M2 Max) running at 60 tok/s and no loops! Much higher actual quality of reasoning too. Seems it will replace my Qwen.

English

291

charbob@Char__Bob·3d

@zw0404 @eleven_32 Curious to hear what you think. I was not impressed with 26b today, will try 31b tomorrow.

English

wayne zhang@zw0404·3d

@eleven_32 我正在下载

中文

wayne zhang@zw0404·4d

anybody run gemma 4

English

352

charbob@Char__Bob·3d

@johnny_everson LibreChat is neat. A bit heavy but it works. If you’re doing chat and not coding, have you tried qwen3.5-35b-a3b? Higher tok/s is not only nice for chat, but also means faster iteration/turnaround on tools I am liking qwopus27b v3 on my 3090 but I mainly code

English

Johnny Everson@johnny_everson·4d

To run a LLM service that uses tool calling heavily, e.g. web search, url context (find specific info in a website). I am using Qwopus 27B and tool calling examples from unsloth. Is this the right way to do it or should I use lib or existing app, like open web ui?

English

276

charbob@Char__Bob·3d

@LottoLabs Excited to try 31b tomorrow

English

charbob@Char__Bob·3d

@LottoLabs Played with qwopus-27b v3 and Gemma 26b today (Q4). Qwopus was great, meaningful small improvement over vanilla 27b in opencode & LibreChat for tool-heavy research and coding. Gemma was total ass. Very lazy model. I could not convince it that it’s 2026 Llama.cpp + RTX 3090

English

Lotto@LottoLabs·3d

Maybe Gemma 26b doesn’t suck?

English

7.9K

charbob@Char__Bob·30 Mar

@no_stp_on_snek Results: Would be fun to test on your new variable-quant setup as well. Is there a stable PR/flag to try?

English

charbob@Char__Bob·30 Mar

@no_stp_on_snek I’ll post compression results for q3.5-27b (also q4) and mistral-24b q4 today

English

Tom Turney@no_stp_on_snek·30 Mar

New TurboQuant result: not all V layers are created equal. TL;DR: turbo2 compression, turbo3 quality, 15 lines of layer policy Boundary V: keep K at q8_0, protect the first 2 and last 2 V layers with full precision, compress everything in the middle at turbo2. 15 lines of code. Tested on 4 models across Metal. Beats uniform turbo2-V every time. Holds at 8K context. NIAH retrieval still works. The insight: boundary layers handle the input and output transformations. Mess with their V precision and you pay for it everywhere downstream. Leave the middle layers alone and they barely notice. Writeup with all the numbers: github.com/TheTom/turboqu…

English

2.7K

charbob@Char__Bob·29 Mar

@no_stp_on_snek Will do - feel free to @ me next time you need more 3090 benchs. Thanks for all you’re doing!

English

Tom Turney@no_stp_on_snek·29 Mar

@Char__Bob really appreciate you running these 🙏 i don’t have a CUDA box handy to repro, but i’ll loop in some folks on my side to dig into it if you’re able to open an issue with your setup + commands + logs that would help a ton in the meantime

English

Tom Turney@no_stp_on_snek·29 Mar

turbo2 is now on metal. 2-bit kv cache, 6.4x compression. the full turbo family is complete development order was 3, 4, 2 for no reason whatsoever ppl results (qwen 35b moe, m5 max): - turbo4 (4-bit): 6.125, +0.23% vs q8_0 - turbo3 (3-bit): 6.176, +1.06% - turbo2 (2-bit): 6.507, +6.48% turbo2 uniform is rough on quality but the real use is asymmetric: turbo2 keys + turbo3 values. keys tolerate more compression than values. buun's cuda data shows that combo at +3.88% ppl ... way better than uniform turbo2. i'll need to test that soon. 166 lines of metal shader. no cuda changes, no turbo3/turbo4 code touched. purely additive. codex reviewed, build clean, ppl verified -ctk turbo2 -ctv turbo2 if you're feeling dangerous -ctk turbo2 -ctv turbo3 if you want the sweet spot Still haven't cracked decode speed issues with sub m5 chips... #top-of-tree-results" target="_blank" rel="nofollow noopener">github.com/TheTom/turboqu…

English

4.8K

charbob@Char__Bob·29 Mar

@no_stp_on_snek turbo4/4: CRASH — SET_ROWS not CUDA-ported mixed configs (turbo3k/turbo2v and turbo2k/turbo3v): decode fine (~88 t/s) but prefill ~11.5x slower than baseline — looks like a bug KV savings scale linearly. at 131k ctx turbo3/3 saves ~1.25 GB and turbo2/2 saves ~1.5 GB vs q8_0

English

charbob@Char__Bob·29 Mar

@no_stp_on_snek RTX 3090 24GB benchmarks on your tree (Qwen3.5-9B Q4_K_M, n_ctx=2048): turbo3/3: PPL 8.31, decode 99.79 t/s, prefill 3727 t/s, 215 MiB KV (-8.5% vs baseline) turbo2/2: PPL 8.66, decode 100.73 t/s, prefill 3702 t/s, 211 MiB KV (-10.2% vs baseline)

English

charbob@Char__Bob·5 Şub

RRR is Little Women for men

English

charbob@Char__Bob·28 Oca

“The Passion According to G.H. for men isn’t real, it can’t hurt you” The Passion According to G.H. for men:

Sandy Petersen 🪔@SandyofCthulhu

A cautionary tale. There is a small land-dwelling crustacean known variously as a pillbug, roly-poly, woodlouse, or potato bug. One day, I decided to eat one to see how it was. It turns out that pillbugs store their wastes as ammonia inside their body, so they taste like stale pee. Therefore, do not crunch them up in your mouth. Determined, I tried again. This time my plan was to swallow the pillbug whole, so to avoid the ammonia. I picked it up, and it rolled into a ball - "Nice" thought I. "Convenient for swallowing." Halfway down my throat, the pillbug unrolled. Now pillbugs are designed to push their way through dense litter and detritus. They have a rounded, smooth exterior, and multiple legs. Also they can tell "up" from down". So the pillbug decided to walk back up my throat. Every single step. Turns out pillbug determination is WAY more powerful than a human's puny peristalsis, so I felt that pillbug move its way slowly up my esophagus minute by minute, until it emerged into my mouth. Feeling it had earned its freedom, I spat it out into the garden intact. There. Now I have told the story and eaten a pillbug, so you don't don't have to. A public service from Sandy Petersen.

English

charbob@Char__Bob·20 Oca

@adrgrondin @liquidai @LocallyAIApp Been running @LocallyAIApp on my iPhone + iPad for a min. So sick. Is MCP/tool use for web search, etc, on your roadmap?

English

Adrien Grondin@adrgrondin·19 Oca

Quick demo of LFM 2.5 VL 1.6B model by @liquidai that I recently added to @LocallyAIApp Running locally on iPhone 17 Pro at ~90tk/s with MLX Small vision-language models are improving fast

English

394

32.5K

Keşfet

@prayag_sonar @Hangsiin @sudoingX @arlanr @Snixtp @huydq179 @grzracz @zw0404