zq_dev

30 posts

zq_dev

zq_dev

@ZQ_Dev

Open Source AI

California, USA Bergabung Mayıs 2014
194 Mengikuti27 Pengikut
Red Hat AI
Red Hat AI@RedHat_AI·
What compression looks like on @vllm_project. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 @_soyr_ for the 2-minute demo.
English
8
42
445
31.9K
zq_dev
zq_dev@ZQ_Dev·
@GoogleDeepMind @sudoingX excited to see how the 31B dense does vs. Qwen3.5-27B on your "build a space shooter" test from last month...
English
0
0
1
1.5K
Google DeepMind
Google DeepMind@GoogleDeepMind·
Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵
GIF
English
371
1.2K
8.8K
3.8M
Google Research
Google Research@GoogleResearch·
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI
GIF
English
1K
5.8K
39K
19.2M
zq_dev
zq_dev@ZQ_Dev·
@sudoingX Genuine curiosity - where does hermes-agent sit in the market for you? Is it meant to be a direct competitor to claude code, opencode, etc?
English
0
0
0
163
Sudo su
Sudo su@sudoingX·
i keep coming back to hermes agent. woke up and opened it before anything else. not because i have to test it. because i want to use it. the UX is what does it. the ASCII skull splash. color coded tool calls with execution times. the emoji phase spinner while thinking. dark theme that doesn't burn your eyes at 2am. every detail feels intentional. most agent frameworks feel like dev tools. this one feels like someone built it for themselves first and shipped it second. still running Qwen 3.5 27B dense on a single 3090. 29-35 tok/s. 262K context. fully local. the stack just works.
Sudo su tweet media
Sudo su@sudoingX

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. @Teknium and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.

English
30
17
414
34.4K
zq_dev
zq_dev@ZQ_Dev·
@sudoingX Amazing results, thanks for sharing! Excited to give 3.5 MoE and 3.5 Dense a try. How would you adjust llama-server config for a 4070 Ti Super (16GB VRAM) and 64GB CPU RAM?
English
0
0
0
349
Sudo su
Sudo su@sudoingX·
the tiebreaker is done. qwen 3.5 27B dense. single RTX 3090. one prompt. zero steering. zero human edits. 1,827 lines across 10 files. 13 minutes. full thinking mode. runs on first load. hermes 4.3 got the same prompt with 2x 3090s and 5x the context it needed. wrote 1,249 lines, left empty files, needed 3 interventions, game was broken on load. same architecture class. same quant. hermes got double the hardware. completely different result. dense wasn't the problem. hermes was. but here's what got me. this model thinks at 27 tok/s. every single token carries 27 billion parameters of reasoning. MoE hit 112 tok/s but only 3B active per token. the dense model is slower and it doesn't matter. watch 13 minutes of autonomous coding on a consumer GPU with zero intervention and tell me speed is what matters. a year ago this wasn't possible. now it runs on hardware you can buy used for $900. no API. no subscription. no cloud. just a 3090 doing what data centers did 18 months ago. full unedited session in the video. every token, every file, every thinking chain. 16 minutes. hit play.
Sudo su@sudoingX

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

English
36
44
555
90.9K
zq_dev
zq_dev@ZQ_Dev·
@sudoingX killing it on these comparisons, excitedly waiting for the results ☕️
English
1
0
1
362
Sudo su
Sudo su@sudoingX·
last time this qwen 3.5 MoE one shotted a full space shooter game. 3,483 lines across 10 files. ran on first load. zero steering. 112 tok/s on a single 3090. then i ran the same prompt on hermes 4.3 36B dense. similar size model, completely different architecture. it wrote 1,249 lines, declared done with empty files, needed three steering interventions, and the game didn't work. used 22% of available context and quit. nine posts and two GPU configs later the conclusion was clear: the bottleneck wasn't hardware. but that leaves a question. was that a dense architecture problem or a hermes 4.3 problem? qwen is the only family that ships both. 35B MoE with 3B active per token. and a 27B dense with all 27B active per token. same team. same training pipeline. different architecture. downloading qwen 3.5 27B dense now. Q4_K_M. same quant. same single RTX 3090. same octopus invaders prompt. if it finishes the game clean, hermes was the problem. if it fails the same way, dense architecture doesn't have the endurance for autonomous coding on consumer hardware regardless of who builds it. the tiebreaker.
Sudo su tweet media
Sudo su@sudoingX

Qwen3.5-35B-A3B testing on single RTX 3090 and it flew. 112 tokens per second. zero tuning. default config. all 41 layers on GPU with 4GB VRAM to spare. for context: the 80B coder-next did 1.3 tok/s on this same card. needed two 3090s to hit 46 tok/s. this model just did 112 on one. same 3B active params. half the total weight. 19.7GB on disk instead of 45. the math was obvious but the result still caught me off guard. flash attention enabled itself automatically. KV cache quantization, expert offloading, thread tuning, none of that applied yet. this is baseline. full optimization breakdown and benchmark results dropping soon. if default settings do 112, i want to see where the ceiling is. exact hardware specs in the image below.

English
18
12
240
92.1K
Sudo su
Sudo su@sudoingX·
the Marlin MoE repack needs ~256MB buffer per GPU after loading weights. at 96% VRAM utilization there's nowhere to put it. same wall on SGLang too since it uses the same kernels. if vLLM can defer the repack or stream it in chunks, it would unlock a lot of MoE models on consumer cards.
English
1
0
1
479
Sudo su
Sudo su@sudoingX·
1.3 t/s on a single 3090. so i added another one and hacked claude code to run on local inference. no API. no rate limits. no subscription. 2x 3090s. 48GB VRAM. same 80B Qwen model. 46 tokens per second. 35x faster. single card was choking on CPU offloading. two cards, full model in VRAM, zero offloading. then the hack: llama-server exposes an OpenAIcompatible API on localhost. LiteLLM proxy translates that to Anthropic's message format. point ANTHROPIC_BASE_URL at your own machine. claude code doesn't know the difference. you're now running anthropic's coding tool on a local open source model. on your own hardware. still tuning layers and context. 46 t/s is day one. already had it build a 3D particle sim from a single prompt. this combo of claude code + local qwen is something else. results coming.
Sudo su tweet mediaSudo su tweet mediaSudo su tweet mediaSudo su tweet media
Sudo su@sudoingX

80 billion parameters on a single RTX 3090. it loaded. it ran. it wrote FastAPI auth with JWT, bcrypt, SQLAlchemy, and cookiebased sessions. prompt eval: 11.1 t/s generation: 1.3 t/s 1.3 tokens per second. slow? yes. but 20 out of 60+ layers fit on GPU, the rest is bleeding through RAM. the 3090 is doing everything it can with 24GB. this card from 2020 is loading a model most enterprise setups would throw an A100 at. the bottleneck isn't the card. it's that there's only one of them. next: 2x 3090s. full model in VRAM. no offloading. no excuses. let's see what Q4 is really made of.

English
42
47
722
184.3K
zq_dev
zq_dev@ZQ_Dev·
@Tim_Dettmers which open weight model are you referring to?
English
1
0
2
1K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
It seems the closed-source vs open-weights landscape has been leveled. GPT-5 is just 10% better at coding than an open-weight model you can run on a consumer desktop and soon laptop. If Anthropic cannot come up with a good model, then we will probably not see AGI for a while.
English
14
30
237
67.7K
zq_dev
zq_dev@ZQ_Dev·
@danielhanchen awesome! Please consider targeting music generation models like AceStep next :)
English
1
0
2
101
Daniel Han
Daniel Han@danielhanchen·
We're bringing the Unsloth magic to TTS and audio models! There are multiple free Colab notebooks with free GPUs for Whisper, Sesame, Orpheus, Spark, Llasa & Oute on our docs! docs.unsloth.ai/basics/text-to…
Unsloth AI@UnslothAI

You can now fine-tune TTS models with Unsloth! Train, run and save models like Sesame-CSM and OpenAI's Whisper locally with our free notebooks. Unsloth makes TTS training 1.5x faster with 50% less VRAM. GitHub: github.com/unslothai/unsl… Docs & Notebooks: docs.unsloth.ai/basics/text-to…

English
6
25
166
12K
Deepanshu Sharma
Deepanshu Sharma@deepanshusharmx·
it's so over DeepSeek V3-0324 just dropped and it created this website in one shot, it wrote 800+ lines of code without breaking even once, this is free, open-source, super fast. it's great to see how these open-source models are creating pressure on the big techs to build better models at lower cost.
English
90
213
1.9K
291K
zq_dev
zq_dev@ZQ_Dev·
@karpathy @teknium tried this the other day, it works but it’s also a bit clunky, esp when the repo is huge. maybe I’m doing it wrong, what’s your workflow @karpathy?
English
0
0
2
1.2K
Teknium (e/λ)
Teknium (e/λ)@Teknium·
I really need an LLM that reads in my actual whole codebase and lets me QA it. Cursor afaict doesn't do this. What does?
English
420
68
1.8K
511.5K
Salman Paracha
Salman Paracha@salman_paracha·
I am thrilled that we are now the #1 🏆 trending function calling LLM on @huggingface - the fastest, most efficient models with performance to match frontier LLMs. Arch-Function is engineered in Arch github.com/katanemo/archgw - the intelligent gateway for agents - we'd❤️ for you to try...
Salman Paracha tweet media
English
5
37
137
35.7K
zq_dev
zq_dev@ZQ_Dev·
@danielhanchen @bnjmn_marie So is unsloth the only finetuning framework with this fix implemented? Any feedback from pytorch, tensorflow, etc?
English
1
0
3
531
Daniel Han
Daniel Han@danielhanchen·
Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match. 2. We reproed the issue, and further investigation showed the L2 Norm betw bsz=16 and ga=16 was 10x larger. 3. The culprit was the cross entropy loss normalizer. 4. We ran training runs with denormalized CE Loss, and all training losses match. 5. We then re-normalized CE Loss with the correct denominator across all gradient accumulation steps, and verified all training loss curves match now. 6. We've already updated @UnslothAI with the fix, and wrote up more details in our blog post here: unsloth.ai/blog/gradient This issue impacts all libraries which use GA, and simple averaging of GA does not work for varying sequence lengths. This also impacts DDP and multi GPU training which accumulates gradients. Please update Unsloth via pip install --upgrade --no-cache-dir unsloth and use from unsloth import unsloth_train We have a Colab notebook using our fixed GA: colab.research.google.com/drive/1z0XJU2F… and a Kaggle notebook: kaggle.com/code/danielhan…
Daniel Han tweet media
English
22
131
747
316.7K
zq_dev
zq_dev@ZQ_Dev·
@mattshumer_ Curious to know if, on the road to finetuning the 70B/405B models, you experimented with reflection-tuning on smaller models in the 7-12B range and saw similar boosts in performance?
English
0
0
0
186
Matt Shumer
Matt Shumer@mattshumer_·
Everyone has been sleeping on applying prompting techniques to models natively. Reflection was just my first attempt to show the power of this. After 405B, I'll be pushing this even further.
English
75
48
1.3K
168.3K
zq_dev
zq_dev@ZQ_Dev·
@mervenoyann Thank you for the tutorial! It’d be really great to have a working example for full finetuning in a multigpu setup (something typical, like 8xA100) with deepspeed or fsdp. Been working on one myself but have been running into issues.
English
0
0
0
15
Philipp Schmid
Philipp Schmid@_philschmid·
All you need is synthetic data, LoRA, and 750 human responses for evaluation.
Philipp Schmid tweet media
English
3
10
110
22.1K
zq_dev
zq_dev@ZQ_Dev·
@erhartford appreciate the response, can you share your config? particularly interested in how you’re only targeting half the parameters - do you just mean qlora adapters every other layer?
English
0
0
4
209
Eric Hartford
Eric Hartford@QuixiAI·
Dolphin-2.9-8x22b is in the oven. fft, deepspeed zero3 param offload, 8k sequence, half the layers are targeted. This is a significantly improved, filtered dataset. Function calling, agentic, math, dolphin and dolphin-coder.
Eric Hartford tweet mediaEric Hartford tweet media
English
33
39
387
113K