zq_dev

30 posts

zq_dev

@ZQ_Dev

Open Source AI

California, USA Bergabung Mayıs 2014

194 Mengikuti27 Pengikut

zq_dev@ZQ_Dev·1d

@RedHat_AI @vllm_project @_soyr_ This is awesome! Is @RedHat_AI watching for and pulling in chat template fixes from @GoogleDeepMind and @UnslothAI for these? I think there were some upstream changes for better tool calling, agentic use, etc. made in the past few days.

English

603

Red Hat AI@RedHat_AI·1d

What compression looks like on @vllm_project. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 @_soyr_ for the 2-minute demo.

English

445

31.9K

zq_dev@ZQ_Dev·2 Nis

@GoogleDeepMind @sudoingX excited to see how the 31B dense does vs. Qwen3.5-27B on your "build a space shooter" test from last month...

English

1.5K

Google DeepMind@GoogleDeepMind·2 Nis

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

GIF

English

371

1.2K

8.8K

3.8M

zq_dev@ZQ_Dev·25 Mar

@GoogleResearch @vllm_project pls <3

Google Research@GoogleResearch·24 Mar

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

GIF

English

5.8K

39K

19.2M

zq_dev@ZQ_Dev·10 Mar

@sudoingX Genuine curiosity - where does hermes-agent sit in the market for you? Is it meant to be a direct competitor to claude code, opencode, etc?

English

163

Sudo su@sudoingX·10 Mar

i keep coming back to hermes agent. woke up and opened it before anything else. not because i have to test it. because i want to use it. the UX is what does it. the ASCII skull splash. color coded tool calls with execution times. the emoji phase spinner while thinking. dark theme that doesn't burn your eyes at 2am. every detail feels intentional. most agent frameworks feel like dev tools. this one feels like someone built it for themselves first and shipped it second. still running Qwen 3.5 27B dense on a single 3090. 29-35 tok/s. 262K context. fully local. the stack just works.

Sudo su@sudoingX

been playing with hermes agent paired with qwen 3.5 dense 27B on my single 3090 since last night. there is something about this harness that caught me and i think i know what it is. i've now run five qwen configs on consumer hardware: 35B MoE (3B active) -- 112 tok/s flat across 262K context, 1x 3090 27B dense -- 35 tok/s, zero degradation across the same range, 1x 3090 qwopus 27B (opus distilled) -- 35.7 tok/s, same architecture, different brain 80B coder -- 46 tok/s on 2x 3090s, oneshotted a 564 line particle sim 80B coder -- 1.3 tok/s on 1x 3090, bleeding through RAM because it didn't fit but it still ran with same benchmarks. same prompts. same quant where possible. every config is documented. i know these models. and hermes agent is the first harness that feels like it respects that work. tool calls show inline with execution time. nvidia-smi 0.2s. write_file 0.7s. you see exactly what the agent is doing and how long each step takes. no mystery. no black box. no tool call failures so far and i've been pushing it. most agent frameworks feel like you're watching a spinner and hoping. hermes shows the work. that transparency changes how you trust the output. once you use it you see the UX decisions are not accidental. @Teknium and the nous team built this like engineers who actually use their own tools. 80 skills. 29 tools. persistent memory. context compression. runs clean on a single consumer GPU.

English

414

34.4K

zq_dev@ZQ_Dev·6 Mar

@sudoingX Amazing results, thanks for sharing! Excited to give 3.5 MoE and 3.5 Dense a try. How would you adjust llama-server config for a 4070 Ti Super (16GB VRAM) and 64GB CPU RAM?

English

349

Sudo su@sudoingX·6 Mar

the tiebreaker is done. qwen 3.5 27B dense. single RTX 3090. one prompt. zero steering. zero human edits. 1,827 lines across 10 files. 13 minutes. full thinking mode. runs on first load. hermes 4.3 got the same prompt with 2x 3090s and 5x the context it needed. wrote 1,249 lines, left empty files, needed 3 interventions, game was broken on load. same architecture class. same quant. hermes got double the hardware. completely different result. dense wasn't the problem. hermes was. but here's what got me. this model thinks at 27 tok/s. every single token carries 27 billion parameters of reasoning. MoE hit 112 tok/s but only 3B active per token. the dense model is slower and it doesn't matter. watch 13 minutes of autonomous coding on a consumer GPU with zero intervention and tell me speed is what matters. a year ago this wasn't possible. now it runs on hardware you can buy used for $900. no API. no subscription. no cloud. just a 3090 doing what data centers did 18 months ago. full unedited session in the video. every token, every file, every thinking chain. 16 minutes. hit play.

Sudo su@sudoingX

first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. hermes 4.3 started at 35 and degraded to 15 as context filled. qwen dense holds. MoE held 112 flat. 3x faster but only 3B of 35B active per token. architecture tradeoff. Q4_K_M on 16.7GB. native context 262K. pushed past training limit to 376K before VRAM ceiling on 24GB. tried q8 KV cache at 262K, speed collapsed to 11 tok/s. q4_0 KV is the sweet spot. flash attention mandatory. built in reasoning mode. the model thinks step by step before it answers. full chain of thought surviving Q4 quant. 1,799+ token thinking chains with self correction loops. on a single consumer GPU. gave it one prompt: "build a realtime particle galaxy simulation in one HTML file." 3,340 tokens. 95 seconds. one shot. ran on first load. full reasoning and coding in the video below. optimal config if you want to skip the hours of testing: llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 this is just the warmup. octopus invaders is next: 10 files, 3,400+ lines, zero steering. the prompt hermes quit at 22%. already more impressed than expected. full results coming soon.

English

555

90.9K

zq_dev@ZQ_Dev·4 Mar

@sudoingX killing it on these comparisons, excitedly waiting for the results ☕️

English

362

Sudo su@sudoingX·4 Mar

last time this qwen 3.5 MoE one shotted a full space shooter game. 3,483 lines across 10 files. ran on first load. zero steering. 112 tok/s on a single 3090. then i ran the same prompt on hermes 4.3 36B dense. similar size model, completely different architecture. it wrote 1,249 lines, declared done with empty files, needed three steering interventions, and the game didn't work. used 22% of available context and quit. nine posts and two GPU configs later the conclusion was clear: the bottleneck wasn't hardware. but that leaves a question. was that a dense architecture problem or a hermes 4.3 problem? qwen is the only family that ships both. 35B MoE with 3B active per token. and a 27B dense with all 27B active per token. same team. same training pipeline. different architecture. downloading qwen 3.5 27B dense now. Q4_K_M. same quant. same single RTX 3090. same octopus invaders prompt. if it finishes the game clean, hermes was the problem. if it fails the same way, dense architecture doesn't have the endurance for autonomous coding on consumer hardware regardless of who builds it. the tiebreaker.

Sudo su@sudoingX

Qwen3.5-35B-A3B testing on single RTX 3090 and it flew. 112 tokens per second. zero tuning. default config. all 41 layers on GPU with 4GB VRAM to spare. for context: the 80B coder-next did 1.3 tok/s on this same card. needed two 3090s to hit 46 tok/s. this model just did 112 on one. same 3B active params. half the total weight. 19.7GB on disk instead of 45. the math was obvious but the result still caught me off guard. flash attention enabled itself automatically. KV cache quantization, expert offloading, thread tuning, none of that applied yet. this is baseline. full optimization breakdown and benchmark results dropping soon. if default settings do 112, i want to see where the ceiling is. exact hardware specs in the image below.

English

240

92.1K

zq_dev@ZQ_Dev·25 Şub

@sudoingX @vllm_project 🔥 @vllm_project

QME

Sudo su@sudoingX·25 Şub

the Marlin MoE repack needs ~256MB buffer per GPU after loading weights. at 96% VRAM utilization there's nowhere to put it. same wall on SGLang too since it uses the same kernels. if vLLM can defer the repack or stream it in chunks, it would unlock a lot of MoE models on consumer cards.

English

479

Sudo su@sudoingX·24 Şub

x.com/i/article/2026…

ZXX

546

55.3K

zq_dev@ZQ_Dev·21 Şub

@sudoingX How much CPU RAM? Also, I wonder if you would get higher t/s hosting with vLLM and using a GPU-compatible quant, like huggingface.co/bullpoint/Qwen…

English

433

Sudo su@sudoingX·20 Şub

1.3 t/s on a single 3090. so i added another one and hacked claude code to run on local inference. no API. no rate limits. no subscription. 2x 3090s. 48GB VRAM. same 80B Qwen model. 46 tokens per second. 35x faster. single card was choking on CPU offloading. two cards, full model in VRAM, zero offloading. then the hack: llama-server exposes an OpenAIcompatible API on localhost. LiteLLM proxy translates that to Anthropic's message format. point ANTHROPIC_BASE_URL at your own machine. claude code doesn't know the difference. you're now running anthropic's coding tool on a local open source model. on your own hardware. still tuning layers and context. 46 t/s is day one. already had it build a 3D particle sim from a single prompt. this combo of claude code + local qwen is something else. results coming.

Sudo su@sudoingX

80 billion parameters on a single RTX 3090. it loaded. it ran. it wrote FastAPI auth with JWT, bcrypt, SQLAlchemy, and cookiebased sessions. prompt eval: 11.1 t/s generation: 1.3 t/s 1.3 tokens per second. slow? yes. but 20 out of 60+ layers fit on GPU, the rest is bleeding through RAM. the 3090 is doing everything it can with 24GB. this card from 2020 is loading a model most enterprise setups would throw an A100 at. the bottleneck isn't the card. it's that there's only one of them. next: 2x 3090s. full model in VRAM. no offloading. no excuses. let's see what Q4 is really made of.

English

722

184.3K

zq_dev@ZQ_Dev·7 Ağu

@Tim_Dettmers which open weight model are you referring to?

English

Tim Dettmers@Tim_Dettmers·7 Ağu

It seems the closed-source vs open-weights landscape has been leveled. GPT-5 is just 10% better at coding than an open-weight model you can run on a consumer desktop and soon laptop. If Anthropic cannot come up with a good model, then we will probably not see AGI for a while.

English

237

67.7K

zq_dev@ZQ_Dev·15 May

@danielhanchen awesome! Please consider targeting music generation models like AceStep next :)

English

101

Daniel Han@danielhanchen·15 May

We're bringing the Unsloth magic to TTS and audio models! There are multiple free Colab notebooks with free GPUs for Whisper, Sesame, Orpheus, Spark, Llasa & Oute on our docs! docs.unsloth.ai/basics/text-to…

Unsloth AI@UnslothAI

You can now fine-tune TTS models with Unsloth! Train, run and save models like Sesame-CSM and OpenAI's Whisper locally with our free notebooks. Unsloth makes TTS training 1.5x faster with 50% less VRAM. GitHub: github.com/unslothai/unsl… Docs & Notebooks: docs.unsloth.ai/basics/text-to…

English

166

12K

zq_dev@ZQ_Dev·25 Mar

@deepanshusharmx can we see the prompt you used?

English

1.2K

Deepanshu Sharma@deepanshusharmx·24 Mar

it's so over DeepSeek V3-0324 just dropped and it created this website in one shot, it wrote 800+ lines of code without breaking even once, this is free, open-source, super fast. it's great to see how these open-source models are creating pressure on the big techs to build better models at lower cost.

English

213

1.9K

291K

zq_dev@ZQ_Dev·28 Şub

@karpathy @teknium tried this the other day, it works but it’s also a bit clunky, esp when the repo is huge. maybe I’m doing it wrong, what’s your workflow @karpathy?

English

1.2K

Andrej Karpathy@karpathy·28 Şub

@teknium gitingest-> Gemini?

Indonesia

679

120.5K

Teknium (e/λ)@Teknium·27 Şub

I really need an LLM that reads in my actual whole codebase and lets me QA it. Cursor afaict doesn't do this. What does?

English

420

1.8K

511.5K

zq_dev@ZQ_Dev·8 Ara

@salman_paracha @huggingface @_akhaliq @ClementDelangue @arungupta Any plans to release the data used to finetune qwen-2.5 into arch-function? Doing so would be an amazing contribution to open-source!

English

470

Salman Paracha@salman_paracha·8 Ara

I am thrilled that we are now the #1 🏆 trending function calling LLM on @huggingface - the fastest, most efficient models with performance to match frontier LLMs. Arch-Function is engineered in Arch github.com/katanemo/archgw - the intelligent gateway for agents - we'd❤️ for you to try...

English

137

35.7K

zq_dev@ZQ_Dev·17 Eki

@danielhanchen @bnjmn_marie So is unsloth the only finetuning framework with this fix implemented? Any feedback from pytorch, tensorflow, etc?

English

531

Daniel Han@danielhanchen·15 Eki

Fixed a bug which caused all training losses to diverge for large gradient accumulation sizes. 1. First reported by @bnjmn_marie, GA is supposed to be mathematically equivalent to full batch training, but losses did not match. 2. We reproed the issue, and further investigation showed the L2 Norm betw bsz=16 and ga=16 was 10x larger. 3. The culprit was the cross entropy loss normalizer. 4. We ran training runs with denormalized CE Loss, and all training losses match. 5. We then re-normalized CE Loss with the correct denominator across all gradient accumulation steps, and verified all training loss curves match now. 6. We've already updated @UnslothAI with the fix, and wrote up more details in our blog post here: unsloth.ai/blog/gradient This issue impacts all libraries which use GA, and simple averaging of GA does not work for varying sequence lengths. This also impacts DDP and multi GPU training which accumulates gradients. Please update Unsloth via pip install --upgrade --no-cache-dir unsloth and use from unsloth import unsloth_train We have a Colab notebook using our fixed GA: colab.research.google.com/drive/1z0XJU2F… and a Kaggle notebook: kaggle.com/code/danielhan…

English

131

747

316.7K

zq_dev@ZQ_Dev·6 Eyl

@mattshumer_ Curious to know if, on the road to finetuning the 70B/405B models, you experimented with reflection-tuning on smaller models in the 7-12B range and saw similar boosts in performance?

English

186

Matt Shumer@mattshumer_·6 Eyl

Everyone has been sleeping on applying prompting techniques to models natively. Reflection was just my first attempt to show the power of this. After 405B, I'll be pushing this even further.

English

1.3K

168.3K

zq_dev@ZQ_Dev·14 Ağu

@mervenoyann Thank you for the tutorial! It’d be really great to have a working example for full finetuning in a multigpu setup (something typical, like 8xA100) with deepspeed or fsdp. Been working on one myself but have been running into issues.

English

merve@mervenoyann·10 Ağu

I have made a fine-tuning tutorial for it 🤝🏻

merve@mervenoyann

New smol-vision tutorial dropped: QLoRA fine-tuning IDEFICS3-Llama 8B on VQAv2 🐶 Learn how to efficiently fine-tune the latest IDEFICS3-Llama on visual question answering in this notebook 📖 Link in the next one 🤗

English

merve@mervenoyann·10 Ağu

In case you have missed, this week @huggingface released IDEFICS3Llama a vision language model state-of-the-art of it's size in many benchmarks 😍

merve@mervenoyann

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

English

126

7.7K

zq_dev@ZQ_Dev·12 Haz

@_philschmid link to paper?

English

Philipp Schmid@_philschmid·11 Haz

All you need is synthetic data, LoRA, and 750 human responses for evaluation.

English

110

22.1K

zq_dev@ZQ_Dev·13 Nis

@erhartford appreciate the response, can you share your config? particularly interested in how you’re only targeting half the parameters - do you just mean qlora adapters every other layer?

English

209

Eric Hartford@QuixiAI·13 Nis

@ZQ_Dev 8xH100 and yes

English

1.3K

Eric Hartford@QuixiAI·13 Nis

Dolphin-2.9-8x22b is in the oven. fft, deepspeed zero3 param offload, 8k sequence, half the layers are targeted. This is a significantly improved, filtered dataset. Function calling, agentic, math, dolphin and dolphin-coder.

English

387

113K

Jelajahi

@RedHat_AI @vllm_project @_soyr_ @GoogleDeepMind @UnslothAI @sudoingX @GoogleResearch @Tim_Dettmers