James Arslan

56 posts

James Arslan

James Arslan

@JamesArslanSwe

CTO of https://t.co/dolKx2UPD9 instagram @jamesarslanswe

Stockholm Katılım Haziran 2024
186 Takip Edilen17 Takipçiler
Michal Majzlík
Michal Majzlík@michalmajzlik·
@tom_doerr This sounds too good, anyone has an experience already and can share results? Wanna to test it on my single gpu + llama.cpp + llama-swap + Hermes Agent + Qwen/Gemma.
English
2
0
2
530
James Arslan
James Arslan@JamesArslanSwe·
@TeksEdge 62 token ratio really good but still crashes on 5090 time to time
English
0
0
1
689
David Hendrickson
David Hendrickson@TeksEdge·
🔥 RTX 5090 + Gemma 4 31B: Real user testing right now 💳️ 32GB GDDR7 gives excellent headroom for higher quants on this dense 31B model. 🧪 Typical performance (llama.cpp + early user reports): QuantApprox. VRAM (weights + overhead)Expected TPS (generation) ⚡ Q4_K_M ~18–21GB 55–75+ t/s 📈 Q5_K_XL ~22–25GB 45–65 t/s 🐢 Q6_K / Q8 ~26–32+GB 35–55 t/s Users are actively testing 🐌 Unsloth UD-Q5_K_XL on RTX 5090 and tuning with TurboQuant / KV cache compression for better speed. Great quality + performance balance for local Gemma 4 31B inference 👌 Who else is running it? 👀
David Hendrickson tweet media
English
19
18
220
39.2K
James Arslan
James Arslan@JamesArslanSwe·
@julien_c Gemma 4 crashes on dense but i do say tool calling and understanding tool usage is better on qwen models but quality of the output code is better on gemma 4 also it is 5 times slower ….
English
0
0
1
4.1K
Julien Chaumond
Julien Chaumond@julien_c·
so…. Qwen3.5 or Gemma 4?
Indonesia
205
18
882
200.5K
James Arslan
James Arslan@JamesArslanSwe·
It's dense by default — Gemma 4 31B doesn't have a "mode" switch. The 31B variant is a dense transformer (all 31B params active every token), unlike the 26B-A4B variant which is MoE (3.8B active out of 25.2B). The model you download determines the architecture: - gemma-4-31B-it = Dense, 31B active, 61 t/s - gemma-4-26B-A4B-it = MoE, 3.8B active, much faster I'm using the dense 31B specifically because the quality gap is worth the speed tradeoff for complex coding tasks. The MoE 26B-A4B would be the faster alternative if you want Qwen3.5-like speed with Gemma quality. Full setup with both Qwen3.5 MoE + Gemma 4 Dense, TurboQuant KV compression, and one-command install is at my repo
English
1
0
1
45
Unsloth AI
Unsloth AI@UnslothAI·
Google releases Gemma 4. ✨ Gemma 4 introduces 4 models: E2B, E4B, 26B-A4B, 31B. The multimodal reasoning models are under Apache 2.0. Run E2B and E4B on ~6GB RAM, and on phones. Run 26B-A4B and 31B on ~18GB. GGUFs: huggingface.co/collections/un… Guide: unsloth.ai/docs/models/ge…
Unsloth AI tweet media
Google DeepMind@GoogleDeepMind

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

English
48
168
1.2K
238.2K
Tom Turney
Tom Turney@no_stp_on_snek·
Gemma4 update: The asymmetric test crashed with garbage output! YES! This proves it's NOT CUDA-specific and NOT multi-GPU-specific.We just reproduced issue #47 on Metal with Gemma 4. A bug I've been chasing for head_dim. Fixing now.
English
3
1
42
5K
James Arslan
James Arslan@JamesArslanSwe·
Gave both models the same Asteroids game prompt: Qwen3.5 (188 t/s): Built it fast, needed 2 bug fixes Gemma 4 (61 t/s): Built it slower, worked perfectly first try Speed vs quality — now I just pick the right model for the task. Both running locally on the same GPU. You can try yourself at github.com/jamesarslan/lo…
English
0
0
0
43
Ettore Di Giacinto
Ettore Di Giacinto@mudler_it·
APEX Gemma-4 coming! Slightly different MoE architecture, 30 layers, 128 experts/8 active. With APEX that means: Layers 0-4 (edge): highest precision (Q6_K experts) Layers 5-9 (near-edge): Q5_K/Q6_K experts Layers 10-19 (middle): lowest precision (profile-dependent) Layers 20-24 (near-edge): Q5_K/Q6_K experts Layers 25-29 (edge): highest precision (Q6_K experts) APEX Balanced (18.2 GB) already beats F16 (47 GB) on perplexity score 316.4 vs 338.5! this is a bit expected as Gemma have always been good on wikitext-2, so I'd take this just as an initial hint. Non-imatrix quants available now. New v1.2 calibration dataset (code + agentic + wiki) for I-variants coming with full benchmarks soon. 🤗 huggingface.co/mudler/gemma-4…
Demis Hassabis@demishassabis

Excited to launch Gemma 4: the best open models in the world for their respective sizes. Available in 4 sizes that can be fine-tuned for your specific task: 31B dense for great raw performance, 26B MoE for low latency, and effective 2B & 4B for edge device use - happy building!

English
5
4
57
6K
James Arslan
James Arslan@JamesArslanSwe·
Updated my open-source local AI coding pipeline — now with dual model support: Qwen3.5-35B-A3B: 188 t/s, MoE (3B active) — fast iteration Gemma 4 31B: 61 t/s, Dense (31B active) — zero-bug code on first try One-command setup, TurboQuant KV compression, OpenCode + Context7 + Chrome DevTools. All on a single RTX 5090, zero cloud APIs. github.com/jamesarslan/lo…
English
0
0
0
77
James Arslan retweetledi
Z.ai
Z.ai@Zai_org·
Introducing GLM-5V-Turbo: Vision Coding Model - Native Multimodal Coding: Natively understands multimodal inputs including images, videos, design drafts, and document layouts. - Balanced Visual and Programming Capabilities: Achieves leading performance across core benchmarks for multimodal coding, tool use, and GUI Agents. - Deep Adaptation for Claude Code and Claw Scenarios: Works in deep synergy with Agents like Claude Code and OpenClaw. Try it now: chat.z.ai API: docs.z.ai/guides/vlm/glm… Coding Plan trial applications: docs.google.com/forms/d/e/1FAI…
English
241
653
5.8K
1.9M
James Arslan retweetledi
Qwen
Qwen@Alibaba_Qwen·
Demo2:Audio-Visual Vibe Coding
Qwen@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. Describe your vision to the camera, and Qwen3.5-Omni-Plus instantly builds a functional website or game for you. Offline Highlights: 🎬 Script-Level Captioning: Generate detailed video scripts with timestamps, scene cuts & speaker mapping. 🏆 SOTA Performance: Outperform Gemini-3.1 Pro in audio and matches its audio-visual understanding. 🧠 Massive Capacity: Natively handle up to 10h of audio or 400s of 720p video, trained on 100M+ hours of data. 🌍 Global Reach: Recognize 113 languages (speech) & speaks 36. Real-time Features: 🎙️ Fine-Grained Voice Control: Adjust emotion, pace, and volume in real-time. 🔍 Built-in Web Search & complex function calling. 👤 Voice Cloning: Customize your AI's voice from a short sample, with engineering rollout coming soon. 💬 Human-like Conversation: Smart turn-taking that understands real intent and ignores noise. The Qwen3.5-Omni family includes Plus, Flash, and Light variants. Try it out: Blog: qwen.ai/blog?id=qwen3.… Realtime Interaction: click the VoiceChat/VideoChat button (bottom-right): chat.qwen.ai HF-Demo: huggingface.co/spaces/Qwen/Qw… HF-VoiceOnline-Demo: huggingface.co/spaces/Qwen/Qw… API-Offline: alibabacloud.com/help/en/model-… API-Realtime: alibabacloud.com/help/en/model-…

Español
36
169
1.6K
208.3K
buun
buun@spiritbuun·
I love Qwen3.5 27B but noone ever talks about the elephant in the room- their architecture has very few attention layers. Context rot is a problem. In my experience it begins subtly misbehaving - ignoring system prompt rules - as early as 9k tokens.
English
22
6
203
21.4K
Techjunkie Aman
Techjunkie Aman@Techjunkie_Aman·
A GTA IV dev build from 2007 just surfaced. Details: • Dated Nov 2007 (months before release) • 127GB Xbox 360 devkit archive • Incomplete + partially corrupted • Pulled from a Rockstar dev kit Dataminers already found: • Silenced pistol + unused weapons • Beta NPCs + early Michelle model • Unfinished aiming animations • Ferry system from early trailers • Early radio/audio placeholders No full playable build yet. But modders are already extracting assets. This is rare development history. And the community is still digging. What would you want to uncover next?
Techjunkie Aman tweet mediaTechjunkie Aman tweet mediaTechjunkie Aman tweet mediaTechjunkie Aman tweet media
English
33
250
4.7K
340.1K
Techolay 🤖
Techolay 🤖@techolay·
Google, büyük dil modelleri için RAM kullanımını 6 kat azaltan ve yazılım aracılığıyla hızı 8 kata kadar artıran yeni yapay zeka algoritması olan TurboQuant'ı duyurdu. Sonuç olarak, Samsung ve Micron gibi RAM üreticisi şirketlerinin hisse senetleri sert düşüş yaşadı.
Techolay 🤖 tweet mediaTecholay 🤖 tweet media
Türkçe
6
3
87
7.6K
Joel - coffee/acc
Joel - coffee/acc@JoelDeTeves·
Qwen3.5-27B-GGUF with hermes agent is the way
English
14
6
284
14.8K
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
Just saw TurboQuant getting ported to llama.cpp.... 4.6x KV cache compression. If this actually works well we might get models with near 1m context on a single 3090 eventually. Considering Qwen 27b can run at over 250k.
English
9
13
221
12.8K
James Arslan
James Arslan@JamesArslanSwe·
Tested TurboQuant on real hardware — RTX 5090, Qwen3.5-35B-A3B (35B params, 3B active MoE). Using @spiritbuun's CUDA port for llama.cpp with turbo3 KV cache: - KV cache: 2,560 MiB → ~730 MiB (3.5x compression) - Prompt processing actually got faster: 4,291 → 5,623 t/s - Quality confirmed: no degradation at 131K context where standard q8_0 falls apart on this model The FWHT rotation + norm correction is what makes this work — it's not just fewer bits, it's fundamentally better quantization geometry. Standard per-block scaling breaks Qwen3.5 at 20-40K+ tokens. TurboQuant doesn't. Built a full local AI coding pipeline around it — agentic coding with browser verification, doc search, all running on a single consumer GPU with zero cloud APIs. Setup + benchmarks + configs open-sourced: github.com/jamesarslan/lo… Great research. Waiting for this to land in mainline llama.cpp.
English
2
0
1
744
Google Research
Google Research@GoogleResearch·
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI
GIF
English
1K
5.8K
39K
19.2M