James Arslan

56 posts

James Arslan

@JamesArslanSwe

CTO of https://t.co/dolKx2UPD9 instagram @jamesarslanswe

Stockholm Katılım Haziran 2024

186 Takip Edilen17 Takipçiler

James Arslan@JamesArslanSwe·3d

@michalmajzlik @tom_doerr not the same turboquant variant but check out my test results and local setup scripts github.com/jamesarslan/lo…

English

Michal Majzlík@michalmajzlik·3d

@tom_doerr This sounds too good, anyone has an experience already and can share results? Wanna to test it on my single gpu + llama.cpp + llama-swap + Hermes Agent + Qwen/Gemma.

English

530

Tom Dörr@tom_doerr·4d

KV cache compression for LLM inference with 5.02x ratio github.com/DevTechJr/turb…

English

282

22.6K

James Arslan@JamesArslanSwe·5 Nis

@TeksEdge 62 token ratio really good but still crashes on 5090 time to time

English

689

David Hendrickson@TeksEdge·5 Nis

🔥 RTX 5090 + Gemma 4 31B: Real user testing right now 💳️ 32GB GDDR7 gives excellent headroom for higher quants on this dense 31B model. 🧪 Typical performance (llama.cpp + early user reports): QuantApprox. VRAM (weights + overhead)Expected TPS (generation) ⚡ Q4_K_M ~18–21GB 55–75+ t/s 📈 Q5_K_XL ~22–25GB 45–65 t/s 🐢 Q6_K / Q8 ~26–32+GB 35–55 t/s Users are actively testing 🐌 Unsloth UD-Q5_K_XL on RTX 5090 and tuning with TurboQuant / KV cache compression for better speed. Great quality + performance balance for local Gemma 4 31B inference 👌 Who else is running it? 👀

English

220

39.2K

James Arslan@JamesArslanSwe·4 Nis

@julien_c Gemma 4 crashes on dense but i do say tool calling and understanding tool usage is better on qwen models but quality of the output code is better on gemma 4 also it is 5 times slower ….

English

4.1K

Julien Chaumond@julien_c·4 Nis

so…. Qwen3.5 or Gemma 4?

Indonesia

205

882

200.5K

James Arslan@JamesArslanSwe·4 Nis

@Sergizzzz4 @TeksEdge Much much better code output

English

Sergizz@Sergizzzz4·3 Nis

@JamesArslanSwe @TeksEdge How is it compared to qwen3.5:27B?

English

David Hendrickson@TeksEdge·3 Nis

For me it appears Gemma 4 26B MoE wins the competition.

stevibe@stevibe

So we know Gemma 4 is good at tool calling, but what about web coding? I threw 4 UI screenshots at three Gemma 4 models and said rebuild this, one shot, no hand-holding, just image in, code out. Model lineup: - E4B - 26B A4B (MoE) - 31B Dense (skipped the E2B this round) Let me know which one you think cooked the hardest

English

148

25.6K

James Arslan@JamesArslanSwe·3 Nis

It's dense by default — Gemma 4 31B doesn't have a "mode" switch. The 31B variant is a dense transformer (all 31B params active every token), unlike the 26B-A4B variant which is MoE (3.8B active out of 25.2B). The model you download determines the architecture: - gemma-4-31B-it = Dense, 31B active, 61 t/s - gemma-4-26B-A4B-it = MoE, 3.8B active, much faster I'm using the dense 31B specifically because the quality gap is worth the speed tradeoff for complex coding tasks. The MoE 26B-A4B would be the faster alternative if you want Qwen3.5-like speed with Gemma quality. Full setup with both Qwen3.5 MoE + Gemma 4 Dense, TurboQuant KV compression, and one-command install is at my repo

English

Nate Brown@ntbrown01·3 Nis

@JamesArslanSwe @UnslothAI How do you tell it to operate in dense mode?

English

Unsloth AI@UnslothAI·2 Nis

Google releases Gemma 4. ✨ Gemma 4 introduces 4 models: E2B, E4B, 26B-A4B, 31B. The multimodal reasoning models are under Apache 2.0. Run E2B and E4B on ~6GB RAM, and on phones. Run 26B-A4B and 31B on ~18GB. GGUFs: huggingface.co/collections/un… Guide: unsloth.ai/docs/models/ge…

Google DeepMind@GoogleDeepMind

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

English

168

1.2K

238.2K

James Arslan@JamesArslanSwe·3 Nis

@no_stp_on_snek Still crashing time to time via llama cpp latest build CUDA 5090 setup

English

Tom Turney@no_stp_on_snek·2 Nis

Gemma4 update: The asymmetric test crashed with garbage output! YES! This proves it's NOT CUDA-specific and NOT multi-GPU-specific.We just reproduced issue #47 on Metal with Gemma 4. A bug I've been chasing for head_dim. Fixing now.

English

James Arslan@JamesArslanSwe·3 Nis

Gave both models the same Asteroids game prompt: Qwen3.5 (188 t/s): Built it fast, needed 2 bug fixes Gemma 4 (61 t/s): Built it slower, worked perfectly first try Speed vs quality — now I just pick the right model for the task. Both running locally on the same GPU. You can try yourself at github.com/jamesarslan/lo…

English

Ettore Di Giacinto@mudler_it·2 Nis

APEX Gemma-4 coming! Slightly different MoE architecture, 30 layers, 128 experts/8 active. With APEX that means: Layers 0-4 (edge): highest precision (Q6_K experts) Layers 5-9 (near-edge): Q5_K/Q6_K experts Layers 10-19 (middle): lowest precision (profile-dependent) Layers 20-24 (near-edge): Q5_K/Q6_K experts Layers 25-29 (edge): highest precision (Q6_K experts) APEX Balanced (18.2 GB) already beats F16 (47 GB) on perplexity score 316.4 vs 338.5! this is a bit expected as Gemma have always been good on wikitext-2, so I'd take this just as an initial hint. Non-imatrix quants available now. New v1.2 calibration dataset (code + agentic + wiki) for I-variants coming with full benchmarks soon. 🤗 huggingface.co/mudler/gemma-4…

Demis Hassabis@demishassabis

Excited to launch Gemma 4: the best open models in the world for their respective sizes. Available in 4 sizes that can be fine-tuned for your specific task: 31B dense for great raw performance, 26B MoE for low latency, and effective 2B & 4B for edge device use - happy building!

English

James Arslan@JamesArslanSwe·3 Nis

Updated my open-source local AI coding pipeline — now with dual model support: Qwen3.5-35B-A3B: 188 t/s, MoE (3B active) — fast iteration Gemma 4 31B: 61 t/s, Dense (31B active) — zero-bug code on first try One-command setup, TurboQuant KV compression, OpenCode + Context7 + Chrome DevTools. All on a single RTX 5090, zero cloud APIs. github.com/jamesarslan/lo…

English

James Arslan retweetledi

Z.ai@Zai_org·1 Nis

Introducing GLM-5V-Turbo: Vision Coding Model - Native Multimodal Coding: Natively understands multimodal inputs including images, videos, design drafts, and document layouts. - Balanced Visual and Programming Capabilities: Achieves leading performance across core benchmarks for multimodal coding, tool use, and GUI Agents. - Deep Adaptation for Claude Code and Claw Scenarios: Works in deep synergy with Agents like Claude Code and OpenClaw. Try it now: chat.z.ai API: docs.z.ai/guides/vlm/glm… Coding Plan trial applications: docs.google.com/forms/d/e/1FAI…

English

241

653

5.8K

1.9M

James Arslan@JamesArslanSwe·31 Mar

@yasinaktimur Bende yaptim fork Claude ile …

Türkçe

476

Rich kids of claude@yasinaktimur·31 Mar

🚨 son dakika TheTom diye bir eleman google ın milyar dolarlık araştırmalarla ulaştığı TurboQuant makalesini okuduktan sonra tersine mühendislikle tüm kodları baştan yazmayı başardı!

Rich kids of claude@yasinaktimur

🚨 SON DAKİKA : Google TurboQuant’ı neden önemli? AI çalışabilmesi için kelimeler arasındaki ilişkiyi anlamak yani vektör hesaplamaları için karmaşık pusulalar kullanıyordu artık üstüne sadece yukarı veya aşağı anlamına gelen basit etiketlerle de yön bulabilmeye başladı. Bu basit bir bilgisayarlarda kendi kodlarınızı yazdırabileceğiniz anlamına geliyor hatta ai için donanım sınırlarının ortadan kalması artık çok daha fazla yatırıma ihtiyaç olmayacağı ve hatta ai araçlarının ucuzlayacağı anlamına geliyor!

Türkçe

233

59.4K

James Arslan retweetledi

Qwen@Alibaba_Qwen·30 Mar

Demo2：Audio-Visual Vibe Coding

Qwen@Alibaba_Qwen

🚀 Qwen3.5-Omni is here! Scaling up to a native omni-modal AGI. Meet the next generation of Qwen, designed for native text, image, audio, and video understanding, with major advances in both intelligence and real-time interaction. A standout feature: 'Audio-Visual Vibe Coding'. Describe your vision to the camera, and Qwen3.5-Omni-Plus instantly builds a functional website or game for you. Offline Highlights: 🎬 Script-Level Captioning: Generate detailed video scripts with timestamps, scene cuts & speaker mapping. 🏆 SOTA Performance: Outperform Gemini-3.1 Pro in audio and matches its audio-visual understanding. 🧠 Massive Capacity: Natively handle up to 10h of audio or 400s of 720p video, trained on 100M+ hours of data. 🌍 Global Reach: Recognize 113 languages (speech) & speaks 36. Real-time Features: 🎙️ Fine-Grained Voice Control: Adjust emotion, pace, and volume in real-time. 🔍 Built-in Web Search & complex function calling. 👤 Voice Cloning: Customize your AI's voice from a short sample, with engineering rollout coming soon. 💬 Human-like Conversation: Smart turn-taking that understands real intent and ignores noise. The Qwen3.5-Omni family includes Plus, Flash, and Light variants. Try it out: Blog: qwen.ai/blog?id=qwen3.… Realtime Interaction: click the VoiceChat/VideoChat button (bottom-right): chat.qwen.ai HF-Demo: huggingface.co/spaces/Qwen/Qw… HF-VoiceOnline-Demo: huggingface.co/spaces/Qwen/Qw… API-Offline: alibabacloud.com/help/en/model-… API-Realtime: alibabacloud.com/help/en/model-…

Español

169

1.6K

208.3K

James Arslan@JamesArslanSwe·30 Mar

@spiritbuun That is almost all open source llms feels like 😏

English

buun@spiritbuun·30 Mar

I love Qwen3.5 27B but noone ever talks about the elephant in the room- their architecture has very few attention layers. Context rot is a problem. In my experience it begins subtly misbehaving - ignoring system prompt rules - as early as 9k tokens.

English

203

21.4K

James Arslan@JamesArslanSwe·29 Mar

@Techjunkie_Aman We will make our own GTA 6 at this point

English

2.1K

Techjunkie Aman@Techjunkie_Aman·29 Mar

A GTA IV dev build from 2007 just surfaced. Details: • Dated Nov 2007 (months before release) • 127GB Xbox 360 devkit archive • Incomplete + partially corrupted • Pulled from a Rockstar dev kit Dataminers already found: • Silenced pistol + unused weapons • Beta NPCs + early Michelle model • Unfinished aiming animations • Ferry system from early trailers • Early radio/audio placeholders No full playable build yet. But modders are already extracting assets. This is rare development history. And the community is still digging. What would you want to uncover next?

English

250

4.7K

340.1K

James Arslan@JamesArslanSwe·28 Mar

@aydeep01 @techolay ne

Kaydıran Birisi@aydeep01·28 Mar

@JamesArslanSwe @techolay ekran kartini dusurmuyor degilmi oda olsa guzel olurdu

Türkçe

Techolay 🤖@techolay·27 Mar

Google, büyük dil modelleri için RAM kullanımını 6 kat azaltan ve yazılım aracılığıyla hızı 8 kata kadar artıran yeni yapay zeka algoritması olan TurboQuant'ı duyurdu. Sonuç olarak, Samsung ve Micron gibi RAM üreticisi şirketlerinin hisse senetleri sert düşüş yaşadı.

Türkçe

7.6K

James Arslan@JamesArslanSwe·27 Mar

@1times1time @JoelDeTeves From my tries it is close .

English

Technical Patriot@1times1time·27 Mar

@JoelDeTeves Would you say it’s on the level of sonnet?

English

Joel - coffee/acc@JoelDeTeves·27 Mar

Qwen3.5-27B-GGUF with hermes agent is the way

English

284

14.8K

James Arslan@JamesArslanSwe·27 Mar

@loktar00 Try it github.com/jamesarslan/lo… Works full context size under 8 kv cache left for 32 gb ram setup

English

Loktar 🇺🇸@loktar00·26 Mar

Just saw TurboQuant getting ported to llama.cpp.... 4.6x KV cache compression. If this actually works well we might get models with near 1m context on a single 3090 eventually. Considering Qwen 27b can run at over 250k.

English

221

12.8K

James Arslan@JamesArslanSwe·27 Mar

Tested TurboQuant on real hardware — RTX 5090, Qwen3.5-35B-A3B (35B params, 3B active MoE). Using @spiritbuun's CUDA port for llama.cpp with turbo3 KV cache: - KV cache: 2,560 MiB → ~730 MiB (3.5x compression) - Prompt processing actually got faster: 4,291 → 5,623 t/s - Quality confirmed: no degradation at 131K context where standard q8_0 falls apart on this model The FWHT rotation + norm correction is what makes this work — it's not just fewer bits, it's fundamentally better quantization geometry. Standard per-block scaling breaks Qwen3.5 at 20-40K+ tokens. TurboQuant doesn't. Built a full local AI coding pipeline around it — agentic coding with browser verification, doc search, all running on a single consumer GPU with zero cloud APIs. Setup + benchmarks + configs open-sourced: github.com/jamesarslan/lo… Great research. Waiting for this to land in mainline llama.cpp.

English

744

Google Research@GoogleResearch·24 Mar

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

GIF

English

5.8K

39K

19.2M

Keşfet

@michalmajzlik @tom_doerr @TeksEdge @julien_c @Sergizzzz4 @UnslothAI @no_stp_on_snek @yasinaktimur