JJJYmmm

63 posts

JJJYmmm banner
JJJYmmm

JJJYmmm

@JJJYmmm2002

https://t.co/sOpeMKQlpT

Katılım Haziran 2023
15 Takip Edilen46 Takipçiler
JJJYmmm retweetledi
Georgi Gerganov
Georgi Gerganov@ggerganov·
llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further. Special thanks to Aman Gupta for leading this development! github.com/ggml-org/llama…
English
48
185
1.2K
270.3K
JJJYmmm
JJJYmmm@JJJYmmm2002·
@ivanfioravanti from a quick look, it seems to only port the mtp layers, without actually doing real multi token prediction
English
0
0
1
197
JJJYmmm
JJJYmmm@JJJYmmm2002·
@_LuoFuli waiting for the tech report🫡
English
0
0
0
518
JJJYmmm retweetledi
Julien Chaumond
Julien Chaumond@julien_c·
This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚‍♀️ Qwen3.6 27B running inside of Pi coding agent via Llama.cpp on the MacBook Pro For non-trivial tasks on the @huggingface codebases, this feels very, very close to hitting the latest Opus in Claude Code, or whatever shiny monopolistic closed source API of the day is. In full airplane mode. Most people haven’t realized this yet. If you have, it means you have a huge headstart to what I call the second revolution of AI. Powerful local models for efficiency, security, privacy, sovereignty 🔥
Julien Chaumond tweet media
English
263
455
5.3K
654K
JJJYmmm
JJJYmmm@JJJYmmm2002·
@ggerganov speculative decoding support in llama.cpp is really, really, really useful. preciate all the effort you guys put into this 😊
English
1
0
13
3K
Georgi Gerganov
Georgi Gerganov@ggerganov·
llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default
20
55
675
75K
JJJYmmm retweetledi
Prince Canuma
Prince Canuma@Prince_Canuma·
Next mlx-vlm release will ship with continuous batching support on the server 🚀 What's coming: → Continuous batching — new requests join the active batch immediately, no waiting. Mixed image + text batches supported → OpenAI-compatible API — field-for-field match with mlx-lm, reasoning/content split for thinking models, tag-aware streaming → Multi-turn tool calling — full tool use support across streaming and non-streaming, works with Gemma4 and other templates → Vision feature caching — cache image embeddings across turns. Gemma4: 228x speedup, Qwen3.5: 23x on cache hit All running locally on Apple Silicon. Check our this demo running 4 concurrent requests (mixed image + text) to gemma-4-26B-A4B-IT by @googlegemma in bf16 using Pi + MLX-VLM server on my M3 Ultra. One of the requests ingests a 8K resolution image!
English
18
21
295
82K
JJJYmmm retweetledi
vLLM
vLLM@vllm_project·
🎉 Congrats @Alibaba_Qwen on the first open-weight Qwen3.6! Stronger agentic coding and a new thinking preservation option to retain reasoning context across turns. Same architecture as Qwen3.5, so serving teams can upgrade in place. Day-0 support in vLLM v0.19+. Thinking, tool calling, MTP speculative decoding, and text-only mode all ready. 📖 Same recipe applies: docs.vllm.ai/projects/recip…
vLLM tweet media
Qwen@Alibaba_Qwen

⚡ Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes Efficient. Powerful. Versatile. Try it now👇 Blog:qwen.ai/blog?id=qwen3.… Qwen Studio:chat.qwen.ai HuggingFace:huggingface.co/Qwen/Qwen3.6-3… ModelScope:modelscope.cn/models/Qwen/Qw… API(‘Qwen3.6-Flash’ on Model Studio):Coming soon~ Stay tuned

English
4
24
341
17.7K
JJJYmmm retweetledi
Qwen
Qwen@Alibaba_Qwen·
⚡ Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes Efficient. Powerful. Versatile. Try it now👇 Blog:qwen.ai/blog?id=qwen3.… Qwen Studio:chat.qwen.ai HuggingFace:huggingface.co/Qwen/Qwen3.6-3… ModelScope:modelscope.cn/models/Qwen/Qw… API(‘Qwen3.6-Flash’ on Model Studio):Coming soon~ Stay tuned
Qwen tweet media
English
446
1.6K
11.6K
2.7M
Zhijian Liu
Zhijian Liu@zhijianliu_·
🔥 DFlash x MLX is happening! Shoutout to @aryagm01 for the early work on this. We're building on the momentum. Native MLX support, more models (Qwen3.5), up to 4x faster. Lossless! 👉 github.com/z-lab/dflash
English
56
89
757
213.9K
JJJYmmm retweetledi
Red Hat AI
Red Hat AI@RedHat_AI·
Michael Goin (@mgoin_) walks through what's new in @vllm_project v0.17, v0.18, and v0.19 in ~8 minutes. Flash Attention 4, new performance modes, zero-bubble async scheduling, online MXFP4 quantization, Gemma 4, and a lot more. 1,592 commits. 682 contributors (163 new). 🎉 🚀
English
3
13
114
23.1K
Chujie Zheng
Chujie Zheng@ChujieZheng·
We are planning to open-source the Qwen3.6 models (particularly medium-sized versions) to facilitate local deployment and customization for developers. Please vote for the model size you are **most** anticipating—the community’s voice is vital to us!
English
313
259
4.1K
300.4K
JJJYmmm
JJJYmmm@JJJYmmm2002·
@eliebakouch Also visual bidirectional attention on swa layer for 31b/26a4 variant. (maybe bidirectional is costly for full-attn) btw the vit’s rope base is very small 100 vs 10000 usually
English
0
0
0
296
elie
elie@eliebakouch·
google gemma 4 architecture is very interesting and every model has some subtle differences, here is a recap: > per layer embedding only on the small variant > no attention scale (usually you divide qk^T by sqrt(d), they don't) > they do QK norm + V norm as well > they share K and V for the large variant > they do quite aggressive KV cache sharing on the small variant > sliding window (512 and 1024) is bigger than gpt-oss 128 and they don't use sinks! > softcapping > rope only on part of the dimensions + different rope theta for the local/global layer
elie tweet media
Omar Sanseviero@osanseviero

Gemma 4 is here! 🧠 31B and 26B A4B for models with impressive intelligence per parameter 🤏E2B and E4B for mobile and IoT 🤗Apache 2.0 🤖Base and IT checkpoints available Available in AI Studio, Hugging Face, Ollama, Android, and your favorite OS tools 🚀Download it today!

English
19
52
563
49.6K
Georgi Gerganov
Georgi Gerganov@ggerganov·
Let me demonstrate the true power of llama.cpp: - Running on Mac Studio M2 Ultra (3 years old) - Gemma 4 26B A4B Q8_0 (full quality) - Built-in WebUI (ships with llama.cpp) - MCP support out of the box (web-search, HF, github, etc.) - Prompt speculative decoding The result: 300t/s (realtime video)
English
132
261
3.3K
778.8K