JJJYmmm

63 posts

JJJYmmm

@JJJYmmm2002

https://t.co/sOpeMKQlpT

Katılım Haziran 2023

15 Takip Edilen46 Takipçiler

JJJYmmm retweetledi

Georgi Gerganov@ggerganov·18 May

llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further. Special thanks to Aman Gupta for leading this development! github.com/ggml-org/llama…

English

185

1.2K

270.3K

JJJYmmm@JJJYmmm2002·10 May

running π0 in cpp lol, basically a vibe toy project built on ggml. github.com/JJJYmmm/vla.cpp

English

JJJYmmm@JJJYmmm2002·9 May

cca is interesting, just added support for it in transformers github.com/huggingface/tr…

Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English

106

JJJYmmm@JJJYmmm2002·8 May

Ernie 4.5 already used this trick last year. 😂 yiyan.baidu.com/blog/publicati…

Chenyang Lyu@Chenyang_Lyu

We found a surprisingly simple trick for temporal grounding in multimodal LLMs: just draw the timestamps directly onto the input. Overlay time on video frames, embed a time axis into spectrograms. No training, no new architecture — works out of the box. Interestingly, a separate/detached clock hurts performance; time has to live with the content it refers to. Tech report: github.com/lyuchenyang/ti…

English

JJJYmmm@JJJYmmm2002·29 Nis

@ivanfioravanti from a quick look, it seems to only port the mtp layers, without actually doing real multi token prediction

English

197

Ivan Fioravanti ᯅ@ivanfioravanti·29 Nis

MLX Native MTP support for Qwen 3.5/3.6 models? I need to test this PR! github.com/ml-explore/mlx…

English

4.7K

JJJYmmm@JJJYmmm2002·27 Nis

@_LuoFuli waiting for the tech report🫡

English

518

Fuli Luo@_LuoFuli·27 Nis

Just dropped two open-source models: MiMo-V2.5-Pro (Code Agent, 1T total) and MiMo-V2.5 (Multimodal Agent, 310B total). Oh and one more thing — we're giving devs & creators 100T tokens on us. Go build something cool 🛠️ 🎁 100T Free Token Grant for Builders 100t.xiaomimimo.com

Xiaomi MiMo@XiaomiMiMo

Xiaomi MiMo-V2.5 is now officially open-sourced！ MIT License, supporting commercial deployment, continued training, and fine-tuning - no additional authorization required. Two models, both supporting a 1M-token context window : • MiMo-V2.5-Pro: built for complex agent and coding tasks, ranking No.1 among open-source models on GDPVal-AA and ClawEval • MiMo-V2.5: a native omni-modal model with strong agent capabilities A model's value isn't measured by rankings alone — it's measured by the problems it solves. Let's build with MiMo now! 🤗 Weights: huggingface.co/collections/Xi… 📄 Blog: #blog" target="_blank" rel="nofollow noopener">mimo.xiaomi.com/index#blog

English

210

305

3.1K

663.8K

JJJYmmm retweetledi

Julien Chaumond@julien_c·24 Nis

This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚‍♀️ Qwen3.6 27B running inside of Pi coding agent via Llama.cpp on the MacBook Pro For non-trivial tasks on the @huggingface codebases, this feels very, very close to hitting the latest Opus in Claude Code, or whatever shiny monopolistic closed source API of the day is. In full airplane mode. Most people haven’t realized this yet. If you have, it means you have a huge headstart to what I call the second revolution of AI. Powerful local models for efficiency, security, privacy, sovereignty 🔥

English

263

455

5.3K

654K

JJJYmmm@JJJYmmm2002·22 Nis

@ggerganov speculative decoding support in llama.cpp is really, really, really useful. preciate all the effort you guys put into this 😊

English

Georgi Gerganov@ggerganov·22 Nis

llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default

675

75K

JJJYmmm retweetledi

Fireworks AI@FireworksAI_HQ·18 Nis

x.com/i/article/2045…

ZXX

186

64.6K

JJJYmmm retweetledi

Prince Canuma@Prince_Canuma·16 Nis

Next mlx-vlm release will ship with continuous batching support on the server 🚀 What's coming: → Continuous batching — new requests join the active batch immediately, no waiting. Mixed image + text batches supported → OpenAI-compatible API — field-for-field match with mlx-lm, reasoning/content split for thinking models, tag-aware streaming → Multi-turn tool calling — full tool use support across streaming and non-streaming, works with Gemma4 and other templates → Vision feature caching — cache image embeddings across turns. Gemma4: 228x speedup, Qwen3.5: 23x on cache hit All running locally on Apple Silicon. Check our this demo running 4 concurrent requests (mixed image + text) to gemma-4-26B-A4B-IT by @googlegemma in bf16 using Pi + MLX-VLM server on my M3 Ultra. One of the requests ingests a 8K resolution image!

English

295

82K

JJJYmmm retweetledi

vLLM@vllm_project·16 Nis

🎉 Congrats @Alibaba_Qwen on the first open-weight Qwen3.6! Stronger agentic coding and a new thinking preservation option to retain reasoning context across turns. Same architecture as Qwen3.5, so serving teams can upgrade in place. Day-0 support in vLLM v0.19+. Thinking, tool calling, MTP speculative decoding, and text-only mode all ready. 📖 Same recipe applies: docs.vllm.ai/projects/recip…

Qwen@Alibaba_Qwen

⚡ Meet Qwen3.6-35B-A3B：Now Open-Source！🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes Efficient. Powerful. Versatile. Try it now👇 Blog：qwen.ai/blog?id=qwen3.… Qwen Studio：chat.qwen.ai HuggingFace：huggingface.co/Qwen/Qwen3.6-3… ModelScope：modelscope.cn/models/Qwen/Qw… API（‘Qwen3.6-Flash’ on Model Studio）：Coming soon～ Stay tuned

English

341

17.7K

JJJYmmm retweetledi

Qwen@Alibaba_Qwen·16 Nis

English

446

1.6K

11.6K

2.7M

JJJYmmm@JJJYmmm2002·15 Nis

@zhijianliu_ @aryagm01 nice work!

English

278

Zhijian Liu@zhijianliu_·15 Nis

🔥 DFlash x MLX is happening! Shoutout to @aryagm01 for the early work on this. We're building on the momentum. Native MLX support, more models (Qwen3.5), up to 4x faster. Lossless! 👉 github.com/z-lab/dflash

English

757

213.9K

JJJYmmm retweetledi

Red Hat AI@RedHat_AI·14 Nis

Michael Goin (@mgoin_) walks through what's new in @vllm_project v0.17, v0.18, and v0.19 in ~8 minutes. Flash Attention 4, new performance modes, zero-bubble async scheduling, online MXFP4 quantization, Gemma 4, and a lot more. 1,592 commits. 682 contributors (163 new). 🎉 🚀

English

114

23.1K

JJJYmmm@JJJYmmm2002·14 Nis

nice work! but maybe a small typo for line25？if the Nth token is rejected and resampled, n_{accept} still becomes N due to line21, but we cannot get a bonus token from it.

Chenfeng_X@Chenfeng_X

🤔The more I studied diffusion language models, the more I came to appreciate the simplicity of autoregressive (AR) language models. AR models are trained to agree with what they generate, and their serving stacks are built to preserve that structure. DLMs often do neither: they lack introspective consistency, and high TPF does not necessarily translate into high real-world TPS. We propose Introspective Diffusion Language Model (I-DLM), which unifies introspection and generation in a single pass: 1. 🧑‍🎓I-DLM brings introspective consistency to DLMs with only 5B training tokens, achieving AR-thinking-level quality. 2. 🚀 I-DLM carefully trades compute for higher TPF while converting that advantage into real TPS under high-concurrency serving. 📖Website: introspective-diffusion.github.io ⌨️Code: github.com/Introspective-…

English

JJJYmmm@JJJYmmm2002·9 Nis

@ZaiforStartups 🫪

QME

101

JJJYmmm retweetledi

Dimitris Papailiopoulos@DimitrisPapail·8 Nis

x.com/i/article/2041…

ZXX

144

1.1K

472.6K

JJJYmmm@JJJYmmm2002·3 Nis

@ChujieZheng 🫪🫪🫪

QME

Chujie Zheng@ChujieZheng·3 Nis

We are planning to open-source the Qwen3.6 models (particularly medium-sized versions) to facilitate local deployment and customization for developers. Please vote for the model size you are **most** anticipating—the community’s voice is vital to us!

English

313

259

4.1K

300.4K

JJJYmmm@JJJYmmm2002·2 Nis

@eliebakouch Also visual bidirectional attention on swa layer for 31b/26a4 variant. (maybe bidirectional is costly for full-attn) btw the vit’s rope base is very small 100 vs 10000 usually

English

296

elie@eliebakouch·2 Nis

google gemma 4 architecture is very interesting and every model has some subtle differences, here is a recap: > per layer embedding only on the small variant > no attention scale (usually you divide qk^T by sqrt(d), they don't) > they do QK norm + V norm as well > they share K and V for the large variant > they do quite aggressive KV cache sharing on the small variant > sliding window (512 and 1024) is bigger than gpt-oss 128 and they don't use sinks! > softcapping > rope only on part of the dimensions + different rope theta for the local/global layer

Omar Sanseviero@osanseviero

Gemma 4 is here! 🧠 31B and 26B A4B for models with impressive intelligence per parameter 🤏E2B and E4B for mobile and IoT 🤗Apache 2.0 🤖Base and IT checkpoints available Available in AI Studio, Hugging Face, Ollama, Android, and your favorite OS tools 🚀Download it today!

English

563

49.6K

JJJYmmm@JJJYmmm2002·2 Nis

@ggerganov cool!

English

143

Georgi Gerganov@ggerganov·2 Nis

Let me demonstrate the true power of llama.cpp: - Running on Mac Studio M2 Ultra (3 years old) - Gemma 4 26B A4B Q8_0 (full quality) - Built-in WebUI (ships with llama.cpp) - MCP support out of the box (web-search, HF, github, etc.) - Prompt speculative decoding The result: 300t/s (realtime video)

English

132

261

3.3K

778.8K

Keşfet

@ivanfioravanti @_LuoFuli @huggingface @ggerganov @googlegemma @Alibaba_Qwen @zhijianliu_ @aryagm01