Shom

665 posts

Shom

@ShomLinEd

language model | sequence modeling | education | HCI

Web 加入时间 Eylül 2021

2.1K 关注366 粉丝

Shom@ShomLinEd·1d

@ClementDelangue huggingface.co/datasets/nex-a… We published 70k high quality agent traces:)

English

clem 🤗@ClementDelangue·1d

We need more open agent traces datasets. Who can help?

English

471

95K

Shom 已转推

Jianyang Gao@gaoj0017·1d

The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

896

832.8K

Shom 已转推

Yu Zhang 🐙🌘@yzhang_cs·2d

flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…

English

231

24.9K

Shom@ShomLinEd·4d

@siantgirl 当前版本为了保证在环境中“能跑起来”，默认检索层使用轻量文本相似度实现，而不是额外依赖向量数据库。后续可以平滑替换为 pgvector、Milvus 或 FAISS。这个肯定不行啊，起码实现基础的向量匹配

中文

1.7K

Ryo@siantgirl·4d

结果出来了，说我完成度和质量不符合标准，vibe coding一年都不到，自家都有自家的面试题了，上面是我用ai写的代码。 github.com/DancyRyo/fe 有谁帮我看看哪里不符合吗，题目就是写一个rag增强检索的🥹🥹🥹不复合标准咱可以虚心学习啊

Ryo@siantgirl

上次面试有一家就很秀，让我回来vibe 一个前后端，单元测试，文档建设，代码架构，UI都有要求，我整整花费了5-6 小时来完成，面试成本太高了。🤡🤡🤡

中文

109

70.6K

Shom@ShomLinEd·6d

@hbouammar How will rlm using python with these primitives (filter, map etc.) perform?

English

353

Haitham Bou Ammar@hbouammar·6d

I am super excited to announce our new work on recursive LLMs for long context reasoning. To solve the problem, we use 1930s math (lambda calculus) and show Uuuuge 🤪 improvements in accuracy and latency. Check out the paper here: arxiv.org/pdf/2603.20105 Code can be found here: github.com/lambda-calculu… #ai #machinelearning

English

371

34.8K

Shom@ShomLinEd·6d

@Jiaxi_Cui whisper 很老了，试试新模型

中文

497

Panda@Jiaxi_Cui·6d

冷知识，whisper诞生已经五年了，但ASR语音转文本的API成本仍然非常的高

中文

166

62.2K

Shom@ShomLinEd·20 Mar

There's a long and rare chinese token in k2's tokenizer: 豫冠薰衣草疤痕精华素 . It must be pure coincidence that both compose 2 and k2.5 struggles repeating it.

English

390

34K

Shom@ShomLinEd·20 Mar

These esoteric languages seems very unfriendly to tokenization.....For brainfuck the basic units are stitched together randomly, making it error-prone. Though it also stems from the fact that these languages are underrepresented in training data.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

739

Shom@ShomLinEd·19 Mar

@YouJiacheng maybe agent interaction

English

146

You Jiacheng@YouJiacheng·19 Mar

"anyone on MiMo Team with fewer than 100 conversations tomorrow can quit." 100 conversations per day? that's heavy, if conversation = session.

Fuli Luo@_LuoFuli

MiMo-V2-Pro & Omni & TTS is out. Our first full-stack model family built truly for the Agent era. I call this a quiet ambush — not because we planned it, but because the shift from Chat to Agent paradigm happened so fast, even we barely believed it. Somewhere in between was a process that was thrilling, painful, and fascinating all at once. The 1T base model started training months ago. The original goal was long-context reasoning efficiency. Hybrid Attention carries real innovation, without overreaching — and it turns out to be exactly the right foundation for the Agent era. 1M context window. MTP inference for ultra-low latency and cost. These architectural decisions weren't trendy. They were a structural advantage we built before we needed it. What changed everything was experiencing a complex agentic scaffold — what I'd call orchestrated Context — for the first time. I was shocked on day one. I tried to convince the team to use it. That didn't work. So I gave a hard mandate: anyone on MiMo Team with fewer than 100 conversations tomorrow can quit. It worked. Once the team's imagination was ignited by what agentic systems could do, that imagination converted directly into research velocity. People ask why we move so fast. I saw it firsthand building DeepSeek R1. My honest summary: — Backbone and Infra research has long cycles. You need strategic conviction a year before it pays off. — Posttrain agility is a different muscle: product intuition driving evaluation, iteration cycles compressed, paradigm shifts caught early. — And the constant: curiosity, sharp technical instinct, decisive execution, full commitment — and something that's easy to underestimate: a genuine love for the world you're building for. We will open-source — when the models are stable enough to deserve it. From Beijing, very late, not quite awake.

English

12.3K

Shom@ShomLinEd·14 Mar

@SuJinYan123 @SeTriones 每个 turn 存 state 感觉state 比 kv cache 大啊，不太值当，有明显缓存控制的手段可能更好

中文

susun@SuJinYan123·14 Mar

@SeTriones 对，做是这么做。但是在 agent loop 下，每个 turn 存一个 state 就好，不过这个有点抽象泄露了。就比如 seq len 64 在 agent coding 下，绝对有很多 state never read。

中文

131

susun@SuJinYan123·14 Mar

今早和一位推理的前辈聊了会天，突然聊起线性注意力来，突然给了我灵感。也许 Pegainfer 的发力点就在这些非标准注意力模型上，嗯，所以 Pegainfer 未来的发力点就在那些奇怪的模型上。当然也能给 Pegaflow 一些启发，线性注意力也需要存。

中文

1.8K

Shom@ShomLinEd·3 Mar

Qwen is nothing without its people

Junyang Lin@JustinLin610

me stepping down. bye my beloved qwen.

English

345

Shom@ShomLinEd·3 Mar

@scaling01 doesn't seem efficient as cached recurrent state for small segment is way larger than kv cache

English

391

Lisan al Gaib@scaling01·3 Mar

NEW ALI BEHROUZ PAPER Memory Caching: RNNs with Growing Memory arxiv.org/pdf/2602.24281

English

299

25.8K

Shom 已转推

Dawning Road@TheDawningRoad·14 Şub

Introducing Agent Debugger. LLM-based agent execution trajectories are becoming longer and longer, making them difficult to understand and control for quality. To address this, Agent Debugger offers the following two functions: Question Answering (QA): The Agent Debugger can reliably answer the users' question about a long agent trace. Quality Check: It analyzes potential errors, such as tool call errors hidden in the trace. These errors, if present during model training, can negatively impact model performance. Details at: dawning-road.github.io/blog/agent-deb…

English

179

Shom@ShomLinEd·3 Şub

@rosinality It is one the rare works where we can peak into frontier model design. it's been rumoured that gemma 3n shares arch with gemini as the decoded gemma 3n weight file has "gemini" in its weight name.

English

Rosinality@rosinality·3 Şub

@ShomLinEd Yup, there were precedents. But still it is a bit hard for me to believe that Gemma 3n alone sparked this.

English

179

Rosinality@rosinality·3 Şub

Using embedding output as an input to FFNs instead of attention outputs. Who inspired everyone to work on embedding layers?

English

251

17.4K

Shom@ShomLinEd·3 Şub

@rosinality there have been a slow but steady line of work on embeddings such as SCONES from google and OE from bytedance. It seems to be Gemma 3n from last year that really sparked it off.

English

180

Rosinality@rosinality·3 Şub

Paper arxiv.org/abs/2602.00398

English

1.3K

Shom@ShomLinEd·30 Oca

@nake13 因为价格贵啊，token比普通版贵10倍，输出还长

中文

296

Zhixiong Pan@nake13·30 Oca

没想到这么贵的 GPT-5.2 Pro 还有这么高的使用率，交易流水已达 GPT-5.2 标准版的一半，但两者的核心用户群截然不同。科学、金融和法律领域明显更倾向于为 Pro 版买单，而程序员和学术界依然主要使用标准版。或许可以这样理解：代码写错了可以低成本迭代，但科研和法律领域容错率更低，因此用户更愿意为了「确定性」支付溢价。

OpenRouter@OpenRouter

GPT-5.2 Pro does about 50% of the $ volume on OpenRouter as GPT-5.2. What are people using GPT-5.2 Pro for? Here's the category breakdown: GPT-5.2 Pro is more heavily used for: - Science (6.7% vs 2.8% for Standard) - Finance (2.6% vs 1.3% for Standard) - Legal (1.2% vs 0.5% for Standard) GPT-5.2 Standard is more heavily used for: - Academia (3.4% vs 2.0% for Pro) - Programming (10.7% vs 9.8% for Pro) - Technology (5.6% vs 4.7% for Pro) See openrouter.ai/rankings for more insights

中文

15.6K

Shom@ShomLinEd·16 Oca

@xeophon @liquidai what do you use Qwen3 4B for?

English

Xeophon@xeophon·6 Oca

This is an insane model and huge progress by @liquidai I've tested it on my Mac and its responses are leagues ahead of LFM2. It improved substantially in multilingual abilities and in its general style. It comes close to my beloved Qwen3 4B, which I daily drive.

Liquid AI@liquidai

Today, we release LFM2.5, our most capable family of tiny on-device foundation models. It’s built to power reliable on-device agentic applications: higher quality, lower latency, and broader modality support in the ~1B parameter class. > LFM2.5 builds on our LFM2 device-optimized hybrid architecture > Pretraining scaled from 10T → 28T tokens > Expanded reinforcement learning post-training > Higher ceilings for instruction following 🧵

English

219

19.6K

Shom 已转推

nathan chen@nathancgy4·14 Oca

Was just working on a (very exciting) model architecture improvement with @yzhang_cs and then my mom called and said "happy birthday" and I realized I'm 17 now. Wow, what a year... Last February I was figuring out how to get into a good college. Then got into machine learning in March and spent every day downloading papers, uploading them to gemini, and asking it infinite questions. Then joined @tilderesearch because of a tweet I shared. Flew there the day school ended, got taken to secondary inspection b/c I'm 16 and alone, and eventually spent 2 months in Palo Alto working on interesting model arch research. In October, when I finally felt like settling back into high school senior year, @Kimi_Moonshot reached out and I joined their scaling team, now having just spent a month in Beijing working on fun model architecture with an amazing group of people. And now I get to feel what it's like to not fall asleep because of getting overly excited about an idea, or wake up in the morning, energized, over the fact that I'm gonna learn new stuff, come up with ideas, run experiments, ask questions, and more. So grateful for all that's happened!

English

203

17.3K

Shom@ShomLinEd·5 Oca

@dongxi_nlp 把prompt 写到一个md里，然后让claude code/codex写python/用代码读就行了🤡

中文

430

马东锡 NLP@dongxi_nlp·5 Oca

Recursive Language Models (RLMs) 这两天像洪水一样，拼命抢占我的注意力。这种vibe，非常像 DSPy。但他们到底是是有价值的next，还是 hype？ RLMs 可以看到很多 DSP的影子，两者共同点，是用Programming 替代Prompting，没有把 LLM 当成单纯的text generator，而是当成 data pipleline 的 controller。DSPy 用来优化 prompt，而 RLMs 将这种思想应用于 long context。 Did they work ？当然，在特定情况下，好用。 Did they fail ? 当然，RLMs 专门在总结了Recursion失败的情况。如果说 RLMs 这一个对 programming as prompting 在long context 中的尝试，当然是很好的工作。但远远达不到将 long context问题革命和彻底解决。如果这么宣称，就是hype。

中文

12.3K

Shom@ShomLinEd·27 Ara

@mkurman88 the baseline is 25% tho as it's a multiple choice benchmark with mostly 4 options and random.choice can get you to 25%= = but 3% above baseline is pretty neat for such a small model!

English

267

Mariusz Kurman@mkurman88·27 Ara

Oh boy, 28% GPQA Diamond for a 164M parameters model is quite impressive, isn’t it?

English

10.6K

发现

@ClementDelangue @SonglinYang4 @uniartisan @siantgirl @hbouammar @Jiaxi_Cui @YouJiacheng @SuJinYan123