Shom

665 posts

Shom banner
Shom

Shom

@ShomLinEd

language model | sequence modeling | education | HCI

Web شامل ہوئے Eylül 2021
2.1K فالونگ366 فالوورز
clem 🤗
clem 🤗@ClementDelangue·
We need more open agent traces datasets. Who can help?
English
86
43
471
94.5K
Shom ری ٹویٹ کیا
Jianyang Gao
Jianyang Gao@gaoj0017·
The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
83
885
5.9K
821.4K
Shom ری ٹویٹ کیا
Yu Zhang 🐙🌘
Yu Zhang 🐙🌘@yzhang_cs·
flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…
Yu Zhang 🐙🌘 tweet media
English
7
26
230
24.8K
Shom
Shom@ShomLinEd·
@siantgirl 当前版本为了保证在环境中“能跑起来”,默认检索层使用轻量文本相似度实现,而不是额外依赖向量数据库。后续可以平滑替换为 pgvector、Milvus 或 FAISS。 这个肯定不行啊,起码实现基础的向量匹配
中文
0
0
7
1.7K
Ryo
Ryo@siantgirl·
结果出来了,说我完成度和质量不符合标准,vibe coding一年都不到,自家都有自家的面试题了,上面是我用ai写的代码。 github.com/DancyRyo/fe 有谁帮我看看哪里不符合吗,题目就是写一个rag增强检索的🥹🥹🥹不复合标准咱可以虚心学习啊
Ryo@siantgirl

上次面试有一家就很秀,让我回来vibe 一个前后端,单元测试,文档建设,代码架构,UI都有要求,我整整花费了5-6 小时来完成,面试成本太高了。🤡🤡🤡

中文
37
4
109
70.5K
Shom
Shom@ShomLinEd·
@hbouammar How will rlm using python with these primitives (filter, map etc.) perform?
English
1
0
1
353
Shom
Shom@ShomLinEd·
@Jiaxi_Cui whisper 很老了,试试新模型
中文
0
0
0
497
Panda
Panda@Jiaxi_Cui·
冷知识,whisper诞生已经五年了,但ASR语音转文本的API成本仍然非常的高
中文
28
9
166
62.2K
Shom
Shom@ShomLinEd·
There's a long and rare chinese token in k2's tokenizer: 豫冠薰衣草疤痕精华素 . It must be pure coincidence that both compose 2 and k2.5 struggles repeating it.
Shom tweet mediaShom tweet media
English
5
10
390
34K
Shom
Shom@ShomLinEd·
These esoteric languages seems very unfriendly to tokenization.....For brainfuck the basic units are stitched together randomly, making it error-prone. Though it also stems from the fact that these languages are underrepresented in training data.
Shom tweet media
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
1
1
8
739
You Jiacheng
You Jiacheng@YouJiacheng·
"anyone on MiMo Team with fewer than 100 conversations tomorrow can quit." 100 conversations per day? that's heavy, if conversation = session.
Fuli Luo@_LuoFuli

MiMo-V2-Pro & Omni & TTS is out. Our first full-stack model family built truly for the Agent era. I call this a quiet ambush — not because we planned it, but because the shift from Chat to Agent paradigm happened so fast, even we barely believed it. Somewhere in between was a process that was thrilling, painful, and fascinating all at once. The 1T base model started training months ago. The original goal was long-context reasoning efficiency. Hybrid Attention carries real innovation, without overreaching — and it turns out to be exactly the right foundation for the Agent era. 1M context window. MTP inference for ultra-low latency and cost. These architectural decisions weren't trendy. They were a structural advantage we built before we needed it. What changed everything was experiencing a complex agentic scaffold — what I'd call orchestrated Context — for the first time. I was shocked on day one. I tried to convince the team to use it. That didn't work. So I gave a hard mandate: anyone on MiMo Team with fewer than 100 conversations tomorrow can quit. It worked. Once the team's imagination was ignited by what agentic systems could do, that imagination converted directly into research velocity. People ask why we move so fast. I saw it firsthand building DeepSeek R1. My honest summary: — Backbone and Infra research has long cycles. You need strategic conviction a year before it pays off. — Posttrain agility is a different muscle: product intuition driving evaluation, iteration cycles compressed, paradigm shifts caught early. — And the constant: curiosity, sharp technical instinct, decisive execution, full commitment — and something that's easy to underestimate: a genuine love for the world you're building for. We will open-source — when the models are stable enough to deserve it. From Beijing, very late, not quite awake.

English
6
2
56
12.3K
Shom
Shom@ShomLinEd·
@SuJinYan123 @SeTriones 每个 turn 存 state 感觉state 比 kv cache 大啊,不太值当,有明显缓存控制的手段可能更好
中文
0
0
1
24
susun
susun@SuJinYan123·
@SeTriones 对,做是这么做。但是在 agent loop 下,每个 turn 存一个 state 就好,不过这个有点抽象泄露了。就比如 seq len 64 在 agent coding 下,绝对有很多 state never read。
中文
3
0
0
131
susun
susun@SuJinYan123·
今早和一位推理的前辈聊了会天,突然聊起线性注意力来,突然给了我灵感。也许 Pegainfer 的发力点就在这些非标准注意力模型上,嗯,所以 Pegainfer 未来的发力点就在那些奇怪的模型上。当然也能给 Pegaflow 一些启发,线性注意力也需要存。
中文
4
0
18
1.8K
Shom
Shom@ShomLinEd·
@scaling01 doesn't seem efficient as cached recurrent state for small segment is way larger than kv cache
English
0
0
0
391
Shom ری ٹویٹ کیا
Dawning Road
Dawning Road@TheDawningRoad·
Introducing Agent Debugger. LLM-based agent execution trajectories are becoming longer and longer, making them difficult to understand and control for quality. To address this, Agent Debugger offers the following two functions: Question Answering (QA): The Agent Debugger can reliably answer the users' question about a long agent trace. Quality Check: It analyzes potential errors, such as tool call errors hidden in the trace. These errors, if present during model training, can negatively impact model performance. Details at: dawning-road.github.io/blog/agent-deb…
Dawning Road tweet media
English
0
2
6
179
Shom
Shom@ShomLinEd·
@rosinality It is one the rare works where we can peak into frontier model design. it's been rumoured that gemma 3n shares arch with gemini as the decoded gemma 3n weight file has "gemini" in its weight name.
English
1
0
2
31
Rosinality
Rosinality@rosinality·
@ShomLinEd Yup, there were precedents. But still it is a bit hard for me to believe that Gemma 3n alone sparked this.
English
1
0
0
179
Rosinality
Rosinality@rosinality·
Using embedding output as an input to FFNs instead of attention outputs. Who inspired everyone to work on embedding layers?
Rosinality tweet media
English
4
18
251
17.4K
Shom
Shom@ShomLinEd·
@rosinality there have been a slow but steady line of work on embeddings such as SCONES from google and OE from bytedance. It seems to be Gemma 3n from last year that really sparked it off.
English
1
0
1
180
Shom
Shom@ShomLinEd·
@nake13 因为价格贵啊,token比普通版贵10倍,输出还长
中文
0
0
1
296
Zhixiong Pan
Zhixiong Pan@nake13·
没想到这么贵的 GPT-5.2 Pro 还有这么高的使用率,交易流水已达 GPT-5.2 标准版的一半,但两者的核心用户群截然不同。 科学、金融和法律领域明显更倾向于为 Pro 版买单,而程序员和学术界依然主要使用标准版。 或许可以这样理解:代码写错了可以低成本迭代,但科研和法律领域容错率更低,因此用户更愿意为了「确定性」支付溢价。
OpenRouter@OpenRouter

GPT-5.2 Pro does about 50% of the $ volume on OpenRouter as GPT-5.2. What are people using GPT-5.2 Pro for? Here's the category breakdown: GPT-5.2 Pro is more heavily used for: - Science (6.7% vs 2.8% for Standard) - Finance (2.6% vs 1.3% for Standard) - Legal (1.2% vs 0.5% for Standard) GPT-5.2 Standard is more heavily used for: - Academia (3.4% vs 2.0% for Pro) - Programming (10.7% vs 9.8% for Pro) - Technology (5.6% vs 4.7% for Pro) See openrouter.ai/rankings for more insights

中文
5
0
32
15.6K
Shom ری ٹویٹ کیا
nathan chen
nathan chen@nathancgy4·
Was just working on a (very exciting) model architecture improvement with @yzhang_cs and then my mom called and said "happy birthday" and I realized I'm 17 now. Wow, what a year... Last February I was figuring out how to get into a good college. Then got into machine learning in March and spent every day downloading papers, uploading them to gemini, and asking it infinite questions. Then joined @tilderesearch because of a tweet I shared. Flew there the day school ended, got taken to secondary inspection b/c I'm 16 and alone, and eventually spent 2 months in Palo Alto working on interesting model arch research. In October, when I finally felt like settling back into high school senior year, @Kimi_Moonshot reached out and I joined their scaling team, now having just spent a month in Beijing working on fun model architecture with an amazing group of people. And now I get to feel what it's like to not fall asleep because of getting overly excited about an idea, or wake up in the morning, energized, over the fact that I'm gonna learn new stuff, come up with ideas, run experiments, ask questions, and more. So grateful for all that's happened!
English
23
5
203
17.3K
Shom
Shom@ShomLinEd·
@dongxi_nlp 把prompt 写到一个md里,然后让claude code/codex写python/用代码读就行了🤡
中文
2
0
3
430
马东锡 NLP
马东锡 NLP@dongxi_nlp·
Recursive Language Models (RLMs) 这两天像洪水一样,拼命抢占我的注意力。 这种vibe,非常像 DSPy。但他们到底是是有价值的next,还是 hype? RLMs 可以看到很多 DSP的影子,两者共同点,是用Programming 替代Prompting,没有把 LLM 当成单纯的text generator,而是当成 data pipleline 的 controller。DSPy 用来优化 prompt,而 RLMs 将这种思想应用于 long context。 Did they work ?当然,在特定情况下,好用。 Did they fail ? 当然,RLMs 专门在总结了Recursion失败的情况。 如果说 RLMs 这一个对 programming as prompting 在long context 中的尝试,当然是很好的工作。 但远远达不到将 long context问题革命和彻底解决。 如果这么宣称,就是hype。
中文
7
7
79
12.3K
Shom
Shom@ShomLinEd·
@mkurman88 the baseline is 25% tho as it's a multiple choice benchmark with mostly 4 options and random.choice can get you to 25%= = but 3% above baseline is pretty neat for such a small model!
English
1
0
6
267
Mariusz Kurman
Mariusz Kurman@mkurman88·
Oh boy, 28% GPQA Diamond for a 164M parameters model is quite impressive, isn’t it?
Mariusz Kurman tweet media
English
4
2
37
10.6K