Shom

682 posts

Shom

@ShomLinEd

language model | sequence modeling | education | HCI

Web Joined Eylül 2021

2.2K Following369 Followers

Shom@ShomLinEd·19h

@can some of sqlte's tests are public tho

English

1.4K

can@can·22h

in the light of this automated bun rewrite, SQLite’s open-source core, closed-source tests policy feels prescient but not exactly sure how

English

660

64.2K

Shom@ShomLinEd·1d

@jarredsumner Is this fuzzer a library or made by you?

English

706

Jarred Sumner@jarredsumner·1d

There is now a fuzzer running 24/7 for every language parser in Bun, ranging from .npmrc files and .patch files to shell scripts to jsonc & typescript & css. Once it minimizes a repro, it sends to Claude to fix and then I review.

English

246

11.6K

Jarred Sumner@jarredsumner·1d

ZXX

664

30.9K

Shom@ShomLinEd·14 May

@boshen_c it's from zig presumably

English

2.6K

Boshen@boshen_c·14 May

So many allocators! We only have 1 in Oxc 😂

English

143

18.8K

Boshen@boshen_c·14 May

Thanks to the Rust rewrite, I now learn why Bun is fast First find: it uses a thread local arena for ASTs

English

985

105K

Shom retweeted

Kaichao You@KaichaoYou·8 May

This is growth-hacking dressed up in open-source language, @radixark please stop doing it immediately. Paying people in platform credits to star a GitHub repo and repost a marketing tweet isn't "fueling the community" — it's laundering paid promotion through the trust signals open source depends on. Stars are supposed to mean someone found a project useful. Attach a $200 bounty and the number means nothing. GitHub's own policies prohibit this for exactly that reason.

RadixArk@radixark

$200 FREE CREDIT! We just launched our inference platform for beta testing, and we're giving it to the community first. ⭐ Star SGLang on GitHub (github.com/sgl-project/sg…) + repost this to claim your credits. → Limited spots, first come first serve → Deadline: May 13, 2025 (AoE) Every star, every issue filed, every PR reviewed, every question answered in Slack — You built this with us. Thank you for believing in open-source AI infrastructure, in our mission, and in us. Claim your credits: platform.radixark.com

English

284

44.9K

Shom@ShomLinEd·6 May

@zephyr_z9 full attention also has linear scaling...

English

181

Zephyr@zephyr_z9·5 May

Ok, so it's a linear attention variant

English

137

23.4K

Shom retweeted

Keller Jordan@kellerjordan0·1 May

New modded-NanoGPT optimization benchmark result: @wen_kaiyue has improved upon both the Muon and AdamW baselines, by replacing their weight decay with hyperball optimization. The new record is 3325 steps.

English

427

59.3K

Shom@ShomLinEd·28 Nis

@kalomaze if you spam on book scans you can get probably even more tokens

English

439

kalomaze@kalomaze·28 Nis

what i find most interesting about this release is that you can approach ~1T raw tokens from *before WWII*, before synthetic augmentation or rephrasers or whatever. that's the floor. and a post-WWII model could have the colloquial talk radio archives too, if transcribed...

David Duvenaud@DavidDuvenaud

Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵

English

202

18.1K

Shom@ShomLinEd·26 Nis

@teortaxesTex i don't think they solved long agentic context reasoning, only some basic hashhop stuff...

English

288

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·26 Nis

Reminder: Magic dot dev had solved context 2 OOMs beyond today's DeepSeek 1.5 years ago. Now they're probably dealing with trillion-long sequences. AND YET! nobody cares about them. In America, this level of innovation is just noise. Open source/China have no hope of catching up.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

sankalp@dejavucoder

does anyone know what magic dot dev is doing now. it's been one year and we didn't see anything.

English

233

28K

Shom@ShomLinEd·24 Nis

The massive kv cache reduction of deepseek may unlock agent scaling as an economical choice...Imagine defaulting to 4 parallel agents solving one of your problem with each agents calling 10~20 subagents in parallel to explore different choices.

English

Shom retweeted

Yu Zhang 🐙🌘@yzhang_cs·21 Nis

It's just a small piece of our bigger puzzle, to build a solid ecosystem for linear attention, and to make KDA as plug-and-play as flash-attn.

Kimi.ai@Kimi_Moonshot

We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achieves 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20, and works as a drop-in backend for flash-linear-attention. Explore on github: github.com/MoonshotAI/Fla…

English

123

11.2K

Shom@ShomLinEd·21 Nis

@_ueaj it may be due to increased parameters instead of increased kv cache rank from enlarged projections. What if you enlarge MLP layers for smaller kv cache rank models to balance the two models' params? Or do an mlp style expansion in projections?

English

119

ueaj@_ueaj·21 Nis

New blog! You can just keep increasing the amount of heads in your model with no diminishing returns on ICL up to atleast 4x. For reference that would make the o_proj head dimensions in this experiment 16k x 2k. Additionally, if you perform a truncated SVD on full rank master weights to train MLA instead of training them as two separate matrices, you can recover most of the ICL capability but with less memorization. I think MLA specialized optimizers are a direction worth exploring and are very underserved rn. Unfortunately I have more important projects to attend to and I've burned like 300$ on compute for this already. I would highly recommend someone trying to scale this up and see how well we can do.

ueaj@_ueaj

Also something like MLA should be trained like QAT but instead of converting a high precision matrix to low precision you convert a full rank master weights into low rank latent projections. You could then also do QAT on the latents maybe with turboquant and add an extra set of "perturb" weights to store how the quantization might affect the exact parameters for faster inference.

English

15.1K

Shom@ShomLinEd·15 Nis

@fleetwood___ maybe it's too small?

English

247

Fleetwood@fleetwood___·14 Nis

The models, they just want to learn (their current task and literally nothing else). Training a toy transformer on 3 digit addition, sorting, reversal and modular addition. Complete lobotomy at every task transition.

English

592

111.6K

Shom@ShomLinEd·9 Nis

@facontidavide I suggest checking the code rigorously for hacking perhaps by asking another code agent like codex to do it. They have very creative ways to game the benchmark and get high scores.

English

398

Davide Faconti@facontidavide·9 Nis

@ShomLinEd what do you mean? If by "hacking" you mean "cheating" the benchmarks somehow, I am going to say "no".

English

1.7K

Davide Faconti@facontidavide·9 Nis

Claude Code just "invented" a new lossless codec on the Pareto frontier 🤯

Davide Faconti@facontidavide

Fun project of the day. I have an AI Agent autonomously trying to create a novel lossless image compression that achieves ratios similar to PNG but beats QOI in speed. I will let you know how this goes

English

431

96K

Shom@ShomLinEd·8 Nis

@Dorialexander isn't gpt-4 rumored to have 200B activated params

English

Alexander Doria@Dorialexander·8 Nis

Since scale pilled people are still at it, reminder there is in all likelihood no model deployed with more active parameters than 2020 GPT-3.

English

203

23.8K

Shom@ShomLinEd·5 Nis

@TommyGun_AB @honeylemon0124 i am not able to find mentions of bee in IHNMAIMS hmm

English

226

TommyGun AB@TommyGun_AB·5 Nis

@honeylemon0124

QME

2.7K

Millie ^ㅅ^@blueydoodles·5 Nis

Sad that we never really got to know why he was so obsessed with bees 🥲

Black Hat Organization@BlackHatOrgn

LOOK AT THIS COOL BEE I DREW 🐝

English

101

203

9.4K

153.9K

Shom@ShomLinEd·30 Mar

@bigeagle_xd code廉价，验证不廉价

中文

278

熊师傅 weight decay 了吗@bigeagle_xd·30 Mar

过去很多软件工程集中在“怎样方便维护、怎样可扩展、怎样尽量复用代码”等方面如果code是廉价的，理想情况下，应该每个任务、每个场景直接原地写出最合适的软件

中文

4.1K

Shom@ShomLinEd·27 Mar

@ClementDelangue huggingface.co/datasets/nex-a… We published 70k high quality agent traces:)

English

clem 🤗@ClementDelangue·27 Mar

We need more open agent traces datasets. Who can help?

English

498

133.5K

Shom retweeted

Jianyang Gao@gaoj0017·27 Mar

The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.

Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English

969

6.5K

Shom retweeted

Yu Zhang 🐙🌘@yzhang_cs·27 Mar

flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…

English

239

31.5K

Discover

@can @jarredsumner @boshen_c @radixark @zephyr_z9 @wen_kaiyue @kalomaze @teortaxesTex