Shom

682 posts

Shom banner
Shom

Shom

@ShomLinEd

language model | sequence modeling | education | HCI

Web Joined Eylül 2021
2.2K Following369 Followers
Shom
Shom@ShomLinEd·
@can some of sqlte's tests are public tho
Shom tweet media
English
0
0
5
1.4K
can
can@can·
in the light of this automated bun rewrite, SQLite’s open-source core, closed-source tests policy feels prescient but not exactly sure how
English
13
15
660
64.2K
Shom
Shom@ShomLinEd·
@jarredsumner Is this fuzzer a library or made by you?
English
1
0
0
706
Jarred Sumner
Jarred Sumner@jarredsumner·
There is now a fuzzer running 24/7 for every language parser in Bun, ranging from .npmrc files and .patch files to shell scripts to jsonc & typescript & css. Once it minimizes a repro, it sends to Claude to fix and then I review.
English
9
3
246
11.6K
Shom
Shom@ShomLinEd·
@boshen_c it's from zig presumably
English
0
0
3
2.6K
Boshen
Boshen@boshen_c·
So many allocators! We only have 1 in Oxc 😂
Boshen tweet media
English
5
0
143
18.8K
Boshen
Boshen@boshen_c·
Thanks to the Rust rewrite, I now learn why Bun is fast First find: it uses a thread local arena for ASTs
Boshen tweet media
English
25
17
985
105K
Shom retweeted
Kaichao You
Kaichao You@KaichaoYou·
This is growth-hacking dressed up in open-source language, @radixark please stop doing it immediately. Paying people in platform credits to star a GitHub repo and repost a marketing tweet isn't "fueling the community" — it's laundering paid promotion through the trust signals open source depends on. Stars are supposed to mean someone found a project useful. Attach a $200 bounty and the number means nothing. GitHub's own policies prohibit this for exactly that reason.
Kaichao You tweet media
RadixArk@radixark

$200 FREE CREDIT! We just launched our inference platform for beta testing, and we're giving it to the community first. ⭐ Star SGLang on GitHub (github.com/sgl-project/sg…) + repost this to claim your credits. → Limited spots, first come first serve → Deadline: May 13, 2025 (AoE) Every star, every issue filed, every PR reviewed, every question answered in Slack — You built this with us. Thank you for believing in open-source AI infrastructure, in our mission, and in us. Claim your credits: platform.radixark.com

English
5
18
284
44.9K
Shom
Shom@ShomLinEd·
@zephyr_z9 full attention also has linear scaling...
English
0
0
0
181
Zephyr
Zephyr@zephyr_z9·
Ok, so it's a linear attention variant
Zephyr tweet media
English
15
0
137
23.4K
Shom retweeted
Keller Jordan
Keller Jordan@kellerjordan0·
New modded-NanoGPT optimization benchmark result: @wen_kaiyue has improved upon both the Muon and AdamW baselines, by replacing their weight decay with hyperball optimization. The new record is 3325 steps.
Keller Jordan tweet media
English
7
42
427
59.3K
Shom
Shom@ShomLinEd·
@kalomaze if you spam on book scans you can get probably even more tokens
English
0
0
1
439
kalomaze
kalomaze@kalomaze·
what i find most interesting about this release is that you can approach ~1T raw tokens from *before WWII*, before synthetic augmentation or rephrasers or whatever. that's the floor. and a post-WWII model could have the colloquial talk radio archives too, if transcribed...
David Duvenaud@DavidDuvenaud

Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵

English
7
4
202
18.1K
Shom
Shom@ShomLinEd·
@teortaxesTex i don't think they solved long agentic context reasoning, only some basic hashhop stuff...
English
0
0
0
288
Shom
Shom@ShomLinEd·
The massive kv cache reduction of deepseek may unlock agent scaling as an economical choice...Imagine defaulting to 4 parallel agents solving one of your problem with each agents calling 10~20 subagents in parallel to explore different choices.
English
0
0
0
80
Shom retweeted
Shom
Shom@ShomLinEd·
@_ueaj it may be due to increased parameters instead of increased kv cache rank from enlarged projections. What if you enlarge MLP layers for smaller kv cache rank models to balance the two models' params? Or do an mlp style expansion in projections?
English
1
0
0
119
ueaj
ueaj@_ueaj·
New blog! You can just keep increasing the amount of heads in your model with no diminishing returns on ICL up to atleast 4x. For reference that would make the o_proj head dimensions in this experiment 16k x 2k. Additionally, if you perform a truncated SVD on full rank master weights to train MLA instead of training them as two separate matrices, you can recover most of the ICL capability but with less memorization. I think MLA specialized optimizers are a direction worth exploring and are very underserved rn. Unfortunately I have more important projects to attend to and I've burned like 300$ on compute for this already. I would highly recommend someone trying to scale this up and see how well we can do.
ueaj tweet media
ueaj@_ueaj

Also something like MLA should be trained like QAT but instead of converting a high precision matrix to low precision you convert a full rank master weights into low rank latent projections. You could then also do QAT on the latents maybe with turboquant and add an extra set of "perturb" weights to store how the quantization might affect the exact parameters for faster inference.

English
9
6
98
15.1K
Fleetwood
Fleetwood@fleetwood___·
The models, they just want to learn (their current task and literally nothing else). Training a toy transformer on 3 digit addition, sorting, reversal and modular addition. Complete lobotomy at every task transition.
Fleetwood tweet media
English
37
22
592
111.6K
Shom
Shom@ShomLinEd·
@facontidavide I suggest checking the code rigorously for hacking perhaps by asking another code agent like codex to do it. They have very creative ways to game the benchmark and get high scores.
English
1
0
8
398
Davide Faconti
Davide Faconti@facontidavide·
@ShomLinEd what do you mean? If by "hacking" you mean "cheating" the benchmarks somehow, I am going to say "no".
English
1
0
0
1.7K
Shom
Shom@ShomLinEd·
@Dorialexander isn't gpt-4 rumored to have 200B activated params
English
1
0
2
2K
Alexander Doria
Alexander Doria@Dorialexander·
Since scale pilled people are still at it, reminder there is in all likelihood no model deployed with more active parameters than 2020 GPT-3.
English
12
2
203
23.8K
Shom
Shom@ShomLinEd·
@bigeagle_xd code廉价,验证不廉价
中文
1
0
4
278
熊师傅 weight decay 了吗
熊师傅 weight decay 了吗@bigeagle_xd·
过去很多软件工程集中在“怎样方便维护、怎样可扩展、怎样尽量复用代码”等方面 如果code是廉价的,理想情况下,应该每个任务、每个场景直接原地写出最合适的软件
中文
8
0
25
4.1K
clem 🤗
clem 🤗@ClementDelangue·
We need more open agent traces datasets. Who can help?
English
88
42
498
133.5K
Shom retweeted
Jianyang Gao
Jianyang Gao@gaoj0017·
The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views. We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (openreview.net/forum?id=tO3AS…). We would greatly appreciate your attention and help in sharing it.
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
98
969
6.5K
1M
Shom retweeted
Yu Zhang 🐙🌘
Yu Zhang 🐙🌘@yzhang_cs·
flash-linear-attention is now seeing over 15,000 daily downloads. 📈 We @SonglinYang4 @uniartisan are honored to see fla becoming a piece of the core infrastructure for efficient model archs. Grateful to the community for the trust and support. github.com/fla-org/flash-…
Yu Zhang 🐙🌘 tweet media
English
7
27
239
31.5K