Jeffrey Li 💙💛

7K posts

Jeffrey Li 💙💛

Jeffrey Li 💙💛

@askerlee

Machine Learning researcher. Veritas.

Singapore Katılım Ocak 2010
1.3K Takip Edilen2.8K Takipçiler
yv
yv@yvbbrjdr·
今天GPT嘴硬,让我一怒之下删了app
yv tweet mediayv tweet media
中文
7
1
34
15.1K
Jeffrey Li 💙💛
Jeffrey Li 💙💛@askerlee·
@CSProfKGD My censorship radar (trained on thousands of Chinese posts) rings on "Chinese media coverage" 😂
English
0
0
2
56
Jeffrey Li 💙💛
Jeffrey Li 💙💛@askerlee·
@himself65 这种话没啥意义,即使不考虑量子效应带来的随机性,和我们关系最大的是地球这个小系统,地球外可能会出现大的扰动让地球的演化改变路径,历史上已经出现多次了,仅看这个系统是没什么确定性的
中文
0
0
9
2K
面包🍞
面包🍞@himself65·
说实话我很不喜欢这些理工男张口闭口什么命运是注定的,套上马尔可夫过程…… 虽然可能是真的,但是讲出来的话就和做题做魔怔的感觉一样
中文
77
10
398
52.6K
Jeffrey Li 💙💛
Jeffrey Li 💙💛@askerlee·
@VukRosic99 I'm confused. Is the 16mb cap for the total param count? An MoE would have more params than its dense counterpart, won't it?
English
1
0
2
376
Vuk Rosić 武克
Vuk Rosić 武克@VukRosic99·
i did quick 71 experiments for 500 out of 13,000 steps for OpenAI's challenge 1. Mixture of Experts is absolute WINNER (very surprising as it shouldn't be for small LLMs) > Expert count matters most. 4 (best) > 3 >> 2. 2. UNTIED Embeddings work, tied are disaster 3. Depthwise Convolution - DEAD END Insights: 1. 4-expert MOE + leaky ReLU -> -0.048 BPB, clear winner 2. Untied factored embeddings (bn128) -> -0.031 BPB, worth combining with MOE 3. MOE + QAT combo -> preserves quantized quality for submission dead ends 1. Depthwise convolution -> every variant hurts, bigger kernels hurt more 2. Tied factored embeddings -> catastrophic, especially at small bottlenecks 3. Weight sharing -> not competitive with MOE for quality 4. Conv + anything combos — compounds the damage Next Steps 1. Validate MOE 4e + leaky at 2000-5000 steps, multiple seeds 2. Test MOE 4e + leaky + untied bn128 — the two biggest wins may stack 3. Full run (13780 steps) of best combo to see if it beats 1.2244 BPB leaderboard 71 experiments, 3 GPUs, ~500 steps each. Vuk Rosić 500 step training mainly helps us eliminate VERY BAD losers, winners need to be tested on longer training. Thank you @novita_labs for compute!
Vuk Rosić 武克 tweet mediaVuk Rosić 武克 tweet mediaVuk Rosić 武克 tweet media
OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English
12
16
211
40.3K
Maxence Frenette
Maxence Frenette@maxencefrenette·
@YouJiacheng Ok i looked at the paper. Sounds like they use the same hparams including lr accross all runs. Not surprising that the tuned setting does better than the untuned setting at iso-tokens (which is also iso-flops in this case since it's the same model).
Maxence Frenette tweet media
English
3
0
3
365
You Jiacheng
You Jiacheng@YouJiacheng·
Interesting, large batch size will make quantized training worse.
You Jiacheng tweet media
English
10
8
123
13.9K
Ethan Mollick
Ethan Mollick@emollick·
Evidence that AI models can, indeed, learn "taste" in this paper where a small model, trained on citations, is able to predict which papers will be hits Citations, upvotes & shares are signals that can teach AI judgment about quality, not just execution. arxiv.org/pdf/2603.14473
Ethan Mollick tweet media
English
47
56
419
35.2K
Xiangyu 香鱼🐬
Xiangyu 香鱼🐬@XianyuLi·
今天和朋友吃饭 他们两周前 刚刚找了一个做机器识别算法的 而我,迟到了两周 错过了一个价值700w的订单 小丑本人🙃🙃🙃
中文
18
0
56
12.9K
Jeffrey Li 💙💛
Jeffrey Li 💙💛@askerlee·
@elliotchen100 原来如此,分的还挺细的,我看过shanda的一个position paper很强调memory和continual learning
中文
0
0
2
344
艾略特
艾略特@elliotchen100·
@askerlee 同属盛大系,evermind 做 memory,miromind 做 reasoning
日本語
1
1
6
4.5K
艾略特
艾略特@elliotchen100·
论文来了。名字叫 MSA,Memory Sparse Attention。 一句话说清楚它是什么: 让大模型原生拥有超长记忆。不是外挂检索,不是暴力扩窗口,而是把「记忆」直接长进了注意力机制里,端到端训练。 过去的方案为什么不行? RAG 的本质是「开卷考试」。模型自己不记东西,全靠现场翻笔记。翻得准不准要看检索质量,翻得快不快要看数据量。一旦信息分散在几十份文档里、需要跨文档推理,就抓瞎了。 线性注意力和 KV 缓存的本质是「压缩记忆」。记是记了,但越压越糊,长了就丢。 MSA 的思路完全不同: → 不压缩,不外挂,而是让模型学会「挑重点看」 核心是一种可扩展的稀疏注意力架构,复杂度是线性的。记忆量翻 10 倍,计算成本不会指数爆炸。 → 模型知道「这段记忆来自哪、什么时候的」 用了一种叫 document-wise RoPE 的位置编码,让模型天然理解文档边界和时间顺序。 → 碎片化的信息也能串起来推理 Memory Interleaving 机制,让模型能在散落各处的记忆片段之间做多跳推理。不是只找到一条相关记录,而是把线索串成链。 结果呢? · 从 16K 扩到 1 亿 token,精度衰减不到 9% · 4B 参数的 MSA 模型,在长上下文 benchmark 上打赢 235B 级别的顶级 RAG 系统 · 2 张 A800 就能跑 1 亿 token 推理。这不是实验室专属,这是创业公司买得起的成本。 说白了,以前的大模型是一个极度聪明但只有金鱼记忆的天才。MSA 想做的事情是,让它真正「记住」。 我们放 github 上了,算法的同学不容易,可以点颗星星支持一下。🌟👀🙏 github.com/EverMind-AI/MSA
艾略特 tweet media
艾略特@elliotchen100

稍微剧透一下,@EverMind 这周还会发一篇高质量论文

中文
165
566
3.1K
1.6M
Rainier
Rainier@mtrainier2020·
这个相关机构需要做个全面的大检查了。 1. Windows的版本是已经是out of service的版本。 看着版本是Win 7 所以新的漏洞没人修,所以处于裸奔状态。知道win7之后有多少的安全漏洞被修复了吗? 你们这么用Windows 7,就是裸奔。 2. 机器上直接装向日葵远程控制软件。 装这个的人应该开除。所有人应该进行安全教学。 这个向日葵远程控制软件,是一个非常常见的远程控制协作软件。安全机制非常脆弱,动态口令保护。 而且向日葵一直有RCE。 装了这个软件,就相当于给你的网络开了一个巨大的后门。 3.这台机器是用来做地堡有限元分析的。 看几个软件是,用来设计地堡防钻地弹的。 妥妥的军事项目。 ps,这几个有限元计算的软件,应该都是盗版的。
Rainier tweet media
中文
47
80
617
332.9K
恋
@hgds0·
恋 tweet media
ZXX
58
78
4.3K
297.2K
Jeffrey Li 💙💛
Jeffrey Li 💙💛@askerlee·
@hillbig "Typical MoE implementations require roughly 40 times as much data per total parameter as dense models" could you suggest a reference? Thanks
English
0
0
0
129
Daisuke Okanohara / 岡野原 大輔
Pre-training of LLMs has once again become a major focus of attention. Although concerns about data scarcity are growing, pre-training itself continues to evolve. A key driver of this progress is the increasing use of synthetic data (see Tramel’s presentation at Berkeley, linked below). Although post-training can improve performance, the upper bound of a model’s capabilities is generally believed to be determined during the pre-training phase. This is because pre-training is where fundamental representations and basic reasoning patterns are acquired, and these tend to change only marginally during post-training. Looking at current scaling laws, the Chinchilla rule originally suggested that the optimal training data size is roughly 20 times the number of parameters. Recently, however, this ratio has increased to around 60 times the number of parameters. In addition, the emergence of Mixture-of-Experts (MoE) architectures has enabled increasing the total number of parameters without a proportional increase in inference compute. This development further intensifies data requirements. Compared with dense models, MoE models require fewer data visits per parameter and are therefore more susceptible to overfitting. As a result, typical MoE implementations require roughly 40 times as much data per total parameter as dense models. For example, a 1T-parameter model may require on the order of 40T tokens. Moreover, data diversity is critical. Simply repeating the same dataset multiple times does not meaningfully improve performance. However, when model-generated synthetic data is used directly as training data, the overall data quality can deteriorate. This phenomenon—often referred to as mode collapse—reduces the diversity present in the long tail of the data distribution and leads to more monotonous model outputs. One effective mitigation strategy is to mix real data and synthetic data during training. In addition, instead of fully regenerating data, it is often preferable to generate paraphrases of existing data. By synthesizing alternative expressions that preserve the original data's factual content, it is possible to improve training efficiency while maintaining data diversity. Importantly, the models used for paraphrasing do not necessarily need to be powerful; relatively small or weak models can be sufficient. This approach follows the same fundamental principle as data augmentation in computer vision. By observing the same information expressed in many different forms, the model learns representations that are independent of specific surface expressions while simultaneously learning the mapping between expressions and internal semantic representations. Recently, two types of synthetic data have emerged as particularly important. The first is program code. Code can be verified by execution, enabling automatic correctness checks and the generation of highly reliable training data. Beyond improving programming ability, code data appears to help models acquire broader representations and reasoning capabilities. The second is data containing explicit reasoning processes. If such reasoning traces are incorporated during pre-training rather than only during post-training, models may learn reasoning procedures—essentially, certain classes of algorithms—during pre-training itself. In real-world data, explicit reasoning processes are often absent; texts rarely include detailed explanations of why particular outcomes occur. To address this, one promising approach is to generate multiple reasoning trajectories with inexpensive, weaker models, then verify and filter them with stronger models. This pipeline can produce high-quality reasoning data suitable for inclusion in the pre-training corpus. In this sense, synthetic data acts as an amplifier of real-world data. Because human-generated data is fundamentally limited, synthetic data will likely play an increasingly central role in future large-scale model training.
English
2
9
39
10.5K
勃勃OC
勃勃OC@bboczeng·
SNDK周一应该就700了 所有命运,将在周三MU财报后 决定一切! 谢谢大家!
中文
9
0
38
18.2K
李老师不是你老师
李老师不是你老师@whyyoutouzhele·
又有3名院士被官网除名! 3月14日,三名涉及雷达、飞行及核武器领域的工程院院士被撤掉简介,其中2位曾任中国工程院副院长 。
李老师不是你老师 tweet media李老师不是你老师 tweet media李老师不是你老师 tweet media
中文
156
97
970
326.7K
麦田 Rye 🇺🇦
麦田 Rye 🇺🇦@maitian99·
我认为会是特朗普载入史册的一句话——他今天说继续轰炸伊朗油岛是,“Just for fun”。战争、轰炸、死亡。。。被美国总统特朗普说是:“Just for fun”。
麦田 Rye 🇺🇦 tweet media
中文
30
25
315
50.4K