AI Adam

3.6K posts

AI Adam banner
AI Adam

AI Adam

@AI_AdamZ

AI. Space. @StardustTrade_

LLM Katılım Ekim 2021
2.3K Takip Edilen2.2K Takipçiler
AI Adam retweetledi
AI Adam
AI Adam@AI_AdamZ·
@yminsky @dwarkesh_sp Who don’t build LLM themselves are no longer quantitative trading firms for sure 💯
English
0
2
1
528
李老师不是你老师
李老师不是你老师@whyyoutouzhele·
马斯克小儿子穿中国风马甲 5月14日上午,马斯克与苹果CEO库克、英伟达CEO黄仁勋等十余名美方商界代表一同进入中美元首会谈现场。 引人注目的是,54岁的马斯克此行带上了6岁的小儿子,照片显示他穿着一件带有中式元素的上衣。
李老师不是你老师 tweet media
中文
956
1.6K
28.7K
4.2M
AI Adam
AI Adam@AI_AdamZ·
agree
Don Wilson@drwconvexity

I've believed for years that compute would evolve into one of the world’s most important commodities — which is why I backed the creation of @Silicon_Data and @computeexchange two years ago. Today’s announcement from @Silicon_Data and @CMEGroup is an important step in that direction. As AI scales, compute markets are developing the same kinds of supply, volatility and capital allocation dynamics we’ve seen in energy and other major commodity markets. Futures markets matter because they improve price discovery, reduce the cost of capital and support long-term infrastructure investment. ft.com/content/3e6b81…

English
0
0
0
85
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
After studying 300 Leetcode Hards, solving every Jane Street puzzle from the Dwarkesh ads, and watching one Horace He lecture, he finally landed the $400k annualized Jane Street internship. Unfortunately, during onboarding his manager said “this diff is negative alpha,” so Jane Street deployed an AI model to translate all feedback into HR-safe speech in real time.
English
10
17
682
133.5K
AI Adam
AI Adam@AI_AdamZ·
My biggest mistake this year is, I thought IBKR could not buy Korean stocks until recently, the fact is I could buy 2x last year……
English
0
0
0
83
AI Adam retweetledi
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
176
1.5K
515.6K
Macro_Lin | 市场观察员
之前做LLM推理芯片架构探索的时候,我把四大AI推理ASIC公司的架构都翻过一遍。Groq、SambaNova、Tenstorrent、Cerebras。前三家的思路虽然各有侧重,但底层逻辑都在同一个框架里:片上大SRAM + dataflow架构 + 确定性调度,核心差异在NoC拓扑、内存层级、编译器抽象这些维度上展开。 Cerebras是里面让我真正被震惊到的一家,而它却这四家里马上第一个拿到IPO结果的。 这家公司的选择比其他三家都激进一个量级:不做芯片,直接做整片wafer。 单颗WSE-3,21.5cm × 21.5cm的整片晶圆,90万个PE通过scribe-line stitching在物理上连成一片连续的silicon。这个工艺是Cerebras和TSMC联合定制的,把原本用于晶圆切割的窄条改造成跨reticle的金属导线,让所有reticle在物理上拼接成一整块芯片。(配图二展示了单颗WSE-3内部结构:左半边是整片晶圆的reticle网格和scribe-line拼接,右半边放大了单个PE的微架构。) 单个PE的结构极简:8-wide FP16 SIMD计算核,48KB本地SRAM直连,没有cache层级,所有数据访问都是确定性的单周期。加上一个5端口路由器(N/S/E/W + loopback),相邻PE之间的通信延迟也是单周期。关键在于,跨reticle边界的mesh在物理参数上和reticle内部完全一致,编译器和runtime完全不需要感知reticle边界的存在。 从LLM推理的视角看,这个均匀性的价值非常大。 LLM推理的瓶颈在decode阶段。每生成一个token,模型权重要被完整读取一次,计算量却很小,典型的memory-bound场景。GPU集群在这个环节的核心问题是数据搬运:HBM带宽有限,多卡之间还要经过NVLink → NVSwitch → InfiniBand → Ethernet四层互联,每一层带宽和延迟都差几个量级,编程模型必须显式处理每一层的拓扑边界。 Cerebras的做法完全绕开了这个问题。单片wafer内部fabric带宽27 PB/s,权重从外部的MemoryX存储集群通过SwarmX流入wafer后,在PE之间按数据流模式传播执行,同一套placement和routing算法跑遍整片wafer。(配图一展示了这个系统级架构:MemoryX参数存储集群到SwarmX互联fabric,再到底层最多2048台CS-3节点,权重广播和梯度规约的数据流方向一目了然。) 90万个PE各自带48KB SRAM,合计约42GB片上存储,每个PE对自己本地SRAM的访问是单周期确定性的,PE间通信每跳single-cycle,延迟和曼哈顿距离成正比。对于推理场景,前提是weight streaming的编译器能把权重有效地分配到对应的PE上,这42GB分布式片上SRAM的聚合带宽远超GPU的HBM方案,没有cache层级带来的访问不确定性,没有跨芯片搬运的开销。 回到我自己的体感。做推理芯片架构的时候,NoC拓扑和内存层级的权衡花了大量精力,因为芯片边界是硬约束,跨芯片通信的成本和片内通信之间永远存在断层。Cerebras的做法等于从片内通信的角度消除了这个断层,代价是整条制造和封装链都要重新定义。 这也解释了Cerebras的工程取舍。所有架构创新集中在wafer内部,scale-out方向直接复用100GbE + RoCE的以太网生态。wafer内27 PB/s对比跨CS-3的SwarmX在Tbps量级,几个数量级的差距全部交给商品化网络承担。推理场景下单wafer内部的带宽和延迟优势可以直接转化成token生成速度。 OpenAI选择和Cerebras合作做推理,从架构层面看逻辑是通的。大规模在线推理需要低延迟、高吞吐、确定性时延,这三点恰好是wafer-scale架构在片上通信均匀性方面的结构性优势。 但这套架构也有几个结构性的问题值得正视。 良率和成本是绕不开的。整片wafer做单颗芯片,任何一个reticle的缺陷都影响整体。Cerebras靠冗余PE和路由绕行来应对,但冗余比例和良率数据从未公开过。一片wafer的制造成本本身就远高于切割后卖单颗die的模式,叠加23kW、15U的单系统功耗和体积,部署密度和TCO在大规模推理集群的经济性上面临考验。 最关键的是KV cache的容量瓶颈。42GB片上SRAM看起来很大,但长上下文推理场景下KV cache随序列长度线性增长。以Llama 70B为参考,FP16下128K上下文的KV cache就要吃掉约40GB,即使做KV cache量化,长序列场景下的容量压力仍然显著。片上放不下的部分必须依赖MemoryX做外部存储,数据要经过SwarmX回传,这条路径的带宽在Tbps量级,和wafer内部27 PB/s的差距意味着长序列场景下decode速度会被外部带宽卡住。这可能是Cerebras在推理场景面临的最核心的架构约束。
Macro_Lin | 市场观察员 tweet mediaMacro_Lin | 市场观察员 tweet media
中文
45
47
272
31.8K
AI Adam retweetledi
Jim Keller
Jim Keller@jimkxa·
My current list of "laws" governing computer design I miss any ? Rents Rule Pollacks’s Rule Amdahls Law Moores Law Dennard Scaling Bitter lesson Little’s Law Jevon’s Paradox
English
63
41
367
45.8K
AI Adam retweetledi
antirez
antirez@antirez·
Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other contributors. Thanks!
English
47
217
1.5K
192.3K
AI Adam retweetledi
Goodfire
Goodfire@GoodfireAI·
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
English
306
1.7K
11.1K
3M
AI Adam
AI Adam@AI_AdamZ·
wow love it
luthira@luthiraabeykoon

We implemented @karpathy 's MicroGPT fully on FPGA fabric. No GPU. No PyTorch. No CPU inference loop. Just a transformer burned into hardware, generating 50,000+ tokens/sec. The model is small, but the idea is not: inference does not have to live only in software 👇

English
0
0
0
74
AI Adam
AI Adam@AI_AdamZ·
@jukan05 I think the X accounts you followed are the most useful sources in public, and what's the english and Korean sources do you recommend?
English
0
0
0
312
Jukan
Jukan@jukan05·
What are the most useful Substacks, media outlets, or websites for tracking Chinese technology trends and the current state of China’s tech ecosystem? Paid sources are fine. I’d appreciate your recommendations.
English
69
21
561
64.2K
Armaan Sidhu
Armaan Sidhu@realarmaansidhu·
Jane Street's moat isn't tech. It's flow. George Coyle asks the right question and the answer is uncomfortable for finance Twitter. Jane Street trades roughly $20 billion a day across ETFs, options, and fixed income. They're the largest market maker in US ETFs by a wide margin. They handle around a third of all retail ETF flow. They see order book activity nobody else sees. That flow trains their pricing models. The pricing models capture the flow. The captured flow trains the next iteration of models. Recursion all the way down. This is the flywheel hedge funds talk about and almost nobody actually has. Citadel Securities is the only real competitor. The two firms together handle north of 50 percent of US equity options volume. D.E. Shaw and Two Sigma can't catch up because they don't run market-making books at this scale. They trade on signals. Jane Street trades on flow. What nobody's saying: barriers to entry in modern market making are now structural, not technological. You can hire the same PhDs. You can buy the same hardware. You cannot manufacture a 15-year head start on order flow data, broker relationships, and exchange-level rebate structures. Jane Street pays out 40 percent of revenue to its 2,500 employees. Bonus pools that hit $100M for individual senior partners. Citadel does the same. That money isn't free. It's the rent collected on a flywheel nobody else can build anymore. The story isn't why nobody's competing. It's why nobody can.
George Coyle@gfc4

If Jane Street is making so much money, why isn't anyone coming in to compete thus reducing their revenue? Technological barriers to entry?

English
19
57
635
94.6K
AI Adam retweetledi
AI Adam
AI Adam@AI_AdamZ·
@realarmaansidhu No the moat is tech. I haven’t seen other trading firms understand neutral network that well.
English
1
1
1
623