Martin Andrews

399 posts

Martin Andrews banner
Martin Andrews

Martin Andrews

@mdda123

AI Research / Founder @ Red Dragon AI. Co-organiser of Machine Learning Singapore MeetUp. @GoogleDevExpert (ML). Fixed Income quant in NYC during AI winter

Singapore Katılım Ocak 2014
1.9K Takip Edilen926 Takipçiler
Sabitlenmiş Tweet
Martin Andrews
Martin Andrews@mdda123·
Next week I'll be in Vancouver presenting two papers on which I was first author: * "A Reasoning-Based Approach to Cryptic Crossword Clue Solving" (#ICML2025) * "GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization" (@ESFoMo #AMD)
English
4
2
19
1.6K
Martin Andrews
Martin Andrews@mdda123·
@altryne If it was actually true that an "H100 is worth more today than 3 years ago", then Nvidia would raise their prices. But the fact that their new chips are better value perf/$ means that market price of H100s has declined. Not zero, though (OTOH depreciation-talk is meaningless)
English
0
0
2
39
Martin Andrews
Martin Andrews@mdda123·
@ShivamDuggal4 Note that there's also a bias in what is being trained for : Binary operations like + or * have been selected for being 'fundamental'. To truly test learnability, one would have to come up with a new ~fundamental operation that has never been discussed before on the web.
English
0
0
0
24
Shivam Duggal
Shivam Duggal@ShivamDuggal4·
Even simple: can we solve for variable length addition if not allowed to pretrain on any coding / python data? Can text-only LLMs trained only on addition-subtraction samples figure out the actual algorithms? Curious if some work already shows that.
English
1
0
7
556
Wesley Smith
Wesley Smith@neowes2025·
I really don't understand this karpathy/autoresearch hype. I mean, it's a cool project, but haven't we been doing this kind of thing for a while now? What is different from DSPy, GEPA and that whole area of tools? What am I missing?
English
29
7
226
40.6K
Martin Andrews
Martin Andrews@mdda123·
@__tinygrad__ @Ambroise23968 As an outsider, it seems like some of these are Ooof, others are 'squint and it's kinda understandable'. OTOH, it really shows that Tiny has been playing ahead of the puck, and others will scramble to get there : Tiny can plug the better information into what they've built.
English
0
0
0
106
the tiny corp
the tiny corp@__tinygrad__·
@Ambroise23968 Our reverse engineering was decent, but time consuming and incomplete around the edges. It's so nice to have real docs.
the tiny corp tweet media
English
2
1
56
2.4K
the tiny corp
the tiny corp@__tinygrad__·
AMD open sourced rocprof-trace-decoder! This was one of the last pieces of closed source code on the CPU side -- the definitions of the hardware SQTT traces are now public. AMD's tracing infrastructure is better than NVIDIA's, it can trace the timing of every instruction.
English
12
53
1.2K
51.9K
Ali Hatamizadeh
Ali Hatamizadeh@ahatamiz1·
@MayankMish98 You are aware that Mamba2 has a very popular repository and we have all been using it for training Mamba models, and writing our papers ? Just so you know, we have used the same exact initialization as in your PR which is basically a copy-paste of the original repo !!!!
English
1
0
6
1.6K
Mayank Mishra
Mayank Mishra@MayankMish98·
We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.
English
17
73
748
369.2K
Pranav
Pranav@pranav_berry·
visiting singapore next week! who is around? reach out if you’re curious about ai, bio or any rabbitholes really (anything from microplastics to building mega infra)
English
3
1
13
1.3K
Martin Andrews
Martin Andrews@mdda123·
@olcan @fchollet Perhaps the constraint of only having limited exact memory (beyond which the capacity to recall exactly peters out) is what incentivises the brain to actively search for explanations. A machine + infinite perfect recall can shortcut 'understanding', so it needs better incentives
English
1
0
0
50
Olcan
Olcan@olcan·
@fchollet Disagree. It merely suggests it is possible to get AGI without cramming more specific knowledge, but we are not constrained by that, and the meta-rule discovery could well come from a system with a lot of specific knowledge.
English
1
0
5
420
François Chollet
François Chollet@fchollet·
Natural evolution suggests that AGI won't come from larger models that cram more and more specific knowledge, but from discovering the meta-rules that allow a system to grow and adapt its own architecture in response to the environment.
English
148
146
1.4K
70.5K
Martin Andrews
Martin Andrews@mdda123·
@sudoingX @jukan05 You can't buy coffee with telemetry data. But shorting a stock that falls 19% in one day = actual cash
English
1
0
3
340
Sudo su
Sudo su@sudoingX·
they don't need to. the developers using claude are already doing it for free. every wrapper startup is essentially an unpaid R&D team showing anthropic exactly which features users want. the telemetry alone is worth more than any short position. why bet against companies when you can just absorb what they discovered?
English
2
1
116
11.1K
Jukan
Jukan@jukan05·
I seriously don’t get why Anthropic is out there begging investors for money. Just short a bunch of SaaS companies, then casually add their entire feature set to Claude.
English
104
103
3.8K
225.3K
Martin Andrews
Martin Andrews@mdda123·
@YouJiacheng Couldn't the DRAM memory reads be sharded across *many* optically connected devices? When doing inference, the compiler has a lot of forewarning about what accesses it needs to queue up - the bits could be interleaved and queued to arrive just when needed)
English
1
0
0
100
You Jiacheng
You Jiacheng@YouJiacheng·
this rumor looks technically wrong to me. you simply can't provide the bandwidth with disaggregated memory, even connected by optics.
Jukan@jukan05

Rumor: Starting with TPU v8, Google will no longer use HBM? The incident was triggered by the global capacity shortage of HBM, which will be unable to meet AI growth demands over the next 2 to 3 years. At the same time, traditional HBM is limited by its design of being fixed on the motherboard, resulting in a capacity ceiling. Accordingly, Google will develop a new solution to be launched in 2027. The physical form involves removing HBM and establishing independent DRAM memory cabinets (containing 16–32 Trays), dynamically allocating memory through photonic technology. This technology deconstructs the originally single and simple HBM component into three parts: - Transport Layer: Employs all-optical interconnects, ensuring cross-cabinet communication efficiency through OCS (Optical Circuit Switching) and customized CXL protocols. The CPUs, GPUs, and memory modules of the memory pool share a single set of protocols. - Storage Layer: Utilizes large-scale DRAM arrays to replace HBM, significantly increasing the addressing space. The memory corresponding to a single TPU can leap from 192GB/256GB to 512GB or even above 768GB. - Control Layer: Adds dedicated memory-side CPU servers for management. Compared to the native "TPU+HBM" direct connection, this "three-in-one" split-combination solution results in a calculation efficiency loss of less than 2%. Regarding this technology, first is OCS, which satisfies high-speed switching in an all-optical environment and achieves bandwidth and latency close to direct connections with HBM or silicon photonic HBM. Traditional Ethernet (via copper) typically has a latency of over 200 nanoseconds, while using an OCS all-optical switching network can reduce latency to below 100 nanoseconds, which is why it is important. Second, in this architecture, there is a dual-side CPU architecture (Tier-1 and Tier-2 CPUs): Tier-1 CPU (TPU side): Located on the TPU motherboard, primarily responsible for interconnect communication between TPUs. Tier-2 CPU (Memory pool side): Most likely deployed on the memory server (DRAM server) side, specifically responsible for communication coordination between TPUs and the distributed memory addressing space. The Tier-2 CPU is deployed independently because, logically, the original TPU motherboard CPU could still read the memory pool, but using the old CPU would involve complex protocol conversions (such as translation between PCIe signals and CXL-like protocols), creating efficiency bottlenecks. Third, the interface is completed directly at the chip level through a "photonic packaging interface." This method is similar to CPO (Co-Packaged Optics) technology, integrating optical interfaces directly within the package of chips like the CPU/TPU, replacing traditional external optical modules. The first supplier contacted during the solution design stage was Lightmatter, with multiple suppliers to follow. This solution, which removes HBM and changes it to an external DRAM memory pool, actually converts what was originally ultra-high-frequency motherboard-level access into "cross-cabinet access." Theoretically, this would generate huge latency and efficiency losses. However, this is not the case. Specifically, complex electrical/optical conversions exist between chips, hosts, and ring networks; these hardware-level protocol conversions and settings generate significant hidden overhead invisible to users. After adopting the DRAM memory pool solution, although CXL translation is introduced, many cumbersome hardware protocol conversion steps from the original architecture are removed. If HBM prices drop and performance improves due to capacity expansion by manufacturers like Samsung and Hynix over the next two years, Google is unlikely to return to the HBM solution due to cost considerations. Google does not believe that upstream manufacturers like Hynix, Samsung, and Micron will subvert their own main product line pricing or mass production strategies to accommodate one or two major customers. They might release some profit margin, but they will not cooperate to an extreme degree. This solution also reduces reliance on CoWoS because HBM is no longer needed. At the same time, the HBM chips originally on the silicon interposer substrate occupied a large area; after removing HBM, the saved CoWoS area can be entirely given to the TPU's Compute Core. Thus, within the same physical dimensions, a TPU chip with stronger performance and a larger area can be made, no longer restricted by the physical size of HBM. Regarding memory, the V7 generation had a single HBM capacity of about 192GB, and V8A is about 256GB, but through memory pooling, the memory per TPU can easily double to 512GB or even reach 768GB or more. The solution is expected to be implemented next year, with the final route determined before March 5. The initial deployment ratio is about 30%, with 100% replacement expected to be achieved in 3 years. Sector Beneficiaries: - OCS (Optical Engine): Lightmatter, as the primary supplier, provides photonic packaging interfaces, integrating optical interfaces within the chip package to replace external modules. - CXL-like: Requires CXL-like chips (MXC chips) to achieve the interconnect between TPUs and the memory pool, costing $100 per chip. One chip manages two channels for two 256GB memory modules, matching the TPU and memory side synchronously. If it is 512GB, two MXC chips are needed; for 768GB, four chips. - DRAM Modules: The quantity of GBs increases significantly. - CPU: Each memory Tray needs to be equipped with a CPU for scheduling; high performance is not required here, and ARM-based CPUs can be used. - PCB: Independent DRAM cabinets require large, multi-layer PCBs to carry a large number of DIMM slots. Source: 国泰海通 (Guotai Haitong)

English
1
0
18
2.5K
Martin Andrews
Martin Andrews@mdda123·
@eatnow240008 @jonashuebotter Much better to get the top-k logits from the teacher, and then calculate the losses using them via a ~scan over the relevant student's logits. Huge saving of memory - but you do need to drill down to logits, rather than just matching final hidden states (much more specific)!
English
0
0
1
25
eatnow
eatnow@eatnow240008·
@jonashuebotter When experimenting with context distillation I've had to match last layer activations as a GPU-poor replacement for the all vocab logit loss. Wonder if it would also work here?
English
1
0
0
182
Jonas Hübotter
Jonas Hübotter@jonashubotter·
Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)
Jonas Hübotter tweet media
English
22
141
1.1K
206.1K
Martin Andrews
Martin Andrews@mdda123·
@NomadProduct @FrnkNlsn @KenOno691 No : Since if the most negative number were even, adding this to 2 would produce an even number (so not prime). If the most negative number were odd, all the odd primes would become even numbers (so not prime). Does the set cover all primes? No, since small ones are skipped.
English
0
0
0
77
Nomad
Nomad@NomadProduct·
@FrnkNlsn @KenOno691 Dumb stupid question...if we found out the most negative value this polynomial produced, could we add that to the result and still output all the primes?
English
1
0
0
1.2K
Frank Nielsen
Frank Nielsen@FrnkNlsn·
Quite impressive result imho ... that I just discovered thanks to @KenOno691 A marvel of computing & math ! However note that this polynomial can also produce negative values (to be discarded, not prime).
Frank Nielsen tweet media
English
14
35
309
58.4K
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
@Teknium The use of "self-distillation" is overloaded -- it's basically the same LLM but different information in context. On-policy context distillation is the right term but it's not as catchy
English
2
0
13
763
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
One more interesting use case of on-policy distillation! Teacher doesn't have to be a bigger neural net, just something better than the student. Here they use student + expert demonstration as the teacher.
idan shenfeld@IdanShenfeld

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English
3
17
180
23.4K
Martin Andrews
Martin Andrews@mdda123·
@asimovinc Good point about traction during push-off. But it's also obvious that someone wearing rigid shoes (like clogs or Japanese Geta) for the first time is at a significant gait coordination disadvantage compared to having a flexible & sensitive toe joint area
English
0
0
0
73
Asimov
Asimov@asimovinc·
Why Asimov has articulated toes and why every humanoid should.
English
6
4
110
5K
Martin Andrews retweetledi
Elliot Arledge
Elliot Arledge@elliotarledge·
if you care about getting your agents to write faster kernels, this is a MUST.
Elliot Arledge tweet mediaElliot Arledge tweet media
English
1
21
158
18.4K
Martin Andrews retweetledi
Thor 雷神 ⚡️
Thor 雷神 ⚡️@thorwebdev·
Full house at the Machine Learning Singapore meetup tonight 🇸🇬🔥 Thanks @mdda123 and @Sam_Witteveen for hosting 🫶
Thor 雷神 ⚡️ tweet media
Singapore 🇸🇬 English
2
3
24
883
Sakana AI
Sakana AI@SakanaAILabs·
Introducing DroPE: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings pub.sakana.ai/DroPE/ We are releasing a new method called DroPE to extend the context length of pretrained LLMs without the massive compute costs usually associated with long-context fine-tuning. The core insight of this work challenges a fundamental assumption in Transformer architecture. We discovered that explicit positional embeddings like RoPE are critical for training convergence but eventually become the primary bottleneck preventing models from generalizing to longer sequences. Our solution is radically simple: We treat positional embeddings as a temporary training scaffold rather than a permanent architectural necessity. Real-world workflows like reviewing massive code diffs or analyzing legal contracts require context windows that break standard pretrained models. While models without positional embeddings (NoPE) generalize better to these unseen lengths, they are notoriously unstable to train from scratch. Here, we achieve the best of both worlds by using embeddings to ensure stability during pretraining and then dropping them to unlock length extrapolation during inference. Our approach unlocks seamless zero-shot context extension without any expensive long-context training. We demonstrated this on a range of off-the-shelf open-source LLMs. In our tests, recalibrating any model with DroPE requires less than 1% of the original pretraining budget, yet it significantly outperforms established methods on challenging benchmarks like LongBench and RULER. We have released the code and the full paper to encourage the community to rethink the role of positional encodings in modern LLMs. Paper: arxiv.org/abs/2512.12167 Code: github.com/SakanaAI/DroPE
GIF
English
40
258
1.8K
454.5K
Martin Andrews retweetledi
JD Ross
JD Ross@justindross·
My last company, Opendoor ($7B), replaced real estate brokers. Today, my new company WithCoverage raised $42M to replace insurance brokers. It was led by Sequoia & Khosla, the first time @RoelofBotha and @Rabois partnered since PayPal.
English
753
537
4.4K
1.8M