Martin Andrews

399 posts

Martin Andrews

@mdda123

AI Research / Founder @ Red Dragon AI. Co-organiser of Machine Learning Singapore MeetUp. @GoogleDevExpert (ML). Fixed Income quant in NYC during AI winter

Singapore Katılım Ocak 2014

1.9K Takip Edilen926 Takipçiler

Sabitlenmiş Tweet

Martin Andrews@mdda123·12 Tem

Next week I'll be in Vancouver presenting two papers on which I was first author: * "A Reasoning-Based Approach to Cryptic Crossword Clue Solving" (#ICML2025) * "GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization" (@ESFoMo #AMD)

English

1.6K

Martin Andrews@mdda123·18 Mar

@altryne If it was actually true that an "H100 is worth more today than 3 years ago", then Nvidia would raise their prices. But the fact that their new chips are better value perf/$ means that market price of H100s has declined. Not zero, though (OTOH depreciation-talk is meaningless)

English

Alex Volkov@altryne·17 Mar

Dylan goes into an incredible autist level gishgallop about pricing but the lede here is "An H100 is worth more today than 3 years ago" What an incredible time to be alive

Dwarkesh Patel@dwarkesh_sp

The value produced by models is getting so much better so fast that old hardware is actually getting *more* expensive to rent. 3 years ago, the best model you could run on a H100 chip was GPT-4. Now, you can run GPT-5.4 on it, which is smaller and cheaper to run while producing much more valuable tokens. w. @dylan522p

English

2.2K

Martin Andrews@mdda123·17 Mar

@ShivamDuggal4 Note that there's also a bias in what is being trained for : Binary operations like + or * have been selected for being 'fundamental'. To truly test learnability, one would have to come up with a new ~fundamental operation that has never been discussed before on the web.

English

Shivam Duggal@ShivamDuggal4·16 Mar

Even simple: can we solve for variable length addition if not allowed to pretrain on any coding / python data? Can text-only LLMs trained only on addition-subtraction samples figure out the actual algorithms? Curious if some work already shows that.

English

556

Shivam Duggal@ShivamDuggal4·16 Mar

Similar thought. Next-token prediction feels statistical: perplexity / shannon-entropy minimization. But creativity / science may require: finding compact generative structures, then exploring in that space. Closer to algorithmic complexity? More Kolmogorov than Shannon.

Andrew Gordon Wilson@andrewgwils

Being good at next word prediction is the opposite of what we want for creativity, for scientific breakthroughs.

English

9.1K

Martin Andrews@mdda123·13 Mar

@edon_d @neowes2025 @lateinteraction I presented this at an ICML workshop in July 2025 : arxiv.org/abs/2506.20807 - so this kind of method was out there already. OTOH, props to Karpathy for promoting his super-clean framework.

English

408

Edon Derguti@edon_d·13 Mar

@neowes2025 @lateinteraction What was a cool project you did with this method before karpathy post?

English

2.8K

Wesley Smith@neowes2025·13 Mar

I really don't understand this karpathy/autoresearch hype. I mean, it's a cool project, but haven't we been doing this kind of thing for a while now? What is different from DSPy, GEPA and that whole area of tools? What am I missing?

English

226

40.6K

Martin Andrews@mdda123·8 Mar

@iScienceLuvr So, if the BitCoin miners had been working on a different problem...

English

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·8 Mar

AGI is just one perfect random seed away

English

5.6K

Martin Andrews@mdda123·3 Mar

@__tinygrad__ @Ambroise23968 As an outsider, it seems like some of these are Ooof, others are 'squint and it's kinda understandable'. OTOH, it really shows that Tiny has been playing ahead of the puck, and others will scramble to get there : Tiny can plug the better information into what they've built.

English

106

the tiny corp@__tinygrad__·3 Mar

@Ambroise23968 Our reverse engineering was decent, but time consuming and incomplete around the edges. It's so nice to have real docs.

English

2.4K

the tiny corp@__tinygrad__·3 Mar

AMD open sourced rocprof-trace-decoder! This was one of the last pieces of closed source code on the CPU side -- the definitions of the hardware SQTT traces are now public. AMD's tracing infrastructure is better than NVIDIA's, it can trace the timing of every instruction.

English

1.2K

51.9K

Martin Andrews@mdda123·27 Şub

@ahatamiz1 @MayankMish98 So why didn't you submit a PR? Or are you complaining that someone submitted a fix to a repo you didn't use?

English

Ali Hatamizadeh@ahatamiz1·26 Şub

@MayankMish98 You are aware that Mamba2 has a very popular repository and we have all been using it for training Mamba models, and writing our papers ? Just so you know, we have used the same exact initialization as in your PR which is basically a copy-paste of the original repo !!!!

English

1.6K

Mayank Mishra@MayankMish98·26 Şub

We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.

English

748

369.2K

Martin Andrews@mdda123·21 Şub

@pranav_berry If you're around next Thursday, there's the Machine Learning SG event : meetup.com/machine-learni…

English

Pranav@pranav_berry·21 Şub

visiting singapore next week! who is around? reach out if you’re curious about ai, bio or any rabbitholes really (anything from microplastics to building mega infra)

English

1.3K

Martin Andrews@mdda123·5 Şub

@olcan @fchollet Perhaps the constraint of only having limited exact memory (beyond which the capacity to recall exactly peters out) is what incentivises the brain to actively search for explanations. A machine + infinite perfect recall can shortcut 'understanding', so it needs better incentives

English

Olcan@olcan·5 Şub

@fchollet Disagree. It merely suggests it is possible to get AGI without cramming more specific knowledge, but we are not constrained by that, and the meta-rule discovery could well come from a system with a lot of specific knowledge.

English

420

François Chollet@fchollet·4 Şub

Natural evolution suggests that AGI won't come from larger models that cram more and more specific knowledge, but from discovering the meta-rules that allow a system to grow and adapt its own architecture in response to the environment.

English

148

146

1.4K

70.5K

Martin Andrews@mdda123·4 Şub

@sudoingX @jukan05 You can't buy coffee with telemetry data. But shorting a stock that falls 19% in one day = actual cash

English

340

Sudo su@sudoingX·4 Şub

they don't need to. the developers using claude are already doing it for free. every wrapper startup is essentially an unpaid R&D team showing anthropic exactly which features users want. the telemetry alone is worth more than any short position. why bet against companies when you can just absorb what they discovered?

English

116

11.1K

Jukan@jukan05·4 Şub

I seriously don’t get why Anthropic is out there begging investors for money. Just short a bunch of SaaS companies, then casually add their entire feature set to Claude.

English

104

103

3.8K

225.3K

Martin Andrews@mdda123·3 Şub

@YouJiacheng Couldn't the DRAM memory reads be sharded across *many* optically connected devices? When doing inference, the compiler has a lot of forewarning about what accesses it needs to queue up - the bits could be interleaved and queued to arrive just when needed)

English

100

You Jiacheng@YouJiacheng·3 Şub

this rumor looks technically wrong to me. you simply can't provide the bandwidth with disaggregated memory, even connected by optics.

Jukan@jukan05

Rumor: Starting with TPU v8, Google will no longer use HBM? The incident was triggered by the global capacity shortage of HBM, which will be unable to meet AI growth demands over the next 2 to 3 years. At the same time, traditional HBM is limited by its design of being fixed on the motherboard, resulting in a capacity ceiling. Accordingly, Google will develop a new solution to be launched in 2027. The physical form involves removing HBM and establishing independent DRAM memory cabinets (containing 16–32 Trays), dynamically allocating memory through photonic technology. This technology deconstructs the originally single and simple HBM component into three parts: - Transport Layer: Employs all-optical interconnects, ensuring cross-cabinet communication efficiency through OCS (Optical Circuit Switching) and customized CXL protocols. The CPUs, GPUs, and memory modules of the memory pool share a single set of protocols. - Storage Layer: Utilizes large-scale DRAM arrays to replace HBM, significantly increasing the addressing space. The memory corresponding to a single TPU can leap from 192GB/256GB to 512GB or even above 768GB. - Control Layer: Adds dedicated memory-side CPU servers for management. Compared to the native "TPU+HBM" direct connection, this "three-in-one" split-combination solution results in a calculation efficiency loss of less than 2%. Regarding this technology, first is OCS, which satisfies high-speed switching in an all-optical environment and achieves bandwidth and latency close to direct connections with HBM or silicon photonic HBM. Traditional Ethernet (via copper) typically has a latency of over 200 nanoseconds, while using an OCS all-optical switching network can reduce latency to below 100 nanoseconds, which is why it is important. Second, in this architecture, there is a dual-side CPU architecture (Tier-1 and Tier-2 CPUs): Tier-1 CPU (TPU side): Located on the TPU motherboard, primarily responsible for interconnect communication between TPUs. Tier-2 CPU (Memory pool side): Most likely deployed on the memory server (DRAM server) side, specifically responsible for communication coordination between TPUs and the distributed memory addressing space. The Tier-2 CPU is deployed independently because, logically, the original TPU motherboard CPU could still read the memory pool, but using the old CPU would involve complex protocol conversions (such as translation between PCIe signals and CXL-like protocols), creating efficiency bottlenecks. Third, the interface is completed directly at the chip level through a "photonic packaging interface." This method is similar to CPO (Co-Packaged Optics) technology, integrating optical interfaces directly within the package of chips like the CPU/TPU, replacing traditional external optical modules. The first supplier contacted during the solution design stage was Lightmatter, with multiple suppliers to follow. This solution, which removes HBM and changes it to an external DRAM memory pool, actually converts what was originally ultra-high-frequency motherboard-level access into "cross-cabinet access." Theoretically, this would generate huge latency and efficiency losses. However, this is not the case. Specifically, complex electrical/optical conversions exist between chips, hosts, and ring networks; these hardware-level protocol conversions and settings generate significant hidden overhead invisible to users. After adopting the DRAM memory pool solution, although CXL translation is introduced, many cumbersome hardware protocol conversion steps from the original architecture are removed. If HBM prices drop and performance improves due to capacity expansion by manufacturers like Samsung and Hynix over the next two years, Google is unlikely to return to the HBM solution due to cost considerations. Google does not believe that upstream manufacturers like Hynix, Samsung, and Micron will subvert their own main product line pricing or mass production strategies to accommodate one or two major customers. They might release some profit margin, but they will not cooperate to an extreme degree. This solution also reduces reliance on CoWoS because HBM is no longer needed. At the same time, the HBM chips originally on the silicon interposer substrate occupied a large area; after removing HBM, the saved CoWoS area can be entirely given to the TPU's Compute Core. Thus, within the same physical dimensions, a TPU chip with stronger performance and a larger area can be made, no longer restricted by the physical size of HBM. Regarding memory, the V7 generation had a single HBM capacity of about 192GB, and V8A is about 256GB, but through memory pooling, the memory per TPU can easily double to 512GB or even reach 768GB or more. The solution is expected to be implemented next year, with the final route determined before March 5. The initial deployment ratio is about 30%, with 100% replacement expected to be achieved in 3 years. Sector Beneficiaries: - OCS (Optical Engine): Lightmatter, as the primary supplier, provides photonic packaging interfaces, integrating optical interfaces within the chip package to replace external modules. - CXL-like: Requires CXL-like chips (MXC chips) to achieve the interconnect between TPUs and the memory pool, costing $100 per chip. One chip manages two channels for two 256GB memory modules, matching the TPU and memory side synchronously. If it is 512GB, two MXC chips are needed; for 768GB, four chips. - DRAM Modules: The quantity of GBs increases significantly. - CPU: Each memory Tray needs to be equipped with a CPU for scheduling; high performance is not required here, and ARM-based CPUs can be used. - PCB: Independent DRAM cabinets require large, multi-layer PCBs to carry a large number of DIMM slots. Source: 国泰海通 (Guotai Haitong)

English

2.5K

Martin Andrews@mdda123·2 Şub

@eatnow240008 @jonashuebotter Much better to get the top-k logits from the teacher, and then calculate the losses using them via a ~scan over the relevant student's logits. Huge saving of memory - but you do need to drill down to logits, rather than just matching final hidden states (much more specific)!

English

eatnow@eatnow240008·31 Oca

@jonashuebotter When experimenting with context distillation I've had to match last layer activations as a GPU-poor replacement for the all vocab logit loss. Wonder if it would also work here?

English

182

Jonas Hübotter@jonashubotter·29 Oca

Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)

English

141

1.1K

206.1K

Martin Andrews@mdda123·31 Oca

@NomadProduct @FrnkNlsn @KenOno691 No : Since if the most negative number were even, adding this to 2 would produce an even number (so not prime). If the most negative number were odd, all the odd primes would become even numbers (so not prime). Does the set cover all primes? No, since small ones are skipped.

English

Nomad@NomadProduct·30 Oca

@FrnkNlsn @KenOno691 Dumb stupid question...if we found out the most negative value this polynomial produced, could we add that to the result and still output all the primes?

English

1.2K

Frank Nielsen@FrnkNlsn·29 Oca

Quite impressive result imho ... that I just discovered thanks to @KenOno691 A marvel of computing & math ! However note that this polynomial can also produce negative values (to be discarded, not prime).

English

309

58.4K

Martin Andrews@mdda123·30 Oca

@agarwl_ @Teknium My working name for this was "cheating teacher" on-policy distillation

English

Rishabh Agarwal@agarwl_·30 Oca

@Teknium The use of "self-distillation" is overloaded -- it's basically the same LLM but different information in context. On-policy context distillation is the right term but it's not as catchy

English

763

Rishabh Agarwal@agarwl_·29 Oca

One more interesting use case of on-policy distillation! Teacher doesn't have to be a bigger neural net, just something better than the student. Here they use student + expert demonstration as the teacher.

idan shenfeld@IdanShenfeld

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English

180

23.4K

Martin Andrews@mdda123·27 Oca

@asimovinc Good point about traction during push-off. But it's also obvious that someone wearing rigid shoes (like clogs or Japanese Geta) for the first time is at a significant gait coordination disadvantage compared to having a flexible & sensitive toe joint area

English

Asimov@asimovinc·27 Oca

Why Asimov has articulated toes and why every humanoid should.

English

110

Martin Andrews retweetledi

Elliot Arledge@elliotarledge·26 Oca

if you care about getting your agents to write faster kernels, this is a MUST.

English

158

18.4K

Martin Andrews@mdda123·20 Oca

@prathamgrv export PS1='[user@localhost \W]$ '

English

113

Martin Andrews retweetledi

Thor 雷神 ⚡️@thorwebdev·15 Oca

Full house at the Machine Learning Singapore meetup tonight 🇸🇬🔥 Thanks @mdda123 and @Sam_Witteveen for hosting 🫶

Singapore 🇸🇬 English

883

Martin Andrews@mdda123·14 Oca

@vector_tao @SwayStar123 @SakanaAILabs It's actually saying that after doing a bunch of training with RoPE, you can just abandon the positional embeddings entirely : No need for any PE kernel after that

English

Sakana AI@SakanaAILabs·12 Oca

Introducing DroPE: Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings pub.sakana.ai/DroPE/ We are releasing a new method called DroPE to extend the context length of pretrained LLMs without the massive compute costs usually associated with long-context fine-tuning. The core insight of this work challenges a fundamental assumption in Transformer architecture. We discovered that explicit positional embeddings like RoPE are critical for training convergence but eventually become the primary bottleneck preventing models from generalizing to longer sequences. Our solution is radically simple: We treat positional embeddings as a temporary training scaffold rather than a permanent architectural necessity. Real-world workflows like reviewing massive code diffs or analyzing legal contracts require context windows that break standard pretrained models. While models without positional embeddings (NoPE) generalize better to these unseen lengths, they are notoriously unstable to train from scratch. Here, we achieve the best of both worlds by using embeddings to ensure stability during pretraining and then dropping them to unlock length extrapolation during inference. Our approach unlocks seamless zero-shot context extension without any expensive long-context training. We demonstrated this on a range of off-the-shelf open-source LLMs. In our tests, recalibrating any model with DroPE requires less than 1% of the original pretraining budget, yet it significantly outperforms established methods on challenging benchmarks like LongBench and RULER. We have released the code and the full paper to encourage the community to rethink the role of positional encodings in modern LLMs. Paper: arxiv.org/abs/2512.12167 Code: github.com/SakanaAI/DroPE

GIF

English

258

1.8K

454.5K

Martin Andrews@mdda123·13 Oca

@justindross @roelofbotha @rabois playbook

English

Martin Andrews retweetledi

JD Ross@justindross·13 Oca

My last company, Opendoor ($7B), replaced real estate brokers. Today, my new company WithCoverage raised $42M to replace insurance brokers. It was led by Sequoia & Khosla, the first time @RoelofBotha and @Rabois partnered since PayPal.

English

753

537

4.4K

1.8M

Keşfet

@altryne @ShivamDuggal4 @edon_d @neowes2025 @lateinteraction @iScienceLuvr @__tinygrad__ @Ambroise23968