Martin Marek

254 posts

Martin Marek

@mrtnm

still writing code

New York, USA Katılım Aralık 2009

430 Takip Edilen294 Takipçiler

Martin Marek@mrtnm·6d

@karpathy Using SGD could save ~10LOC and really shouldn't even hurt performance! x.com/micahgoldblum/…

Micah Goldblum@micahgoldblum

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

English

Andrej Karpathy@karpathy·12 Şub

New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further. gist.github.com/karpathy/8627f…

English

653

3.1K

25.2K

5.2M

Martin Marek@mrtnm·6d

Interestingly, TPU v7x (Ironwood) is the first generation to 𝘥𝘳𝘰𝘱 4-bit precision, an opposite trend to Nvidia. While Google Cloud docs do not list full TPU specs, they’re actually listed in the Pallas source code: github.com/jax-ml/jax/blo…

English

168

Martin Marek@mrtnm·29 Mar

@jordibruin @thevividapp Hm. I don't own the Studio Display XDR which is exactly why I'm curious about these measurements / tests. (But I absolutely love Vivid on my MBP!)

English

Jordi Bruin@jordibruin·29 Mar

@mrtnm @thevividapp Not in my testing on my xdr. Are you seeing that yourself as well with vivid?

English

Jordi Bruin@jordibruin·11 Mar

Had to test how bright those 2000 nits could go on the Studio Display XDR compared to the old one with @thevividapp If you're going to be using it in a bright environment (or your MacBook Pro) give it a try getvivid.app

English

7.5K

Martin Marek@mrtnm·29 Mar

@jordibruin @thevividapp But apparently Studio Display XDR maxes out at 1000 nits on a 100% window? (which you can get through the ambient light sensor without Vivid) gregbenzphotography.com/hdr-photos/a-p…

English

Jordi Bruin@jordibruin·29 Mar

@mrtnm @thevividapp Yeah goes to 1600 or 2000 depending on your Mac

English

Martin Marek@mrtnm·10 Mar

@francoisfleuret What’s wrong with recurrence?

English

François Fleuret@francoisfleuret·10 Mar

The two main problems with architecture design are that 1. You have to please the GPU, so for instance anything recurrent is prohibited, 2. You have to beat baselines which have co-evolved with the data sets and training procedures.

smiz@__smiz

@francoisfleuret @ylecun When will it be easy, or even cheap, to iterate on model architectures? I suspect that’s when this will pop wide open.

English

10.8K

Martin Marek@mrtnm·4 Şub

@SamuelMLSmith @OpenAI Is this purely the result of writing more efficient GPU kernels for existing hardware? Is there a cost in terms of throughput?

English

388

Samuel L Smith@SamuelMLSmith·4 Şub

A big win born in @OpenAI London! 🇬🇧

OpenAI Developers@OpenAIDevs

GPT-5.2 and GPT-5.2-Codex are now 40% faster. We have optimized our inference stack for all API customers. Same model. Same weights. Lower latency.

English

408

63.5K

Martin Marek@mrtnm·31 Ara

@giffmana @ziv_ravid Cold posteriors? I’m guilty of that one myself 😅 There was this mystery of why a “cold” posterior works better than the Bayes posterior in BNNs. But it turns out that a cold posterior is equivalent to just using a different prior, so there was actually no issue…

Martin Marek@mrtnm

We additionally introduce a "confidence" prior that directly approximates a cold likelihood. This allows us to see cold posteriors from a new perspective: as approximating a valid prior combined with the categorical likelihood. 6/8

English

346

Lucas Beyer (bl16)@giffmana·31 Ara

@ziv_ravid oh, this one I think I missed.

English

708

Lucas Beyer (bl16)@giffmana·31 Ara

rofl the Bayesians are doing it again! (Might still be an interesting paper)

English

280

37.8K

Martin Marek@mrtnm·30 Ara

@karanjagtiani04 @rasbt The results are similar: arxiv.org/abs/1811.03600

English

Karan Jagtiani@karanjagtiani04·30 Ara

@rasbt Interesting insights on batch size dynamics. Curious about how these findings translate to other architectures outside LLMs. Anyone tried small batches with CNNs or RNNs?

English

152

Sebastian Raschka@rasbt·29 Ara

One of the underrated papers this year: "Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101) (I can confirm this holds for RLVR, too! I have some experiments to share soon.)

English

174

1.4K

90.4K

Martin Marek@mrtnm·29 Ara

@rasbt @postimortem That's a good point. This was a very basic experiment: we finetuned a base model on the (short) solutions from the MATH dataset. I agree that if you want the model to output long sequences, then it totally makes sense to put a single long sequence in the batch.

English

Sebastian Raschka@rasbt·29 Ara

@mrtnm @postimortem So my intuition here is that MATH usually benefits from larger reasoning chains (2048 or longer). So if we disregard the FLOP budget here, we could potentially, under the same memory requirement, get a better model by investing the freed up memory into longer sequence training.

English

Martin Marek@mrtnm·29 Ara

@rasbt @postimortem What is your intuition here? Under a fixed FLOP budget, if we decrease the token batch size, we can take more optimizer steps – I think this is where the benefits mostly come from. However, if you increase the sequence length, that decreases the number of optimizer steps.

English

Sebastian Raschka@rasbt·29 Ara

@mrtnm @postimortem Very nice! For the green one, it would be interesting to add another bar where you use that saved memory (from reducing batch size) to increase sequence length while keeping batch size at 1

English

Martin Marek@mrtnm·29 Ara

@Krishna70284154 @rasbt Yes, and that's exactly where a small batch size helps the most!

English

Kris@Krishna70284154·29 Ara

@rasbt optimiser states consuming more memory than activations is still a big problem to solve in a llm world where memory is less available than compute

English

579

Martin Marek@mrtnm·29 Ara

@postimortem @rasbt That is true. Although it is worth noting that using a small batch size can save a _lot_ of memory. In this experiment, we were able to run full finetuning with just ~16 bits / param. (bar width = memory)

English

tim ganiev@postimortem·29 Ara

@rasbt imo grad acc survives not because it's "optimal", but because large batch recipes get copied without rederiving the optimizer dynamics a lot of other finetuning "heuristics" stick around for the same reason: "it works well, why change?"

English

765

Martin Marek@mrtnm·29 Ara

@scottjmaddox @rasbt This speedrun result is interesting though. In general, batch size is tied to the learning rate, β₁, β₂, and even weight decay. So I wonder to what extent you could observe similar results by using a schedule for the other hyperparams instead.

English

Martin Marek@mrtnm·29 Ara

@scottjmaddox @rasbt Indeed, we only ran pretraining for 20 tokens / parameter. But 1) we also tested finetuning; and 2) Adam uses EMA to smooth out minibatch gradient noise and you can increase β₁ and β₂ to emulate a larger batch size.

English

179

Martin Marek@mrtnm·29 Ara

@rasbt Thank you for your kind words!

English

267

Martin Marek@mrtnm·25 Ara

I hope this can be useful for researchers who want to run both training + sampling on a single model replica or implement new models – e.g. qwen3.py and llama3.py differ in just 3 LOC! github.com/martin-marek/j…

English

195

Martin Marek@mrtnm·25 Ara

On a TPU v6e-8, Qwen3-8B achieves 30% training MFU and Qwen3-32B achieves over 20,000 tokens / sec sampling throughput (~50% memory bandwidth utilization).

English

224

Martin Marek@mrtnm·25 Ara

🎄 My holiday project – implementing Qwen3 in pure JAX in just 70 LOC – without any model libraries (Flax / Haiku / etc).

English

680

Keşfet

@karpathy @jordibruin @thevividapp @francoisfleuret @SamuelMLSmith @OpenAI @giffmana @ziv_ravid