Martin Marek

254 posts

Martin Marek banner
Martin Marek

Martin Marek

@mrtnm

still writing code

New York, USA Katılım Aralık 2009
430 Takip Edilen294 Takipçiler
Andrej Karpathy
Andrej Karpathy@karpathy·
New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further. gist.github.com/karpathy/8627f…
English
653
3.1K
25.2K
5.2M
Martin Marek
Martin Marek@mrtnm·
Interestingly, TPU v7x (Ironwood) is the first generation to 𝘥𝘳𝘰𝘱 4-bit precision, an opposite trend to Nvidia. While Google Cloud docs do not list full TPU specs, they’re actually listed in the Pallas source code: github.com/jax-ml/jax/blo…
Martin Marek tweet media
English
0
0
3
168
Martin Marek
Martin Marek@mrtnm·
@jordibruin @thevividapp Hm. I don't own the Studio Display XDR which is exactly why I'm curious about these measurements / tests. (But I absolutely love Vivid on my MBP!)
English
1
0
0
63
Jordi Bruin
Jordi Bruin@jordibruin·
Had to test how bright those 2000 nits could go on the Studio Display XDR compared to the old one with @thevividapp If you're going to be using it in a bright environment (or your MacBook Pro) give it a try getvivid.app
English
6
2
24
7.5K
François Fleuret
François Fleuret@francoisfleuret·
The two main problems with architecture design are that 1. You have to please the GPU, so for instance anything recurrent is prohibited, 2. You have to beat baselines which have co-evolved with the data sets and training procedures.
smiz@__smiz

@francoisfleuret @ylecun When will it be easy, or even cheap, to iterate on model architectures? I suspect that’s when this will pop wide open.

English
8
4
91
10.8K
Martin Marek
Martin Marek@mrtnm·
@SamuelMLSmith @OpenAI Is this purely the result of writing more efficient GPU kernels for existing hardware? Is there a cost in terms of throughput?
English
0
0
0
388
Martin Marek
Martin Marek@mrtnm·
@giffmana @ziv_ravid Cold posteriors? I’m guilty of that one myself 😅 There was this mystery of why a “cold” posterior works better than the Bayes posterior in BNNs. But it turns out that a cold posterior is equivalent to just using a different prior, so there was actually no issue…
Martin Marek@mrtnm

We additionally introduce a "confidence" prior that directly approximates a cold likelihood. This allows us to see cold posteriors from a new perspective: as approximating a valid prior combined with the categorical likelihood. 6/8

English
1
0
5
346
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
rofl the Bayesians are doing it again! (Might still be an interesting paper)
Lucas Beyer (bl16) tweet media
English
18
7
280
37.8K
Karan Jagtiani
Karan Jagtiani@karanjagtiani04·
@rasbt Interesting insights on batch size dynamics. Curious about how these findings translate to other architectures outside LLMs. Anyone tried small batches with CNNs or RNNs?
English
1
0
0
152
Sebastian Raschka
Sebastian Raschka@rasbt·
One of the underrated papers this year: "Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101) (I can confirm this holds for RLVR, too! I have some experiments to share soon.)
Sebastian Raschka tweet media
English
27
174
1.4K
90.4K
Martin Marek
Martin Marek@mrtnm·
@rasbt @postimortem That's a good point. This was a very basic experiment: we finetuned a base model on the (short) solutions from the MATH dataset. I agree that if you want the model to output long sequences, then it totally makes sense to put a single long sequence in the batch.
English
1
0
2
56
Sebastian Raschka
Sebastian Raschka@rasbt·
@mrtnm @postimortem So my intuition here is that MATH usually benefits from larger reasoning chains (2048 or longer). So if we disregard the FLOP budget here, we could potentially, under the same memory requirement, get a better model by investing the freed up memory into longer sequence training.
English
1
0
1
82
Martin Marek
Martin Marek@mrtnm·
@rasbt @postimortem What is your intuition here? Under a fixed FLOP budget, if we decrease the token batch size, we can take more optimizer steps – I think this is where the benefits mostly come from. However, if you increase the sequence length, that decreases the number of optimizer steps.
English
1
0
2
63
Sebastian Raschka
Sebastian Raschka@rasbt·
@mrtnm @postimortem Very nice! For the green one, it would be interesting to add another bar where you use that saved memory (from reducing batch size) to increase sequence length while keeping batch size at 1
English
1
0
2
69
Kris
Kris@Krishna70284154·
@rasbt optimiser states consuming more memory than activations is still a big problem to solve in a llm world where memory is less available than compute
English
1
0
3
579
Martin Marek
Martin Marek@mrtnm·
@postimortem @rasbt That is true. Although it is worth noting that using a small batch size can save a _lot_ of memory. In this experiment, we were able to run full finetuning with just ~16 bits / param. (bar width = memory)
Martin Marek tweet media
English
1
0
2
75
tim ganiev
tim ganiev@postimortem·
@rasbt imo grad acc survives not because it's "optimal", but because large batch recipes get copied without rederiving the optimizer dynamics a lot of other finetuning "heuristics" stick around for the same reason: "it works well, why change?"
English
1
0
2
765
Martin Marek
Martin Marek@mrtnm·
@scottjmaddox @rasbt This speedrun result is interesting though. In general, batch size is tied to the learning rate, β₁, β₂, and even weight decay. So I wonder to what extent you could observe similar results by using a schedule for the other hyperparams instead.
English
1
0
0
76
Martin Marek
Martin Marek@mrtnm·
@scottjmaddox @rasbt Indeed, we only ran pretraining for 20 tokens / parameter. But 1) we also tested finetuning; and 2) Adam uses EMA to smooth out minibatch gradient noise and you can increase β₁ and β₂ to emulate a larger batch size.
Martin Marek tweet media
English
1
0
0
179
Martin Marek
Martin Marek@mrtnm·
On a TPU v6e-8, Qwen3-8B achieves 30% training MFU and Qwen3-32B achieves over 20,000 tokens / sec sampling throughput (~50% memory bandwidth utilization).
Martin Marek tweet media
English
1
0
1
224
Martin Marek
Martin Marek@mrtnm·
🎄 My holiday project – implementing Qwen3 in pure JAX in just 70 LOC – without any model libraries (Flax / Haiku / etc).
Martin Marek tweet media
English
2
1
16
680