Gurvan

151 posts

Gurvan

Gurvan

@gurvanson

I like Super Smash Bros. Melee for the Nintendo Gamecube™, Machine Learning, and most of all taking care of my friends.

Rennes, France Katılım Temmuz 2009
414 Takip Edilen55 Takipçiler
Gurvan
Gurvan@gurvanson·
@jsuarez what do you think about MinGRU for RL ?
English
1
0
1
51
Joseph Suarez 🐡
Joseph Suarez 🐡@jsuarez·
For trying out new ideas, PufferLib is the only thing out there that is fast and hackable. I've ported and run several hundred experiments on MinGRU and Mamba in the last 48 hours alone. Shipping new tools to help with this in the next version, too!
English
2
1
8
1.4K
Joseph Suarez 🐡
Joseph Suarez 🐡@jsuarez·
Point - experimental work represents multiple months of the dev effort on PufferLib. Probably around 1/3, not 10%. The puffer 2.0->3.0 capabilities leap was mostly algo-side. It's just that the sexy stuff usually doesn't work as written on paper.
Ariel@redtachyon

Aight let's talk about frameworks, libraries, RL, and why I probably don't like your favorite RL codebase. Yes, including that one. The unusual thing about RL is that the algorithm is the easy part. GRPO is a single-line equation on some logprobs. If you have the data, computing the loss is trivial, and then presumably you're using it with a backprop library of your choice. But that's the problem -- getting the data. It's a pain in the ass. In regular RL you have to do rollouts, perhaps truncate some episodes, and handle the ends accordingly. If you don't want to be a snail, you'll want to vectorize the environment and adapt the algorithm for that. If you want to do an LLM, you need to do all the nonsense that makes LLMs fit in memory. You need to be careful about your prompts, mask out the right parts for the loss. You need a decent generation engine (vLLM), which then makes it a pain to update the weights. If you want to do multi-agent multi-turn LLM RL, might as well do commit sudoku. While we have many disagreements on just about anything RL-related, I think @jsuarez's Pufferlib exemplifies this point beautifully. It's without a doubt incredible at what it does - training RL algos on simulated environments very very quickly. But most of its novelty is pure infra. The core algorithms are largely the same as they've been for years, and I'm willing to bet they represent less than 10% of the overall engineering effort. Naturally, this has implications on the code you need to write to do anything beyond running the built-in examples. What I find time and time again, is that for many sufficiently nontrivial (read: interesting) research problems, it takes a similar amount of time to (a) write the thing from scratch/from simple primitives, or (b) adapt an existing framework to accommodate crazy ideas. In the former, you focus on writing the actual logic. In the latter, you wrangle the framework to allow you to add the logic. I know what I like better. All of this is because the algorithm is the easy part. The infra is the pain in the ass. So whenever you're in a position to choose - use the tools that simplify infra, and write the training loop yourself. Don't build frameworks, build libraries. You'll thank yourself later. Big shout out to my Master's supervisor from back in the day, who was the first one to tell me to drop rllib and just write PPO myself in PyTorch. And to @hallerite for inspiring me to finally write up this rant. I might write a proper effortpost with examples at some point in the future if the people demand it.

English
2
0
27
5.5K
kache
kache@yacineMTB·
@gurvanson literally any CVPR/SIGGRAPH paper in the last 3 years
English
1
0
3
343
kache
kache@yacineMTB·
honestly insane that this shit works
kache tweet media
English
3
0
111
7.9K
Gurvan
Gurvan@gurvanson·
@jeremyphoward @ggerganov I understand that the amount of memory is a bottleneck on consumer GPUs, but wouldn't the inference speed still be better with less active parameters during generation?
English
1
0
1
233
Jeremy Howard
Jeremy Howard@jeremyphoward·
@ggerganov MoE is strictly worse for folks using discrete consumer GPUs like a 3090 AFAICT. It seems pretty good for unified memory stuff like Macs though.
English
5
0
37
3.6K
Georgi Gerganov
Georgi Gerganov@ggerganov·
gpt-oss is a great model IMO OpenAI showed us the blueprint for winning local AI: - Interleaved SWA - Small head sizes in the attention - Attention sinks - Mixture of Experts FFN - 4-bit training All of these parts combined together result in the best architecture suitable for regular users. Very lightweight and efficient for inference on pretty much any hardware. Qwen models are also great. The MoE works really well. I think they should just adopt iSWA and 4-bit training to become the best. Gemma models are also great. They already have the 4-bit QAT figured out. It seems they just need to adopt the MoE architecture. And maybe reduce the head size a bit. p.s. don't know if this makes sense, just my overall impression and intuitive understanding
English
36
76
993
82.1K
Gurvan
Gurvan@gurvanson·
@kalomaze if the issue is stable ssh try mosh maybe?
English
0
0
0
34
kalomaze
kalomaze@kalomaze·
my home internet (spectrum) is really flaky and my wifi keeps popping in and out which makes persistent ssh a losing battle, should i just set a monitor down next to the router and ethernet or is there something else i can try
English
33
1
72
7.3K
madison
madison@dearmadisonblue·
can an egregore have a personality disorder?
English
2
0
1
282
Gurvan
Gurvan@gurvanson·
@SSBM_Arte I think it happened to me once when I had paste an image that wouldn't properly upload, but refreshing the page fixed it iirc. It could also be that a previous message is open for editing i guess
English
0
0
1
88
Julien Bernard
Julien Bernard@SSBM_Arte·
Would anyone familiar with with google aistudio know why I can't send a message in one conversation (created yesterday, 174k tokens in) but can in another (created today, 4k tokens in) ? Both Gemini 2.5. Never experienced this before, and really annoying given the context built :(
Julien Bernard tweet mediaJulien Bernard tweet media
English
1
1
3
1.6K
Gurvan
Gurvan@gurvanson·
@qtnx_ talking about it tricks you into thinking you've already done something. don't fall for it
English
0
0
0
50
Gurvan
Gurvan@gurvanson·
@jaxmorphy i mean, 3 denoising steps is not a lot. you can still see the stage outline really well. do you plan on using rolling diffusion/diffusion forcing?
English
1
0
1
32
ja
ja@jaxmorphy·
@gurvanson thats at 40 denoise steps, here's 5m using just 3 denoise steps
ja tweet media
English
2
0
1
35
ja
ja@jaxmorphy·
Training diffusion transformers on melee gameplay. 5M params, slightly better results, still bad results but i need to train longer than 30mins lol.
ja tweet media
English
2
0
4
173
Gurvan
Gurvan@gurvanson·
@jaxmorphy that's the world model i want to see
English
0
0
1
41
ja
ja@jaxmorphy·
ja tweet media
ZXX
2
0
1
61
Gurvan
Gurvan@gurvanson·
@y0b1byte i think the DreamerV3 paper mentions that it uses the same set of hyperparameters for every experiment, so the comparison might not be entirely fair
English
0
0
0
36
yobibyte
yobibyte@y0b1byte·
Singularity
yobibyte tweet mediayobibyte tweet media
English
1
0
12
1.1K
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
Training Large Language Models to Reason in a Continuous Latent Space Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT) Extremely simple change: instead of mapping between hidden states and language tokens using the LLM head and embedding layer, you directly feed the last hidden state (a continuous thought) as the input embedding for the next token. The system can be optimized end-to-end by gradient descent, as continuous thoughts are fully differentiable.
Tanishq Mathew Abraham, Ph.D. tweet media
English
51
296
1.9K
377.6K
Gurvan
Gurvan@gurvanson·
@spikedoanz @filipviz normalize by average won't work with negative logits. you could offset everything by the smallest logit maybe, and it would also get you the translation invariance property of softmax
English
0
0
2
101
spike
spike@spikedoanz·
thanks for the response! the property i'm describing is mostly addressing the common reason given for why we use softmax softmax :: Vector[logits] -> Vector[probabilities] and it seems strange, given this, that it would be biased in what distribution it preferred, since something like def normalize_by_average(x): return x / x.sum() provides an unbiased way to turn logits into probabilities, and is, 1. even easier to differentiate 2. has some of the same gradient properties as backprop (hooks up all numbers in a vector into a gradient chokepoint) 3. trivially more numerically stable by inspection and the point on amplifying differences between large values could also be addressed by taking the square of the vector, or 2^x (both smell more numerically stable to me) instead of doing exp() also.
English
2
0
3
293
spike
spike@spikedoanz·
TIL softmax isn't idempotent, and makes the spread converge towards a uniform distribution
spike tweet media
English
7
1
123
14.3K
Gurvan
Gurvan@gurvanson·
@rami_mmo that's good to know, thank you for answering!
English
0
0
1
47
rami
rami@rami_mmo·
generally vq in literature has less fidelity than kl for image recosntruction, i actually thought that vq could work better on this case since the latents are so tiny and noising them up is a bit dangerous unless they looked a bit closer to a gaussian, but it couldn't get past the low frequency features...
English
1
0
2
71
Gurvan
Gurvan@gurvanson·
@rami_mmo this seems contrary to what's commonly though about quantized latents, and about why VQVAE was made in the first place. do you think KL works better here because the minecraft scenery is not that diverse? (i think you allude to this in the article)
English
1
0
1
75
rami
rami@rami_mmo·
@gurvanson the latents are not quantized, i tried quantizing them but it looses a lot more fidelity while reconstructing, the kl seems to have done a very good job at keeping the latents well behaved for the downstream tasks
English
1
0
2
534
Gurvan
Gurvan@gurvanson·
@torchcompiled @giffmana I think the issue here is that for Transformer they plot the cumulative training time (124M+354M+757M+1.4B), instead of comparing to just the 1.4B trained from scratch, which seems to take about the same amount of TPU hours as the Tokenformer 1.4B, so the graph seems disingenuous
English
1
0
2
202
Ethan
Ethan@torchcompiled·
@giffmana I might be misunderstanding but I thought the incremental scaling was from scratch
Ethan@torchcompiled

@stanislavfort i honestly picked the wrong graph to show. It is trained from scratch, but in an incremental fashion where parameters (the pseudo key/values) are iteratively increased, each time initializing at zeros.

English
3
0
8
1.6K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
TokenFormer is a cool idea, but before you get all carried away by these scaling curves, please note that it compares *incrementally* scaled TokenFormer to Transformer scaled *from scratch*. It may be a nice model for all i know, but this plot is not what’s going to convince me.
Lucas Beyer (bl16) tweet media
PapersAnon@papers_anon

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters. Replaces all the linear projections with a token-parameter attention layer where input tokens act as queries and model parameters as keys and values. Links below

English
18
14
180
34.2K