Gurvan

151 posts

Gurvan

@gurvanson

I like Super Smash Bros. Melee for the Nintendo Gamecube™, Machine Learning, and most of all taking care of my friends.

Rennes, France Katılım Temmuz 2009

414 Takip Edilen55 Takipçiler

Gurvan@gurvanson·6 Kas

@jsuarez what do you think about MinGRU for RL ?

English

Joseph Suarez 🐡@jsuarez·6 Kas

For trying out new ideas, PufferLib is the only thing out there that is fast and hackable. I've ported and run several hundred experiments on MinGRU and Mamba in the last 48 hours alone. Shipping new tools to help with this in the next version, too!

English

1.4K

Joseph Suarez 🐡@jsuarez·6 Kas

Point - experimental work represents multiple months of the dev effort on PufferLib. Probably around 1/3, not 10%. The puffer 2.0->3.0 capabilities leap was mostly algo-side. It's just that the sexy stuff usually doesn't work as written on paper.

Ariel@redtachyon

Aight let's talk about frameworks, libraries, RL, and why I probably don't like your favorite RL codebase. Yes, including that one. The unusual thing about RL is that the algorithm is the easy part. GRPO is a single-line equation on some logprobs. If you have the data, computing the loss is trivial, and then presumably you're using it with a backprop library of your choice. But that's the problem -- getting the data. It's a pain in the ass. In regular RL you have to do rollouts, perhaps truncate some episodes, and handle the ends accordingly. If you don't want to be a snail, you'll want to vectorize the environment and adapt the algorithm for that. If you want to do an LLM, you need to do all the nonsense that makes LLMs fit in memory. You need to be careful about your prompts, mask out the right parts for the loss. You need a decent generation engine (vLLM), which then makes it a pain to update the weights. If you want to do multi-agent multi-turn LLM RL, might as well do commit sudoku. While we have many disagreements on just about anything RL-related, I think @jsuarez's Pufferlib exemplifies this point beautifully. It's without a doubt incredible at what it does - training RL algos on simulated environments very very quickly. But most of its novelty is pure infra. The core algorithms are largely the same as they've been for years, and I'm willing to bet they represent less than 10% of the overall engineering effort. Naturally, this has implications on the code you need to write to do anything beyond running the built-in examples. What I find time and time again, is that for many sufficiently nontrivial (read: interesting) research problems, it takes a similar amount of time to (a) write the thing from scratch/from simple primitives, or (b) adapt an existing framework to accommodate crazy ideas. In the former, you focus on writing the actual logic. In the latter, you wrangle the framework to allow you to add the logic. I know what I like better. All of this is because the algorithm is the easy part. The infra is the pain in the ass. So whenever you're in a position to choose - use the tools that simplify infra, and write the training loop yourself. Don't build frameworks, build libraries. You'll thank yourself later. Big shout out to my Master's supervisor from back in the day, who was the first one to tell me to drop rllib and just write PPO myself in PyTorch. And to @hallerite for inspiring me to finally write up this rant. I might write a proper effortpost with examples at some point in the future if the people demand it.

English

5.5K

Gurvan@gurvanson·7 Eki

@yacineMTB thank you

English

106

kache@yacineMTB·7 Eki

@gurvanson literally any CVPR/SIGGRAPH paper in the last 3 years

English

343

kache@yacineMTB·7 Eki

honestly insane that this shit works

English

111

7.9K

Gurvan@gurvanson·28 Ağu

@jeremyphoward @ggerganov I understand that the amount of memory is a bottleneck on consumer GPUs, but wouldn't the inference speed still be better with less active parameters during generation?

English

233

Jeremy Howard@jeremyphoward·28 Ağu

@ggerganov MoE is strictly worse for folks using discrete consumer GPUs like a 3090 AFAICT. It seems pretty good for unified memory stuff like Macs though.

English

3.6K

Georgi Gerganov@ggerganov·28 Ağu

gpt-oss is a great model IMO OpenAI showed us the blueprint for winning local AI: - Interleaved SWA - Small head sizes in the attention - Attention sinks - Mixture of Experts FFN - 4-bit training All of these parts combined together result in the best architecture suitable for regular users. Very lightweight and efficient for inference on pretty much any hardware. Qwen models are also great. The MoE works really well. I think they should just adopt iSWA and 4-bit training to become the best. Gemma models are also great. They already have the 4-bit QAT figured out. It seems they just need to adopt the MoE architecture. And maybe reduce the head size a bit. p.s. don't know if this makes sense, just my overall impression and intuitive understanding

English

993

82.1K

Gurvan@gurvanson·26 Tem

@kalomaze if the issue is stable ssh try mosh maybe?

English

kalomaze@kalomaze·25 Tem

my home internet (spectrum) is really flaky and my wifi keeps popping in and out which makes persistent ssh a losing battle, should i just set a monitor down next to the router and ethernet or is there something else i can try

English

7.3K

Gurvan@gurvanson·18 Tem

@dearmadisonblue is there a normal range of behavior for an egregore?

English

madison@dearmadisonblue·18 Tem

can an egregore have a personality disorder?

English

282

Gurvan@gurvanson·29 Mar

@SSBM_Arte I think it happened to me once when I had paste an image that wouldn't properly upload, but refreshing the page fixed it iirc. It could also be that a previous message is open for editing i guess

English

Julien Bernard@SSBM_Arte·29 Mar

Would anyone familiar with with google aistudio know why I can't send a message in one conversation (created yesterday, 174k tokens in) but can in another (created today, 4k tokens in) ? Both Gemini 2.5. Never experienced this before, and really annoying given the context built :(

English

1.6K

Gurvan@gurvanson·8 Mar

@kalomaze you should check phillip if you haven't twitch.tv/x_pilot

English

366

kalomaze@kalomaze·8 Mar

raw

Eric Gu@ericyuegu

I trained a 20M param Transformer on 3 billion frames of Super Smash Bros. Melee replays. No RL, pure behavior cloning. It wins 95% of the time against level 9 CPU, and only took about $5 to train (5 hrs on two 3090s).

QST

205

11.8K

Gurvan@gurvanson·13 Ara

@qtnx_ talking about it tricks you into thinking you've already done something. don't fall for it

English

Gurvan@gurvanson·13 Ara

@jaxmorphy i mean, 3 denoising steps is not a lot. you can still see the stage outline really well. do you plan on using rolling diffusion/diffusion forcing?

English

ja@jaxmorphy·13 Ara

@gurvanson thats at 40 denoise steps, here's 5m using just 3 denoise steps

English

ja@jaxmorphy·12 Ara

Training diffusion transformers on melee gameplay. 5M params, slightly better results, still bad results but i need to train longer than 30mins lol.

English

173

Gurvan@gurvanson·12 Ara

@jaxmorphy that's the world model i want to see

English

ja@jaxmorphy·12 Ara

ZXX

Gurvan@gurvanson·11 Ara

@y0b1byte i think the DreamerV3 paper mentions that it uses the same set of hyperparameters for every experiment, so the comparison might not be entirely fair

English

yobibyte@y0b1byte·11 Ara

Singularity

English

1.1K

Gurvan@gurvanson·10 Ara

@iScienceLuvr already done by Schmidhuber in Recurrent Highway Networks arxiv.org/abs/1607.03474 better illustrated here arxiv.org/abs/1707.05589

English

1.3K

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·10 Ara

Training Large Language Models to Reason in a Continuous Latent Space Introduces a new paradigm for LLM reasoning called Chain of Continuous Thought (COCONUT) Extremely simple change: instead of mapping between hidden states and language tokens using the LLM head and embedding layer, you directly feed the last hidden state (a continuous thought) as the input embedding for the next token. The system can be optimized end-to-end by gradient descent, as continuous thoughts are fully differentiable.

Tanishq Mathew Abraham, Ph.D. tweet media

English

296

1.9K

377.6K

Gurvan@gurvanson·13 Kas

@spikedoanz @filipviz normalize by average won't work with negative logits. you could offset everything by the smallest logit maybe, and it would also get you the translation invariance property of softmax

English

101

spike@spikedoanz·13 Kas

thanks for the response! the property i'm describing is mostly addressing the common reason given for why we use softmax softmax :: Vector[logits] -> Vector[probabilities] and it seems strange, given this, that it would be biased in what distribution it preferred, since something like def normalize_by_average(x): return x / x.sum() provides an unbiased way to turn logits into probabilities, and is, 1. even easier to differentiate 2. has some of the same gradient properties as backprop (hooks up all numbers in a vector into a gradient chokepoint) 3. trivially more numerically stable by inspection and the point on amplifying differences between large values could also be addressed by taking the square of the vector, or 2^x (both smell more numerically stable to me) instead of doing exp() also.

English

293

spike@spikedoanz·13 Kas

TIL softmax isn't idempotent, and makes the spread converge towards a uniform distribution

English

123

14.3K

Gurvan@gurvanson·13 Kas

@spikedoanz still not idempotent tho

English

473

spike@spikedoanz·13 Kas

hear me out:

spike@spikedoanz

TIL softmax isn't idempotent, and makes the spread converge towards a uniform distribution

English

35.3K

Gurvan@gurvanson·12 Kas

@rami_mmo that's good to know, thank you for answering!

English

rami@rami_mmo·12 Kas

generally vq in literature has less fidelity than kl for image recosntruction, i actually thought that vq could work better on this case since the latents are so tiny and noising them up is a bit dangerous unless they looked a bit closer to a gaussian, but it couldn't get past the low frequency features...

English

rami@rami_mmo·11 Kas

Excited to announce Lucid V1: A world model that can emulate Minecraft environments in real-time on consumer hardware! 🔥 play here: lucidv1-demo.vercel.app post: ramimo.substack.com/p/lucid-v1-a-w… repo: github.com/SonicCodes/luc…

English

309

67.8K

Gurvan@gurvanson·12 Kas

@rami_mmo this seems contrary to what's commonly though about quantized latents, and about why VQVAE was made in the first place. do you think KL works better here because the minecraft scenery is not that diverse? (i think you allude to this in the article)

English

rami@rami_mmo·12 Kas

@gurvanson the latents are not quantized, i tried quantizing them but it looses a lot more fidelity while reconstructing, the kl seems to have done a very good job at keeping the latents well behaved for the downstream tasks

English

534

Gurvan@gurvanson·31 Eki

@torchcompiled @giffmana I think the issue here is that for Transformer they plot the cumulative training time (124M+354M+757M+1.4B), instead of comparing to just the 1.4B trained from scratch, which seems to take about the same amount of TPU hours as the Tokenformer 1.4B, so the graph seems disingenuous

English

202

Ethan@torchcompiled·31 Eki

@giffmana I might be misunderstanding but I thought the incremental scaling was from scratch

Ethan@torchcompiled

@stanislavfort i honestly picked the wrong graph to show. It is trained from scratch, but in an incremental fashion where parameters (the pseudo key/values) are iteratively increased, each time initializing at zeros.

English

1.6K

Lucas Beyer (bl16)@giffmana·31 Eki

TokenFormer is a cool idea, but before you get all carried away by these scaling curves, please note that it compares *incrementally* scaled TokenFormer to Transformer scaled *from scratch*. It may be a nice model for all i know, but this plot is not what’s going to convince me.

PapersAnon@papers_anon

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters. Replaces all the linear projections with a token-parameter attention layer where input tokens act as queries and model parameters as keys and values. Links below

English

180

34.2K

Keşfet

@jsuarez @yacineMTB @jeremyphoward @ggerganov @kalomaze @dearmadisonblue @SSBM_Arte @qtnx_