Floatingtrees

12 posts

Floatingtrees

@floatingtrees

Person at place

Santa Clara Joined Temmuz 2025

92 Following8 Followers

Floatingtrees@floatingtrees·16 May

@yule_gan Perhaps it’s because pretraining creates a set of experts sufficiently close to the model in weight space such that a random perturbation can move the model onto the an expert? This paper has more details: arxiv.org/html/2603.1222…

English

Yulu Gan@yule_gan·15 May

A fun experiment comparing a random step with one gradient step: With a small CNN on CIFAR-10, a random step is basically a disaster. (A gradient step is a ~185σ event.) That makes sense if you expect a random direction in R^d to be ~sqrt(d) standard deviations worse than the optimal one. So scaling up to a larger model should make things even worse. But with a 7B model (test on GSM8k), random steps have a good chance of outperforming a gradient step. (The gradient norm of one PPO update is 1.94, while the L2 norm of the Gaussian perturbation is 85.6. The figure below rescales the Gaussian perturbation to match the PPO update norm, so the random step and gradient step have the same radius.) We should really rethink the parameter-function map.

GIF

English

140

14.4K

Floatingtrees@floatingtrees·7 May

@avivbick @rshia_afz @CevherLIONS @ericxing @_albertgu Do you think this advantage over SSMs will scale up well? It seems like a sufficiently trained SSM will be able to predict a delta t vector that approaches 0 on the information it wants to preserve.

English

Aviv Bick@avivbick·7 May

0/ Paper: github.com/goombalab/rave… Blogpost: goombalab.github.io/blog/2026/rave… Code: github.com/goombalab/raven w/ @rshia_afz, @CevherLIONS, @ericxing, and @_albertgu

English

1.9K

Aviv Bick@avivbick·7 May

SSMs fail on recall tasks they have the capacity to solve. The two dominant approaches today, SSMs and sliding-window attention, both lack persistence: memory either decays over time or gets evicted. We built Raven to fix this, surpassing all prior linear models even at 16× their training sequence length. 🧵🐦‍⬛

English

396

52.3K

Floatingtrees@floatingtrees·30 Nis

This is my first blog post, so please tell me if anything I did was unconventional or confusing!

English

Floatingtrees@floatingtrees·30 Nis

Current video autoencoders waste enormous amounts of capacity on uninteresting information; a video zooming in on a rock contains much less information than a lion running on a Savannah. Here's a way to fix it. floatingtrees.github.io/dynamic-frame-…

English

104

Floatingtrees@floatingtrees·20 Nis

@1a1n1d1y Can’t believe I didn’t follow you before today, very interesting.

English

315

andy@1a1n1d1y·20 Nis

x.com/i/article/2043…

ZXX

472

104.3K

Floatingtrees@floatingtrees·20 Nis

@wsl8297 Very cool, thanks for making this!

English

179

Joruno@wsl8297·19 Nis

GitHub 开源 CUDA 系统教程：LeetCUDA（从入门到进阶一站式） 200+ 个循序渐进的 CUDA 内核实战，配套 HGEMM 库性能可达 cuBLAS 的 98%～100%。另有 100+ 篇高性能计算技术博客，聚焦关键技巧与优化方法，帮你把“会写”变成“写得快”。 GitHub：github.com/xlite-dev/Leet… 专为初学者设计，结合 PyTorch 提供清晰学习路径：写对 → 写快 → 逼近库级性能。适合想系统掌握 CUDA 的开发者，也适合作为大模型推理优化的 AI 工程师参考与进阶路线。

中文

224

1.2K

66.6K

Floatingtrees@floatingtrees·16 Nis

@drummatick I think temperature 0 might be sufficient for TPUs? (Since their runtime execution is fully deterministic)

English

2.2K

Saurabh Kumar@drummatick·15 Nis

Uhh temperature zero?

@bluecow 🐮@BLUECOW009

At my job we managed to make Gemini deterministic, I cant say anything else because im under an NDA.

Português

406

59K

Floatingtrees@floatingtrees·16 Nis

@Memetic_Theory Google has a TRC program that gives you 64 v6e TPUs (a bit over 32 h100s of compute) for a month. They’re quite generous, and you should be able to get it if you apply.

English

281

mass@Memetic_Theory·15 Nis

We need access to 32 H100s for 1-2 weeks to add a few more data points to a novel scaling law that we are about to publish. Can someone help asap. Pls rt for visibility. If this holds, the implications are enormous.

English

126

13.5K

Floatingtrees@floatingtrees·5 Nis

@gabriberton This is a fairly common thing with a lot of LLMs, especially Qwen. RL on itself narrows the model’s distribution which reduces the probability of an error up to an extent.

English

245

Gabriele Berton@gabriberton·4 Nis

Chat is this real? Doesn't make much sense to me Especially the screenshot below: bad training data, good results ?!? My guess is that it only works on a small subsets of models / datasets with very narrow hyperparams, unless I'm missing something

Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass @1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

19.3K

Floatingtrees@floatingtrees·4 Nis

@danpacary There’s no need to do LoRA by the way. If you’re not computing gradients, each set of weights can be represented as a scalar seed, so LoRA just reduces expressivity without any memory benefit.

English

Daniel Isaac@danpacary·4 Nis

Combine EGGROLL with LoRA. LoRA freezes 99.97% of the model. Only tiny low-rank adapter matrices update. 811,008 trainable params out of 30 billion. EGGROLL perturbs these adapters with random noise, evaluates via forward pass, updates based on fitness. 224 lines of Python.

English

2.9K

Daniel Isaac@danpacary·4 Nis

I'm fine-tuning a 30B parameter model on my MacBook Pro. No backward passes. No GPU cluster. No gradients. Just forward passes on Apple Silicon.

English

457

43.9K

Floatingtrees@floatingtrees·27 Mar

@nabla_theta Even in a potential future highly competitive market, there will probably be some AI companies that are more trusted than others because some companies have a large number of highly opinionated researchers that would leave if the company did something terrible.

English

Leo Gao@nabla_theta·27 Mar

new post: Corporations seem evil because we anthropomorphize them if you model companies as people, then they would be amoral sociopaths. but so would your lawnmower. to change corporate behavior effectively, you have to treat them as optimizers and change their incentives.

English

178

9.3K

Floatingtrees@floatingtrees·19 Mar

@nabla_theta Hi Leo, before I make my predictions, does the polling service you used have any mechanism that incentivizes people to answer honestly?

English

325

Leo Gao@nabla_theta·19 Mar

new blog post! my hobby: running deranged surveys link in thread

English

5.8K

Discover

@yule_gan @avivbick @rshia_afz @CevherLIONS @ericxing @_albertgu @1a1n1d1y @wsl8297