Floatingtrees

12 posts

Floatingtrees

Floatingtrees

@floatingtrees

Person at place

Santa Clara Joined Temmuz 2025
92 Following8 Followers
Floatingtrees
Floatingtrees@floatingtrees·
@yule_gan Perhaps it’s because pretraining creates a set of experts sufficiently close to the model in weight space such that a random perturbation can move the model onto the an expert? This paper has more details: arxiv.org/html/2603.1222…
English
0
0
0
63
Yulu Gan
Yulu Gan@yule_gan·
A fun experiment comparing a random step with one gradient step: With a small CNN on CIFAR-10, a random step is basically a disaster. (A gradient step is a ~185σ event.) That makes sense if you expect a random direction in R^d to be ~sqrt(d) standard deviations worse than the optimal one. So scaling up to a larger model should make things even worse. But with a 7B model (test on GSM8k), random steps have a good chance of outperforming a gradient step. (The gradient norm of one PPO update is 1.94, while the L2 norm of the Gaussian perturbation is 85.6. The figure below rescales the Gaussian perturbation to match the PPO update norm, so the random step and gradient step have the same radius.) We should really rethink the parameter-function map.
GIF
English
15
21
140
14.4K
Aviv Bick
Aviv Bick@avivbick·
SSMs fail on recall tasks they have the capacity to solve. The two dominant approaches today, SSMs and sliding-window attention, both lack persistence: memory either decays over time or gets evicted. We built Raven to fix this, surpassing all prior linear models even at 16× their training sequence length. 🧵🐦‍⬛
English
5
58
396
52.3K
Floatingtrees
Floatingtrees@floatingtrees·
This is my first blog post, so please tell me if anything I did was unconventional or confusing!
English
0
0
2
20
Floatingtrees
Floatingtrees@floatingtrees·
Current video autoencoders waste enormous amounts of capacity on uninteresting information; a video zooming in on a rock contains much less information than a lion running on a Savannah. Here's a way to fix it. floatingtrees.github.io/dynamic-frame-…
English
1
1
2
104
Floatingtrees
Floatingtrees@floatingtrees·
@1a1n1d1y Can’t believe I didn’t follow you before today, very interesting.
English
0
0
1
315
Joruno
Joruno@wsl8297·
GitHub 开源 CUDA 系统教程:LeetCUDA(从入门到进阶一站式) 200+ 个循序渐进的 CUDA 内核实战,配套 HGEMM 库性能可达 cuBLAS 的 98%~100%。另有 100+ 篇高性能计算技术博客,聚焦关键技巧与优化方法,帮你把“会写”变成“写得快”。 GitHub:github.com/xlite-dev/Leet… 专为初学者设计,结合 PyTorch 提供清晰学习路径:写对 → 写快 → 逼近库级性能。 适合想系统掌握 CUDA 的开发者,也适合作为大模型推理优化的 AI 工程师参考与进阶路线。
Joruno tweet media
中文
15
224
1.2K
66.6K
Floatingtrees
Floatingtrees@floatingtrees·
@drummatick I think temperature 0 might be sufficient for TPUs? (Since their runtime execution is fully deterministic)
English
0
0
0
2.2K
Floatingtrees
Floatingtrees@floatingtrees·
@Memetic_Theory Google has a TRC program that gives you 64 v6e TPUs (a bit over 32 h100s of compute) for a month. They’re quite generous, and you should be able to get it if you apply.
English
0
0
1
281
mass
mass@Memetic_Theory·
We need access to 32 H100s for 1-2 weeks to add a few more data points to a novel scaling law that we are about to publish. Can someone help asap. Pls rt for visibility. If this holds, the implications are enormous.
mass tweet media
English
19
9
126
13.5K
Floatingtrees
Floatingtrees@floatingtrees·
@gabriberton This is a fairly common thing with a lot of LLMs, especially Qwen. RL on itself narrows the model’s distribution which reduces the probability of an error up to an extent.
English
0
0
0
245
Gabriele Berton
Gabriele Berton@gabriberton·
Chat is this real? Doesn't make much sense to me Especially the screenshot below: bad training data, good results ?!? My guess is that it only works on a small subsets of models / datasets with very narrow hyperparams, unless I'm missing something
Gabriele Berton tweet media
Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English
15
2
66
19.3K
Floatingtrees
Floatingtrees@floatingtrees·
@danpacary There’s no need to do LoRA by the way. If you’re not computing gradients, each set of weights can be represented as a scalar seed, so LoRA just reduces expressivity without any memory benefit.
English
0
0
0
19
Daniel Isaac
Daniel Isaac@danpacary·
Combine EGGROLL with LoRA. LoRA freezes 99.97% of the model. Only tiny low-rank adapter matrices update. 811,008 trainable params out of 30 billion. EGGROLL perturbs these adapters with random noise, evaluates via forward pass, updates based on fitness. 224 lines of Python.
Daniel Isaac tweet media
English
4
5
42
2.9K
Daniel Isaac
Daniel Isaac@danpacary·
I'm fine-tuning a 30B parameter model on my MacBook Pro. No backward passes. No GPU cluster. No gradients. Just forward passes on Apple Silicon.
Daniel Isaac tweet media
English
23
20
457
43.9K
Floatingtrees
Floatingtrees@floatingtrees·
@nabla_theta Even in a potential future highly competitive market, there will probably be some AI companies that are more trusted than others because some companies have a large number of highly opinionated researchers that would leave if the company did something terrible.
English
0
0
0
99
Leo Gao
Leo Gao@nabla_theta·
new post: Corporations seem evil because we anthropomorphize them if you model companies as people, then they would be amoral sociopaths. but so would your lawnmower. to change corporate behavior effectively, you have to treat them as optimizers and change their incentives.
English
16
5
178
9.3K
Floatingtrees
Floatingtrees@floatingtrees·
@nabla_theta Hi Leo, before I make my predictions, does the polling service you used have any mechanism that incentivizes people to answer honestly?
English
1
0
0
325
Leo Gao
Leo Gao@nabla_theta·
new blog post! my hobby: running deranged surveys link in thread
English
6
0
57
5.8K