Assaf Ben Kish

110 posts

Assaf Ben Kish

Assaf Ben Kish

@abk_tau

Deep Learning | Large Language Models | Reinforcement Learning

Katılım Ağustos 2023
191 Takip Edilen137 Takipçiler
Assaf Ben Kish retweetledi
Assaf Ben Kish retweetledi
Ryan Bahlous-Boldi
Ryan Bahlous-Boldi@RyanBoldi·
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Ryan Bahlous-Boldi tweet media
English
35
120
845
205.2K
Michael Hassid
Michael Hassid@MichaelHassid·
After a decade at HUJI (@HebrewU), I've submitted my PhD thesis! Huge thanks to my awsome advisors, @royschwartzNLP & @adiyossLC , my FAIR colleagues (@AIatMeta), lab mates, and of course - my wife (we met at HUJI!). Above all, thanks to my parents for teaching me how to learn.
Michael Hassid tweet mediaMichael Hassid tweet media
English
4
1
43
1.1K
Assaf Ben Kish retweetledi
Linlu Qiu
Linlu Qiu@linluqiu·
Language is discrete. Language models don’t have to be. 🧚Introducing ELF🧚‍♀️: Embedded Language Flows—a class of diffusion models in continuous embedding space based on continuous-time Flow Matching 🧵
Linlu Qiu tweet media
English
15
131
805
135.7K
Assaf Ben Kish
Assaf Ben Kish@abk_tau·
Most of the information in LLM weights is knowledge, not learning algorithms. The known learning algorithms that emerge (e.g. induction heads) trace back to just 2 attention heads (0.001% of GPT-3's parameters). It is plausible that the DNA encodes better learning algorithms.
Dwarkesh Patel@dwarkesh_sp

There's a quadrillion-dollar question at the heart of AI: Why are humans so much more sample efficient compared to LLM? There are three possible answers: 1. Architecture and hyperparameters (aka transformer vs whatever ‘algo’ cortical columns are implementing) 2. Learning rule (backprop vs whatever brain is doing) 3. Reward function @AdamMarblestone believes the answer is the reward function. ML likes to use pretty simple loss functions, like cross-entropy. These are easy to work with. But they might be too simple for sample-efficient learning. Adam thinks that, in humans, the large number of highly specialised cells in the ‘lizard brain’ might actually be encoding information for sophisticated loss functions, used for ‘training’ in the more sophisticated areas like the cortex and amygdala. Like: the human genome is barely 3 gigabytes (compare that to the TBs of parameters that encode frontier LLM weights). So how can it include all the information necessary to build highly intelligent learners? Well, if the key to sample-efficient learning resides in the loss function, even very complicated loss functions can still be expressed in a couple hundred lines of Python code.

English
0
0
1
148
Sham Kakade
Sham Kakade@ShamKakade6·
1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.
Sham Kakade tweet media
English
17
69
552
252.3K
alex zhang
alex zhang@a1zhang·
does anyone know of any evals / benchmarks that are particularly sensitive to prompting? can be for any range of reasons, e.g. requires ICL, requires in context examples, is hard, has strict requirements, partial information, etc. ideally also not super long horizon or hard to set up
English
11
3
67
8.7K
Assaf Ben Kish retweetledi
Yael Vinker🎗
Yael Vinker🎗@YVinker·
I am *very* excited to announce our SIGGRAPH 2026 workshop: Lines & Minds: Visual Abstraction in Art, Psychology, and Computer Graphics 🎨🧠🫖 🔗 lines-and-minds.github.io 📅 Sunday, July 19 Join us to explore how visual abstraction shapes how we think, create, and communicate.
Yael Vinker🎗 tweet media
English
6
19
102
10.4K
Assaf Ben Kish retweetledi
Yael Vinker🎗
Yael Vinker🎗@YVinker·
Creative work often starts before we can describe what we're looking for. What role can generative models play at this stage? 🌱Our new work, Inspiration Seeds, reveals hidden visual connections between images, creating a purely visual exploration space. 🔗kfirgoldberg.github.io/InspirationSee…
English
2
22
106
15.9K
Assaf Ben Kish retweetledi
Yacine Mahdid
Yacine Mahdid@yacinelearning·
good morning folks we're going live in about 2h to have a jolly discussion about making models self-teach themselves hard stuff like @justinskycak would
Yacine Mahdid tweet media
Yacine Mahdid@yacinelearning

tomorrow at 10h AM EST we'll have @IdanShenfeld and @jonashubotter on the livestream to have a presentation on self-distillation + ask them a whole bunch of questions on their research! drop your questions in the comments so I can ask them and come chat with the authors!

English
11
29
336
29K
Assaf Ben Kish retweetledi
Assaf Ben Kish retweetledi
Nimrod Shabtay
Nimrod Shabtay@NimrodShabtay·
Introducing Look Where It Matters — High-Resolution Crops Retrieval for Efficient VLMs. VLMs don't need to process full high-res images. AwaRes uses tool-calling to retrieve only the high-res regions needed to answer a given query🧵 arxiv.org/abs/2603.16932 nimrodshabtay.github.io/AwaRes/
Nimrod Shabtay tweet media
English
3
9
20
1.1K
Oliver Sieberling
Oliver Sieberling@osieberling·
A very interesting observation on backpropagation is that no matter how nonlinear the forward pass, once the forward pass is fixed, backpropagation itself is purely linear. This allows for all kinds of gradient analysis. For example, one can decompose the backward pass by the depth of the backpropagated signal: Each forward and backward pass can be viewed as involving 2^{2L} different paths, where L is the number of blocks (2L to account for Attn/MLP subblocks). Because the forward pass is nonlinear, we can’t just compute each of the paths separately to decompose the forward pass. However, for the backward pass we can. Of course, 2^{2L} is computationally intractable as we would need to do 2^{2L} separate backward passes (>1e19 for a 32-block transformer). But if we are only interested in the depth of gradients, we can use simple dynamic programming to decompose the backward signal by depth: Let x_l be the residual stream at depth l. Instead of just maintaining dL/dx_l as in normal backpropagation, we maintain (dL/dx_l)^k for each gradient depth k, i.e., a table consisting of 2L+1 gradients. Because backpropagation is linear, we have dL/dx_l = \sum_{k=0}^{2L} (dL/dx_l)^k. Then, after backpropagating each subblock, we can update our table of gradients with (dL/dx_l)^k = (dL/dx_{l+1})^k + (Jacobian_{subblock_l}(x_l))^T (dL/dx_{l+1})^{k-1}. This way, we can efficiently compute for each weight update how much comes from each gradient depth. Interestingly, because there are C(2L, k) (2L choose k) paths of depth k (left plot), by sheer number of paths we would expect the decomposition to concentrate around depth L (ignoring cancellations/correlations), but this is not what we observe in practice: The actual decomposition of gradients by depth is much more shifted towards shallower depths (right plot), which suggests that after normalizing by the number of paths, shallower paths contribute gradients of much larger magnitude than deeper paths do.
Oliver Sieberling tweet media
English
6
22
333
33.4K
Assaf Ben Kish
Assaf Ben Kish@abk_tau·
@vivek_2332 @karpathy Curious - After all the hyperparameter tuning on the eval set - Was the final model evaluated on a test set?
English
0
0
0
40
Vivek
Vivek@vivek_2332·
introducing autoresearch-rl, autonomous research for rl post-training. inspired by @karpathy autoresearch, and i think rl post-training is honestly one of the places where this idea fits perfectly. there are at least 50+ hyper parameters to tweak, learning rate, batch size, rollouts, clipping ratios, kl penalties, schedulers, the list goes on. instead of sitting there for hours turning knobs one at a time, just let the model figure out the right starting config on its own. some things worth mentioning: -> built on @PrimeIntellect prime-rl (my favourite rl post-training framework) and @willccbb verifiers for reward verification. -> ran qwen2.5-0.5b-instruct on gsm8k across 60+ autonomous experiments. eval score went from 0.475 to 0.550 and the agent actually found a way to do it in fewer steps (20 instead of 30). less compute, better results -> the whole thing was surprisingly smooth to set up and run. point the agent at the config, go to sleep, wake up to a full experiment log. i really wish i could try this on a bigger model but gpu poor for now lol -> the agent discovers things you wouldn't think to try. like how rollouts = 4 beats rollouts = 8, or how a constant lr schedule outperforms cosine. it just methodically tests everything i think the real value here is that rl training is so fragile and noisy that having an agent patiently run experiment after experiment is genuinely more effective than a human doing it manually. check it out: github.com/vivekvkashyap/…
Vivek tweet media
English
22
51
751
83.7K
Assaf Ben Kish retweetledi
Moran Yanuka
Moran Yanuka@moranynk·
🎙️ Introducing ID-LoRA: the first open-source model to jointly generate a video with a person's appearance and voice in a single pass from just a reference image + short audio clip. No more cascaded pipelines where the audio can't follow your prompt. youtu.be/6bWcMh18K6g?si…
YouTube video
YouTube
English
3
11
23
2.2K
Assaf Ben Kish retweetledi
Assaf Ben Kish retweetledi
Tinker
Tinker@tinkerapi·
Introducing Self-Distillation Fine-Tuning, a new approach to continual learning. SDFT distills a model’s own outputs when given expert demonstration in-context. The result is faster learning than off-policy distillation, combined with reduced forgetting. x.com/IdanShenfeld/s…
idan shenfeld@IdanShenfeld

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English
1
3
61
5.1K
Omar Khattab
Omar Khattab@lateinteraction·
Buried in the massive progress in LLMs over the past few years is how all your favorite Transformers/DNNs still can't solve even just grade school math problems above a "B" grade through a forward pass. Unless they're in a scaffold like CoT, ReAct, RLM, etc. And this is true even at trillions of params and bajillions of FLOPs. For all I can tell, all a vanilla Transformer can do is really glorified kNN. Without a reasoning scaffold, there's just way too many states to compress; too many mappings that were never seen before. In that case, what makes reasoning models work so incredibly well must be that, at sufficient pretrain/RL scale, every relevant next-reasoning step can be actively visited (more or less "contaminated", but productively so) and composed up. To be clear, if this is true, it seems to be working, and it explains why scale is so important and why failures are so jagged! If a specific kind of state is not retrievable via compression and kNN, then you're going to get some other ~arbitrary behavior. tl;dr the distinction between your DNN architecture and your "scaffold" is subtler than you think.
N8 Programs@N8Programs

Inspired by @RyanPGreenblatt, I measured LLMs accuracy on GSM8K when only allowed to output a numerical answer without any CoT - all reasoning done in a few forward passes. The result is a nice log-linear scaling curve. We can also use this to guess param amt of closed models.

English
17
32
332
34.8K