Assaf Ben Kish

110 posts

Assaf Ben Kish

@abk_tau

Deep Learning | Large Language Models | Reinforcement Learning

Katılım Ağustos 2023

191 Takip Edilen137 Takipçiler

Assaf Ben Kish retweetledi

Yael Vinker🎗@YVinker·4d

Excited to share that Inspiration Seeds has received an Honorable Mention award this year at SIGGRAPH! 🎉 👉kfirgoldberg.github.io/InspirationSee… Huge thanks and congrats to the best team! @kfir99 @EladRichardson Looking forward to seeing you in LA in July!🌱

Yael Vinker🎗@YVinker

Creative work often starts before we can describe what we're looking for. What role can generative models play at this stage? 🌱Our new work, Inspiration Seeds, reveals hidden visual connections between images, creating a purely visual exploration space. 🔗kfirgoldberg.github.io/InspirationSee…

English

5.3K

Assaf Ben Kish retweetledi

Ryan Bahlous-Boldi@RyanBoldi·22 May

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

English

120

845

205.2K

Assaf Ben Kish@abk_tau·19 May

@MichaelHassid @HebrewU @royschwartzNLP @adiyossLC @AIatMeta Congrats Michael! 🎉

English

Michael Hassid@MichaelHassid·19 May

After a decade at HUJI (@HebrewU), I've submitted my PhD thesis! Huge thanks to my awsome advisors, @royschwartzNLP & @adiyossLC , my FAIR colleagues (@AIatMeta), lab mates, and of course - my wife (we met at HUJI!). Above all, thanks to my parents for teaching me how to learn.

English

1.1K

Assaf Ben Kish retweetledi

Linlu Qiu@linluqiu·12 May

Language is discrete. Language models don’t have to be. 🧚Introducing ELF🧚‍♀️: Embedded Language Flows—a class of diffusion models in continuous embedding space based on continuous-time Flow Matching 🧵

English

131

805

135.7K

Assaf Ben Kish@abk_tau·29 Nis

Most of the information in LLM weights is knowledge, not learning algorithms. The known learning algorithms that emerge (e.g. induction heads) trace back to just 2 attention heads (0.001% of GPT-3's parameters). It is plausible that the DNA encodes better learning algorithms.

Dwarkesh Patel@dwarkesh_sp

There's a quadrillion-dollar question at the heart of AI: Why are humans so much more sample efficient compared to LLM? There are three possible answers: 1. Architecture and hyperparameters (aka transformer vs whatever ‘algo’ cortical columns are implementing) 2. Learning rule (backprop vs whatever brain is doing) 3. Reward function @AdamMarblestone believes the answer is the reward function. ML likes to use pretty simple loss functions, like cross-entropy. These are easy to work with. But they might be too simple for sample-efficient learning. Adam thinks that, in humans, the large number of highly specialised cells in the ‘lizard brain’ might actually be encoding information for sophisticated loss functions, used for ‘training’ in the more sophisticated areas like the cortex and amygdala. Like: the human genome is barely 3 gigabytes (compare that to the TBs of parameters that encode frontier LLM weights). So how can it include all the information necessary to build highly intelligent learners? Well, if the key to sample-efficient learning resides in the loss function, even very complicated loss functions can still be expressed in a couple hundred lines of Python code.

English

148

Assaf Ben Kish@abk_tau·27 Nis

@SunnySanyal9 @ShamKakade6 Can you explain how?

English

Sunny Sanyal@SunnySanyal9·27 Nis

@ShamKakade6 I got the answer. It is a little different.

English

273

Sham Kakade@ShamKakade6·27 Nis

1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.

English

552

252.3K

Assaf Ben Kish@abk_tau·11 Nis

@a1zhang Longbench

English

alex zhang@a1zhang·9 Nis

does anyone know of any evals / benchmarks that are particularly sensitive to prompting? can be for any range of reasons, e.g. requires ICL, requires in context examples, is hard, has strict requirements, partial information, etc. ideally also not super long horizon or hard to set up

English

8.7K

Assaf Ben Kish retweetledi

Yael Vinker🎗@YVinker·9 Nis

I am *very* excited to announce our SIGGRAPH 2026 workshop: Lines & Minds: Visual Abstraction in Art, Psychology, and Computer Graphics 🎨🧠🫖 🔗 lines-and-minds.github.io 📅 Sunday, July 19 Join us to explore how visual abstraction shapes how we think, create, and communicate.

English

102

10.4K

Assaf Ben Kish retweetledi

Yael Vinker🎗@YVinker·2 Nis

English

106

15.9K

Assaf Ben Kish retweetledi

Yacine Mahdid@yacinelearning·26 Mar

good morning folks we're going live in about 2h to have a jolly discussion about making models self-teach themselves hard stuff like @justinskycak would

Yacine Mahdid@yacinelearning

tomorrow at 10h AM EST we'll have @IdanShenfeld and @jonashubotter on the livestream to have a presentation on self-distillation + ask them a whole bunch of questions on their research! drop your questions in the comments so I can ask them and come chat with the authors!

English

336

29K

Assaf Ben Kish retweetledi

MIT NLP@nlp_mit·27 Mar

🚨new paper!🚨 RL makes LLMs smarter - but it also causes diversity collapse. Check out Multi-Answer RL - a method that trains LMs to capture and output a distribution of answers in a single generation 👀

Isha Puri@ishapuri101

ChatGPT several times where's best to go for spring break? It recommends Barcelona almost every time. This isn't a fluke. RL training rewards one best answer, so the model learns to commit to one mode and repeat it. Meet Multi-Answer RL: a simple RL method that trains LMs to reason through and output a distribution of answers in a single generation. [1/N]

English

2.1K

Assaf Ben Kish retweetledi

Nimrod Shabtay@NimrodShabtay·24 Mar

Introducing Look Where It Matters — High-Resolution Crops Retrieval for Efficient VLMs. VLMs don't need to process full high-res images. AwaRes uses tool-calling to retrieve only the high-res regions needed to answer a given query🧵 arxiv.org/abs/2603.16932 nimrodshabtay.github.io/AwaRes/

English

1.1K

Assaf Ben Kish@abk_tau·16 Mar

@osieberling Very cool This learning profile correlates with how different layers communicate over the residual stream during the forward pass: transformer-circuits.pub/2021/framework…

English

544

Oliver Sieberling@osieberling·15 Mar

A very interesting observation on backpropagation is that no matter how nonlinear the forward pass, once the forward pass is fixed, backpropagation itself is purely linear. This allows for all kinds of gradient analysis. For example, one can decompose the backward pass by the depth of the backpropagated signal: Each forward and backward pass can be viewed as involving 2^{2L} different paths, where L is the number of blocks (2L to account for Attn/MLP subblocks). Because the forward pass is nonlinear, we can’t just compute each of the paths separately to decompose the forward pass. However, for the backward pass we can. Of course, 2^{2L} is computationally intractable as we would need to do 2^{2L} separate backward passes (>1e19 for a 32-block transformer). But if we are only interested in the depth of gradients, we can use simple dynamic programming to decompose the backward signal by depth: Let x_l be the residual stream at depth l. Instead of just maintaining dL/dx_l as in normal backpropagation, we maintain (dL/dx_l)^k for each gradient depth k, i.e., a table consisting of 2L+1 gradients. Because backpropagation is linear, we have dL/dx_l = \sum_{k=0}^{2L} (dL/dx_l)^k. Then, after backpropagating each subblock, we can update our table of gradients with (dL/dx_l)^k = (dL/dx_{l+1})^k + (Jacobian_{subblock_l}(x_l))^T (dL/dx_{l+1})^{k-1}. This way, we can efficiently compute for each weight update how much comes from each gradient depth. Interestingly, because there are C(2L, k) (2L choose k) paths of depth k (left plot), by sheer number of paths we would expect the decomposition to concentrate around depth L (ignoring cancellations/correlations), but this is not what we observe in practice: The actual decomposition of gradients by depth is much more shifted towards shallower depths (right plot), which suggests that after normalizing by the number of paths, shallower paths contribute gradients of much larger magnitude than deeper paths do.

English

333

33.4K

Assaf Ben Kish@abk_tau·15 Mar

@vivek_2332 @karpathy Curious - After all the hyperparameter tuning on the eval set - Was the final model evaluated on a test set?

English

Vivek@vivek_2332·14 Mar

introducing autoresearch-rl, autonomous research for rl post-training. inspired by @karpathy autoresearch, and i think rl post-training is honestly one of the places where this idea fits perfectly. there are at least 50+ hyper parameters to tweak, learning rate, batch size, rollouts, clipping ratios, kl penalties, schedulers, the list goes on. instead of sitting there for hours turning knobs one at a time, just let the model figure out the right starting config on its own. some things worth mentioning: -> built on @PrimeIntellect prime-rl (my favourite rl post-training framework) and @willccbb verifiers for reward verification. -> ran qwen2.5-0.5b-instruct on gsm8k across 60+ autonomous experiments. eval score went from 0.475 to 0.550 and the agent actually found a way to do it in fewer steps (20 instead of 30). less compute, better results -> the whole thing was surprisingly smooth to set up and run. point the agent at the config, go to sleep, wake up to a full experiment log. i really wish i could try this on a bigger model but gpu poor for now lol -> the agent discovers things you wouldn't think to try. like how rollouts = 4 beats rollouts = 8, or how a constant lr schedule outperforms cosine. it just methodically tests everything i think the real value here is that rl training is so fragile and noisy that having an agent patiently run experiment after experiment is genuinely more effective than a human doing it manually. check it out: github.com/vivekvkashyap/…

English

751

83.7K

Assaf Ben Kish retweetledi

Moran Yanuka@moranynk·12 Mar

🎙️ Introducing ID-LoRA: the first open-source model to jointly generate a video with a person's appearance and voice in a single pass from just a reference image + short audio clip. No more cascaded pipelines where the audio can't follow your prompt. youtu.be/6bWcMh18K6g?si…

YouTube

English

2.2K

Assaf Ben Kish retweetledi

Akarsh Kumar@akarshkumar0101·12 Mar

Pre-pre-training LLMs on synthetic NCA trajectories is better than pre-pre-training on real language! Check out why below. Great job @Danicmhlee and @seungwookh!

Seungwook Han@seungwookh

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

5.7K

Assaf Ben Kish retweetledi

Tinker@tinkerapi·26 Şub

Introducing Self-Distillation Fine-Tuning, a new approach to continual learning. SDFT distills a model’s own outputs when given expert demonstration in-context. The result is faster learning than off-policy distillation, combined with reduced forgetting. x.com/IdanShenfeld/s…

idan shenfeld@IdanShenfeld

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English

5.1K

Assaf Ben Kish@abk_tau·23 Şub

@N8Programs @lateinteraction Also, it is much more efficient to re-use neural circuits in later steps rather than "spreading copies" of them throughout depth

English

Assaf Ben Kish@abk_tau·23 Şub

@N8Programs @lateinteraction Another issue with feed-forward models: for many p-hop reasoning task, the order of information encoding (in depth) matters. There will always be "adversarial" p-hop trajectories that aren't feasible in a single forward pass. aclanthology.org/2024.emnlp-mai…

English

Omar Khattab@lateinteraction·23 Şub

Buried in the massive progress in LLMs over the past few years is how all your favorite Transformers/DNNs still can't solve even just grade school math problems above a "B" grade through a forward pass. Unless they're in a scaffold like CoT, ReAct, RLM, etc. And this is true even at trillions of params and bajillions of FLOPs. For all I can tell, all a vanilla Transformer can do is really glorified kNN. Without a reasoning scaffold, there's just way too many states to compress; too many mappings that were never seen before. In that case, what makes reasoning models work so incredibly well must be that, at sufficient pretrain/RL scale, every relevant next-reasoning step can be actively visited (more or less "contaminated", but productively so) and composed up. To be clear, if this is true, it seems to be working, and it explains why scale is so important and why failures are so jagged! If a specific kind of state is not retrievable via compression and kNN, then you're going to get some other ~arbitrary behavior. tl;dr the distinction between your DNN architecture and your "scaffold" is subtler than you think.

N8 Programs@N8Programs

Inspired by @RyanPGreenblatt, I measured LLMs accuracy on GSM8K when only allowed to output a numerical answer without any CoT - all reasoning done in a few forward passes. The result is a nice log-linear scaling curve. We can also use this to guess param amt of closed models.

English

332

34.8K

Assaf Ben Kish retweetledi

Moran Yanuka@moranynk·1 Şub

1/3 🎉 Our paper is accepted at #ICASSP2026! TL;DR: Speculative decoding struggles for speech because exact token matching is too strict: acoustically similar tokens get rejected. We propose a fix. machinelearning.apple.com/research/coars…

English

427

Keşfet

@kfir99 @EladRichardson @MichaelHassid @HebrewU @royschwartzNLP @adiyossLC @AIatMeta @SunnySanyal9