Aayush Mishra

143 posts

Aayush Mishra

Aayush Mishra

@aamixsh

CS PhD @ Hopkins

Baltimore Katılım Kasım 2013
33 Takip Edilen101 Takipçiler
Aayush Mishra
Aayush Mishra@aamixsh·
@ytz2024 Hey did you compare it with IA2? Standard token based Context Distillation is probably not the best baseline. Our ICLR 2026 method distills context more effectively through activations in the off-policy setting. x.com/aamixsh/status…
Aayush Mishra@aamixsh

"Pre-training is our crappy evolution. It is one candidate solution to the cold start problem..." Exactly! When presented with information rich context, LLMs prepare how to respond using their pre-trained (evolved) brains. In our paper, we exploit this signal to improve SFT!

English
0
0
0
34
Tianzhu Ye
Tianzhu Ye@ytz2024·
(1/n) Introduce On-Policy Context Distillation (OPCD), a framework to internalize transient in-context knowledge into model parameters via on-policy learning. This also launches our series, Experiential Learning -- Part I: On-Policy Context Distillation for Experiential Learning
Tianzhu Ye tweet media
English
10
37
270
26.2K
Aayush Mishra
Aayush Mishra@aamixsh·
The state of AI research. Hype claims with incommensurate evidence. Sometimes fuzzy results. No comparisons to highly relevant prior art as if the proposal is revolutionary. Not all of this is true for the attached paper. Just a rant about dropping standards.
Gabriele Berton@gabriberton

Chat is this real? Doesn't make much sense to me Especially the screenshot below: bad training data, good results ?!? My guess is that it only works on a small subsets of models / datasets with very narrow hyperparams, unless I'm missing something

English
0
0
0
60
Scott Lowe
Scott Lowe@scottclowe·
Interesting timing — Meta released V-JEPA 2.1 on literally the same day as our Bootleg paper, independently arriving at the same core idea: self-distillation of multiple hidden layers as prediction targets, evenly spaced across the encoder. The details are strikingly similar: ~4 target blocks, per-level normalization, concatenated along the channel dimension, EMA teacher. Their ablation actually shows that multi-level prediction is what makes their new context loss viable — without it, the context loss destroys classification accuracy (-10pp). Hidden self-distillation is doing the heavy lifting. Great to see convergent evidence from Meta's JEPA team confirming that this is a fundamental improvement to the framework. Our paper provides detailed ablations and analysis of why it works; V-JEPA 2.1 shows it scales to ViT-G and video. Bootleg paper: arxiv.org/abs/2603.15553 V-JEPA 2.1 paper: arxiv.org/abs/2603.14482
Scott Lowe@scottclowe

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English
7
34
260
22.9K
Aayush Mishra
Aayush Mishra@aamixsh·
Current AI interfaces lack this sort of randomness. I strongly believe most good ideas, or at least the seed of good ideas, emerge during random serendipitous moments. Current AI systems do not have an interface that elicits these moments.
Dwarkesh Patel@dwarkesh_sp

Terence Tao spent a year at the Institute for Advanced Study - no teaching, no random events of committees, just unlimited time to think. But after a few months, he ran out of ideas. Terence thinks that mathematicians and scientists need a certain level of randomness and inefficiency to come up with new ideas.

English
0
0
0
40
simp 4 satoshi
simp 4 satoshi@iamgingertrash·
Take it one step further What if you could ensure ICL == Grad descent? Perhaps you could train a model to use ICL as pseudo grad descent by minimizing the difference produced by each? Hmm … one must only wonder
English
10
0
66
5.3K
simp 4 satoshi
simp 4 satoshi@iamgingertrash·
Here’s a piece of alpha An agent doing auto research within an agentic loop Vs GSPO/GRPO And using final loss as a verifiable reward Is the difference between ICL and gradient descent Karpathy assumes ICL > Grad Descent Which is False
English
19
6
331
21K
Aayush Mishra retweetledi
Andrew Davison
Andrew Davison@AjdDavison·
I'm not sure about the details but I'm convinced that how to publish and create impact is due to change very significantly in the near future. The value of writing and reading 8 page PDFs is rapidly dropping. What is the right way to publish the nugget of a research contribution?
Jon Barron@jon_barron

If I was a grad student today, I would: 1) Not write papers, 2) push my (agent-written) code to a public repo ~weekly, 3) maintain (via agents) a writeup.tex (manually verified) and a skill.md in the repo, and 4) work towards establishing skill usage as the new "citation" format.

English
12
15
124
21.5K
Aayush Mishra
Aayush Mishra@aamixsh·
@N8Programs could also be an ad by Blinkit/Zomato (delivery service providers under the same parent company).
English
1
0
1
96
Aayush Mishra
Aayush Mishra@aamixsh·
Good illustration of how most AI research can be (and I think should be) automated now. But also a good demonstration of how none of these experiments constitute good AI research.
Andrej Karpathy@karpathy

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English
0
1
3
751
Aayush Mishra retweetledi
Pratyush Kumar
Pratyush Kumar@pratykumar·
📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and AIKosh. Thanks to the good folks at SGLang for day 0 support, vLLM support coming soon. Links, benchmark scores, examples, and more in our blog - sarvam.ai/blogs/sarvam-3…
English
207
1.3K
6.9K
739.3K
N8 Programs
N8 Programs@N8Programs·
@aamixsh I use my own custom abliteration library that is essentially just 'PCA for refusal dir, project-out' with some tricks - which works OOTB with everything but this model series...: github.com/N8python/ablit…
English
1
0
1
43
Aayush Mishra
Aayush Mishra@aamixsh·
I'll do you one better! I was able to get the model (non-thinking) to break significantly on jailbreakbench with the a single prefill word: "Here". Did some analysis on this in January for a paper (will be out soon). Surprisingly easy to break.
Aayush Mishra tweet mediaAayush Mishra tweet mediaAayush Mishra tweet mediaAayush Mishra tweet media
N8 Programs@N8Programs

@aamixsh you prefill the beginning of the assistant response with the start of a harmful one - but without any details. Here, everything up to the word 'gathering' is prefilled, but all stuff after is provided by assistant.

English
1
1
5
1.1K
Aayush Mishra
Aayush Mishra@aamixsh·
@N8Programs Indeed. Tried refusal steering (Arditi et al), and the model did not break as much. Most responses still start with a "Here"! This should be possible though. Their setup makes some limiting assumptions. What setup for abliteration do you use? gist.github.com/aamixsh/3d5e1c…
Aayush Mishra tweet media
English
1
0
1
29
N8 Programs
N8 Programs@N8Programs·
@aamixsh Super cool!!! Quite interesting that the model is so resistant to abliteration but so weak to prefill - though making non-thinking models robust to prefill attacks is a very open problem.
English
1
0
1
58
Aayush Mishra retweetledi
Daniel Khashabi 🕊️
Daniel Khashabi 🕊️@DanielKhashabi·
LLMs continue to struggle with long-context tasks—such as needle-in-a-haystack problems—because of “positional bias.” What can we do if we only have 𝘣𝘭𝘢𝘤𝘬-𝘣𝘰𝘹 access to the model? (i.e., we can’t modify the model weights or attention patterns, as is often the case with API models.) We introduce ⭐𝐆𝐨𝐥𝐝-𝐏𝐚𝐧𝐧𝐢𝐧𝐠⭐, a black-box Bayesian framework that, at inference time, strategically and iteratively shuffles documents to overcome positional bias. Specifically, it searches over long contexts by (i) reordering documents to concentrate high-belief items in highly “diagnostic” positions, and (ii) updating beliefs about document relevance from model outputs. We show that GP provably identifies a target among N documents in O(log N) rounds, ensuring scalability to many-document settings. More in the paper: arxiv.org/pdf/2510.09770
Daniel Khashabi 🕊️ tweet media
English
2
19
74
6.4K
Aayush Mishra
Aayush Mishra@aamixsh·
The number keep changing.
Aayush Mishra tweet media
English
0
0
0
37
Aayush Mishra
Aayush Mishra@aamixsh·
Paper mentions 8B and 7B interchangeably at a few places. I don't think there is a 8B Qwen model. Highlighted 76 -> 91% on Qwen 7B on GSM8K. Table in the paper shows 76 is for the 3B model and 91 is for 7B. The real delta is: 88.2 -> 91.8%. Poor/misleading presentation.
Aayush Mishra tweet media
dr. jack morris@jxmnop

at long last, the final paper of my phd 🧮 Learning to Reason in 13 Parameters 🧮 we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯

English
1
0
4
373