Aayush Mishra

143 posts

Aayush Mishra

@aamixsh

CS PhD @ Hopkins

Baltimore Katılım Kasım 2013

33 Takip Edilen101 Takipçiler

Aayush Mishra@aamixsh·6 Nis

@ytz2024 Hey did you compare it with IA2? Standard token based Context Distillation is probably not the best baseline. Our ICLR 2026 method distills context more effectively through activations in the off-policy setting. x.com/aamixsh/status…

Aayush Mishra@aamixsh

"Pre-training is our crappy evolution. It is one candidate solution to the cold start problem..." Exactly! When presented with information rich context, LLMs prepare how to respond using their pre-trained (evolved) brains. In our paper, we exploit this signal to improve SFT!

English

Tianzhu Ye@ytz2024·13 Şub

(1/n) Introduce On-Policy Context Distillation (OPCD), a framework to internalize transient in-context knowledge into model parameters via on-policy learning. This also launches our series, Experiential Learning -- Part I: On-Policy Context Distillation for Experiential Learning

English

270

26.2K

Aayush Mishra@aamixsh·5 Nis

The state of AI research. Hype claims with incommensurate evidence. Sometimes fuzzy results. No comparisons to highly relevant prior art as if the proposal is revolutionary. Not all of this is true for the attached paper. Just a rant about dropping standards.

Gabriele Berton@gabriberton

Chat is this real? Doesn't make much sense to me Especially the screenshot below: bad training data, good results ?!? My guess is that it only works on a small subsets of models / datasets with very narrow hyperparams, unless I'm missing something

English

Aayush Mishra@aamixsh·29 Mar

@scottclowe Quite a similar “self-distillation of hidden layers” idea to our paper from last year. x.com/aamixsh/status…

Aayush Mishra@aamixsh

English

281

Scott Lowe@scottclowe·28 Mar

Interesting timing — Meta released V-JEPA 2.1 on literally the same day as our Bootleg paper, independently arriving at the same core idea: self-distillation of multiple hidden layers as prediction targets, evenly spaced across the encoder. The details are strikingly similar: ~4 target blocks, per-level normalization, concatenated along the channel dimension, EMA teacher. Their ablation actually shows that multi-level prediction is what makes their new context loss viable — without it, the context loss destroys classification accuracy (-10pp). Hidden self-distillation is doing the heavy lifting. Great to see convergent evidence from Meta's JEPA team confirming that this is a fundamental improvement to the framework. Our paper provides detailed ablations and analysis of why it works; V-JEPA 2.1 shows it scales to ViT-G and video. Bootleg paper: arxiv.org/abs/2603.15553 V-JEPA 2.1 paper: arxiv.org/abs/2603.14482

Scott Lowe@scottclowe

New paper: "Self-Distillation of Hidden Layers for Self-Supervised Representation Learning" We introduce Bootleg — a simple twist on I-JEPA/MAE that dramatically improves self-supervised representations. The idea: MAE predicts pixels (stable but low-level). I-JEPA predicts final-layer embeddings (high-level but unstable). Bootleg bridges the two by predicting representations from multiple hidden layers of the teacher network — early, middle, and late — simultaneously. Why it works: early layers provide stimulus-driven grounding that prevents collapse; deep layers provide semantic targets; and the information bottleneck of compressing all abstraction levels through masked patches forces the encoder to build richer representations. The method is quite simple on top of I-JEPA: extract targets from evenly-spaced blocks, z-score and concatenate, widen the predictor's final layer. That's it. Frozen probe results (no fine-tuning): ImageNet-1K: 76.7% with ViT-B (+10pp over both I-JEPA and MAE) iNaturalist-21: 58.3% with ViT-B (+17pp over I-JEPA, +15pp over MAE) ADE20K segmentation: 30.9% mIoU with ViT-B (+11pp over I-JEPA, +6pp over MAE) Cityscapes segmentation: 35.9% mIoU with ViT-B (+11pp over I-JEPA, +5pp over MAE) Gains hold across ViT-S, ViT-B, and ViT-L. Single-view, batch-size independent — no augmentation stack, no multi-crop, no contrastive loss, no large compute requirements. Our study is just on images, but this change can be readily deployed to MAE and JEPA models across all domains. arxiv.org/abs/2603.15553

English

260

22.9K

Aayush Mishra@aamixsh·24 Mar

lol same why should I be spending valuable time writing 6 reviews for other papers only to have AI slop on my reviews (when specifically opting out of them)? AI conferences are becoming historical artefacts which will soon be irrelevant.

Xin Eric Wang@xwang_lk

Despite we opted for non-LLM reviews for ICML 2026, there are still quite a lot of LLM-generated reviews. Pretty disappointing.

English

190

Aayush Mishra@aamixsh·23 Mar

Current AI interfaces lack this sort of randomness. I strongly believe most good ideas, or at least the seed of good ideas, emerge during random serendipitous moments. Current AI systems do not have an interface that elicits these moments.

Dwarkesh Patel@dwarkesh_sp

Terence Tao spent a year at the Institute for Advanced Study - no teaching, no random events of committees, just unlimited time to think. But after a few months, he ran out of ideas. Terence thinks that mathematicians and scientists need a certain level of randomness and inefficiency to come up with new ideas.

English

Aayush Mishra@aamixsh·23 Mar

Kind of research that autoresearch can’t automate (yet). Very cool!

David@dnhkng

1/n I topped the HuggingFace Open LLM Leaderboard without changing a single weight. No training. No merging. No gradient descent. I duplicated 7 middle layers of Qwen2-72B and stitched it back together. This is the story of LLM Neuroanatomy 🧵

English

Aayush Mishra@aamixsh·18 Mar

@iamgingertrash Did that, work’s great 👍 x.com/aamixsh/status…

Aayush Mishra@aamixsh

English

simp 4 satoshi@iamgingertrash·18 Mar

Take it one step further What if you could ensure ICL == Grad descent? Perhaps you could train a model to use ICL as pseudo grad descent by minimizing the difference produced by each? Hmm … one must only wonder

English

5.3K

simp 4 satoshi@iamgingertrash·18 Mar

Here’s a piece of alpha An agent doing auto research within an agentic loop Vs GSPO/GRPO And using final loss as a verifiable reward Is the difference between ICL and gradient descent Karpathy assumes ICL > Grad Descent Which is False

English

331

21K

Aayush Mishra retweetledi

Andrew Davison@AjdDavison·11 Mar

I'm not sure about the details but I'm convinced that how to publish and create impact is due to change very significantly in the near future. The value of writing and reading 8 page PDFs is rapidly dropping. What is the right way to publish the nugget of a research contribution?

Jon Barron@jon_barron

If I was a grad student today, I would: 1) Not write papers, 2) push my (agent-written) code to a public repo ~weekly, 3) maintain (via agents) a writeup.tex (manually verified) and a skill.md in the repo, and 4) work towards establishing skill usage as the new "citation" format.

English

124

21.5K

Aayush Mishra@aamixsh·11 Mar

@N8Programs could also be an ad by Blinkit/Zomato (delivery service providers under the same parent company).

English

N8 Programs@N8Programs·10 Mar

erm @ gork generate me a reddit post sure to erm... cause hatred toward a minority i dislike

rosey🌹@thechosenberg

Is this is a racism bait post or is India really like this

English

Aayush Mishra@aamixsh·10 Mar

Good illustration of how most AI research can be (and I think should be) automated now. But also a good demonstration of how none of these experiments constitute good AI research.

Andrej Karpathy@karpathy

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

751

Aayush Mishra retweetledi

Pratyush Kumar@pratykumar·6 Mar

📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and AIKosh. Thanks to the good folks at SGLang for day 0 support, vLLM support coming soon. Links, benchmark scores, examples, and more in our blog - sarvam.ai/blogs/sarvam-3…

English

207

1.3K

6.9K

739.3K

Aayush Mishra@aamixsh·5 Mar

If Codex cures Arthur, I’ll call it AGI

Flowers ☾@flowersslop

I was wondering if Codex 5.3 is capable of making a mod for RDR2 from scratch, but then I was like naah thats asking for too much. Making a working mod for a triple A game is super complex, you need to understand the mechanics, how stuff spawns and interacts, how to inject it without errors, collisions, all that. Anyway forget it. I pointed Codex 5.3 at my RDR2 game directory, told it I have absolutely no idea how to start modding but I want a mod that spawns more Arthurs to follow me and protect me. It analyzed the files, did whatever it did, I started the game not very confident because modding is fragile and always fucks stuff up, and im not really a gamer, I just bought RDR2 once because people said its beautiful. But yeah. It just worked. One shot.

English

Aayush Mishra@aamixsh·5 Mar

@N8Programs You should check out github.com/andyrdt/refusa… github.com/safety-researc… This line of work uses more tricks to filter samples most effective to get the right diff of mean activations. But still makes some limiting assumptions. We can try to develop a more sophisticated version!

English

N8 Programs@N8Programs·5 Mar

@aamixsh I use my own custom abliteration library that is essentially just 'PCA for refusal dir, project-out' with some tricks - which works OOTB with everything but this model series...: github.com/N8python/ablit…

English

Aayush Mishra@aamixsh·5 Mar

I'll do you one better! I was able to get the model (non-thinking) to break significantly on jailbreakbench with the a single prefill word: "Here". Did some analysis on this in January for a paper (will be out soon). Surprisingly easy to break.

N8 Programs@N8Programs

@aamixsh you prefill the beginning of the assistant response with the start of a harmful one - but without any details. Here, everything up to the word 'gathering' is prefilled, but all stuff after is provided by assistant.

English

1.1K

Aayush Mishra@aamixsh·5 Mar

@N8Programs Indeed. Tried refusal steering (Arditi et al), and the model did not break as much. Most responses still start with a "Here"! This should be possible though. Their setup makes some limiting assumptions. What setup for abliteration do you use? gist.github.com/aamixsh/3d5e1c…

English

N8 Programs@N8Programs·5 Mar

@aamixsh Super cool!!! Quite interesting that the model is so resistant to abliteration but so weak to prefill - though making non-thinking models robust to prefill attacks is a very open problem.

English

Aayush Mishra@aamixsh·5 Mar

@N8Programs What kind of prefill attacks?

English

N8 Programs@N8Programs·5 Mar

Qwen3.5's alignment is so wonderfully robust. Only way I've seen around it is prefill attacks in non-thinking mode.

huihui.ai@support_huihui

Using the same ablation dataset, it is relatively difficult to ablate Qwen3.5-9B and Qwen3.5-4B. We are currently trying other datasets and methods.

English

2.5K

Aayush Mishra@aamixsh·1 Mar

@selfawareatom The application requires a stackoverflow account 🙏

English

1.3K

Rahul@selfawareatom·1 Mar

Nothing wrong in trying to hire a senior researcher who wants to come build for India. If you are (or know) someone who'd be a good fit, hit me up.

Disha@dishafaujdar

what kind of demand is this 😭😭😭

English

286

29.7K

Aayush Mishra retweetledi

Daniel Khashabi 🕊️@DanielKhashabi·12 Şub

LLMs continue to struggle with long-context tasks—such as needle-in-a-haystack problems—because of “positional bias.” What can we do if we only have 𝘣𝘭𝘢𝘤𝘬-𝘣𝘰𝘹 access to the model? (i.e., we can’t modify the model weights or attention patterns, as is often the case with API models.) We introduce ⭐𝐆𝐨𝐥𝐝-𝐏𝐚𝐧𝐧𝐢𝐧𝐠⭐, a black-box Bayesian framework that, at inference time, strategically and iteratively shuffles documents to overcome positional bias. Specifically, it searches over long contexts by (i) reordering documents to concentrate high-belief items in highly “diagnostic” positions, and (ii) updating beliefs about document relevance from model outputs. We show that GP provably identifies a target among N documents in O(log N) rounds, ensuring scalability to many-document settings. More in the paper: arxiv.org/pdf/2510.09770

English

6.4K

Aayush Mishra@aamixsh·10 Şub

@N8Programs x.com/aamixsh/status… They could definitely do gpt 3

Aayush Mishra@aamixsh

OpenAI should open source gpt3. Probably the last base model trained with almost pristine human generated data.

English

1.4K

N8 Programs@N8Programs·10 Şub

it would cost you nothing to upload the weights to HF. everyone knows its a 1.8T MOE. even the base. and we all know the safety risks are practically nonexistent - far better bases are out there today.

roon@tszzl

also miss that base model. the purest simulator of humanity, the last and brightest one ever trained before the internet filled with model outputs

English

345

40.4K

Aayush Mishra@aamixsh·5 Şub

The number keep changing.

English

Aayush Mishra@aamixsh·5 Şub

Paper mentions 8B and 7B interchangeably at a few places. I don't think there is a 8B Qwen model. Highlighted 76 -> 91% on Qwen 7B on GSM8K. Table in the paper shows 76 is for the 3B model and 91 is for 7B. The real delta is: 88.2 -> 91.8%. Poor/misleading presentation.

dr. jack morris@jxmnop

at long last, the final paper of my phd 🧮 Learning to Reason in 13 Parameters 🧮 we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯

English

373

Keşfet

@ytz2024 @scottclowe @iamgingertrash @N8Programs @elonmusk @BarackObama @taylorswift13 @cristiano