

Marco Ciccone
1.2K posts

@mciccone_AI
Postdoctoral Fellow @VectorInst - Collaborative, Decentralized, Modular ML - Competition chair @NeurIPSConf 2021, 2022, 2023 - PhD @polimi ex @NVIDIA @NNAISENSE








Everyone's excited about Karpathy's autoresearch that automates the experiment loop. We automated the whole damn thing. 🦞 Meet AutoResearchClaw: one message in, full conference paper out. Real experiments. Real citations. Real code. No human in the loop. One message in → full paper out. Here's what happens in between: 📚 Raids arXiv & Semantic Scholar, digests 50+ papers in minutes 🥊 Three AI agents FIGHT over the best hypothesis (one swings big, one sanity-checks, one tries to kill every idea) 💻 Writes experiment code from scratch, adapts to your hardware 💥 Code crashes at 3am? It reads the stack trace, rewrites the fix, keeps going 🔄 Results weak? It pivots to entirely new hypotheses and starts over 📝 Drafts a full paper with citations, every single one verified against live databases No babysitting. No Slack messages. No "hey can you re-run this." Karpathy built the experiment loop. We built the whole lab. Chat an idea. Get a paper. 🦞 Try it 👉: github.com/aiming-lab/Aut… Kudos to the team @JiaqiLiu835914, @richardxp888, @lillianwei423, @StephenQS0710, @Xinyu2ML, @HaoqinT, @zhengop, @cihangxie, @dingmyu, and we are looking for more contributors.




PSA: never, ever write "we use the same learning rate across all methods for fair comparison" I read this as "do not trust any of our conclusions" and then i move on. If learning rate tuning is not mentioned, it takes me a little more time to notice that, but i also move on.

Someone should put me (down) out of the fp8 misery, it has no end, I’ll really end up writing my own kernels at this point

New paper dropped by Anthropic: "Fractal Language Models" It DESTROYS the context window narrative. The LLM doesn't just respond, it splits into self similar copies No tokens but models arguing, compressing until the prompt is not read but self reconstructed /satire @a1zhang





so many things to do that I end up doing nothing

mHC puts lots of efforts on training stability. In some aspect, stable backprop through depth is similar to stable backprop through time(BPTT) for modern RNN. lots of RNN can be written as: S_t+1 = Gate @ S_t + f(S_t), similar to mHC: x_t+1 = H@x_t + f(x_t). And the backprop for both will has cumulative matmuls, where eigen value might explode or vanish. In RNN, common stable parametrization of the gate include: 1. Decay gate: diagonal or scalar gate with value between 0-1. Used by Retnet, Mamba2 2. Identity: same as original residual connect 3. Householder matrix: used by deltanet(if beta=2), one type orthogonal matrix, singular value as 1. Thus cumulative matmuls also is orthogonal. mHC use double stochastic mat, and the cumulative matmuls also yields double stochastic mat. Interestingly, these design space for residual connections and RNN might be shared, and influence each other. And more tricky point is that, stable might not always mean effectiveness.








A lunch merge.

