Puneesh Deora

767 posts

Puneesh Deora banner
Puneesh Deora

Puneesh Deora

@puneeshdeora

PhD student at UBC. Working on theory of Deep Learning.

เข้าร่วม Ağustos 2019
370 กำลังติดตาม136 ผู้ติดตาม
Puneesh Deora รีทวีตแล้ว
ICML Conference
ICML Conference@icmlconf·
Announcing the #ICML2026 tutorials! All ten tutorials will be presented the first day of the conference, Monday July 6. Read the blog post for more details on the selection process!
ICML Conference tweet media
English
2
20
95
14.4K
Puneesh Deora
Puneesh Deora@puneeshdeora·
@f14bertolotti I was looking for a wall clock time comparison instead of just steps
English
0
0
2
253
Francesco Bertolotti
Francesco Bertolotti@f14bertolotti·
Another optimizer to watch: Mousse, which combines SOAP and Muon. Early results show promising scaling up to roughly 1B-parameter LLMs, and the experiments look quite solid. Really cool work! 🔗arxiv.org/pdf/2603.09697
Francesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet media
English
5
16
84
5.4K
Puneesh Deora
Puneesh Deora@puneeshdeora·
A big part is learning things, learning how to think, what to think, developing taste. Sending papers in the void is no good, but still formalizing and writing things helps. I'm not sure agent-written code would help with these
Jon Barron@jon_barron

If I was a grad student today, I would: 1) Not write papers, 2) push my (agent-written) code to a public repo ~weekly, 3) maintain (via agents) a writeup.tex (manually verified) and a skill.md in the repo, and 4) work towards establishing skill usage as the new "citation" format.

English
0
0
1
89
Puneesh Deora รีทวีตแล้ว
Edgar Dobriban
Edgar Dobriban@EdgarDobriban·
AI is getting great at math, but how good is it at solving real research problems in areas outside of those covered by Erdős problems? Towards gauging this, I have started putting together a list of unsolved research problems in mathematical statistics and machine learning, sourced from recent papers in a leading statistics journal, the Annals of Statistics (with some bonus COLT open problems: solveall.org. Currently >100 problems. In my view, much of the value of AI for researchers in the mathematical sciences stems from helping with their own research problems. These are problems without known solutions. There are many math benchmarks, but few with the following properties: (1) of a realistic research-level, so that solving them can potentially lead to a publication in a top journal (problems discussed in papers already, not contest math, not Millenium problems, not problems created for a benchmark, not problems that have a known solution); I'd say Erdős problems are the best example of this. (2) cover problems outside of the usual focus (combinatorics, number theory, ... ) of Erdős problems. Especially under-represented are domains of applied math, along with statistics, operations research, etc. I'm interested in statistics and ML, so that's where I started, but this could grow over time. Hope this can grow into something useful to the community! Happy to hear your thoughts...
Edgar Dobriban tweet media
English
32
72
430
54.4K
Mathieu
Mathieu@miniapeur·
What is a good alternative to Google Scholar, particularly one that counts citations accurately?
English
14
2
49
23.5K
Puneesh Deora
Puneesh Deora@puneeshdeora·
Most definitely agree, while these tools level the playing field in a way by reducing the technical barriers, the thinking barrier still exists. Those who have practiced what/how to think in a more organic way will move even faster than before.
Haider.@haider1

OpenAI Sebastian Bubeck says deep expertise is more important than ever in the AI age to get maximum value from AI, you need enough real understanding to describe the problem clearly "this creates the gap between people who keep studying and those who rely too much on AI"

English
0
0
2
120
Puneesh Deora
Puneesh Deora@puneeshdeora·
Me laughing at ChatGPT in 2022 trying to solve research relevant Math vs now
English
0
0
3
82
Puneesh Deora
Puneesh Deora@puneeshdeora·
@ben_golub Also, do we know how much human supervision was involved? They say "limited" but idk what that means
English
0
0
2
1.2K
Ben Golub
Ben Golub@ben_golub·
So what's the state of First Proof? Is there some consensus on how OAI did?
English
3
1
61
17.6K
Puneesh Deora
Puneesh Deora@puneeshdeora·
@yangpliu @littmath I feel one thing that comes out of attempting proofs with AI and is fully realizing how much information is already out there and the amount of new results we can get using core ideas that already exist, but we just can't search so efficiently but AI can
English
1
1
10
2.2K
Yang Liu
Yang Liu@yangpliu·
My thoughts on #1stProof Problem 6 (closely related to areas I've worked in): OpenAI’s solution is essentially correct, and the difficulty feels consistent with AI capabilities over the past several months. More detail in the thread.
English
8
36
380
79.5K
Puneesh Deora รีทวีตแล้ว
Difan Zou
Difan Zou@difanzou·
We are excited to share our latest work on the implicit bias of stochastic steepest descent! While optimizers like Adam and Muon, variants of steepest descent under different norms, are popular for large-scale pretraining, their theoretical behavior under mini-batch stochastic gradients has remained elusive. We provide a unified analysis of how batch size, momentum, and variance reduction shape the solutions these algorithms find. Key findings: 🔹 No Momentum: Convergence to the (approximated) max-margin solution requires large batches. Small batches fail to recover the full-batch implicit bias. 🔹 Momentum: Acts as a stabilizer! It enables convergence to the max-margin solution even with small batches, though at a slower rate. 🔹 Variance Reduction: A better fix, it recovers the exact full-batch implicit bias regardless of batch size or momentum. 🔹 Small Batch Size (B=1): In the extreme case (per-sample updates), the implicit bias fundamentally changes, which provably cannot be explained by standard max-margin solutions. We believe thoroughly understanding these algorithms in theory is crucial for developing the foundation of large model training, and our work provides the initial attempts along this line. Paper: arxiv.org/abs/2602.11557 Joint work with Jichu Li (fantastic undergrad student) and Xuan Tang (genius grad student).
Difan Zou tweet media
English
0
24
148
9.5K
Puneesh Deora
Puneesh Deora@puneeshdeora·
I think it's got to do with non-contiguous meta data, when you separate arxiv id from title and authors with a url or whatever it messes up. It's the format you get using @ misc entry type which is default for arxiv's bibtex generator. Use @ article entry type instead.
English
0
0
1
49
Puneesh Deora
Puneesh Deora@puneeshdeora·
Idk if people know this, but google scholar does not index the second reference format type (first it does), and the second is the bibtex you get from arxiv.
Puneesh Deora tweet media
English
1
0
2
214
Puneesh Deora
Puneesh Deora@puneeshdeora·
15 simple things put in the right way gives something great. Nothing surprising about it
Mathieu@miniapeur

English
0
0
1
69
Damek
Damek@damekdavis·
Finally finished our substantial revision of this paper and uploaded to arXiv. It's much cleaner and clearer, though not any shorter. I'll write a followup thread in the next couple of days when it appears!
Damek@damekdavis

New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread

English
3
1
89
9.6K
Puneesh Deora
Puneesh Deora@puneeshdeora·
I was using the beautiful Marchenko-Pastur theorem today and randomly decided to check their wiki pages and learned that Volodymyr Marchenko passed away a few hours ago, aged 103. RIP.
Puneesh Deora tweet media
English
0
2
5
257
Puneesh Deora
Puneesh Deora@puneeshdeora·
I just learned that Kepler in 1611 conjectured that cubic close packing is the most optimal way of arranging spheres (in terms of highest average density) and it took ~400 years for the formal proof :)
English
0
0
4
102