Dersu

796 posts

Dersu banner
Dersu

Dersu

@tak3sh8

🦀 🍤 🧋🇸🇬 Katılım Eylül 2021
2.8K Takip Edilen9.4K Takipçiler
Dersu
Dersu@tak3sh8·
@yoavgo But I derived the dual of SVM from scratch, surely that must count??
English
0
0
7
419
(((ل()(ل() 'yoav))))👾
"I've been doing AI for 20 years and ..." and nothing. LLMs are new. LLM-Agents are new. our 20+ years experience with AI/ML/NLP may be marginally useful for understanding aspects of their training, but thats about it. we need new tools and experiences. we dont deserve authority.
English
26
28
381
19K
Dersu
Dersu@tak3sh8·
@deliprao Well is there a simpler and ubiquitous increasing function than exp that maps the reals to the positive reals?
English
1
0
2
1K
Delip Rao e/σ
Delip Rao e/σ@deliprao·
Why softmax? This is great question and I explain it in the following way in my deep learning course: While there are historical uses of this exponential form (Boltzmann, Gibbs, Jaynes, Luce & McFadden), its use in neural networks with backprop was first by Bridle*. He essentially closed how to solve the classification head question in neural networks. (Hinton’s Boltzmann machine paper did not use backprop and he didn’t refer to this function as softmax). We also have Bridle responsible for gifting us the term “softmax” (although in reality it is softargmax). After Bridle, softmax became defacto standard for classification heads, because Chris Bishop popularized it in his textbook drawing connections to GLMs. Now as to the question why softmax and not anything else: It’s not because there is a legacy lock in effect that we continue to use softmax. There are technical reasons: - softmax was *derived* (not arbitrarily picked) from information theory (maximum entropy principle) so it has well motivated theoretical foundations - derivatives of exp were easy to compute (this was especially important in the era before autodiff when gradient functions were hand computed) - it’s strictly positive everywhere, which means every class will receive a non-zero gradient. - it is C^∞ smooth making it gradient descent friendly so it continues getting used - translation invariant, clean logprob function - cross entropy loss along with softmax produces a simple gradient form of type (a-b). So no exponentials to compute and no exploding gradients - same with jacobians - all this made softmax sticky even before the hardware appeared to support it Overall, the community stumbled on a gem, quickly realized its value, and locked in. That’s why softmax is everywhere. *Bridle paper which many do not know about: link.springer.com/chapter/10.100…
levi@levidiamode

Day 125/365 of GPU Programming One thing I'm still struggling to understand is why softmax? What is it about the softmax function that made it survive/thrive for this long? What is it about exp() compared to another positive, monotonic, differentiable function that is so sticky? So studying softmax functions in a bit more depth today, taking a look at optimizations via SFUs on Nvidia GPUs and listening to the GOATs (Andrew Ng, Hinton, etc) explain the reasoning behind softmax as a primary choice. If anyone has good resources that dive into softmax and softmax alternatives, please share!

English
9
71
710
61K
Yann LeCun
Yann LeCun@ylecun·
@eladgil BS. Attention was born in Montréal PyTorch in NYC. AlphaGo in London AlphaFold in London ESMFold in NYC Llama 1 in Paris. Llama 2 in Paris+NYC+SV DeepSeek in Hangzhou Plus: DINO in Paris JEPA in Montréal+Paris+NYC SV is 3 mos ahead on topics SV is singularly obsessed with.
English
180
490
7.7K
715.3K
Elad Gil
Elad Gil@eladgil·
People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers SV founders/eng 3-6 months ahead of NY NY founders/eng 6-12 months ahead of rest of world Most people have no idea how fast AI shifting as 1-2 years behind SOTA "The future is here, just not equally distributed" - Robert Heinlein
English
352
471
5.3K
3.9M
Dersu retweetledi
Bryan Cheong
Bryan Cheong@bryancsk·
Incredible things are happening in the Parliament of Singapore rn
Bryan Cheong tweet media
English
42
105
1.1K
224.9K
Dersu
Dersu@tak3sh8·
saw a mathematician at a cafe no GPT5.5 no Lean no AlphaProof no multi-agent setup just a pen and paper, pushing epsilon-delta around like a psychopath
English
0
0
7
369
Dersu
Dersu@tak3sh8·
@kfountou I haven't seen anybody remotely saying that mathematics has been solved??
English
0
0
5
708
Kimon Fountoulakis
Kimon Fountoulakis@kfountou·
I don't know why people believe that mathematics has been solved. It's not even close. I have been using big models frequently since July 2025, starting with Gemini 2.5 Pro (arxiv.org/abs/2510.04115), I use them daily since then. GPT has basically been running non-stop for me since 5.2. If you try to push these models beyond the boundary of the current literature, they fail. They don't tell you they fail, they introduce conditional assumptions which make the results weak. However, they make it very clear what is boundary knowledge and what is not, which was previously extremely hard to detect. Use them this way.
English
18
18
192
30.2K
Dersu retweetledi
Timothy Gowers @wtgowers
@johnfrduncan I agree with this, but it’s not obvious how the social and economic structures to support that activity will be able to develop.
English
9
9
179
20.8K
Przemek Chojecki | PC
Przemek Chojecki | PC@prz_chojecki·
@tak3sh8 @wittgensteinsBB Currently tenured professors are (probably) safe. I would not extend that into the future, as the entire idea of tenure might be gone. There are definitely changes coming to academia, and that's good. Both from learning and research perspective they were due for a long time.
English
1
0
1
51
Przemek Chojecki | PC
Przemek Chojecki | PC@prz_chojecki·
Mathematical research is bound to change. The question is not if but how. Tenured professors will be fine, but what about PhD students and postdocs? They will likely be the most affected by the capabilities of AI. My bet is that academic community will start valuing exposition more than being first to prove something.
Timothy Gowers @wtgowers@wtgowers

Of course, this raises all sorts of questions about what is going to happen to mathematical research, with the impact on PhD students being particularly urgent. I give a few thoughts on this in the blog post, but I don't have anything like complete answers.

English
18
7
125
14.2K
Dersu
Dersu@tak3sh8·
@prz_chojecki @wittgensteinsBB Also, formal bankruptcy is quite rare, but structural financial failure inside universities is not! Departments get cut, merged, or phased out all the time..
English
0
0
0
16
Dersu
Dersu@tak3sh8·
@prz_chojecki @wittgensteinsBB Academia was built around scarcity of knowledge, expertise, access & research infrastructure + frontier research is increasingly done in industrial labs. And then there is the model of charging students a fortune for large lecture-based knowledge transmission. Tenure is safe?
English
2
0
1
134
Dersu retweetledi
Richard Sutton
Richard Sutton@RichardSSutton·
If you are interested, you can learn a bit more about me from this video portrait from the Heidelberg Laureates Forum: youtu.be/jRPR6lx-iuw?si…
YouTube video
YouTube
English
4
18
126
12.5K
Dersu
Dersu@tak3sh8·
@wtgowers @stevenstrogatz It's interesting that you used the word "comfort" several times to describe the feeling many mathematicians seemed to have when LLMs were still less capable than they were..
English
0
0
7
3.4K
Dersu retweetledi
Timothy Gowers @wtgowers
I've recently got in on the act of getting AI to solve open problems in mathematics. More precisely, I gave some questions asked by Melvyn Nathanson to ChatGPT 5.5 Pro, to which I have been given access, and it answered them. 🧵
English
73
367
1.9K
609.2K
Ofir Press
Ofir Press@OfirPress·
Not sure how we pulled off this marketing chess move, but @Kasparov63 retweeted ProgramBench.
Ofir Press tweet media
English
4
3
79
8.7K
Dersu
Dersu@tak3sh8·
@systematicls "comically good reputation among those outside the industry" not what my colleagues in academia told me 😅
English
0
0
2
407
Dersu
Dersu@tak3sh8·
Has anyone ever gotten real value from those cute-looking #Obsidian knowledge graphs
Dersu tweet media
English
0
0
1
310