Dersu

796 posts

Dersu

@tak3sh8

🦀 🍤 🧋🇸🇬 Katılım Eylül 2021

2.8K Takip Edilen9.4K Takipçiler

Dersu@tak3sh8·1d

@yoavgo But I derived the dual of SVM from scratch, surely that must count??

English

419

(((ل()(ل() 'yoav))))👾@yoavgo·1d

"I've been doing AI for 20 years and ..." and nothing. LLMs are new. LLM-Agents are new. our 20+ years experience with AI/ML/NLP may be marginally useful for understanding aspects of their training, but thats about it. we need new tools and experiences. we dont deserve authority.

English

381

19K

Dersu@tak3sh8·2d

@deliprao Well is there a simpler and ubiquitous increasing function than exp that maps the reals to the positive reals?

English

Delip Rao e/σ@deliprao·2d

Why softmax? This is great question and I explain it in the following way in my deep learning course: While there are historical uses of this exponential form (Boltzmann, Gibbs, Jaynes, Luce & McFadden), its use in neural networks with backprop was first by Bridle*. He essentially closed how to solve the classification head question in neural networks. (Hinton’s Boltzmann machine paper did not use backprop and he didn’t refer to this function as softmax). We also have Bridle responsible for gifting us the term “softmax” (although in reality it is softargmax). After Bridle, softmax became defacto standard for classification heads, because Chris Bishop popularized it in his textbook drawing connections to GLMs. Now as to the question why softmax and not anything else: It’s not because there is a legacy lock in effect that we continue to use softmax. There are technical reasons: - softmax was *derived* (not arbitrarily picked) from information theory (maximum entropy principle) so it has well motivated theoretical foundations - derivatives of exp were easy to compute (this was especially important in the era before autodiff when gradient functions were hand computed) - it’s strictly positive everywhere, which means every class will receive a non-zero gradient. - it is C^∞ smooth making it gradient descent friendly so it continues getting used - translation invariant, clean logprob function - cross entropy loss along with softmax produces a simple gradient form of type (a-b). So no exponentials to compute and no exploding gradients - same with jacobians - all this made softmax sticky even before the hardware appeared to support it Overall, the community stumbled on a gem, quickly realized its value, and locked in. That’s why softmax is everywhere. *Bridle paper which many do not know about: link.springer.com/chapter/10.100…

levi@levidiamode

Day 125/365 of GPU Programming One thing I'm still struggling to understand is why softmax? What is it about the softmax function that made it survive/thrive for this long? What is it about exp() compared to another positive, monotonic, differentiable function that is so sticky? So studying softmax functions in a bit more depth today, taking a look at optimizations via SFUs on Nvidia GPUs and listening to the GOATs (Andrew Ng, Hinton, etc) explain the reasoning behind softmax as a primary choice. If anyone has good resources that dive into softmax and softmax alternatives, please share!

English

710

61K

Dersu@tak3sh8·2d

@ylecun @MuzafferKal_ @eladgil CNNs in Tokyo

English

251

Yann LeCun@ylecun·2d

@eladgil BS. Attention was born in Montréal PyTorch in NYC. AlphaGo in London AlphaFold in London ESMFold in NYC Llama 1 in Paris. Llama 2 in Paris+NYC+SV DeepSeek in Hangzhou Plus: DINO in Paris JEPA in Montréal+Paris+NYC SV is 3 mos ahead on topics SV is singularly obsessed with.

English

180

490

7.7K

715.3K

Elad Gil@eladgil·3d

People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers SV founders/eng 3-6 months ahead of NY NY founders/eng 6-12 months ahead of rest of world Most people have no idea how fast AI shifting as 1-2 years behind SOTA "The future is here, just not equally distributed" - Robert Heinlein

English

352

471

5.3K

3.9M

Dersu retweetledi

Bryan Cheong@bryancsk·3d

Incredible things are happening in the Parliament of Singapore rn

English

105

1.1K

224.9K

Dersu@tak3sh8·3d

Water

Patrick Collison@patrickc

Which are the most common everyday phenomena that we don't properly understand? Off the top of my head: • Lightning (how does it happen?) • Sleep; dreams (why do they exist?) • Glass (thermodynamics of formation) • Turbulence (when does it start?) • Morphogenesis (how does a creature know what should go where?) • Rain (it seems to start faster than models would predict) • Ice (dynamics of slipperiness) • Static electricity (which material will donate electrons?) • General anaesthetic. (And the mechanism of a lot of drugs, e.g. paracetamol.)

English

295

Dersu@tak3sh8·3d

saw a mathematician at a cafe no GPT5.5 no Lean no AlphaProof no multi-agent setup just a pen and paper, pushing epsilon-delta around like a psychopath

English

369

Dersu@tak3sh8·3d

@kfountou I haven't seen anybody remotely saying that mathematics has been solved??

English

708

Kimon Fountoulakis@kfountou·3d

I don't know why people believe that mathematics has been solved. It's not even close. I have been using big models frequently since July 2025, starting with Gemini 2.5 Pro (arxiv.org/abs/2510.04115), I use them daily since then. GPT has basically been running non-stop for me since 5.2. If you try to push these models beyond the boundary of the current literature, they fail. They don't tell you they fail, they introduce conditional assumptions which make the results weak. However, they make it very clear what is boundary knowledge and what is not, which was previously extremely hard to detect. Use them this way.

English

192

30.2K

Dersu retweetledi

Timothy Gowers @wtgowers@wtgowers·4d

@johnfrduncan I agree with this, but it’s not obvious how the social and economic structures to support that activity will be able to develop.

English

179

20.8K

Dersu@tak3sh8·3d

@prz_chojecki @wittgensteinsBB That's also my view (after having recently discussed with a few worried tenured prof friends!)

English

Przemek Chojecki | PC@prz_chojecki·3d

@tak3sh8 @wittgensteinsBB Currently tenured professors are (probably) safe. I would not extend that into the future, as the entire idea of tenure might be gone. There are definitely changes coming to academia, and that's good. Both from learning and research perspective they were due for a long time.

English

Przemek Chojecki | PC@prz_chojecki·4d

Mathematical research is bound to change. The question is not if but how. Tenured professors will be fine, but what about PhD students and postdocs? They will likely be the most affected by the capabilities of AI. My bet is that academic community will start valuing exposition more than being first to prove something.

Timothy Gowers @wtgowers@wtgowers

Of course, this raises all sorts of questions about what is going to happen to mathematical research, with the impact on PhD students being particularly urgent. I give a few thoughts on this in the blog post, but I don't have anything like complete answers.

English

125

14.2K

Dersu@tak3sh8·3d

@prz_chojecki @wittgensteinsBB Also, formal bankruptcy is quite rare, but structural financial failure inside universities is not! Departments get cut, merged, or phased out all the time..

English

Dersu@tak3sh8·3d

@prz_chojecki @wittgensteinsBB Academia was built around scarcity of knowledge, expertise, access & research infrastructure + frontier research is increasingly done in industrial labs. And then there is the model of charging students a fortune for large lecture-based knowledge transmission. Tenure is safe?

English

134

Dersu retweetledi

Richard Sutton@RichardSSutton·4d

If you are interested, you can learn a bit more about me from this video portrait from the Heidelberg Laureates Forum: youtu.be/jRPR6lx-iuw?si…

YouTube

English

126

12.5K

Dersu@tak3sh8·4d

Tim Gowers: "my guess is that [...] what it means to undertake research in mathematics will have changed out of all recognition"

Timothy Gowers @wtgowers@wtgowers

I've recently got in on the act of getting AI to solve open problems in mathematics. More precisely, I gave some questions asked by Melvyn Nathanson to ChatGPT 5.5 Pro, to which I have been given access, and it answered them. 🧵

English

764

Dersu@tak3sh8·4d

@wtgowers @stevenstrogatz It's interesting that you used the word "comfort" several times to describe the feeling many mathematicians seemed to have when LLMs were still less capable than they were..

English

3.4K

Dersu retweetledi

Timothy Gowers @wtgowers@wtgowers·4d

English

367

1.9K

609.2K

Dersu@tak3sh8·4d

@OfirPress @Kasparov63 He was tweeting about NFTs just a few few years ago btw

English

106

Ofir Press@OfirPress·5d

Not sure how we pulled off this marketing chess move, but @Kasparov63 retweeted ProgramBench.

English

8.7K

Dersu@tak3sh8·4d

@systematicls "comically good reputation among those outside the industry" not what my colleagues in academia told me 😅

English

407

sysls@systematicls·4d

De Prado has a comically good reputation among those outside the industry and a comically bad one among those inside the industry. It makes me wonder who I respect has that kind of reputation in industries I'm not a part of.

0xDipper@Dipper_pol

Professor Marcos López de Prado at Cornell - the man Shannon and Thorp's framework eventually became He personally managed $13 billion at Guggenheim Partners with an audited risk-adjusted return of 2.3 - institutional Sharpe-equivalent that less than 1% of fund managers ever hit Then became the first head of machine learning at AQR Capital, a $226B fund. Then went to Cornell to teach this Shannon used information theory to beat Buffett. Thorp used it to beat Vegas. López de Prado uses it to manage billions for institutions today The article above is that exact same lineage applied to Polymarket - KL-divergence, max-entropy, entropy collapse, three tools you can use today 1 hour from one of three people on Earth qualified to teach this ↓

English

184

40.4K

Dersu@tak3sh8·5d

Has anyone ever gotten real value from those cute-looking #Obsidian knowledge graphs