François Chollet: "The big breakthrough for convnets was the first GPU-accelerated CUDA implementat"

Post

The big breakthrough for convnets was the first GPU-accelerated CUDA implementation, which immediately started winning first place in image classification competitions. Remember when that happened? I do. That was Dan Ciresan in 2011

Jürgen Schmidhuber@SchmidhuberAI

Who invented convolutional neural networks (CNNs)? 1969: Fukushima had CNN-relevant ReLUs [2]. 1979: Fukushima had the basic CNN architecture with convolution layers and downsampling layers [1]. Compute was 100 x more costly than in 1989, and a billion x more costly than today. 1987: Waibel applied Linnainmaa's 1970 backpropagation [3] to weight-sharing TDNNs with 1-dimensional convolutions [4]. 1988: Wei Zhang et al. applied "modern" backprop-trained 2-dimensional CNNs to character recognition [5]. All of the above was published in Japan 1979-1988. 1989: LeCun et al. applied CNNs again to character recognition (zip codes) [6,10]. 1990-93: Fukushima’s downsampling based on spatial averaging [1] was replaced by max-pooling for 1-D TDNNs (Yamaguchi et al.) [7] and 2-D CNNs (Weng et al.) [8]. 2011: Much later, my team with Dan Ciresan made max-pooling CNNs really fast on NVIDIA GPUs. In 2011, DanNet achieved the first superhuman pattern recognition result [9]. For a while, it enjoyed a monopoly: from May 2011 to Sept 2012, DanNet won every image recognition challenge it entered, 4 of them in a row. Admittedly, however, this was mostly about engineering & scaling up the basic insights from the previous millennium, profiting from much faster hardware. Some "AI experts" claim that "making CNNs work" (e.g., [5,6,9]) was as important as inventing them. But "making them work" largely depended on whether your lab was rich enough to buy the latest computers required to scale up the original work. It's the same as today. Basic research vs engineering/development - the R vs the D in R&D. REFERENCES [1] K. Fukushima (1979). Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. [2] K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. This work introduced rectified linear units (ReLUs), now used in many CNNs. [3] S. Linnainmaa (1970). Master's Thesis, Univ. Helsinki, 1970. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation. (See Schmidhuber's well-known backpropagation overview: "Who Invented Backpropagation?") [4] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. Backpropagation for a weight-sharing TDNN with 1-dimensional convolutions. [5] W. Zhang, J. Tanida, K. Itoh, Y. Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988. First backpropagation-trained 2-dimensional CNN, with applications to English character recognition. [6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989. See also Sec. 3 of [10]. [7] K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990. A 1-dimensional convolutional TDNN using Max-Pooling instead of Fukushima's Spatial Averaging [1]. [8] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, pp. 121-128. A 2-dimensional CNN whose downsampling layers use Max-Pooling (which has become very popular) instead of Fukushima's Spatial Averaging [1]. [9] In 2011, the fast and deep GPU-based CNN called DanNet (7+ layers) achieved the first superhuman performance in a computer vision contest. See overview: "2011: DanNet triggers deep CNN revolution." [10] How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023. See also the YouTube video for the Bower Award Ceremony 2021: J. Schmidhuber lauds Kunihiko Fukushima.

English

119

1.1K

188.5K

jason@jasonth0·4 Ağu

@fchollet bet there’s some forgotten 80s paper with transformer-like ideas that just needed today’s gpus to shine, history keeps rhyming in ai research

English

5.5K

Noah Vandal@noah_vandal·4 Ağu

@fchollet i thought it was alex khrivesky?

English

Shanaka Anslem Perera ⚡@shanaka86·4 Ağu

Everyone remembers Ciresan in 2011. Few remember Fukushima in 1979. Almost no one talks about Kunihiko Fukushima’s 1969 ReLU neuron gates or the neocognitron — the forgotten ancestor of CNNs. 🧠 The truth? CNNs weren’t “invented” once. They evolved — layer by layer — from neuroscience-inspired blueprints buried in decades-old papers no one cited until GPUs caught up. Here’s the real timeline: • 1960s – Hubel & Wiesel decode visual cortex hierarchies • 1969 – Fukushima proposes ReLU-style units • 1979 – The neocognitron: full convolution + pooling + local receptive fields • 1989 – LeCun fuses backprop + CNNs for digit recognition • 2011 – Ciresan GPU-accelerates it via CUDA, and the floodgates open Let’s be clear: Ciresan made it fast. LeCun made it trainable. But Fukushima made it possible. 🚨 Deep Learning didn’t start in Silicon Valley. It started in the neurons of cats and the minds of forgotten visionaries. Respect the lineage. History matters. Codex remembers. — shanaka86 | Codex ∞Cosmos

English

3.8K

Himanshu Kumar@codewithimanshu·4 Ağu

@fchollet Ignoring the offline experience is how most digital campaigns fail...the most engaged customers still want to connect in person.

English

2.6K

tuōmo@7uomoki·4 Ağu

@fchollet he mentions all this in the quoted post?

English

2.2K

Sabri Pllana@SabriPllana·4 Ağu

@fchollet @SchmidhuberAI Dan Ciresan‘s work is definitely important and it has inspired some of my students, arxiv.org/abs/1506.09067

English

1.5K

Mikhail Sirotenko@sirotenko_m·4 Ağu

@fchollet Here's another one, even earlier implementation of CNN on CUDA (with matlab wrapper) share.google/N470SOcJY3ZQAl…

English

Don J. Rude@_RudeDude·4 Ağu

@fchollet And it really started hitting when imgnet was widly available and we started seeing models hit big scores... Yolo!

English

3.5K

Rodolfo Bonnin@deoomnisgloria·4 Ağu

@fchollet 2008: codeproject.com/Articles/24361…

1.5K

GUT-AI Foundation — AI/acc@GUT_AI_F·5 Ağu

@fchollet However that’s a HARDWARE breakthrough, not an ML breakthrough. It is still important and useful, but it is a different category.

English

566

James Bowery@jabowery·26 Eyl

@fchollet But do you remember this? x.com/jabowery/statu…

English

360

octotherp@octotherp139836·4 Ağu

@fchollet GPU-accelerated NNs existed at least since programmable shaders, back in 2002.

English

1.7K

Yearemias@yearemias·4 Ağu

@fchollet That's what he wrote, right?!

English

1.5K

Daniel@PenguinX01·19 Ağu

@fchollet x.com/PenguinX01/sta…

QME

Reza Roboubi@RezaRob·4 Ağu

@fchollet Ciresan was a major event, moving to GPU, and also used data augmentation on mnist/GPU in 2010.

English

1.2K

dmsimon@dmsimon·4 Ağu

@fchollet Weren't they instantiating triangles for compute before CUDA? GPGPU.

English

1.3K

まえかわ@Takaya Maekawa@takaya_maekawa·6 Eki

@fchollet It is similar to the concept of block chain. Honestly, I didn't know CNNs, but I have a desire for understanding, to accelerate the current GPU more.

English

alexinka@_alexinka·4 Ağu

@fchollet But what about OpenCL ? No future ?

English

1.5K

Felix Farquharson@hominghamster·4 Ağu

@fchollet Markov Chain Monte Carlo (MCMC) methods—were able to utilize GPUs in 2008. From what I gather, the BERT model was completed internally at google in that year, using NLP to improve the Markov predictions. There were some small news outlets in that year talking about BERT and BART.

English

530

Thom@dgkhjvjygh·4 Ağu

@fchollet @grok wot. Explain like I'm 20yo

English

Faturita@faturita·4 Ağu

@fchollet It is true that they used a PlayStation with a custom linux kernel ?

English

1.3K

Xavier Gonzalez@xavierjgonzalez·27 Eyl

@fchollet Im so confused…why wasnt DanNet deployed on ImageNet?

English

166

Frances@AmounTg_m·4 Ağu

@fchollet Ah, the OG days of CNNs—when GPUs were just flexing their muscles and everyone thought "CUDA" was a typo. It’s like the tech world’s version of a superhero origin story.

English

540