kirk

1K posts

kirk banner
kirk

kirk

@68kirk

"Nobody dies a virgin... Life f*** us all!"

Katılım Mayıs 2014
439 Takip Edilen74 Takipçiler
Grigory Sapunov
Grigory Sapunov@che_shr_cat·
1/ We know Transformers fail at length extrapolation. But new research shows a deeper flaw: they fail at IN-DISTRIBUTION state tracking. They don't learn algorithmic rules, they just memorize isolated circuits per length. 🧵
Grigory Sapunov tweet media
English
7
47
385
35.6K
kirk
kirk@68kirk·
@JFPuget Welcome to the era of "we'll delete everything and you'll be happy"
English
0
0
0
34
JFPuget 🇺🇦🇨🇦🇬🇱
You can't make this up. And this happened to a someone working on safety and alignment at META Superintelligence Lab. I don't get how people can let AI agents go on unsupervised like this. I'll stick to my supervised use of AI for now.
JFPuget 🇺🇦🇨🇦🇬🇱 tweet media
English
44
24
405
32.4K
kirk
kirk@68kirk·
@DimitrisPapail Nice write up! I lf I'm not mistaken I would say the pair encoding trick of tokens resembles a lot the RLE trick where u compress repeating values/letters in a condensed format representation. Has been used a lot in vision & I bet there's a lot of codebases on web with it.
English
0
0
0
251
kirk
kirk@68kirk·
@ZimingLiu11 @naturecomputes @SuryaGanguli @AToliasLab Nice thread! Any thoughts on how do continuous but non autoregressive models like fno and pinns compare to transformer based? Based on your findings they should be avoiding both issues present in transformers?
English
0
0
0
38
Ziming Liu
Ziming Liu@ZimingLiu11·
@naturecomputes @SuryaGanguli @AToliasLab 15/N To close this thread, let me remind you of the "no free lunch theorem", and that everything boils down to inductive biases, which is the central theme of this paper.
Ziming Liu tweet media
English
1
6
64
3.7K
Ziming Liu
Ziming Liu@ZimingLiu11·
🚨Transformers don't learn Newton's laws? They learn Kepler's laws! Like us, transformers don't predict a flying ball via a differential equation, but by fitting a curve. Moreover, reducing context length steers a transformer from Keplerian to Newtonian. Compression in play.
Ziming Liu tweet media
English
25
206
1.2K
115.7K
kirk
kirk@68kirk·
@giffmana @crude2refined Indeed, the optimizers we use are quite sensitive and as such lr can help when stuck in a local min to get unstuck, all other hyperparams are there to smooth the optimizer trajectory.
English
0
0
0
51
kirk
kirk@68kirk·
@JFPuget @jm_alexia 🤔 isn't that already happening? The majority of academics use overleaf and given that overleaf has integrated 3rd party AI models, those could potentially be sending data to underlying companies, no?
English
1
0
0
56
JFPuget 🇺🇦🇨🇦🇬🇱
@jm_alexia That was my reaction too. I was amused to see academics happy to share their unpublished research with OpenAI. I didn't post myself as writing latex paper is a low priority for me , I do it once every other year or so now.
English
1
0
7
611
kirk
kirk@68kirk·
@branerico @adrian1977 @burkov I think it already has lost its value if u consider that on average a PhD makes what a 20 year old earns with vocational training working as an electrician in datacenters.
English
0
0
1
77
aneric broni
aneric broni@branerico·
There’s competition for academic jobs not because of quality PhDs graduating. It’s because there are more than needed PhDs ( unnecessary numbers) every year while the jobs have not increased, yet research funds have been limited. Everyone is doing a PhD , it would loose its value soon!
English
1
0
1
92
BURKOV
BURKOV@burkov·
It's an easiest time to make a master's or a PhD: 1. You pick a bunch of recent papers from scientists in your research domain, submit them to a chatbot, and ask it to analyze the "future works" sections and propose experiments to try. 2. You ask the chatbot to write code for these experiments. 3. You run this code and get some incremental results not published anywhere else. 4. You ask the chatbot to write a paper about these incremental results. 5. You submit this paper to a third-tier conference in your domain where the acceptance rate is above 0.5. Three papers like that and you are ready to write a thesis. I guess you already know how to get the thesis written fast.
English
54
78
1.1K
120.6K
kirk
kirk@68kirk·
History has a funny way of repeating itself...
kirk tweet media
English
1
0
0
35
kirk
kirk@68kirk·
@connordavis_ai Didn't LLM's already knew how to answer what-if questions, even at a superficial level. What was the baseline here?
English
0
0
0
14
Connor Davis
Connor Davis@connordavis_ai·
Holy shit… this paper might be the most important shift in how we use LLMs this entire year. “Large Causal Models from Large Language Models.” It shows you can grow full causal models directly out of an LLM not approximations, not vibes actual causal graphs, counterfactuals, interventions, and constraint-checked structures. And the way they do it is wild: Instead of training a specialized causal model, they interrogate the LLM like a scientist: → extract a candidate causal graph from text → ask the model to check conditional independencies → detect contradictions → revise the structure → test counterfactuals and interventional predictions → iterate until the causal model stabilizes The result is something we’ve never had before: a causal system built inside the LLM using its own latent world knowledge. Across benchmarks synthetic, real-world, messy domains these LCMs beat classical causal discovery methods because they pull from the LLM’s massive prior knowledge instead of just local correlations. And the counterfactual reasoning? Shockingly strong. The model can answer “what if” questions that standard algorithms completely fail on, simply because it already “knows” things about the world those algorithms can’t infer from data alone. This paper hints at a future where LLMs aren’t just pattern machines. They become causal engines systems that form, test, and refine structural explanations of reality. If this scales, every field that relies on causal inference economics, medicine, policy, science is about to get rewritten. LLMs won’t just tell you what happens. They’ll tell you why.
Connor Davis tweet media
English
56
137
796
44K
kirk
kirk@68kirk·
@elonmusk The default should be reject all by definition. There problem solved.
English
0
0
0
9
kirk
kirk@68kirk·
@elonmusk The problem is purely financial/economic. People were given the promise that if u work & study hard u’ll be better of than ur parents, that’s a lie of course, not in this economy & with this inflation. Also, 1% is hoarding all the money. Why don’t u give some to boost families?
English
0
0
0
6
Alexander Doria
Alexander Doria@Dorialexander·
based current vibes, not sure if neurips 2025 is a conference or a job fair.
English
20
11
386
32.7K
Chubby♨️
Chubby♨️@kimmonismus·
Yann LeCun’s 1989 convolutional neural network demo, the foundation for the CNNs we still use today. It's amazing how far we've come since then!
English
100
467
4.4K
705.3K
kirk
kirk@68kirk·
@francoisfleuret U need a model to decode the pattern in order to appreciate the hidden beauty...
English
0
0
0
111
kirk
kirk@68kirk·
@SchmidhuberAI @kimmonismus It has a voice answer in '86, impressive if u think that todays apps like @WhatsApp fail to transcribe even a basic voice message of couple of minutes.
English
1
0
0
114
Jürgen Schmidhuber
Jürgen Schmidhuber@SchmidhuberAI·
@kimmonismus Fukushima's 1986 video shows a CNN that recognises handwritten digits, three years before LeCun's video: x.com/SchmidhuberAI/…
Jürgen Schmidhuber@SchmidhuberAI

Fukushima's video (1986) shows a CNN that recognises handwritten digits [3], three years before LeCun's video (1989). CNN timeline taken from [5]: ★ 1969: Kunihiko Fukushima published rectified linear units or ReLUs [1] which are now extensively used in CNNs. ★ 1979: Fukushima published the basic CNN architecture with convolution layers and downsampling layers [2]. He called it neocognitron. It was trained by unsupervised learning rules. Compute was 100 times more expensive than in 1989, and a billion times more expensive than today. ★ 1986: Fukushima's video on recognising hand-written digits [3]. ★ 1988: Wei Zhang et al had the first "modern" 2-dimensional CNN trained by backpropagation, and also applied it to character recognition [4]. Compute was about 10 million times more expensive than today. ★ 1989-: later work by others [5]. REFERENCES (more in [5]) [1] K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. This work introduced rectified linear units or ReLUs, now widely used in CNNs and other neural nets. [2] K. Fukushima (1979). Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: 1980. [3] Movie produced by K. Fukushima, S. Miyake and T. Ito (NHK Science and Technical Research Laboratories), in 1986. YouTube: youtube.com/watch?v=oVYCjL… [4] W. Zhang, J. Tanida, K. Itoh, Y. Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988. First "modern" backpropagation-trained 2-dimensional CNN, applied to character recognition. [5] J. Schmidhuber (AI Blog, 2025). Who invented convolutional neural networks? x.com/SchmidhuberAI/…

English
5
21
294
22.1K
kirk
kirk@68kirk·
@burkov Haven't read the paper yet but from ur description sounds like there are some resemblances with diffusion approaches.
English
0
0
1
440
BURKOV
BURKOV@burkov·
This paper really is groundbreaking. It solves a long-standing embarrassment in machine learning: despite all the hype around deep learning, traditional tree-based methods (XGBoost, CatBoost, random forests, etc) have dominated tabular data—the most common data format in real-world applications—for two decades. Deep learning conquered images, text, and games, but spreadsheets remained stubbornly resistant. This paper's (published in Nature by the way) main contribution is a foundation model that finally beats tree-based methods convincingly on small-to-medium datasets, and does so very fast. TabPFN in 2.8 seconds outperforms CatBoost tuned for 4 hours—a 5,000× speedup. That's not incremental; it's a different regime entirely. The training approach is also fundamentally different. GPT trains on internet text; CLIP trains on image-caption pairs. TabPFN trains on entirely synthetic data—over 100 million artificial datasets generated from causal graphs. TabPFN generates training data by randomly constructing directed acyclic graphs where each edge applies a random transformation (using neural networks, decision trees, discretization, or noise), then pushes random noise through the root nodes and lets it propagate through the graph—the intermediate values at various nodes become features, one becomes the target, and post-processing adds realistic messiness like missing values and outliers. By training on millions of these synthetic datasets with very different structures, the model learns general prediction strategies without ever seeing real data. The inference mechanism is also unusual. Rather than finetuning or prompting, TabPFN performs both "training" and prediction in a single forward pass. You feed it your labeled training data and unlabeled test points together, and it outputs predictions immediately. There's no gradient descent at inference time—the model has learned how to learn from examples during pretraining. The architecture respects tabular structure with two-way attention (across features within a row, then across samples within a column), unlike standard transformers that treat everything as a flat sequence. So, the transformer has basically learned to do supervised learning. Talk to the paper on ChapterPal: chapterpal.com/s/a1899430/acc… Download the PDF: nature.com/articles/s4158…
BURKOV tweet media
English
77
394
2.6K
330.6K
Michael Luo
Michael Luo@michaelzluo·
Transformers without positional embeddings are functionally the same as dLLM and has better scaling law than transformers with positional embeddings. It is also much easier and stable to train transformer than dLLM due to much prior work. I would position dLLM as a "cost arbitrage" over LLMs, i.e. faster generation with much higher token throughput.
Cunxiao Du@ducx_du

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English
9
10
158
28.4K