kirk

1K posts

kirk

@68kirk

"Nobody dies a virgin... Life f*** us all!"

Katılım Mayıs 2014

439 Takip Edilen74 Takipçiler

kirk@68kirk·2 Mar

@che_shr_cat Didn't we already knew that transformers fail at algorithmic tasks? They mostly solve these kind of tasks by using spurious correlations youtube.com/watch?v=aBGXgW…

YouTube

English

Grigory Sapunov@che_shr_cat·1 Mar

11/ I also made a comic version for this one — sometimes a picture is worth a thousand tokens. #MachineLearning #AI

English

1.5K

Grigory Sapunov@che_shr_cat·1 Mar

1/ We know Transformers fail at length extrapolation. But new research shows a deeper flaw: they fail at IN-DISTRIBUTION state tracking. They don't learn algorithmic rules, they just memorize isolated circuits per length. 🧵

English

385

35.6K

kirk@68kirk·23 Şub

@JFPuget Welcome to the era of "we'll delete everything and you'll be happy"

English

JFPuget 🇺🇦🇨🇦🇬🇱@JFPuget·23 Şub

You can't make this up. And this happened to a someone working on safety and alignment at META Superintelligence Lab. I don't get how people can let AI agents go on unsupervised like this. I'll stick to my supervised use of AI for now.

English

405

32.4K

kirk@68kirk·20 Şub

@DimitrisPapail Nice write up! I lf I'm not mistaken I would say the pair encoding trick of tokens resembles a lot the RLE trick where u compress repeating values/letters in a condensed format representation. Has been used a lot in vision & I bet there's a lot of codebases on web with it.

English

251

Dimitris Papailiopoulos@DimitrisPapail·20 Şub

oh snap! we now have a 980 (Claude Code) and 970 (Codex) parameter transformesr that can add 10-digit numbers!!!! After I pushed both to "just try harder to get it below 1k params, and be creative" they managed to do it!

Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2024…

English

384

41.8K

kirk@68kirk·11 Şub

@ZimingLiu11 @naturecomputes @SuryaGanguli @AToliasLab Nice thread! Any thoughts on how do continuous but non autoregressive models like fno and pinns compare to transformer based? Based on your findings they should be avoiding both issues present in transformers?

English

Ziming Liu@ZimingLiu11·9 Şub

@naturecomputes @SuryaGanguli @AToliasLab 15/N To close this thread, let me remind you of the "no free lunch theorem", and that everything boils down to inductive biases, which is the central theme of this paper.

English

3.7K

Ziming Liu@ZimingLiu11·9 Şub

🚨Transformers don't learn Newton's laws? They learn Kepler's laws! Like us, transformers don't predict a flying ball via a differential equation, but by fitting a curve. Moreover, reducing context length steers a transformer from Keplerian to Newtonian. Compression in play.

English

206

1.2K

115.7K

kirk@68kirk·9 Şub

@giffmana @crude2refined Indeed, the optimizers we use are quite sensitive and as such lr can help when stuck in a local min to get unstuck, all other hyperparams are there to smooth the optimizer trajectory.

English

Lucas Beyer (bl16)@giffmana·9 Şub

@crude2refined Yes, they are too, but (usually) lr is the highest bit.

English

Lucas Beyer (bl16)@giffmana·8 Şub

As per my recurring rants:

English

478

40.4K

kirk@68kirk·30 Oca

@JFPuget @jm_alexia 🤔 isn't that already happening? The majority of academics use overleaf and given that overleaf has integrated 3rd party AI models, those could potentially be sending data to underlying companies, no?

English

JFPuget 🇺🇦🇨🇦🇬🇱@JFPuget·30 Oca

@jm_alexia That was my reaction too. I was amused to see academics happy to share their unpublished research with OpenAI. I didn't post myself as writing latex paper is a low priority for me , I do it once every other year or so now.

English

611

Alexia Jolicoeur-Martineau@jm_alexia·29 Oca

Don't use a tool made to steal your IP

Sebastien Bubeck@SebastienBubeck

Try prism.openai.com, an AI powered latex editor, unicorn approved!

English

444

47.4K

kirk@68kirk·23 Oca

@branerico @adrian1977 @burkov I think it already has lost its value if u consider that on average a PhD makes what a 20 year old earns with vocational training working as an electrician in datacenters.

English

aneric broni@branerico·23 Oca

There’s competition for academic jobs not because of quality PhDs graduating. It’s because there are more than needed PhDs ( unnecessary numbers) every year while the jobs have not increased, yet research funds have been limited. Everyone is doing a PhD , it would loose its value soon!

English

BURKOV@burkov·23 Oca

It's an easiest time to make a master's or a PhD: 1. You pick a bunch of recent papers from scientists in your research domain, submit them to a chatbot, and ask it to analyze the "future works" sections and propose experiments to try. 2. You ask the chatbot to write code for these experiments. 3. You run this code and get some incremental results not published anywhere else. 4. You ask the chatbot to write a paper about these incremental results. 5. You submit this paper to a third-tier conference in your domain where the acceptance rate is above 0.5. Three papers like that and you are ready to write a thesis. I guess you already know how to get the thesis written fast.

English

1.1K

120.6K

kirk@68kirk·10 Oca

c.f. science.org/doi/10.1126/sc…

kirk@68kirk·10 Oca

History has a funny way of repeating itself...

English

kirk@68kirk·11 Ara

@connordavis_ai Didn't LLM's already knew how to answer what-if questions, even at a superficial level. What was the baseline here?

English

Connor Davis@connordavis_ai·9 Ara

Holy shit… this paper might be the most important shift in how we use LLMs this entire year. “Large Causal Models from Large Language Models.” It shows you can grow full causal models directly out of an LLM not approximations, not vibes actual causal graphs, counterfactuals, interventions, and constraint-checked structures. And the way they do it is wild: Instead of training a specialized causal model, they interrogate the LLM like a scientist: → extract a candidate causal graph from text → ask the model to check conditional independencies → detect contradictions → revise the structure → test counterfactuals and interventional predictions → iterate until the causal model stabilizes The result is something we’ve never had before: a causal system built inside the LLM using its own latent world knowledge. Across benchmarks synthetic, real-world, messy domains these LCMs beat classical causal discovery methods because they pull from the LLM’s massive prior knowledge instead of just local correlations. And the counterfactual reasoning? Shockingly strong. The model can answer “what if” questions that standard algorithms completely fail on, simply because it already “knows” things about the world those algorithms can’t infer from data alone. This paper hints at a future where LLMs aren’t just pattern machines. They become causal engines systems that form, test, and refine structural explanations of reality. If this scales, every field that relies on causal inference economics, medicine, policy, science is about to get rewritten. LLMs won’t just tell you what happens. They’ll tell you why.

English

137

796

44K

kirk@68kirk·8 Ara

@elonmusk The default should be reject all by definition. There problem solved.

English

Elon Musk@elonmusk·8 Ara

The EU Commission has destroyed countless life-seconds with their idiotic “accept cookies” pop-up!

X Freeze@XFreeze

The nightmare still haunting us today is brought to you by the one and only EU

English

4.5K

9.6K

79.4K

8.4M

kirk@68kirk·5 Ara

@elonmusk The problem is purely financial/economic. People were given the promise that if u work & study hard u’ll be better of than ur parents, that’s a lie of course, not in this economy & with this inflation. Also, 1% is hoarding all the money. Why don’t u give some to boost families?

English

Elon Musk@elonmusk·5 Ara

An immediate increase in the birth rate is needed

Tesla Owners Silicon Valley@teslaownersSV

Birth rates are plummeting in a lot of countries. Population collapse is the greatest threat to civilization. Change needs to happen to save humanity.

English

29.7K

13.1K

87.5K

28.9M

kirk@68kirk·4 Ara

@Dorialexander Both

English

147

Alexander Doria@Dorialexander·3 Ara

based current vibes, not sure if neurips 2025 is a conference or a job fair.

English

386

32.7K

kirk@68kirk·3 Ara

@TomGur @fortnow Shite just hit the fun

English

Tom Gur@TomGur·3 Ara

Times are changing. From the acknowledgments of @fortnow's new paper on search vs decision for S_2^P. (arxiv.org/abs/2512.02808)

English

6.4K

kirk@68kirk·3 Ara

@dejanseo @SchmidhuberAI @kimmonismus @WhatsApp Still not bad at all

English

DEJAN@dejanseo·3 Ara

@68kirk @SchmidhuberAI @kimmonismus @WhatsApp most likely just rule based sampling of pre-recorded audio files

Brisbane, Queensland 🇦🇺 English

Chubby♨️@kimmonismus·28 Kas

Yann LeCun’s 1989 convolutional neural network demo, the foundation for the CNNs we still use today. It's amazing how far we've come since then!

English

100

467

4.4K

705.3K

kirk@68kirk·3 Ara

@francoisfleuret U need a model to decode the pattern in order to appreciate the hidden beauty...

English

111

François Fleuret@francoisfleuret·2 Ara

François Fleuret@francoisfleuret

BTW America, why are your carpets so awful always? Why?

ZXX

23.5K

kirk@68kirk·3 Ara

@SchmidhuberAI @kimmonismus It has a voice answer in '86, impressive if u think that todays apps like @WhatsApp fail to transcribe even a basic voice message of couple of minutes.

English

114

Jürgen Schmidhuber@SchmidhuberAI·2 Ara

@kimmonismus Fukushima's 1986 video shows a CNN that recognises handwritten digits, three years before LeCun's video: x.com/SchmidhuberAI/…

Jürgen Schmidhuber@SchmidhuberAI

Fukushima's video (1986) shows a CNN that recognises handwritten digits [3], three years before LeCun's video (1989). CNN timeline taken from [5]: ★ 1969: Kunihiko Fukushima published rectified linear units or ReLUs [1] which are now extensively used in CNNs. ★ 1979: Fukushima published the basic CNN architecture with convolution layers and downsampling layers [2]. He called it neocognitron. It was trained by unsupervised learning rules. Compute was 100 times more expensive than in 1989, and a billion times more expensive than today. ★ 1986: Fukushima's video on recognising hand-written digits [3]. ★ 1988: Wei Zhang et al had the first "modern" 2-dimensional CNN trained by backpropagation, and also applied it to character recognition [4]. Compute was about 10 million times more expensive than today. ★ 1989-: later work by others [5]. REFERENCES (more in [5]) [1] K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. This work introduced rectified linear units or ReLUs, now widely used in CNNs and other neural nets. [2] K. Fukushima (1979). Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: 1980. [3] Movie produced by K. Fukushima, S. Miyake and T. Ito (NHK Science and Technical Research Laboratories), in 1986. YouTube: youtube.com/watch?v=oVYCjL… [4] W. Zhang, J. Tanida, K. Itoh, Y. Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988. First "modern" backpropagation-trained 2-dimensional CNN, applied to character recognition. [5] J. Schmidhuber (AI Blog, 2025). Who invented convolutional neural networks? x.com/SchmidhuberAI/…

English

294

22.1K

kirk@68kirk·3 Ara

@ben_eysenbach @ZiarkoAlicja @WiMLworkshop Are these recorded?

English

Ben Eysenbach@ben_eysenbach·2 Ara

To what extent can combinatorial reasoning problems be solved with learned representation and no search? Check out @ZiarkoAlicja's talk at @WiMLworkshop today at 2:45pm PT!

Alicja Ziarko@ZiarkoAlicja

Can complex reasoning emerge directly from learned representations? In our new work, we study representations that capture both perceptual and temporal structure, enabling agents to reason without explicit planning. princeton-rl.github.io/CRTR/

English

194

23.7K

kirk@68kirk·3 Ara

@burkov Haven't read the paper yet but from ur description sounds like there are some resemblances with diffusion approaches.

English

440

BURKOV@burkov·3 Ara

This paper really is groundbreaking. It solves a long-standing embarrassment in machine learning: despite all the hype around deep learning, traditional tree-based methods (XGBoost, CatBoost, random forests, etc) have dominated tabular data—the most common data format in real-world applications—for two decades. Deep learning conquered images, text, and games, but spreadsheets remained stubbornly resistant. This paper's (published in Nature by the way) main contribution is a foundation model that finally beats tree-based methods convincingly on small-to-medium datasets, and does so very fast. TabPFN in 2.8 seconds outperforms CatBoost tuned for 4 hours—a 5,000× speedup. That's not incremental; it's a different regime entirely. The training approach is also fundamentally different. GPT trains on internet text; CLIP trains on image-caption pairs. TabPFN trains on entirely synthetic data—over 100 million artificial datasets generated from causal graphs. TabPFN generates training data by randomly constructing directed acyclic graphs where each edge applies a random transformation (using neural networks, decision trees, discretization, or noise), then pushes random noise through the root nodes and lets it propagate through the graph—the intermediate values at various nodes become features, one becomes the target, and post-processing adds realistic messiness like missing values and outliers. By training on millions of these synthetic datasets with very different structures, the model learns general prediction strategies without ever seeing real data. The inference mechanism is also unusual. Rather than finetuning or prompting, TabPFN performs both "training" and prediction in a single forward pass. You feed it your labeled training data and unlabeled test points together, and it outputs predictions immediately. There's no gradient descent at inference time—the model has learned how to learn from examples during pretraining. The architecture respects tabular structure with two-way attention (across features within a row, then across samples within a column), unlike standard transformers that treat everything as a flat sequence. So, the transformer has basically learned to do supervised learning. Talk to the paper on ChapterPal: chapterpal.com/s/a1899430/acc… Download the PDF: nature.com/articles/s4158…

English

394

2.6K

330.6K

kirk@68kirk·28 Kas

@michaelzluo x.com/andrew_n_carr/…

Andrew Carr 🤸@andrew_n_carr

diffusion models are just VAEs with a fixed encoder

QME

Michael Luo@michaelzluo·26 Kas

Transformers without positional embeddings are functionally the same as dLLM and has better scaling law than transformers with positional embeddings. It is also much easier and stable to train transformer than dLLM due to much prior work. I would position dLLM as a "cost arbitrage" over LLMs, i.e. faster generation with much higher token throughput.

Cunxiao Du@ducx_du

Diffusion LLMs (DLLM) can do “any-order” generation, in principle, more flexible than left-to-right (L2R) LLM. Our main finding is uncomfortable: ➡️ In real language, this flexibility backfires: DLLMs become worse probabilistic models than the L2R / R2L AR LMs. This thread is about why “any order” turns into a curse. (Work with Xinyu Yang @Xinyu2ML , Min Lin @mavenlin , Chao Du @duchao0726 and the team.) Blog Link: #2af0ba07baa880c29fc4c8c198244cc8" target="_blank" rel="nofollow noopener">notion.so/Understanding-…

English

158

28.4K

Keşfet

@che_shr_cat @JFPuget @DimitrisPapail @ZimingLiu11 @naturecomputes @SuryaGanguli @AToliasLab @giffmana