Bayesian ML at Scale

85

Sam Altman@sama·5d

what problem do you most hope AI will solve in the future? maybe we can help!

English

15.1K

762

12.7K

3.6M

Bayesian ML at Scale@BayesianIn·5d

@DimitrisPapail @jiaxinwen22 I know lots of people know this anecdote, but it's a good one!

English

4

236

Dimitris Papailiopoulos@DimitrisPapail·5d

Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)

English

17

7

195

12.2K

Jiaxin Wen@jiaxinwen22·5d

It's very disappointing that information theory cannot explain AI at all.

English

69

13

365

103.4K

Bayesian ML at Scale@BayesianIn·10 Nis

@techNmak I think calling P(B|A) the likelihood is confusing, it makes much more sense to call \prod_{i=1}^n P(x_n|theta) the likelihood..

English

116

Tech with Mak@techNmak·8 Nis

Most engineers have seen this formula. P(A|B) = P(B|A) × P(A) / P(B) Almost none can explain what it actually does. Here's Bayes' Theorem in plain English, and where it's hiding inside systems you use every day. The core idea in one sentence: Bayes' Theorem updates your belief about something after seeing new evidence. That's it. Four terms: Prior → what you believed before the evidence Likelihood → how probable the evidence is, given your hypothesis Evidence → how common the evidence is overall Posterior → your updated belief after seeing the evidence A concrete example: Say 40% of all emails are spam (your prior). You see a new email containing the word "lottery." 10% of spam emails contain "lottery." Only 1% of legitimate emails do. Plug into Bayes: P(spam | "lottery") = (0.10 × 0.40) / P("lottery") ≈ 87% The word "lottery" updated your belief from 40% → 87%. That's Bayes in action. Prior belief + new evidence = updated belief. Where it lives in AI: 1/ Spam filters The Naive Bayes classifier, the algorithm behind most spam filters - applies this exact calculation word by word across an entire email. Each word shifts the probability up or down. It's called "naive" because it assumes each word is independent of the others, which isn't realistic, but works remarkably well in practice. 2/ Medical diagnosis AI A patient has symptom X. What's the probability of disease Y? Bayes updates the base rate (how common the disease is) with the likelihood of seeing that symptom in patients who have it. Same formula, different domain. 3/ Your LLM's uncertainty Modern language models don't just predict the next token, they assign a probability to every possible token. The sampling process (temperature, top-p) is directly working with those probability distributions. Bayesian reasoning is embedded in every response your model generates. The insight most engineers miss: Bayes doesn't give you certainty. It gives you a rational way to update uncertainty. That's exactly why it's foundational to AI - real-world systems are never certain. They're always working with incomplete, noisy, probabilistic information. Every model that learns from data is, at its core, doing some version of this: Start with a belief. See evidence. Update the belief. That's Bayes. That's machine learning.

English

38

459

2K

116.5K

Bayesian ML at Scale@BayesianIn·7 Nis

@xwanyex I think Zitron's main points are around finance, which I don't know well enough to comment on, but I find his perspective interesting. This is a parallel point to are LLMs are impressive or useful.

English

23

Bayesian ML at Scale@BayesianIn·18 Mar

Christopher Sims October 21, 1942 – March 14, 2026 youtube.com/watch?v=3JMwAt…

YouTube

CY

Richard Sutton@RichardSSutton

2

162

Bayesian ML at Scale retweetledi

Judea Pearl@yudapearl·20 Şub

A friend just informed me that our colleage, Professor Arthur Dempster, has died last month at 96. Arthur was an intellectual giant, famous for developing the EM algorithm as well as for the Shafer-Dempster theory, but remained skeptic about causation. memoritree.com/memorial/arthu…. May his memory be an inspiration.

English

4

9

39

4.9K

Bayesian ML at Scale@BayesianIn·24 Oca

Well, everything (except probability)

Yann is right about everything (except RL).

English

1

221

Bayesian ML at Scale@BayesianIn·13 Oca

Thanks for the response, I will wait and see if Prof Pearl has the same interpretation as you. From an operational point of view, if M1 and M2 only differ in something that isn't observed, then I would say they don't really differ at all, not just in the likelihood but for any practical purpose. How does: "M2 - a drug that cures 10% and kills 10%." Make sense unless it's possible to differentiate the 10% who are cured and the 10% who are killed? Otherwise you would just say it does nothing (like M1).

English

0

1

66

Dong Nguyen@DongNguyeb·13 Oca

Although @BayesianIn said with “I must confess I am confused…” and raises two questions (a) and (b), this Bayesian’s reply is not naïve. It is a subtle attempt to change the problem. Readers need to understand the core of Pearl’s argument is this, im summary. There can exist two causal models, M1 and M2, such that a RCT even a perfect one produces exactly the same likelihood P(data|M1)=P(data| M2), because an RCT only observes the ATE after averaging, while latent causal heterogeneity is mixed away. As a result, Bayes has nothing to update and then there is no data, even a large amount, can distinguish between these two causal worlds. But In Pearl’s causal language, the do-operator intervenes on different data-generating mechanisms in M1 and M2. That is where the real difference lies. This is Pearl’s message, RCTs do not identify causal mechanisms. Latent causal heterogeneity may exist that no RCT data can reveal. And do(·) cannot be replaced by conditioning, because conditioning only operates on probability distributions, whereas do(·) operates on the structure of the data-generating process.

English

1

227

Judea Pearl@yudapearl·9 Oca

Interesting. What is the principle by which the weights of the hypothesized causal models are updated by each empirical datum?

Shuhua Jiang@JiangSH24

One missing piece in medical AI is how to act under causal uncertainty. Inspired by @yudapearl, we don’t collapse uncertainty into heuristics. We maintain multiple causal hypotheses and update their weights online through real-world feedback. Prediction asks what may happen. Causal bounds + stability ask when automation must stop.

English

4

1

9

3.3K

Bayesian ML at Scale@BayesianIn·13 Oca

@yudapearl Have a relaxing break and I hope they are friendly alligators.

English

Bayesian ML at Scale@BayesianIn

0

1

56

Judea Pearl@yudapearl·12 Oca

I'm taking off a few days from this lively discussion on the role of Bayes updating in causal inference (too many alligators vying for my time). But will be back.

Thanks for the problem. Let me try to understand it. X=1 is the patient has the inclination to take the drug and X=2 is the patient does not have the inclination to take the drug, these two inclinations are equally probable, so P(X=1)=P(X=2)=0.5 P(death|do(Drug),X=1,M2)=0.2 P(death|do(Drug),X=2,M2)=0 P(death|do(Placebo),X=1,M2)=0.1 P(death|do(Placebo),X=2,M2)=0.1 M1 is that the Drug behaves like the Placebo (sugar tablet) P(death|do(Drug),X=1,M1)=0.1 P(death|do(Drug),X=2,M1)=0.1 P(death|do(Placebo),X=1,M1)=0.1 P(death|do(Placebo),X=2,M1)=0.1 So by the backdoor rule: P(death|do(Drug),M1)=P(death|do(Drug),M2)=0.1, so with X unobserved the likelihood is identical in both cases. So the likelihood for model=M1 and for model=M2 are the same in an RCT which does not observe X. If there is an additional single observation of: death, drug, X=1 (patient has the inclination to take the drug, and this was observed) The probability of this single observation is 0.2 under M2 and 0.1 under M1. This makes the posterior probability of M2 as 2/3. I must confess I am confused about a) Why M2 involves a covariate which is the inclination to take the drug, rather than another easily measured attribute. Does this make the example more interesting? b) Why you think this poses a challenge to Bayes. It's entirely possible I misunderstood an aspect of this example.

English

@DongNguyeb @yudapearl x.com/BayesianIn/sta…

0

12

2.8K

Bayesian ML at Scale@BayesianIn·12 Oca

Bayesian ML at Scale@BayesianIn

Thanks for the problem. Let me try to understand it. X=1 is the patient has the inclination to take the drug and X=2 is the patient does not have the inclination to take the drug, these two inclinations are equally probable, so P(X=1)=P(X=2)=0.5 P(death|do(Drug),X=1,M2)=0.2 P(death|do(Drug),X=2,M2)=0 P(death|do(Placebo),X=1,M2)=0.1 P(death|do(Placebo),X=2,M2)=0.1 M1 is that the Drug behaves like the Placebo (sugar tablet) P(death|do(Drug),X=1,M1)=0.1 P(death|do(Drug),X=2,M1)=0.1 P(death|do(Placebo),X=1,M1)=0.1 P(death|do(Placebo),X=2,M1)=0.1 So by the backdoor rule: P(death|do(Drug),M1)=P(death|do(Drug),M2)=0.1, so with X unobserved the likelihood is identical in both cases. So the likelihood for model=M1 and for model=M2 are the same in an RCT which does not observe X. If there is an additional single observation of: death, drug, X=1 (patient has the inclination to take the drug, and this was observed) The probability of this single observation is 0.2 under M2 and 0.1 under M1. This makes the posterior probability of M2 as 2/3. I must confess I am confused about a) Why M2 involves a covariate which is the inclination to take the drug, rather than another easily measured attribute. Does this make the example more interesting? b) Why you think this poses a challenge to Bayes. It's entirely possible I misunderstood an aspect of this example.

QME

0

55

Dong Nguyen@DongNguyeb·11 Oca

The simple answer is: it cannot be done, unless readers have specified or assumed the data generating causal mechanism. Many people are under the illusion that if we have multiple causal models, we can simply let Bayes update them using observational data and then pick the one with the highest posterior probability. This is a fundamental mistake. They fail to recognize the existence of causal equivalence classes, many different DAGs can generate the same observational distribution. In such cases, no amount of observational data can distinguish between them, because the likelihoods are identical. To break this equivalence, they need data from an interventional regime that is, data generated under some do(X) in a causal DAG or a RCT. And Only when the data-generating process is altered do different causal models make different predictions, and only then does Bayesian updating become informative Do you agree with me ? @f2harrell and @BayesianIn

English

117

Judea Pearl@yudapearl·10 Oca

Two important observations about the questions I posed in: x.com/yudapearl/stat… (1) If we find the mortality rate among drug-choosing patients to be higher than among drug-avoiding patients, model M1 can be ruled out. (2) Such finding is not unreasonable, assuming that patients are drawn to the pain-relieving properties of the drug, even though pain signals underlying conditions that increase the risk of mortality. How do we update the posterior probabilities of the two models with each patient in the observational study?

Judea Pearl@yudapearl

I am still only "half Bayesian", for reasons explained here: ucla.in/2nZN7IH and made concrete here: x.com/yudapearl/stat… @ngdonghung @JiangSH24 @soboleffspaces

English

1

18

15.1K

Bayesian ML at Scale@BayesianIn·12 Oca

Thanks for the problem. Let me try to understand it. X=1 is the patient has the inclination to take the drug and X=2 is the patient does not have the inclination to take the drug, these two inclinations are equally probable, so P(X=1)=P(X=2)=0.5 P(death|do(Drug),X=1,M2)=0.2 P(death|do(Drug),X=2,M2)=0 P(death|do(Placebo),X=1,M2)=0.1 P(death|do(Placebo),X=2,M2)=0.1 M1 is that the Drug behaves like the Placebo (sugar tablet) P(death|do(Drug),X=1,M1)=0.1 P(death|do(Drug),X=2,M1)=0.1 P(death|do(Placebo),X=1,M1)=0.1 P(death|do(Placebo),X=2,M1)=0.1 So by the backdoor rule: P(death|do(Drug),M1)=P(death|do(Drug),M2)=0.1, so with X unobserved the likelihood is identical in both cases. So the likelihood for model=M1 and for model=M2 are the same in an RCT which does not observe X. If there is an additional single observation of: death, drug, X=1 (patient has the inclination to take the drug, and this was observed) The probability of this single observation is 0.2 under M2 and 0.1 under M1. This makes the posterior probability of M2 as 2/3. I must confess I am confused about a) Why M2 involves a covariate which is the inclination to take the drug, rather than another easily measured attribute. Does this make the example more interesting? b) Why you think this poses a challenge to Bayes. It's entirely possible I misunderstood an aspect of this example.

English

0

3

3.1K

Judea Pearl@yudapearl·11 Oca

@JiangSH24 One tiny correction, in the example cited we have P(death | do(Drug)) = 0.1 under both M1 and M2. RCT in itself cannot distinguish between the two models.

English

Pedro Domingos@pmddomingos

0

246

Bayesian ML at Scale@BayesianIn·26 Ara

While this is more or less the main RecSys heuristic, and it is hard to beat, I do think we should try to do better. Being satisfied with a completely ad-hoc solution is not a long term path to progress.

Simple way to replace RL with supervised learning: assign the reward to every action on the path to it and learn to predict it. Hypothesis: no RL algorithm will ever beat this by much.

English

1

4

404

Bayesian ML at Scale@BayesianIn·15 Ara

@DongNguyeb Many people find that when they specify their probability (point of indifference to buying and selling bets) over repeated measures e.g. x1..xn that their probabilities are exchangeable (and hence have a de Finetti representation).

English

1

31

Dong Nguyen@DongNguyeb·9 Ara

My question to you was about inference, not about assumption to against what you said pure probability is sufficient to answer “why I infer the fact” Of course anyone may assume anything . The point is can pure probability theory derive causal exchangeability? If not, then causality is not contained in probability.

English

0

124

Bayesian ML at Scale@BayesianIn·1 Ara

The @yudapearl Pearlian view is that causal inference is a completely separate discipline to statistical inference (and statistical estimation can be tackled using either the Bayesian or frequentist paradigm) and then causal inference is "inference across distributions", that is a modification of a (frequentist) probability that accounts for an intervention. Here I quote @analisereal

English

2

5

391

Bayesian ML at Scale@BayesianIn·8 Ara

@DongNguyeb You may assume a probability specification is exchangeable in any sense.

English

0

63

Dong Nguyen@DongNguyeb·7 Ara

@BayesianIn So please tell me does pure probability theory allow us to infer exchangeability in the causal sense?

English

0

34

Bayesian ML at Scale@BayesianIn·7 Ara

If you deem a future outcome on a unit that receives a treatment exchangeable with past outcomes that received a treatment and a future outcome on a unit that receives no treatments exchangeable with past outcomes on units that received no treatments then this is a powerful and consequential assumption that enables causal inference.

English

0

87

Dong Nguyen@DongNguyeb·2 Ara

Thank you for your self confirmation. So you now acknowledge that you are not using counterfactuals when you talk about conditional exchangeability. And the kind of ‘exchangeability’ you are using is only the trivial statistical exchangeability within the observed data: y_i independent y_j| (t_i = t_j) This is merely the exchangeability of observations within the same treatment group. It is a purely statistical property, and while it is true, it is irrelevant for causal inference. Why? Because this type of exchangeability only describes the distribution of outcomes given the treatments actually received. It tells us nothing about the causal question: “What would happen to this unit if we were to change its treatment?” Statistical exchangeability within observed groups cannot answer that question. It carries no information about how the outcome would happens under an intervention. Therefore, it does not help with causal inference at all.

English

0

106

Bayesian ML at Scale@BayesianIn·2 Ara

@DongNguyeb It is a price or a point of indifference between buying and selling bets.

English

49

Dong Nguyen@DongNguyeb·2 Ara

@BayesianIn How do you have y1,...yn|t1,..tn ? Can you get it from observational data?

English

0

40

Bayesian ML at Scale@BayesianIn·2 Ara

@DongNguyeb Yes, I am not using counterfactuals. If you have y1,...yn|t1,..tn then there is exchangeability of yi,yj if ti=tj.

English

0

94

Dong Nguyen@DongNguyeb·2 Ara

That’s great if you can help me understand what I may have missed. If you claim that Bayes’ rule together with an exchangeability assumption is sufficient to infer the outcomes a unit would have under different treatments, then you must acknowledge that conditional exchangeability is itself a strong causal assumption. When you write Y(1),..,Y(n) independence T | covariates this is not a purely probabilistic assumption. Because, Y(t) are counterfactuals ,they do not exist in the observational data, and for any individual you never observe both Y(0) and Y(1). Therefore, you cannot approximate the distribution of Y(t) from the data, nor can you test this independence using probability theory alone. It can only be assumed on the basis of causal knowledge about the data-generating process. That’s why conditional exchangeability is a causal assumption about how the world works (or you believe the world works), this is not a derivation from probability theory. Do I miss something?

English

0

95

Bayesian ML at Scale@BayesianIn·2 Ara

@yudapearl I feel like I am repeating myself, and I suspect you feel the same. At a later date, I will use a different forum to try to outline your point of view (as best I understand it) and the small points in which I have a differing view.

English

0

60

Bayesian ML at Scale@BayesianIn·2 Ara

> "I don't understand the urge people have to demonstrate "we don't need this machinery", especially when the alternative machinery they propose is so cognitively cumbersome." This is a separate question, more around preference, taste and foundations. The advantage of basing causal inference purely on the Ramsey-de Finetti-Savage theory of statistics are: - Automatically consistent with the most complete axiom system for decision making under uncertainty that we know. - An operational procedure for determining conditional exchangeability relationships usually needed for causal inference. - Likely incoherence and inadmissibility arguments can be made against a two step procedure of estimate a joint probability then apply causal inference. Yes, this is academic and perhaps of little practical consequence, but still of some concern from a foundational point of view. Some disadvantages include: - Belief that frequentist probability and causal concepts are more intuitive, than Bayesian probability and conditional exchangeability. - Ability to ignore covariates, greatly simplifying certain analyses. Feel free to add more.

English

0

40

Bayesian ML at Scale retweetledi

Bayesian ML at Scale@BayesianIn·2 Ara

Thanks for the engagement and the thoughtful response. It is difficult to outline my (many) points of agreement and the few points where I differ in an X post, but I will try. > "I've never insisted on the "frequency interpretation" of probability." In this paper statistical analysis is defined in frequentist terms. escholarship.org/content/qt4q74… I acknowledge I am being particularly purist here, but to a strict (operational subjective) Bayesian, probability does not exist, and the idea of a Bayesian estimator is a contradiction in terms. The concept of "experimental conditions remaining the same" only makes sense with a frequentist notion of repeated draws from a low-dimensional probability model.

English