Gary King

5.2K posts

Gary King banner
Gary King

Gary King

@kinggary

Harvard Professor (social scientist, statistician). Co-founder Crimson Hexagon (now Brandwatch), Learning Catalytics, Perusall, Thresher (now Two Six Tech),...

Cambridge, MA Katılım Nisan 2009
7.8K Takip Edilen34.3K Takipçiler
Gary King
Gary King@kinggary·
Thanks everyone at @JohnsHopkins for coming to my talk & for the fabulous conversations. Great to see all the wonderful changes (with more in progress) in the social sciences there... Slides from my talk on "Who’s to Blame for Survey Instability: Respondents With Random Preferences or Researchers With Flawed Measures": gking.harvard.edu/presentations/…
English
1
22
105
9.7K
Gary King
Gary King@kinggary·
@SeanLangenfeld You might have a look at this article which explains "why propensity scores should not be used for matching" j.mp/psnot It really is not ambiguous.
English
1
5
20
1.8K
Sean Langenfeld
Sean Langenfeld@SeanLangenfeld·
I asked chatGPT to explain to me why biostatisticians are so critical of propensity score matching, and I thought the answer was very organized and straightforward. I always summarized it as false confidence, sample bias, and ongoing baseline differences in unmeasured variables. But, I am the farthest thing from being a statistician. Here's the better answer, though: "Great question. Propensity score matching (PSM) is widely used, but many biostatisticians are indeed critical of it—not because it’s useless, but because it’s often misused, misunderstood, or oversold. Here are the main reasons, in a structured way. 1. It cannot fix unmeasured confounding Core limitation: Propensity scores only balance observed covariates. If an important confounder is not measured or is poorly measured, PSM does nothing to fix that. This is especially problematic in retrospective clinical data (e.g., registry, EHR), where: Disease severity, frailty, physician judgment, social factors, etc. are often missing or crude. In surgery and oncology (which you deal with a lot), this is a huge issue: “who gets which treatment” is often driven by factors not captured in the dataset. So PSM can produce beautifully balanced tables and still be badly biased. 2. It can increase bias if done poorly Counterintuitively, matching can make things worse: If the propensity model is: Misspecified Omits important confounders Uses bad functional forms Then matching can: Amplify imbalance in key covariates Induce collider bias Create a sample that is less representative and more biased This is known as bias amplification. 3. It throws away data (often a lot of it) Typical 1:1 matching often: Discards: Many treated patients Many controls Especially if groups are very different initially. Consequences: Lower statistical power Larger standard errors Estimates apply only to a restricted subpopulation (the “matched” patients) So: You trade sample size and generalizability for “balance”. And many papers forget to state that they changed the estimand. 4. People confuse “balance” with “causality” A very common mistake: “The groups are balanced, therefore the effect is causal.” No. Balance on observed variables ≠ exchangeability You can balance 100 variables and still be confounded by 1 unmeasured one. This leads to false confidence in the result. 5. The propensity model is often built incorrectly Common problems: Using stepwise variable selection Using c-statistic to judge PS quality (wrong metric) Including post-treatment variables Omitting weak but important confounders Not checking standardized mean differences properly PS models should: Include all confounders, not just predictors of treatment Optimize covariate balance, not prediction accuracy Many clinical papers get this wrong. 6. It answers a different question than people think After matching, you are no longer estimating: “What is the effect in the whole population?” You are estimating: “What is the effect in the subset of patients who could plausibly have received either treatment?” This is often: Not stated Not understood Not what the clinician thinks they’re reading 7. There are often better methods In many settings, these are statistically superior: Doubly robust methods Inverse probability weighting Targeted maximum likelihood (TMLE) Modern outcome regression with flexible models Causal forests / g-methods These: Use all the data Can achieve better bias–variance tradeoff Are less sensitive to arbitrary matching decisions 8. Matching hides modeling decisions Matching involves many degrees of freedom: Caliper size With or without replacement Ratio (1:1, 1:2, 1:5…) Which variables go in PS How ties are handled Each choice: Can change the result Is rarely pre-specified Enables subtle p-hacking 9. In medical literature, it is often used as a “ritual” Unfortunately: PSM has become a checkbox: “Do PSM → claim causal inference”. Many statisticians view this as methodological cargo cult science. The fair, balanced conclusion Propensity scores are: ✅ Useful when done carefully ✅ Better than crude regression in some settings ❌ Not a magic fix for confounding ❌ Often misapplied in clinical research ❌ Frequently over-interpreted A practical rule of thumb If: Treatment choice is strongly influenced by unrecorded clinical judgment Or frailty, anatomy, tumor resectability, performance status are not well captured Then: PSM probably does not rescue the study from confounding by indication."
English
2
14
37
9.6K
Gary King
Gary King@kinggary·
New paper:"Experimental Evidence on the (Limited) Influence of Reputable Media Outlets" w/Bharat Anand, Kiran Misra, Sascha Riaz at GaryKing.org/reputable
Gary King tweet media
English
0
4
5
1.9K
Gary King
Gary King@kinggary·
Big congratulations to Chris Kenny, Harvard's newest PhD, with a few of his fans.
Gary King tweet media
English
0
2
77
21K
Gary King
Gary King@kinggary·
@rmkubinec @namalhotra @carlislerainey @matt_motta Our behavioral models (our observation mechanisms) can be tremendously important. Economists originally liked it because they can still get (stochastic) rationality, but only by assuming humans have no stable preferences. much evidence rejects this model
English
0
0
2
84
Gary King
Gary King@kinggary·
@namalhotra @matt_motta Thanks the comment, Neil. The random utility model is an assumption (humans have random preferences & never make mistakes in survey responses) not supported by the evidence. This often doesn't matter but does here. Have a look at our supplementary appendix, which discusses this.
English
1
0
5
252
Neil Malhotra
Neil Malhotra@namalhotra·
@matt_motta I may just be really dumb but I don't understand this paper. Normally, one would write out a random utility choice model (e.g, Luce/Thurstone), which have error terms. But that doesn't mean you inflate up estimates as if it was a compliance problem.
English
2
0
3
473
Gary King
Gary King@kinggary·
@carlislerainey @JohnHolbein1 Definitely right. As social scientists we know it is hard to change individual behavior, even when the individuals are us! Nice to see that expectations have changed so clearly, and that behavior is following.
English
0
1
2
257
Carlisle Rainey
Carlisle Rainey@carlislerainey·
@kinggary @JohnHolbein1 I think there's a generational divide in how one interprets these results. Post ~2014, sharing is expected, common, and easy, so it's "only" 31% On the other hand, what a massive change in 30 years.
English
2
1
2
388
John B. Holbein
John B. Holbein@JohnHolbein1·
Oof. This is a gut punch. "We find that relatively few [political science] articles make the underlying data & code available to others" "Reproduction archives are almost entirely unavailable from 1995-2006" In 2022, 31% of quantitative PS journals include shared data & code
John B. Holbein tweet media
English
7
68
299
50.3K