David Reber

26 posts

David Reber

@davidpreber

#AIsafety, interpretability, and causality. PhD student in CS @ #UChicago.

Katılım Haziran 2023

95 Takip Edilen77 Takipçiler

David Reber@davidpreber·20 Nis

@PalantirTech Rifles are the wrong metaphor. The closest historical metaphor for the level of personalization and scope of AI militarization is secret police.

English

Palantir@PalantirTech·18 Nis

Because we get asked a lot. The Technological Republic, in brief. 1. Silicon Valley owes a moral debt to the country that made its rise possible. The engineering elite of Silicon Valley has an affirmative obligation to participate in the defense of the nation. 2. We must rebel against the tyranny of the apps. Is the iPhone our greatest creative if not crowning achievement as a civilization? The object has changed our lives, but it may also now be limiting and constraining our sense of the possible. 3. Free email is not enough. The decadence of a culture or civilization, and indeed its ruling class, will be forgiven only if that culture is capable of delivering economic growth and security for the public. 4. The limits of soft power, of soaring rhetoric alone, have been exposed. The ability of free and democratic societies to prevail requires something more than moral appeal. It requires hard power, and hard power in this century will be built on software. 5. The question is not whether A.I. weapons will be built; it is who will build them and for what purpose. Our adversaries will not pause to indulge in theatrical debates about the merits of developing technologies with critical military and national security applications. They will proceed. 6. National service should be a universal duty. We should, as a society, seriously consider moving away from an all-volunteer force and only fight the next war if everyone shares in the risk and the cost. 7. If a U.S. Marine asks for a better rifle, we should build it; and the same goes for software. We should as a country be capable of continuing a debate about the appropriateness of military action abroad while remaining unflinching in our commitment to those we have asked to step into harm’s way. 8. Public servants need not be our priests. Any business that compensated its employees in the way that the federal government compensates public servants would struggle to survive. 9. We should show far more grace towards those who have subjected themselves to public life. The eradication of any space for forgiveness—a jettisoning of any tolerance for the complexities and contradictions of the human psyche—may leave us with a cast of characters at the helm we will grow to regret. 10. The psychologization of modern politics is leading us astray. Those who look to the political arena to nourish their soul and sense of self, who rely too heavily on their internal life finding expression in people they may never meet, will be left disappointed. 11. Our society has grown too eager to hasten, and is often gleeful at, the demise of its enemies. The vanquishing of an opponent is a moment to pause, not rejoice. 12. The atomic age is ending. One age of deterrence, the atomic age, is ending, and a new era of deterrence built on A.I. is set to begin. 13. No other country in the history of the world has advanced progressive values more than this one. The United States is far from perfect. But it is easy to forget how much more opportunity exists in this country for those who are not hereditary elites than in any other nation on the planet. 14. American power has made possible an extraordinarily long peace. Too many have forgotten or perhaps take for granted that nearly a century of some version of peace has prevailed in the world without a great power military conflict. At least three generations — billions of people and their children and now grandchildren — have never known a world war. 15. The postwar neutering of Germany and Japan must be undone. The defanging of Germany was an overcorrection for which Europe is now paying a heavy price. A similar and highly theatrical commitment to Japanese pacifism will, if maintained, also threaten to shift the balance of power in Asia. 16. We should applaud those who attempt to build where the market has failed to act. The culture almost snickers at Musk’s interest in grand narrative, as if billionaires ought to simply stay in their lane of enriching themselves . . . . Any curiosity or genuine interest in the value of what he has created is essentially dismissed, or perhaps lurks from beneath a thinly veiled scorn. 17. Silicon Valley must play a role in addressing violent crime. Many politicians across the United States have essentially shrugged when it comes to violent crime, abandoning any serious efforts to address the problem or take on any risk with their constituencies or donors in coming up with solutions and experiments in what should be a desperate bid to save lives. 18. The ruthless exposure of the private lives of public figures drives far too much talent away from government service. The public arena—and the shallow and petty assaults against those who dare to do something other than enrich themselves—has become so unforgiving that the republic is left with a significant roster of ineffectual, empty vessels whose ambition one would forgive if there were any genuine belief structure lurking within. 19. The caution in public life that we unwittingly encourage is corrosive. Those who say nothing wrong often say nothing much at all. 20. The pervasive intolerance of religious belief in certain circles must be resisted. The elite’s intolerance of religious belief is perhaps one of the most telling signs that its political project constitutes a less open intellectual movement than many within it would claim. 21. Some cultures have produced vital advances; others remain dysfunctional and regressive. All cultures are now equal. Criticism and value judgments are forbidden. Yet this new dogma glosses over the fact that certain cultures and indeed subcultures . . . have produced wonders. Others have proven middling, and worse, regressive and harmful. 22. We must resist the shallow temptation of a vacant and hollow pluralism. We, in America and more broadly the West, have for the past half century resisted defining national cultures in the name of inclusivity. But inclusion into what? Excerpts from the #1 New York Times Bestseller The Technological Republic: Hard Power, Soft Belief, and the Future of the West, by Alexander C. Karp & Nicholas W. Zamiska techrepublicbook.com

English

8.7K

7.2K

33.9K

35.6M

David Reber@davidpreber·7 Mar

[Hot take] Your Causal Variables Are Irreducibly Subjective Mech interp keeps "finding the bug" in earlier interventions, but the real problem is upstream: your variable definitions are subjective choices no formalism can validate. open.substack.com/pub/cichicago/…

English

David Reber retweetledi

Mourad Heddaya@mouradheddaya·20 Şub

Democracy depends on an informed electorate. But political issues and ballot measures can be confusing, obscuring the effects of one outcome versus another. Moreover, politics is personal. Once we make an initial decision about an issue, it can be hard to change our mind or see things from “the other side.” And talking about issues with those with whom we disagree can be challenging, especially when the conversation feels more like a debate than a discussion. Technology offers ways to alleviate these difficulties, but not without introducing problems of its own. The Internet and social media promised new ways for people to connect, discuss issues, and learn from each other. But in practice, both often inflame passions, solidify echo chambers, and spread misinformation. More recently, LLM chat interfaces may help people stay informed through personalized access to information, but mainstream chatbots tend to match user beliefs rather than clarifying or challenging them.12 Without the kind of pushback you’d encounter in a discussion between disagreeing friends, chatbots are ill-suited for helping people think through political issues in a balanced way. The goal of CivicChats is to address these shortcomings. Starting with ballot measures, CivicChats helps people better understand political issues through three different modes of discussion: a Q&A mode for understanding what a measure does and what’s at stake, an argumentative mode that presents competing views to your own, and a reflective mode that helps you examine and develop your own thinking.

English

3.2K

David Reber retweetledi

Todd Nief@toddknife·7 Ara

Most mech interp work relies on activation patching, but patching activations destroys previous computation. What if we want to use a different mechanism on the same residual stream? We propose dynamic weight grafting to interpret finetuned model weights. 🧵 1/n

English

5.8K

David Reber retweetledi

Ari Holtzman@universeinanegg·4 Ara

Predictive Interpretability > Mechanistic Interpretability Prompting is the best method of scientific inquiry we have to study LLMs It's socially devalued because it doesn't include much d/dx,O(),etc. come to poster #3503 to talk about this or anything re: the science of LLMs

English

100

11.5K

David Reber@davidpreber·15 Tem

Excited to present our work on LLM-assisted explainability at #ICML2025! 🖼️ Poster: Wednesday, 11:00am–1:30pm (#E-2902) 📄 arxiv.org/abs/2410.11348 w/ @seanrson @toddknife @ggarbacea @victorveitch If you're using LLMs to generate counterfactual pairs, rewrite twice—not once!

English

996

David Reber retweetledi

Victor Veitch 🔸@victorveitch·6 Haz

Semantics in language is naturally hierarchical, but attempts to interpret LLMs often ignore this. Turns out: baking semantic hierarchy into sparse autoencoders can give big jumps in interpretability and efficiency. Thread + bonus musings on the value of SAEs:

English

304

27.3K

David Reber retweetledi

Dang Nguyen@divingwithorcas·14 Nis

1/n You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how those biases are encoded internally and whether we can surgically remove them?

English

2.6K

David Reber@davidpreber·18 Eki

5/ Shout out to my amazing collaborators @seanrson @toddknife @ggarbacea @victorveitch give them a follow! Dive in to the full paper at arxiv.org/abs/2410.11348

English

108

David Reber@davidpreber·18 Eki

4/ But how can we know we’re actually getting counterfactuals? We use a novel synthetic experiment to test how much our rewrite method is affecting known off-target correlates: induce a distributional shift, and see if the reported ATE changes! (It shouldn’t, and RATE passes ✅)

English

111

David Reber@davidpreber·18 Eki

🧵 RATE: Score Reward Models with Imperfect Rewrites of Rewrites 1/ How do you measure whether a reward model incentivizes helpfulness without accidentally measuring length, complexity, etc? Rewrites of rewrites give good counterfactuals, without needing to list all confounders!

English

2.1K

David Reber retweetledi

Victor Veitch 🔸@victorveitch·17 Eki

Reward models are a key ingredient to understanding LLMs. But what they *actually* reward can be a mystery. Turns out we can measure this directly! The trick is to use LLMs to produce "rewrites-of-rewrites" datasets to measure attributes in isolation. arxiv.org/abs/2410.11348

English

9.2K

David Reber retweetledi

Yibo Jiang@yibophd·16 Tem

Are LLMs just doing next token predictions? It is believed that if an LLM can accurately predict the next tokens in a Wikipedia entry, it essentially "learns" the information. But do pre-trained LLMs actually need to understand context sentences to solve this task? The answer is no!

English

192

40.4K

David Reber retweetledi

Adam Gleave@ARGleave·12 Tem

Progress in empirical ML fields like interpretability is driven by success metrics -- but how good are these metrics? @JosephMiller_ et al find "faithfulness" measures are sensitive to arbitrary design choices, calling into question previous interp claims.

Joseph Miller@JosephMiller_

1/ When you find a circuit in a language model, how do you test if it does what you think? Just accepted to COLM 2024, our new paper (@bilalchughtai_ and William Saunders), investigates this question and finds a number of common pitfalls. 🧵

English

976

David Reber retweetledi

Victor Veitch 🔸@victorveitch·10 Haz

Fundamentally, high-level concepts group into categorical variables---mammal, reptile, fish, bird---with a semantic hierarchy---poodle is a dog is a mammal is an animal. How do LLMs internally represent this structure? arxiv.org/abs/2406.01506

English

117

611

81.2K

David Reber retweetledi

Victor Veitch 🔸@victorveitch·4 Haz

LLM best-of-n sampling works great in practice---but why? Turns out: it's the best possible policy for maximizing win rate over the base model! Then: we use this to get a truly sweet alignment scheme: easy tweaks, huge gains w @ybnbxb @ggarbacea arxiv.org/abs/2406.00832

English

16.2K

David Reber@davidpreber·15 Ara

Ongoing work, joint with @ggarbacea and @victorveitch. -> We formalize mechanistic faithfulness as structural equivalence between causal abstractions -> *Partial* faithfulness interp evaluations are naturally induced by causal hierarchy

English

190

David Reber@davidpreber·15 Ara

'Mechanistic' interpretability has unclear objectives, inconsistent evaluation, and overlooks alternative hypotheses. Come discuss a causal taxonomy for interp evals! CRL workshop poster session: Friday Dec 15, 10:30-12pm in 243-245! @crl_neurips2023 tinyurl.com/yc5edyxn

English

1.6K

Keşfet

@PalantirTech @toddknife @ggarbacea @victorveitch @JosephMiller_ @ybnbxb @crl_neurips2023 @elonmusk