Ben Cohen-Wang

25 posts

Ben Cohen-Wang

@bcohenwang

Pre-training @AnthropicAI On leave from PhD at MIT with Aleksander Madry

Katılım Eylül 2022

241 Takip Edilen198 Takipçiler

Ben Cohen-Wang retweetledi

Anthropic@AnthropicAI·28 Şub

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

2.8K

6.6K

42.5K

17.7M

Ben Cohen-Wang@bcohenwang·29 Nis

@brianryhuang Sorry I'm not following. What's the mechanism here that encourages hallucinations? If the answer is correct regardless of the post-hoc reasoning chain (because the model has memorized it), then it seems like RL wouldn't push the reasoning chain to do anything in particular?

English

Brian Huang@brianryhuang·29 Nis

@bcohenwang A good illustrative example is math "trick questions" (think Monty Hall); the correct answer is well-known, and so are the erroneous solutions. If the reasoning chain is post-hoc, the model might output wrong solution + correct answer, and do this often for this kind of problem

English

223

Ben Cohen-Wang@bcohenwang·29 Nis

Popular reasoning benchmarks just reward correct answers (they don't penalize guessing). This incentivizes models that guess when they're not sure which (beyond hurting usability) seems like it would encourage hallucinations more broadly. Is this why o3 etc. hallucinate a lot?

English

1.5K

Ben Cohen-Wang@bcohenwang·29 Nis

@brianryhuang Interesting--would this *encourage* hallucinations? It seems like it just wouldn't penalize hallucinations in post-hoc reasoning. I think as long as hallucinations aren't encouraged, they can be mitigated through, e.g., factuality RL (if they are encouraged, you get a tradeoff).

English

140

Brian Huang@brianryhuang·29 Nis

@bcohenwang Another intuition I have here comes from CoT faithfulness literature Because the reasoning chain is sometimes not necessary for the model to reach the answer, the model can output post-hoc CoTs with mistakes before a correct answer, and these mistakes are not penalized

English

235

Ben Cohen-Wang@bcohenwang·23 Nis

@ko175041 @YungSungChuang @aleks_madry Yes! We look at "thought attribution" in the paper: arxiv.org/abs/2504.13752

English

134

Ting-Wen Ko@ko175041·23 Nis

@bcohenwang @YungSungChuang @aleks_madry Also, is there a way to know whether it's using its intermediate thoughts? if so, can we know what they are?

English

145

Ben Cohen-Wang@bcohenwang·22 Nis

It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8)

English

10.9K

Ben Cohen-Wang@bcohenwang·23 Nis

@ko175041 @YungSungChuang @aleks_madry Thanks! A two layer NN does a little better than a coefficient for each head, but not enough to make the added complexity worth it! Another potential axis for improvement is to add additional attention features besides just the "first-order" attention weights.

English

133

Ting-Wen Ko@ko175041·23 Nis

@bcohenwang @YungSungChuang @aleks_madry Also an enthusiasticist in faithfulness 👋 I find this work very interesting! Question - would it be even better to train using a simple NN rather than a coefficient of attention head?

English

147

Ben Cohen-Wang@bcohenwang·22 Nis

With @YungSungChuang, @aleks_madry! For more, check out: Python package: github.com/MadryLab/AT2 Paper: arxiv.org/abs/2504.13752 Demo: bencw99.github.io/at2-citations/

English

414

Ben Cohen-Wang@bcohenwang·22 Nis

AT2 makes it practical to, for example, produce citations for an existing RAG system. Check out our demo which uses AT2 for citations in an LLM-powered search tool: bencw99.github.io/at2-citations/ (7/8)

English

563

Ben Cohen-Wang@bcohenwang·14 Şub

Increasingly, LLMs cite sources for claims they make, but are the sources they cite actually what they are using? In work led by @YungSungChuang, we design a reward to quantify this, and use this reward to (automatically) improve citation quality! 🧵

Yung-Sung Chuang@YungSungChuang

(1/5)🚨LLMs can now self-improve to generate better citations✅ 📝We design automatic rewards to assess citation quality 🤖Enable BoN/SimPO w/o external supervision 📈Perform close to “Claude Citations” API w/ only 8B model 📄arxiv.org/abs/2502.09604 🧑‍💻github.com/voidism/SelfCi…

English

987

Ben Cohen-Wang@bcohenwang·10 Haz

@kellerjordan0 @aleks_madry @josh_vendrow This is really cool! thanks for raising!

English

107

Keller Jordan@kellerjordan0·10 Haz

A confirmation of your intuition: Intuition: In Figure 2, @bcohenwang et al. (2024) show that for logistic regression, out-of-support distribution shifts can induce variance between runs of training (i.e., a dependency on initialization), whereas in-support shifts do not. Intuitively, the authors suggest that neural networks may have similar behavior. Confirmation: In Figure 9 of arxiv.org/abs/2304.01910, I show that we do indeed see this effect across 1000 repeated runs of ResNet-18 training on ImageNet. There is significant distribution-wise variance between runs when evaluating on ImageNet-sketch (out-of-support), and little variance on ImageNet-V2 (in-support). This result exactly matches the intuition of Cohen-Wang et al.

English

717

Aleksander Madry@aleks_madry·4 Mar

Models often fail under distribution shifts—can pre-training on a large and diverse dataset and then fine-tuning on a task-specific dataset help? W/ @bcohenwang, @josh_vendrow we show that this depends on the specific failure mode. In particular, pre-training can help with extrapolation, but does not address failures that stem from dataset biases.

English

256

48.7K

Ben Cohen-Wang@bcohenwang·28 May

@feng_jiahai @harshays_ @kris_georgiev1 @aleks_madry This would be nice to have for k>1 to contextualize these values, but becomes very hard to compute.

English

832

Ben Cohen-Wang@bcohenwang·28 May

@feng_jiahai @harshays_ @kris_georgiev1 @aleks_madry Great point, yeah! For k=1 we're pretty much at this optimal log-prob drop (we'll include a formal evaluation in the paper, but you can already see for k=1 things look pretty saturated as we increase the number of ablations in the plots in the blog post).

English

830

Ben Cohen-Wang@bcohenwang·6 May

We introduce ContextCite, a tool that can help us understand when and how an LLM uses in-context information! w/ @harshays_, @kris_georgiev1, @aleks_madry Check out our demo: huggingface.co/spaces/context… Thread ⤵️

Aleksander Madry@aleks_madry

How is an LLM actually using the info given to it in its context? Is it misinterpreting anything or making things up? Introducing ContextCite: a simple method for attributing LLM responses back to the context: gradientscience.org/contextcite w/ @bcohenwang, @harshays_, @kris_georgiev1

English

7.8K

Ben Cohen-Wang@bcohenwang·7 May

@cloutiness @aleks_madry @harshays_ @kris_georgiev1 We have an example notebook of using ContextCite with RAG: github.com/MadryLab/conte…

English

K.A.@cloutiness·7 May

@aleks_madry @bcohenwang @harshays_ @kris_georgiev1 RAG with in-line references if I'm not mistaken

English

195

Aleksander Madry@aleks_madry·6 May

GIF

English

241

51.4K

Ben Cohen-Wang@bcohenwang·7 May

@xilinniao @harshays_ @kris_georgiev1 @aleks_madry Hi great question! This is definitely possible (this type of approach is usually called "leave-one-out"). We've tried this and it works reasonably well but is a lot more expensive than ContextCite. ContextCite only needs a small number of ablations due to sparsity (see our blog).

English

130

皮特@xilinniao·7 May

@bcohenwang @harshays_ @kris_georgiev1 @aleks_madry thanks for sharing! is it possible just use step2, iterate every sentence in the context, and mask it, then compute the probability of generating the original response given current mask, use this probability as "importance" of each sentence?

English

131

Keşfet

@brianryhuang @ko175041 @YungSungChuang @aleks_madry @kellerjordan0 @josh_vendrow @feng_jiahai @harshays_