Ben Cohen-Wang

25 posts

Ben Cohen-Wang

Ben Cohen-Wang

@bcohenwang

Pre-training @AnthropicAI On leave from PhD at MIT with Aleksander Madry

Katılım Eylül 2022
241 Takip Edilen198 Takipçiler
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
@brianryhuang Sorry I'm not following. What's the mechanism here that encourages hallucinations? If the answer is correct regardless of the post-hoc reasoning chain (because the model has memorized it), then it seems like RL wouldn't push the reasoning chain to do anything in particular?
English
1
0
0
93
Brian Huang
Brian Huang@brianryhuang·
@bcohenwang A good illustrative example is math "trick questions" (think Monty Hall); the correct answer is well-known, and so are the erroneous solutions. If the reasoning chain is post-hoc, the model might output wrong solution + correct answer, and do this often for this kind of problem
English
1
0
0
223
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
Popular reasoning benchmarks just reward correct answers (they don't penalize guessing). This incentivizes models that guess when they're not sure which (beyond hurting usability) seems like it would encourage hallucinations more broadly. Is this why o3 etc. hallucinate a lot?
English
1
0
24
1.5K
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
@brianryhuang Interesting--would this *encourage* hallucinations? It seems like it just wouldn't penalize hallucinations in post-hoc reasoning. I think as long as hallucinations aren't encouraged, they can be mitigated through, e.g., factuality RL (if they are encouraged, you get a tradeoff).
English
1
0
0
140
Brian Huang
Brian Huang@brianryhuang·
@bcohenwang Another intuition I have here comes from CoT faithfulness literature Because the reasoning chain is sometimes not necessary for the model to reach the answer, the model can output post-hoc CoTs with mistakes before a correct answer, and these mistakes are not penalized
English
2
0
4
235
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
It can be helpful to pinpoint the in-context information that a language model uses when generating content (is it using provided documents? or its own intermediate thoughts?). We present Attribution with Attention (AT2), a method for doing so efficiently and reliably! (1/8)
Ben Cohen-Wang tweet media
English
3
14
58
10.9K
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
@ko175041 @YungSungChuang @aleks_madry Thanks! A two layer NN does a little better than a coefficient for each head, but not enough to make the added complexity worth it! Another potential axis for improvement is to add additional attention features besides just the "first-order" attention weights.
English
1
0
2
133
Ting-Wen Ko
Ting-Wen Ko@ko175041·
@bcohenwang @YungSungChuang @aleks_madry Also an enthusiasticist in faithfulness 👋 I find this work very interesting! Question - would it be even better to train using a simple NN rather than a coefficient of attention head?
English
1
0
0
147
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
AT2 makes it practical to, for example, produce citations for an existing RAG system. Check out our demo which uses AT2 for citations in an LLM-powered search tool: bencw99.github.io/at2-citations/ (7/8)
Ben Cohen-Wang tweet media
English
1
2
9
563
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
Increasingly, LLMs cite sources for claims they make, but are the sources they cite actually what they are using? In work led by @YungSungChuang, we design a reward to quantify this, and use this reward to (automatically) improve citation quality! 🧵
Yung-Sung Chuang@YungSungChuang

(1/5)🚨LLMs can now self-improve to generate better citations✅ 📝We design automatic rewards to assess citation quality 🤖Enable BoN/SimPO w/o external supervision 📈Perform close to “Claude Citations” API w/ only 8B model 📄arxiv.org/abs/2502.09604 🧑‍💻github.com/voidism/SelfCi…

English
0
1
17
987
Keller Jordan
Keller Jordan@kellerjordan0·
A confirmation of your intuition: Intuition: In Figure 2, @bcohenwang et al. (2024) show that for logistic regression, out-of-support distribution shifts can induce variance between runs of training (i.e., a dependency on initialization), whereas in-support shifts do not. Intuitively, the authors suggest that neural networks may have similar behavior. Confirmation: In Figure 9 of arxiv.org/abs/2304.01910, I show that we do indeed see this effect across 1000 repeated runs of ResNet-18 training on ImageNet. There is significant distribution-wise variance between runs when evaluating on ImageNet-sketch (out-of-support), and little variance on ImageNet-V2 (in-support). This result exactly matches the intuition of Cohen-Wang et al.
English
1
1
3
717
Aleksander Madry
Aleksander Madry@aleks_madry·
Models often fail under distribution shifts—can pre-training on a large and diverse dataset and then fine-tuning on a task-specific dataset help? W/ @bcohenwang, @josh_vendrow we show that this depends on the specific failure mode. In particular, pre-training can help with extrapolation, but does not address failures that stem from dataset biases.
Aleksander Madry tweet media
English
5
42
256
48.7K
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
@feng_jiahai @harshays_ @kris_georgiev1 @aleks_madry Great point, yeah! For k=1 we're pretty much at this optimal log-prob drop (we'll include a formal evaluation in the paper, but you can already see for k=1 things look pretty saturated as we increase the number of ablations in the plots in the blog post).
English
1
0
0
830
Ben Cohen-Wang
Ben Cohen-Wang@bcohenwang·
@xilinniao @harshays_ @kris_georgiev1 @aleks_madry Hi great question! This is definitely possible (this type of approach is usually called "leave-one-out"). We've tried this and it works reasonably well but is a lot more expensive than ContextCite. ContextCite only needs a small number of ablations due to sparsity (see our blog).
English
0
0
2
130
皮特
皮特@xilinniao·
@bcohenwang @harshays_ @kris_georgiev1 @aleks_madry thanks for sharing! is it possible just use step2, iterate every sentence in the context, and mask it, then compute the probability of generating the original response given current mask, use this probability as "importance" of each sentence?
English
1
0
0
131