Ekdeep Singh Lubana

764 posts

Ekdeep Singh Lubana

Ekdeep Singh Lubana

@EkdeepL

Member of Technical Staff @GoodfireAI; Previously: Postdoc / PhD at Center for Brain Science, Harvard and University of Michigan

San Francisco, CA Katılım Aralık 2017
1.3K Takip Edilen2.6K Takipçiler
Sabitlenmiş Tweet
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient? Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵 1/
GIF
English
10
63
358
69K
Ekdeep Singh Lubana retweetledi
Goodfire
Goodfire@GoodfireAI·
LLMs often reason “performatively” well after deciding on a final answer - something that CoT monitors are slow to catch. Our new paper finds that: - probes can help monitor for this - it seems to track with task difficulty - probes enable early CoT exit, saving tokens! (1/7)
Goodfire tweet media
English
8
39
328
41.6K
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
@wendlerch @PresItamar Yeah agreed! I think DINO-style learning was impractical for LLMs in the past, but increasingly feasible if you use LLM judges for labeling to define sets for consistency.
English
0
0
1
37
Chris Wendler
Chris Wendler@wendlerch·
@PresItamar This feels like Dino 🦕 ported to llms? Seems like a good idea.
English
1
0
2
261
Itamar Pres
Itamar Pres@PresItamar·
New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵
Itamar Pres tweet media
English
16
55
424
70.8K
Peter Hase
Peter Hase@peterbhase·
@PresItamar @belindazli @LauraRuis @jacobandreas @CarlGuo866 @HuLillian39250 @MehulDamani2 @EkdeepL @ishapuri101 Nice paper, sounds like a great way to tie a lot of problems together! You might be interested in what we just did for CoT faithfulness, which I think is basically the kind of training you argue for: x.com/peterbhase/sta…
Peter Hase@peterbhase

Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)

English
2
0
2
319
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
Bro is coming to town 😎
Thomas Fel@thomas_fel_

Bidding farewell to @KempnerInst! Leaving with great memories of fun moments and a truly supportive, wonderful community 🙏 If you want great research freedom surrounded by brilliant people, apply for the Kempner Fellowship. Cannot recommend it enough! Next stop: SF! 🌉

English
2
0
22
2.5K
Rylan Schaeffer
Rylan Schaeffer@RylanSchaeffer·
Every few months, I privately run my personal AI research benchmark I give 3-5 papers to SOTA models, vaguely suggest a research direction, and have them work as autonomously as possible to put together a paper This is the first time I think the paper might be decent 🧵 1/N
Rylan Schaeffer tweet media
English
14
15
265
31.6K
Thomas Fel
Thomas Fel@thomas_fel_·
Bidding farewell to @KempnerInst! Leaving with great memories of fun moments and a truly supportive, wonderful community 🙏 If you want great research freedom surrounded by brilliant people, apply for the Kempner Fellowship. Cannot recommend it enough! Next stop: SF! 🌉
Thomas Fel tweet media
English
14
6
127
10K
Ekdeep Singh Lubana retweetledi
Hadas Orgad
Hadas Orgad@OrgadHadas·
A growing body of work, including ours, showed that LLMs encode more about truthfulness internally than their outputs reflect. @GoodfireAI's new paper puts this to use: train probes on activations to detect hallucinations, then use those probe scores as RL rewards to reduce them
Hadas Orgad tweet media
English
3
20
204
11.7K
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
Very excited about this talk on Thursday! I’ll be presenting a lot of new work (including ongoing stuff), so this will be both daunting and exciting haha.
Stanford NLP Group@stanfordnlp

For this week's seminar, we are excited to host @EkdeepL from @GoodfireAI! Date and Time: Thursday, February 19, 11:00 AM — 12:00 PM Pacific Time. Zoom Link: stanford.zoom.us/j/93941842999?… Title: Bayes-ed: Formalizing a Paradigm for Interpretability in the Language of Bayesian Inference Abstract: Interpretability research has exploded in recent years, resulting in diverse, often heuristic attempts at understanding how models perform the tasks they do. In this talk, I intend to present steps towards a framework that helps concretize these heuristics and also expands the notion of what it means to interpret. Specifically, focusing on in-context learning, we will start our analysis with a behavior-first approach and define Bayesian models that predict both the outputs produced and, assuming power-law scaling, the learning dynamics of large-scale Transformers. We then use these Bayesian models as our guiding object and characterize how representations ought to be structured in order to support such a behavioral model, hence making feature geometry a core object of study for interpretability. This lens helps us characterize the limitations with several existing interpretability paradigms, e.g., SAEs, but also offers a path forward by either designing tools with appropriate geometrical assumptions or post-processing of SAE activations. Critically, this implies there is no silver bullet in bottom-up interpretability: behavior guides what tool or post-processing ought to be used. Grounded in this discussion, we then analyze the utility of our framework by assessing how representations can be used to influence behavior: we will make precise what inference-time interventions like activation steering are trying to achieve, how existing protocols have inherently incorrect assumptions, and how this can be fixed. Critically, making a formal link to post-training (grounded in existing Bayesian accounts of RLHF), we will show inference-time interventions can be seen as rejection sampling, motivating a pipeline for amortizing this process and leading to scalable oversight approaches grounded in interpretability. As a case study, we will operationalize a naive version of this pipeline for the task of reducing hallucinations, resulting in 58% reduction in hallucinated claims in an open-source LLM at 100x less cost than the use of a frontier model judge. Excited to see everyone at the seminar!

English
3
5
53
7.6K
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
@StephenLCasper @GoodfireAI The “Judge RL” baseline already involves CoT. Look at App F.3 / App. K.1.7 for prompt details. Happy to emphasize this more if it wasn’t clear to you.
English
0
0
0
60
Cas (Stephen Casper)
Cas (Stephen Casper)@StephenLCasper·
@EkdeepL @GoodfireAI If it were me I would at least add cot as a baseline and mention that the probe method involved the GT while the baselines didn’t.
English
1
0
0
64
Cas (Stephen Casper)
Cas (Stephen Casper)@StephenLCasper·
@GoodfireAI, IIUC, I'm very skeptical. In its central results, I think the paper either sets baselines up to fail or omits relevant ones entirely. If the paper is going to claim that RLFR is useful, Fig 4 needs to show it beating comparable (and simpler) baselines like RL on ground truth rewards, DPO, or ITI. In particular, I'd bet that DPO using ground truth labels would be a simpler and better alternative to RLFR. And it would have been a fair comparison because the probes were trained using the exact same labels and data that could have been used for DPO. In Fig 7, the "Judge" baseline was based on the same model that you were evaluating (plus web search) against the ground truth. So I'm not surprised if the model is bad at recognizing its own mistakes. No experiment was done to compare the Judge-RL rewards against the ground truth or see how much web search helped. Meanwhile, the ground truth WAS used to train the probe, which makes for a completely unfair comparison. It also simultaneously obviates any practical value of the probes when the ground truth is already available and used for both training and testing. It seems like Goodfire is adding to a pattern in which it (1) accomplishes something simple in a circuitous way using interpretability, (2) reports on it misleadingly, and (3) hypes it more than is warranted. I think it's the kind of thing you expect from a venture-capital-backed for-profit company trying to sell interp as a product. What would fix my concerns with this paper? For example, if Goodfire showed that (1) their RLFR method beats DPO using the same data used to train the probes or (2) that their RLFR method can handle noisy ground truth labels better than other methods, then I'd chalk this up to a win. Maybe other experiments could show competitiveness too. But else, and unless I'm misunderstanding something, I think of this paper as claiming much while delivering little.
Cas (Stephen Casper) tweet mediaCas (Stephen Casper) tweet media
Goodfire@GoodfireAI

We used interpretability to scale RL against open-ended tasks, cutting Gemma 12B’s hallucination rate in half by teaching it to self-correct in tandem with our probing harness.

English
3
1
40
5.4K
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
@StephenLCasper @GoodfireAI Sure, but at that point you're describing a new data curation pipeline for longform DPO that will have its own hyperparameters. I also don't expect this to work particularly well. Nevertheless, if we must up energy and $$$, we might give this a shot to verify your claim.
English
1
0
0
56
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
@StephenLCasper @GoodfireAI Fig 7's referring text is on page 7 (titled "Test time"), and that clarifies what the experiment is trying to gauge. Nevertheless, given your comment, more than happy to expand caption for Fig 7 and clarify the experiment's goal.
English
1
0
1
54
Cas (Stephen Casper)
Cas (Stephen Casper)@StephenLCasper·
I see what you mean here. And I think this makes sense. I suppose how I feel about this depends on whether or not it is presented as a simple demo only of the point that you describe or as a feature of rlfr that makes it competitive. My reading of the caption made it seem more like the latter was the vibe coming across. And I still think that because ground truth was involved in the probing experiment and was not involved in the prompting experiment, the comparison is not fair.
English
1
0
0
42
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
@StephenLCasper @GoodfireAI during evals. Also, just noting that such a protocol of frontier model as a verified would make no sense for inference: it's unbounded expense at that point. The whole point of doing RL here is to amortize the inference process.
English
0
0
0
42
Ekdeep Singh Lubana
Ekdeep Singh Lubana@EkdeepL·
@StephenLCasper @GoodfireAI I meant 300K to check and fix answers in rollouts produced by a policy so that you can RL against it, i.e., use a frontier model as a verifier. If you meant a baseline where the model is used as a verifier at inference time, well that is our ground truth...
English
1
0
1
41