Ekdeep Singh Lubana

764 posts

Ekdeep Singh Lubana

@EkdeepL

Member of Technical Staff @GoodfireAI; Previously: Postdoc / PhD at Center for Brain Science, Harvard and University of Michigan

San Francisco, CA Katılım Aralık 2017

1.3K Takip Edilen2.6K Takipçiler

Sabitlenmiş Tweet

Ekdeep Singh Lubana@EkdeepL·28 Haz

🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient? Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵 1/

GIF

English

358

69K

Ekdeep Singh Lubana retweetledi

Goodfire@GoodfireAI·12 Mar

LLMs often reason “performatively” well after deciding on a final answer - something that CoT monitors are slow to catch. Our new paper finds that: - probes can help monitor for this - it seems to track with task difficulty - probes enable early CoT exit, saving tokens! (1/7)

English

328

41.6K

Ekdeep Singh Lubana retweetledi

Goodfire@GoodfireAI·10 Mar

Welcome @thomas_fel_! We're excited to have you at Goodfire.

Kempner Institute at Harvard University@KempnerInst

After 2 years probing #visionmodels at the #KempnerInstitute, @thomas_fel_ reflects on what he’s learned—and what pieces of the #interpretability puzzle remain hidden—as he heads to @GoodfireAI. Read the interview: bit.ly/4aEBzmp 🎙️🧩

English

5.2K

Ekdeep Singh Lubana@EkdeepL·7 Mar

@wendlerch @PresItamar Yeah agreed! I think DINO-style learning was impractical for LLMs in the past, but increasingly feasible if you use LLM judges for labeling to define sets for consistency.

English

Chris Wendler@wendlerch·6 Mar

@PresItamar This feels like Dino 🦕 ported to llms? Seems like a good idea.

English

261

Itamar Pres@PresItamar·5 Mar

New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵

English

424

70.8K

Ekdeep Singh Lubana@EkdeepL·6 Mar

@peterbhase @PresItamar @belindazli @LauraRuis @jacobandreas @CarlGuo866 @HuLillian39250 @MehulDamani2 @ishapuri101 Yeah this is totally in line with the framework! The position is ofc a call to see more such work :)

English

120

Peter Hase@peterbhase·6 Mar

@PresItamar @belindazli @LauraRuis @jacobandreas @CarlGuo866 @HuLillian39250 @MehulDamani2 @EkdeepL @ishapuri101 Nice paper, sounds like a great way to tie a lot of problems together! You might be interested in what we just did for CoT faithfulness, which I think is basically the kind of training you argue for: x.com/peterbhase/sta…

Peter Hase@peterbhase

Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)

English

319

Ekdeep Singh Lubana@EkdeepL·6 Mar

A thread I’m very looking forward to pursuing at GoodFire :)

Itamar Pres@PresItamar

English

4.2K

Ekdeep Singh Lubana retweetledi

Goodfire@GoodfireAI·5 Mar

Not every day nine of your teammates get published in Nature! We've been working with Evo 2 since its release, and have found a number of exciting results with our interpretability tools - including discovering numerous biologically relevant features in the model.

Arc Institute@arcinstitute

Evo 2, the largest fully open biological AI model to date, is now published in @Nature.

English

213

16.7K

Ekdeep Singh Lubana@EkdeepL·28 Şub

Kempner Institute at Harvard University@KempnerInst

QST

Ekdeep Singh Lubana@EkdeepL·25 Şub

Infra peeps are cooking! 🔥

Goodfire@GoodfireAI

New blog post: how we built infrastructure to enable interp at trillion-parameter scale with minimal inference overhead. In a couple short years, interpretability has gone from toy models to the frontier. (1/6)

English

2.9K

Ekdeep Singh Lubana@EkdeepL·24 Şub

@klindt_david :D

David Klindt@klindt_david·23 Şub

@EkdeepL Congrats on the hire ;)

English

Ekdeep Singh Lubana@EkdeepL·22 Şub

Bro is coming to town 😎

Thomas Fel@thomas_fel_

Bidding farewell to @KempnerInst! Leaving with great memories of fun moments and a truly supportive, wonderful community 🙏 If you want great research freedom surrounded by brilliant people, apply for the Kempner Fellowship. Cannot recommend it enough! Next stop: SF! 🌉

English

2.5K

Ekdeep Singh Lubana@EkdeepL·23 Şub

@RylanSchaeffer Damn that looks actually legit (with caveats etc., but still very good)

English

Rylan Schaeffer@RylanSchaeffer·23 Şub

Every few months, I privately run my personal AI research benchmark I give 3-5 papers to SOTA models, vaguely suggest a research direction, and have them work as autonomously as possible to put together a paper This is the first time I think the paper might be decent 🧵 1/N

English

265

31.6K

Ekdeep Singh Lubana@EkdeepL·22 Şub

@MatthewKowal9 @thomas_fel_ @KempnerInst I wonder what both of you are up to…

English

Matthew Kowal@MatthewKowal9·22 Şub

@thomas_fel_ @KempnerInst Can’t wait to see what you do next!! ;)

English

364

Thomas Fel@thomas_fel_·22 Şub

English

127

10K

Ekdeep Singh Lubana retweetledi

Rachit Bansal@rach_it_·18 Şub

In standard LLM training, RL comes last. In our new work, we question this paradigm. So, when does an LLM become capable of learning via RL? Short answer: Much earlier than you expect! Blogpost: rl-excursions.github.io w/ @clara_mohri @sunnytqin @elmelis @ShamKakade6 🧵(1/n)

GIF

English

332

26.3K

Ekdeep Singh Lubana retweetledi

Hadas Orgad@OrgadHadas·17 Şub

A growing body of work, including ours, showed that LLMs encode more about truthfulness internally than their outputs reflect. @GoodfireAI's new paper puts this to use: train probes on activations to detect hallucinations, then use those probe scores as RL rewards to reduce them

English

204

11.7K

Ekdeep Singh Lubana@EkdeepL·17 Şub

Very excited about this talk on Thursday! I’ll be presenting a lot of new work (including ongoing stuff), so this will be both daunting and exciting haha.

Stanford NLP Group@stanfordnlp

For this week's seminar, we are excited to host @EkdeepL from @GoodfireAI! Date and Time: Thursday, February 19, 11:00 AM — 12:00 PM Pacific Time. Zoom Link: stanford.zoom.us/j/93941842999?… Title: Bayes-ed: Formalizing a Paradigm for Interpretability in the Language of Bayesian Inference Abstract: Interpretability research has exploded in recent years, resulting in diverse, often heuristic attempts at understanding how models perform the tasks they do. In this talk, I intend to present steps towards a framework that helps concretize these heuristics and also expands the notion of what it means to interpret. Specifically, focusing on in-context learning, we will start our analysis with a behavior-first approach and define Bayesian models that predict both the outputs produced and, assuming power-law scaling, the learning dynamics of large-scale Transformers. We then use these Bayesian models as our guiding object and characterize how representations ought to be structured in order to support such a behavioral model, hence making feature geometry a core object of study for interpretability. This lens helps us characterize the limitations with several existing interpretability paradigms, e.g., SAEs, but also offers a path forward by either designing tools with appropriate geometrical assumptions or post-processing of SAE activations. Critically, this implies there is no silver bullet in bottom-up interpretability: behavior guides what tool or post-processing ought to be used. Grounded in this discussion, we then analyze the utility of our framework by assessing how representations can be used to influence behavior: we will make precise what inference-time interventions like activation steering are trying to achieve, how existing protocols have inherently incorrect assumptions, and how this can be fixed. Critically, making a formal link to post-training (grounded in existing Bayesian accounts of RLHF), we will show inference-time interventions can be seen as rejection sampling, motivating a pipeline for amortizing this process and leading to scalable oversight approaches grounded in interpretability. As a case study, we will operationalize a naive version of this pipeline for the task of reducing hallucinations, resulting in 58% reduction in hallucinated claims in an open-source LLM at 100x less cost than the use of a frontier model judge. Excited to see everyone at the seminar!

English

7.6K

Ekdeep Singh Lubana@EkdeepL·16 Şub

@StephenLCasper @GoodfireAI The “Judge RL” baseline already involves CoT. Look at App F.3 / App. K.1.7 for prompt details. Happy to emphasize this more if it wasn’t clear to you.

English

Cas (Stephen Casper)@StephenLCasper·15 Şub

@EkdeepL @GoodfireAI If it were me I would at least add cot as a baseline and mention that the probe method involved the GT while the baselines didn’t.

English

Cas (Stephen Casper)@StephenLCasper·12 Şub

@GoodfireAI, IIUC, I'm very skeptical. In its central results, I think the paper either sets baselines up to fail or omits relevant ones entirely. If the paper is going to claim that RLFR is useful, Fig 4 needs to show it beating comparable (and simpler) baselines like RL on ground truth rewards, DPO, or ITI. In particular, I'd bet that DPO using ground truth labels would be a simpler and better alternative to RLFR. And it would have been a fair comparison because the probes were trained using the exact same labels and data that could have been used for DPO. In Fig 7, the "Judge" baseline was based on the same model that you were evaluating (plus web search) against the ground truth. So I'm not surprised if the model is bad at recognizing its own mistakes. No experiment was done to compare the Judge-RL rewards against the ground truth or see how much web search helped. Meanwhile, the ground truth WAS used to train the probe, which makes for a completely unfair comparison. It also simultaneously obviates any practical value of the probes when the ground truth is already available and used for both training and testing. It seems like Goodfire is adding to a pattern in which it (1) accomplishes something simple in a circuitous way using interpretability, (2) reports on it misleadingly, and (3) hypes it more than is warranted. I think it's the kind of thing you expect from a venture-capital-backed for-profit company trying to sell interp as a product. What would fix my concerns with this paper? For example, if Goodfire showed that (1) their RLFR method beats DPO using the same data used to train the probes or (2) that their RLFR method can handle noisy ground truth labels better than other methods, then I'd chalk this up to a win. Maybe other experiments could show competitiveness too. But else, and unless I'm misunderstanding something, I think of this paper as claiming much while delivering little.

Goodfire@GoodfireAI

We used interpretability to scale RL against open-ended tasks, cutting Gemma 12B’s hallucination rate in half by teaching it to self-correct in tandem with our probing harness.

English

5.4K

Ekdeep Singh Lubana@EkdeepL·14 Şub

@StephenLCasper @GoodfireAI I'll update you if we go down this route.

English

Ekdeep Singh Lubana@EkdeepL·14 Şub

@StephenLCasper @GoodfireAI Sure, but at that point you're describing a new data curation pipeline for longform DPO that will have its own hyperparameters. I also don't expect this to work particularly well. Nevertheless, if we must up energy and $$$, we might give this a shot to verify your claim.

English

Ekdeep Singh Lubana@EkdeepL·14 Şub

@StephenLCasper @GoodfireAI Fig 7's referring text is on page 7 (titled "Test time"), and that clarifies what the experiment is trying to gauge. Nevertheless, given your comment, more than happy to expand caption for Fig 7 and clarify the experiment's goal.

English

Cas (Stephen Casper)@StephenLCasper·14 Şub

I see what you mean here. And I think this makes sense. I suppose how I feel about this depends on whether or not it is presented as a simple demo only of the point that you describe or as a feature of rlfr that makes it competitive. My reading of the caption made it seem more like the latter was the vibe coming across. And I still think that because ground truth was involved in the probing experiment and was not involved in the prompting experiment, the comparison is not fair.

English

Ekdeep Singh Lubana@EkdeepL·14 Şub

@StephenLCasper @GoodfireAI during evals. Also, just noting that such a protocol of frontier model as a verified would make no sense for inference: it's unbounded expense at that point. The whole point of doing RL here is to amortize the inference process.

English

Ekdeep Singh Lubana@EkdeepL·14 Şub

@StephenLCasper @GoodfireAI I meant 300K to check and fix answers in rollouts produced by a policy so that you can RL against it, i.e., use a frontier model as a verifier. If you meant a baseline where the model is used as a verifier at inference time, well that is our ground truth...

English

Keşfet

@thomas_fel_ @wendlerch @PresItamar @peterbhase @belindazli @LauraRuis @jacobandreas @CarlGuo866