BerkeleyNLP (@BerkeleyNLP) - Twitter-Profil | Zamantika Mersobahis Locabet

BerkeleyNLP retweetet

Sewon Min@sewon__min·17 Şub

Exciting results on open-source modes for IMO-level problems - congratulations to @aviral_kumar2 and everyone involved!! Great to see @wenjie_ma's ProofGrader (proofgrader.github.io) integrated into the development ✨

Lewis Tunstall@_lewtun

We trained a tiny 4B model to reason for millions of tokens through IMO-level problems. Heaps excited to share our new blog post covering the full pipeline, from distilling the 🐳 to augmenting RL with a reasoning cache that unlocks extreme inference-time scaling for theorem proving. huggingface.co/spaces/lm-prov…

English

0

11

83

11.5K

BerkeleyNLP retweetet

Sewon Min@sewon__min·12 Ara

Really excited about this work!! As a retrieval person, having a pre-training-scale retrieval index in an academic setting has long been a dream, and I thought it would be too difficult / infeasible. Collaborating with systems experts made it possible much earlier than I expected. Huge thanks to the students driving this: @YichuanM and @jinjianliuu !

Yichuan Wang@YichuanM

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

English

5

15

121

23.2K

BerkeleyNLP retweetet

Yichuan Wang@YichuanM·12 Ara

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

GIF

English

5

53

172

63.5K

BerkeleyNLP retweetet

Jiaxin Ge@aomaru_21490·20 Eki

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 echo-bench.github.io

English

4

32

126

45.3K

BerkeleyNLP retweetet

Sewon Min@sewon__min·17 Eki

Super excited about @wenjie_ma's work on verifying math proofs! ✅ 24 competitions, 3 SoTAs (o3, Gemini-2.5-Pro, R1) ✅ Strong evaluator -- a carefully designed evaluator with simple ensemble beats agentic ones ✅ Strong best-of-n performance Check out the paper & website!

Wenjie Ma@wenjie_ma

LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key bottleneck: reliable proof evaluation. Without a good evaluator, we can't automatically evaluate or train better "provers." Our new work tackles this challenge step by step. 🧵 📄 Paper: arxiv.org/pdf/2510.13888

English

3

15

119

32.3K

BerkeleyNLP retweetet

Wenjie Ma@wenjie_ma·17 Eki

LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key bottleneck: reliable proof evaluation. Without a good evaluator, we can't automatically evaluate or train better "provers." Our new work tackles this challenge step by step. 🧵 📄 Paper: arxiv.org/pdf/2510.13888

English

9

37

194

60K

BerkeleyNLP retweetet

Kayo Yin@kayo_yin·29 May

Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🧠🎉 How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach? 🌐 sites.google.com/berkeley.edu/p… 📅 Submit by June 23rd

English

6

22

94

57.8K

BerkeleyNLP retweetet

Ruiqi Zhong @NeurIPS 25@ZhongRuiqi·16 May

Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @AnthropicAI Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P

English

30

36

541

57.6K

BerkeleyNLP retweetet

Nicholas Tomlin@NickATomlin·13 May

The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hope our new *benchmark generator* can help measure progress toward this vision

Vivek Verma@vcubingx

🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: arxiv.org/abs/2505.07215

English

4

29

179

25.9K

BerkeleyNLP retweetet

Vivek Verma@vcubingx·13 May

🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: arxiv.org/abs/2505.07215

English

3

25

147

38K

BerkeleyNLP retweetet

Nicholas Tomlin@NickATomlin·15 Nis

I'm incredibly excited to share that I'll be joining @TTIC_Connect as an assistant professor in Fall 2026! Until then, I'm wrapping up my PhD at Berkeley, and after that I'll be a faculty fellow at @NYUDataScience

English

33

10

198

16.7K

BerkeleyNLP retweetet

Ruiqi Zhong @NeurIPS 25@ZhongRuiqi·10 Nis

Finished my dissertation!!! (scalable oversight,link below) Very fortunate to have @JacobSteinhardt and Dan Klein as my advisors! Words can't describe my gratitude, so I used a pic of Frieren w/ her advisor :) Thanks for developing my research mission, and teaching me magic

English

27

11

395

23.5K

BerkeleyNLP retweetet

Lakshya A Agrawal@LakshyAAAgrawal·3 Mar

🧵Introducing LangProBe: the first benchmark testing where and how composing LLMs into language programs affects cost-quality tradeoffs! We find that, on avg across diverse tasks, smaller models within optimized programs beat calls to larger models at a fraction of the cost.

English

3

45

142

29.8K

BerkeleyNLP retweetet

Kayo Yin@kayo_yin·28 Şub

Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale? We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL. arxiv.org/abs/2502.14010 🧵

English

17

113

784

288.1K

BerkeleyNLP retweetet

Charlie Snell@sea_snell·27 Kas

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

English

14

75

573

156.3K

BerkeleyNLP retweetet

Kayo Yin@kayo_yin·12 Kas

Cool new dataset for translation ambiguity in 9 language pairs (7 low-resource), and we found LLM-generated descriptions help weaker models resolve ambiguity! @BaruaJosh will be presenting this at the 2-3:30pm poster session today, come talk to us about multilinguality in LLMs!

Josh Barua@BaruaJosh

Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵

English

0

2

6

6K

BerkeleyNLP retweetet

Josh Barua@BaruaJosh·12 Kas

Do LLMs encode knowledge of concept variation across languages? Can they use this knowledge to resolve ambiguity in translation? Our #EMNLP2024 paper finds a big performance gap between closed- and open-weight LLMs, but lexical rules can help transfer knowledge across models! 🧵

English

2

12

38

6K

BerkeleyNLP retweetet

Kayo Yin@kayo_yin·11 Kas

🚨New dataset + challenge #EMNLP2024🚨 We release ASL STEM Wiki: the first signing dataset of STEM articles! 📰 254 Wikipedia articles 📹 ~300 hours of ASL interpretations 👋 New task: automatic sign suggestion to make STEM education more accessible microsoft.com/en-us/research… 🧵

English

8

21

107

15.3K

BerkeleyNLP retweetet

Ruiqi Zhong @NeurIPS 25@ZhongRuiqi·2 Eki

Given the rapid progress of LLMs, I feel compelled to present this topic (even if it's not the main focus of my Ph.D. work). I will cover concrete ML problems related to "AI deception" -- undesirable behaviors of AI systems that are hard to catch -- and how to study this "Sci-Fi" topic scientifically.

Stanford NLP Group@stanfordnlp

For this week’s NLP Seminar, we are thrilled to host @ZhongRuiqi to talk about Concrete Problems in AI Deception: From Evaluation Gaming to Cyber Attack! When: 10/3 Thurs 11am PT Non-Stanford affiliates registration form: forms.gle/ez4PtVWVATbnT2… (closed at 9am PT on the talk day)

English

3

17

121

25.6K

BerkeleyNLP retweetet

Ruiqi Zhong @NeurIPS 25@ZhongRuiqi·23 Eyl

Graphical models struggle to explain patterns in text & images 😭 LLM can do this but hallucinates. 👿 It’s time to combine their strengths! We define models with natural language parameters! Unlocking opportunities in science, business, ML, etc

English

8

30

222

33.8K

BerkeleyNLP

Entdecken