Sewon Min

20

51

11.7K

Sewon Min retweetledi

Raj Movva@rajivmovva·26 Şub

🎉 Excited that WIMHF was selected for an oral at ICLR 2026!

Raj Movva@rajivmovva

📣NEW PAPER! What's In My Human Feedback? (WIMHF) 🔦 Human feedback can induce unexpected/harmful changes to LLMs, like overconfidence or sycophancy. How can we forecast these behaviors ahead of time? Using SAEs, WIMHF automatically extracts these signals from preference data.

English

We trained a tiny 4B model to reason for millions of tokens through IMO-level problems. Heaps excited to share our new blog post covering the full pipeline, from distilling the 🐳 to augmenting RL with a reasoning cache that unlocks extreme inference-time scaling for theorem proving. huggingface.co/spaces/lm-prov…

9

79

27K

Sewon Min retweetledi

Oscar Yinn@yinn_oscar·24 Şub

Many people are using RL to make models smarter. We used RL to pull training data out of the models themselves. Our results show that models know a lot more about their training data than most people think. We develop Active Data Reconstruction Attack (ADRA) — a data detection method that uses RL to induce models to reconstruct data seen during training. ADRA beats existing methods by an average of >10% across pre-training, post-training, and distillation. Our paper, with @uwnlp, @Cornell, and @BerkeleyNLP @Berkeleyai, is now available. Arxiv: arxiv.org/pdf/2602.19020 Joint work with @jxmnop @shmatikov @sewon__min @HannaHajishirzi

English

4

38

181

10.9K

Sewon Min@sewon__min·17 Şub

Exciting results on open-source modes for IMO-level problems - congratulations to @aviral_kumar2 and everyone involved!! Great to see @wenjie_ma's ProofGrader (proofgrader.github.io) integrated into the development ✨

Lewis Tunstall@_lewtun

English

11

83

11.5K

Sewon Min@sewon__min·14 Şub

@joon_s_pk @karpathy @drfeifei @adamdangelo @rauchg @scottbelsky Congratulations!

English

1

0

2

434

Joon Sung Park@joon_s_pk·12 Şub

Introducing Simile. Simulating human behavior is one of the most consequential and technically difficult problems of our time. We raised $100M from Index, Hanabi, A* BCV, @karpathy @drfeifei @adamdangelo @rauchg @scottbelsky among others.

English

501

841

7.8K

2.3M

Sewon Min retweetledi

Tal Linzen@tallinzen·11 Şub

My take on the substance of the matter: if you want to study how humans are using a new technology in high-stakes social context like medicine, you need to study it carefully, in a controlled human study. These questions are too important to leave to substackers who spend a couple of hours on each post or to the big labs' comms teams. I didn't read the study any more than the two famous journalists who tweeted or retweeted about it did, but as far as I can tell the authors of this particular study did everything right. They put a preprint on arXiv as soon as the study was concluded, and then also submitted it for publication. Submitting it seems like a good move. They got a few more people to evaluate their methodology, and, I suppose, the snazzy Nature Group typesetting got them the attention of two famous journalists who prior to this publication didn't seem to be interested in this topic (or aware of the preprint that was released about a year ago). Of course it would be nice to apply the evaluation methodology the authors propose to every weekly update from every big lab, but I don't think that's a reasonable expectation from academia. Maybe the action editor should have asked for one last replication with 2025 models before accepting this for publication. But the important thing is that this article points out a gap between models' performance on medical questions (which was already high in 2024) and the outcomes of the models' interactions with humans, and it advocates for more realistic evaluations that include the human component of the equation. Now it's up to companies and policymakers to decide what to do with this information.

Tal Linzen@tallinzen

I tried to find the tweet from yesterday where @mattyglesias expressed an opinion about academic publishing and had to scroll past pages and pages of tweets where he had equally strong opinions about literally dozens of unrelated topics

English

4

37

7.8K

Sewon Min retweetledi

Shannon Shen@shannonzshen·9 Şub

Super excited to share our open interactive demo for DR Tulu-8B! It supports web and literature search with full transparency — you can see the model's thinking traces and tool outputs as it reasons through your query. 🔗 dr-tulu.org 📝 arxiv.org/abs/2511.19399

English

27

133

26.6K

Sewon Min retweetledi

Yuezhou Hu@yuezhouhu·2 Şub

Take a look at Residual Context Diffusion (RCD): a simple idea to boost diffusion LLMs—stop wasting “remasked” tokens!!! arxiv.org/abs/2601.22954 (Example on AIME24. RCD increases parallelism by 4x while reaching the baseline's peak accuracy.) #DiffusionLLM #LLM #Reasoning #GenAI

GIF

English

3

35

201

37.4K

Sewon Min retweetledi

Zirui "Colin" Wang@zwcolin·26 Oca

🎮 We release VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (w/ @junyi42 @aomaru_21490) 🌐 With 17 environments across multiple domains, we show systematically the brittleness of VLMs in visual interaction, and what training leads to. 🧵[1/8]

English

34

181

39.6K

Sewon Min retweetledi

Junyang Lin@JustinLin610·17 Oca

i do agree that sometimes boring stuffs with solid impl make miracles happen but i still believe that great ideas can change the world and it will become the era of research. it is not the difference between academia and industry. it is what it is, for always. always low prob for elegant ideas work well unless a toy setting but we still need more compute for these undeterministic stuff. but for the deterministic stuff u r right about it. good infra is the key to fast iteration and fast success.

English

You’ve done real work. But most of it is hard to see. DINQ brings your projects, code, and research onto one card. No self-promotion. Just real signals. Build your DINQ → dinq.me #DINQ

7

152

15.9K

Sewon Min@sewon__min·16 Oca

Congratulations on the great product, @samuel_ys92 and the team! Created mine: dinq.me/sewonm #DINQ

DINQ@dinq_me

English

1

23

10.3K

Sewon Min retweetledi

Jae Sung Park@jjaesungpark·16 Ara

Adding tracking capability to Molmo2 was a fun experience! Molmo2 can track objects and assign IDs in text: “” Demo: playground.allenai.org/?model=molmo2-… Rundown: youtube.com/watch?v=uot140… Tips for best tracking 🧵👇 (Note: cup video is 2x speed)

YouTube

English

3

6

28

6.1K

Sewon Min@sewon__min·16 Ara

@liujc1998 @HannaHajishirzi @YejinChoinka @uwnlp Congratulations, Jiacheng!!😍

English

1

0

3

872

Jiacheng Liu@liujc1998·16 Ara

Belated update: I defended my PhD last month! I am tremendously grateful to my advisors, @HannaHajishirzi and @YejinChoinka. Without their incredible support, I wouldn’t have had so much fun exploring bold ideas, like taking a journey into the ocean of LLM pretraining data. 🥰🥰

English

39

10

307

20.7K

Sewon Min retweetledi

Yichuan Wang@YichuanM·14 Ara

What actually matters when building a scalable retrieval system? I wrote up lessons learned from building scalable ANN with DiskANN, based on real systems experience from LEANN and DS-Serve. yichuan-w.github.io/blog/How-to-bu…

Yichuan Wang@YichuanM

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

English

12

42

7.8K

Sewon Min retweetledi

Rulin Shao@RulinShao·13 Ara

Please check out DS-Serve for pretraining-scale index serving! Huge congrats to amazing @YichuanM and @jinjianliuu who made this possible! 🥳 I've been far wishing to have an online serving version of MassiveDS since its release in 2024. Here's a roadmap that I've witnessed with wonderful mentors @sewon__min @Tim_Dettmers : 1. We initially developed a CPU-based serving code that performs distributed search for MassiveDS. This code has been used in some Meta and AI2 projects, but it was painful to monitor the jobs that we needed to relaunch the server frequently and grab unnecessarily expensive nodes for serving. 2. Another aspect is the data quality. It was unclear how to filter high-quality data for a retrieval datastore to cut costs. This problem was explored in CompactDS led by @XinxiLyu & @micdun8, advised by @sewon__min. They did a scary amount of careful ablations on how data composition, vector compression, and reranking can impact model performance. As a result, they built CompactDS that, as its name indicates, is a higher-quality version of MassiveDS. 3. Despite CompactDS made it possible to serve one giant index on one node with 1TB CPU memory, the latency and throughput were still not satisfying. @YichuanM and @jinjianliuu, as the experts of efficient IR system, built this DS-Serve that finally made the large-scale index usable in online serving applications. In case it's not obvious how significant is the achievement: my previous distributed serving index can only compress the latency to a few seconds to even minutes at most depending on index types, but they made it <100ms with diskANN & other techniques, which is shockingly fast. I believe it will enable many important applications such as Deep Research training w/ in-house datastore. 4. Hardware improvement. This is not published yet, but @Tim_Dettmers and I built an in-house SSD machine in my first year of PhD that is specially designed for large-scale serving. DS-Serve is currently running on this machine, showing its great capacity at low cost. Time to rework on software-hardware co-design for index serving 😃 As RL w/ tool use and context management is becoming more popular, I believe there will be more use cases that require larger-scale in-house datastore 😍

Yichuan Wang@YichuanM

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

English

7

71

25.2K

Sewon Min retweetledi

Yichuan Wang@YichuanM·12 Ara

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

GIF

English

53

172

63.5K

Sewon Min@sewon__min·12 Ara

This is also part of a longer effort on pre-training-scale retrieval: * MassiveDS led by @RulinShao : Retrieval over trillion-token pre-training data brings substantial, consistent gains across the board (reproducing RETRO) arxiv.org/abs/2407.12854 * CompactDS led by @XinxiLyu & @micdun8 : You can actually get same / better gains from a smaller subset (0.4T tokens) through careful, high-quality data curation, with benefits extending beyond classic RAG to reasoning-heavy tasks. arxiv.org/abs/2507.01297 * DS-Serve led by @jinjianliuu & @YichuanM : now this can be super-efficient, low-latency, modest-memory, high-accuracy serving -- publicly deployable in an academic setting (and now you can use it freely via our API!) berkeley-large-rag.github.io/RAG-DS-Serve

English

3

23

2.1K

Sewon Min@sewon__min·12 Ara

Really excited about this work!! As a retrieval person, having a pre-training-scale retrieval index in an academic setting has long been a dream, and I thought it would be too difficult / infeasible. Collaborating with systems experts made it possible much earlier than I expected. Huge thanks to the students driving this: @YichuanM and @jinjianliuu !

Yichuan Wang@YichuanM

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

English

15

121

23.2K

Sewon Min retweetledi

Saining Xie@sainingxie·28 Kas

Please don’t call it a shitshow 1) It’s an OpenReview bug, so it isn’t really any organizer’s fault. The ICLR chairs worked through the holiday to find a solution. 2) There isn’t a perfect fix. In the extreme, they could either redo all the reviews or just leave everything as they are, and neither is possible. 3) Reverting the ratings and letting the ACs make the call seems like a pretty reasonable compromise. (I understand they won’t erase the discussions.)

English