Linus Pin-Jie Lin

31

136

13.4K

Linus Pin-Jie Lin retweetledi

Kyle Lo@kylelostat·20 Kas

we released Olmo 3! lot of exciting stuff but wanna focus on: 🐟Olmo 3 32B Base, the best fully-open base model to-date, near Qwen 2.5 & Gemma 3 on diverse evals 🐠Olmo 3 32B Think, first fully-open reasoning model approaching Qwen 3 levels 🐡12 training datasets corresp to different staged training recipes, all open & accessible since I'm a pretraining person, I'll share some of my fav Base model ideas:

English

13

20

125

21.7K

Linus Pin-Jie Lin@linusdd44804·7 Kas

Drop by today if you’re around!

I am not at EMNLP this year, but my student @linusdd44804 will be presenting our paper on efficient model development through fine-tuning transfer. The presentation is tomorrow 2-3:30 pm, A109 (session 15). Please come talk to him!

English

58

Linus Pin-Jie Lin retweetledi

The Sanghani Center at Virginia Tech@SanghaniCtrVT·6 Kas

@therealthapa One more @SanghaniCtrVT paper at #EMNLP2025: Efficient Model Development through Fine-tuning Transfer Main proceedings @linusdd44804 @Sub_RBala @tuvllms (all VT) w/@fyliufengyuan, @kandpal_nikhil tinyurl.com/2kv9nr25

English

3

2

460

Linus Pin-Jie Lin@linusdd44804·6 Kas

I’ll be presenting our fine-tuning transfer paper tomorrow! TLDR: Alignment tuning effects can be captured as transferable model diff vectors — no need to fine-tune from scratch for every new base model version. Come find me: 🕑 14:00–15:30 📍 A109 (Session 15) #EMNLP2025

Excited to share that our paper on efficient model development has been accepted to #EMNLP2025 Main conference @emnlpmeeting. Congratulations to my students @linusdd44804 and @Sub_RBala on their first PhD paper! 🎉

English

5

2.2K

Linus Pin-Jie Lin retweetledi

Thinking Machines@thinkymachines·29 Eyl

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English

82

559

3.5K

1.4M

Linus Pin-Jie Lin retweetledi

Thinh@thinhphp_vt·22 Ağu

DeepSeek achieved a strong result on SEAL0, a challenging benchmark for reasoning with conflicting search results. 🎊

DeepSeek@deepseek_ai

Tools & Agents Upgrades 🧰 📈 Better results on SWE / Terminal-Bench 🔍 Stronger multi-step reasoning for complex search tasks ⚡️ Big gains in thinking efficiency 3/5

English

1

5

259

Linus Pin-Jie Lin@linusdd44804·20 Ağu

🎉🎉

Excited to share that our paper on efficient model development has been accepted to #EMNLP2025 Main conference @emnlpmeeting. Congratulations to my students @linusdd44804 and @Sub_RBala on their first PhD paper! 🎉

ART

1

95

Linus Pin-Jie Lin retweetledi

Thinh@thinhphp_vt·14 Tem

We just evaluated Grok 4 on our SEAL-0 dataset 👍Try it: huggingface.co/datasets/vtllm…

English

Prateek Yadav@prateeky2806

2

14

3.1K

Linus Pin-Jie Lin retweetledi

Tsendsuren@TsendeeMTS·26 Haz

This work got accepted at Transactions on Machine Learning Research (TMLR). Congratulations to @prateeky2806 and my co-authors. Also, thank you to the reviewers and editors for their time.

Ever wondered if model merging works at scale? Maybe the benefits wear off for bigger models? Maybe you considered using model merging for post-training of your large model but not sure if it generalizes well? cc: @GoogleAI @GoogleDeepMind @uncnlp 🧵👇 Excited to announce my internship work on large-scale model merging! We explore what happens when you combine larger and larger language models (up to 64B parameters!) and how different factors –model size, base model quality, merging methods, and # of experts– impact held-in performance and generalization. 📰: arxiv.org/abs/2410.03617

English

Prateek Yadav@prateeky2806

4

13

1.4K

Linus Pin-Jie Lin retweetledi

Prateek Yadav@prateeky2806·8 Eki

Ever wondered if model merging works at scale? Maybe the benefits wear off for bigger models? Maybe you considered using model merging for post-training of your large model but not sure if it generalizes well? cc: @GoogleAI @GoogleDeepMind @uncnlp 🧵👇 Excited to announce my internship work on large-scale model merging! We explore what happens when you combine larger and larger language models (up to 64B parameters!) and how different factors –model size, base model quality, merging methods, and # of experts– impact held-in performance and generalization. 📰: arxiv.org/abs/2410.03617

English

6

86

391

85.7K

Linus Pin-Jie Lin retweetledi

Tu Vu@tuvllms·26 Haz

Excited to share that our paper on model merging at scale has been accepted to Transactions on Machine Learning Research (TMLR). Huge congrats to my intern @prateeky2806 and our awesome co-authors @_JLai, @alexandraxron, @manaalfar, @mohitban47, and @TsendeeMTS 🎉!!

Ever wondered if model merging works at scale? Maybe the benefits wear off for bigger models? Maybe you considered using model merging for post-training of your large model but not sure if it generalizes well? cc: @GoogleAI @GoogleDeepMind @uncnlp 🧵👇 Excited to announce my internship work on large-scale model merging! We explore what happens when you combine larger and larger language models (up to 64B parameters!) and how different factors –model size, base model quality, merging methods, and # of experts– impact held-in performance and generalization. 📰: arxiv.org/abs/2410.03617

English

2

19

90

9.6K

Linus Pin-Jie Lin retweetledi

Rohan Paul@rohanpaul_ai·3 Haz

More thinking power at test-time doesn't fix noisy-search problems—SealQA proves it. AI's reasoning capabilities fall flat when web search turns messy, and SealQA quantifies that. SealQA introduces an exceptionally challenging benchmark for search-augmented language models, highlighting that merely increasing inference-time computation doesn't reliably improve model performance, especially when faced with conflicting, noisy, or ambiguous search results. 📉 Why Advanced Models Still Struggle Remarkably, test-time scaling do not consistently boost model accuracy. This is because when models reason extensively over noisy data, irrelevant or misleading information often gets amplified, leading to worse outcomes rather than improvements. Additionally, advanced models like DeepSeek-R1, despite their robust reasoning mechanisms, can suffer significantly from exposure to noisy web searches, highlighting their sensitivity to misinformation.

English

9

1.5K

Linus Pin-Jie Lin retweetledi

Tu Vu@tuvllms·3 Haz

✨ New paper ✨ 🚨 Scaling test-time compute can lead to inverse or flattened scaling!! We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways: ➡️ Frontier LLMs struggle on Seal-0 (SealQA’s core set): most chat models (incl. GPT-4.1 w/ browsing) achieve near-zero accuracy ➡️ Advanced reasoning models (e.g., DeepSeek-R1) can be highly vulnerable to noisy search results ➡️ More test-time compute does not yield reliable gains: o-series models often plateau or decline early ➡️ "Lost-in-the-middle" is less of an issue, but models still fail to reliably identify relevant docs amid distractors 📜: arxiv.org/abs/2506.01062 🤗: huggingface.co/datasets/vtllm… 🧵:👇

English

4

40

146

17.3K

Linus Pin-Jie Lin@linusdd44804·3 Haz

@thinhphp_vt Congrats Thinh 🙌🎊

Blacksburg, VA 🇺🇸 English

1

0

1

45

Thinh@thinhphp_vt·3 Haz

My first work done during my PhD 🥳🥳🥳

✨ New paper ✨ 🚨 Scaling test-time compute can lead to inverse or flattened scaling!! We introduce SealQA, a new challenge benchmark w/ questions that trigger conflicting, ambiguous, or unhelpful web search results. Key takeaways: ➡️ Frontier LLMs struggle on Seal-0 (SealQA’s core set): most chat models (incl. GPT-4.1 w/ browsing) achieve near-zero accuracy ➡️ Advanced reasoning models (e.g., DeepSeek-R1) can be highly vulnerable to noisy search results ➡️ More test-time compute does not yield reliable gains: o-series models often plateau or decline early ➡️ "Lost-in-the-middle" is less of an issue, but models still fail to reliably identify relevant docs amid distractors 📜: arxiv.org/abs/2506.01062 🤗: huggingface.co/datasets/vtllm… 🧵:👇

English

Sara Vera Marjanović@saraveramarjano

1

21

2.8K

Linus Pin-Jie Lin retweetledi

Siva Reddy@sivareddyg·1 Nis

Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts ✨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/

English

23

83

16.2K

Linus Pin-Jie Lin retweetledi

Eran Malach@EranMalach·11 Nis

How does RL improve performance on math reasoning? Studying RL from pretrained models is hard, as behavior depends on choice of base model. 🚨 In our new work, we train models *from scratch* to study the effect of the data mix on the behavior of RL. arxiv.org/abs/2504.07912

English

35

143

24.8K

Linus Pin-Jie Lin retweetledi

Tu Vu@tuvllms·2 Nis

📢 Research internship @Google📢 I am looking for a PhD student researcher to work with me and my colleagues on advanced reasoning and/or RAG factuality this summer @Google Mountain View, CA. We will focus on open-source models and benchmarks, and aim to publish our findings. Please fill out this form if interested docs.google.com/forms/d/e/1FAI…

English

4

37

339

37.1K

Linus Pin-Jie Lin retweetledi

Maksym Andriushchenko@maksym_andr·1 Nis

prompt engineering -> thought engineering :-) arxiv.org/abs/2503.24370

English