Mojan Javaheripi (@mojan_jp) - Twitter Profili | Zamantika Mersobahis Locabet

Great to see the additive dataset methodology we proposed in Phi-4-reasoning adopted in open-r1. Tldr: optimize data mixture per reasoning domain, and combine in final run for generalized performance. This is a game changer for reducing data ablation costs.

Lewis Tunstall@_lewtun

Happy to share 💭 Mixture of Thoughts 💭 A curated, general reasoning dataset that trims down over 1M samples from public datasets to ~350k through an extensive set of ablations 🧑‍🍳 Models trained on this mix match or exceed the performance of DeepSeek's distilled models -- not just on math/code but also on scientific benchmarks like GPQA We also validate that the "additive" methodology from Phi-4-reasoning really works! You can optimise the data mixture independently per reasoning domain and then bring it all together for the final run 🔥 Link to the dataset ⤵️

English

0

8

44

5.6K

Mojan Javaheripi@mojan_jp·2 May

@randallb @DimitrisPapail @MSFTResearch @AhmedHAwadallah @Arindam1408 @BehlHarkirat @besanushi @ChenLingjiao @marah_i_abdin @neelsj @OlliSaarikivi @rosaguga We did some small scale experiments with dpo and it performed worse compared to sft on similar data, could be that the pair selection in dpo has certain nuances that need more tuning, while sft just worked well out of the box as long as prompt/response are high-quality.

English

0

1

53

Randall Bennett@randallb·2 May

@DimitrisPapail @MSFTResearch @AhmedHAwadallah @Arindam1408 @BehlHarkirat @besanushi @ChenLingjiao @marah_i_abdin @mojan_jp @neelsj @OlliSaarikivi @rosaguga qq— why sft and not dpo? just curious.

English

1

0

57

Dimitris Papailiopoulos@DimitrisPapail·1 May

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.

English

35

230

1.4K

439.4K

Mojan Javaheripi retweetledi

Sebastien Bubeck@SebastienBubeck·1 May

wow phi-4-reasoning with its mere 14B parameters beats deepseek-R1 and its 671B parameters (on AIME25). So data quality matters you tell me? 😁

Suriya Gunasekar@suriyagnskr

I am thrilled to share our newest Phi models. This time we went all in on post-training to produce Phi-4-reasoning (SFT only) and Phi-4-reasoning-plus (SFT + a touch of RL) — both 14B models that pack a punch in a small size across reasoning and general purpose benchmarks🧵

English

2

12

89

7.7K

Mojan Javaheripi retweetledi

Ece Kamar@ecekamar·1 May

Excited to share our latest Phi model, Phi4-reasoning, a small but powerful model that match the performance of much larger reasoning models up to DeepSeek R1. Here is the report for new insights into training reasoning models and evaluating them: lnkd.in/g_Pz5JQA

Ahmed Awadallah@AhmedHAwadallah

Introducing Phi-4-reasoning, adding reasoning models to the Phi family of SLMs. The model is trained with both supervised finetuning (using a carefully curated dataset of reasoning demonstration) and Reinforcement Learning. 📌Competitive results on reasoning benchmarks with much larger top-tier models up to DeepSeek R1 📌 Strong performance on new tests released after data collection (AIME 2025, HMMT) 📌Reasoning transfers/generalizes well to new domains even with only SFT (e.g. k-SAT, Mae Solving, Calendar Planning, etc.) 📌Retains and often significantly improves general-purpose capabilities (e.g. instruction following) In addition to the models, we are also very excited to share a very detailed technical report with insights on model training and evaluation Still have a lot to improve especially with context length, coding and tools. Hope you find the models useful! A big thanks to the amazing team and to all our partners.

English

7

16

65

9.3K

Mojan Javaheripi retweetledi

Ahmed Awadallah@AhmedHAwadallah·1 May

Introducing Phi-4-reasoning, adding reasoning models to the Phi family of SLMs. The model is trained with both supervised finetuning (using a carefully curated dataset of reasoning demonstration) and Reinforcement Learning. 📌Competitive results on reasoning benchmarks with much larger top-tier models up to DeepSeek R1 📌 Strong performance on new tests released after data collection (AIME 2025, HMMT) 📌Reasoning transfers/generalizes well to new domains even with only SFT (e.g. k-SAT, Mae Solving, Calendar Planning, etc.) 📌Retains and often significantly improves general-purpose capabilities (e.g. instruction following) In addition to the models, we are also very excited to share a very detailed technical report with insights on model training and evaluation Still have a lot to improve especially with context length, coding and tools. Hope you find the models useful! A big thanks to the amazing team and to all our partners.

English

3

32

139

35.2K

Mojan Javaheripi retweetledi

Suriya Gunasekar@suriyagnskr·1 May

In all, we SFT’ed on ~1.4M reasoning traces on select prompts and further RL'd on a small ~6k sample. Despite the relatively long SFT on select domains, we see broad generalization across domains and no degradation in general purpose performance. On the contrary....🔁📚

English

1

2

4

479

Mojan Javaheripi@mojan_jp·1 May

Joint work with: @Arindam1408,@sj_agrwl, Caio Mendes,@OlliSaarikivi,@marah_i_abdin,@suriyagnskr,@BehlHarkirat,@zzzzgq,@VaishShrivas,@DimitrisPapail5,@rosaguga,Piero Kauffmann,@sytelus,Yash Lara ,@vidhisha_b,@ChenLingjiao,Neel Joshi,@VibhavVineet, @besanushi,@AhmedHAwadallah

English

0

1

252

Mojan Javaheripi@mojan_jp·1 May

Tech Report: arxiv.org/pdf/2504.21318 HuggingFace: huggingface.co/microsoft/Phi-… and huggingface.co/microsoft/Phi-… Azure AI Foundry: ai.azure.com/explore/models…

English

1

0

2

146

Mojan Javaheripi@mojan_jp·1 May

Phi-4-reasoning-plus is obtained via a short reinforcement learning on Phi-4-reasoning using a randomly selected subset of SFT prompts. This short RL amplifies the reasoning style and unlocks nice improvements across benchmarks with longer response length.

English

1

0

2

167

Mojan Javaheripi@mojan_jp·1 May

Nice summary of more cool results for Phi-4-Reasoning by @DimitrisPapail

Dimitris Papailiopoulos@DimitrisPapail

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.

English

0

3

13

1.7K

Mojan Javaheripi@mojan_jp·1 May

Phi-4-reasoning is supervised fine-tuned on Phi-4. The secret sauce? 1) high-quality prompts at the edge of model capability to go beyond vanilla distillation + strong reasoning responses from a teacher. 2) optimal data mixture of different sources for best overall performance.

English

1

0

2

128

Mojan Javaheripi@mojan_jp·1 May

More interestingly, our models generalize well to out-of-distribution tasks like algorithmic problem solving, planning, and spatial reasoning. These skills were not targeted in our training data but Phi-4-reasoning performs quite well.

English

1

0

3

143

Mojan Javaheripi@mojan_jp·1 May

With 14B parameters, both models are competitive and often better than (larger) frontier models: outperforming DeepSeek-R1-Distill-Llama-70B across the board (small gap in coding) and comparable with original DeepSeek-R1 on AIME 2025 which came out after our data cutoff date.

English

1

0

3

158

Mojan Javaheripi@mojan_jp·1 May

Excited to release our first set of reasoning models Phi-4-reasoning and Phi-4-reasoning-plus, available today on HuggingFace and Azure AI foundry. Some interesting insights below and more deep dives in following days!

English

1

10

42

3.6K

Mojan Javaheripi@mojan_jp·4 Oca

Excited to see our SLM work, Phi, mentioned in MIT Technology Review as top 10 breakthrough technologies! 😊 #small-language-models" target="_blank" rel="nofollow noopener">technologyreview.com/2025/01/03/110…

English

0

2

146

Mojan Javaheripi@mojan_jp·16 Ara

@aaron_defazio This matches our empirical results, nice to see a possible underlying explanation!

English

0

168

Aaron Defazio@aaron_defazio·16 Ara

Good insight. Training recovers from loss spikes because spikes occur in only a few latent dimensions.

English

12

24

426

44.3K

Mojan Javaheripi retweetledi

Shital Shah@sytelus·13 Ara

Are you ready for an early Christmas present from our team at Microsoft Research? Introducing the most powerful smol model ever built in the world! Welcome to Phi-4! 👇

English

37

130

1.6K

215.7K

Mojan Javaheripi@mojan_jp·13 Ara

@aaron_defazio @SebastienBubeck @aaron_defazio yes that’s correct!

English

1

0

3

420

Aaron Defazio@aaron_defazio·13 Ara

fantastic, looks like this was trained with a Linear Decay schedule with warmup. Is this correct @SebastienBubeck ? "The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules"

Sebastien Bubeck@SebastienBubeck

Surprise #NeurIPS2024 drop for y'all: phi-4 available open weights and with amazing results!!! Tl;dr: phi-4 is in Llama 3.3-70B category (win some lose some) with 5x fewer parameters, and notably outperforms on pure reasoning like GPQA (56%) and MATH (80%).

English

1

24

4.4K

Mojan Javaheripi retweetledi

Sebastien Bubeck@SebastienBubeck·13 Ara

Surprise #NeurIPS2024 drop for y'all: phi-4 available open weights and with amazing results!!! Tl;dr: phi-4 is in Llama 3.3-70B category (win some lose some) with 5x fewer parameters, and notably outperforms on pure reasoning like GPQA (56%) and MATH (80%).

English

19

68

411

94.6K

Mojan Javaheripi retweetledi

Peter Lee@peteratmsr·13 Ara

🚀 Phi-4 is here! A small language model that performs as well as (and often better than) large models on certain types of complex reasoning tasks such as math. Useful for us in @MSFTResearch, and available now for all researcher on the Azure AI Foundry! aka.ms/phi4blog

English

41

173

726

194.3K

Mojan Javaheripi

Keşfet