Dingli Yu

25 posts

Dingli Yu

Dingli Yu

@dingli_yu

Researcher @ OpenAI | PhD from Princeton

Katılım Eylül 2018
77 Takip Edilen526 Takipçiler
Dingli Yu retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta's first release that is not open weights Muse Spark is a new model from @Meta evaluated on Artificial Analysis. We were given early access by Meta to independently benchmark the model. It is the first frontier-class model from Meta since Llama 4 Maverick was released in April 2025, and notably the first @AIatMeta model that is not being released as open weights. The release follows Meta's reorganization of its AI efforts under Meta Superintelligence Labs, and signals that Meta is re-entering the frontier race after roughly a year of relative quiet. For context, Llama 4 Maverick and Scout scored 18 and 13 respectively on the Artificial Analysis Intelligence Index as non-reasoning models at the time of their release, while Muse Spark scores 52. Muse Spark essentially closes the gap between to the frontier in a single release. The model is not open source and is not yet accessible via an API but Meta has shared they expect this to come soon. Meta is also integrating Muse Spark into their first party products including their Meta AI chat product, Facebook, Instagram and Threads. Key takeaways from our benchmarks: ➤ Muse Spark scores 52 on the Artificial Analysis Intelligence Index, placing it within the top 5 models we have benchmarked. It sits ahead of Claude Sonnet 4.6, GLM-5.1, MiniMax-M2.7, Grok 4.20 and behind Gemini 3.1 Pro Preview, GPT-5.4 and Claude Opus 4.6 ➤ Muse Spark is notably token efficient for its intelligence level. It used 58M output tokens to run the Intelligence Index, comparable to Gemini 3.1 Pro Preview (57M) and notably lower than Claude Opus 4.6 (Adaptive Reasoning, max effort, 157M), GPT-5.4 (xhigh, 120M) and GLM-5 (110M) ➤ Muse Spark is the second-most capable vision model we have benchmarked. It scores 80.5% on MMMU-Pro, behind only Gemini 3.1 Pro Preview (82.4%) ➤ Muse Spark performs strongly on reasoning and instruction-following evaluations. It scores 39.9% on HLE, trailing only Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (xhigh, 41.6%). The model also achieved 5th highest in CritPT with a score of 11%, an eval that is focused on difficult physics research questions. This is substantially above above Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%) ➤ Agentic performance does not stand out. On GDPval-AA, our evalaution focused on real world work tasks, Muse Spark scores 1427, behind both Claude Sonnet 4.6 at 1648 and GPT-5.4 at 1676, but ahead of Gemini 3.1 Pro Preview at 1320. On On TerminalBench Hard, Muse Spark trails Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. Muse Spark joins others in achieving a high τ²-Bench Telecom score of 92% Key model details: ➤ Modalities: Multimodal including text and vision input, text output ➤ License: Proprietary, Meta's first frontier model not released as open weights ➤ Availability: No public API at the time of publishing. Meta expects to provide API access soon. Meta has started integration into their first party AI offering Meta AI and inside Facebook, Instagram, and Threads
Artificial Analysis tweet media
English
76
323
2.5K
497.4K
Dingli Yu retweetledi
Hongyu Ren
Hongyu Ren@ren_hongyu·
Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑
Hongyu Ren tweet mediaHongyu Ren tweet media
English
18
60
317
67K
Dingli Yu retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
Alexandr Wang tweet media
English
727
1.2K
10.3K
4.5M
Dingli Yu retweetledi
Shengjia Zhao
Shengjia Zhao@shengjia_zhao·
Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. ai.meta.com/blog/introduci…
Shengjia Zhao tweet media
English
75
171
1.7K
229.8K
Dingli Yu retweetledi
Sebastien Bubeck
Sebastien Bubeck@SebastienBubeck·
Here at @OpenAI we've cracked pretraining, then reasoning, and now we're experimenting with a new set of techniques that maximally leverage their interaction. GPT-5 is just the first step in this direction, and we're incredibly excited to see where scaling this up will lead us!
Sebastien Bubeck tweet media
English
66
56
615
247.9K
Dingli Yu retweetledi
Sanjeev Arora
Sanjeev Arora@prfsanjeevarora·
@QuantaMagazine featured our work on emergence of skill compositionality (and its limitations) in LLMs among the CS breakthroughs of the year. tinyurl.com/5f5jvzy5. Work was done over 2023 @GoogleDeepMind and @PrincetonPLI. Key pieces: (i) mathematical framework for quantifying how LLM scaling leads to predictable increase in the model’s ability to combine skills while solving new tasks. Joint work with @anirudhg9119 (ii) experiments verifying theoretical prediction in experiments via SkillMix evaluation (lead author @dingliy_yu) (iii) the level of skill-compositionality detected in GPT4O via Sept'23 experiments mathematically imply that it is able to reason and talk about situations it has not seen in its training data —i.e. it has moved beyond the “stochastic parrots” stereotype that had dogged earlier LLMs. Skill emergence paper: arxiv.org/abs/2307.15936 Skillmix Evaluation: arxiv.org/abs/2310.17567 Models can improve skill composition from examples arxiv.org/abs/2409.19808 Wonderful to work with the colleagues and students involved.
English
1
8
36
2.9K
Dingli Yu retweetledi
Sebastien Bubeck
Sebastien Bubeck@SebastienBubeck·
Surprise #NeurIPS2024 drop for y'all: phi-4 available open weights and with amazing results!!! Tl;dr: phi-4 is in Llama 3.3-70B category (win some lose some) with 5x fewer parameters, and notably outperforms on pure reasoning like GPQA (56%) and MATH (80%).
Sebastien Bubeck tweet media
English
19
68
411
94.6K
Dingli Yu retweetledi
Peter Lee
Peter Lee@peteratmsr·
🚀 Phi-4 is here! A small language model that performs as well as (and often better than) large models on certain types of complex reasoning tasks such as math. Useful for us in @MSFTResearch, and available now for all researcher on the Azure AI Foundry! aka.ms/phi4blog
Peter Lee tweet media
English
40
174
729
194.3K
Dingli Yu
Dingli Yu@dingli_yu·
Safer practice for tuning chatbots: fine-tune without the safety prompt and inference with it! Works surprisingly well in practical settings — fine-tuning on benign dataset to improve downstream tasks while keeping it safe.
Kaifeng Lyu@vfleaking

Fine-tuning can improve chatbots (e.g., Llama 2-Chat, GPT-3.5) on downstream tasks — but may unintentionally break their safety alignment. Our new paper: Adding a safety prompt is enough to largely mitigate the issue, but be cautious about when to add it! arxiv.org/abs/2402.18540

English
0
0
3
877
Dingli Yu retweetledi
Sanjeev Arora
Sanjeev Arora@prfsanjeevarora·
Launching blog @PrincetonPLI with a post on skillmix. LLMs aren't just "stochastic parrots." @geoffreyhinton recently mentioned this as evidence that LLMs do "understand" the world a fair bit. More blog posts on the way! (Hinton's post here: twitter.com/geoffreyhinton…)
Princeton PLI@PrincetonPLI

We are excited to introduce the PLI Blog! pli.princeton.edu/blog First post by @prfsanjeevarora, "Are Language Models Mere Stochastic Parrots? The SkillMix Test Says NO." bit.ly/47PpKp4

English
3
15
70
26.6K
Dingli Yu
Dingli Yu@dingli_yu·
Skill-Mix is motivated from theories of human pedagogy, and recent paper (Arora & Goyal, 2023) that gave a theory for how complex skills emerge in LLMs when scaled up. The paper predicted that models of 10x larger can perform well with k doubled, which is roughly what we find.
English
0
0
1
457
Dingli Yu
Dingli Yu@dingli_yu·
Does high rank on LLM leaderboards mean anything?  Or is it just a game of "dataset contamination" and "Stochastic Parrots?" Find answers via Skill-Mix, our evaluation of LLMs’ capacity to combine skills! Paper: arxiv.org/abs/2310.17567
English
3
11
65
21.8K
Dingli Yu retweetledi
Sanjeev Arora
Sanjeev Arora@prfsanjeevarora·
Fine tuned LLMs can solve many NLP tasks. A priori, fine-tuning a huge LM on a few datapoints could lead to catastrophic overfitting. So why doesn’t it? Our theory + experiments (on GLUE) reveal that fine-tuning is often well-approximated as simple kernel-based learning. 1/2
English
2
34
229
0