Dingli Yu

25 posts

Dingli Yu

@dingli_yu

Researcher @ OpenAI | PhD from Princeton

Katılım Eylül 2018

77 Takip Edilen526 Takipçiler

Dingli Yu retweetledi

Artificial Analysis@ArtificialAnlys·8 Nis

Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta's first release that is not open weights Muse Spark is a new model from @Meta evaluated on Artificial Analysis. We were given early access by Meta to independently benchmark the model. It is the first frontier-class model from Meta since Llama 4 Maverick was released in April 2025, and notably the first @AIatMeta model that is not being released as open weights. The release follows Meta's reorganization of its AI efforts under Meta Superintelligence Labs, and signals that Meta is re-entering the frontier race after roughly a year of relative quiet. For context, Llama 4 Maverick and Scout scored 18 and 13 respectively on the Artificial Analysis Intelligence Index as non-reasoning models at the time of their release, while Muse Spark scores 52. Muse Spark essentially closes the gap between to the frontier in a single release. The model is not open source and is not yet accessible via an API but Meta has shared they expect this to come soon. Meta is also integrating Muse Spark into their first party products including their Meta AI chat product, Facebook, Instagram and Threads. Key takeaways from our benchmarks: ➤ Muse Spark scores 52 on the Artificial Analysis Intelligence Index, placing it within the top 5 models we have benchmarked. It sits ahead of Claude Sonnet 4.6, GLM-5.1, MiniMax-M2.7, Grok 4.20 and behind Gemini 3.1 Pro Preview, GPT-5.4 and Claude Opus 4.6 ➤ Muse Spark is notably token efficient for its intelligence level. It used 58M output tokens to run the Intelligence Index, comparable to Gemini 3.1 Pro Preview (57M) and notably lower than Claude Opus 4.6 (Adaptive Reasoning, max effort, 157M), GPT-5.4 (xhigh, 120M) and GLM-5 (110M) ➤ Muse Spark is the second-most capable vision model we have benchmarked. It scores 80.5% on MMMU-Pro, behind only Gemini 3.1 Pro Preview (82.4%) ➤ Muse Spark performs strongly on reasoning and instruction-following evaluations. It scores 39.9% on HLE, trailing only Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (xhigh, 41.6%). The model also achieved 5th highest in CritPT with a score of 11%, an eval that is focused on difficult physics research questions. This is substantially above above Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%) ➤ Agentic performance does not stand out. On GDPval-AA, our evalaution focused on real world work tasks, Muse Spark scores 1427, behind both Claude Sonnet 4.6 at 1648 and GPT-5.4 at 1676, but ahead of Gemini 3.1 Pro Preview at 1320. On On TerminalBench Hard, Muse Spark trails Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. Muse Spark joins others in achieving a high τ²-Bench Telecom score of 92% Key model details: ➤ Modalities: Multimodal including text and vision input, text output ➤ License: Proprietary, Meta's first frontier model not released as open weights ➤ Availability: No public API at the time of publishing. Meta expects to provide API access soon. Meta has started integration into their first party AI offering Meta AI and inside Facebook, Instagram, and Threads

English

323

2.5K

497.4K

Dingli Yu retweetledi

Hongyu Ren@ren_hongyu·8 Nis

Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑

English

317

67K

Dingli Yu retweetledi

Alexandr Wang@alexandr_wang·8 Nis

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

727

1.2K

10.3K

4.5M

Dingli Yu retweetledi

Shengjia Zhao@shengjia_zhao·8 Nis

Excited to share what we’ve been building at Meta Superintelligence Labs! We just released Muse Spark, our first AI model. It's a natively multimodal reasoning model and the first step on our path to personal superintelligence. We've overhauled our entire stack to support scaling, and this is just the beginning. ai.meta.com/blog/introduci…

English

171

1.7K

229.8K

Dingli Yu retweetledi

Sebastien Bubeck@SebastienBubeck·7 Ağu

Here at @OpenAI we've cracked pretraining, then reasoning, and now we're experimenting with a new set of techniques that maximally leverage their interaction. GPT-5 is just the first step in this direction, and we're incredibly excited to see where scaling this up will lead us!

English

615

247.9K

Dingli Yu retweetledi

Simon Park@parksimon0808·8 Oca

Does all LLM reasoning transfer to VLM? In context of Simple-to-Hard generalization we show: NO! We also give ways to reduce this modality imbalance. Paper arxiv.org/abs/2501.02669 Code github.com/princeton-pli/… @Abhishek_034 @chengyun01 @dingli_yu @anirudhg9119 @prfsanjeevarora

English

19.4K

Dingli Yu retweetledi

Shengjia Zhao@shengjia_zhao·20 Ara

Excited to train o3-mini with @ren_hongyu @_kevinlu and others, a blindingly fast model with amazing reasoning / code / math performance. openai.com/12-days/?day=12

English

425

174.2K

Dingli Yu retweetledi

Sanjeev Arora@prfsanjeevarora·20 Ara

@QuantaMagazine featured our work on emergence of skill compositionality (and its limitations) in LLMs among the CS breakthroughs of the year. tinyurl.com/5f5jvzy5. Work was done over 2023 @GoogleDeepMind and @PrincetonPLI. Key pieces: (i) mathematical framework for quantifying how LLM scaling leads to predictable increase in the model’s ability to combine skills while solving new tasks. Joint work with @anirudhg9119 (ii) experiments verifying theoretical prediction in experiments via SkillMix evaluation (lead author @dingliy_yu) (iii) the level of skill-compositionality detected in GPT4O via Sept'23 experiments mathematically imply that it is able to reason and talk about situations it has not seen in its training data —i.e. it has moved beyond the “stochastic parrots” stereotype that had dogged earlier LLMs. Skill emergence paper: arxiv.org/abs/2307.15936 Skillmix Evaluation: arxiv.org/abs/2310.17567 Models can improve skill composition from examples arxiv.org/abs/2409.19808 Wonderful to work with the colleagues and students involved.

English

2.9K

Dingli Yu retweetledi

Sebastien Bubeck@SebastienBubeck·13 Ara

Surprise #NeurIPS2024 drop for y'all: phi-4 available open weights and with amazing results!!! Tl;dr: phi-4 is in Llama 3.3-70B category (win some lose some) with 5x fewer parameters, and notably outperforms on pure reasoning like GPQA (56%) and MATH (80%).

English

411

94.6K

Dingli Yu retweetledi

Peter Lee@peteratmsr·13 Ara

🚀 Phi-4 is here! A small language model that performs as well as (and often better than) large models on certain types of complex reasoning tasks such as math. Useful for us in @MSFTResearch, and available now for all researcher on the Azure AI Foundry! aka.ms/phi4blog

English

174

729

194.3K

Dingli Yu@dingli_yu·4 Mar

Safer practice for tuning chatbots: fine-tune without the safety prompt and inference with it! Works surprisingly well in practical settings — fine-tuning on benign dataset to improve downstream tasks while keeping it safe.

Kaifeng Lyu@vfleaking

Fine-tuning can improve chatbots (e.g., Llama 2-Chat, GPT-3.5) on downstream tasks — but may unintentionally break their safety alignment. Our new paper: Adding a safety prompt is enough to largely mitigate the issue, but be cautious about when to add it! arxiv.org/abs/2402.18540

English

877

Dingli Yu retweetledi

Sanjeev Arora@prfsanjeevarora·29 Kas

Launching blog @PrincetonPLI with a post on skillmix. LLMs aren't just "stochastic parrots." @geoffreyhinton recently mentioned this as evidence that LLMs do "understand" the world a fair bit. More blog posts on the way! (Hinton's post here: twitter.com/geoffreyhinton…)

Princeton PLI@PrincetonPLI

We are excited to introduce the PLI Blog! pli.princeton.edu/blog First post by @prfsanjeevarora, "Are Language Models Mere Stochastic Parrots? The SkillMix Test Says NO." bit.ly/47PpKp4

English

26.6K

Dingli Yu@dingli_yu·27 Eki

Skill-Mix is motivated from theories of human pedagogy, and recent paper (Arora & Goyal, 2023) that gave a theory for how complex skills emerge in LLMs when scaled up. The paper predicted that models of 10x larger can perform well with k doubled, which is roughly what we find.

English

457

Dingli Yu@dingli_yu·27 Eki

Here is a demonstration of Skill-Mix with samples of generations by various chatbots: huggingface.co/spaces/dingliy…

English

614

Dingli Yu@dingli_yu·27 Eki

Does high rank on LLM leaderboards mean anything? Or is it just a game of "dataset contamination" and "Stochastic Parrots?" Find answers via Skill-Mix, our evaluation of LLMs’ capacity to combine skills! Paper: arxiv.org/abs/2310.17567

English

21.8K

Dingli Yu@dingli_yu·5 Eki

Introducing Depth-µP, a depthwise scaling strategy that allows scaling up nets to infinite depth, and provides hyperparameter transfer! Very glad to work w/ @TheGregYang @chenzhucs @hayou_soufiane! Link: arxiv.org/abs/2310.02244

Greg Yang@TheGregYang

Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width What if depth→∞as well? 🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth! But GPT flawed 🧵

English

3.9K

Dingli Yu retweetledi

Sadhika Malladi@SadhikaMalladi·14 Eki

Why can we fine-tune (FT) huge LMs on a few data points without overfitting? We show with theory + exps that FT can be described by kernel dynamics. arxiv.org/abs/2210.05643 Joint work with @_awettig, @dingli_yu, @danqi_chen, @prfsanjeevarora. [1/8]

English

Dingli Yu retweetledi

Sanjeev Arora@prfsanjeevarora·14 Eki

Fine tuned LLMs can solve many NLP tasks. A priori, fine-tuning a huge LM on a few datapoints could lead to catastrophic overfitting. So why doesn’t it? Our theory + experiments (on GLUE) reveal that fine-tuning is often well-approximated as simple kernel-based learning. 1/2

English

229

Keşfet

@Meta @AIatMeta @OpenAI @Abhishek_034 @chengyun01 @anirudhg9119 @prfsanjeevarora @ren_hongyu