Jiayi Yuan

26 posts

Jiayi Yuan

@jiayiy

@xai

Katılım Mart 2020

449 Takip Edilen397 Takipçiler

Jiayi Yuan@jiayiy·18 May

@_fmla_ this is the tech report (paper) for skip softmax :)

English

137

Lorenzo Garcia@_fmla_·18 May

@jiayiy how is this different from skip softmax from nvidia?

English

148

Jiayi Yuan@jiayiy·18 May

🚀 BLASST just won Best Paper at #MLSys26! In this paper, we introduce a simple, training-free dynamic sparse attention mechanism that uses a single scalar threshold on online softmax statistics to skip negligible attention blocks. Unfortunately I won’t be there in person, but please say hi to my awesome coauthors! 🙌 Paper: arxiv.org/abs/2512.12087

SemiAnalysis@SemiAnalysis_

Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sparse Attention, and recently @NousResearch 's Lighthouse Attention. BLASST by NVIDIA, from paper Dynamic Blocked Attention Sparsity via Softmax Thresholding, attempts to sparsify attention in a different way, leveraging a similar rescale factor threshold idea from Flash Attention 4. We expect to see more interesting sparse attention techniques in the future. arxiv.org/abs/2512.12087 (2/4)

English

360

40.2K

Jiayi Yuan@jiayiy·18 May

@Infopulsed Hey, thanks for sharing! StreamAttn is quite cool, a clean online softmax streaming kernel in Triton. BLASST reuses the running max stats and mainly focusing on sparse attention, we also provided easy-to-use kernels github.com/NVIDIA/TensorR…. Happy to connect

English

452

EDITH@Infopulsed·18 May

@jiayiy Hey i am not kidding, i have been working on this since 2024, i had this idea actually about online softmax.. here's the repo github.com/MagellaX/Strea…

English

557

Jiayi Yuan@jiayiy·15 May

@zdhnarsil 鼎！

日本語

Dinghuai Zhang 张鼎怀@zdhnarsil·15 May

Been working on the coding model behind this for a while. Still needs huge improvement, but let's see!

xAI@xai

An early beta of Grok Build, an agentic CLI for coding, building apps, and automating workflows is now available for SuperGrok Heavy subscribers. Through this early beta, we will improve the model and product based on your feedback. Try it at x.ai/cli

English

219

10.3K

Jiayi Yuan@jiayiy·15 May

@xiuyu_l @hebiao064 🐐

QME

157

Xiuyu Li@xiuyu_l·15 May

We worked incredibly hard to scale and iterate on RL and post-training to make Grok Build smarter, better at coding, and faster. Proud and excited to see it finally released. Enjoy!

xAI@xai

English

122

4.9K

Jiayi Yuan@jiayiy·12 May

@xiuyu_l 💯

QME

Xiuyu Li@xiuyu_l·12 May

Verification bottlenecks progress. Bandwidth bottlenecks verification.

Horace He@cHHillee

In modern ML accelerators, FLOPS have absolutely exploded. Often though, the bottleneck is not FLOPS but memory bandwidth. Similarly, model intelligence has exploded, causing the bottleneck to be human<->AI bandwidth. At Thinky, we think that it’s important to solve this. 1/4

English

3.5K

Jiayi Yuan@jiayiy·1 May

@zdhnarsil 🐐

QME

Dinghuai Zhang 张鼎怀@zdhnarsil·1 May

Check our latest model which achieves decent performance with relative small capacity! Happy to contribute as a small team to coding RL and other parts of the lineage.

Artificial Analysis@ArtificialAnlys

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20 The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite. Key Takeaways: ➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level ➤ Large increase in real world agentic task performance: The largest single benchmark improvement is on GDPval-AA, where Grok 4.3 scores an ELO of 1500, up 321 points from Grok 4.20 0309 v2’s score of 1179 Grok 4.3, surpassing Gemini 3.1 Pro Preview, Muse Spark, Gpt-5.4 mini (xhigh), and Kimi K2.5. Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17% against GPT-5.5 (xhigh) under the standard Elo formula ➤ Grok 4.3’s performs strongly on instruction following and agentic customer support tasks. It gains 5 points on 𝜏²-Bench Telecom to reach 98%, in line with GLM-5.1. Grok 4.3 maintains an 81% IFBench score from Grok 4.20 0309 v2 ➤ Gains 8 points on AA-Omniscience Accuracy, but at the cost of lower AA-Omniscience Non-Hallucination Rate of 8 points, so Grok 4.20 0309 v2 still leads AA-Omniscience Non-Hallucination Rate, followed by MiMo-V2.5-Pro, in line with Grok 4.3 Congratulations to @xAI and @elonmusk on the impressive release!

English

150

6.7K

Jiayi Yuan@jiayiy·13 Mar

Thanks for everything, Guodong.

Guodong Zhang@Guodzh

Last day at xAI. Wild journey past three years but excited about next chapter. Thanks all for the love and support yesterday. So many friends made along the way and I will miss you all!

English

813

Jiayi Yuan@jiayiy·6 Mar

@xiuyu_l impressive!

English

277

Xiuyu Li@xiuyu_l·6 Mar

The last project I co-led during my PhD is finally out. Verifiable rewards are a key ingredient for RL. The ability to verify is also what enables parallel agents and self-evolving. We propose V1, where generation and verification co-evolve through RL. When done properly, the model can become a surprisingly effective verifier for itself.

Harman Singh (in NYC for summer)@Harman26Singh

Can LLMs Self-Verify? Much better than you'd expect. LLMs are increasingly used as parallel reasoners, sampling many solutions at once. Choosing the right answer is the real bottleneck. We show that pairwise self-verification is a powerful primitive. Introducing V1, a framework that unifies generation and self-verification: 💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling 💡 V1-Infer: Efficient tournament-style ranking that improves self-verification 💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers 🧵👇

English

446

44.8K

Jiayi Yuan@jiayiy·12 Şub

🚀

xAI@xai

Since xAI was formed just 30 months ago, the small and talented team has made remarkable progress. The future has never looked more exciting!

ART

672

Jiayi Yuan retweetledi

SpaceX@SpaceX·3 Şub

SpaceX has acquired xAI, forming one of the most ambitious, vertically integrated innovation engines on (and off) Earth → #xai-joins-spacex" target="_blank" rel="nofollow noopener">spacex.com/updates#xai-jo…

English

3.9K

7.9K

45.4K

19.3M

Jiayi Yuan@jiayiy·11 Kas

As more small expert and agentic models emerge, routing has become a hot research topic. We’re excited to introduce RouterArena, an open leaderboard for routing evaluation. Proud to be part of this effort!

Jiarong Xing@Jiarong_Xing

Ever wondered who decides which LLM answers your question? A router. But… how good is it? 🤔 📢 We built RouterArena—the first open leaderboard for comprehensive router evaluation. 🔗github.com/RouteWorks/Rou…

English

481

Jiayi Yuan@jiayiy·10 Eyl

@thinkymachines Thanks for the shout-out to our work—it's great to see more focus on this important problem. Horace's approach is a truly elegant solution. Fantastic work! More reading: arxiv.org/abs/2506.09501

English

233

Thinking Machines@thinkymachines·10 Eyl

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/defeating…

English

229

1.2K

7.6K

3.5M

Jiayi Yuan@jiayiy·10 Eyl

Thanks for the shout-out to our work—it's great to see more focus on this important problem. Horace's approach is a truly elegant solution. Fantastic work! More reading: arxiv.org/abs/2506.09501

Thinking Machines@thinkymachines

English

4.3K

Jiayi Yuan retweetledi

Zirui Liu@ziruirayliu·12 Haz

🔥Exited to share our new work on reproducibility challenges in reasoning models caused by numerical precision. Ever run the same prompt twice and get completely different answers from your LLM under greedy decoding? You're not alone. Most LLMs today default to BF16 precision, but we show this choice severely impacts the reproducibility of long generations — even under greedy decoding with a fixed seed. While issues like this are known in tools like vLLM and sgLang, the severity of the problem is widely underestimated. Many in the community still rely on single-run greedy decoding for evaluation — which can lead to misleading results. 🤯 To get a sense, switching from 2 GPUs to 4 GPUs may completely change your model outputs, with up to 9% drop in accuracy and a difference of 9,000 token length on standard benchmarks like AIME. Key takeaways: • ⚠️ Floating-point non-associativity causes tiny numerical errors to snowball in multi-step reasoning. • 🔄 Greedy decoding ≠ deterministic output — we observe up to 9% accuracy variance and 9,000 token difference in response length • 📉 When using random sampling with non-zero tempurature, the accuracy variance purely from numerical precision is 0.3%~2%, depending on the dataset size and the number of repeated runs. 🌍 Suggestions to the community: We urge the community to adopt better evaluation practices for LLMs — especially for tasks like math reasoning, code generation, and auto-grading: 1. Use random sampling + report Pass@k, average length, and error bars — especially on small datasets and low precision. 2. If using greedy decoding for token-by-token reproducibility, run it in FP32. To help, we released a vLLM patch for FP32 inference. 📄 Paper: lnkd.in/gZAjbWKA 💻 Code: lnkd.in/gwdGWFP5 📈 HF Summary: lnkd.in/gFjsK7Y9

English

14K

Jiayi Yuan retweetledi

elvis@omarsar0·21 Mar

A survey on efficient reasoning for LLMs. That was quick! I have been featuring papers on the topic of efficient reasoning and I see a few familiar papers in this survey. Good read overall!

English

374

57.1K

Jiayi Yuan retweetledi

Zhaozhuo Xu@ZhaozhuoX·25 Şub

🚀 Join us at #AAAI2025 for our tutorial: TQ08: KV Cache Compression for Efficient Long Context LLM Inference 📍 Room 116 ⏰ 4:15 PM - 6:00 PM Learn how to compress KV cache for faster, scalable LLM inference! Don't miss it! #AI #LLM #Efficiency #AAAI25

English

11K

Jiayi Yuan@jiayiy·12 Kas

Meet us tomorrow at the poster session!

Yu-Neng Chuang@YuNengChuang

📢Excited to present "Taylor Unswift" poster at #EMNLP24 in Miami! Join us on Nov 13 (Wed), 10:30–12:00, at Main #778. "Taylor Unswift" aims to solve the dilemma of secured weight release for LLM developers and users. 🔗Paper: arxiv.org/pdf/2410.05331 🔗Code: github.com/guanchuwang/Ta… Wanna know more about "Taylor Unswift"😉: 🚨 Oftentimes, model developers face a dilemma: open-source their models and lose control, or offer closed APIs but bear costs and deter privacy-conscious users. 🚑 Introducing "Taylor Unswift": a method using Taylor Expansion Theory to protect model weights while allowing users to run models on their own data without accessing the weights. These correspond to the 'Taylor' and 'Unswift' in the title. 🌟 Developers can prevent misuse of their models, while users can run models on their own data without sharing it—unlike with services like the ChatGPT API. More detailed insights can be found in the paper! Kudos to all co-authors: @Guanchu_Gary*, @YuNengChuang*, @RuixiangT, @henryzhongsc, @jiayiy, @serendip410, @ziruirayliu, Vipin Chaudhary, Shuai Xu, James Caverlee, @huxia #LLM #security #NLP #EMNLP

English

385

Jiayi Yuan@jiayiy·30 Eki

See you in Vice City, State of Leonida! More info: linkedin.com/posts/guanchu-…

Yuchen Jin@Yuchenj_UW

After "Attention Is All You Need", AI paper titles be like:

English

814

Jiayi Yuan@jiayiy·20 Eyl

🚀Excited to share our latest #EMNLP2024 work on benchmarking the long context ability with KV Cache compression across RNN-based architectures, token eviction, prompt compression, and quantization. We also provide an easy-to-use codebase (it also has my favorite WoW quote 😉). Feel free to give it a try and ⭐ it if you find it useful! 📄 Paper: arxiv.org/abs/2407.01527 💻 Code: github.com/henryzhongsc/l… Some interesting findings/suggestions include: 1️⃣ Maintaining an uncompressed prefill process is essential for performance, especially with harder tasks. 2️⃣ Combining RNN-based models with attention significantly enhances long-context capabilities. 3️⃣ In "needle-in-a-haystack" evaluation for recent LLMs like Llama-3, we should use longer needles (like 64 digits) since these models tokenize multiple digits into one token. More results and insights can be found in the paper! Kudos to all collaborators: @jiayiy, Hongyi Liu, @henryzhongsc, @YuNengChuang, Songchen Li, Guanchu Wang, Duy Le, @serendip410, Vipin Chaudhary, @ZhaozhuoX, @ziruirayliu, @huxia

English

10.2K

Keşfet

@_fmla_ @Infopulsed @zdhnarsil @xiuyu_l @hebiao064 @thinkymachines @elonmusk @BarackObama