Boxi Yu

703 posts

Boxi Yu

@BoshCavendish

I love building coding agents and exploring how AI and software engineering can improve each other.

Katılım Şubat 2016

1.6K Takip Edilen328 Takipçiler

Boxi Yu@BoshCavendish·1d

有想学的朋友可以来，I can teach you.

Elon Musk@elonmusk

@whyyoutouzhele 我的儿子正在学习普通话

中文

Boxi Yu retweetledi

Qiuyang Mang@MangQiuyang·3d

We integrated FrontierCS into Harbor and are releasing a preview long-horizon agent leaderboard (up to 835 turns, ~200K output tokens) with Kimi K2.6 @Kimi_Moonshot (score 46.9) and Claude Code Opus 4.7 @claudeai (43.0) 🚢. The goal: evaluate frontier coding agents in a setting where they iteratively write code, run experiments, read feedback, and improve in an extremely long loop. FrontierCS tasks are open-ended optimization problems. Each task has a continuous score. There is no single accepted output. Agents need to search for better solutions under a step/time/token budget. This makes FrontierCS a natural fit for agentic evaluation. Just plan, code, test, revise, fail, recover, and keep optimizing. Check out our blog: frontier-cs.org/blog/harbor FrontierCS GitHub: github.com/FrontierCS/Fro…

English

131

22.7K

Boxi Yu retweetledi

Elon Musk@elonmusk·4d

❤️❤️ Happy Mother’s Day ❤️❤️ Appreciation to mothers everywhere who brought us all into the world and nurtured their beloved children 🥰

English

15.3K

46K

624K

68.4M

Boxi Yu@BoshCavendish·5d

Democratizing AI is very important.

Elad Gil@eladgil

People at major AI labs (using internal models) 3-4 months ahead of startup silicon valley engineers SV founders/eng 3-6 months ahead of NY NY founders/eng 6-12 months ahead of rest of world Most people have no idea how fast AI shifting as 1-2 years behind SOTA "The future is here, just not equally distributed" - Robert Heinlein

English

Boxi Yu@BoshCavendish·6 May

@itaowe Thank you Wei!

English

Wei Tao@itaowe·6 May

Congratulations!

Boxi Yu@BoshCavendish

🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1 → #5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇

English

139

Boxi Yu@BoshCavendish·6 May

@helloitsaustin @Samhanknr @AnthropicAI congratulations!

English

austin lau@helloitsaustin·6 May

I got married this past weekend so I did what any rational @AnthropicAI employee would do and had Claude Code analyze 12 years of iMessages with my wife, then Claude Design used that data to whip up a website for our guests in just minutes.

English

535

870

19K

3.4M

Boxi Yu retweetledi

Anthropic@AnthropicAI·5 May

New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.

English

126

159

1.9K

252.2K

Boxi Yu@BoshCavendish·5 May

@CYY095121943873 @icmlconf @OpenAI @SWEbench Tysm!

Svenska

CYY@CYY095121943873·5 May

@BoshCavendish @icmlconf @OpenAI @SWEbench interesting

English

Boxi Yu@BoshCavendish·5 May

English

595

Boxi Yu@BoshCavendish·5 May

@Llllink2000 @icmlconf @OpenAI @SWEbench tysm!

Svenska

ank R@Llllink2000·5 May

@BoshCavendish @icmlconf @OpenAI @SWEbench Nice work！

English

Boxi Yu@BoshCavendish·5 May

@icmlconf @OpenAI @SWEbench 📄 Paper: arxiv.org/abs/2603.00520 💻 Code: github.com/OpenAgentEval/… 📊 Data: huggingface.co/datasets/OpenA… 🌐 Website: openagenteval.github.io/SWE-ABS

Nederlands

223

Boxi Yu retweetledi

Rohit@rohit4verse·29 Nis

x.com/i/article/2048…

ZXX

255

1.7K

2.7M

Boxi Yu@BoshCavendish·1 May

@goodhunt 过于厉害了老哥求拉群

中文

241

Hunter Bown@goodhunt·1 May

鲸鱼兄弟们好，我是做 DeepSeek-TUI 的那个美国佬。说真的，特别想跟国内的鲸鱼兄弟们一起混——但我的翻墙技能仅限于写代码，微信到现在都没搞定，属实有点丢人。求各位大佬帮个忙： 1）帮忙转发扩散一下，让这个开源终端工具翻过高墙被兄弟们看到 2）顺手帮我验证个微信号，我想建个群，大家一起聊 DeepSeek、聊开源、聊怎么把 agent 做得更好作为交换，我发誓死守 cargo install 这条安装路径，绝不让任何一个兄弟受 npm 的苦。顺带一提，这段话是 DeepSeek 帮我润色的——感谢鲸鱼赐我流利中文 🙏 github.com/Hmbown/DeepSee…

中文

939

648

5.5K

998.9K

Boxi Yu@BoshCavendish·23 Nis

@ewind_dev 国外怎么保护呢？感觉是不是可以维护一个组织或者工会帮大家一起打击偷盗者

中文

906

Yifeng "Evan" Wang@ewind_dev·23 Nis

x.com/i/article/2047…

ZXX

471

206.2K

Boxi Yu retweetledi

Kye Gomez (swarms)@KyeGomezB·19 Nis

Introducing OpenMythos An open-source, first-principles theoretical reconstruction of Claude Mythos, implemented in PyTorch. The architecture instantiates a looped transformer with a Mixture-of-Experts (MoE) routing mechanism, enabling iterative depth via weight sharing and conditional computation across experts. My implementation explores the hypothesis that recursive application of a fixed parameterized block, coupled with sparse expert activation, can yield improved efficiency–performance tradeoffs and emergent multi-step reasoning. Learn more ⬇️🧵

English

243

1.2K

8.3K

1.7M

Boxi Yu@BoshCavendish·17 Nis

@karminski3 蹲后续

中文

688

karminski-牙医@karminski3·17 Nis

Qwen3.6-35B-A3B 2bit 量化都这么猛吗? Unsloth 团队(当然他们只有哥俩)刚光速放出了量化版本的 Qwen3.6-35B-A3B, 然后他们做这个测试把我惊呆了... 2bit 能完成 30 多次工具调用??? 我是真不信的.. 因为我之前测 Qwen3.5-35B-A3B 8bit (mlx 格式哈) 大概只能 4-5 次工具调用就不行了, 大概只能做做整理邮件这种简单工作, 但凡让它整理完邮件做个统计记录到 Notion / Obsidian 上就炸了. 要知道 unsloth 的 2bit 动态量化这个模型只有12.3GB, 激活只有1G! 32G 的 Mac 可以轻松跑起来了. 我赶紧测一下试试, 稍后给大家带来实测效果. x.com/UnslothAI/stat…

中文

573

71.2K

Boxi Yu@BoshCavendish·16 Nis

@OfirPress Agree

English

Ofir Press@OfirPress·16 Nis

code is everything and everything is code

English

1.5K

Boxi Yu@BoshCavendish·15 Nis

@stlslimjim @CreaoAI Awesome!

English

Boxi Yu retweetledi

Dora Xie@stlslimjim·15 Nis

@CreaoAI Same idea, but open source! Checkout OpenASE

English

307

Creao AI@CreaoAI·15 Nis

x.com/i/article/2044…

ZXX

102

143.6K

Boxi Yu retweetledi

levi@levidiamode·28 Mar

Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.

levi@levidiamode

Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic. I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations. Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd. Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

English

147

1.4K

113.8K

Keşfet

@Kimi_Moonshot @claudeai @itaowe @helloitsaustin @Samhanknr @AnthropicAI @CYY095121943873 @icmlconf