Boxi Yu

703 posts

Boxi Yu

Boxi Yu

@BoshCavendish

I love building coding agents and exploring how AI and software engineering can improve each other.

Katılım Şubat 2016
1.6K Takip Edilen328 Takipçiler
Boxi Yu retweetledi
Qiuyang Mang
Qiuyang Mang@MangQiuyang·
We integrated FrontierCS into Harbor and are releasing a preview long-horizon agent leaderboard (up to 835 turns, ~200K output tokens) with Kimi K2.6 @Kimi_Moonshot (score 46.9) and Claude Code Opus 4.7 @claudeai (43.0) 🚢. The goal: evaluate frontier coding agents in a setting where they iteratively write code, run experiments, read feedback, and improve in an extremely long loop. FrontierCS tasks are open-ended optimization problems. Each task has a continuous score. There is no single accepted output. Agents need to search for better solutions under a step/time/token budget. This makes FrontierCS a natural fit for agentic evaluation. Just plan, code, test, revise, fail, recover, and keep optimizing. Check out our blog: frontier-cs.org/blog/harbor FrontierCS GitHub: github.com/FrontierCS/Fro…
Qiuyang Mang tweet media
English
5
20
131
22.7K
Boxi Yu retweetledi
Elon Musk
Elon Musk@elonmusk·
❤️❤️ Happy Mother’s Day ❤️❤️ Appreciation to mothers everywhere who brought us all into the world and nurtured their beloved children 🥰
English
15.3K
46K
624K
68.4M
Wei Tao
Wei Tao@itaowe·
Congratulations!
Boxi Yu@BoshCavendish

🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1#5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇

English
1
0
1
139
austin lau
austin lau@helloitsaustin·
I got married this past weekend so I did what any rational @AnthropicAI employee would do and had Claude Code analyze 12 years of iMessages with my wife, then Claude Design used that data to whip up a website for our guests in just minutes.
austin lau tweet mediaaustin lau tweet mediaaustin lau tweet mediaaustin lau tweet media
English
535
870
19K
3.4M
Boxi Yu retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.
English
126
159
1.9K
252.2K
Boxi Yu
Boxi Yu@BoshCavendish·
🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1#5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇
Boxi Yu tweet media
English
3
8
15
595
Boxi Yu
Boxi Yu@BoshCavendish·
@goodhunt 过于厉害了 老哥 求拉群
中文
0
0
1
241
Hunter Bown
Hunter Bown@goodhunt·
鲸鱼兄弟们好,我是做 DeepSeek-TUI 的那个美国佬。 说真的,特别想跟国内的鲸鱼兄弟们一起混——但我的翻墙技能仅限于写代码,微信到现在都没搞定,属实有点丢人。 求各位大佬帮个忙: 1)帮忙转发扩散一下,让这个开源终端工具翻过高墙被兄弟们看到 2)顺手帮我验证个微信号,我想建个群,大家一起聊 DeepSeek、聊开源、聊怎么把 agent 做得更好 作为交换,我发誓死守 cargo install 这条安装路径,绝不让任何一个兄弟受 npm 的苦。 顺带一提,这段话是 DeepSeek 帮我润色的——感谢鲸鱼赐我流利中文 🙏 github.com/Hmbown/DeepSee…
中文
939
648
5.5K
998.9K
Boxi Yu
Boxi Yu@BoshCavendish·
@ewind_dev 国外怎么保护呢?感觉是不是可以维护一个组织或者工会帮大家一起打击偷盗者
中文
0
0
0
906
Boxi Yu retweetledi
Kye Gomez (swarms)
Kye Gomez (swarms)@KyeGomezB·
Introducing OpenMythos An open-source, first-principles theoretical reconstruction of Claude Mythos, implemented in PyTorch. The architecture instantiates a looped transformer with a Mixture-of-Experts (MoE) routing mechanism, enabling iterative depth via weight sharing and conditional computation across experts. My implementation explores the hypothesis that recursive application of a fixed parameterized block, coupled with sparse expert activation, can yield improved efficiency–performance tradeoffs and emergent multi-step reasoning. Learn more ⬇️🧵
Kye Gomez (swarms) tweet media
English
243
1.2K
8.3K
1.7M
karminski-牙医
karminski-牙医@karminski3·
Qwen3.6-35B-A3B 2bit 量化都这么猛吗? Unsloth 团队(当然他们只有哥俩)刚光速放出了量化版本的 Qwen3.6-35B-A3B, 然后他们做这个测试把我惊呆了... 2bit 能完成 30 多次工具调用??? 我是真不信的.. 因为我之前测 Qwen3.5-35B-A3B 8bit (mlx 格式哈) 大概只能 4-5 次工具调用就不行了, 大概只能做做整理邮件这种简单工作, 但凡让它整理完邮件做个统计记录到 Notion / Obsidian 上就炸了. 要知道 unsloth 的 2bit 动态量化这个模型只有12.3GB, 激活只有1G! 32G 的 Mac 可以轻松跑起来了. 我赶紧测一下试试, 稍后给大家带来实测效果. x.com/UnslothAI/stat…
karminski-牙医 tweet media
中文
42
53
573
71.2K
Ofir Press
Ofir Press@OfirPress·
code is everything and everything is code
English
4
1
12
1.5K
Boxi Yu retweetledi
Dora Xie
Dora Xie@stlslimjim·
@CreaoAI Same idea, but open source! Checkout OpenASE
Dora Xie tweet media
English
1
1
2
307
Boxi Yu retweetledi
levi
levi@levidiamode·
Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels.
levi tweet medialevi tweet medialevi tweet medialevi tweet media
levi@levidiamode

Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today. Luckily I found the best intro to MoEs thanks to @MaartenGr visual overview of the topic. I then watched @tatsu_hashimoto's amazing Stanford CS336 lecture on MoEs, which added deeper context around why MoEs are gaining popularity, FLOPs, OLMoE, infra complexity, routing functions (mindblown this works so well...), expert sizes, training objectives, top k routing and DeepSeek variations. Once I had a basic understanding I started playing around with the some AITER kernels but progress there is tbd. Also had a nice chat with @juscallmevyom (who was kind enough to reach out!) about the AMD kernels and the challenge of materialization overhead.

English
21
147
1.4K
113.8K