Yuhao Dong (@dyhTHU) - Twitter Profili | Zamantika Mersobahis Locabet

Yuhao Dong retweetledi

Ziwei Liu@liuziwei7·6d

📢Welcome to check out our work @iclr_conf 🇧🇷 * LMMs: - NEO: github.com/EvolvingLMMs-L… - Visual Jigsaw: penghao-wu.github.io/visual_jigsaw/ * World Models: - Light-X: lightx-ai.github.io - IGGT: lifuguan.github.io/IGGT_official/ - ViMoGen: motrixlab.github.io/2026_iclr_vimo… - EgoTwin: egotwin.pages.dev

English

1

17

78

4.6K

Yuhao Dong retweetledi

DeepSeek@deepseek_ai·6d

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English

1.6K

7.7K

44.8K

9.3M

Yuhao Dong retweetledi

Kimi.ai@Kimi_Moonshot·6d

Meet Kimi K2.6 Agent Swarm 👋 Highlights： 🔹 Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from 100 / 1,500 in K2.5). 🔹 Outputs are real files, not chat - one run delivers 100+ files, 100,000-word literature reviews, or 20,000-row datasets. 🔹Heterogeneous skills - search, analysis, coding, long-form writing, and visual generation all running in parallel 🔗Try it at: kimi.com/agent-swarm?ch…

English

106

326

3.7K

596.9K

Yuhao Dong retweetledi

Artificial Analysis@ArtificialAnlys·21 Nis

Moonshot’s Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at #4 on the Artificial Analysis Intelligence Index (54) behind only Anthropic, Google, and OpenAI (all 57) Key takeaways: ➤ Increase in performance on agentic tasks: @Kimi_Moonshot's Kimi K2.6 achieves an Elo of 1520 on our GDPval-AA evaluation, which is a marked improvement over Kimi K2.5’s Elo of 1309. GDPval-AA is our leading metric for general agentic performance, measuring the performance on knowledge work tasks such as preparing presentations and analysis. Models are given code execution and web browsing tools in an agentic loop via our open source reference agentic harness called Stirrup. This continues Kimi K2.6’s strength in tool use, maintaining a 96% score on τ²-Bench Telecom, placing it among other frontier models in this category. ➤ Low hallucination rate: Kimi K2.5 scores 6 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This score is primarily driven by a comparatively low hallucination rate of 39% (reduced from Kimi K2.5’s 65%), indicating a greater capability to abstain rather than fabricate knowledge when the model is uncertain. Kimi K2.6’s low hallucination rate places it similarly to other models such as Claude Opus 4.7 (36%) and MiniMax-M2.7 (34%) ➤ High token usage: Kimi K2.6 demonstrates high token usage, but is in line with other frontier models in the same intelligence tier. To run the full Artificial Analysis Intelligence Index, Kimi K2.6 used ~160M reasoning tokens. This is slightly lower than Claude Sonnet 4.6 (~190M reasoning tokens) but much higher than GPT 5.4 (~110M reasoning tokens). ➤ Open weights: Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1T total parameters and 32B active, same as the previous two generations of models Kimi K2 Thinking and Kimi K2.5. Kimi K2.6 again pushes the open weights frontier in intelligence. ➤ Third Party Access: Kimi K2.6 is accessible through Moonshot’s First Party API as well as third party API providers Novita, Baseten, Fireworks, and Parasail ➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Further analysis in the threads below.

English

30

130

1.3K

207.6K

Yuhao Dong@dyhTHU·20 Nis

Fabulous😍

Kimi.ai@Kimi_Moonshot

Meet Kimi K2.6 agent - Video hero section, WebGL shaders, real backends. From one prompt. 🔹 Video hero sections - cinematic aesthetic, auto-composited 🔹 WebGL shader animations - native GLSL / WGSL, liquid metal, caustics, raymarching 🔹 Motion design - GSAP + Framer Motion 🔹 Backend database: Kimi wires up auth + database + backend in one pass. 🔹 Website stack - React 19 + TypeScript + Vite + Tailwind + shadcn/ui 🔹 3D w/ physically-based lighting - Three.js + React Three Fiber

Français

0

4

104

Yuhao Dong retweetledi

Xinyu Zhou@zxytim·20 Nis

BTW, I vibe coded this LLM inference engine example in the official blog using Kimi K2.6 on my laptop😘. I choose to use zig, not because it is easy, but because it is hard. I've never written any zig and metal code in my entire life, and I can just build whatever I imagine with Kimi K2.6. kimi.com/blog/kimi-k2-6

Kimi.ai@Kimi_Moonshot

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English

40

56

904

150.3K

Yuhao Dong retweetledi

Wu Haoning@HaoningTimothy·20 Nis

It’s finnnnnnnally out: huggingface.co/moonshotai/Kim…

English

7

110

3.4K

Yuhao Dong retweetledi

Kimi.ai@Kimi_Moonshot·20 Nis

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English

900

2.4K

18.1K

7.4M

Yuhao Dong retweetledi

Kimi.ai@Kimi_Moonshot·18 Nis

We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: arxiv.org/html/2604.1503…

English

69

347

2.9K

680.3K

Yuhao Dong retweetledi

Evolvent AI@Evolvent_AI·14 Nis

Launch Week — Day 2: Terrarium 🧊 Most existing data engines are built for single-session, reactive agents: the environment does not change on its own, and nearly all state changes are triggered by the agent itself. But real-world proactive agents operate over time on long-horizon tasks. While the agent is working, the world keeps changing. 📧 Emails arrive mid-task. 📊 Databases get updated. 🔄 Agents must detect these changes on their own — and in some cases, proactively take action in response. We’re open-sourcing 🪴 Terrarium — a multi-turn data engine for proactive agents in living environments, designed for agents like 🦞 OpenClaw, 📟 Claude Code, and others. You build a world. Place an agent inside. Watch how it adapts. GitHub: github.com/evolvent-ai/Te…

English

2

7

18

1.5K

Yuhao Dong retweetledi

Junyang Lin@JustinLin610·12 Nis

we need agent evals that are really consistent with real world usages. otherwise people are optimizing foundation models for the wrong direction. the problem of targeting is even bigger than benchmaxxing.

English

22

16

239

27.1K

Yuhao Dong retweetledi

Yuwei Niu@purshow04·10 Nis

x.com/i/article/2042…

ZXX

1

15

65

11.3K

Yuhao Dong retweetledi

Ziwei Liu@liuziwei7·9 Nis

🤩Next-Generation Video Understanding Benchmark🤩 📽️Video-MME-v2📽️ provides a robust and faithful evaluation of video models with two highlights: * Progressive Multi-Level Evaluation Dimensions * Grouped Non-Linear Evaluation Mechanism - Leaderboard: video-mme-v2.netlify.app

Yuhao Dong@dyhTHU

🔥 Excited to share Video-MME-v2! 🔥 We built it to tackle a growing issue: video understanding benchmarks are getting saturated. 🏃🏻 Over 3,300 human-hours, nearly a year of effort 🌟 A new design with a progressive hierarchy + group-based nonlinear evaluation What we found: 👉 Human: 90.7 vs 👉 Gemini-3-Pro: 49.4 The gap is still huge. Explore More at: Page: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015

English

0

7

37

4.5K

Yuhao Dong retweetledi

Artificial Analysis@ArtificialAnlys·8 Nis

Announcing APEX-Agents-AA, our latest leaderboard on Artificial Analysis, evaluating AI agents on long-horizon professional services tasks with realistic application dependencies This is our implementation of the APEX-Agents benchmark - an agentic work task evaluation open-sourced by @mercor_ai. It tests AI agent ability to execute realistic tasks created by investment banking analysts, management consultants, and corporate lawyers. Mercor released extensive data to enable model evaluation and training across the community, comprising 480 tasks including tool implementations, rubrics, and grading workflows. We exclude tasks with external service dependencies and run the remaining 452 tasks for APEX-Agents-AA. Models complete tasks using Stirrup, our open-source agent harness as used in GDPval-AA, and a customized tool set based on the original benchmark implementation Results overview: 🏅 OpenAI, Anthropic and Google are in close competition at the top of the leaderboard, with 33.3% for GPT-5.4, 33.0% for Claude Opus 4.6, and 32% for Gemini 3.1 Pro Preview 📈 The overall scores on Artificial Analysis today are similar to Mercor’s testing, but some models such as GPT-5.4 nano show improvements in score using our Stirrup test harness ↻ We’ll be updating this leaderboard with key releases for agentic work use as a metric for agent capability on well-defined, long horizon work tasks APEX-Agents overview: ➤ Tasks span 3 professional domains: investment banking, management consulting, and corporate law ➤ The tasks are designed to require long-horizon work with a large number of tools, which are provided through MCP servers as would be used in many real-world deployments (including calendar, chat, spreadsheet and presentation operations, etc.) ➤ Required outputs include direct message responses (87%) and creating or modifying spreadsheets (6.6%), documents (4.8%), and presentations (1.3%) ➤ Model outputs are parsed and graded against binary rubrics using an LLM judge. Each task is run 3 times and scored pass@1 - a pass requires every rubric test to pass ➤ In our APEX-Agents-AA implementation, 452 tasks run in our open-source Stirrup harness with tool management and usage from @mercor_ai's original MCP implementation. This provides a consistent, reproducible baseline for comparing raw model capability that aligns with realistic agent deployments

English

15

262

28.9K

Yuhao Dong@dyhTHU·8 Nis

Nice Debut

Meta Newsroom@MetaNewsroom

Today we’re introducing Muse Spark, our most powerful model yet, giving you a faster and smarter Meta AI. Muse Spark currently powers the Meta AI app and website and will be rolling out to @whatsapp, @Instagram, @facebook, @messenger, and AI glasses in the coming weeks. about.fb.com/news/2026/04/i…

English

0

139

Yuhao Dong retweetledi

Chaoyou Fu@brady202406·8 Nis

🔥🔥 Sharing our work: Video-MME-v2！ A team of 60+ amazing colleagues spent nearly a year building Video-MME-v2. 🤔 Due the existing saturation problem! 🚀 3,300+ human-hours 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 ❗A substantial gap! Project: video-mme-v2.netlify.app

English

2

3

6

172

Yuhao Dong retweetledi

Lei Li@_TobiasLee·8 Nis

📣🔥 Video-MME-v2 is here! 🎯 Tackling the saturation of video understanding benchmarks 🚀 Built with 3,300+ human-hours over nearly a year 🔍 Progressive tri-level hierarchy & group-based nonlinear scoring 👉 Human: 90.7 vs the best Gemini-3-Pro: 49.4 Project: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015

English

1

7

31

3.2K

Yuhao Dong@dyhTHU·8 Nis

🔥 Excited to share Video-MME-v2! 🔥 We built it to tackle a growing issue: video understanding benchmarks are getting saturated. 🏃🏻 Over 3,300 human-hours, nearly a year of effort 🌟 A new design with a progressive hierarchy + group-based nonlinear evaluation What we found: 👉 Human: 90.7 vs 👉 Gemini-3-Pro: 49.4 The gap is still huge. Explore More at: Page: video-mme-v2.netlify.app Paper: arxiv.org/pdf/2604.05015

English

2

8

24

5.9K

Yuhao Dong retweetledi

DailyPapers@HuggingPapers·8 Nis

Video-MME-v2 A new benchmark for video understanding featuring a progressive tri-level hierarchy and grouped non-linear scoring. Built with 3,300 human-hours across 800 videos to expose gaps between leaderboard scores and true model capabilities.