Wu Haoning

452 posts

Wu Haoning

Wu Haoning

@HaoningTimothy

PhD Nanyang Technological University🇸🇬, BS @PKU1898, cooking VLMs in @Kimi_Moonshot. Opinions are personal.

Singapore Katılım Aralık 2020
736 Takip Edilen3.2K Takipçiler
Wu Haoning
Wu Haoning@HaoningTimothy·
this is literally insane that we truly see an open model thinking 200-300k for math problems…
Wu Haoning tweet media
English
1
0
109
12.2K
Wu Haoning
Wu Haoning@HaoningTimothy·
I saw flowers and moonlight today.
Wu Haoning tweet mediaWu Haoning tweet media
English
1
0
18
795
Wu Haoning
Wu Haoning@HaoningTimothy·
往错误的方向行走不会到达终点。
中文
4
0
43
5.6K
Wu Haoning retweetledi
Fanqing Meng
Fanqing Meng@FanqingMengAI·
Evolvent AI@Evolvent_AI

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (github.com/openclaw/openc…), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

English
2
8
40
11.4K
Wu Haoning retweetledi
Wu Haoning
Wu Haoning@HaoningTimothy·
As long as K2.5/K2.6 is multimodal, we are also making it to use (I am really amazed by how it excels at long multi-image documents because we are not specially optimizing for them too much) However still a long way to go
Arena.ai@arena

Kimi K2.6 is the new SOTA open model in Vision and Document Arena, with solid gains since Kimi K2.5: - #1 open on Vision Arena (#15 overall), +14 over #2 Kimi K2.5 (Thinking) - #1 open on Document Arena (#8 overall), +9 over K2.5 and on par with proprietary models like Muse Spark and Gemini 3.1 Pro. Huge congrats again to the @Kimi_Moonshot team on the open source progress!

English
1
7
103
14.1K
Wu Haoning
Wu Haoning@HaoningTimothy·
I think deepseek-v4 is not over-benchmaxxing, which is good. We build these things for people to use.
English
14
36
1.1K
44.8K
Wu Haoning retweetledi
Vals AI
Vals AI@ValsAI·
The 🐳 has surfaced and it’s a powerhouse on the Vals leaderboards, dominating on coding. DeepSeek V4 just landed #2 on the Vals Index, nearly tying Kimi K2.6 (only 0.07% behind).
Vals AI tweet media
English
9
13
258
15.6K
Wu Haoning retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
GPT-5.5 takes OpenAI back to the clear number one in AI. OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google OpenAI gave us pre-release access to test all five reasoning effort levels: xhigh, high, medium, low and non-reasoning. ➤ OpenAI topping five headline evaluations: GPT-5.5 (xhigh) leads Terminal-Bench Hard, GDPval-AA and our newly hosted APEX-Agents-AA. The model trails only other OpenAI models in CritPt and AA-LCR, and comes second to Gemini 3.1 Pro Preview on three additional evaluations. The largest gains are on AA-Omniscience (+14 pts), our knowledge and hallucination benchmark, and τ²-Bench Telecom (+7 pts), a customer service agent benchmark. ➤ 20% more expensive to run our Intelligence Index: Per-token pricing has doubled from GPT-5.4 to $5/$30 per 1M input/output tokens. However, a ~40% token use reduction largely absorbs the hike - resulting in a net ~+20% cost to run our Intelligence Index. ➤ Effort a clear ladder for balancing intelligence and cost: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900. GPT-5.5 (low) approximates Claude Opus 4.7 (Non-reasoning, high) on our Intelligence Index at half the cost to run (~$500 vs ~$1 ,000). ➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by ~30 pts and Gemini 3.1 Pro Preview by ~470 pts. GDPval-AA is Artificial Analysis’ benchmark that leverages OpenAI’s GDPval dataset to evaluate models on real-world economically valuable tasks. ➤ Top AA-Omniscience accuracy, but trailing the frontier on hallucination: Our private AA-Omniscience benchmark rewards factual knowledge across diverse topics, but punishes hallucination. GPT-5.5 (xhigh) has the highest accuracy at 57% - meaning the model can recall facts in the Omniscience corpus more effectively than any other model. However, it has a hallucination rate of 86% - vs Opus 4.7 (max) at 36%, and Gemini 3.1 Pro Preview at 50%. This makes it more likely to answer a question when it does not ‘know’ the answer. The 14 pt gain in AA-Omniscience from GPT-5.4 (xhigh) was largely driven by knowledge, with a modest improvement in hallucination. Congratulations to the team at @OpenAI and @sama on the launch
Artificial Analysis tweet media
English
63
209
1.7K
263.8K
Wu Haoning
Wu Haoning@HaoningTimothy·
@teortaxesTex Enjoy Kimi first! (We are trying our best to serve everyone)
English
0
0
3
52
Wu Haoning
Wu Haoning@HaoningTimothy·
@teortaxesTex hope they be fast... (from my very personal perspective day-0 open-source is better than day-X open-source but I am not working for business teams just a model trainer
English
0
0
4
130
Wu Haoning
Wu Haoning@HaoningTimothy·
Shall we be looking forward to V4 now? Serving the burst of K2.6 has made us GPU poor again now😂
English
0
0
2
212
Wu Haoning retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥 Kimi-K2.6-thinking is the new best open-weight model on HalluHard (without web search)! K2.5 had 76.9% hallucination rate, whereas K2.6 now has 63.6%. Since our benchmark contains hard hallucination cases, this improvement is very notable. Thank you @Kimi_Moonshot for providing API credits and @dyfan22 for running the eval! Full results: halluhard.com Paper: arxiv.org/abs/2602.01031
Maksym Andriushchenko tweet media
English
0
3
12
1.8K
Wu Haoning
Wu Haoning@HaoningTimothy·
Looling forward to more open-source models on vision arena (但等等这玩意是open-source吗)
Arena.ai@arena

MiMo-V2.5 by @XiaomiMiMo is now live on Arena. Evaluate it across Text, Vision & Code Arena - Pro versions available specifically in Text & Code. Start prompting and voting in Battle mode. Scores incoming.

中文
2
0
13
1.4K