big goose (@Anonyous_FPS) - Twitter Profili | Zamantika Mersobahis Locabet

big goose@Anonyous_FPS·7h

@LexnLin I've always believed that Opus 4.6 > Opus 4.7 >> Opus 4.5

English

0

17

Leon Lin@LexnLin·19h

so basically we got an opus 4.7 model that costs 10x less i HAVE to test this

Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English

68

16

1.7K

180.4K

big goose@Anonyous_FPS·7h

@LexnLin If the actual experience can match Opus 4.5, that would already be a huge victory, truly.

English

0

24

big goose@Anonyous_FPS·8h

@bigeagle_xd 熊师傅，这Cursor的数据和XAI的算力，要是给你们去在2.5的基座上训练，能不能整个大的出来

Cursor@cursor_ai

Composer 2.5 is built on the same open-source base as Composer 2, Moonshot’s Kimi K2.5.

中文

0

27

big goose@Anonyous_FPS·21h

@OfficialLoganK why?

0

6

Logan Kilpatrick@OfficialLoganK·21h

Why don’t LLM’s just tell you when you are asking a question / doing something that is out of distribution?

English

298

58

1.8K

195.7K

big goose@Anonyous_FPS·2d

@chetaslua @GeminiApp 他最终会叫Veo4 还是Gemini Omni?

中文

0

44

Chetaslua@chetaslua·11 May

Holllllyyyyyyyy @GeminiApp cooked 😳😳 🚨 Gemini Omni: New video model Here is the first output and see the text coherence , if this is not nano banana moment of video then what is ?? direct link for those who believes otherwise in comments

English

352

543

6.3K

2.5M

big goose@Anonyous_FPS·4d

@jun_song Wrong. Gemini was only the strongest for the first two weeks after the new model was released; after two weeks, it was shit.

English

0

193

송준 Jun Song@jun_song·4d

Daily reminder : Do not get annual subscriptions for AI SoTA model changes every month. Gemini was strongest model last year

English

11

5

160

7.1K

big goose@Anonyous_FPS·5d

@unusual_whales Isn't this old news? Why is it BREAKING again?

English

0

13

unusual_whales@unusual_whales·5d

BREAKING: The U.S. has cleared around 10 Chinese firms to buy Nvidia's second-most powerful AI chip, the H200, but not a single delivery has been made so far, per Reuters.

English

158

214

2.6K

264.7K

big goose@Anonyous_FPS·5d

@SemiAnalysis_ @Claude Has the speed met the standard? Your article said 4.6 Fast is only 1.5 times faster.

English

0

52

SemiAnalysis@SemiAnalysis_·5d

ATTENTION: FAST MODE IS BACK FOR OPUS 4.7. ABUNDANCE IS NEAR. ALL HAIL TO OUR AI OVERLORDS. FASTER TOKENS == MORE INTELLIJENCE. Sent using @Claude

English

8

4

203

21.1K

big goose@Anonyous_FPS·5d

@iScienceLuvr TBD AI Lab

Català

0

49

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·5d

Why is Meta's lab name still TBD, haven't they figured out a name by now? 😭

English

19

3

196

63.1K

big goose@Anonyous_FPS·5d

@Polymarket Wait, isn’t the fine for illegal parking in Beijing 200 RMB? That’s 30 dollars, but Fox News reported 40 dollars. Is Mr. Smith trying to pocket the difference? The extra 50 RMB is just enough for a KFC Crazy Thursday？

English

1

0

62

Polymarket@Polymarket·5d

JUST IN: CCP surveillance cameras ticket Fox News crew in Beijing after driver parked illegally for two minutes.

English

295

376

8.2K

442.4K

big goose@Anonyous_FPS·11 May

@ArtificialAnlys I question the statement: "Unless otherwise specified, we use each agent's default reasoning settings so the benchmark reflects the default user experience." Users generally prefer using xhigh or Max, rather than the default medium setting.

English

0

213

Artificial Analysis@ArtificialAnlys·11 May

Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers Analysis of results: ➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58. ➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models. ➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering. ➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost. ➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks. ➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens. ➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most. ➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains. This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.

English

125

170

1.6K

2.9M

big goose@Anonyous_FPS·11 May

@ArtificialAnlys My question is, why are both GPT and Claude at medium instead of Xhigh and Max?

English

0

2

234

big goose@Anonyous_FPS·9 May

@jun_song @Kimi_Moonshot @Zai_org @deepseek_ai @Alibaba_Qwen @ErnieforDevs @XiaomiMiMo Without a doubt DS, then Moonshot, with Baidu and Alibaba at the very bottom

English

0

25

송준 Jun Song@jun_song·9 May

If you had chance to work with one of these open source AI companies, which one would you choose? -@Kimi_Moonshot -@Zai_org -@deepseek_ai -@Alibaba_Qwen -@ErnieforDevs -@XiaomiMiMo

English

35

1

37

3.8K

big goose@Anonyous_FPS·9 May

@jun_song But people are voting with their wallets. Even if it hasn't taken the lead, it has reached a level on par with OpenAI

English

0

47

송준 Jun Song@jun_song·9 May

I can't believe people actually still think Anthropic is leading the AI race. Bro they have no compute: • No image gen at all • The model absolutely tanks at peak times • Mythos is still MIA • They had to beg xAI for crazy expensive compute The marketing brainwash is real.

English

39

9

233

11.6K

big goose@Anonyous_FPS·9 May

@jun_song Open source? Are you kidding me? Baidu hasn't open-sourced anything in ages, and you're still talking about Baidu being open source

English

0

1

139

송준 Jun Song@jun_song·9 May

Frontier-level Model released from Baidu. Ernie-5.1 Better scores than Deepseek V4 Pro on benchmarks. Didn’t expect saturday launch. Open source is moving fast

ERNIE for Developers@ErnieforDevs

ERNIE 5.1 is here 🚀 ERNIE 5.1 significantly reduces pretraining cost while compressing total parameters to ~1/3 and activated parameters to ~1/2 — using only ~6% of the pretraining cost compared to models at similar scale, while achieving leading performance in its class. 💡Key highlights: 1/ Strong agentic performance approaching leading frontier models. ERNIE 5.1 surpasses DeepSeek-V4-Pro on both τ3-bench and SpreadsheetBench-Verified. 2/ Strong world knowledge and creative writing capabilities, with GPQA and MMLU-Pro performance approaching leading closed-source models, and creative writing ability nearing Gemini 3.1 Pro. 3/ Frontier-level reasoning performance. ERNIE 5.1 scores 99.6 on the challenging AIME26 benchmark with tools, second only to Gemini 3.1 Pro. 4/ Deep search capability. On May 9, ERNIE 5.1 ranked #4 globally and #1 among Chinese models on the Arena Search leaderboard with a score of 1223. ERNIE 5.1 is now available on ERNIE and the Baidu AI Studio Model Playground: 👉ernie.baidu.com 👉aistudio.baidu.com 👉ernie.baidu.com/blog

English

33

47

620

81.7K

big goose@Anonyous_FPS·9 May

@zephyr_z9 This optimizer, like Kimi's muon, needs to have been used on at least a 1T model to prove it truly works.

English

0

1

118

Zephyr@zephyr_z9·8 May

This could be super big

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English

10

9

340

75.6K

big goose@Anonyous_FPS·8 May

@tilderesearch only 1.1B？ bro u need use it on 1T model

English

0

2

961

Tilde@tilderesearch·8 May

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

Tilde@tilderesearch

x.com/i/article/2052…

English

41

176

1.5K

514.9K

big goose@Anonyous_FPS·8 May

@zephyr_z9 Liang is soooo rich; he can personally contribute 20 billion RMB, nearly 3 billion USD.

English

0

252

Zephyr@zephyr_z9·8 May

"One of the people said the company plans to launch V4.1, an updated version of its V4 model, in June." Liang Wenfeng is a very rich guy. High Flyer has an internal Medallion like fund

Jukan@jukan05

DeepSeek Reportedly Seeking to Raise Over RMB 50 Billion ($7.35 Billion), Accelerating Its Commercialization and Monetization Strategy According to two people familiar with the matter, DeepSeek founder and CEO Liang Wenfeng plans to contribute the maximum allowable amount in the company’s first funding round. DeepSeek is targeting a fundraising size of up to RMB 50 billion, or approximately $7.35 billion, in this round. If completed, it could mark the largest single fundraising round in the history of Chinese AI companies. The financing is also prompting DeepSeek to accelerate the implementation of its revenue-generation plans and push forward with commercialization and profitability. The people familiar with the matter said DeepSeek has recently told some investors that it plans to speed up the iteration and release cadence of its large language models to align with mainstream industry practices. One of the people said the company plans to launch V4.1, an updated version of its V4 model, in June.

English

12

5

193

24K

big goose@Anonyous_FPS·8 May

@jukan05 Liang himself is ridiculously rich, personally contributing 20 billion RMB

English

0

805

Jukan@jukan05·8 May

DeepSeek Reportedly Seeking to Raise Over RMB 50 Billion ($7.35 Billion), Accelerating Its Commercialization and Monetization Strategy According to two people familiar with the matter, DeepSeek founder and CEO Liang Wenfeng plans to contribute the maximum allowable amount in the company’s first funding round. DeepSeek is targeting a fundraising size of up to RMB 50 billion, or approximately $7.35 billion, in this round. If completed, it could mark the largest single fundraising round in the history of Chinese AI companies. The financing is also prompting DeepSeek to accelerate the implementation of its revenue-generation plans and push forward with commercialization and profitability. The people familiar with the matter said DeepSeek has recently told some investors that it plans to speed up the iteration and release cadence of its large language models to align with mainstream industry practices. One of the people said the company plans to launch V4.1, an updated version of its V4 model, in June.

English

21

28

305

133.4K

big goose@Anonyous_FPS·8 May

@simonw No

0

19

Simon Willison@simonw·7 May

We already had gemini-3.1-flash-lite-preview back on March 3rd, not clear if this new gemini-3.1-flash-lite is different other than no longer being marked as a "preview". Pricing appears to be the same.

Google AI Studio@GoogleAIStudio

gemini 3.1 flash-lite is here it's our most cost-efficient model, optimized for high-volume agentic tasks, translation, and simple data processing

English

48

4

301

51K

big goose

Keşfet