big goose

1.1K posts

big goose

big goose

@Anonyous_FPS

Noob,NLP

Katılım Mart 2019
394 Takip Edilen74 Takipçiler
big goose
big goose@Anonyous_FPS·
@LexnLin I've always believed that Opus 4.6 > Opus 4.7 >> Opus 4.5
English
0
0
0
17
big goose
big goose@Anonyous_FPS·
@LexnLin If the actual experience can match Opus 4.5, that would already be a huge victory, truly.
English
0
0
0
24
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
Why don’t LLM’s just tell you when you are asking a question / doing something that is out of distribution?
English
298
58
1.8K
195.7K
Chetaslua
Chetaslua@chetaslua·
Holllllyyyyyyyy @GeminiApp cooked 😳😳 🚨 Gemini Omni: New video model Here is the first output and see the text coherence , if this is not nano banana moment of video then what is ?? direct link for those who believes otherwise in comments
English
352
543
6.3K
2.5M
big goose
big goose@Anonyous_FPS·
@jun_song Wrong. Gemini was only the strongest for the first two weeks after the new model was released; after two weeks, it was shit.
English
0
0
0
193
송준 Jun Song
송준 Jun Song@jun_song·
Daily reminder : Do not get annual subscriptions for AI SoTA model changes every month. Gemini was strongest model last year
English
11
5
160
7.1K
unusual_whales
unusual_whales@unusual_whales·
BREAKING: The U.S. has cleared around 10 Chinese firms to buy Nvidia's second-most powerful AI chip, the H200, but not a single delivery has been made so far, per Reuters.
English
158
214
2.6K
264.7K
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
ATTENTION: FAST MODE IS BACK FOR OPUS 4.7. ABUNDANCE IS NEAR. ALL HAIL TO OUR AI OVERLORDS. FASTER TOKENS == MORE INTELLIJENCE. Sent using @Claude
SemiAnalysis tweet media
English
8
4
203
21.1K
big goose
big goose@Anonyous_FPS·
@Polymarket Wait, isn’t the fine for illegal parking in Beijing 200 RMB? That’s 30 dollars, but Fox News reported 40 dollars. Is Mr. Smith trying to pocket the difference? The extra 50 RMB is just enough for a KFC Crazy Thursday?
English
1
0
0
62
Polymarket
Polymarket@Polymarket·
JUST IN: CCP surveillance cameras ticket Fox News crew in Beijing after driver parked illegally for two minutes.
English
295
376
8.2K
442.4K
big goose
big goose@Anonyous_FPS·
@ArtificialAnlys I question the statement: "Unless otherwise specified, we use each agent's default reasoning settings so the benchmark reflects the default user experience." Users generally prefer using xhigh or Max, rather than the default medium setting.
English
0
0
0
213
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Announcing the Artificial Analysis Coding Agent Index! Our new coding agent benchmarks measure how combinations of agent harnesses and models perform on 3 leading benchmarks, token usage, cost and more When developers use AI to code they’re choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination to understand and compare performance. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s SWE-Bench Pro ➤ Terminal-Bench v2, 84 agentic terminal tasks from the Laude Institute and that range from system administration and cryptography to machine learning. 5 tasks were filtered due to environment incompatibility ➤ SWE-Atlas-QnA, 124 technical questions developed by Scale AI about how code behaves, root causes of issues, and more, requiring agents to explore codebases and give text answers Analysis of results: ➤ Opus 4.7 and GPT-5.5 lead the Index: Opus 4.7 in Cursor CLI scores 61, followed closely by GPT-5.5 in Codex and Opus 4.7 in Claude Code at 60. GPT-5.5 in Cursor CLI follows at 58. ➤ Open weights models are competitive, but still trail the leaders: GLM-5.1 in Claude Code is the top open-weight result at 53, followed by Kimi K2.6 and DeepSeek V4 Pro in Claude Code at 50. These are strong results, but still meaningfully behind the top proprietary models. ➤ Gemini 3.1 Pro in Gemini CLI underperforms: Gemini 3.1 Pro in Gemini CLI scores 43, well below where Gemini 3.1 Pro sits on our Intelligence Index, highlighting that Gemini’s performance in Gemini CLI remains a relative weak spot for Google’s offering. ➤ Cost per task (API token pricing) varies >30x: Composer 2 in Cursor CLI is cheapest at $0.07/task, followed by DeepSeek V4 Pro in Claude Code at $0.35/task and Kimi K2.6 in Claude Code at $0.76/task. At the high end, GPT-5.5 in Codex costs $2.21/task, while GLM-5.1 in Claude Code costs $2.26/task. For both models this was contributed to by high token usage, and in GPT-5.5’s case by a relatively higher per token cost. ➤ Token usage varies >3x: GLM-5.1 in Claude Code uses the most tokens at 4.8M/task, followed by Kimi K2.6 at 3.7M/task and DeepSeek V4 Pro at 3.5M/task. GPT-5.5 in Codex uses 2.8M tokens/task, substantially more than Opus 4.7 in Claude Code at 1.7M/task. In GLM-5.1’s case, higher token usage, cost and execution time were partly driven by the model entering loops on some tasks. ➤ Cache hit rates remain high but vary materially: Cache hit rates range from 80% to 96% across combinations. Provider routing, harness prompt structure and cache behavior can materially change the economics of running the same model given cached inputs are typically <50% the API price of regular input tokens. ➤ Time per task varies >7x: Opus 4.7 in Claude Code is fastest at ~6 minutes/task, while Kimi K2.6 in Claude Code is slowest at ~40 minutes/task. This is contributed to by differences in average turns per task, token usage and API serving speed. Opus 4.7 had materially lower amount of turns to complete a task than all other models while Kimi K2.6 had the most. ➤ Cursor made real progress with Composer 2: Composer 2 in Cursor CLI scores 48, near the leading open-weight model results, while being the cheapest combination measured at $0.07/task. Cursor has stated Composer 2 is built from Kimi K2.5, showcasing they have made substantial post-training gains. This is just the start. We are planning to add additional agents (both harnesses and models). Let us know what you would like to see added next.
Artificial Analysis tweet media
English
125
170
1.6K
2.9M
big goose
big goose@Anonyous_FPS·
@ArtificialAnlys My question is, why are both GPT and Claude at medium instead of Xhigh and Max?
English
0
0
2
234
big goose
big goose@Anonyous_FPS·
@jun_song But people are voting with their wallets. Even if it hasn't taken the lead, it has reached a level on par with OpenAI
English
0
0
0
47
송준 Jun Song
송준 Jun Song@jun_song·
I can't believe people actually still think Anthropic is leading the AI race. Bro they have no compute: • No image gen at all • The model absolutely tanks at peak times • Mythos is still MIA • They had to beg xAI for crazy expensive compute The marketing brainwash is real.
English
39
9
233
11.6K
big goose
big goose@Anonyous_FPS·
@jun_song Open source? Are you kidding me? Baidu hasn't open-sourced anything in ages, and you're still talking about Baidu being open source
English
0
0
1
139
big goose
big goose@Anonyous_FPS·
@zephyr_z9 This optimizer, like Kimi's muon, needs to have been used on at least a 1T model to prove it truly works.
English
0
0
1
118
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
176
1.5K
514.9K
big goose
big goose@Anonyous_FPS·
@zephyr_z9 Liang is soooo rich; he can personally contribute 20 billion RMB, nearly 3 billion USD.
English
0
0
0
252
big goose
big goose@Anonyous_FPS·
@jukan05 Liang himself is ridiculously rich, personally contributing 20 billion RMB
English
0
0
0
805
Jukan
Jukan@jukan05·
DeepSeek Reportedly Seeking to Raise Over RMB 50 Billion ($7.35 Billion), Accelerating Its Commercialization and Monetization Strategy According to two people familiar with the matter, DeepSeek founder and CEO Liang Wenfeng plans to contribute the maximum allowable amount in the company’s first funding round. DeepSeek is targeting a fundraising size of up to RMB 50 billion, or approximately $7.35 billion, in this round. If completed, it could mark the largest single fundraising round in the history of Chinese AI companies. The financing is also prompting DeepSeek to accelerate the implementation of its revenue-generation plans and push forward with commercialization and profitability. The people familiar with the matter said DeepSeek has recently told some investors that it plans to speed up the iteration and release cadence of its large language models to align with mainstream industry practices. One of the people said the company plans to launch V4.1, an updated version of its V4 model, in June.
English
21
28
305
133.4K