Jake

726 posts

Jake banner
Jake

Jake

@JakeKAllDay

Works in AI, still doesnt trust it

Austin, TX Beigetreten Ağustos 2012
346 Folgt495 Follower
Jake retweetet
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
Moonshot’s Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at #4 on the Artificial Analysis Intelligence Index (54) behind only Anthropic, Google, and OpenAI (all 57) Key takeaways: ➤ Increase in performance on agentic tasks: @Kimi_Moonshot's Kimi K2.6 achieves an Elo of 1520 on our GDPval-AA evaluation, which is a marked improvement over Kimi K2.5’s Elo of 1309. GDPval-AA is our leading metric for general agentic performance, measuring the performance on knowledge work tasks such as preparing presentations and analysis. Models are given code execution and web browsing tools in an agentic loop via our open source reference agentic harness called Stirrup. This continues Kimi K2.6’s strength in tool use, maintaining a 96% score on τ²-Bench Telecom, placing it among other frontier models in this category. ➤ Low hallucination rate: Kimi K2.5 scores 6 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This score is primarily driven by a comparatively low hallucination rate of 39% (reduced from Kimi K2.5’s 65%), indicating a greater capability to abstain rather than fabricate knowledge when the model is uncertain. Kimi K2.6’s low hallucination rate places it similarly to other models such as Claude Opus 4.7 (36%) and MiniMax-M2.7 (34%) ➤ High token usage: Kimi K2.6 demonstrates high token usage, but is in line with other frontier models in the same intelligence tier. To run the full Artificial Analysis Intelligence Index, Kimi K2.6 used ~160M reasoning tokens. This is slightly lower than Claude Sonnet 4.6 (~190M reasoning tokens) but much higher than GPT 5.4 (~110M reasoning tokens). ➤ Open weights: Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1T total parameters and 32B active, same as the previous two generations of models Kimi K2 Thinking and Kimi K2.5. Kimi K2.6 again pushes the open weights frontier in intelligence. ➤ Third Party Access: Kimi K2.6 is accessible through Moonshot’s First Party API as well as third party API providers Novita, Baseten, Fireworks, and Parasail ➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Further analysis in the threads below.
Artificial Analysis tweet media
English
4
10
165
8.1K
Jake
Jake@JakeKAllDay·
And now @ArtificialAnlys confirms it: new SOTA for OS and very close in on the (larger and more expensive) US cloud models. x.com/ArtificialAnly…
Artificial Analysis@ArtificialAnlys

Moonshot’s Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at #4 on the Artificial Analysis Intelligence Index (54) behind only Anthropic, Google, and OpenAI (all 57) Key takeaways: ➤ Increase in performance on agentic tasks: @Kimi_Moonshot's Kimi K2.6 achieves an Elo of 1520 on our GDPval-AA evaluation, which is a marked improvement over Kimi K2.5’s Elo of 1309. GDPval-AA is our leading metric for general agentic performance, measuring the performance on knowledge work tasks such as preparing presentations and analysis. Models are given code execution and web browsing tools in an agentic loop via our open source reference agentic harness called Stirrup. This continues Kimi K2.6’s strength in tool use, maintaining a 96% score on τ²-Bench Telecom, placing it among other frontier models in this category. ➤ Low hallucination rate: Kimi K2.5 scores 6 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This score is primarily driven by a comparatively low hallucination rate of 39% (reduced from Kimi K2.5’s 65%), indicating a greater capability to abstain rather than fabricate knowledge when the model is uncertain. Kimi K2.6’s low hallucination rate places it similarly to other models such as Claude Opus 4.7 (36%) and MiniMax-M2.7 (34%) ➤ High token usage: Kimi K2.6 demonstrates high token usage, but is in line with other frontier models in the same intelligence tier. To run the full Artificial Analysis Intelligence Index, Kimi K2.6 used ~160M reasoning tokens. This is slightly lower than Claude Sonnet 4.6 (~190M reasoning tokens) but much higher than GPT 5.4 (~110M reasoning tokens). ➤ Open weights: Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1T total parameters and 32B active, same as the previous two generations of models Kimi K2 Thinking and Kimi K2.5. Kimi K2.6 again pushes the open weights frontier in intelligence. ➤ Third Party Access: Kimi K2.6 is accessible through Moonshot’s First Party API as well as third party API providers Novita, Baseten, Fireworks, and Parasail ➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Further analysis in the threads below.

English
0
0
0
11
Jake retweetet
Jake
Jake@JakeKAllDay·
Moonshot AI continues to be the *most* open of the OS shops (#qwen and GLM are still great too!). #Kimi K2.6 is a legitimately frontier model, making it available from the start is great pressure on the Big 3 cloud providers to keep costs down. It should also be a boon to @cursor_ai users, as #Composer 2 was based on Kimi and provides vastly better value than Claude.
Kimi.ai@Kimi_Moonshot

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English
2
4
81
1.5K
Jake
Jake@JakeKAllDay·
@pHequals7 Worst at tool calling (and agentic capabilities by extension), but still plenty of post training to be had on accuracy, IF, alignment, domain specialization, etc.
English
0
0
0
42
Jake
Jake@JakeKAllDay·
*very* few companies voluntarily involute their products. They bias toward upselling. At a minimum, they try to increase volume and keep ASP the same. "Raise prices for more functionality" has long been the norm. Lowering prices means "owning" a smaller pie, vs getting the largest absolute piece you can.
English
0
0
1
9
Callum Williams
Callum Williams@econcallum·
This is a highly underrated point. Programmer productivity improvements SHOULD show up as falling software prices. This is the historical norm. In fact, though, in recent months software prices have actually been *rising*...the opposite of what should happen
Callum Williams tweet media
wanye@xwanyex

I don’t have to be convinced that LLM’s make programmers more productive. But where’s all the stuff? We’ve now had months and months of 100x or 1000x programmet productivity improvements. Where’s all the stuff they’re building?

English
10
11
106
12.1K
Jake
Jake@JakeKAllDay·
@econcallum it isn't -- this data is based on NIPA. It is an inflation (product) level metric. it has absolutely no way to scale for what features exist within a given software bundle. apps.bea.gov/iTable/?1921=u…
English
0
0
3
43
Jake
Jake@JakeKAllDay·
@Goosehater123 @qcapital2020 'hard to disentangle' is a terrible excuse for misleading statistics. Amazon, by revenue, is mostly a retail company (not so by valuation). Measuring their AI capex, a tiny portion of their AWS business currently, against their retail revenue is dumb.
English
0
0
0
65
Moose
Moose@Goosehater123·
@JakeKAllDay @qcapital2020 Hard to disentangle. Machine learning has already been a core component of all their businesses. Just AMZ for ex. Amazon Ads, AMZ search, AWS, even AMZ logistics like FBA, warehousing, pick/fulfillment, etc. The line between ML and AI is hard to determine, It’s more gradual
English
2
0
0
85
 Q-Cap 
 Q-Cap @qcapital2020·
Capex Bubble for ants
 Q-Cap  tweet media
English
23
58
779
70.2K
Jake retweetet
Jake
Jake@JakeKAllDay·
@qcapital2020 This is a crazy misleading chart. VAST majority of the revenue of Amazon/google/Meta/Oracle/MSFT has nothing to do with AI. Most of Amazon rev doesn’t even have to do with IT! Google + Meta are 0.5T in just Ad revenue. It’s apples and orangutans.
English
2
4
29
1.7K
Jake
Jake@JakeKAllDay·
MCP standard reached critical mass during 2.5/5/4 period and then the corpus existed to do proper RL for tool calling. Coding capabilities also progressed which was a tailwind on agentic scope. Gemini series is the smartest internally but their tool calling RL is still the weakest. Hence the disparity bw benchmark and lay engineer (I said what I said) perspective. Hopefully it gets fixed on the next cycle.
English
0
0
5
1.1K
Mike Knoop
Mike Knoop@mikeknoop·
Extremely clear what caused the qualitative leap from GPT 4 to o1 (test time adaptation via chain of thought reasoning). Not clear what caused the agentic leap from Gemini 2.5/GPT 5.1/Opus 4.1 to Gemini 3/GPT 5.2/Opus 4.5. Even crazier all three released ~3 weeks apart.
English
24
4
232
21.4K
Jake
Jake@JakeKAllDay·
@MichaelFKane One day, Children of Húrin will be made into a movie and Glaurung will get his due. @andyserkis already narrated it perfectly in audiobook form hopefully he can make that happen.
English
0
0
0
5
Jake
Jake@JakeKAllDay·
@MichaelFKane “It’s only a REAL victory if there are zero detectable costs”
English
1
0
57
621
Michael F Kane
Michael F Kane@MichaelFKane·
The US will never be allowed a military victory again because arbitrary metrics unrelated to actual military goals will be applied to declare defeat. And this isn't particularly a shot at Chris, because he is right that oil price standard will be applied to Iran, despite oil being a tertiary concern at best to the administration's actual objectives in Iran. But we are so well off and comfortable that talking heads will apply any discomfort or inconvenience to the account and credit it as a loss.
Christopher F. Rufo ⚔️@christopherrufo

One way to measure that stable victory conditions have been met would be the price of oil returning to the pre-war median price

English
44
86
1.1K
44.6K
Jake
Jake@JakeKAllDay·
@ThePowerAudit I’ve never understood why people think Iran would seriously play the Red Sea card. It might not work, it will bring KSA into the fight, and will lose their CCP support. It’s great for Russia but that’s about it
English
0
0
1
103
Chris Rollins
Chris Rollins@ThePowerAudit·
The only cards Iran holds are a suicide card that assumes they can convince the Houthis to enact it too. I do not think they would at this point. I also do not think the US will eliminate the power plants in Iran. Game theory says you don't, not unless you want the destruction of the entire region as mentioned. (there is a devastatingly sad benefit for the US economically long term to this) China is not going to let the IRGC harm them and make them even more indebted to the US by doing it either. The "cards" Iran holds are not actually leverage. It is more of a threatened suicide vest strapped to the region. If I aim the US, I would NOT take out the power plants or attempt to send Iran to the stone age. But if Iran does pull the pin, guess whose oil and LNG just became the hottest commodity in the world? US LNG at $20 JKM. US propane at $1.13/gal delivered ARA. US crude exports at a record 5.2 million barrels a day. That is the position Iran's suicide card actually strengthens. Here is the game theory. Iran has two choices. Execute the threat and die, which makes US energy the strongest card on Earth and hands Trump the strongest hand in a generation. Or do not execute, and keep bleeding revenue through the filter every day the blockade operates. Both branches favor the US. There is no Iranian move that improves Iran's position. A threat where every outcome helps your opponent is not a threat. What sort of leverage is that?
Degrees of Change@DegreesOcean

@ThePowerAudit What are you talking about? Iran holds ALL the cards. It can destroy Abqaiq, Ras Tanura, Ras Laffan, Yanbu etc, etc and send the entire world back to the stone age. There is no way to stop them destroying all the energy infrastructure. Why can't you Americans see that???

English
5
1
21
3.4K
Jake
Jake@JakeKAllDay·
@StirlingForge @Alibaba_Qwen It’ll be considerably larger (hence the naming convention). Just wait for 3.6 27B dense
English
0
0
1
20
Qwen
Qwen@Alibaba_Qwen·
🚀 Introducing Qwen3.6-Max-Preview, an early preview of our next flagship model Highlights: ⚡️ Improved agentic coding capability over Qwen3.6-Plus 📖 Stronger world knowledge and instruction following 🌍 Improved real-world agent and knowledge reliability performance Smarter, sharper, still evolving. More Qwen3.6 models to come. Stay tuned! 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… API: modelstudio.console.alibabacloud.com/ap-southeast-1…
Qwen tweet media
English
152
378
3.7K
252.8K
Jake
Jake@JakeKAllDay·
@IrvingSwisher Gasoline prices were 4.4% of GDP at the peak of the 1980 shock. They’re typically less than 2% in today’s economy. Oil as a % of US energy consumed is vastly lower than it was before. Brent prices also =/= broader petroleum costs. open.substack.com/pub/jakekooker…
Jake tweet media
English
0
0
1
1.3K
Jake
Jake@JakeKAllDay·
@VKMacro @stja42860 China is hoarding oil and has banned exports so it is logical to treat them as a separate, dislocated market.
English
0
0
1
56
VKMacro
VKMacro@VKMacro·
@stja42860 Not a different conclusion per se, but China inventories have been flat to up since the crisis began. Also, China demand can fluctuate significantly in the near term which offsets against RoW inventories.
English
3
0
3
684