Lisan al Gaib

18.9K posts

Lisan al Gaib

@scaling01

lead them to paradise https://t.co/IiP4VZlGU3

Beigetreten Ağustos 2024

988 Folgt38.2K Follower

Angehefteter Tweet

Lisan al Gaib@scaling01·1 Oca

My predictions for 2026: Coding and Mathematics AGI - METR 50% time horizons above 24 hours - my mean estimate is 30.8 hours, 2 day time horizons possible within frontier labs when accounting for 60 day lag - if 2025 was the year of agents, then 2026 will be the year of multi-agent systems - agents delegating work to subagents -> the start of the agent economy and the great unhobbling! Most of our current math and coding benchmarks will get saturated! - Epoch Capabilities Index ( > 175 ) - FrontierMath Levels 1-3 ( > 95% ) - ARC-AGI 1 and 2 ( > 95% ) - SimpleQA verified ( > 95% ) - Simple-Bench ( > 90% ) - SWE-Bench-verified ( > 90% ) - Terminal-Bench 2 ( > 90% ) - WeirdML v2 ( > 85% ) - Humanities Last Exam ( > 80% ) - FrontierMath Level 4 ( > 75% ) - Cybench ( > 70% ) - GDPval ( > 70 % win rate, no ties) - GSO ( > 65% ) - ARC-AGI-3 ( > 60% and > 80% if they go for o3-preview comparable compute budgets or continual learning breakthrough happens) - more evals like gdpval that capture economic value of models and systems - big focus white collar work and large acceleration of science: specifically i see acceleration in medicine, biology, chemistry, finance, legal, administrative work - automation of white collar work will be enabled by having reliable and fast computer use agents - reliable computer use agents will also have implications for how you use the internet. this is OpenAI's big goal: become the hub to the internet and delegate shopping and whatever to agents! Big models launches to get hyped for in 2026: - Claude 5 - Claude 5.5 - Gemini 3.5 - Gemini-4 - GPT-5.3 - GPT-6 (everything in between possible, but Gemini 4 ~ 80%, Claude 5.5 ~ 70%, GPT-6 ~ 60% likely before 2027) - DeepSeek-V4 - Grok-5 - Qwen-4 - Kimi-K3, GLM-5, MiniMax M3 - more korean models and a bunch of american open-source models :) The gap between closed and open labs will narrow in H1 2026 due to DeepSeek-V4, then widen in the later half of the year, especially on economically valuable tasks. Closed models will be much more reliable. But we will still have Opus 4.5+ level open models by the end of 2026. Most frontier models will be around 5-10T params. If we see GPT-6 and Gemini-4 at the end of 2026 10T+ param models are possible. These models + harnesses will be the first not research agents. We should also see much better live models with voice and video mode. Model architecture: - we will see both, more efficient architectures and more expressive architectures! - hybrid architectures for even longer context windows, diffusion models for speed on edge devices, but also models that double down on full attention or even more expressive attention mechanisms - looped language models, other recurrent architectures and continual learning will enable much smaller reasoning models! (TRM on ARC-AGI has paved the way for the reasoning core) - big improvements in reasoning efficiency in my 2025 prediction I included a prediction for 2026 that I stand by: - "someone (Anthropic) figures out efficient test-time-training [...], this will be the next paradigm for 2026 and lead to superintelligence" General outlook and some random thoughts: - it will be clear to everybody that Anthropic has the mandate and is ahead of everyone else - OpenAI, Anthropic and Google will remain frontier labs - decent chance that Anthropic overtakes OpenAI's valuation and both are valued > 1T - DeepSeek will join them with V4 as THE chinese frontier lab - xAI will likely repeat Grok-4, Grok-5 will be great on benchmarks but Elon persists on slop-maxxing the model - AI generated video content will take off with Veo-4 and Sora-3, consistent minute long videos will be possible - embodied intelligence will start to take off by RL through world models - full self-driving solved, waymo and tesla everywhere - the stock market will have a 20%+ drawdown - 15% chance of OpenAI going bankrupt and getting acquired by Microsoft due to collapse of oracle or a market crash, caused by rapidly deteriorating economic situation (unemployment, inflation) - push against AI will become a common theme in most advanced western economies as unemployment rises - populist right wing parties continue to gain traction in europe - trump/republicans will lose midterm elections

English

723

248K

Lisan al Gaib@scaling01·8h

cursor.com/blog/composer-2

ZXX

2.4K

Lisan al Gaib@scaling01·8h

that looks pretty fucking good

English

310

43.9K

Lisan al Gaib retweetet

Lincoln 🇿🇦@Presidentlin·8h

It does "beat" Opus 4.6 but benchmarks mean so little. It's going to be a good worker model.

Jimmy Apples 🍎/acc@apples_jimmy

Apparently Cursor is going to release a coding model better than opus 4.6 and cheaper as well ( maybe tomorrow ) Can they regularly do this to keep up though ?

English

5.4K

Lisan al Gaib@scaling01·9h

@cormac_mars for me it's like: avatar 1: 4/5 avatar 2: 3/5 avatar 3: 3/5 dune 1: 4/5 dune 2: 5/5 dune 3: 10/5

हिन्दी

1.7K

Cormac@cormac_mars·9h

@scaling01 avatar 3 was 5 stars and 1 was 4.5 for me dune 2 was obv a 5 and dune 1 was 4

English

1.9K

Lisan al Gaib@scaling01·1d

it took villeneuve 5 years to create the greatest (sci-fi) trilogy of all time meanwhile, james cameron got a billion dollars and 16 fricking years to create 3 mid movies slop is not only AI related. it exists everywhere in the real world

English

1.5K

122K

Lisan al Gaib@scaling01·21h

@Presidentlin happy 454 day anniversary

English

143

Lisan al Gaib@scaling01·20 Ara

@Presidentlin I have become enlightened and humble. I see you like or comment every day. You deserve a follow my friend.

English

152

Lincoln 🇿🇦@Presidentlin·20 Ara

I collected a new tazos.

English

1.1K

Lisan al Gaib@scaling01·22h

@pingToven @OpenRouter @XiaomiMiMo @openclaw if only someone would've said it was xiaomi reliefed it wasnt deepseek

English

131

Toven@pingToven·23h

@OpenRouter @XiaomiMiMo @openclaw gm @scaling01

276

OpenRouter@OpenRouter·23h

Stealth Model Reveal: Hunter and Healer Alpha are @XiaomiMiMo MiMo-V2-Pro and MiMo-V2-Omni Both models are live now on OpenRouter, and free to use in @OpenClaw via the OpenRouter provider for the next week!

English

126

1.4K

99.1K

Lisan al Gaib@scaling01·1d

@david_sepulvado the first one was sick

English

2.6K

David Sepulvado@david_sepulvado·1d

@scaling01 2 and 3 were reductive yes, but I doubt you really think that for the first 😁

English

2.9K

Lisan al Gaib@scaling01·1d

@AlphaMFPEFM so do cigarettes and other drugs does it make them good?

English

4.8K

AlphaMFPEFM@AlphaMFPEFM·1d

@scaling01 And yet Cameron's movies makes billions...

English

5.1K

Lisan al Gaib@scaling01·1d

@xpasky gonna run lisanbench for GPT-5.4, Mistral Small 4 and M2.7 this weekend

English

670

Petr Baudis@xpasky·1d

Looks like an absolute banger on paper, but zero hype on my timeline. (I know, @scaling01 currently consumed by Dune 3, but even so.) Is it a good model, or is it benchmaxxed?

MiniMax (official)@MiniMax_AI

Introducing MiniMax-M2.7, our first model which deeply participated in its own evolution, with an 88% win-rate vs M2.5 - Production-Ready SWE: With SOTA performance in SWE-Pro (56.22%) and Terminal Bench 2 (57.0%), M2.7 reduced intervention-to-recovery time for online incidents to 3-min on certain occasions. - Advanced Agentic Abilities: Trained for Agent Teams and tool search tool, with 97% skill adherence across 40+ complex skills. M2.7 is on par with Sonnet 4.6 in OpenClaw. - Professional Workspace: SOTA in professional knowledge, supports multi-turn, high-fidelity Office file editing. MiniMax Agent: agent.minimax.io API: platform.minimax.io Token Plan: platform.minimax.io/subscribe/toke…

English

1.7K

Lisan al Gaib retweetet

Cem Karsan 🥐@jam_croissant·1d

The width 📏 b/w PINK👛 & YELLOW🟡= LEVERAGE This is the true barometer 🌡️ of success for this administration…

Cem Karsan 🥐@jam_croissant

1) Who do you think The US’s 🇺🇸 war in Iran 🇮🇷 is actually against??? 2) Given that… What do you see as the US’s 🇺🇸 #1 & #2 greatest sources of leverage in this Grand War ⚔️??? 3) Are those 2 sources of leverage somehow connected 🪢 to 1 another??? 4) Given that, why might the Strait of Hormuz 🚢 be the most critical front of this Grand War⚔️??? 5) Finally… Why might that mean that this conflict likely won’t (actually) be "very complete, pretty much," for many years??? 🤷‍♂️ #🥐RUMBS. . . . . . . . . . .

English

243

81.6K

Lisan al Gaib@scaling01·1d

lisan never forgets comments

English

1.9K

Lisan al Gaib@scaling01·1d

the words of companies mean so much lmao

永雏塔菲@xhyctf

@scaling01 @teortaxesTex xiaomi denied it in wechat

English

178

16.3K

Lisan al Gaib@scaling01·1d

@ZixuanLi_ what? im so confused

English

770

Zixuan Li@ZixuanLi_·1d

Me introducing M2.7💯

English

743

31.2K

Lisan al Gaib@scaling01·1d

@lazysloth 11 what is that? oscar amount for ants?

English

aislop@lazysloth·1d

@scaling01 oof an't gonna happen, all the other pictures had luscious hair, one of them is diff

English

118

Lisan al Gaib@scaling01·1d

we can beat LOTR no doubters will be left

The Cinéprism@TheCineprism

Can Dune surpass The Lord of the Rings legacy?

English

4.6K

Lisan al Gaib@scaling01·1d

@lazysloth yeah because it's going to sweep the oscars

English

119

Lisan al Gaib@scaling01·1d

github.com/openai/paramet…

ZXX

2.1K

Lisan al Gaib@scaling01·1d

OpenAI just released "Parameter Golf" a new challenge to train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s There's also a leaderboard. If you perform well they might hire you The challenge is open from March 18th to April 30th

English

262

20.4K

Entdecken

@cormac_mars @Presidentlin @pingToven @OpenRouter @XiaomiMiMo @openclaw @david_sepulvado @AlphaMFPEFM