LF

918 posts

LF banner
LF

LF

@Ellefe_

France Katılım Ağustos 2010
65 Takip Edilen142 Takipçiler
Lisan al Gaib
Lisan al Gaib@scaling01·
(and I myself am of course a rebel and played some role in the first ChatGPT plus rebellion that ultimately lead to the downfall of the insane GPT-5 limits)
English
4
0
26
2.1K
LF
LF@Ellefe_·
@scaling01 This would mean that the dream to have an on device model on your MacBook Air as your CTO is not gonna happen soon and might not even happen ever. I don’t know how I feel about that, I guess we’re not close to reduce our dependency to main AI labs
English
0
0
1
73
Lisan al Gaib
Lisan al Gaib@scaling01·
I'm sure you can squeeze small models a lot more, but there's a depth and knowledge gap. My guess is that a 120B can find the same exploits mythos did, but only if it has a lot more test time compute + heavily overtrained on cyber and it has to be distilled from an even stronger model than mythos. my timeline for such a 120B model is 2028-2030
English
2
0
0
431
LF
LF@Ellefe_·
@scaling01 You should add mistral, just to capture EU lag vs US & CH
English
1
0
1
109
Lisan al Gaib
Lisan al Gaib@scaling01·
the positions are based on vibes right now, and also capture more the coding domain rather than all domains which is why the lag for frontier chinese models is ~6 months right now / GPT-5.2 or Opus 4.5 level
English
3
0
11
1.7K
Lisan al Gaib
Lisan al Gaib@scaling01·
what do you think about this idea to chart current AI capabilities and factor in the acceleration/velocity of labs? don't read too much into the actual numbers, right now it's more a vibe of what model capabilities are right now and could be in 12 months today: Anthropic > OpenAI >> Google >> Meta > xAI >= DeepSeek, Moonshot, Alibaba, Zhipu, ByteDance > MiniMax In 12 months I think it's pretty much the same except that all the labs that aren't on the frontier will fall behind by a couple of months depending on how much compute they have I would also really like to add error bars to that, because for some labs the outcome distribution is just much wider.
Lisan al Gaib tweet media
English
12
2
64
19.9K
LF
LF@Ellefe_·
@scaling01 Hey, you have somewhere details on how your benchmark works? Like a GitHub repo and one example, to better get the idea?
English
1
0
1
926
Lisan al Gaib
Lisan al Gaib@scaling01·
LisanBench results for GPT-5.5 - it's good. GPT-5.5 is now the strongest model without Thinking on both metrics! GPT-5.5-medium uses on average ~45.6% less tokens than GPT-5.4-medium while scoring 1.77x higher! (1.14x higher score on the difficulty weighted metric) Running LisanBench for GPT-5.5-medium cost basically the same as for GPT-5.4-medium despite being 2x more expensive. GPT-5.5-medium has the highest validity ratio (% of legal and correct moves) out of all tested models. In the overall GPT-5.5-medium is: - Rank #4 by average path length: 9327 - Rank #3 by difficulty-weighted score: 2539 on the difficulty weighted metric: - Opus 4.7-xhigh used +134.9% more tokens and scored +55.9% higher - Opus 4.6-16k used +11.1% more tokens and scored 9.2% higher - Sonnet 4.6-16k used +3.1% more tokens and scored -9.1% lower Current Validity ratio leaderboard: 1. GPT 5.5 (medium): 99.44% 2. Opus 4.7 (xhigh): 99.35% 3. Sonnet 4.6 (16k): 99.28% 4. Opus 4.6 (16k): 98.74% 5. Gemini 3.1 Pro Preview (low): 97.77% The "non-thinking" version of GPT-5.5 uses 3.3x more tokens than GPT-5.4 (non thinking), but gets a 3.1x higher avg score and 2.96x higher weighted score.
Lisan al Gaib tweet media
English
25
14
336
121.4K
LF
LF@Ellefe_·
@iruletheworldmo Thank you for this post, I’m not the worried about AI kind of guy, but this is still worrisome. I think the problem is not how to handle misalignment (I’m sure researchers will find ways), but the fact that it’s a race and some players would rather be 1st and take risks than wait
English
0
0
1
136
LF
LF@Ellefe_·
@_catwu Hi, we really need a way to get the result of a subagent work easily (like cmd+R). I have subagents code analyzer/security checker and I would like to only get their output/final review, it's currently hard to get them when we stack multiple agents one after the other
English
0
0
0
10
cat
cat@_catwu·
New Claude Code features are here: Microcompact: Clear old tool calls to extend session length Subagents: @-mention support + model selection for agents PDF support: Read PDFs directly from your file system
cat tweet media
English
97
215
3K
310.1K
LF
LF@Ellefe_·
@GrablyR N’est-ce pas déjà le cas ?
Français
0
0
1
183
Raphael Grably
Raphael Grably@GrablyR·
Elon Musk va-t-il mettre son réseau social au service du RN?
Raphael Grably tweet media
Français
199
51
180
22.8K
LF
LF@Ellefe_·
@trq212 I’ll wait for v2 then no worries
English
0
0
0
18
Thariq
Thariq@trq212·
@Ellefe_ ahh my apologies! tbh I felt that I didn't get far enough into actually doing the agentic loop for the replay to be valuable I can try and bring it back but will also be doing another stream of a v2
English
1
0
2
115
Pierre Jacquel
Pierre Jacquel@pierre_jacquel2·
Fun fact. : L’iPhone est le tel le plus éthique et le plus durable (Fairphone exclu). Le fait que les droitards répètent ce truc en boucle montre à quel point ils sont pas des lumières. 😭
Français
12
5
35
25.3K
Eric Buess
Eric Buess@EricBuess·
Something small but I made a markdown to phone-optimized png as a public Claude Artifact if anyone needs that. I share rendered AI-generated markdown as an image to people on a regular basis due to certain compatibility/rendering benefits on some platforms claude.ai/public/artifac…
English
1
0
5
185
LF
LF@Ellefe_·
... the competition is trying to catch up with CC, still not good enough yet in my opinion but more competition will probably make us have even better tools in the future and that's nice! 9/9
English
0
0
0
13
LF
LF@Ellefe_·
Conclusion: I'm not ready to leave Claude Code since the speed is a really breaking point for gemini yet (and I've been on gemini-flash the whole testing session as pointed out in 2/, I guess pro version would be even slower). But I'm really happy that 8/9
English
1
0
0
49
LF
LF@Ellefe_·
My first takes on Gemini cli yet (after 1h only of testing) : You can update .gemini/settings.json so you continue using your CLAUDE.md file just like that : { "contextFileName": "CLAUDE.md" } And this is marvelous to test a new cli coding agent 1/9
English
1
0
0
90