LF (@Ellefe_) - Twitter Profili | Zamantika Mersobahis Locabet

LF@Ellefe_·1d

@scaling01

QME

0

19

Lisan al Gaib@scaling01·1d

(and I myself am of course a rebel and played some role in the first ChatGPT plus rebellion that ultimately lead to the downfall of the insane GPT-5 limits)

English

4

0

26

2.1K

Lisan al Gaib@scaling01·1d

I like to think that he sent an X-Wing, because the rebels used it to blow up the death star

Lisan al Gaib@scaling01

Sam actually sent me an X-Wing and some extra goodies :) thank you a thousand times

English

4

0

40

4.6K

LF@Ellefe_·2d

@scaling01 This would mean that the dream to have an on device model on your MacBook Air as your CTO is not gonna happen soon and might not even happen ever. I don’t know how I feel about that, I guess we’re not close to reduce our dependency to main AI labs

English

0

1

73

Lisan al Gaib@scaling01·2d

I'm sure you can squeeze small models a lot more, but there's a depth and knowledge gap. My guess is that a 120B can find the same exploits mythos did, but only if it has a lot more test time compute + heavily overtrained on cyber and it has to be distilled from an even stronger model than mythos. my timeline for such a 120B model is 2028-2030

English

2

0

431

Lisan al Gaib@scaling01·3d

x.com/i/article/2058…

ZXX

15

17

244

99.9K

LF@Ellefe_·6d

@scaling01 You should add mistral, just to capture EU lag vs US & CH

English

1

0

1

109

Lisan al Gaib@scaling01·6d

the positions are based on vibes right now, and also capture more the coding domain rather than all domains which is why the lag for frontier chinese models is ~6 months right now / GPT-5.2 or Opus 4.5 level

English

3

0

11

1.7K

Lisan al Gaib@scaling01·6d

what do you think about this idea to chart current AI capabilities and factor in the acceleration/velocity of labs? don't read too much into the actual numbers, right now it's more a vibe of what model capabilities are right now and could be in 12 months today: Anthropic > OpenAI >> Google >> Meta > xAI >= DeepSeek, Moonshot, Alibaba, Zhipu, ByteDance > MiniMax In 12 months I think it's pretty much the same except that all the labs that aren't on the frontier will fall behind by a couple of months depending on how much compute they have I would also really like to add error bars to that, because for some labs the outcome distribution is just much wider.

English

12

2

64

19.9K

LF@Ellefe_·25 Nis

@scaling01 @scaling01 is this « the one » ? github.com/voice-from-the… Or the « Mahdi » I should say to not offend you and your fight to defend Dune vs LOTR :D

English

0

25

LF@Ellefe_·25 Nis

@scaling01 Hey, you have somewhere details on how your benchmark works? Like a GitHub repo and one example, to better get the idea?

English

1

0

1

926

Lisan al Gaib@scaling01·25 Nis

LisanBench results for GPT-5.5 - it's good. GPT-5.5 is now the strongest model without Thinking on both metrics! GPT-5.5-medium uses on average ~45.6% less tokens than GPT-5.4-medium while scoring 1.77x higher! (1.14x higher score on the difficulty weighted metric) Running LisanBench for GPT-5.5-medium cost basically the same as for GPT-5.4-medium despite being 2x more expensive. GPT-5.5-medium has the highest validity ratio (% of legal and correct moves) out of all tested models. In the overall GPT-5.5-medium is: - Rank #4 by average path length: 9327 - Rank #3 by difficulty-weighted score: 2539 on the difficulty weighted metric: - Opus 4.7-xhigh used +134.9% more tokens and scored +55.9% higher - Opus 4.6-16k used +11.1% more tokens and scored 9.2% higher - Sonnet 4.6-16k used +3.1% more tokens and scored -9.1% lower Current Validity ratio leaderboard: 1. GPT 5.5 (medium): 99.44% 2. Opus 4.7 (xhigh): 99.35% 3. Sonnet 4.6 (16k): 99.28% 4. Opus 4.6 (16k): 98.74% 5. Gemini 3.1 Pro Preview (low): 97.77% The "non-thinking" version of GPT-5.5 uses 3.3x more tokens than GPT-5.4 (non thinking), but gets a 3.1x higher avg score and 2.96x higher weighted score.

English

25

14

336

121.4K

LF@Ellefe_·4 Oca

@iruletheworldmo Thank you for this post, I’m not the worried about AI kind of guy, but this is still worrisome. I think the problem is not how to handle misalignment (I’m sure researchers will find ways), but the fact that it’s a race and some players would rather be 1st and take risks than wait

English

0

1

136

🍓🍓🍓@iruletheworldmo·3 Oca

x.com/i/article/2007…

ZXX

631

1.1K

6.3K

3.3M

LF@Ellefe_·19 Ağu

@_catwu Hi, we really need a way to get the result of a subagent work easily (like cmd+R). I have subagents code analyzer/security checker and I would like to only get their output/final review, it's currently hard to get them when we stack multiple agents one after the other

English

0

10

cat@_catwu·5 Ağu

New Claude Code features are here: Microcompact: Clear old tool calls to extend session length Subagents: @-mention support + model selection for agents PDF support: Read PDFs directly from your file system

English

97

215

3K

310.1K

LF@Ellefe_·12 Ağu

@GrablyR N’est-ce pas déjà le cas ?

Français

0

1

183

Raphael Grably@GrablyR·12 Ağu

Elon Musk va-t-il mettre son réseau social au service du RN?

Français

199

51

180

22.8K

LF@Ellefe_·11 Ağu

@trq212 I’ll wait for v2 then no worries

English

0

18

Thariq@trq212·11 Ağu

@Ellefe_ ahh my apologies! tbh I felt that I didn't get far enough into actually doing the agentic loop for the replay to be valuable I can try and bring it back but will also be doing another stream of a v2

English

1

0

2

115

Thariq@trq212·10 Ağu

Going to do a chill stream tomorrow at 6pm PT, trying to take one of my favorite AI demos and make it agentic—AI Town

Thariq@trq212

unironically spent all last night talking to @sawyerhood about how we could rebuild every one of our old LLM projects with an agentic harness and it would be way better

English

13

1

97

31.1K

LF@Ellefe_·2 Ağu

@SiempreMickael @pierre_jacquel2 Bonjour, je ne suis pas d’accord, lisez cet article qui prouve l’inverse par l’exemple : bfmtv.com/economie/entre…

Français

0

58

Pierre Jacquel@pierre_jacquel2·28 Tem

Fun fact. : L’iPhone est le tel le plus éthique et le plus durable (Fairphone exclu). Le fait que les droitards répètent ce truc en boucle montre à quel point ils sont pas des lumières. 😭

Français

12

5

35

25.3K

LF@Ellefe_·2 Tem

@EricBuess This is sick

English

0

10

Eric Buess@EricBuess·2 Tem

Something small but I made a markdown to phone-optimized png as a public Claude Artifact if anyone needs that. I share rendered AI-generated markdown as an image to people on a regular basis due to certain compatibility/rendering benefits on some platforms claude.ai/public/artifac…

English

1

0

5

185

LF@Ellefe_·26 Haz

... the competition is trying to catch up with CC, still not good enough yet in my opinion but more competition will probably make us have even better tools in the future and that's nice! 9/9

English

0

13

LF@Ellefe_·26 Haz

Conclusion: I'm not ready to leave Claude Code since the speed is a really breaking point for gemini yet (and I've been on gemini-flash the whole testing session as pointed out in 2/, I guess pro version would be even slower). But I'm really happy that 8/9

English

1

0

49

LF@Ellefe_·26 Haz

My first takes on Gemini cli yet (after 1h only of testing) : You can update .gemini/settings.json so you continue using your CLAUDE.md file just like that : { "contextFileName": "CLAUDE.md" } And this is marvelous to test a new cli coding agent 1/9

English

1

0

90

LF

Keşfet