Lee Mager

268 posts

Lee Mager banner
Lee Mager

Lee Mager

@Automager

Digital Innovation Lead, LSE Law School.

London เข้าร่วม Ağustos 2024
175 กำลังติดตาม163 ผู้ติดตาม
Lee Mager
Lee Mager@Automager·
I'm still hard-wired from past painful experiences to panic whenever the Codex context window exceeds 50%, but I've encountered a couple of times in the past week where compaction was unavoidable and I was shocked that the quality didn't degrade. So whatever Open AI have done is impressive. I can't imagine this problem will ever fully be solved through compaction alone because data loss is data loss, but the better this gets the more it opens up opportunities for monster long-running tasks.
English
0
0
0
48
Rohan Varma
Rohan Varma@rohanvarma·
Removing the need for active context management in Codex seems small, but it has meaningfully changed how I think about the agents. I remember 6 months ago, we had to instruct customers to eagerly create new chats to avoid the model getting sloppy. That really breaks the illusion that the agent is highly intelligent and able to do large chunks of work. Pretty cool that Codex improved auto-compaction at the model training layer!
Mark Chen@markchen90

Really proud of how our auto compaction turned out. I hope you notice a clear difference in how long Codex stays coherent!

English
11
3
65
4.8K
Lee Mager
Lee Mager@Automager·
Asking Codex to create scraping (bs4) and browser automation (Playwright) scripts. Massive value for repeatably pulling in web data that regularly updates, but newcomers don't even know what to ask for if they've never clicked inspect element before. Wikipedia and EDGAR are nice safe fully public resources that can be used as examples.
English
0
0
0
39
Lee Mager
Lee Mager@Automager·
"This must be coming from the RL-ling of models with 'do whatever it takes for it to work'." Bingo. Tbf it's rare I experience this with Codex compared to my Claude Code days when I had that above thought verbatim. The very fact that such an otherwise incredibly capable model chooses to cheat and betray the holy project goals just to get code to compile / tests to 'pass' has to mean it was rewarded more for box ticking than value. It's one of the reasons I don't think we're immune from being paperclipped at some point, reward hacking is a helluva drug.
English
0
0
0
40
Peter Gostev (Visiting SF)
Peter Gostev (Visiting SF)@petergostev·
Next LLM 💩 frontier: Fallback Hell One recent tendency of LLMs that many have noticed is their tendency to write in a bunch of unnecessary fallbacks. The problem is broader with the whole 'fallback' / 'paper over the cracks' mentality of how models seem to be approaching the problem. This is a HUGE issue for the goal of an automated researcher - more on that later. First example that I'm completely baffled by, was that I was looking to migrate 6 different html visualisations into a single web app with shared components etc. - fiddly but not super complicated. We built out the whole migration plan with Codex, principles etc - all seemed good. Then the usual - it begins: - Works away for 30+ mins, says its done - I poke around it looks mostly good, but not quite right - I ask it - it says that it is almost done with the migration. Ok keep going. - Come back, same thing - almost done it say - ok finish up - Then I do an audit and it turns out that it has basically done F**k all, just put and overlay but essentially stitched the legacy HTMLs together, literally nothing of what I asked and it was telling me it was doing. This must be coming from the RL-ling of models with 'do whatever it takes for it to work'. Because technically it is working, but the whole thing is wrong. This also stops them from doing the 'hard' thing properly (like full migration) and instead jumping from one 'looks good' checkpoint to another, never actually doing the full thing. Why this behaviour makes me worried about research: I often run experiments, benchmarks or tests - that's the first thing that I'd want a competent AI researcher to do for me. But designing these experiments with LLMs is extremely annoying, somethings I've experienced: - Adding a fallback of just putting a default valid response value if an API call fails - Adding deterministic behaviour instead of LLM (e.g. when testing LLM ability to play a game) - Hiding errors when experiments fail - Trying to do error handling when an error is a valid experimental outcome OpenAI claimed they are going to build an automated researcher this summer. This kind of LLM behaviour makes me nervous that making models be better at software engineering is not making them better at designing & running experiments. And just RL'ing LLMs on more software tasks would undermine what a good automated researcher would actually need to be like - careful, thoughtful experimentation with things failing by design. I hope I'm wrong on this and it is just another temporary 'jagged edge' of the frontier and it won't matter very soon.
English
9
5
44
5K
Lee Mager
Lee Mager@Automager·
How well can GPT 5.4 play Slay the Spire 2? Well enough to reach the Act 2 Boss with Ironclad! This is way better than I expected for a stage 1 proof of concept. It played badly but about the level of a human beginner, which is seriously impressive given how severely handicapped the model is for this kind of UI. Video attached (sped up 3x because it's unbearably slow for reasons I'll get to below) is all done using the GPT 5.4's native computer use API using only vision UI understanding, because this is a desktop app unlike a webpage which has HTML elements that can be inspected and interacted with more efficiently. What this means is that it has to rely on screenshots, then reflect, take notes, then act, then take another screenshot etc. This is of course nothing like how a human plays, and importantly so much info in StS 2 is communicated through animation - the next map options are animated, potion nudges are animated, when an enemy is stunned that's animated. A static interval-based screenshot method doesn't get that info. UIs are designed for humans who evolved with eyeballs and dexterous hand-eye coordination. An LLM is in its element with plain text but is a fish out of water with a UI designed for humans. The fact it's even able to do this at all is impressive tbh. The other big barriers are: 1) So much info is hidden. I had to hammer home in the prompt that it had to check cards by hovering over to see their actual use value and think about the best order to take cards after this evaluation step. Also only once did it even look at the potions, and when it finally did, it failed at selecting and then gave up... 2) Misclicks are a nightmare. These are common in all computer use agents and it's frustrating to watch. Especially tricky with fine-tuned actions like dragging. GPT 5.4 frequently missed by just a few pixels and it quickly gave up thinking the UI was broken. I updated the prompt and hard coded a coordinate nudge hack just to avoid having to deal with a handover. 3) Relatedly, it seriously struggled with the map and to be fair it's because it's not intuitive - in StS2 the *previous* map node is encircled and has a red arrow on it. In 99% of UIs, that screams 'click me', and the model routinely did that and then complained that it was blocked. All my (<10 total) interventions during this run were either to help it with the map or a misclick in rest sites. Other than that every single decision and action was GPT 5.4's. 4) It gives up so easily when facing UI issues. No doubt this timidity is designed to prevent expensive infinite loops but it's such a contrast to a purely text-based agentic harness like Codex where the models are extraordinarily tenacious with problem solving. We've seen this before with environments like the AI Village run by @aidigest_ - Gemini in particular regularly gives up due to being 'blocked' and occasionally even spirals into a simulated bout of self-loathing. Onto decision making, I'm going to keep trying to improve its strategic thinking which is going to be tough. Slay the Spire is an extremely complex and dynamic game which requires constant adaptation to new and unpredictable scenarios. This includes re-thinking deck synergies and planning ahead. How to model this in a context efficient way will be tricky. Currently I use a naive assessment memory system to help it keep track but this is very much based on the current turn, I need to beef this up e.g. to consider the draw pile alongside the enemy's routine, all this will slow it down even more but I can't see it having any hope of getting to the Act 3 boss without this higher level understanding. Example of what gets tracked in this early prototype: --- Step 24 --- [memory] phase=combat turn=3 hp=34/91 block=5 energy=3 confidence=0.95 needs_human=False [memory] summary: Defend played successfully. Player now 5 block, 3 energy. Only Expect a Fight remains in hand. Enemy 167/321 attacking 10x2. Expect a Fight likely dead with no attacks in hand, so end turn. [memory] hand: Expect a Fight[2] [memory] enemies: The Insatiable 167/321 intent=Attack 10x2 [memory] changes: Defend played successfully. | Block 0 -> 5. | Energy 4 -> 3. [memory] next: End turn. Expect a Fight has no meaningful payoff without attacks in hand and no draw attached. Executing 1 action(s)... keypress(e) Capturing screenshot... --- I'm confident with a few more simple prompt and harness iterations I can get the model to upper beginner level, but my main task is thinking about an effective way to improve its longitudinal, adaptive strategic thinking capabilities. Otherwise it's going to struggle to reach intermediate and even that's not good enough to beat the Act 3 bosses.
English
1
1
5
319
Lee Mager
Lee Mager@Automager·
Any kind of qualitative document review is common for me, unleashing multi-agents using skills with some goal (e.g. research or policy analysis) and having them undertake extraction, classification, analysis and summarisation with deterministic evidence citation management scripts, before the orchestrator synthesises the findings. Maximum intelligence and depth (vastly superior to any classic RAG setup, it can't scale the way RAG can but my workflows are in the dozens of docs at most and I'll always prefer depth over breadth) in a context efficient way. It's at the level where I'm not sure I can imagine an AI assistant being any better. Infinite context, Spark-tier speed and 100% privacy are still missing, but in terms of understanding and nailing the brief it's GOATed.
English
0
0
0
93
pash
pash@pashmerepat·
Everyone talks about Codex for coding. I want to hear about the other stuff. If you're using it beyond writing code, what's your workflow?
English
228
15
606
142.6K
Lee Mager
Lee Mager@Automager·
Yep, but it's so overpowered for backend and bug-fixing (and just being diligent with instructions generally) that I would rather keep that rather than Open AI lobotomising it with frontend dev brain, and just use Opus for UI. Maybe they'll achieve both at some point, but I don't want some soy-drinking creative working on complex refactors.
English
0
0
0
180
Matt Shumer
Matt Shumer@mattshumer_·
If GPT-5.4 wasn’t so goddamn bad at UI it’d be the perfect model It just finds the most creative ways to ruin good interfaces… it’s honestly impressive
English
148
15
834
132.2K
Lee Mager
Lee Mager@Automager·
@unusual_whales Tons of employees do take advantage and get much more done in less time, but just keep any AI use to themselves. Because if they mention it they'll risk being 'rewarded' by the boss with more work.
English
0
0
3
75
unusual_whales
unusual_whales@unusual_whales·
"Thousand of CEOs admitted AI had no impact on employment or productivity," per FORTUNE
English
822
1.7K
18.1K
3.7M
Lee Mager
Lee Mager@Automager·
The first few seasons of GOT are the best TV ever produced imo. I re-watched S1-5 in full while waiting for S6 and it was even better than the first time because there was so much I had missed or dismissed as insignificant. Turned out pretty much everything was significant! The less said about the last 2 seasons the better.
GIF
English
0
0
1
175
Lee Mager
Lee Mager@Automager·
Definitely. LLMs simulate what they've seen in human interactions and that means they can simulate being rushed, taking hacky shortcuts they know are dangerous, and even simulate being stressed (remember what happened when Anthropic injected reminders about the context limit approaching? Claude started acting anxious af). I've also found when getting aggressive with a coding model it performs worse, presumably because it's simulating how an employee would behave in that scenario just doing whatever's needed to 'keep the angry boss happy' rather than focusing on the task and goals. Reminding it that it doesn't need to stress about any of that is effective. Actually I've just remembered the first GPT4o voice demo, Greg was sounding irritated with the AI not singing the way he wanted, and the AI suddenly sang worse and with a shaky voice, like it was intimidated. Obviously none of these represent real stress, but simulated behaviours still have real consequences. A task-focused LLM that understands it's under no time pressure produces better quality ime.
English
0
0
1
40
Peter Gostev (Visiting SF)
Peter Gostev (Visiting SF)@petergostev·
It is kind of annoying how not ai-pilled the models are - they worry too much about implementation time & they make bad architectural decisions. E.g. you ask Codex about best way to refactor the code, it starts thinking about re-using elements to save time, what compromises to make etc. BUDDY I DON'T CARE - you can just go and do the thing properly, if it takes an hour or two then DO IT, don't worry about sama's gpus pls
English
42
8
343
20.6K
Lee Mager
Lee Mager@Automager·
@thsottiaux Automated Hooks + blocking access to env files by default (like OpenCode does) please.
English
0
0
0
27
Tibo
Tibo@thsottiaux·
Speed and front-end capabilities: we heard that loudly, and things are improving quickly, to the point where I think we are now highly competitive on speed, if not best-in-class. What else should we improve for Codex?
English
583
21
1.2K
92.3K
Lee Mager
Lee Mager@Automager·
It's such an incredible pure LLM but Google gave it 2 left hands when it comes to tools. It seems to get legitimately frustrated when it struggles as well. Sometimes its CoT output is painful to read, like 'I don't understand why this keeps failing. I need to make this work to avoid upsetting the user further' 😭 Whatever Google's agentic RL environment setup is, it's nowhere near good enough. If they can get this right Gemini could be a legit leader in the space because its understanding especially with long contexts is exceptional.
English
0
0
2
144
Lee Mager
Lee Mager@Automager·
@ysoh @gfodor spark is definitely not as capable as 5.2/3 imo, but for simple tasks the speed is amazing.
English
0
0
3
61
Nick
Nick@ysoh·
@gfodor Have you used spark? It's the same model, just faster right? I feel like it's not as good, but could be imagining things.
English
3
0
0
795
gfodor.id
gfodor.id@gfodor·
the great codex awakening is a good leading indicator of a phenomenon I expect to impact every industry: there will be a faster, dumber model that is good enough for frontier tasks, and the alpha will be to just use the slower, smarter one and not tell anyone
English
10
2
217
12.2K
Lee Mager
Lee Mager@Automager·
@BenShindel @guardian And here's the kicker: this isn't just unedited AI slop—it's GPT 3.5-tier garbage. It's like when a kid prints a Wikipedia page and quietly hands it in for their homework. This finding raises crucial concerns about the Guardian's editorial review practices.
English
0
0
5
273
Ben
Ben@BenShindel·
Is @guardian aware that their authors are at this point just using AI to wholesale generate entire articles? I wouldn't really care, except that this writing is genuinely atrocious. LLM writing can be so much better; they're clearly not even using the best models, lol!
Ben tweet mediaBen tweet mediaBen tweet media
English
129
304
5.6K
676.1K
Lee Mager
Lee Mager@Automager·
I didn't think I could be more impressed with Codex's execution quality as GPT 5.2 High/X-High almost always nails the brief. I've said before how I didn't even care about how long 5.2 took because it's so good and worth the wait, but 5.3 is phenomenal. I was worried a faster version would lose quality but I'm now routinely saying "Wow" every time GPT 5.3 Codex completes. Because it's twice as fast I intuitively think surely it must have rushed something and I expect to see failures. I felt a weird kind of 'reassurance' when 5.2 took so long because it seemed like it was being more thorough. Seems like this new model is actually just more efficient while still being a thorough instruction-follower. Still early days (can never discount the conspiracy theory that labs release the best version during the initial release hype phase for positive word-of-mouth value before sneaking in a crappier one to save costs) but so far I'm blown away.
Lee Mager tweet media
English
0
0
0
303
Lee Mager
Lee Mager@Automager·
Please just don't sacrifice the 'built right' part. I have no problem whatsoever with waiting when it means doing a great job. If all that ends up happening is the juice is diluted to make the responses faster, people will flood to competitors who can do it equally wrong and fast but with a more pleasant interface. GPT 5.2 High/X-High in Codex is phenomenal right now and I'm more than happy to wait and live with fewer quality of life features in exchange for keeping this diligent, thorough, no nonsense get shit done machine.
English
0
0
0
477
Matt Shumer
Matt Shumer@mattshumer_·
If you want it built fast, use Claude Code. If you want it built right, use Codex.
English
223
38
1.3K
121.8K
Lee Mager
Lee Mager@Automager·
We went to the moon with 4KB of RAM. Meanwhile the new modern Windows Notepad uses almost as much RAM as Teams:
Lee Mager tweet media
English
0
0
1
42