hamza mostafa

0

2

4.3K

hamza mostafa retweetledi

Thariq@trq212·2d

ZXX

340

2K

15K

6.1M

hamza mostafa@hamostaf04·3d

@H4mzaAhmed @DennwsLee oooo interesting idea for a new leaf!

English

35

Humza Ahmed@H4mzaAhmed·3d

@hamostaf04 @DennwsLee pls make me one where the validation/evals are multimodal i.e. can capture image from my camera and run CV tasks

English

0

1

49

hamza mostafa@hamostaf04·4d

my friend @DennwsLee and i spent the past week tinkering with autoresearch we gave 4 AI agents a research loop and told them to never stop 48 hours later: 550+ experiments, zero babysitting. One agent hit 93% on competition math from pure reward signal. another proved SFT beats RL at half the cost. highlights in 🧵

English

16

14

199

32.2K

hamza mostafa@hamostaf04·3d

@_rajanagarwal would be really cool to see an experiment conducted on this!

English

1

51

rajan agarwal@_rajanagarwal·3d

@hamostaf04 yeah i agree, this only really provides a lot of net value if you can reuse as much of the KV prefix as possible i think the real argument is whether subagents provide a meaningful performance upgrade in practice/when studied, not just context management upgrades

English

0

1

295

rajan agarwal@_rajanagarwal·3d

had a few interesting conversations recently! im curious what if subagents didn't know they're subagents? the standard subagent has isolated context, handed a summary and returns findings. imo this works great for narrow tasks but for harder tasks, the summary is probably lossy. the parent spent thousands of tokens building up intuition about implicit constraints and dead ends, and we compress all of that into a paragraph. the subagent will often have to read the files again to get the full context with its cold start. i always notice my claude code usage increase at a much higher rate when it uses subagents instead, maybe we fork the conversation. the child gets the parent's full prefix (already computed via KV cache, basically free) but don't include the tool call that spawned it. from the child's perspective, the conversation just naturally pivoted to a new focus the orchestrator knows about the fork. the model doesn't. when we join back, we just attach the child's findings/output back to the parent this is basically just fork() with copy-on-write. after branching, the child appends its own suffix and the join is still text-level. @sgl_project SGLang already supports fork/join abstractions + we have things like prefix caching, RadixAttention this might just like not work at all... has this already been done? is the token consumption/latency of subagents with a cold start studied? my intuition tells me it's probably a hybrid

English

8

2

62

6.6K

hamza mostafa retweetledi

sameel arif@endpoint·3d

wrote this article on how we built a passive caching system that compares actions across millions of requests would love if you gave it a read, check it out below :)

Stagehand 🤘@Stagehanddev

English

24

2K

hamza mostafa retweetledi

Hector@iamhectorlopez·3d

Great read. x.com/i/status/20333…

English

1

632

hamza mostafa retweetledi

Dakshay Mehta@Dakshay·3d

this is an awesome read, if you are even slightly into RSI. recursive research models are going to be huge going forward.

English

1

5

1K

hamza mostafa retweetledi

rajan agarwal@_rajanagarwal·3d

great read i think we’ll see a lot of improvements with automated research this year, likely derivative of long horizon coding performance

English

5

33

10.7K

hamza mostafa@hamostaf04·4d

@adiprasadd @DennwsLee nah you!

English

123

adi@adiprasadd·4d

@hamostaf04 @DennwsLee goat

English

0

1

132

hamza mostafa@hamostaf04·4d

@matthieuschulz @DennwsLee appreciate you man

English

1

155

Matthieu Schulz@matthieuschulz·4d

@hamostaf04 @DennwsLee This is sick

English

0

2

155

hamza mostafa@hamostaf04·4d

@barathvelmu @DennwsLee thanks gang!

English

123

Barath Velmurugan@barathvelmu·4d

@hamostaf04 @DennwsLee this is very cool bro!

English

0

1

107

hamza mostafa@hamostaf04·4d

@PrimeIntellect labs + @tinkerapi is all you need 😎

some of the code the agents wrote is genuinely surprising. like the sft agent decided on its own to upweight the answer tokens 3x during training, so the model learns to prioritize getting the final answer right over just mimicking reasoning patterns. would not have been one of the things on my list to try (at least not the weight multiple) but seemed to work. code: #L109-L124" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… and on the prime side the agent designed a smooth penalty curve for tool call efficiency instead of a hard cutoff. it figures out the optimal number of calls per question type and penalizes excess calls gradually. pretty decent-ish reward engineering. code: #L552-L564" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… on overfitting i think you're right that it means something different in codegen. the agents overfit to their search space, not to the data. they'll exhaustively find the best config within the bounds you set but they won't question whether the bounds are right

English

17

1.4K

hamza mostafa@hamostaf04·4d

some of the code the agents wrote is genuinely surprising. like the sft agent decided on its own to upweight the answer tokens 3x during training, so the model learns to prioritize getting the final answer right over just mimicking reasoning patterns. would not have been one of the things on my list to try (at least not the weight multiple) but seemed to work. code: #L109-L124" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… and on the prime side the agent designed a smooth penalty curve for tool call efficiency instead of a hard cutoff. it figures out the optimal number of calls per question type and penalizes excess calls gradually. pretty decent-ish reward engineering. code: #L552-L564" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… on overfitting i think you're right that it means something different in codegen. the agents overfit to their search space, not to the data. they'll exhaustively find the best config within the bounds you set but they won't question whether the bounds are right

English

2

0

8

2.1K

Thariq@trq212·4d

@hamostaf04 @DennwsLee sick, would be cool to walk through some of the code the AI made and see if made sense or if it was surprising/unintuitive to you always feel like a process like this will have some sort of overfitting, but I think what overfitting means in codegen is very different

English

45

6.3K

hamza mostafa retweetledi

Dennis Lee@DennwsLee·4d

Really is fascinating to see what the current SOTA coding agents can do when given the right loops. Would also love to see how this generalizes beyond AI research. Side note: we ran the same tasks on CC and Codex. Night and day. Codex consistently stopped after 4-5 experiments.

my friend @DennwsLee and i spent the past week tinkering with autoresearch we gave 4 AI agents a research loop and told them to never stop 48 hours later: 550+ experiments, zero babysitting. One agent hit 93% on competition math from pure reward signal. another proved SFT beats RL at half the cost. highlights in 🧵

English

2

9

1.1K

hamza mostafa@hamostaf04·4d

@amytam01 @DennwsLee indeed!

English

2

235

Amy Tam@amytam01·4d

@hamostaf04 @DennwsLee It’s a good question as to how much scaffolding a domain needs before the loop becomes useful rather than just expensive

English

0

7

396

hamza mostafa@hamostaf04·4d

@SwishMoe thanks brother!

English

1

92

hamza mostafa retweetledi

Mohammed Alshehri@SwishMoe·4d

Good Read @hamostaf04

English

2

3

917

hamza mostafa@hamostaf04·4d

open source: github.com/Hamza-Mos/prax… pick a leaf, edit Section 1, spin up your favourite coding agent. want to add OpenRLHF, SkyRL, veRL, or your own framework? open a PR. let's grow this together :)

English

4

655

hamza mostafa@hamostaf04·4d

the takeaway isn't the numbers; it's the pattern. agents don't need more intelligence to do research. they need structure. one change at a time. hypothesis before experiment. memory across sessions. the constraints are what make exploration useful. human taste and direction matter more now, not less!

English