hamza mostafa

1.2K posts

hamza mostafa banner
hamza mostafa

hamza mostafa

@hamostaf04

cs @uwaterloo | prev @openai

San Francisco, CA Katılım Aralık 2024
1.3K Takip Edilen3.7K Takipçiler
hamza mostafa retweetledi
omkaar
omkaar@omkizzy·
I hand-wrote a 500-LoC RL stack to make hacking on RL research much easier. Most RL stacks are either massive and unhackable, or duct-taped research scripts. I am open-sourcing Mithrl, a modular RLVR stack. Next items on my checklist: adding more complex environment examples, supporting multi-gpu + async RL, and QoL fixes. I might scrap external runtime dependencies (Huggingface PEFT + vLLM) and write purpose-built, simpler versions from scratch if I feel the need. If you want to experiment with RL and are looking to own sovereign tools, I’d love to get on call, understand your requirements and help integrate for free.
English
19
19
167
12.8K
hamza mostafa
hamza mostafa@hamostaf04·
very cool! i’m curious if you think you can get some of these skills (like runbook especially) for free via memory? i find that when i ask CC to debug/monitor logs on a project, it saves to memory the helpful commands and workflows, but that does not necessarily meet the definition of a skill per se - it just happens to recall exact workflow an execute very similarly
English
1
0
2
4.3K
Humza Ahmed
Humza Ahmed@H4mzaAhmed·
@hamostaf04 @DennwsLee pls make me one where the validation/evals are multimodal i.e. can capture image from my camera and run CV tasks
English
1
0
1
49
hamza mostafa
hamza mostafa@hamostaf04·
my friend @DennwsLee and i spent the past week tinkering with autoresearch we gave 4 AI agents a research loop and told them to never stop 48 hours later: 550+ experiments, zero babysitting. One agent hit 93% on competition math from pure reward signal. another proved SFT beats RL at half the cost. highlights in 🧵
hamza mostafa@hamostaf04

x.com/i/article/2033…

English
16
14
199
32.2K
rajan agarwal
rajan agarwal@_rajanagarwal·
@hamostaf04 yeah i agree, this only really provides a lot of net value if you can reuse as much of the KV prefix as possible i think the real argument is whether subagents provide a meaningful performance upgrade in practice/when studied, not just context management upgrades
English
1
0
1
295
rajan agarwal
rajan agarwal@_rajanagarwal·
had a few interesting conversations recently! im curious what if subagents didn't know they're subagents? the standard subagent has isolated context, handed a summary and returns findings. imo this works great for narrow tasks but for harder tasks, the summary is probably lossy. the parent spent thousands of tokens building up intuition about implicit constraints and dead ends, and we compress all of that into a paragraph. the subagent will often have to read the files again to get the full context with its cold start. i always notice my claude code usage increase at a much higher rate when it uses subagents instead, maybe we fork the conversation. the child gets the parent's full prefix (already computed via KV cache, basically free) but don't include the tool call that spawned it. from the child's perspective, the conversation just naturally pivoted to a new focus the orchestrator knows about the fork. the model doesn't. when we join back, we just attach the child's findings/output back to the parent this is basically just fork() with copy-on-write. after branching, the child appends its own suffix and the join is still text-level. @sgl_project SGLang already supports fork/join abstractions + we have things like prefix caching, RadixAttention this might just like not work at all... has this already been done? is the token consumption/latency of subagents with a cold start studied? my intuition tells me it's probably a hybrid
English
8
2
62
6.6K
hamza mostafa
hamza mostafa@hamostaf04·
@PrimeIntellect labs + @tinkerapi is all you need 😎
hamza mostafa@hamostaf04

some of the code the agents wrote is genuinely surprising. like the sft agent decided on its own to upweight the answer tokens 3x during training, so the model learns to prioritize getting the final answer right over just mimicking reasoning patterns. would not have been one of the things on my list to try (at least not the weight multiple) but seemed to work. code: #L109-L124" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… and on the prime side the agent designed a smooth penalty curve for tool call efficiency instead of a hard cutoff. it figures out the optimal number of calls per question type and penalizes excess calls gradually. pretty decent-ish reward engineering. code: #L552-L564" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… on overfitting i think you're right that it means something different in codegen. the agents overfit to their search space, not to the data. they'll exhaustively find the best config within the bounds you set but they won't question whether the bounds are right

English
0
0
17
1.4K
hamza mostafa
hamza mostafa@hamostaf04·
some of the code the agents wrote is genuinely surprising. like the sft agent decided on its own to upweight the answer tokens 3x during training, so the model learns to prioritize getting the final answer right over just mimicking reasoning patterns. would not have been one of the things on my list to try (at least not the weight multiple) but seemed to work. code: #L109-L124" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… and on the prime side the agent designed a smooth penalty curve for tool call efficiency instead of a hard cutoff. it figures out the optimal number of calls per question type and penalizes excess calls gradually. pretty decent-ish reward engineering. code: #L552-L564" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… on overfitting i think you're right that it means something different in codegen. the agents overfit to their search space, not to the data. they'll exhaustively find the best config within the bounds you set but they won't question whether the bounds are right
English
2
0
8
2.1K
Thariq
Thariq@trq212·
@hamostaf04 @DennwsLee sick, would be cool to walk through some of the code the AI made and see if made sense or if it was surprising/unintuitive to you always feel like a process like this will have some sort of overfitting, but I think what overfitting means in codegen is very different
English
1
1
45
6.3K
hamza mostafa retweetledi
Dennis Lee
Dennis Lee@DennwsLee·
Really is fascinating to see what the current SOTA coding agents can do when given the right loops. Would also love to see how this generalizes beyond AI research. Side note: we ran the same tasks on CC and Codex. Night and day. Codex consistently stopped after 4-5 experiments.
hamza mostafa@hamostaf04

my friend @DennwsLee and i spent the past week tinkering with autoresearch we gave 4 AI agents a research loop and told them to never stop 48 hours later: 550+ experiments, zero babysitting. One agent hit 93% on competition math from pure reward signal. another proved SFT beats RL at half the cost. highlights in 🧵

English
0
2
9
1.1K
Amy Tam
Amy Tam@amytam01·
@hamostaf04 @DennwsLee It’s a good question as to how much scaffolding a domain needs before the loop becomes useful rather than just expensive
English
1
0
7
396
hamza mostafa
hamza mostafa@hamostaf04·
open source: github.com/Hamza-Mos/prax… pick a leaf, edit Section 1, spin up your favourite coding agent. want to add OpenRLHF, SkyRL, veRL, or your own framework? open a PR. let's grow this together :)
English
0
0
4
655
hamza mostafa
hamza mostafa@hamostaf04·
the takeaway isn't the numbers; it's the pattern. agents don't need more intelligence to do research. they need structure. one change at a time. hypothesis before experiment. memory across sessions. the constraints are what make exploration useful. human taste and direction matter more now, not less!
English
1
0
9
954