

Dude was the janitor at that LA Fitness in Mississauga, hurontario and eglington, and was happy that his gym membership was included.
hamza mostafa
1.2K posts

@hamostaf04
cs @uwaterloo | prev @openai


Dude was the janitor at that LA Fitness in Mississauga, hurontario and eglington, and was happy that his gym membership was included.











some of the code the agents wrote is genuinely surprising. like the sft agent decided on its own to upweight the answer tokens 3x during training, so the model learns to prioritize getting the final answer right over just mimicking reasoning patterns. would not have been one of the things on my list to try (at least not the weight multiple) but seemed to work. code: #L109-L124" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax…
and on the prime side the agent designed a smooth penalty curve for tool call efficiency instead of a hard cutoff. it figures out the optimal number of calls per question type and penalizes excess calls gradually. pretty decent-ish reward engineering. code: #L552-L564" target="_blank" rel="nofollow noopener">github.com/Hamza-Mos/prax… on overfitting i think you're right that it means something different in codegen. the agents overfit to their search space, not to the data. they'll exhaustively find the best config within the bounds you set but they won't question whether the bounds are right


my friend @DennwsLee and i spent the past week tinkering with autoresearch we gave 4 AI agents a research loop and told them to never stop 48 hours later: 550+ experiments, zero babysitting. One agent hit 93% on competition math from pure reward signal. another proved SFT beats RL at half the cost. highlights in 🧵


