Yaron (Ron) Minsky

8.4K posts

Yaron (Ron) Minsky banner
Yaron (Ron) Minsky

Yaron (Ron) Minsky

@yminsky

Occasional OCaml programmer. Host of @signalsthreads. @[email protected] @yminsky.bsky.social https://t.co/kiUGRvWOO2

Katılım Haziran 2009
360 Takip Edilen17.6K Takipçiler
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
So, I really want someone to do a study on the effectiveness of types for agents. Studying the same question with humans is absurdly expensive, but agents provide a new way of asking the question. Is anyone working on this?
English
20
7
143
12.8K
James Noble
James Noble@jameskjx·
@yminsky studying this with agents doesn't provide a new way of asking the question --- rather it's a way of asking (or at least starting to answer) a *different* question. but again: remember you a studying agents not (just) type systems. most likely mostly studying the agents.
English
2
0
0
31
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
@samth @difficultyang Maybe, though, I worry that the measurements of success that are used in those benchmarks are fairly weak. I'd be interested in some more time intensive, human evaluations of the quality of the results.
English
2
0
1
92
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
@difficultyang I agree that a mere lack of type errors is not an interesting goal. I would think that it would be reasonable to measure the speed at which a hard task is completed, as well as the defect rate of the resulting code.
English
0
0
0
63
difficultyang
difficultyang@difficultyang·
@yminsky Let's define the first order question then. "Well" is very underspecified. "Doesn't have type errors" is too easy (or too hard? Mumble mumble math proofs). Is our goal token efficiency? Latency? Success rate at the one-shot task?
English
2
0
1
107
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
@difficultyang But... You know as well as anyone that types can prevent a wide variety of bugs! So why an I telling you this?
English
1
0
1
227
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
@difficultyang That's not a crazy outcome, but it's pretty surprising. I think feedback is super important when you get beyond toy problems. And I think it's really hard to get similarly good results by just improving the prompts.
English
1
0
0
308
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
@difficultyang Meaning that the variants in results between different prompting approaches and different ways of setting up the agentic harness will dominate differences in the kinds of feedback you provide to the model?
English
1
0
5
865
agniv
agniv@agniv_s·
@yminsky Like, your question is "given a more general SWE-BENCH style of benchmark, and giving an agent a typechecking tool that it can use, does this meaningfully improve results?"
English
1
0
0
298
Chad Brewbaker
Chad Brewbaker@SMT_Solvers·
@yminsky @headinthebox however you can do it with @NousResearch Hermes. Have it build coreutils in Python without type annotations, then do it with type annotations. Which is faster after a few runs?
English
2
0
0
647
Yaron (Ron) Minsky
Yaron (Ron) Minsky@yminsky·
@agniv_s The core question I'm interested in is whether an agent given a type system will generate better results, more quickly than one without.
English
3
0
9
763
agniv
agniv@agniv_s·
@yminsky could you clarify what the effectiveness of types for agents means? like, if i give an agent access to a sandbox with a strong compiler, how it reacts to the compiler feedback? or agent-as-compiler, see what types of types it creates when taking in a prompt?
English
1
0
2
881