Yaron (Ron) Minsky

8.4K posts

Yaron (Ron) Minsky

@yminsky

Occasional OCaml programmer. Host of @signalsthreads. @[email protected] @yminsky.bsky.social https://t.co/kiUGRvWOO2

Katılım Haziran 2009

360 Takip Edilen17.6K Takipçiler

Yaron (Ron) Minsky@yminsky·7h

@samth @ArjunGuha @ShriramKMurthi The point being that the hard structure of the types helps you figure out where you should dig deeper. But source files are still useful.

English

Yaron (Ron) Minsky@yminsky·7h

@samth @ArjunGuha @ShriramKMurthi Is it? I use types for exploration and I read source files. Agents might do the same

English

Yaron (Ron) Minsky@yminsky·2d

So, I really want someone to do a study on the effectiveness of types for agents. Studying the same question with humans is absurdly expensive, but agents provide a new way of asking the question. Is anyone working on this?

English

143

12.8K

Yaron (Ron) Minsky@yminsky·2d

@jameskjx For sure. But it's an interestingly related question!

English

James Noble@jameskjx·2d

@yminsky studying this with agents doesn't provide a new way of asking the question --- rather it's a way of asking (or at least starting to answer) a *different* question. but again: remember you a studying agents not (just) type systems. most likely mostly studying the agents.

English

Yaron (Ron) Minsky@yminsky·2d

@samth @difficultyang Maybe, though, I worry that the measurements of success that are used in those benchmarks are fairly weak. I'd be interested in some more time intensive, human evaluations of the quality of the results.

English

Sam Tobin-Hochstadt@samth·2d

@difficultyang @yminsky I think the success rate at "one shot" (with an agent) on real programming tasks in the SWE-bench vein is the right metric.

English

Yaron (Ron) Minsky@yminsky·2d

@difficultyang I agree that a mere lack of type errors is not an interesting goal. I would think that it would be reasonable to measure the speed at which a hard task is completed, as well as the defect rate of the resulting code.

English

difficultyang@difficultyang·2d

@yminsky Let's define the first order question then. "Well" is very underspecified. "Doesn't have type errors" is too easy (or too hard? Mumble mumble math proofs). Is our goal token efficiency? Latency? Success rate at the one-shot task?

English

107

Yaron (Ron) Minsky@yminsky·2d

@difficultyang But that's still an empirical question I'd love to see validated!

English

Yaron (Ron) Minsky@yminsky·2d

@difficultyang But I suspect that types do make agents more useful when humans are doing a good job with design

English

Yaron (Ron) Minsky@yminsky·2d

@difficultyang But... You know as well as anyone that types can prevent a wide variety of bugs! So why an I telling you this?

English

227

Yaron (Ron) Minsky@yminsky·2d

@difficultyang XSS and SQL injection attacks can be prevented as well.

English

234

Yaron (Ron) Minsky@yminsky·2d

@ShriramKMurthi Woohoo! Looking forward to the results.

English

663

Shriram Krishnamurthi (primary: Bluesky)@ShriramKMurthi·2d

@yminsky Yes. Two studies being designed.

English

809

Yaron (Ron) Minsky@yminsky·2d

@difficultyang That's not a crazy outcome, but it's pretty surprising. I think feedback is super important when you get beyond toy problems. And I think it's really hard to get similarly good results by just improving the prompts.

English

308

Yaron (Ron) Minsky@yminsky·2d

@difficultyang Meaning that the variants in results between different prompting approaches and different ways of setting up the agentic harness will dominate differences in the kinds of feedback you provide to the model?

English

865

Yaron (Ron) Minsky@yminsky·2d

@agniv_s More or less

English

277

agniv@agniv_s·2d

@yminsky Like, your question is "given a more general SWE-BENCH style of benchmark, and giving an agent a typechecking tool that it can use, does this meaningfully improve results?"

English

298

Yaron (Ron) Minsky@yminsky·2d

@SMT_Solvers @headinthebox @NousResearch Python is a tough language to answer this question with because the type feedback is comparatively weak.

English

545

Chad Brewbaker@SMT_Solvers·2d

@yminsky @headinthebox however you can do it with @NousResearch Hermes. Have it build coreutils in Python without type annotations, then do it with type annotations. Which is faster after a few runs?

English

647

Yaron (Ron) Minsky@yminsky·2d

@agniv_s The core question I'm interested in is whether an agent given a type system will generate better results, more quickly than one without.

English

763

agniv@agniv_s·2d

@yminsky could you clarify what the effectiveness of types for agents means? like, if i give an agent access to a sandbox with a strong compiler, how it reacts to the compiler feedback? or agent-as-compiler, see what types of types it creates when taking in a prompt?

English

881

Keşfet

@samth @ArjunGuha @ShriramKMurthi @jameskjx @difficultyang @agniv_s @SMT_Solvers @headinthebox