Elron Bandel

796 posts

Elron Bandel banner
Elron Bandel

Elron Bandel

@ElronBandel

Research Scientist | @IBMResearch | General Agent Evaluation Team

Tel Aviv, Israel Katılım Mart 2020
407 Takip Edilen319 Takipçiler
Sabitlenmiş Tweet
Elron Bandel
Elron Bandel@ElronBandel·
A big thank you to the Unitxt team, collaborators, and our community for an incredible 2024! Together, we pushed boundaries in AI evaluation and set new standards for the field. Read our End-of-Year Summary here: unitxt.ai/en/latest/blog… #unitxt #llmevaluation
English
0
2
6
335
Elron Bandel
Elron Bandel@ElronBandel·
@yoavgo What did you think about the symbolic regression part? Unlike NAND or plain binary EML is differentiable and more dense.. so formulas stay at manageable depths, making search practical
English
0
0
0
407
(((ل()(ل() 'yoav))))👾
is the EML result surprising or even interesting to mathematicians? it gives me vibes of some triviality that one would give undergrads to prove in the first problem in a homework problem set rather than "a big thing", but then again i am terrible at judging these things and it might actually be incredibly hard. so which is it?
English
61
7
221
50K
Elron Bandel retweetledi
Asaf Yehudai
Asaf Yehudai@AsafYehudai·
Amazing to see the progress @NotebookLM is making! @JagersbergKnut created a video about our new paper that captures the essence nicely; great analogies, clear explanations, and some really cute infographics. Go watch it (or at least send your agents 😅): youtube.com/watch?v=fEB_oS…
YouTube video
YouTube
English
1
3
9
1.1K
Elron Bandel retweetledi
Shir Ashury-Tahan
Shir Ashury-Tahan@ShirAshuryTahan·
A cool project for anyone who wants to experiment with general agents and understand them better
Asaf Yehudai@AsafYehudai

New preprint, evaluation framework & leaderboard!🚨 General-purpose AI agents are everywhere. 🤖 From ReAct to @claudeai Code and @OpenAI SDK. But how do we actually evaluate them — as general agents? Currently, benchmarks are deeply tied to domain-specific setups, making it impossible to evaluate true cross-domain agents. We’re changing that! We’re introducing Exgentic and the Open General Agent Leaderboard. 🧵👇

English
0
1
2
119
Elron Bandel retweetledi
Elron Bandel retweetledi
Leshem (Legend) Choshen 🤖🤗
For a long time I avoided Agent research. For me it meant going back, a symbol of everything we achieved going down the drain. Back to pipelines, no end2end, no control on weights, they won, they have AI, we play with scraps on top Which made me much more excited about this:
English
1
2
26
4.7K
Elron Bandel retweetledi
Asaf Yehudai
Asaf Yehudai@AsafYehudai·
New preprint, evaluation framework & leaderboard!🚨 General-purpose AI agents are everywhere. 🤖 From ReAct to @claudeai Code and @OpenAI SDK. But how do we actually evaluate them — as general agents? Currently, benchmarks are deeply tied to domain-specific setups, making it impossible to evaluate true cross-domain agents. We’re changing that! We’re introducing Exgentic and the Open General Agent Leaderboard. 🧵👇
Asaf Yehudai tweet media
English
2
14
47
6.7K
Elron Bandel
Elron Bandel@ElronBandel·
@michielmv The honest truth, I think both benchmarks and agents are not there yet. We need new benchmarks with clear rules on when we allow or disallow continuity, and agents with clear mechanisms to turn the memory component on and off etc.
English
0
0
0
4
Michiel V
Michiel V@michielmv·
@ElronBandel interesting project. curious how you're handling task diversity in the evals. openclaw is tricky to benchmark because so much value is in memory/continuity across sessions, not raw task execution. has that come up?
English
1
0
1
6
Elron Bandel retweetledi
Elron Bandel
Elron Bandel@ElronBandel·
💻 Claude Code? 🦞 Open Claw? Maybe a simple loop of LLM calls? Which agent is best across diverse tasks? Which one to deploy at scale? We're introducing the Open General Agent Leaderboard. General agents are far too important to leave untracked.🧵
Elron Bandel tweet media
English
3
12
28
1.5K
Elron Bandel retweetledi
Leshem (Legend) Choshen 🤖🤗
There's a lot to read about Exgentic, our proposed evaluation x.com/ElronBandel/st… But the bottom line is, a sharp reduce in tweaking per benchmark, one place to run it all, and a massive leaderboard testing tons of stuff. And of course, all open :-)
Elron Bandel@ElronBandel

💻 Claude Code? 🦞 Open Claw? Maybe a simple loop of LLM calls? Which agent is best across diverse tasks? Which one to deploy at scale? We're introducing the Open General Agent Leaderboard. General agents are far too important to leave untracked.🧵

English
1
1
8
674