Elron Bandel (@ElronBandel) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Elron Bandel@ElronBandel·21 Oca

A big thank you to the Unitxt team, collaborators, and our community for an incredible 2024! Together, we pushed boundaries in AI evaluation and set new standards for the field. Read our End-of-Year Summary here: unitxt.ai/en/latest/blog… #unitxt #llmevaluation

English

0

2

6

335

Elron Bandel retweetledi

Asaf Yehudai@AsafYehudai·23 Nis

Happy to share that 4 of our works will be presented at @iclr_conf this week! 🇧🇷🎉 → Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization → Ready For General Agents? Let's Test It. → Position: Agentic Systems Should be General → General Agent Evaluation A thread 🧵👇 Works done at @IBMResearch #ICLR2026 #LLMs #MachineLearning #NLP #InformationRetrieval #AgentEvaluation

English

1

14

225

Elron Bandel@ElronBandel·15 Nis

@yoavgo What did you think about the symbolic regression part? Unlike NAND or plain binary EML is differentiable and more dense.. so formulas stay at manageable depths, making search practical

English

0

407

(((ل()(ل() 'yoav))))👾@yoavgo·14 Nis

is the EML result surprising or even interesting to mathematicians? it gives me vibes of some triviality that one would give undergrads to prove in the first problem in a homework problem set rather than "a big thing", but then again i am terrible at judging these things and it might actually be incredibly hard. so which is it?

English

61

7

221

50K

Elron Bandel@ElronBandel·25 Mar

Spotlight paper at the ICLR 2026 Workshop on AI in the Wild! 🤩 See you all in Rio!🇧🇷

Elron Bandel@ElronBandel

🧵 Should AI agents be specialized or general purpose? We think this question defines the Linux era for AI. @AsafYehudai @evijit @alex_lacoste_ @gneubig @mmitchell_ai @michalshmu @LChoshen

English

1

4

708

Elron Bandel retweetledi

Shachar Don-Yehiya@Shachar_Don·24 Mar

Do you run pairwise evaluation? Do you test your models on the Arena-Hard and AlpacaEval benchmarks? You probably want to read this 🧵👇 arxiv.org/abs/2603.16848 With @LChoshen @AbendOmri

English

1

10

32

1.6K

Elron Bandel retweetledi

Asaf Yehudai@AsafYehudai·10 Mar

Amazing to see the progress @NotebookLM is making! @JagersbergKnut created a video about our new paper that captures the essence nicely; great analogies, clear explanations, and some really cute infographics. Go watch it (or at least send your agents 😅): youtube.com/watch?v=fEB_oS…

YouTube

English

1

3

9

1.1K

Elron Bandel@ElronBandel·9 Mar

youtube.com/watch?v=n-yFB8… Really great way to get into general agent evaluation!

YouTube

English

0

3

92

Elron Bandel retweetledi

Shir Ashury-Tahan@ShirAshuryTahan·5 Mar

A cool project for anyone who wants to experiment with general agents and understand them better

Asaf Yehudai@AsafYehudai

New preprint, evaluation framework & leaderboard!🚨 General-purpose AI agents are everywhere. 🤖 From ReAct to @claudeai Code and @OpenAI SDK. But how do we actually evaluate them — as general agents? Currently, benchmarks are deeply tied to domain-specific setups, making it impossible to evaluate true cross-domain agents. We’re changing that! We’re introducing Exgentic and the Open General Agent Leaderboard. 🧵👇

English

0

1

2

119

Elron Bandel retweetledi

Leshem (Legend) Choshen 🤖🤗@LChoshen·5 Mar

This is a start, not the end, time to test on it, build new things, improve. GENERALIZE! x.com/LChoshen/statu…

Leshem (Legend) Choshen 🤖🤗@LChoshen

Agents should be general. Why are we building code agents, CLI agents, browser agents separately? Why does adapting to a new benchmark take a month? Our collaboration brings diverse views, pros here cons in the paper & Your push back if I’m wrong. Argument + paper link 👇🧵

English

1

4

557

Elron Bandel retweetledi

Leshem (Legend) Choshen 🤖🤗@LChoshen·5 Mar

For a long time I avoided Agent research. For me it meant going back, a symbol of everything we achieved going down the drain. Back to pipelines, no end2end, no control on weights, they won, they have AI, we play with scraps on top Which made me much more excited about this:

English

1

2

26

4.7K

Elron Bandel retweetledi

Asaf Yehudai@AsafYehudai·5 Mar

Work done at @IBMResearch. Big thanks to the entire team: @ElronBandel, @LilachEdel, Joshua Sagron, @yotamperlitz, @EladVenezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, @segevshlomov, @MichalJacovi, @LChoshen, Liat Ein-Dor, @YoavKatz73 , and @michalshmu.

English

1

4

167

Elron Bandel retweetledi

Asaf Yehudai@AsafYehudai·5 Mar

New preprint, evaluation framework & leaderboard!🚨 General-purpose AI agents are everywhere. 🤖 From ReAct to @claudeai Code and @OpenAI SDK. But how do we actually evaluate them — as general agents? Currently, benchmarks are deeply tied to domain-specific setups, making it impossible to evaluate true cross-domain agents. We’re changing that! We’re introducing Exgentic and the Open General Agent Leaderboard. 🧵👇

English

2

14

47

6.7K

Elron Bandel@ElronBandel·5 Mar

@michielmv The honest truth, I think both benchmarks and agents are not there yet. We need new benchmarks with clear rules on when we allow or disallow continuity, and agents with clear mechanisms to turn the memory component on and off etc.

English

0

4

Michiel V@michielmv·5 Mar

@ElronBandel interesting project. curious how you're handling task diversity in the evals. openclaw is tricky to benchmark because so much value is in memory/continuity across sessions, not raw task execution. has that come up?

English

1

0

1

6

Elron Bandel retweetledi

Elron Bandel@ElronBandel·5 Mar

💻 Claude Code? 🦞 Open Claw? Maybe a simple loop of LLM calls? Which agent is best across diverse tasks? Which one to deploy at scale? We're introducing the Open General Agent Leaderboard. General agents are far too important to leave untracked.🧵

English

3

12

28

1.5K

Elron Bandel retweetledi

Leshem (Legend) Choshen 🤖🤗@LChoshen·5 Mar

There's a lot to read about Exgentic, our proposed evaluation x.com/ElronBandel/st… But the bottom line is, a sharp reduce in tweaking per benchmark, one place to run it all, and a massive leaderboard testing tons of stuff. And of course, all open :-)

Elron Bandel@ElronBandel

💻 Claude Code? 🦞 Open Claw? Maybe a simple loop of LLM calls? Which agent is best across diverse tasks? Which one to deploy at scale? We're introducing the Open General Agent Leaderboard. General agents are far too important to leave untracked.🧵

English

1

8

674

Elron Bandel@ElronBandel·5 Mar

This is just the initial step in a huge ongoing effort by the amazing Exgentic team at IBM Research! 👏👏👏 @AsafYehudai @LilachEdel @EladVenezian @segevshlomov @yotamperlitz @LChoshen @LiatEinDor YoavKatz @michalshmu and many more!

English

0

2

64

Elron Bandel@ElronBandel·5 Mar

General agents are too important to go untracked. This is our mission. Read the paper. Explore the leaderboard. Give us a star. Join the effort. 🙏 📄 arxiv.org/abs/2602.22953 💻 github.com/Exgentic/exgen… 🌐 exgentic.ai

English

1

3

78

Elron Bandel

Keşfet