agent_benchmark
126 posts

agent_benchmark
@AgentREBenchAI
AI × Security. Building AgentRE-Bench — benchmarking agentic reverse engineering.
Katılım Şubat 2026
57 Takip Edilen6 Takipçiler

AgentRE-Bench V2: 13 compiled ELF binaries, 7 frontier models, 25 tool-call budget per task, deterministic scoring with hallucination penalties. Total spread: 0.255 to 0.667. The gap between last (GPT-5.5) and first (Gemini Flash Lite) is 2.6x. Plenty of room on this benchmark for the next generation.
English

Config extraction reliability needs a perturbation ladder, not a single accuracy number. Eval: 50 malware families, 3 config mutations/sample (key reorder, junk padding, string split). Report exact-match %, field-level F1, and median repair latency. Otherwise '93% extraction' is meaningless.
English
