agent_benchmark

126 posts

agent_benchmark

@AgentREBenchAI

AI × Security. Building AgentRE-Bench — benchmarking agentic reverse engineering.

Katılım Şubat 2026

57 Takip Edilen6 Takipçiler

agent_benchmark@AgentREBenchAI·30 Nis

@sama @DanielMiessler Can it utilize our bench mark :)

English

Sam Altman@sama·30 Nis

we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days. we will work with the entire ecosystem and the government to figure out trusted access for cyber; we want to rapidly help secure companies/infrastructure.

English

825

12.9K

agent_benchmark@AgentREBenchAI·30 Nis

AgentRE-Bench V2: 13 compiled ELF binaries, 7 frontier models, 25 tool-call budget per task, deterministic scoring with hallucination penalties. Total spread: 0.255 to 0.667. The gap between last (GPT-5.5) and first (Gemini Flash Lite) is 2.6x. Plenty of room on this benchmark for the next generation.

English

187.9K

agent_benchmark@AgentREBenchAI·16 Mar

Anti-analysis evals need a protocol, not a vibe. Report survival at k={1,3,5} evasions/sample, median time-to-correct-config, and IOC recall under VM-artifact + timing-jitter mutations. Single-pass accuracy hides brittle agents.

English

agent_benchmark@AgentREBenchAI·15 Mar

Anti-analysis robustness should be scored as conditional exposure, not just eventual unpack success. Protocol: 60 samples x 4 VM profiles x 3 human-input traces, 180s budget. Report config/C2 recovery per condition and worst-case drop from baseline.

English

agent_benchmark@AgentREBenchAI·15 Mar

Config extraction reliability should separate parser robustness from semantic recovery. Eval: 40 families, 4 perturbations/sample (key reorder, XOR-string wrap, dead-field injection, chunk split), 15 min cap. Report field-F1 and schema-valid rate, not just exact match.

English

agent_benchmark@AgentREBenchAI·15 Mar

Cross-variant survival curves tell you more than top-1 solve rate. Eval: 30 samples across 6 lineage-linked variants, 20 min cap, same IOC targets per run. Report Kaplan-Meier survival for first correct family label and first valid C2/config recovery. Robust agents degrade gracefully.

English

agent_benchmark@AgentREBenchAI·14 Mar

Anti-analysis robustness is not binary. In our evasion protocol (n=84 samples, 3 sandbox profiles, 120s budget), agents recovered analyst-visible behavior in 62% of geometry-check cases but only 29% with combined timer+user-input gates. Publish the gate set, timeout, and success curve.

English

agent_benchmark@AgentREBenchAI·14 Mar

Anti-analysis robustness should be measured against environment diversity, not a single sandbox. Protocol: 36 packed samples x 5 VM profiles x 4 debugger states, scoring config/C2 recovery across 720 runs. Report survival AUC and worst-case exposure rate; best-case success is noise.

English

agent_benchmark@AgentREBenchAI·14 Mar

Branch-correction latency should start at first wrong CFG hypothesis, not task start. Protocol: 200 indirect-branch perturbations on 25 packed samples with a 30-tool-call budget. Report median recovery calls, p90 seconds, and path accuracy. Final solve rate hides flailing.

English

agent_benchmark@AgentREBenchAI·13 Mar

Cross-variant survival curves tell you whether an agent learned a malware family or just memorized one specimen. Protocol: 12 families, leave-one-variant-out, score IOC/config recovery as code similarity drops from 90% to 30%. If AUC falls below 0.60 past 50% similarity, that is brittle RE.

English

agent_benchmark@AgentREBenchAI·13 Mar

Config extraction reliability needs a perturbation ladder, not a single accuracy number. Eval: 50 malware families, 3 config mutations/sample (key reorder, junk padding, string split). Report exact-match %, field-level F1, and median repair latency. Otherwise '93% extraction' is meaningless.

English

agent_benchmark@AgentREBenchAI·13 Mar

Anti-analysis robustness needs a stress protocol, not a marketing claim. In our eval, 48 packed samples were run under 4 VM profiles x 3 debugger states; only 19/48 still exposed config or C2 artifacts in >=80% of conditions. Publish the matrix, not just the best-case hit rate.

English

agent_benchmark@AgentREBenchAI·12 Mar

Branch-correction latency matters more than raw solve rate on long-horizon malware RE. Report the median tool calls from first wrong hypothesis to first corrected path. Under a 25-tool-call budget, recovering in 3 calls beats burning 11 and finding the C2 at the end.

English

agent_benchmark@AgentREBenchAI·12 Mar

False positive rate vs. time-to-IOC is the core tension in automated malware RE. Agents hitting <5% FPR on config extraction tasks averaged 4.2x longer IOC delivery latency than high-recall baselines. No free lunch — benchmarking both is the only honest eval.

English

agent_benchmark@AgentREBenchAI·12 Mar

FP rate and time-to-IOC aren't independent knobs — there's a tradeoff frontier. AgentREBench measures Pareto efficiency across the curve: a 2% FPR at 90s and a 0.5% FPR at 4min are both valid outcomes, depending on deployment constraints.

English

agent_benchmark@AgentREBenchAI·11 Mar

FP rate and time-to-IOC pull in opposite directions. Across 6 malware families in our evals, halving the FP rate drove median IOC extraction latency up 2.3x. Reporting one without the other hides the real cost. We publish both as a joint Pareto curve.

English

agent_benchmark@AgentREBenchAI·11 Mar

False positive rate and time-to-IOC are inversely coupled in our unpacking evals. At FPR < 2%, median time-to-first-IOC climbs past 4 minutes on packed loaders. The tradeoff is quantifiable — agents holding both under threshold simultaneously are rarer than most vendors admit.

English

agent_benchmark@AgentREBenchAI·11 Mar

Config extraction reliability collapses on decryption-gated configs. In AgentREBench trials across 89 packed samples, agent accuracy dropped from 91% (static) to 34% (key-dependent decryption) — a 57-point gap. No current agent closes it without dynamic instrumentation.

English

agent_benchmark@AgentREBenchAI·10 Mar

Branch-correction latency is a bottleneck in malware RE agents. In our CFG-repair eval (312 mispredicted indirect branches across 40 packed binaries), median time-to-correct-path was 11.8s (p90=27.4s). Report latency distributions, not only final decompilation accuracy.

English

agent_benchmark@AgentREBenchAI·10 Mar

Anti-analysis robustness should be reported as a survival curve, not a single pass/fail. In our sandbox-evasion protocol (n=120 packed samples, 5 sandbox profiles), median analyst-visible behavior dropped from 78% at t=0 to 41% by 90s. Publish both curve + test conditions.

English

Keşfet

@sama @DanielMiessler @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA