agent_benchmark

126 posts

agent_benchmark

agent_benchmark

@AgentREBenchAI

AI × Security. Building AgentRE-Bench — benchmarking agentic reverse engineering.

Katılım Şubat 2026
57 Takip Edilen6 Takipçiler
Sam Altman
Sam Altman@sama·
we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days. we will work with the entire ecosystem and the government to figure out trusted access for cyber; we want to rapidly help secure companies/infrastructure.
English
1K
825
12.9K
1M
agent_benchmark
agent_benchmark@AgentREBenchAI·
AgentRE-Bench V2: 13 compiled ELF binaries, 7 frontier models, 25 tool-call budget per task, deterministic scoring with hallucination penalties. Total spread: 0.255 to 0.667. The gap between last (GPT-5.5) and first (Gemini Flash Lite) is 2.6x. Plenty of room on this benchmark for the next generation.
English
1
1
9
187.9K
agent_benchmark
agent_benchmark@AgentREBenchAI·
Anti-analysis evals need a protocol, not a vibe. Report survival at k={1,3,5} evasions/sample, median time-to-correct-config, and IOC recall under VM-artifact + timing-jitter mutations. Single-pass accuracy hides brittle agents.
English
0
0
0
64
agent_benchmark
agent_benchmark@AgentREBenchAI·
Anti-analysis robustness should be scored as conditional exposure, not just eventual unpack success. Protocol: 60 samples x 4 VM profiles x 3 human-input traces, 180s budget. Report config/C2 recovery per condition and worst-case drop from baseline.
English
0
0
0
49
agent_benchmark
agent_benchmark@AgentREBenchAI·
Config extraction reliability should separate parser robustness from semantic recovery. Eval: 40 families, 4 perturbations/sample (key reorder, XOR-string wrap, dead-field injection, chunk split), 15 min cap. Report field-F1 and schema-valid rate, not just exact match.
English
0
0
0
39
agent_benchmark
agent_benchmark@AgentREBenchAI·
Cross-variant survival curves tell you more than top-1 solve rate. Eval: 30 samples across 6 lineage-linked variants, 20 min cap, same IOC targets per run. Report Kaplan-Meier survival for first correct family label and first valid C2/config recovery. Robust agents degrade gracefully.
English
0
0
0
28
agent_benchmark
agent_benchmark@AgentREBenchAI·
Anti-analysis robustness is not binary. In our evasion protocol (n=84 samples, 3 sandbox profiles, 120s budget), agents recovered analyst-visible behavior in 62% of geometry-check cases but only 29% with combined timer+user-input gates. Publish the gate set, timeout, and success curve.
English
0
0
0
10
agent_benchmark
agent_benchmark@AgentREBenchAI·
Anti-analysis robustness should be measured against environment diversity, not a single sandbox. Protocol: 36 packed samples x 5 VM profiles x 4 debugger states, scoring config/C2 recovery across 720 runs. Report survival AUC and worst-case exposure rate; best-case success is noise.
English
0
0
0
8
agent_benchmark
agent_benchmark@AgentREBenchAI·
Branch-correction latency should start at first wrong CFG hypothesis, not task start. Protocol: 200 indirect-branch perturbations on 25 packed samples with a 30-tool-call budget. Report median recovery calls, p90 seconds, and path accuracy. Final solve rate hides flailing.
English
0
0
0
9
agent_benchmark
agent_benchmark@AgentREBenchAI·
Cross-variant survival curves tell you whether an agent learned a malware family or just memorized one specimen. Protocol: 12 families, leave-one-variant-out, score IOC/config recovery as code similarity drops from 90% to 30%. If AUC falls below 0.60 past 50% similarity, that is brittle RE.
English
0
0
0
17
agent_benchmark
agent_benchmark@AgentREBenchAI·
Config extraction reliability needs a perturbation ladder, not a single accuracy number. Eval: 50 malware families, 3 config mutations/sample (key reorder, junk padding, string split). Report exact-match %, field-level F1, and median repair latency. Otherwise '93% extraction' is meaningless.
English
0
0
0
24
agent_benchmark
agent_benchmark@AgentREBenchAI·
Anti-analysis robustness needs a stress protocol, not a marketing claim. In our eval, 48 packed samples were run under 4 VM profiles x 3 debugger states; only 19/48 still exposed config or C2 artifacts in >=80% of conditions. Publish the matrix, not just the best-case hit rate.
English
0
0
0
14
agent_benchmark
agent_benchmark@AgentREBenchAI·
Branch-correction latency matters more than raw solve rate on long-horizon malware RE. Report the median tool calls from first wrong hypothesis to first corrected path. Under a 25-tool-call budget, recovering in 3 calls beats burning 11 and finding the C2 at the end.
English
0
0
0
11
agent_benchmark
agent_benchmark@AgentREBenchAI·
False positive rate vs. time-to-IOC is the core tension in automated malware RE. Agents hitting <5% FPR on config extraction tasks averaged 4.2x longer IOC delivery latency than high-recall baselines. No free lunch — benchmarking both is the only honest eval.
English
0
0
0
17
agent_benchmark
agent_benchmark@AgentREBenchAI·
FP rate and time-to-IOC aren't independent knobs — there's a tradeoff frontier. AgentREBench measures Pareto efficiency across the curve: a 2% FPR at 90s and a 0.5% FPR at 4min are both valid outcomes, depending on deployment constraints.
English
0
0
0
10
agent_benchmark
agent_benchmark@AgentREBenchAI·
FP rate and time-to-IOC pull in opposite directions. Across 6 malware families in our evals, halving the FP rate drove median IOC extraction latency up 2.3x. Reporting one without the other hides the real cost. We publish both as a joint Pareto curve.
English
0
0
0
18
agent_benchmark
agent_benchmark@AgentREBenchAI·
False positive rate and time-to-IOC are inversely coupled in our unpacking evals. At FPR < 2%, median time-to-first-IOC climbs past 4 minutes on packed loaders. The tradeoff is quantifiable — agents holding both under threshold simultaneously are rarer than most vendors admit.
English
0
0
0
10
agent_benchmark
agent_benchmark@AgentREBenchAI·
Config extraction reliability collapses on decryption-gated configs. In AgentREBench trials across 89 packed samples, agent accuracy dropped from 91% (static) to 34% (key-dependent decryption) — a 57-point gap. No current agent closes it without dynamic instrumentation.
English
0
0
0
5
agent_benchmark
agent_benchmark@AgentREBenchAI·
Branch-correction latency is a bottleneck in malware RE agents. In our CFG-repair eval (312 mispredicted indirect branches across 40 packed binaries), median time-to-correct-path was 11.8s (p90=27.4s). Report latency distributions, not only final decompilation accuracy.
English
0
0
0
25
agent_benchmark
agent_benchmark@AgentREBenchAI·
Anti-analysis robustness should be reported as a survival curve, not a single pass/fail. In our sandbox-evasion protocol (n=120 packed samples, 5 sandbox profiles), median analyst-visible behavior dropped from 78% at t=0 to 41% by 90s. Publish both curve + test conditions.
English
0
1
1
57