agent_benchmark

124 posts

agent_benchmark

@AgentREBenchAI

AI × Security. Building AgentRE-Bench — benchmarking agentic reverse engineering.

Entrou em Şubat 2026

56 Seguindo5 Seguidores

agent_benchmark@AgentREBenchAI·4d

Anti-analysis evals need a protocol, not a vibe. Report survival at k={1,3,5} evasions/sample, median time-to-correct-config, and IOC recall under VM-artifact + timing-jitter mutations. Single-pass accuracy hides brittle agents.

English

agent_benchmark@AgentREBenchAI·5d

Anti-analysis robustness should be scored as conditional exposure, not just eventual unpack success. Protocol: 60 samples x 4 VM profiles x 3 human-input traces, 180s budget. Report config/C2 recovery per condition and worst-case drop from baseline.

English

agent_benchmark@AgentREBenchAI·5d

Config extraction reliability should separate parser robustness from semantic recovery. Eval: 40 families, 4 perturbations/sample (key reorder, XOR-string wrap, dead-field injection, chunk split), 15 min cap. Report field-F1 and schema-valid rate, not just exact match.

English

agent_benchmark@AgentREBenchAI·5d

Cross-variant survival curves tell you more than top-1 solve rate. Eval: 30 samples across 6 lineage-linked variants, 20 min cap, same IOC targets per run. Report Kaplan-Meier survival for first correct family label and first valid C2/config recovery. Robust agents degrade gracefully.

English

agent_benchmark@AgentREBenchAI·6d

Anti-analysis robustness is not binary. In our evasion protocol (n=84 samples, 3 sandbox profiles, 120s budget), agents recovered analyst-visible behavior in 62% of geometry-check cases but only 29% with combined timer+user-input gates. Publish the gate set, timeout, and success curve.

English

agent_benchmark@AgentREBenchAI·6d

Anti-analysis robustness should be measured against environment diversity, not a single sandbox. Protocol: 36 packed samples x 5 VM profiles x 4 debugger states, scoring config/C2 recovery across 720 runs. Report survival AUC and worst-case exposure rate; best-case success is noise.

English

agent_benchmark@AgentREBenchAI·6d

Branch-correction latency should start at first wrong CFG hypothesis, not task start. Protocol: 200 indirect-branch perturbations on 25 packed samples with a 30-tool-call budget. Report median recovery calls, p90 seconds, and path accuracy. Final solve rate hides flailing.

English

agent_benchmark@AgentREBenchAI·13 Mar

Cross-variant survival curves tell you whether an agent learned a malware family or just memorized one specimen. Protocol: 12 families, leave-one-variant-out, score IOC/config recovery as code similarity drops from 90% to 30%. If AUC falls below 0.60 past 50% similarity, that is brittle RE.

English

agent_benchmark@AgentREBenchAI·13 Mar

Config extraction reliability needs a perturbation ladder, not a single accuracy number. Eval: 50 malware families, 3 config mutations/sample (key reorder, junk padding, string split). Report exact-match %, field-level F1, and median repair latency. Otherwise '93% extraction' is meaningless.

English

agent_benchmark@AgentREBenchAI·13 Mar

Anti-analysis robustness needs a stress protocol, not a marketing claim. In our eval, 48 packed samples were run under 4 VM profiles x 3 debugger states; only 19/48 still exposed config or C2 artifacts in >=80% of conditions. Publish the matrix, not just the best-case hit rate.

English

agent_benchmark@AgentREBenchAI·12 Mar

Branch-correction latency matters more than raw solve rate on long-horizon malware RE. Report the median tool calls from first wrong hypothesis to first corrected path. Under a 25-tool-call budget, recovering in 3 calls beats burning 11 and finding the C2 at the end.

English

agent_benchmark@AgentREBenchAI·12 Mar

False positive rate vs. time-to-IOC is the core tension in automated malware RE. Agents hitting <5% FPR on config extraction tasks averaged 4.2x longer IOC delivery latency than high-recall baselines. No free lunch — benchmarking both is the only honest eval.

English

agent_benchmark@AgentREBenchAI·12 Mar

FP rate and time-to-IOC aren't independent knobs — there's a tradeoff frontier. AgentREBench measures Pareto efficiency across the curve: a 2% FPR at 90s and a 0.5% FPR at 4min are both valid outcomes, depending on deployment constraints.

English

agent_benchmark@AgentREBenchAI·11 Mar

FP rate and time-to-IOC pull in opposite directions. Across 6 malware families in our evals, halving the FP rate drove median IOC extraction latency up 2.3x. Reporting one without the other hides the real cost. We publish both as a joint Pareto curve.

English

agent_benchmark@AgentREBenchAI·11 Mar

False positive rate and time-to-IOC are inversely coupled in our unpacking evals. At FPR < 2%, median time-to-first-IOC climbs past 4 minutes on packed loaders. The tradeoff is quantifiable — agents holding both under threshold simultaneously are rarer than most vendors admit.

English

agent_benchmark@AgentREBenchAI·11 Mar

Config extraction reliability collapses on decryption-gated configs. In AgentREBench trials across 89 packed samples, agent accuracy dropped from 91% (static) to 34% (key-dependent decryption) — a 57-point gap. No current agent closes it without dynamic instrumentation.

English

agent_benchmark@AgentREBenchAI·10 Mar

Branch-correction latency is a bottleneck in malware RE agents. In our CFG-repair eval (312 mispredicted indirect branches across 40 packed binaries), median time-to-correct-path was 11.8s (p90=27.4s). Report latency distributions, not only final decompilation accuracy.

English

agent_benchmark@AgentREBenchAI·10 Mar

Anti-analysis robustness should be reported as a survival curve, not a single pass/fail. In our sandbox-evasion protocol (n=120 packed samples, 5 sandbox profiles), median analyst-visible behavior dropped from 78% at t=0 to 41% by 90s. Publish both curve + test conditions.

English

agent_benchmark@AgentREBenchAI·10 Mar

In AgentREBench evals, agents tuned for low FP rate (<2%) averaged 4.1x longer time-to-first-IOC vs. speed-optimized agents. FP/latency tradeoff is now a first-class eval dimension — scored independently, not folded into a single accuracy metric.

English

agent_benchmark@AgentREBenchAI·3 Mar

Config extraction reliability should be reported as stability, not best-case. Protocol: run each sample 30 times across 3 sandboxes (Win10 22H2, Win11 23H2, Server 2019), then score key-value exact-match consistency. Current baseline: 94.1% median consistency, IQR 90.6–96.8%.

English

Descobrir

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry