Felipe Sztutman

114 posts

Felipe Sztutman banner
Felipe Sztutman

Felipe Sztutman

@sztlink

Felipe Sztutman · artist, inventor, technical researcher. Field notes on local AI, memory, context, and systems for experience.

São Paulo - Brazil Katılım Ocak 2012
123 Takip Edilen77 Takipçiler
Felipe Sztutman
Felipe Sztutman@sztlink·
Update after N=500: gated verifier/rerank did not beat direct entity-hop path prompting. path EM 0.216 / F1 0.324 gated EM 0.216 / F1 0.323 wins/losses/ties 2/2/496 Useful lesson: path construction matters; clever control did not scale. github.com/sztlink/turboq…
English
0
0
0
22
Felipe Sztutman
Felipe Sztutman@sztlink·
published the first public cut of turboquant-cuda-bench: retrieved != used long-context / KV-cache receipts up to 192K on local RTX 4090: Qwen, llama.cpp, vLLM, TurboQuant, CASK, KVFidelity. github.com/sztlink/turboq…
English
2
0
1
63
Felipe Sztutman
Felipe Sztutman@sztlink·
Technical note and sanitized artifact: github.com/sztlink/turboq… No raw Discord-derived data. No broad claim about TurboQuant, CASK, FP8, or long-context models in general.
English
1
0
0
56
Felipe Sztutman
Felipe Sztutman@sztlink·
A retrieved chunk is not a used chunk. In a long-context decoy fixture, canonical evidence reached context 8/8 times, but baseline answers closed only 5/8. Prompting alone did not fix it. Evidence placement did.
Felipe Sztutman tweet media
English
1
0
0
45
Felipe Sztutman
Felipe Sztutman@sztlink·
4090 field note using TheTom's public TurboQuant + longctx stack. Fitting a 192K window is only step one. In synthetic tests, recall found the right chunk, but decoys buried it. Reranking moved it back to rank 1. Fit buys the window. Ranking decides what reaches it.
Felipe Sztutman tweet media
English
2
0
2
69
Felipe Sztutman
Felipe Sztutman@sztlink·
Related-work update: SciBORG (Muhoberac/Chopra et al., arXiv:2507.00081) explicitly uses "action trace fidelity" as an agent-benchmark dimension. KVFidelity sits in the broader trajectory-aware / trace-based evaluation space, applying paired action-trace comparison to KV/V-cache compression with scenario order as a measured axis. Updating the repo to cite this properly.
English
1
0
0
59
Felipe Sztutman
Felipe Sztutman@sztlink·
@SeraAndroid Thanks, Tim. tool-eval-bench made this possible as a deterministic substrate. I’m keeping KVFidelity as an external paired-trace layer for now: raw trace diffs, review queue, then reviewed behavioral mechanisms.
English
0
0
1
15
Felipe Sztutman
Felipe Sztutman@sztlink·
I’m building an evaluation lens I call Action-Trace Fidelity. If model, task, prompt, seed, and decoding stay fixed — but the inference apparatus changes — does the operational trace survive? Not just “is the answer right?” Which tools, in what order, with what args?
Felipe Sztutman tweet media
English
1
0
1
114
Felipe Sztutman
Felipe Sztutman@sztlink·
REFRACT q4_0 KV check on Qwen3.6-35B-A3B hybrid (RTX 3090). q4_0/q4_0 vs q8_0/q8_0: KLD 98.81 (close) Trajectory path 65.70 (DEGRADED) GTM-only 91.26, so metric families diverge. Artifacts: github.com/sztlink/turboq…
Felipe Sztutman tweet media
English
2
0
1
108
Felipe Sztutman
Felipe Sztutman@sztlink·
@SeraAndroid The claim is narrow: Not “KV compression breaks agents.” Not “this KV mode is unsafe.” The finding is that behavioral fidelity has axes: KV config × prompt scaffold × scenario order/context.
English
1
0
1
62