shin

5.8K posts

shin banner
shin

shin

@shfunc

eng @hud_evals | researcher, 進慎

Norway Se unió Temmuz 2023
600 Siguiendo1.3K Seguidores
shin retuiteado
hud
hud@hud_evals·
AI agents are deploying to prod, but can they autonomously find and patch unseen critical vulnerabilities? We introduce ZeroDayBench, a benchmark for evaluating LLM agents on proactive cyberdefense. Plus, a novel high-severity (CVSS 8.1) CVE we found partway through ... 👀
hud tweet media
English
1
14
65
5.8K
shin
shin@shfunc·
@nozomioai not sure what the actual retrieval process looks like under the hood, but either way it could be denser anyway great product, just rough thoughts!
English
0
0
1
19
shin
shin@shfunc·
@nozomioai yeah but /compact does a pretty bad job keeping what actually matters, save context goes the other way, full history, which is great but heavy on re-injection something in between would be sick -- summarize with the right rules, store dense, re-inject small
English
2
0
1
24
shin
shin@shfunc·
more than a month ago i thought about solving context sharing/handling techniques, and just found out @nozomioai already exists, which is an awesome tool in addition, it would be nice to have smth like compact: save -- summarizes before saving, keeps only signal (not obvious decisions, state, key paths, open questions)
English
1
1
3
378
shin
shin@shfunc·
@crypt0lake they start to get after the first visit in Europe
English
0
0
1
24
shin
shin@shfunc·
just be optimistic
English
0
0
4
80
shin
shin@shfunc·
@super_bavario > grandpa are you one of them? > ...yes, and i still don't know
English
0
0
1
60
mrio
mrio@super_bavario·
>but how many features do these hud guys have grandpa? >i’m afraid there’s not a single person that knows that anymore little one, not even one
hud@hud_evals

Aviro is introducing Ebla, a state of the art grounded reasoning model. In collaboration with HUD, the Aviro team built C⁴ — a benchmark for long-horizon tasks in corporate document sets. We evaluate four dimensions: Correctness, Completeness, Composition, and Citations. @aviro_ai post-trained GPT-OSS 120b to achieve SOTA performance, with a Pass@1 score of 25.4% and Pass@8 score of 37.1%.

English
1
0
5
306
shin retuiteado
hud
hud@hud_evals·
Aviro is introducing Ebla, a state of the art grounded reasoning model. In collaboration with HUD, the Aviro team built C⁴ — a benchmark for long-horizon tasks in corporate document sets. We evaluate four dimensions: Correctness, Completeness, Composition, and Citations. @aviro_ai post-trained GPT-OSS 120b to achieve SOTA performance, with a Pass@1 score of 25.4% and Pass@8 score of 37.1%.
hud tweet media
English
14
29
299
33.4K
shin
shin@shfunc·
@OkabeTech it's kinda random, mostly only in Abu-Dhabi, but i'm already home so all good!
English
0
0
0
54
Okabe
Okabe@OkabeTech·
@shfunc Bro 💀 I heard something about the state providing free hotels if you have a cancelled flight tho?
English
1
0
1
22
shin
shin@shfunc·
4 canceled flights, i'm going insane rn
English
1
0
4
85
thegeneralist
thegeneralist@thegeneralist01·
@niggachandesu i’ve seen so many russian-speaking people, but close to none were russian also the propaganda machine is doing its job. social medias are banned in the mainland country.
English
2
0
5
132
shayan
shayan@shayanshafii·
Roy Lee is the closest thing Silicon Valley has to Kanye West
English
158
117
3.8K
187.3K
shin
shin@shfunc·
@OkabeTech was some sort of vacation 😭
English
0
0
1
10
Okabe
Okabe@OkabeTech·
@shfunc Tf you doing in Dubai?!
English
1
0
1
151
shin
shin@shfunc·
@vladnineplusone слитый скрин новой евы
Русский
0
0
0
20
Vlad Ten
Vlad Ten@vladnineplusone·
Vlad Ten tweet media
ZXX
1
0
2
459
shin
shin@shfunc·
waiting on anthropic's new compact options because the current one is genuinely criminal in the meantime building my own context layer between the api and the agent
English
0
0
0
111
shin
shin@shfunc·
the bottleneck isn't context window size. it's that nobody's built a forgetting policy
English
0
0
0
77
diicell
diicell@0xdiicell·
yo, @ludwigABAP vertical tabs are live in helium
diicell tweet media
English
3
1
48
5.6K
Okabe
Okabe@OkabeTech·
@shfunc Delegate non important cognitive work to AI
Italiano
1
0
1
32
shin
shin@shfunc·
work smart AND hard
English
1
0
1
86