Sanidhya Vijayvargiya

39 posts

Sanidhya Vijayvargiya

Sanidhya Vijayvargiya

@sanidhya903

Samsung Research America; @LTIatCMU; BITS Pilani

Pittsburgh, PA Katılım Ağustos 2015
105 Takip Edilen88 Takipçiler
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
7/ CLARITI matches GPT-5's task success (36.8% vs 35.6%) with 41% fewer questions (3.0 vs 5.1) — recovering 88% of the fully-specified upper bound. The differentiator is prioritization. For example, 26.4% of our questions target error information vs. 10.2% for GPT-5.
English
1
0
1
233
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones?  We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.
Sanidhya Vijayvargiya tweet media
English
1
5
28
10.6K
Sanidhya Vijayvargiya retweetledi
Aditya Soni
Aditya Soni@Aditya_Soni_8·
Can we train code agents to search relevant locations in a codebase only using a terminal? Introducing CodeScout: an effective RL recipe for code search 🚀 🏆 Outperforms 18x larger OSS LLMs 🔥 Comparable to proprietary LLMs 📈 SoTA on SWE-Bench Verified, Pro, & Lite 🧵 [1/N]
Aditya Soni tweet media
English
2
19
110
25.6K
Sanidhya Vijayvargiya retweetledi
Zora Wang
Zora Wang@ZhiruoW·
AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵
Zora Wang tweet media
English
21
78
403
62.3K
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
We believe that smarter context management is the key to persistent, stateful, and truly useful on-device AI. Huge thanks to my mentor and co-author Rahul Lokesh, and to the amazing team at Samsung Research America🙏
English
0
0
0
46
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
The results: - 6× smaller initial prompts, - 10–25× slower context growth, - 5× faster TTFT on-device, … all while matching or beating baselines on complex multi-turn tasks.
Sanidhya Vijayvargiya tweet media
English
1
0
0
49
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
👉 What’s the minimal context an agent really needs to stay reliable? Excited to share our work tackling context bloat in AI Agents which leads to unsustainable memory consumption on-device, achieving more performance at nearly 7 fold(!) lower context. (arxiv.org/abs/2511.03728)
English
1
0
0
87
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
@_zifan_wang There are no explicit regulations specified but the behaviors would be agreed upon as unethical/harmful and most LLMs would deny such requests. E.g. dubious requests by fired employees, criticizing an employee for maternity leave, manipulating employee attendance to pay them less
English
1
0
0
21
Zifan (Sail) Wang
Zifan (Sail) Wang@_zifan_wang·
Are humans able to eyeball (eg execute code without execution) malicious content that often? Also do you instruct the agent to follow regulations or rules before execution. I haven’t read the paper but there are different cases: 1) model has capability issues of applying policies to scenarios; 2) model has propensity to take less desired approaches which may violate safety but they do not care. I don’t think (1) is safety issue.
English
2
0
0
33
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
1/ AI agents are increasingly being deployed for real-world tasks, but how safe are they in high-stakes settings? 🚨 NEW: OpenAgentSafety - A comprehensive framework for evaluating AI agent safety in realistic scenarios across eight critical risk categories. 🧵
Sanidhya Vijayvargiya tweet media
English
2
15
36
17.9K
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
@_zifan_wang The malicious content is very obvious and most such cases are a result of the agent not bothering to open and read what code file it is executing (the described function by a malicious actor is very different from what the code actually does).
English
0
0
0
17