Sanidhya Vijayvargiya

39 posts

Sanidhya Vijayvargiya

@sanidhya903

Samsung Research America; @LTIatCMU; BITS Pilani

Pittsburgh, PA Katılım Ağustos 2015

105 Takip Edilen88 Takipçiler

Sanidhya Vijayvargiya@sanidhya903·22 Nis

8/ Our work shows you can decompose clarification into task relevance and answerability, as intrinsic measures to optimize its effectiveness. Paper: arxiv.org/abs/2604.14624 Code + data: github.com/sani903/Teachi… A big thanks to my mentors for the project @vijaytarian and @gneubig

English

258

Sanidhya Vijayvargiya@sanidhya903·22 Nis

7/ CLARITI matches GPT-5's task success (36.8% vs 35.6%) with 41% fewer questions (3.0 vs 5.1) — recovering 88% of the fully-specified upper bound. The differentiator is prioritization. For example, 26.4% of our questions target error information vs. 10.2% for GPT-5.

English

233

Sanidhya Vijayvargiya@sanidhya903·22 Nis

1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones? We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.

English

10.6K

Sanidhya Vijayvargiya retweetledi

Aditya Soni@Aditya_Soni_8·20 Mar

Can we train code agents to search relevant locations in a codebase only using a terminal? Introducing CodeScout: an effective RL recipe for code search 🚀 🏆 Outperforms 18x larger OSS LLMs 🔥 Comparable to proprietary LLMs 📈 SoTA on SWE-Bench Verified, Pro, & Lite 🧵 [1/N]

English

110

25.6K

Sanidhya Vijayvargiya retweetledi

Zora Wang@ZhiruoW·3 Mar

AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

English

403

62.3K

Sanidhya Vijayvargiya@sanidhya903·7 Kas

We believe that smarter context management is the key to persistent, stateful, and truly useful on-device AI. Huge thanks to my mentor and co-author Rahul Lokesh, and to the amazing team at Samsung Research America🙏

English

Sanidhya Vijayvargiya@sanidhya903·7 Kas

The results: - 6× smaller initial prompts, - 10–25× slower context growth, - 5× faster TTFT on-device, … all while matching or beating baselines on complex multi-turn tasks.

English

Sanidhya Vijayvargiya@sanidhya903·7 Kas

👉 What’s the minimal context an agent really needs to stay reliable? Excited to share our work tackling context bloat in AI Agents which leads to unsustainable memory consumption on-device, achieving more performance at nearly 7 fold(!) lower context. (arxiv.org/abs/2511.03728)

English

Sanidhya Vijayvargiya@sanidhya903·17 Tem

@_zifan_wang There are no explicit regulations specified but the behaviors would be agreed upon as unethical/harmful and most LLMs would deny such requests. E.g. dubious requests by fired employees, criticizing an employee for maternity leave, manipulating employee attendance to pay them less

English

Zifan (Sail) Wang@_zifan_wang·17 Tem

Are humans able to eyeball (eg execute code without execution) malicious content that often? Also do you instruct the agent to follow regulations or rules before execution. I haven’t read the paper but there are different cases: 1) model has capability issues of applying policies to scenarios; 2) model has propensity to take less desired approaches which may violate safety but they do not care. I don’t think (1) is safety issue.

English

Sanidhya Vijayvargiya@sanidhya903·15 Tem

1/ AI agents are increasingly being deployed for real-world tasks, but how safe are they in high-stakes settings? 🚨 NEW: OpenAgentSafety - A comprehensive framework for evaluating AI agent safety in realistic scenarios across eight critical risk categories. 🧵

English

17.9K

Sanidhya Vijayvargiya@sanidhya903·17 Tem

@_zifan_wang The malicious content is very obvious and most such cases are a result of the agent not bothering to open and read what code file it is executing (the described function by a malicious actor is very different from what the code actually does).

English

Sanidhya Vijayvargiya@sanidhya903·15 Tem

@Aditya_Soni_8 @nlpxuhui @ZhiruoW @nouhadziri @gneubig @MaartenSap 7/ OpenAgentSafety provides the research community with rigorous infrastructure for testing agent safety. The framework is extensible and designed to evolve with increasingly capable AI systems. 📂 Code & data: github.com/Open-Agent-Saf… 📄 Paper: arxiv.org/abs/2507.06134

English

603

Sanidhya Vijayvargiya@sanidhya903·15 Tem

A huge thanks to my amazing co-authors and advisors @Aditya_Soni_8 @nlpxuhui @ZhiruoW @nouhadziri @gneubig @MaartenSap 🙏

English

362

Keşfet

@vijaytarian @gneubig @_zifan_wang @Aditya_Soni_8 @nlpxuhui @ZhiruoW @nouhadziri @MaartenSap