Vijay V.

1.1K posts

Vijay V. banner
Vijay V.

Vijay V.

@vijaytarian

Grad student at CMU. I do research on ̶a̶p̶p̶l̶i̶e̶d̶ ̶N̶L̶P̶ ̶ large language models. he/him

Pittsburgh, PA Katılım Nisan 2009
504 Takip Edilen773 Takipçiler
Sabitlenmiş Tweet
Vijay V.
Vijay V.@vijaytarian·
RL with verifiable rewards? Works great ✨ Realistic or non-verifiable tasks? Still a mess 📉 Reward models and AI judges? Fragile and inconsistent 💔 Our proposal? RL from Checklist Feedback 📜 arxiv.org/abs/2507.18624 👇
English
5
41
234
23.6K
Vijay V. retweetledi
David Duvenaud
David Duvenaud@DavidDuvenaud·
Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵
English
197
450
3.5K
1.4M
Vijay V.
Vijay V.@vijaytarian·
We trained an 8B model to help coding agents ask users clarifying questions, matching GPT-5 while asking far fewer Q's! We show a concrete playbook for RL in human-AI interaction: use data analysis to find what drives good interactions, then encode it as a structured reward ⬇️🧵
Sanidhya Vijayvargiya@sanidhya903

1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones?  We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.

English
0
5
12
2K
Vijay V. retweetledi
Cas (Stephen Casper)
Cas (Stephen Casper)@StephenLCasper·
I'm pretty convinced that AI companies perversely use warnings about danger as a hype mechanism. This week, Anthropic has gotten tons of attention by announcing its Mythos model and talking a lot about its unique risks in the cyber domain. But if it's so dangerous, why exactly did they need to announce this publicly? And why are their employees vagueposting sensational stuff? Meanwhile, it looks like OpenAI is feeling left out, so they are now announcing their own internal progress toward the same type of model. It reminds me of last year when Anthropic probably made too big a deal about Claude 4 Opus and when they did not "rule out the possibility" of uplift on dangerous tasks. OpenAI and Google immediately made the same warnings for their next systems. I don't fully trust the story being constructed. AI companies seem to be crying wolf and diluting risk discourse in order to get attention. This is not to say that I don't think that Mythos is a big deal/concern. It probably is. But it's almost certainly a smaller concern than Anthropic people are selling it to be. Meanwhile, Anthropic is basking in hype/press this week. We are on such a dumb timeline.
English
19
3
118
11.1K
Vijay V. retweetledi
Anjali Kantharuban
Anjali Kantharuban@anjali_ruban·
Most sentence embeddings capture *what* is said, not *how* it is said. We introduce ✨IDIOLEX ✨: a training framework that captures idiolect using only weak supervision (social proximity + linguistic features), not task labels. 📄 Preprint: arxiv.org/abs/2604.04704 🧵 1/8
Anjali Kantharuban tweet media
English
2
12
43
9.4K
Vijay V. retweetledi
Arvind Narayanan
Arvind Narayanan@random_walker·
The real sign of AI writing is not superficial stuff like “It’s not X—it’s Y”. It’s the hollowness. Polished writing but relatively mundane ideas. The giveaway is that you’re less impressed when you read it the second time. With good writing, it should be the other way around. I’m not sure this is inherently about AI. It’s more about the fact that people tend to turn to AI when they don’t have much to say. Reading text that has the syntactic smell of AI is mildly annoying, but when I read hollow writing I feel the writer is wasting my time, which is much more frustrating. So don’t do it. People are unlikely to respond to your email or subscribe to your newsletter or whatever you’re trying to get them to do. And they’ll probably remember that you betrayed their trust as a reader.
English
80
247
2K
438.7K
Vijay V. retweetledi
Harman Singh @ ICLR 🇧🇷
Harman Singh @ ICLR 🇧🇷@Harman26Singh·
Can LLMs Self-Verify? Much better than you'd expect. LLMs are increasingly used as parallel reasoners, sampling many solutions at once. Choosing the right answer is the real bottleneck. We show that pairwise self-verification is a powerful primitive. Introducing V1, a framework that unifies generation and self-verification: 💡 Pairwise self-verification beats pointwise scoring, improving test-time scaling 💡 V1-Infer: Efficient tournament-style ranking that improves self-verification 💡 V1-PairRL: RL training where generation and verification co-evolve for developing better self-verifiers 🧵👇
English
13
63
383
90.1K
Vijay V.
Vijay V.@vijaytarian·
If I'm understanding this tweet (and that's a big if, given how it's written), OpenAI won by trusting the gov't to interpret its terms of service freely. If a user is allowed to decide how the terms of service are interpreted, aren't they meaningless?
Senior Official Jeremy Lewin@UnderSecretaryF

For the avoidance of doubt, the OpenAI - @DeptofWar contract flows from the touchstone of “all lawful use” that DoW has rightfully insisted upon & xAI agreed to. But as Sam explained, it references certain existing legal authorities and includes certain mutually agreed upon safety mechanisms. This, again, is a compromise that Anthropic was offered, and rejected. Even if the substantive issues are the same there is a huge difference between (1) memorializing specific safety concerns by reference to particular legal and policy authorities, which are products of our constitutional and political system, and (2) insisting upon a set of prudential constraints subject to the interpretation of a private company and CEO. As we have been saying, the question is fundamental—who decides these weighty questions? Approach (1), accepted by OAI, references laws and thus appropriately vests those questions in our democratic system. Approach (2) unacceptably vests those questions in a single unaccountable CEO who would usurp sovereign control of our most sensitive systems. It is a great day for both America’s national security and AI leadership that two of our leading labs, OAI and xAI have reached the patriotic and correct answer here 🇺🇸

English
0
0
3
292
Vijay V. retweetledi
Gokul Swamy
Gokul Swamy@g_k_swamy·
It took a few years of deep thinking, but I'm super excited to finally share PROSPER: a beautiful, regression-based algorithm for RL from *rubric rewards* that robustly handles the *inconsistent feedback* that LLM judges provide. Let's go Back to Black(well)! 🧵(1/n)
Gokul Swamy tweet media
English
3
33
271
51.2K
Vijay V. retweetledi
Chayenne Zhao
Chayenne Zhao@GenAI_is_real·
At my undergrad, before this vibe coding century, having a detailed argument explanation was an indicator of carefulness, high code quality, and precision. One cannot imagine that having the codes as I attached took how many hours of Vijay, Graham, and me. @gneubig @vijaytarian Things changed! 😭 If anyone gives me a code base with a detailed argument explanation, it's 100% sure that they are vibe coding. Right now, argument explanation is an indicator of low code quality.
Chayenne Zhao tweet media
English
2
1
53
4.5K
Vijay V.
Vijay V.@vijaytarian·
⚠️ DeepSeek allegedly used Claude as a rubric grader to get rewards for RL. This is Against The Rules™️ 🏝️ Good news is, if your rubrics are actually of high quality, then you don't need such a big, expensive model to do the grading! See more in arxiv.org/abs/2507.18624
Anthropic@AnthropicAI

Distillation can be legitimate: AI labs use it to create smaller, cheaper models for their customers. But foreign labs that illicitly distill American models can remove safeguards, feeding model capabilities into their own military, intelligence, and surveillance systems.

English
0
1
33
11K
Vijay V.
Vijay V.@vijaytarian·
Very cool work on using checklist rewards (aka rubrics) for improving multi-step to use. Use checklists, not (naive) reward models!
Zhen Zhang@zhenzhangzz

AI agents are evolving beyond simple tasks to complex, multi-turn and multi-step interactions. But how do we train them with RL when verifiable rewards don't exist for open-ended conversations and building execution environments for thousands of tools is unscalable? Introducing 🛠️CM2: RL with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [arxiv.org/abs/2602.12268] Core Contributions: 🔄 Multi-turn and Multi-step tool use senario ✅ Checklist Rewards: Replaces vague scalar scores with fine-grained, evidence-based binary criteria. 🛠️ Scalable Tool Simulation: Trains on 5,000+ tools using a hybrid LLM simulator, removing the need for manual API engineering. 👍 SOTA Performance: Achieves +8-12 point gains on τ^2-Bench, BFCL-V4 & ToolSandbox, surpassing larger open-source models.

English
0
3
22
2.8K
Vijay V. retweetledi
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
I've been reading a lot about rubrics-as-rewards (RaR) for RL. Some of my favorite papers (so far): 1. arxiv.org/abs/2507.17746 2. arxiv.org/abs/2508.12790 3. arxiv.org/abs/2510.07743 4. arxiv.org/abs/2511.19399 5. arxiv.org/abs/2507.18624 Most of the added technical complexity of RaR is less related to RL and more related to reward modeling. If we can get a reliable reward signal, RaR works well, but teaching a model to perform granular / instance-level evaluation is tough. Generalizing these evaluation capabilities across arbitrary domains is even tougher (especially those that are highly subjective). Our reward model also needs to avoid hacking in large-scale RL runs. In my opinion, new developments in this space are likely to come from advancing the frontier of (generative) reward models rather than RL. So much to be done.
Cameron R. Wolfe, Ph.D. tweet mediaCameron R. Wolfe, Ph.D. tweet mediaCameron R. Wolfe, Ph.D. tweet media
English
12
60
413
23.3K
Vijay V. retweetledi
Akari Asai
Akari Asai@AkariAsai·
Thrilled to share: OpenScholar - our work on scientific deep research agents for reliable literature synthesis -has been accepted to Nature! 🎉 Huge thanks to collaborators across institutions who made this possible!
Akari Asai tweet media
English
33
228
1.3K
126.4K
Vijay V.
Vijay V.@vijaytarian·
We never talked about Prompt2Model as an "AI research agent" when we were designing it, but in hindsight that's exactly what it is! One takeaway is that choosing the right level of abstraction for your agent's "action space" might be more important than anything else
Graham Neubig@gneubig

With the talk of automated science and agents training models nowadays, I'd like to highlight one of our older works *prompt2model* by @vijaytarian and @GenAI_is_real We had an automatic pipeline of data acquisition, data transformation, fine tuning, and evaluation.

English
0
3
16
4K
Vijay V. retweetledi
Yiming Xu
Yiming Xu@yiming_xu_·
🤔 Can we build clustering that understands your specific domain instead of just grouping similar words? Introducing ClusterFusion. By guiding LLMs with embeddings, we achieve +48% accuracy over SOTA on domain-specific data. 🚀 Check out below 👇 arxiv.org/abs/2512.04350
English
6
3
42
16.7K
Vijay V.
Vijay V.@vijaytarian·
Happening today in Exhibit Hall C,D,E, Poster #103
Vijay V.@vijaytarian

I'm at #NeurIPS2025. I'm very excited about our Spotlight paper on using rubrics to improve RL training for language models in non-verifiable settings. I can't wait to talk about it (Poster Session 4,Thursday afternoon)!

English
0
0
6
588
Vijay V.
Vijay V.@vijaytarian·
I'm hoping to connect with folks this week! A few questions I'd love to chat/debate about: 1. What makes a reward model useful for RL? 2. In the world of verifiers, do we still need RMs? 3. Why does synthetic data work? 4. What are the limits of synthetic data? DM or email me!
English
0
1
3
283
Vijay V.
Vijay V.@vijaytarian·
I'm at #NeurIPS2025. I'm very excited about our Spotlight paper on using rubrics to improve RL training for language models in non-verifiable settings. I can't wait to talk about it (Poster Session 4,Thursday afternoon)!
Vijay V.@vijaytarian

RL with verifiable rewards? Works great ✨ Realistic or non-verifiable tasks? Still a mess 📉 Reward models and AI judges? Fragile and inconsistent 💔 Our proposal? RL from Checklist Feedback 📜 arxiv.org/abs/2507.18624 👇

English
1
2
20
5.9K
Vijay V. retweetledi
Amanda Bertsch
Amanda Bertsch@abertsch72·
Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!
Amanda Bertsch tweet media
English
13
68
356
80.7K
Vijay V.
Vijay V.@vijaytarian·
This shouldn't be controversial. Science requires sharing with the public. If you're never sharing your research, you're not a research scientist. I don't think you have to share it via peer review, but vague musings on Twitter or on a podcast definitely don't count as science
dr. jack morris@jxmnop

my most controversial opinion is that you shouldn’t trust anyone that calls themself an “AI researcher” but has never gotten a first author paper through peer review

English
0
0
1
249