Vivek Kalyan

815 posts

Vivek Kalyan

@vivekkalyansk

reinforcement learner @CoreWeave

Seattle, WA เข้าร่วม Haziran 2016

479 กำลังติดตาม431 ผู้ติดตาม

Vivek Kalyan@vivekkalyansk·18h

@finbarrtimbers @ChinmayKak the assumption is that the SFT models are then used in RL right? which will reward the model on the correct answers? i guess the open question you are referring to is if there is a diversity benefit to training on incorrect answers during SFT which boosts downstream RL?

English

finbarr@finbarrtimbers·19h

@ChinmayKak I could argue differently though; why would you train your model to output answers you know are wrong?

English

234

finbarr@finbarrtimbers·21h

An interesting gap in the literature is that the large open weights labs (DeepSeek, Zhipu) do correctness filtering for their SFT data, but there's a bunch of results from smaller labs (OpenThoughts, for one) that claim you should also include incorrect responses in SFT.

English

Vivek Kalyan@vivekkalyansk·8 Nis

@remilouf @obsdmd and you are using CC/Codex sdk with the prompts/schemas that you define in your obsidian agent? is obsidian just a convenient transport layer between you and the agent(s)?

English

Rémi@remilouf·8 Nis

@vivekkalyansk @obsdmd It's basically an event loop that listens to changes in the environment, be it file changes, webhooks, etc. and dispatches the event to the appropriate agent. It runs on a VPS, not Obsidian. Obsidian is just the frontend.

English

Rémi@remilouf·8 Nis

@obsdmd is hands-down the best interface for personal AI agents: - Uses plain text files, and models turn out to be RLed to death on CLI tools. - Great frontend for plain text files, works on mobile. - Manual data entry. To take notes on the go, of course. But never discussed: Periodic Notes + Templater + Meta Bind turn Obsidian into a life OS you can update from anywhere. I pushed this even a little further. The runtime I built uses plain Markdown and YAML front matter for agent definitions. Which means that I can also edit my agents / add new ones from the same vault. I have been using and tweaking my system for more than a month now, and it's hard to explain how it feels to have a system that works seamlessly in the background, reacts to my environment and my input to put the information I need in front of me before I need it. Everyone will experience this. But who's going to deliver it? I've tried everything before building my own; no one is anywhere close.

English

850

Vivek Kalyan@vivekkalyansk·8 Nis

@natolambert holy shit that’s amazing

English

Nathan Lambert@natolambert·8 Nis

Nothing will beat REINFORCE REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility Great RL trivia I found when writing my book

finbarr@finbarrtimbers

BIRD might be the most egregious backronym I’ve seen in AI recently

English

220

22.1K

Vivek Kalyan@vivekkalyansk·8 Nis

my ~decade of maintaining my dotfiles and recently transitioning to nix came in clutch today when my work laptop suddenly wouldn’t boot and i had to get a replacement. i was up and running in an hour

GIF

English

Vivek Kalyan@vivekkalyansk·6 Nis

@finbarrtimbers makes sense, ty!

English

finbarr@finbarrtimbers·6 Nis

@vivekkalyansk Alas, it’s closer to a full day. We are actively working to get the experimentation cycle down.

English

412

finbarr@finbarrtimbers·6 Nis

For Olmo 3, we moved from a synchronous RL setup to an asynchronous one. This made our code 4x faster in terms of throughput (tokens/second). I wrote about the changes in the paper, but I finally found the time to go deeper on what was involved: finbarr.ca/making-rl-fast/

English

388

26.9K

Vivek Kalyan@vivekkalyansk·5 Nis

@eliebakouch @PrimeIntellect lfggg, congrats!!

English

elie@eliebakouch·5 Nis

update: joining @PrimeIntellect 🦋 i'm super excited to join the team. i really admire what they've been building and i love the mission of pushing the frontier in the open i'll be working on pre/mid training, there's so much left to figure out and i truly believe a small group with the right people, resources and focus can do sooo much 🚀

English

172

1.2K

101.7K

Vivek Kalyan@vivekkalyansk·4 Nis

@eliebakouch @bclavie @derangineer @art_zucker multi turn instruction following for chat agents. they were all evaluated with reasoning turned off so thats probably the main contributor

English

elie@eliebakouch·4 Nis

@vivekkalyansk @bclavie @derangineer @art_zucker oh that's very surprising! what are your benchmarks?

English

de.bash 🦘 ✈️ ECIR'26 🇳🇱@derangineer·3 Nis

Am I missing something or is this just a bad take?

Arthur Zucker@art_zucker

Gemma4 is amazing. You'll read that everywhere. Let's focus on what is HUGE here: the revenge of dense models.... Throw away your b200, not needed anymore, throw away the millions of lines of code we had to write to make MOEs faster, training stable etc... throw away your router-aware kernel, your EP DEEP GEMM, throw away the auxiliary loss function. Welcome to simplicity, dense is the new king. FINALLY hating MoEs is back to being chad. For those who know me: I was always a moe doomer

English

1.6K

Vivek Kalyan@vivekkalyansk·3 Nis

@eliebakouch @bclavie @derangineer @art_zucker yeah and i think its hard to compare models purely based on size. in my benchmarks, the Qwen 3.5 35B A3B is better than the 122B and 397B models. I think their smaller models are just trained with a much higher chinchilla ratio

English

138

elie@eliebakouch·3 Nis

@bclavie @derangineer @art_zucker i've seen a few people citing the qwen3.5 series to argue dense > moe but one could argue the opposite by looking at how strong qwen3.5 35B 3B is compare to qwen3.5 27B 😂 > "moe are still extremely fragile right now" also curious what do you mean here?

English

513

Vivek Kalyan@vivekkalyansk·1 Nis

codex context compaction is a work of art. i asked codex to share some stats from my ongoing thread spanning 3 PRs and multiple hour long experiments. 3k+ tool calls, 22 compacts and still going strong. codex has transformed the way I code with agents. previously I had to think hard about which context to include and how to break long complex tasks into smaller steps. now I can just focus on solving problems and let codex drive most of that.

English

Vivek Kalyan@vivekkalyansk·30 Mar

@vikhyatk this is not security advice but most password managers can save your 2FA token :)

English

vik@vikhyatk·30 Mar

i would rather walk on a bed of nails than have to do 2FA again

English

6.5K

Vivek Kalyan@vivekkalyansk·26 Mar

@jbfja thanks for the clarification! i guess the vague part was how you go from implicit feedback to "distilling them to rewards and calculate how to adjust model weights"

English

Jacob Jackson@jbfja·26 Mar

@vivekkalyansk On-policy = model that generated the response receiving feedback is the same as the model being trained with RL Implicit feedback = user feedback, but not something like thumbs up/thumbs down, which would be explicit

English

104

Jacob Jackson@jbfja·26 Mar

Excited to share our work on training Composer with on-policy implicit feedback!

Cursor@cursor_ai

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

English

8.9K

Vivek Kalyan@vivekkalyansk·25 Mar

@ivanleomk @GoogleDeepMind @OfficialLoganK @thorwebdev @DynamicWebPaige @_philschmid @patloeber @ammaar @vadiamit @osanseviero @harrisonfjobe @goodside @alihcevik @matthewridenour congrats!!

English

103

Ivan Leo@ivanleomk·25 Mar

Life update I’ve moved to San Francisco and joined @GoogleDeepMind. Excited to work alongside @OfficialLoganK @thorwebdev @DynamicWebPaige @_philschmid @patloeber @ammaar @vadiamit @osanseviero @harrisonfjobe @goodside @alihcevik @matthewridenour

English

120

675

62.9K

Vivek Kalyan@vivekkalyansk·25 Mar

super nice report, some interesting things: - trained with NVFP4 forward and MXFP8 backward. forward pass uses FP4 to match the inference engine, but the backward pass uses higher precision since it only runs on the training cluster. - continued pretraining cross-entropy loss is predictive of downstream RL reward: they run their CPT recipe at three compute levels on Qwen3-Coder, then doing identical RL runs on each. - MTP layers trained via self-distillation against main LM head, interestingly it is trained on a checkpoint cut from the middle of the CPT run rather than the end - self-summarization trained end-to-end with RL: the model compresses its own trajectory history, and both the agent actions and the summaries receive the final outcome reward. good summaries get upweighted naturally - nonlinear concave length penalty that's steeper for easy tasks and flatter for hard ones, so the model learns to be quick on simple requests but can think longer on hard problems. - MoE router replay with plausibility filter: if the replayed expert's gating score falls below a threshold derived from the router's own top-k, they swap it for the router's candidate. reduces numerics mismatch

Cursor@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

English

365

Vivek Kalyan@vivekkalyansk·14 Mar

@dubeyornotdubey just codex, i’m old school like that

English

Sagar Dubey@dubeyornotdubey·14 Mar

@vivekkalyansk Who is looking through these sessions? OpenClaw?

English

Vivek Kalyan@vivekkalyansk·14 Mar

"look through my codex and claude sessions for the week as well as my daily notes to see what i've been working on this week and give me a report organised by project" days fly by, but its good to reflect how much work gets done in a week

English

132

Vivek Kalyan@vivekkalyansk·13 Mar

@natolambert @willccbb @karpathy i just want to say that without the open source research ecosystem - it would have been impossible for me (no US uni/tech job) to learn the things needed and work on what i absolutely love doing rn. ty 🫡

English

117

Nathan Lambert@natolambert·13 Mar

@willccbb @karpathy its honestly a hard life. I regularly think I'm doing the wrong thing. No playbook for it. Doing our best.

English

3.4K

Nathan Lambert@natolambert·13 Mar

I personally think about this a lot. We all have a huge desire to be at one of the 3 companies at the front edge of AI, but the ecosystem can't work without independent voices guiding and understanding progress. @karpathy is the GOAT at this. It's a different path to impact.

Noam Brown@polynoamial

@saranormous @karpathy @NoPriorsPod Why is he not at a frontier AI lab at the most pivotal time in human history since at least the industrial revolution?

English

653

67.6K

Vivek Kalyan@vivekkalyansk·8 Mar

@xeophon damn it.

English

Florian Brand@xeophon·8 Mar

fuck.

Florian Brand@xeophon

I won’t fall for the SF psy op I won’t fall for the SF psy op I won’t fall for the SF psy op I won’t fall for the SF psy op I won’t fall for the SF psy op

English

609

34.3K

ค้นพบ

@finbarrtimbers @ChinmayKak @remilouf @obsdmd @natolambert @eliebakouch @PrimeIntellect @bclavie