Vivek Kalyan

815 posts

Vivek Kalyan banner
Vivek Kalyan

Vivek Kalyan

@vivekkalyansk

reinforcement learner @CoreWeave

Seattle, WA เข้าร่วม Haziran 2016
479 กำลังติดตาม431 ผู้ติดตาม
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
@finbarrtimbers @ChinmayKak the assumption is that the SFT models are then used in RL right? which will reward the model on the correct answers? i guess the open question you are referring to is if there is a diversity benefit to training on incorrect answers during SFT which boosts downstream RL?
English
0
0
0
15
finbarr
finbarr@finbarrtimbers·
@ChinmayKak I could argue differently though; why would you train your model to output answers you know are wrong?
English
2
0
1
234
finbarr
finbarr@finbarrtimbers·
An interesting gap in the literature is that the large open weights labs (DeepSeek, Zhipu) do correctness filtering for their SFT data, but there's a bunch of results from smaller labs (OpenThoughts, for one) that claim you should also include incorrect responses in SFT.
English
7
1
59
7K
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
@remilouf @obsdmd and you are using CC/Codex sdk with the prompts/schemas that you define in your obsidian agent? is obsidian just a convenient transport layer between you and the agent(s)?
English
1
0
0
70
Rémi
Rémi@remilouf·
@vivekkalyansk @obsdmd It's basically an event loop that listens to changes in the environment, be it file changes, webhooks, etc. and dispatches the event to the appropriate agent. It runs on a VPS, not Obsidian. Obsidian is just the frontend.
English
1
0
1
94
Rémi
Rémi@remilouf·
@obsdmd is hands-down the best interface for personal AI agents: - Uses plain text files, and models turn out to be RLed to death on CLI tools. - Great frontend for plain text files, works on mobile. - Manual data entry. To take notes on the go, of course. But never discussed: Periodic Notes + Templater + Meta Bind turn Obsidian into a life OS you can update from anywhere. I pushed this even a little further. The runtime I built uses plain Markdown and YAML front matter for agent definitions. Which means that I can also edit my agents / add new ones from the same vault. I have been using and tweaking my system for more than a month now, and it's hard to explain how it feels to have a system that works seamlessly in the background, reacts to my environment and my input to put the information I need in front of me before I need it. Everyone will experience this. But who's going to deliver it? I've tried everything before building my own; no one is anywhere close.
English
3
2
16
850
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
my ~decade of maintaining my dotfiles and recently transitioning to nix came in clutch today when my work laptop suddenly wouldn’t boot and i had to get a replacement. i was up and running in an hour
GIF
English
0
0
1
54
finbarr
finbarr@finbarrtimbers·
@vivekkalyansk Alas, it’s closer to a full day. We are actively working to get the experimentation cycle down.
English
1
0
3
412
finbarr
finbarr@finbarrtimbers·
For Olmo 3, we moved from a synchronous RL setup to an asynchronous one. This made our code 4x faster in terms of throughput (tokens/second). I wrote about the changes in the paper, but I finally found the time to go deeper on what was involved: finbarr.ca/making-rl-fast/
English
5
39
388
26.9K
elie
elie@eliebakouch·
update: joining @PrimeIntellect 🦋 i'm super excited to join the team. i really admire what they've been building and i love the mission of pushing the frontier in the open i'll be working on pre/mid training, there's so much left to figure out and i truly believe a small group with the right people, resources and focus can do sooo much 🚀
elie tweet media
English
172
46
1.2K
101.7K
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
@eliebakouch @bclavie @derangineer @art_zucker yeah and i think its hard to compare models purely based on size. in my benchmarks, the Qwen 3.5 35B A3B is better than the 122B and 397B models. I think their smaller models are just trained with a much higher chinchilla ratio
English
2
0
3
138
elie
elie@eliebakouch·
@bclavie @derangineer @art_zucker i've seen a few people citing the qwen3.5 series to argue dense > moe but one could argue the opposite by looking at how strong qwen3.5 35B 3B is compare to qwen3.5 27B 😂 > "moe are still extremely fragile right now" also curious what do you mean here?
English
3
0
8
513
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
codex context compaction is a work of art. i asked codex to share some stats from my ongoing thread spanning 3 PRs and multiple hour long experiments. 3k+ tool calls, 22 compacts and still going strong. codex has transformed the way I code with agents. previously I had to think hard about which context to include and how to break long complex tasks into smaller steps. now I can just focus on solving problems and let codex drive most of that.
Vivek Kalyan tweet media
English
0
0
1
97
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
@vikhyatk this is not security advice but most password managers can save your 2FA token :)
English
1
0
2
65
vik
vik@vikhyatk·
i would rather walk on a bed of nails than have to do 2FA again
English
8
1
62
6.5K
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
@jbfja thanks for the clarification! i guess the vague part was how you go from implicit feedback to "distilling them to rewards and calculate how to adjust model weights"
English
0
0
0
32
Jacob Jackson
Jacob Jackson@jbfja·
@vivekkalyansk On-policy = model that generated the response receiving feedback is the same as the model being trained with RL Implicit feedback = user feedback, but not something like thumbs up/thumbs down, which would be explicit
English
1
0
3
104
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
super nice report, some interesting things: - trained with NVFP4 forward and MXFP8 backward. forward pass uses FP4 to match the inference engine, but the backward pass uses higher precision since it only runs on the training cluster. - continued pretraining cross-entropy loss is predictive of downstream RL reward: they run their CPT recipe at three compute levels on Qwen3-Coder, then doing identical RL runs on each. - MTP layers trained via self-distillation against main LM head, interestingly it is trained on a checkpoint cut from the middle of the CPT run rather than the end - self-summarization trained end-to-end with RL: the model compresses its own trajectory history, and both the agent actions and the summaries receive the final outcome reward. good summaries get upweighted naturally - nonlinear concave length penalty that's steeper for easy tasks and flatter for hard ones, so the model learns to be quick on simple requests but can think longer on hard problems. - MoE router replay with plausibility filter: if the replayed expert's gating score falls below a threshold derived from the router's own top-k, they swap it for the router's candidate. reduces numerics mismatch
Cursor@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

English
0
1
4
365
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
"look through my codex and claude sessions for the week as well as my daily notes to see what i've been working on this week and give me a report organised by project" days fly by, but its good to reflect how much work gets done in a week
English
1
0
4
132
Vivek Kalyan
Vivek Kalyan@vivekkalyansk·
@natolambert @willccbb @karpathy i just want to say that without the open source research ecosystem - it would have been impossible for me (no US uni/tech job) to learn the things needed and work on what i absolutely love doing rn. ty 🫡
English
0
0
5
117
Nathan Lambert
Nathan Lambert@natolambert·
@willccbb @karpathy its honestly a hard life. I regularly think I'm doing the wrong thing. No playbook for it. Doing our best.
English
6
0
92
3.4K
Nathan Lambert
Nathan Lambert@natolambert·
I personally think about this a lot. We all have a huge desire to be at one of the 3 companies at the front edge of AI, but the ecosystem can't work without independent voices guiding and understanding progress. @karpathy is the GOAT at this. It's a different path to impact.
Noam Brown@polynoamial

@saranormous @karpathy @NoPriorsPod Why is he not at a frontier AI lab at the most pivotal time in human history since at least the industrial revolution?

English
14
33
653
67.6K