Ben Cohen

569 posts

Ben Cohen banner
Ben Cohen

Ben Cohen

@blc_16

spends too much time watching football and coding. Prev @Meta, @Microsoft

Katılım Haziran 2014
2.6K Takip Edilen457 Takipçiler
Ben Cohen
Ben Cohen@blc_16·
If you thought MIT wasn't going to drop another banger RL method this week, you were wrong. They just released Vector Policy Optimization. Most RL methods focus on single answer generation. VPO focuses on candidate set generation for search instead. The model is trained to produce a batch of candidates in a single rollout separated by a delimiter. This allows the model to reason across answers within a single rollout and create diversity within a candidate set. Think about kernel optimization. If you ask a model for 16 possible CUDA kernels, you want real variation. Some candidates should try different tiling choices. Some should use memory differently. Some should make different tradeoffs around speed, stability, or which tensor shapes they handle well. You want to explore the full design space instead of exploring only around a local maximum. Generating one candidate per rollout and depending on stochastic variation often leads to similar results across rollouts. VPO argues that the RL phase should explore the search space by producing candidates with different strengths. The test-time search loop can then exploit the promising regions with benchmarks, verifiers, or evolutionary search. VPO does this with two changes. First, the model generates multiple candidates in one rollout, so later candidates can see earlier ones and avoid repeating the same idea. Second, it uses reward vectors instead of a single scalar reward. For code, the reward vector might include tests passed, speed on different input sizes, memory use, numerical stability, and hardware compatibility. Instead of picking one fixed weighting of those signals, VPO samples many weightings. One weighting may care mostly about speed. Another may care more about memory. Another may favor stability or an edge case. For each weighting, VPO asks which candidate in the set wins. The set gets rewarded when it contains winners across many different weightings, which they call reward-space diversity. The model is trained to keep multiple useful directions alive, so search has something to work with later. The ablations are important here too. Multi-candidate rollout by itself is not enough. If you still train the set with one scalar reward, the candidates can collapse into similar answers. Random reward weightings by themselves are not enough either. If the model still emits one answer at a time, you are changing the target but not training a useful candidate set. VPO needs both: generate a set, then reward the set for covering different parts of the reward space. That also explains why it helps less when rewards are colinear. If one kernel is best on correctness, speed, memory, and stability, every weighting picks the same winner. Diversity does not buy much. But hard search problems usually have tradeoffs. The fastest kernel may be brittle. The robust one may be slower. The version that works best on small inputs may lose on large inputs. The weird candidate may be bad now but contain the mutation path that wins later. Search only works if the generator gives it somewhere to search. There are many interesting areas you could apply this like model training, agent planning, auto-research, and code search. VPO is promising for any problem where the design space is large and has many different tradeoffs. Paper in the comments
Ben Cohen tweet media
English
1
0
2
72
Aryan Siddiqui
Aryan Siddiqui@Ar_boian·
It’s astonishing how little @OpenAI ChatGPT product experience has changed. If they had seriously worked on just memory and proactiveness, their growth and retention would be a lot more.
English
26
2
97
62.2K
britton winterrose
britton winterrose@Winterrose·
whats the hardest addiction you’ve ever conqured? (generally curious what's most prevalent among my intelligent, mostly successful, and well-to-do friends. results are anon)
English
27
0
11
7.3K
Ben Cohen
Ben Cohen@blc_16·
@scottastevenson @tobi Wouldnt be surprised if the hft firms are running this already. They love to rugpull retail anyway possible
English
1
0
1
60
Scott Stevenson
Scott Stevenson@scottastevenson·
@blc_16 @tobi Pretty interesting idea. At the end of the day investing is just getting into the next asset class or strategy before the herd does, and trying not to be in the herd. That could be a way to reliably discover the universe of things that are underpriced
English
1
0
1
118
Scott Stevenson
Scott Stevenson@scottastevenson·
Investing is a ruthlessly competitive game where everyone is trying to win money at the expense of others. Passive makes a brilliant sleight of hand saying: "you can reap the rewards of being in the arena, while being on the sidelines of the arena" This only works if the fighters don't notice you standing there. This is like play 1v1v1v1 Smash Bros and hiding on the sidelines while everyone dukes it out. A great strategy until they notice you doing it and kill you. Passive worked because it was under the radar for 20 years. But now the fighters in the arena see that over half the money they can "win" is sitting on the sidelines. Why would the keep fighting eachother for alpha when they can attack the massive crowd of sitting ducks and take all their money? Passives are like a rock-paper-scissors player telegraphing their move every single turn: "Rock. Rock. Rock.". The easiest way for actives to make money is to play "Paper. Paper. Paper." to siphon money from the passives who didn't even realize they were in the arena about to be mugged. "Paper" is front-running. Actives just have to buy stock with leverage, and sell it back to passives at ever-increasing prices. That is the easiest way to make money in markets right now. The winning strategy is to keep buying index funds, with one critical difference: break the "never sell" religious dogma and be the first to sell when sentiment shifts and things start to crack. This will start to happen when there is a mass withdrawal event and the 5-to-1 amplification in passive investing pushes to the downside.
English
4
3
17
1.5K
Ben Cohen
Ben Cohen@blc_16·
The more I read about RL, the more OP I think social media will become. Some of these optimizations paired with large scale A/B testing with verifiable rewards (how long you scroll) will go absolutely crazy. Everyone should probably delete social media off their phone or at least cap screen time somehow
English
0
0
1
47
Ben Cohen
Ben Cohen@blc_16·
@Som_Mohapatra Agreed it seems like a major overreaction and hybrid roles are probably the best
English
0
0
1
76
Som Mohapatra
Som Mohapatra@Som_Mohapatra·
I don’t think I’m fully sold that managers should all become ICs, but I think it’s really clear that a manager who can also own ops for their function is absolutely unstoppable. Insanely insanely hard skillset to find though.
English
1
0
9
843
Ben Cohen
Ben Cohen@blc_16·
@signulll Pretraining beats rl everytime. The bitter lesson
English
0
0
1
139
signüll
signüll@signulll·
the most interesting outcome of these games is if the enhanced athletes don’t perform at even olympic level. everything else seems pretty priced in. this would effectively mean elite performance is mostly genetic ceiling + decade of matching brain to muscles + recovery infrastructure, & the chemistry itself is a 3-5% if at all. peds might give you more raw material (muscle, recovery, red blood cells, etc) but maybe the software matters way more.
Polymarket@Polymarket

JUST IN: The Enhanced Games are set to debut this weekend in Las Vegas, with athletes allowed to use steroids, testosterone, HGH, & other banned substances.

English
144
16
921
214K
Ben Cohen
Ben Cohen@blc_16·
@rl_env And as a harness through their sdk. Everyone gunning for best agent/harness
English
0
0
0
72
Pocket Jacks Capital
they’ll offer it via api too - frontier intelligence at a fraction of oai/anthropic
English
2
0
7
781
Pocket Jacks Capital
cursor/spacexai shipping composer 3 is going to break a lot of people’s brains
English
3
0
64
5.4K
Ben Cohen
Ben Cohen@blc_16·
@snowmaker Think you need to consider a conversion rate for LOC to useful LOC based on cost of human labor to make the AI output useful
English
0
0
1
431
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
Currently it is shocking and newsworthy when AIs solve an important open problem that humans couldn't Before AI totally surpass us intellectually, there will be an interesting era, where it will be just as shocking (but not impossible) for a human to solve a problem AI couldn't
English
88
53
1.2K
88.8K
@levelsio
@levelsio@levelsio·
I don't write code anymore I haven't written code in I think 6 months? I think everyone is like this no?
justadev@just_adev

@levelsio @TermiusHQ so do you write code? or you just write to Claude what do you want? ex. you want a new feature or you want a change. you just prompt or what ? or Claude is mainly the assistant here? my question mostly is how much trust you put on claude while you are on prod.

English
168
84
2.9K
1.5M
will brown
will brown@willccbb·
verifiers renderers tasksets harnesses codebases worldsims … open superintelligence stack
English
10
7
241
10.7K
ThePrimeagen
ThePrimeagen@ThePrimeagen·
Honestly why stop at 100x engineer? Just use more agents, you literally could be 1000x, 10000x, 100000x just by scaling You could what you use to in an entire year in one second
English
298
157
4.7K
180K
Ben Cohen
Ben Cohen@blc_16·
@ypatil125 Does the more efficent signal make up for extra compute spent on token selection?
English
0
0
1
117
Ben Cohen retweetledi
Ryan Bahlous-Boldi
Ryan Bahlous-Boldi@RyanBoldi·
Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.
Ryan Bahlous-Boldi tweet media
English
34
119
844
200K