Sabitlenmiş Tweet
maksim
313 posts

maksim
@ivanovm_
maksim @ agentic labs
fightertown us-east-1 Katılım Mart 2024
476 Takip Edilen145 Takipçiler

@___4o____ @Austen What is an "institutional lead investor"? A lot of companies just close a party round with a few 6 fig checks
English

Despite the propaganda an overwhelming majority of YC companies don’t ever find a lead investor for their seed round.
first check $500k-1M pre-seed@ajhodls
most of the YC founders I talked to hadn't come close to their fundraise target by demo day. i suspect this can be extrapolated to the median company in the batch as well.
English

Yeah, frontier models + automatic prompt and harness optimization via e.g. gepa gives you a lot of the same hill-climbing without touching the weights
For RLaaS, starting from weaker open models + factoring in inference hosting costs, you need a *massive* improvement to justify the investment for most buyers
English

There were a few RL-as-a-service companies that emerged this year (with one recently acquired by DoorDash for a very generous price). Some aspects of custom post-training are exciting, but the economics seem tricky here.
Fundamentally, the RLaaS thesis is that enterprises should post-train models in-house to:
1. Outperform base models on uncommon, domain-specific tasks.
- For DoorDash, this includes optimizing ad relevance and order recommendation models. Enterprises at DoorDash's scale have petabytes of proprietary data, and are evidently willing to spend hundreds of millions for talent alone.
- Tangentially, these complex domain-specific tasks also make for great training data, which is why some of these companies make most of their revenue from data sales, not enterprise RL.
- This is usually only necessary if the model is the bottleneck, which isn't the case for almost all agent deployments. Context/harness engineering go a long way. However, if you have you have a measurable objective + tight feedback loop (like Decagon), improvements start to become nonlinear and significant.
2. Derisk from frontier labs (long-term).
- Owning your own model insulates you from OpenAI/Anthropic API pricing, which is why Cursor is racing for a SoTA coding model.
- Regulated industries may also prefer local models for data security (especially if training truly becomes commoditized and they can justify the spend).
In reality, as Michael mentions, there are ~zero businesses that are immediately ready for RL. Teams like Applied Compute spend most of their time transforming data and mapping processes (which can spiral into months of work) and even building agents for the enterprises before they can train. This feels slightly distracting—closer to McKinsey than OpenAI—but is also the only way to get AI in businesses today.
But is this a good business? In some ways, it feels too early and hard to commoditize. Michael's write-up seems a bit frustrated. Also, there are a few existential questions here:
- Is a six-figure training run + continual improvement costs worth it for most businesses?
- It's almost like (beyond compute) hiring is the bottleneck to scale here. Sure, AC is innovating on RL infra but why won't businesses hire AI talent internally instead of paying AC millions on top of already high training costs? This doesn't seem sticky.
- If Opus 5 is substantially better than Kimi K2.5 (which is already distilled from Opus 4.6 lol) does the fine-tuned model become obsolete? Also, the cost of the training run is amortized across the time period between runs (e.g. Cursor has to "pay off" Composer 1 in the period between launch and the Composer 2 release), which applies additional pressure.
- Or, if they re-train on the next open-source model, will any of the initial post-training investment transfer? Maybe the data doesn't have to be re-processed, but unsure if training is cheap enough for this to be dismissible yet.
Also, part of the reason these are still open questions is because the impact of post-training is difficult to quantify—there's no reliable way to attribute business outcomes to post-training improvements.
- Sure, you can create a benchmark to track model improvement on some tasks.
- But there's still a gap between [progress on an ad-recommendation benchmark] and [direct revenue growth for the business] that is latent (and hard to measure, given the lack of a single variable in a company's strategy). Realistic benchmarks are still an unsolved problem.
In-house post-training is a no brainer for some businesses. Maybe DoorDash will benefit, and it's obvious Cursor and Decagon will. But broadly, I'm unsure how large or sticky this market really is.
Michael Chen@michaelzchen5
English

@katieruthmishra it’s the only modality with clean handoffs between agents and non-technical users. similar to tsla autopilot that asks the driver to take over in risky situations. underpriced
English
maksim retweetledi
maksim retweetledi

I am surprised more VCs aren't talking about this. But, if you are a NYC founder with any type of liquidity event happening soon, consider relocating NOW!
CC: @ethdaly @MaxwellAbram @ChanniGreenwall @SandroChess @Bfaviero @evanbfish @jackmmcclelland
jdsupra.com/legalnews/new-…
English

@JoshPurtell @LakshyAAAgrawal @hensapir @SeanZCai @PrimeIntellect How is it to continuous rewards? One of the biggest challenges we’re facing is calibrating partial credit
English

@hensapir @SeanZCai @PrimeIntellect Thanks for the shoutout!
Yeah GEPA works great for bootstrapping verifiers, one of the cleanest apps imo. GEPA on a single prompt or RLM backend is what we use nowadays
English

Running @PrimeIntellect Lab GRPO on a hard-to-verify action-matching task. Judge inconsistency was generating phantom reward variance like same model output, different scores across rollouts and step 0 kept winning.
Fixed it with a stronger judge (better SOTA OAI model, using all my thousands of old OAI hackathon credits) + response caching and got zero phantom variance, clean gradient signal.
Anybody know what's the best off-the-shelf judge for semantic action matching in RL training without post-training a purpose-built one? What are people actually shipping with? Is there anybody working on purpose-built judge models for GRPO?
English
maksim retweetledi

Healthcare software was designed for humans. Multi-step, nuanced workflows: prior auth submissions, EHR note creation, eligibility verification. The kind of work that can't be reduced to an API call.
That's what AI agents in healthcare are being asked to automate. And the infrastructure to do it reliably doesn't exist off the shelf.
We build it: A coding agent to generate automation scripts, fully managed infrastructure to run them at scale, and a maintenance agent to keep them working as portals and EHRs change.
Today, we're announcing our $5M seed round, backed by Floating Point, @MeridianStCap, Twine Ventures, @refractvc and angels like @zacharylipton (CTO, Abridge) and @dps (fmr. CTO, Stripe).
If you're building AI agents that need to operate payer portals or EHRs, we'd love to talk. And we're hiring!

English

gundo should host live-fire war games for uas/cuas companies to compete in
cc @jakobdiepen
English

@cxgonzalez It’s a 2 mile wide choke point, the ships passing through are giant bombs, Iran can launch cruise missiles cheaper than a Honda civic for 2k miles in any direction. And they can be launched off a truck and manufactured in a shed. They really figured out asymmetric warfare
English

@skeptrune @RhysSullivan I miss tab, it was really good for flow state. Now the options are:
1. stare at the agent while it works, feels like a waste of time
2. go turbo-adhd with multiple agents doing different tasks
3. scroll x or go for a walk
deep focus is an unsolved problem in this age
English

@RhysSullivan those cursor tabs keys are going to be like the vhs tapes of vibecoding
English

RL envs are a subset of useful data, and they generate training tuples (state, action, reward, next state). But they only work when the world can be simulated.
Most high-value domains (healthcare, enterprise workflows, multimodal reasoning) can’t be faithfully simulated, so models still need real-world datasets and evaluation benchmarks.
RL envs are also mainly for post training and sit a layer above in abstraction whereas mid/pre training require other real world data and domain adaptation.
English

Ohhh good point!!
Since RL mostly works now RL data might be a thing people wanna sell.
Is anyone doing this? Selling RL envs? Is there even a single company doing this?
Bobby Samuels@BobbySamuels
English

We’re renaming the YC spring batches from X25 and (what was going to be) X26 to P25 and P26 — P for Primavera, which literally means “first spring” in Latin-derived languages.
The original X was a cute programmer in-joke, but people kept asking “what does X stand for?”, so we’re switching to something that actually says “spring” while still keeping it to a single letter.
English

@skeptrune @beaversteever the real rag was the friends we made along the way
English













