Hen Sapir

46 posts

Hen Sapir

@hensapir

cofounder @charcoal_hq | previously eng @fronthq

San Francisco, CA Katılım Eylül 2014

957 Takip Edilen131 Takipçiler

Hen Sapir@hensapir·20 Mar

@amanrsanger wait, CPT on an instruct model?

English

458

Aman Sanger@amanrsanger·20 Mar

We've evaluated a lot of base models on perplexity-based evals and Kimi k2.5 proved to be the strongest! After that, we do continued pre-training and high-compute RL (a 4x scale-up). The combination of the strong base, CPT and RL, and Fireworks' inference and RL samplers make Composer-2 frontier level. It was a miss to not mention the Kimi base in our blog from the start. We'll fix that for the next model.

Kimi.ai@Kimi_Moonshot

Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support. Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ ' hosted RL and inference platform as part of an authorized commercial partnership.

English

152

134

2.5K

488.6K

Hen Sapir@hensapir·20 Mar

@SeanZCai @PrimeIntellect if you have ground truth samples, use gepa to optimize your judge (rubric, criteria weights, prompt). cc @JoshPurtell who built docs.usesynth.ai/cookbooks/veri…

English

Sean Cai@SeanZCai·20 Mar

Running @PrimeIntellect Lab GRPO on a hard-to-verify action-matching task. Judge inconsistency was generating phantom reward variance like same model output, different scores across rollouts and step 0 kept winning. Fixed it with a stronger judge (better SOTA OAI model, using all my thousands of old OAI hackathon credits) + response caching and got zero phantom variance, clean gradient signal. Anybody know what's the best off-the-shelf judge for semantic action matching in RL training without post-training a purpose-built one? What are people actually shipping with? Is there anybody working on purpose-built judge models for GRPO?

English

5.2K

Hen Sapir@hensapir·27 Şub

@jeff_weinstein @sumeetvtweets

QAM

Jeff Weinstein@jeff_weinstein·27 Şub

who is building the next great ci startup optimized for agent generate code? (there is about to be _extremely_ ci needs)

English

110

21.5K

Hen Sapir@hensapir·27 Şub

@marcklingen @langfuse request to add the ability to set negative tag filters in the trace UI (e.g., tag != X). i can hack it via query params but would be nice to have it in the UI 🙏

English

Hen Sapir@hensapir·6 Şub

@jaltma any company whose core value prop is accountability

English

255

Jack Altman@jaltma·6 Şub

The current consensus view is saas is dead...presuming that's right, the next interesting next question is What companies are "safe from ai"? - handling money, regulation - agents on top of company data - most hardware? - maybe systems of record? - security? - marketplaces?

English

306

710

178.8K

Hen Sapir@hensapir·5 Şub

@jarredsumner how would this would work with encodings? you can't decode partially-encoded characters that have been cutoff by maxLength/offset. and, afaict, all workarounds to that are bad

English

189

Jarred Sumner@jarredsumner·4 Şub

node:fs readFile needs a maxLength and offset option

English

206

25.4K

Hen Sapir@hensapir·29 Ara

@irl_danB interested!

English

dan@irl_danB·29 Ara

I have a working version of the call stack context manager as an opencode plugin. It exposes tool calls to the agent to manage the call stack as plugin state. Frames in the call stack are opencode sessions. Compacted ancestor, uncle, and sibling frames are injected into the context. I’ve not run benchmarks and it needs plenty of tire kicking, but opus 4.5 uses it pretty well now. It’s not really built for interactive mode at the moment (it can be), I haven’t optimized it for cache utilization (versions of it can be), and for now it is most useful for certain classes of tasks: primarily long running, like building out a large new project or doing a large refactor. In fact adding this plugin kind of borks your opencode if you’re trying to run short single threaded interactive tasks, because it aggressively uses the call stack model to break apart tasks. It ships without expectation of maintenance or further attention, so don’t build anything on top of it. Normally I wouldn’t release this at all, except several people have indicated interest and willingness to play with a raw version of it. If you want to play with a more polished, actually useful, hopefully benchmarked version, please wait a few more days. Please forgive the slow progress, I’ve got my hands full at the moment with two newborns and two toddlers. I’ll run it on terminal-use and harbour benchmarks once I shore up confidence that it’s worthwhile to spend that money. Buried lede: building this out has made me realize that the opencode session itself is a unique primative that can probably support more interesting composition. Pair this with the opencode client-server model and I’m thinking of pivoting my currently in-progress inversion of control framework from orchestrating Claude Agents SDK to orchestrating opencode sessions. Exciting stuff ahead, lots you can do with this. Please reply or dm if you are interested in trying the proof-of-concept version, especially if you’re interested in sending feedback. Wait a bit if you want something more polished and proven.

dan@irl_danB

context window won’t be “solved” as long as attention is quadratic and presumably Suhail is thinking about the compaction problem as it occurs in long running agents like claude code but this is downstream from an architectural problem with standard agent implementations (claude code among them) that use a linear “chat-like” history we all work through coding tasks linearly, but any seasoned software engineer’s mental model of their progress looks more like a call stack: pushing tasks on and popping them off when complete when the claude code harness organizes the context more like a call stack (think flame graph) than a linear chat log, compaction will not even be necessary in many cases and less lossy in the cases where it is for the familiar, think: loom

English

9.5K

Hen Sapir@hensapir·21 Kas

what a gift 🇺🇸

Ai2@allen_ai

Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵

English

514

Hen Sapir@hensapir·3 Kas

@corbtt @willccbb do you think that will still be true in 1yr? presumably not

English

Kyle Corbitt@corbtt·2 Kas

@willccbb 10B+ is a pretty low bar. Most actual usage of open models is in the 100B+ range.

English

2.1K

will brown@willccbb·2 Kas

registering a prediction that by this time next year, there will be at least 5 serious players in the west releasing great open models chinese models will still be great, but the gap will be small if not non-existent, and people will mostly use the western ones

English

419

57.6K

Hen Sapir@hensapir·20 Eki

@nikitabier @dinkin_flickaa @misha_mityushk @nicoduc ah okay i see. wasn’t immediately obvious but will get used to it 👌

English

588

Nikita Bier@nikitabier·20 Eki

@hensapir @dinkin_flickaa @misha_mityushk @nicoduc You can swipe right to get back to timeline

English

Nikita Bier@nikitabier·19 Eki

We're testing a new link experience, starting on iOS -- to make it easier for your followers to engage with your post while browsing links. For creators, a common complaint is that posts with links tend to get lower reach. This is because the web browser covers the post and people forget to Like or Reply. So X doesn't get a clear signal whether the content is any good. To help get better signal, posts will now collapse to the bottom of the page so people can react while you're reading. As always, remember: the post should stand alone as great content so write a solid caption.

English

1.3K

633

13.8K

Hen Sapir@hensapir·20 Eki

@nikitabier @dinkin_flickaa @misha_mityushk @nicoduc one unexpected annoyance is that, when exiting the browser view, i get redirected to the post detail view (ie the /status/<post_id> page) even if i clicked the link from my home timeline. otherwise, nice work 👏

English

5.1K

Nikita Bier@nikitabier·19 Eki

Credit to @dinkin_flickaa and @misha_mityushk for building it and @nicoduc for the designs. It's only version 1, so please share any bugs you find.

English

138

1.9K

267.2K

Hen Sapir@hensapir·17 Eki

@simonw x.com/cognition/stat… related - presumably this is rl-finetuned on a small oss llm but the blogpost doesn't confirm/deny that hypothesis. cc @cognition

Cognition@cognition

Introducing SWE-grep and SWE-grep-mini: Cognition’s model family for fast agentic search at >2,800 TPS. Surface the right files to your coding agent 20x faster. Now rolling out gradually to Windsurf users via the Fast Context subagent – or try it in our new playground!

English

161

Hen Sapir@hensapir·17 Eki

@simonw also lots of examples where the dev time costs are meaningfully lower, ie prompt iteration on frontier models taking longer than RL-fine tuning small oss models for the same task

English

145

Simon Willison@simonw·17 Eki

Anyone got a success story they can share about fine-tuning an LLM? I'm looking for examples that produced commercial value beyond what could be achieved by prompting an existing hosted model - or waiting a month for the next generation of hosted models to solve the same problem

English

154

100

1.2K

188.2K

Hen Sapir@hensapir·6 Eki

leggo

English

155

Hen Sapir@hensapir·6 Eyl

@growing_daniel @tigran_zzz it’ll probably be a net positive but america “lacks” israel’s mix of existential threat, national cohesion, tiny scale, and integration of army experience into daily life. i think all of the above are required for the outcomes you’re thinking about.

English

5.5K

Daniel@growing_daniel·6 Eyl

@tigran_zzz Israelis famously underperforming Maybe your country just sucked

English

128

62.9K

Daniel@growing_daniel·6 Eyl

Two years of military service for everyone after high school or turning 18 would fix America

English

655

1.6K

7.6M

Hen Sapir@hensapir·5 Şub

@sentdefender you already know this but… in the middle east, there’s what you say, what you do, and what you think—and none of them are ever the same.

English

245

OSINTdefender@sentdefender·5 Şub

Despite U.S. President Donald J. Trump stating yesterday that Saudi Arabia was now willing to “Normalize Ties” with Israel without the guarantee of Palestinian Statehood; the Kingdom of Saudi Arabia release a Statement earlier, stating that this was not true, and that there would be No Diplomatic Ties with Israel unless a Palestinian State is established with East Jerusalem as its Capital.

English

122

218

988

175.8K

Hen Sapir@hensapir·31 Eki

@andrewchen @KatiaAmeri i'm biased but @FrontHQ wins by a landslide front.com

English

265

andrew chen@andrewchen·31 Eki

dear lazyweb- what's everyone's new modern customer support tool? (no more zendesk!!!) cc @KatiaAmeri

English

36.7K

Hen Sapir@hensapir·10 Kas

@DD_Geopolitics hamas-massacre.net

QME

DD Geopolitics@DD_Geopolitics·9 Kas

twitter.com/i/spaces/1jMJg…

ZXX

453

33.1K

Hen Sapir@hensapir·17 Ara

@soffes You can create a Twilio SMS team inbox with @FrontApp! See help.frontapp.com/t/q52442/how-t…

English

Sam Soffes@soffes·17 Ara

Looking for a service that gives you an SMS number and goes to a team inbox to use for support. Would love for users to be able to SMS us for help. Know of anything like that? (I know I could make something but would rather pay for a service that does this well.)

English

Keşfet

@amanrsanger @SeanZCai @PrimeIntellect @JoshPurtell @jeff_weinstein @sumeetvtweets @marcklingen @langfuse