Hen Sapir

46 posts

Hen Sapir

Hen Sapir

@hensapir

cofounder @charcoal_hq | previously eng @fronthq

San Francisco, CA Katılım Eylül 2014
957 Takip Edilen131 Takipçiler
Aman Sanger
Aman Sanger@amanrsanger·
We've evaluated a lot of base models on perplexity-based evals and Kimi k2.5 proved to be the strongest! After that, we do continued pre-training and high-compute RL (a 4x scale-up). The combination of the strong base, CPT and RL, and Fireworks' inference and RL samplers make Composer-2 frontier level. It was a miss to not mention the Kimi base in our blog from the start. We'll fix that for the next model.
Kimi.ai@Kimi_Moonshot

Congrats to the @cursor_ai team on the launch of Composer 2! We are proud to see Kimi-k2.5 provide the foundation. Seeing our model integrated effectively through Cursor's continued pretraining & high-compute RL training is the open model ecosystem we love to support. Note: Cursor accesses Kimi-k2.5 via @FireworksAI_HQ ' hosted RL and inference platform as part of an authorized commercial partnership.

English
152
134
2.5K
488.6K
Sean Cai
Sean Cai@SeanZCai·
Running @PrimeIntellect Lab GRPO on a hard-to-verify action-matching task. Judge inconsistency was generating phantom reward variance like same model output, different scores across rollouts and step 0 kept winning. Fixed it with a stronger judge (better SOTA OAI model, using all my thousands of old OAI hackathon credits) + response caching and got zero phantom variance, clean gradient signal. Anybody know what's the best off-the-shelf judge for semantic action matching in RL training without post-training a purpose-built one? What are people actually shipping with? Is there anybody working on purpose-built judge models for GRPO?
English
5
0
51
5.2K
Jeff Weinstein
Jeff Weinstein@jeff_weinstein·
who is building the next great ci startup optimized for agent generate code? (there is about to be _extremely_ ci needs)
English
37
5
110
21.5K
Hen Sapir
Hen Sapir@hensapir·
@marcklingen @langfuse request to add the ability to set negative tag filters in the trace UI (e.g., tag != X). i can hack it via query params but would be nice to have it in the UI 🙏
English
0
0
0
43
Hen Sapir
Hen Sapir@hensapir·
@jaltma any company whose core value prop is accountability
English
0
0
1
255
Jack Altman
Jack Altman@jaltma·
The current consensus view is saas is dead...presuming that's right, the next interesting next question is What companies are "safe from ai"? - handling money, regulation - agents on top of company data - most hardware? - maybe systems of record? - security? - marketplaces?
English
306
30
710
178.8K
Hen Sapir
Hen Sapir@hensapir·
@jarredsumner how would this would work with encodings? you can't decode partially-encoded characters that have been cutoff by maxLength/offset. and, afaict, all workarounds to that are bad
English
0
0
0
189
Jarred Sumner
Jarred Sumner@jarredsumner·
node:fs readFile needs a maxLength and offset option
English
10
2
206
25.4K
dan
dan@irl_danB·
I have a working version of the call stack context manager as an opencode plugin. It exposes tool calls to the agent to manage the call stack as plugin state. Frames in the call stack are opencode sessions. Compacted ancestor, uncle, and sibling frames are injected into the context. I’ve not run benchmarks and it needs plenty of tire kicking, but opus 4.5 uses it pretty well now. It’s not really built for interactive mode at the moment (it can be), I haven’t optimized it for cache utilization (versions of it can be), and for now it is most useful for certain classes of tasks: primarily long running, like building out a large new project or doing a large refactor. In fact adding this plugin kind of borks your opencode if you’re trying to run short single threaded interactive tasks, because it aggressively uses the call stack model to break apart tasks. It ships without expectation of maintenance or further attention, so don’t build anything on top of it. Normally I wouldn’t release this at all, except several people have indicated interest and willingness to play with a raw version of it. If you want to play with a more polished, actually useful, hopefully benchmarked version, please wait a few more days. Please forgive the slow progress, I’ve got my hands full at the moment with two newborns and two toddlers. I’ll run it on terminal-use and harbour benchmarks once I shore up confidence that it’s worthwhile to spend that money. Buried lede: building this out has made me realize that the opencode session itself is a unique primative that can probably support more interesting composition. Pair this with the opencode client-server model and I’m thinking of pivoting my currently in-progress inversion of control framework from orchestrating Claude Agents SDK to orchestrating opencode sessions. Exciting stuff ahead, lots you can do with this. Please reply or dm if you are interested in trying the proof-of-concept version, especially if you’re interested in sending feedback. Wait a bit if you want something more polished and proven.
dan@irl_danB

context window won’t be “solved” as long as attention is quadratic and presumably Suhail is thinking about the compaction problem as it occurs in long running agents like claude code but this is downstream from an architectural problem with standard agent implementations (claude code among them) that use a linear “chat-like” history we all work through coding tasks linearly, but any seasoned software engineer’s mental model of their progress looks more like a call stack: pushing tasks on and popping them off when complete when the claude code harness organizes the context more like a call stack (think flame graph) than a linear chat log, compaction will not even be necessary in many cases and less lossy in the cases where it is for the familiar, think: loom

English
15
2
93
9.5K
Kyle Corbitt
Kyle Corbitt@corbtt·
@willccbb 10B+ is a pretty low bar. Most actual usage of open models is in the 100B+ range.
English
3
0
19
2.1K
will brown
will brown@willccbb·
registering a prediction that by this time next year, there will be at least 5 serious players in the west releasing great open models chinese models will still be great, but the gap will be small if not non-existent, and people will mostly use the western ones
English
40
16
419
57.6K
Nikita Bier
Nikita Bier@nikitabier·
We're testing a new link experience, starting on iOS -- to make it easier for your followers to engage with your post while browsing links. For creators, a common complaint is that posts with links tend to get lower reach. This is because the web browser covers the post and people forget to Like or Reply. So X doesn't get a clear signal whether the content is any good. To help get better signal, posts will now collapse to the bottom of the page so people can react while you're reading. As always, remember: the post should stand alone as great content so write a solid caption.
English
1.3K
633
13.8K
6M
Hen Sapir
Hen Sapir@hensapir·
@nikitabier @dinkin_flickaa @misha_mityushk @nicoduc one unexpected annoyance is that, when exiting the browser view, i get redirected to the post detail view (ie the /status/<post_id> page) even if i clicked the link from my home timeline. otherwise, nice work 👏
English
1
0
3
5.1K
Hen Sapir
Hen Sapir@hensapir·
@simonw also lots of examples where the dev time costs are meaningfully lower, ie prompt iteration on frontier models taking longer than RL-fine tuning small oss models for the same task
English
1
0
1
145
Simon Willison
Simon Willison@simonw·
Anyone got a success story they can share about fine-tuning an LLM? I'm looking for examples that produced commercial value beyond what could be achieved by prompting an existing hosted model - or waiting a month for the next generation of hosted models to solve the same problem
English
154
100
1.2K
188.2K
Hen Sapir
Hen Sapir@hensapir·
leggo
Hen Sapir tweet media
English
0
0
1
155
Hen Sapir
Hen Sapir@hensapir·
@growing_daniel @tigran_zzz it’ll probably be a net positive but america “lacks” israel’s mix of existential threat, national cohesion, tiny scale, and integration of army experience into daily life. i think all of the above are required for the outcomes you’re thinking about.
English
0
0
28
5.5K
Daniel
Daniel@growing_daniel·
@tigran_zzz Israelis famously underperforming Maybe your country just sucked
English
52
1
128
62.9K
Daniel
Daniel@growing_daniel·
Two years of military service for everyone after high school or turning 18 would fix America
English
655
49
1.6K
7.6M
Hen Sapir
Hen Sapir@hensapir·
@sentdefender you already know this but… in the middle east, there’s what you say, what you do, and what you think—and none of them are ever the same.
English
0
0
1
245
OSINTdefender
OSINTdefender@sentdefender·
Despite U.S. President Donald J. Trump stating yesterday that Saudi Arabia was now willing to “Normalize Ties” with Israel without the guarantee of Palestinian Statehood; the Kingdom of Saudi Arabia release a Statement earlier, stating that this was not true, and that there would be No Diplomatic Ties with Israel unless a Palestinian State is established with East Jerusalem as its Capital.
OSINTdefender tweet media
English
122
218
988
175.8K
andrew chen
andrew chen@andrewchen·
dear lazyweb- what's everyone's new modern customer support tool? (no more zendesk!!!) cc @KatiaAmeri
English
49
2
80
36.7K
Sam Soffes
Sam Soffes@soffes·
Looking for a service that gives you an SMS number and goes to a team inbox to use for support. Would love for users to be able to SMS us for help. Know of anything like that? (I know I could make something but would rather pay for a service that does this well.)
English
16
1
21
0