Harbor Framework

85 posts

Harbor Framework banner
Harbor Framework

Harbor Framework

@harborframework

San Francisco, CA Katılım Ocak 2026
3 Takip Edilen1.1K Takipçiler
Harbor Framework
Harbor Framework@harborframework·
We're releasing support for running verification in a separate sandbox. Tasks pre-configure artifacts to move from the agent sandbox into the verifier sandbox for the grading phase, improving the security boundary between agent and verifier. Blog post below. Happy building!
English
2
2
27
2.1K
Harbor Framework retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Great write up by @adithya_s_k about @harborframework . I want to add some thoughts around coding agents = CUA and Harbor coding envs = computer envs. One of the reasons we built Terminal-Bench was because we saw that terminals/code were/was a powerful way for language models to control a computer. We’ve always viewed TB as a computer-use benchmark. Coding agents = CUA means measuring coding agents is essentially the same thing as measuring general purpose agents. This is becoming more obvious with products like Claude Cowork, which is essentially a non-technical interface around Claude Code, and OpenAI’s push to making Codex a more general purpose tool. We see this on the Harbor side too. Users create coding tasks. But they also create finance, law, accounting, engineering, general computer work, etc. tasks as well. Terminal-Bench 3.0 will cover all of these domains. The implication is that Harbor becomes a tool for representing and measuring agents’ abilities to perform arbitrary computer work, which right now is the exact scope that users build agents to automate. In fact, the Harbor Framework (as opposed to the Harbor Format) is just one opinionated way of performing rollouts on Harbor tasks. It works particularly well for agent evals. But there is no reason people can’t/shouldn’t implement other means of performing rollouts on Harbor tasks (e.g. @PrimeIntellect, @GenReasoning, and @tinkerapi all support some variation of a Harbor rollout). We’ll have some releases around this soon. To summarize, coding agents = CUA, Harbor’s coding environments = computer environments, which means the scope of Harbor is probably broader than you think (as our users will attest!)
Adithya S K@adithya_s_k

x.com/i/article/2054…

English
4
8
109
12.5K
Harbor Framework retweetledi
poolside
poolside@poolsideai·
As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: poolside.ai/blog/through-t… Examples below 👇 1/
poolside tweet media
English
8
22
105
15.8K
Harbor Framework retweetledi
Harbor Framework retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Can agents build off their prior work? Can they continually learn? Answering these questions requires feeding your agent a sequence of tasks, each building off the prior. Today we're releasing the first major addition to the Harbor task format: multi-step tasks. We've partnered with @GOrlanski to add SlopCodeBench to the Harbor Registry as the first benchmark taking advantage of multi-step tasks.
Alex Shaw tweet media
Gabe Orlanski@GOrlanski

Very excited to announce the v1.0 of SlopCodeBench release: - Doubling the size of the dataset - @harborframework support - scb-check: a CLI that flags slop anti-patterns - Way more model results scbench.ai github.com/SprocketLab/sl… 🧵

English
6
10
73
10.3K
Harbor Framework retweetledi
General Reasoning
General Reasoning@GenReasoning·
🎉 Native Harbor support on OpenReward! 🐋 Connect your GitHub repository. We'll build the Docker images for each harbor task and deploy the environment as an API endpoint. 🚂 Train on the deployed tasks with any RL framework. ⚖️ Evaluate on the deployed tasks with any harness. Drop the anchor here and get started below: docs.openreward.ai/environments/d…
English
1
6
25
4.5K
Harbor Framework retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard. We're adding some new policies to keep it reliable: • ATIF trajectories required for all passing trials • Reward hacking results in reward 0 for the trial • Cheating results in immediate leaderboard removal Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences! Detailed blog post in comments ⬇️
English
4
11
120
11.6K
Harbor Framework retweetledi
Scale Labs
Scale Labs@ScaleAILabs·
We refreshed the SWE-Atlas - Codebase QnA leaderboard results, and transitioned to using the @harborframework for all models. We encourage the community to use the official harbor implementation to run the benchmark. The dataset in harbor format is now available at github.com/scaleapi/SWE-A…
Alex Shaw@alexgshaw

Measure how well your agent writes unit tests using SWE-Atlas Test Writing from @scale_AI. SWE-Atlas Test Writing and SWE-Atlas Codebase QnA both ship in the Harbor format and are available on the Harbor registry.

English
0
5
18
2.4K
Harbor Framework
Harbor Framework@harborframework·
benchtalk with @alexgshaw and @vincentsunnchen - harbor, terminal-bench, and more
vincent sunn chen@vincentsunnchen

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (@harborframework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0

English
0
0
8
572
Harbor Framework
Harbor Framework@harborframework·
"@harborframework product and team have been amazing" <3 Thanks for the love @NoCommas and congrats on the launch!
Monk Zero@NoCommas

We started with eval first. Benchmarks are the primary way to get real data before getting users. The principle: we would rather burn tokens than burn customer trust. @harborframework product and team have been amazing — and we topped Terminal Bench 1.0 and Terminal Bench 2.0. We evaluate how well the agent channels the model's power, not the model itself. We improve the harness, not the prompt.

English
1
0
2
290
Harbor Framework
Harbor Framework@harborframework·
Agents building agents! Just put everything into a filesystem (agent code and evaluation results - trajectories and rewards generated by harbor) and let the Meta-Harness iterate.
Harbor Framework tweet media
Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English
0
3
13
1.3K