Harbor Framework

85 posts

Harbor Framework

@harborframework

San Francisco, CA Katılım Ocak 2026

3 Takip Edilen1.1K Takipçiler

Harbor Framework@harborframework·1d

harborframework.com/news/separate-…

ZXX

166

Harbor Framework@harborframework·1d

We're releasing support for running verification in a separate sandbox. Tasks pre-configure artifacts to move from the agent sandbox into the verifier sandbox for the grading phase, improving the security boundary between agent and verifier. Blog post below. Happy building!

English

2.1K

Harbor Framework retweetledi

Alex Shaw@alexgshaw·2d

Great write up by @adithya_s_k about @harborframework . I want to add some thoughts around coding agents = CUA and Harbor coding envs = computer envs. One of the reasons we built Terminal-Bench was because we saw that terminals/code were/was a powerful way for language models to control a computer. We’ve always viewed TB as a computer-use benchmark. Coding agents = CUA means measuring coding agents is essentially the same thing as measuring general purpose agents. This is becoming more obvious with products like Claude Cowork, which is essentially a non-technical interface around Claude Code, and OpenAI’s push to making Codex a more general purpose tool. We see this on the Harbor side too. Users create coding tasks. But they also create finance, law, accounting, engineering, general computer work, etc. tasks as well. Terminal-Bench 3.0 will cover all of these domains. The implication is that Harbor becomes a tool for representing and measuring agents’ abilities to perform arbitrary computer work, which right now is the exact scope that users build agents to automate. In fact, the Harbor Framework (as opposed to the Harbor Format) is just one opinionated way of performing rollouts on Harbor tasks. It works particularly well for agent evals. But there is no reason people can’t/shouldn’t implement other means of performing rollouts on Harbor tasks (e.g. @PrimeIntellect, @GenReasoning, and @tinkerapi all support some variation of a Harbor rollout). We’ll have some releases around this soon. To summarize, coding agents = CUA, Harbor’s coding environments = computer environments, which means the scope of Harbor is probably broader than you think (as our users will attest!)

Adithya S K@adithya_s_k

x.com/i/article/2054…

English

109

12.5K

Harbor Framework retweetledi

poolside@poolsideai·5d

As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: poolside.ai/blog/through-t… Examples below 👇 1/

English

105

15.8K

Harbor Framework retweetledi

Ryan Marten@ryanmart3n·8 May

Evals are specs for agents. Building agents <> Building evals with harbor

George@odysseus0z

You don't need a new IDE. You need a new ISE. Integrated Spec Environment. Spec is the new code. Ship the right spec and your job is basically done.

English

1.5K

Harbor Framework retweetledi

MohammadHossein Rezaei@mhrezaeics·8 May

@alexgshaw @mohit_r9a @harborframework @scale_AI Can’t imagine agent research without Harbor!

English

714

Harbor Framework@harborframework·8 May

"harbor upload" view your trajectories in luxury on hub.harborframework.com/jobs

Sriraam@27upon2

@xeophon Me before I start looking at harbor rollouts

English

2.2K

Harbor Framework retweetledi

Alex Shaw@alexgshaw·8 May

Test your agent's ability to refactor code using SWE Atlas Refactoring built with @harborframework. Love the SWE Atlas benchmark suite from @scale_AI !

Scale Labs@ScaleAILabs

Today we’re releasing Refactoring, the final leaderboard of our SWE Atlas suite. This new leaderboard is the ultimate test of an agent's ability to restructure code without breaking the system. Claude Opus 4.7 with Claude Code takes the top spot🥇

English

2.9K

Harbor Framework retweetledi

Alex Shaw@alexgshaw·24 Nis

Can agents build off their prior work? Can they continually learn? Answering these questions requires feeding your agent a sequence of tasks, each building off the prior. Today we're releasing the first major addition to the Harbor task format: multi-step tasks. We've partnered with @GOrlanski to add SlopCodeBench to the Harbor Registry as the first benchmark taking advantage of multi-step tasks.

Gabe Orlanski@GOrlanski

Very excited to announce the v1.0 of SlopCodeBench release: - Doubling the size of the dataset - @harborframework support - scb-check: a CLI that flags slop anti-patterns - Way more model results scbench.ai github.com/SprocketLab/sl… 🧵

English

10.3K

Harbor Framework retweetledi

General Reasoning@GenReasoning·21 Nis

🎉 Native Harbor support on OpenReward! 🐋 Connect your GitHub repository. We'll build the Docker images for each harbor task and deploy the environment as an API endpoint. 🚂 Train on the deployed tasks with any RL framework. ⚖️ Evaluate on the deployed tasks with any harness. Drop the anchor here and get started below: docs.openreward.ai/environments/d…

English

4.5K

Harbor Framework retweetledi

Alex Shaw@alexgshaw·20 Nis

The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard. We're adding some new policies to keep it reliable: • ATIF trajectories required for all passing trials • Reward hacking results in reward 0 for the trial • Cheating results in immediate leaderboard removal Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences! Detailed blog post in comments ⬇️

English

120

11.6K

Harbor Framework retweetledi

Alex Shaw@alexgshaw·17 Nis

Evaluate on ultra-long horizon coding tasks with FrontierSWE, built using @harborframework

Justus Mattern@MatternJustus

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English

2.8K

Harbor Framework@harborframework·3 Nis

Legacy-Bench, built on Harbor

Factory@FactoryAI

No major benchmark is designed for COBOL, Fortran, or Assembly - the languages powering trillions in transactions and infrastructure that must be modernized or risk catastrophic failure. We built Legacy-Bench to measure frontier agents on the code the world actually runs on.

English

795

Harbor Framework@harborframework·3 Nis

scaling evals with harbor 🤝

Henry Ventura@hvent90

Scaling up an over-night eval run - a single GAIA level 2 task - two contestants: pi as baseline, and another that can do fold/peek operations on its context window - 100 attempts each on the same task Should get a decent picture of failure modes. @harborframework ROCKS

English

861

Harbor Framework retweetledi

Scale Labs@ScaleAILabs·1 Nis

We refreshed the SWE-Atlas - Codebase QnA leaderboard results, and transitioned to using the @harborframework for all models. We encourage the community to use the official harbor implementation to run the benchmark. The dataset in harbor format is now available at github.com/scaleapi/SWE-A…

Alex Shaw@alexgshaw

Measure how well your agent writes unit tests using SWE-Atlas Test Writing from @scale_AI. SWE-Atlas Test Writing and SWE-Atlas Codebase QnA both ship in the Harbor format and are available on the Harbor registry.

English

2.4K

Harbor Framework retweetledi

Mohit Raghavendra@mohit_r9a·1 Nis

Huge fan of @harborframework and how easy they’ve made it to publish and run benchmarks. @alexgshaw and @ryanmart3n are cooking

Alex Shaw@alexgshaw

The Harbor registry is getting an upgrade. Now, anyone can publish to the registry to make their dataset available to every Harbor user:

English

395

Harbor Framework@harborframework·1 Nis

harbor.storage upload your harbor datasets!

Alex Shaw@alexgshaw

The Harbor registry is getting an upgrade. Now, anyone can publish to the registry to make their dataset available to every Harbor user:

English

454

Harbor Framework@harborframework·31 Mar

benchtalk with @alexgshaw and @vincentsunnchen - harbor, terminal-bench, and more

vincent sunn chen@vincentsunnchen

Terminal-Bench 2.0 went from ~25% → 80% in four months and became the standard eval for frontier CLI agents. Now, TB3 is in the works. I talked to @alexgshaw about what happens when model capabilities climb faster than we can measure them. His answer: the benchmark factory (@harborframework)— infrastructure to develop hard, representative evals at the pace that the frontier moves. As Alex put it: "we need a thousand times more benchmarks than we have right now." 00:23 - How quickly models hill-climbed TB2 01:46 - What rapid progress reveals about benchmarks vs. real-world capability 03:28 - What made Terminal-Bench stick 04:58 - Why the terminal is the right abstraction for agentic AI 07:14 - How TB2 maintains task quality at scale 09:23 - Managing benchmark integrity in a benchmaxxing world 10:47 - Harbor: from experiment to benchmark factory 12:19 - What Harbor does that nothing else did 14:37 - The invariants: what won't change as agent evals evolve 16:55 - The benchmark Alex most wants to see built 18:18 - The ideal human-in-the-loop task creation flywheel 20:32 - How to contribute to Terminal-Bench 3.0

English

572

Harbor Framework@harborframework·31 Mar

"@harborframework product and team have been amazing" <3 Thanks for the love @NoCommas and congrats on the launch!

Monk Zero@NoCommas

We started with eval first. Benchmarks are the primary way to get real data before getting users. The principle: we would rather burn tokens than burn customer trust. @harborframework product and team have been amazing — and we topped Terminal Bench 1.0 and Terminal Bench 2.0. We evaluate how well the agent channels the model's power, not the model itself. We improve the harness, not the prompt.

English

290

Harbor Framework@harborframework·31 Mar

Agents building agents! Just put everything into a filesystem (agent code and evaluation results - trajectories and rewards generated by harbor) and let the Meta-Harness iterate.

Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English

1.3K

Keşfet

@adithya_s_k @PrimeIntellect @GenReasoning @tinkerapi @alexgshaw @mohit_r9a @scale_AI @GOrlanski