AnyAi.fyi - Discover ANY AI to make more online for less.

venturebeat

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Harbor, a new framework for testing, improving and optimizing AI agents in containerized environments. The dual release aims to address long-standing pain points in testing and optimizing AI agents, particularly those built to operate autonomously in realistic developer environments.With a more difficult and rigorously verified task set, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing frontier model capabilities. Harbor, the accompanying runtime framework, enables developers and researchers to scale evaluations across thousands of cloud containers and integrates with both open-source and proprietary agents and training pipelines.“Harbor is the package we wish we had had while making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models."Higher Bar, Cleaner DataTerminal-Bench 1.0 saw rapid adoption after its release in May 2025, becoming a default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems through the command line, mimicking how developers work behind the scenes of the graphical user interface.However, its broad scope came with inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.Version 2.0 addresses those issues directly. The updated suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic, and clearly specified, raising the difficulty ceiling while improving reliability and reproducibility.A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependence on unstable third-party APIs.“Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” Shaw noted on X. “We believe this is because task quality is substantially higher in the new benchmark.”Harbor: Unified Rollouts at ScaleAlongside the benchmark update, the team launched Harbor, a new framework for running and evaluating agents in cloud-deployed containers. Harbor supports large-scale rollout infrastructure, with compatibility for major providers like Daytona and Modal.Designed to generalize across agent architectures, Harbor supports:Evaluation of any container-installable agentScalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelinesCustom benchmark creation and deploymentFull integration with Terminal-Bench 2.Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.Early Results: GPT-5 Leads in Task SuccessInitial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate — the highest among all agents tested so far. Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.Top 5 Agent Results (Terminal-Bench 2.0):Codex CLI (GPT-5) — 49.6%Codex CLI (GPT-5-Codex) — 44.3%OpenHands (GPT-5) — 43.8%Terminus 2 (GPT-5-Codex) — 43.4%Terminus 2 (Claude Sonnet 4.5) — 42.8%The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.Submission and UseTo test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation.harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark.Aiming for StandardizationThe combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has grown.These tools offer a potential foundation for a unified evaluation stack — supporting model improvement, environment simulation, and benchmark standardization across the AI ecosystem.

Discover Copy

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

Framework Desktop (2025) Review: Powerful, but perhaps not for everyone

The most obvious question is “Why?” Framework builds <a data-i13n="cpos:1;pos:1" href="https://www.engadget.com/computing/laptop [...]

More Copy

Match Score: 130.41

Framework Laptop 12 review: Doing the right thing comes at a cost

Earlier this year, Framework announced it was making a <a data-i13n="cpos:1;pos:1" href="https://www.engadget.com/computing/laptops/framework-teases-a-low-cost-2-in-1-conver [...]

More Copy

Match Score: 121.21

venturebeat

Microsoft retires AutoGen and debuts Agent Framework to unify and govern en

<a href="https://www.microsoft.com/">Microsoft</a>’s multi-agent framework, AutoGen, acts as the backbone for many enterprise projects, particularly [...]

More Copy

Match Score: 98.85

Framework Laptop 16 (2025 upgrade) review: The RTX 5070 is the star

Plenty of companies have promised to produce a gaming laptop that could be upgraded over time. If we’re honest, nobody has managed to properly deliver on that pledge until now, as <a tar [...]

More Copy

Match Score: 94.92

Framework Laptop 13 (2025) with AMD Ryzen AI 300 review: The usual iterativ

You might know the story by now: Framework makes repairable, modular <a href="https://www.engadget.com/computing/laptops/best-laptops-120008636.html" data-autolinker-wiki-id=&quo [...]

More Copy

Match Score: 94.84

venturebeat

We keep talking about AI agents, but do we ever know what they are?

Imagine you do two things on a Monday morning.First, you ask a chatbot to summarize your new emails. Next, you ask an AI tool to figure out why your top competitor grew so [...]

More Copy

Match Score: 89.13

venturebeat

Upwork study shows AI agents excel with human partners but fail independent

Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to < [...]

More Copy

Match Score: 85.40

venturebeat

The Google Search of AI agents? Fetch launches ASI:One and Business tier fo

<a href="https://fetch.ai/">Fetch AI</a>, a startup founded and led by former DeepMind founding investor, Humayun Sheikh, <a href="https://www.businesswire.com/n [...]

More Copy

Match Score: 85.12

venturebeat

Mistral launches powerful Devstral 2 coding model including open source, la

French AI startup Mistral has weathered a rocky period of public questioning over the last year to emerge, now here in December 2025, with new, crowd-pleasing models for enterprise and indie [...]

More Copy

Match Score: 81.24