HORIZON — INITIAL DIAGNOSTIC BENCHMARK

The Long-Horizon Task Mirage?

Diagnosing Where and Why Agentic Systems Break

An initial diagnostic benchmark for systematically constructing tasks and characterizing long-horizon failures in LLM-based agents — measuring performance degradation across increasing intrinsic horizons H* and attributing breakdowns via a 7-category failure taxonomy across 4 domains.

Contribute Tasks 📄 Paper 🤗 HuggingFace

Domains

Failure Categories

Intrinsic Horizon

Extension Levels

Performance vs. Extension Level s — Figure 3 (paper)

Intrinsic Horizon H* Minimum effective actions required by an optimal policy — defined independent of agent implementation, via expert demonstrations or oracle solvers.

Leaderboard

Average success rate across all four domains and extension levels.

3 models · sorted by avg success rate · break level = first s where success drops sharply

#	Model	Avg ↓	Web	OS	Database	Embodied	Break Level

GPT-5-mini and Claude-4-Sonnet evaluated on all four domains.

Benchmark Domains

Controlled Horizon Extension

Tasks are constructed with systematically increasing H* via two extension methods.

🌐

Web Navigation

Based on WebArena. Tasks span e-commerce, CMS, GitLab, Reddit, and mapping. Breadth extension composes independent subtasks into composite workflows. Collapses earliest — at very small s.

Breadth Extension s = 1..4 WebArena base

💻

Operating System

Based on AgentBench. Shell-level control with file permission chains. Depth extension inserts non-skippable intermediate states — e.g., enforce read-only before making writable.

Depth Extension s = 1..8 AgentBench base

🗄️

Database

Based on MAC-SQL (Wang et al., 2023). Multi-agent Text-to-SQL framework with Selector, Decomposer, and Refiner agents. Evaluated by execution accuracy. Tasks range from simple queries to multi-table joins with nested sub-goals.

Depth Extension s = 1..8 MAC-SQL base

🤖

Embodied Manipulation

IsaacSim 5.0 with bimanual Franka Panda arms. Three objects: red, blue, green cubes. Four primitives: home, move_to_pose, grasp, release. Steepest degradation of all domains.

Breadth Extension s = 1..5 IsaacSim 5.0

Failure Attribution

7-Category Failure Taxonomy

Categories represent orthogonal dimensions of agent behavior — not mutually exclusive classes. A single failed trajectory may exhibit multiple failure types simultaneously. Validated by two domain experts via blind annotation and consensus discussion.

Evaluation Pipeline

LLM-as-a-Judge Failure Attribution

After collecting agent trajectories, each failed run is submitted to an LLM judge prompted with the full 7-category taxonomy. The judge identifies the primary failure mode and provides a natural-language rationale grounded in the trajectory evidence.

Pipeline Overview

Three-Stage Process

Trajectory Collection

Each agent run produces a full action–observation log. Runs with final score = 0 are marked as failures and forwarded to the judge.

LLM Judgment

A judge LLM receives the trajectory plus the 7-category taxonomy definition. It selects the primary failure type and outputs a structured rationale citing the specific step where failure occurs.

Validation

A random sample of attributions is reviewed by two domain experts via blind annotation. Disagreements are resolved through consensus discussion.

Attribution Results

Failure Distribution (3,357 Trajectories)

CategoryCount

Planning Error dominates across all domains. Success (score=1) trajectories are retained for contrast analysis.

Judge Prompt Template

Taxonomy-Grounded Attribution Prompt

You are an expert evaluator of LLM agent failures.
Given the trajectory below, identify the PRIMARY failure type
from the following 7 categories:

  [1] Environment Disturbance
  [2] Instruction (Ill-defined / Partial Understanding)
  [3] Planning Error
  [4] False Assumption
  [5] History Error Accumulation
  [6] Catastrophic Forgetting
  [7] Memory Limitation

Trajectory:
<trajectory>{full_action_observation_log}</trajectory>

Respond with:
  failure_type: <one of the 7 categories above>
  failure_step: <step number where failure first occurs>
  reason: <1–2 sentence explanation citing specific trajectory evidence>

Real-World Grounding

OpenClaw Failure Examples

The Agents of Chaos study documents representative failures from OpenClaw, a real-world email-based agent system. Below we map each incident to our taxonomy, offering preliminary real-world validation for the failure categories identified in our controlled benchmark. Conceptual examples (marked ✦) are derived from OpenClaw-style deployment patterns rather than a specific named incident.

7 categories · mapped from Agents of Chaos (Shapira et al., 2025, arXiv:2602.20021) · ✦ = conceptual illustration

Failure Category	Incident / Example	What Happened	Mechanism

Community

Contribute Tasks to HORIZON

Help grow the benchmark by contributing new long-horizon tasks. We collect structured tasks via HuggingFace Datasets — all domains and extension methods welcome.

Task Schema

JSON Format

Each task follows a structured JSON schema. Provide at least two extension levels (s=1 and s=2) with corresponding instructions and oracle solutions.

{
  "task_id": "web_042",
  "domain": "web",
  "extension_method": "breadth",
  "description": "Buy the cheapest red item, then post a review",
  "H_star": 1,
  "extension_levels": [
    {
      "s": 1,
      "instruction": "...",
      "oracle": "..."
    },
    {
      "s": 2,
      "instruction": "...",
      "oracle": "..."
    }
  ],
  "base_env": "webarena",
  "contributor": "your_username",
  "verified": false
}

Step 01 — Design

Create a Task

Choose a domain and extension method, then craft a baseline task and its horizon extensions following the HORIZON protocol.

domain: one of web, os, db, embodied
extension_method: breadth (compose subtasks) or depth (add non-skippable steps)
H_star: intrinsic horizon of the s=1 baseline task
Each extension level needs a full instruction and an oracle solution

Step 02 — Submit

Upload to HuggingFace

Submit your task JSON to our HuggingFace dataset repository. Open a discussion or pull request with your contribution.

Fork or open a discussion on the dataset repo
Upload your JSON file following the schema on the left
Include your username in the contributor field
Verified tasks are reviewed by the HORIZON team before inclusion

Contribute on HuggingFace →

Step 03 — Quality

Task Guidelines

Good tasks have unambiguous oracle solutions and exhibit meaningful performance degradation across extension levels.

Baseline (s=1) must be solvable by an optimal policy
Depth extension: each added step must be non-skippable by an optimal policy
Breadth extension: H*(s) must be additive across composed subtasks
Oracle solutions should be verifiable programmatically

Publication

Paper & Citation

Abstract

Position: The Long-Horizon Task Mirage?
Diagnosing Where and Why Agentic Systems Break

We argue that identifying the breaking point of long-horizon agents is inherently challenging, but can be meaningfully studied through a structured approach. We propose HORIZON, an initial cross-domain diagnostic benchmark that generates tasks with increasing intrinsic horizons and quantifies how agent success degrades — using a seven-category failure taxonomy applied across web navigation, operating systems, databases, and embodied manipulation as a pilot empirical study.

📄 arXiv Preprint 🤗 HuggingFace Dataset ↗

BibTeX

Cite This Work

@misc{wang2026longhorizontaskmiragediagnosing, title={The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break}, author={Xinyu Jessica Wang and Haoyue Bai and Yiyou Sun and Haorui Wang and Shuibai Zhang and Wenjie Hu and Mya Schroder and Bilge Mutlu and Dawn Song and Robert D Nowak}, year={2026}, eprint={2604.11978}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2604.11978}, }