Read the Paper  Β·  Contribute Tasks
HORIZON β€” INITIAL DIAGNOSTIC BENCHMARK

The Long-Horizon Task Mirage?

Diagnosing Where and Why Agentic Systems Break

An initial diagnostic benchmark for systematically constructing tasks and characterizing long-horizon failures in LLM-based agents β€” measuring performance degradation across increasing intrinsic horizons H* and attributing breakdowns via a 7-category failure taxonomy across 4 domains.

4
Domains
7
Failure Categories
H*
Intrinsic Horizon
s
Extension Levels
100% 0% s=0 s=8 Claude-4-Sonnet GPT-5-mini break region
Performance vs. Extension Level s β€” Figure 3 (paper)
Leaderboard
Average success rate across all four domains and extension levels.
3 models Β·  sorted by avg success rate  Β·  break level = first s where success drops sharply
# Model Avg ↓ Web OS Database Embodied Break Level

GPT-5-mini and Claude-4-Sonnet evaluated on all four domains.

Controlled Horizon Extension

Tasks are constructed with systematically increasing H* via two extension methods.

🌐
Web Navigation
Based on WebArena. Tasks span e-commerce, CMS, GitLab, Reddit, and mapping. Breadth extension composes independent subtasks into composite workflows. Collapses earliest β€” at very small s.
Breadth Extension s = 1..4 WebArena base
πŸ’»
Operating System
Based on AgentBench. Shell-level control with file permission chains. Depth extension inserts non-skippable intermediate states β€” e.g., enforce read-only before making writable.
Depth Extension s = 1..8 AgentBench base
πŸ—„οΈ
Database
Based on MAC-SQL (Wang et al., 2023). Multi-agent Text-to-SQL framework with Selector, Decomposer, and Refiner agents. Evaluated by execution accuracy. Tasks range from simple queries to multi-table joins with nested sub-goals.
Depth Extension s = 1..8 MAC-SQL base
πŸ€–
Embodied Manipulation
IsaacSim 5.0 with bimanual Franka Panda arms. Three objects: red, blue, green cubes. Four primitives: home, move_to_pose, grasp, release. Steepest degradation of all domains.
Breadth Extension s = 1..5 IsaacSim 5.0
7-Category Failure Taxonomy

Categories represent orthogonal dimensions of agent behavior β€” not mutually exclusive classes. A single failed trajectory may exhibit multiple failure types simultaneously. Validated by two domain experts via blind annotation and consensus discussion.

LLM-as-a-Judge Failure Attribution

After collecting agent trajectories, each failed run is submitted to an LLM judge prompted with the full 7-category taxonomy. The judge identifies the primary failure mode and provides a natural-language rationale grounded in the trajectory evidence.

Pipeline Overview
Three-Stage Process
1
Trajectory Collection
Each agent run produces a full action–observation log. Runs with final score = 0 are marked as failures and forwarded to the judge.
2
LLM Judgment
A judge LLM receives the trajectory plus the 7-category taxonomy definition. It selects the primary failure type and outputs a structured rationale citing the specific step where failure occurs.
3
Validation
A random sample of attributions is reviewed by two domain experts via blind annotation. Disagreements are resolved through consensus discussion.
Attribution Results
Failure Distribution (3,357 Trajectories)
CategoryCount

Planning Error dominates across all domains. Success (score=1) trajectories are retained for contrast analysis.

Judge Prompt Template
Taxonomy-Grounded Attribution Prompt
You are an expert evaluator of LLM agent failures. Given the trajectory below, identify the PRIMARY failure type from the following 7 categories: [1] Environment Disturbance [2] Instruction (Ill-defined / Partial Understanding) [3] Planning Error [4] False Assumption [5] History Error Accumulation [6] Catastrophic Forgetting [7] Memory Limitation Trajectory: <trajectory>{full_action_observation_log}</trajectory> Respond with: failure_type: <one of the 7 categories above> failure_step: <step number where failure first occurs> reason: <1–2 sentence explanation citing specific trajectory evidence>
OpenClaw Failure Examples

The Agents of Chaos study documents representative failures from OpenClaw, a real-world email-based agent system. Below we map each incident to our taxonomy, offering preliminary real-world validation for the failure categories identified in our controlled benchmark. Conceptual examples (marked ✦) are derived from OpenClaw-style deployment patterns rather than a specific named incident.

7 categories  Β·  mapped from Agents of Chaos (Shapira et al., 2025, arXiv:2602.20021)  Β·  ✦ = conceptual illustration
Failure Category Incident / Example What Happened Mechanism
Contribute Tasks to HORIZON

Help grow the benchmark by contributing new long-horizon tasks. We collect structured tasks via HuggingFace Datasets β€” all domains and extension methods welcome.

Task Schema
JSON Format
Each task follows a structured JSON schema. Provide at least two extension levels (s=1 and s=2) with corresponding instructions and oracle solutions.
{ "task_id": "web_042", "domain": "web", "extension_method": "breadth", "description": "Buy the cheapest red item, then post a review", "H_star": 1, "extension_levels": [ { "s": 1, "instruction": "...", "oracle": "..." }, { "s": 2, "instruction": "...", "oracle": "..." } ], "base_env": "webarena", "contributor": "your_username", "verified": false }
Step 01 β€” Design
Create a Task
Choose a domain and extension method, then craft a baseline task and its horizon extensions following the HORIZON protocol.
  • domain: one of web, os, db, embodied
  • extension_method: breadth (compose subtasks) or depth (add non-skippable steps)
  • H_star: intrinsic horizon of the s=1 baseline task
  • Each extension level needs a full instruction and an oracle solution
Step 02 β€” Submit
Upload to HuggingFace
Submit your task JSON to our HuggingFace dataset repository. Open a discussion or pull request with your contribution.
  • Fork or open a discussion on the dataset repo
  • Upload your JSON file following the schema on the left
  • Include your username in the contributor field
  • Verified tasks are reviewed by the HORIZON team before inclusion
Contribute on HuggingFace β†’
Step 03 β€” Quality
Task Guidelines
Good tasks have unambiguous oracle solutions and exhibit meaningful performance degradation across extension levels.
  • Baseline (s=1) must be solvable by an optimal policy
  • Depth extension: each added step must be non-skippable by an optimal policy
  • Breadth extension: H*(s) must be additive across composed subtasks
  • Oracle solutions should be verifiable programmatically
Paper & Citation
Abstract
Position: The Long-Horizon Task Mirage?
Diagnosing Where and Why Agentic Systems Break

We argue that identifying the breaking point of long-horizon agents is inherently challenging, but can be meaningfully studied through a structured approach. We propose HORIZON, an initial cross-domain diagnostic benchmark that generates tasks with increasing intrinsic horizons and quantifies how agent success degrades β€” using a seven-category failure taxonomy applied across web navigation, operating systems, databases, and embodied manipulation as a pilot empirical study.

πŸ“„ arXiv Preprint πŸ€— HuggingFace Dataset β†—
BibTeX
Cite This Work
@inproceedings{horizon2026, }