Diagnosing Where and Why Agentic Systems Break
An initial diagnostic benchmark for systematically constructing tasks and characterizing long-horizon failures in LLM-based agents β measuring performance degradation across increasing intrinsic horizons H* and attributing breakdowns via a 7-category failure taxonomy across 4 domains.
| # | Model | Avg β | Web | OS | Database | Embodied | Break Level |
|---|
GPT-5-mini and Claude-4-Sonnet evaluated on all four domains.
Tasks are constructed with systematically increasing H* via two extension methods.
home, move_to_pose, grasp, release. Steepest degradation of all domains.Categories represent orthogonal dimensions of agent behavior β not mutually exclusive classes. A single failed trajectory may exhibit multiple failure types simultaneously. Validated by two domain experts via blind annotation and consensus discussion.
After collecting agent trajectories, each failed run is submitted to an LLM judge prompted with the full 7-category taxonomy. The judge identifies the primary failure mode and provides a natural-language rationale grounded in the trajectory evidence.
Planning Error dominates across all domains. Success (score=1) trajectories are retained for contrast analysis.
The Agents of Chaos study documents representative failures from OpenClaw, a real-world email-based agent system. Below we map each incident to our taxonomy, offering preliminary real-world validation for the failure categories identified in our controlled benchmark. Conceptual examples (marked β¦) are derived from OpenClaw-style deployment patterns rather than a specific named incident.
| Failure Category | Incident / Example | What Happened | Mechanism |
|---|
Help grow the benchmark by contributing new long-horizon tasks. We collect structured tasks via HuggingFace Datasets β all domains and extension methods welcome.
web, os, db, embodiedbreadth (compose subtasks) or depth (add non-skippable steps)contributor fieldWe argue that identifying the breaking point of long-horizon agents is inherently challenging, but can be meaningfully studied through a structured approach. We propose HORIZON, an initial cross-domain diagnostic benchmark that generates tasks with increasing intrinsic horizons and quantifies how agent success degrades β using a seven-category failure taxonomy applied across web navigation, operating systems, databases, and embodied manipulation as a pilot empirical study.