Researchers at Xiaomi have introduced HarnessX, an innovative framework that automates the optimization of AI agent software scaffolding. The scaffold, or harness, is the operational layer that connects the base language model to the external environment, managing prompts, tools, memory, and control flows. Until now, these configurations were static and hand-crafted, requiring manual rewrites for every change. HarnessX fundamentally changes this approach by treating the harness as a composable object and autonomously applying improvements to its code.
The three challenges of AI harness engineering
An AI agent's harness is critical to its performance, but it suffers from three main limitations. First, it is static and hand-engineered: any change in the underlying model or operational environment requires manual code rewrites. Second, it suffers from architectural entanglement: prompts, tool wrappers, retry policies, and memory management are tightly coupled, so tweaking one component can silently break others. Third, the harness and model are optimized in isolation: execution traces generated during testing are discarded instead of being used to train the model. This creates a bottleneck where teams fail to capture the full value of operational data.
Sponsored Protocol
HarnessX: an autonomous foundry for AI agents
HarnessX solves these problems with a unified harness foundry. The key innovation is treating the harness as a first-class object, separating model configuration from harness configuration. This allows swapping, adapting, and evolving the harness without touching the underlying model. The framework breaks down agent behavior into components such as context assembly, memory management, tool ecosystems, control flow, and observability. Each specific behavior is implemented as a processor that plugs into precise lifecycle hooks of the harness. To automate optimization, HarnessX introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning problem but must address three pathologies: reward hacking, catastrophic forgetting, and under-exploration. To prevent these, it uses a four-stage pipeline: Digester compresses execution traces into structured summaries; Planner analyzes summaries to explore structural changes; Evolver generates code-level edits and tests them; Critic and gate evaluate edits and reject any regression.
Sponsored Protocol
Co-evolution of harness and model: the real differentiator
What sets HarnessX apart is the co-evolution of harness and model. As the harness adapts, execution traces are converted into reinforcement learning signals for the base model. This happens via cross-harness GRPO (Group Relative Policy Optimization), the same algorithm used to train reasoning models like DeepSeek-R1. When fine-tuning the model, execution trajectories for the same task from different harness versions are pooled. This allows the model to internalize high-level strategic changes, such as using a new API endpoint or managing an execution budget. Similar tools have been introduced by other companies, such as Google with Chrome 149 and Gemini and Alibaba with its predictive models.
Sponsored Protocol
Benchmark results: +44% for smaller models
Tests of HarnessX on five benchmarks — software engineering, multi-turn dialogue, web navigation, open-ended multi-step reasoning, and embodied planning — showed an average improvement of 14.5% across 15 model-benchmark combinations. The open-weight Qwen3.5-9B achieved an impressive +44% on the ALFWorld embodied planning benchmark. Smaller models benefited most from dynamic harness evolution. For example, on the SWE-bench Verified benchmark, Qwen3.5-9B gained +18.2%. Co-evolution added an additional 4.7% average performance boost.
An illustrative example: during the GAIA benchmark, the agent failed because the headless browser used to scrape Wikipedia timed out on the site's heavy JavaScript frontend. HarnessX analyzed the traces, diagnosed the error, and wrote a new tool that bypassed the browser by directly querying the MediaWiki API. Swapping this tool into the harness unlocked the failing tasks. In another test, the agent got stuck in pagination loops during WebShop purchases. HarnessX built a processor that detected repeated navigation actions and injected a warning into the context to force a decision, eliminating the looping behavior.
Sponsored Protocol
Despite limitations — the meta-agent requires powerful models like Claude Opus 4.6 — HarnessX proves that harness engineering is a concrete lever to improve AI agent performance, especially for smaller models. For teams using open-weight models on complex workflows, harness evolution can be a first step before resorting to more expensive models.