Hierarchical Experimentalist Agents

LLM agents that learn like scientists — actively experiment, evolve reusable skills, and transfer them across tasks in-context

1University of Massachusetts Amherst
Preprint
HeXA framework overview: ReAct vs. HeXA meta-episode vs. cross-level transfer
Figure 1. Overview of HExA on InterPhyre physics puzzles. (a) ReAct solves each seed independently with no cross-seed learning. (b) HExA groups seeds into a meta-episode; after each seed an evolver distils the trajectory into a skill bank (physics principles + common mistakes), and the evolved bank is injected into subsequent seeds' system prompts (25 iterations over 50 seeds per source level). (c) Cross-level transfer: banks from easier source levels (down_to_earth, two_body_problem, pass_the_parcel) are synthesised into a single cross-level bank for a harder target level (catapult), with no target-level trajectories required.
+75 pp
Catapult gain over the no-tool Direct baseline (2% → 77%) with Claude Sonnet 4.6
+36 pp
Zero-shot cross-level transfer to catapult — no target-level experiments
0% β†’ 54%
GPT-OSS-120B on catapult — gains extend to open-weight models
~37%
Fewer iterations per episode vs. ReAct / Reflexion

Abstract

Large language models (LLMs) are increasingly used to take action or augment human decision-making, yet most systems rely on parametric knowledge acquired by imitation, optionally supplemented with fixed data, retrieval, or search. However, this paradigm breaks down in novel domains and on sophisticated queries that cannot be answered from prior knowledge alone: knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries about or complete long-horizon tasks in a complex physical system without explicit interactions in a physics simulator. Thus to solve such novel problems generally, agents should have the fundamental ability of active experimentation—to explore and gather targeted query-specific data or general principles about the unseen environments and to acquire new reusable skills by learning from these diverse interactions and experiences.

We thus introduce Hierarchical Experimentalist Agents (HExA), a novel in-context, experiment-centric self-improvement framework that (1) iteratively designs and refines query-relevant experiments; (2) incrementally learns from experiences a library of reusable and composable skills that accelerate experimentation within and across tasks; and (3) integrates the experimental data to effectively take actions or answer queries. Being entirely in-context and training-free, HExA can be used with any models, including black-box frontier models—enabling them to self-improve while co-evolving the skills using progress and efficiency rewards, interaction feedback from correct, incorrect, and partial rollouts, and experimental interventions, all without any reliance on offline data, oracle supervision, or external teacher guidance.

We also introduce InterPhyre, a 2D procedural physics simulation and embodied reasoning environment with additional tool-call and intervention APIs to enable agents to propose experiments and test hypotheses—designed to better evaluate and measure the ability of agents to learn from experimentation. On the hardest levels of InterPhyre, frontier models like Claude Sonnet 4.6 only achieve 2% success rate while HExA using experimentation-centric learning improves the same model to a maximum success rate of 77%. We also show similar improvements across all experiments with smaller open-weight models and over other agentic baselines like ReAct and Reflexion. The agent also achieves 44% success by not doing any active experimentation and using only skills that were learned and transferred from easier levels, demonstrating reusability and generalization. Our experiments show that current LLM agents still struggle in these settings, leaving a large room for improvement. HExA is a step toward scalable learning mechanisms that let agents learn through active experimentation and interaction while acquiring evolving skills, which remains a core challenge for building generalist agents.

The InterPhyre Environment

InterPhyre is a 2D procedural physics benchmark built so that solving a task requires interaction, not recall. In every level the agent places a red ball at (x, y, r); the simulator then runs forward and checks a level-specific success predicate. The catch: outcomes depend on contact chains, lever mechanics, and collision timing that cannot be read off a static scene description — the agent must run experiments to discover them. The curriculum ships 25 levels × 10,000 pre-validated seeds = 250,000 certified task instances, plus a snapshot/restore intervention API for paired counterfactual rollouts.

Level gallery

catapult level being solved
catapult Tier 2 Β· hard Impulse transfer: drop the ball on a pivoting arm to launch a green ball over a ceiling-blocker into a basket. Sonnet ReAct baseline: 8%
pass_the_parcel level being solved
pass_the_parcel Tier 2 Β· hard Discover a ramp-and-basket mechanism rather than a direct collision to route the ball home. Sonnet ReAct baseline: 24%
down_to_earth level being solved
down_to_earth Tier 1 Gravity-driven drop past an obstruction. Sonnet ReAct baseline: 100%
two_body_problem level being solved
two_body_problem Tier 1 Coordinate two interacting bodies via momentum transfer. Sonnet ReAct baseline: 100%
falling_into_place level being solved
falling_into_place Tier 1 Timed interception: cross a platform edge to meet a falling jar. Sonnet ReAct baseline: 100%
tipping_point level being solved
tipping_point Tier 1 Unstable equilibrium — nudge a balanced structure the right way. Sonnet ReAct baseline: 96%
basket_case level being solved
basket_case Tier 1 Container avoidance: deflect the green ball away from a basket. Sonnet ReAct baseline: 100%
cliffhanger level being solved
cliffhanger Tier 1 Ledge/edge dynamics under gravity. Sonnet ReAct baseline: 100%

The tool-call & intervention API

Each level exposes a structured tool surface the agent invokes directly, so that hypothesis testing is explicit and measurable. The catapult level, for example, exposes four complementary tools:

describe_scene_geometry()
    # Strategy-neutral inventory: balls (pos, radius, dynamic),
    # bars (centre, angle, length), baskets, key pairwise distances,
    # and the success condition. A first-pass survey.

predict_first_contact(x, y, radius)
    # Cheap pre-check (≀90 steps): which object the red ball hits
    # first, the contact step, approach speed, point, surface normal.

trace_green_ball(x, y, radius)
    # Lightweight probe: green ball's (x,y) waypoints every 30 steps
    # plus start/end/peak summary. "Where does it go?"

simulate_with_trace(x, y, radius, object_names, stop_step)
    # Full rollout: per-object kinematic extrema (peak_y, v_max,
    # Ξ”pos, angular speed) + up to 15 contact events.

finish(x, y, radius)
    # Commit the final placement; environment scores the predicate.

Other levels add bespoke analysis tools — compute_intercept_setup() (falling_into_place), compute_basket_analysis() (basket_case), get_ramp_center() (pass_the_parcel). The same backend exposes a snapshot/restore primitive that branches a shared mid-trajectory state into paired counterfactual rollouts (paper Figure 13).

How HExA Works

HExA instantiates a two-agent architecture operating over a sequence of rounds. An actor generates trajectories by interacting with the environment through the tool-call interface; an evolver reads batches of trajectories and distils them into a natural-language skill bank; and a retriever injects the top-weighted skills and common mistakes into the actor's context at the start of each new episode. Each subsequent trajectory benefits from the accumulated experience of all prior ones — in-context reinforcement learning, where the policy improves through context augmentation rather than weight updates.

The HeXA actor, skill evolver, and evolving skill bank feedback loop
Figure 2. The actor–evolver–retriever loop (one round). The actor runs each episode with the retrieved skill context prepended; trajectories are tagged with rewards. The evolver curates the skill bank (capped at Mmax skills, Nmax mistakes per task family); the retriever re-injects the top entries by weight for the next round.

Reward-guided, two-pass distillation

Each trajectory gets a scalar reward r(τ) ∈ [−1, +1] reflecting both outcome and efficiency — fast successes yield high-confidence strategies, while extensively-explored failures provide evidence for diagnosing mistakes. The evolver runs a two-pass distillation: first, contrastive skill extraction across the full batch (what distinguishes high-reward from low-reward trajectories); second, mistake and partial-skill extraction focused on the failure subset. Each skill carries a weight wk derived from the mean reward of its source trajectories.

Hierarchical, evolving skills

Skills are hierarchical in two senses. First, they abstract many low-level tool calls into one high-level principle (e.g. "when the ceiling–range tradeoff is unsolvable by radius tuning, shift x to 0.1–0.3" compresses dozens of (x, y, r) sequences). Second, skills are learned while the agent already has access to earlier ones, so later skills build on earlier ones. The default Off2On Evolving regime merges, revises, or prunes skills with each round's evidence; it beats Iterative Replacement, frozen Offline banks, and Pure Online across every model and level tested.

Cross-task transfer

Given source banks and a textual description of a target task, the evolver identifies structurally relevant skills, re-grounds them in the target task, and recalibrates their weights by how directly the principle transfers — enabling zero-shot transfer with no target-task trajectories. A transferred bank can also seed subsequent within-task refinement.

One Episode, Two Agents

Catapult, seed 45. ReAct (no memory) exhausts its 25-iteration budget re-searching placements and never finds a working launch. HExA reuses skills distilled from earlier catapult episodes — it already knows how to launch (drop a heavy ball on the arm) and how to fix the usual failure (when the shot arcs too high and hits the ceiling, shift the drop point sideways to flatten it). After one ceiling overshoot it applies the learned correction and solves the level.

ReAct — 25 iterations, FAILURE   placement (0.5, −1.0, 2.0)
ReAct failure: 25 iterations, no green–blue contact
HExA — 6 iterations, SUCCESS   placement (0.3, 0.9, 1.5)
HeXA success: 6 iterations, lever flings green ball over the blocker into the basket

Core Claims

Experiments & Results

All methods are evaluated on the same 50 seeds per level (paired comparisons), under an identical 25-iteration harness, with HExA in its default Off2On Evolving configuration (x=3 seeds/round). HExA, HExA (no reward), and Reflexion report mean ± std over 3 runs.

H1 — Skill accumulation on catapult (Claude Sonnet 4.6, 50 seeds)

MethodAcc. (%)Succ.FailAvg Iter
Direct (one-shot, no tools)2.01491.0
ReAct baseline8.044622.9
Reflexion (K=2, 3 runs)21.3 ± 2.510.739.321.2
HExA (no reward, 3 runs)50.7 ± 9.425.324.716.5
HExA (Off2On Evol., 3 runs)67.3 ± 9.333.716.314.4

HExA reaches ~3× the strongest baseline (Reflexion) and uses the fewest iterations per solve. On pass_the_parcel, the best HExA config reaches 60% vs. 24% (ReAct), 16% (Reflexion), 0% (Direct).

Catapult cumulative success: HExA 67.3% vs HExA no-reward 50.7% vs Reflexion 21.3%
Figure 4. Cumulative success on catapult (mean of 3 runs, ±1 std). HExA 67.3% Β· HExA no-reward 50.7% Β· Reflexion 21.3%.
Pass the Parcel variants: Off2On Evolving at 60% vs ReAct at 24%
Figure 5. Pass the Parcel: Off2On Evolving finishes at 60% (+36 pp over ReAct) and converges to the lowest cost per seed.

Ablation — reward signal (Qwen-2.5-7B, 50 seeds)

LevelMethodAcc. (%)Avg ItersΔ
Down to EarthReAct baseline62.012.5
HExA (no reward)64.014.8+2.0
HExA (reward-weighted)72.012.6+10.0
Two Body ProblemReAct baseline18.022.2
HExA (no reward)26.022.2+8.0
HExA (reward-weighted)34.018.9+16.0

H2 — Gains extend to open-weight solvers (50 seeds)

LevelModelBaselineHExAΔ
Down to EarthQwen-2.5-3B8.024.0+16.0
Down to EarthQwen-2.5-7B62.072.0+10.0
Two Body ProblemQwen-2.5-3B6.014.0+8.0
Two Body ProblemQwen-2.5-7B18.034.0+16.0
CatapultGPT-OSS-120B0.054.0+54.0
HeXA on open-weight solvers: consistent gains across Qwen-2.5-3B, 7B and GPT-OSS-120B
Figure 10. Consistent improvement across small open LLMs on DTE, TBP, and catapult.

H4 — In-context HExA vs. gradient-based GRPO (Qwen-2.5-3B)

At a matched budget of 50 unique seeds, HExA beats GRPO fine-tuning: 24% vs 20% on down_to_earth and 14% vs 6% on two_body_problem. A strategy HExA discovers is usable by the next episode immediately via context, whereas GRPO must encode it into weights over many rollouts. With sufficient training, GRPO eventually catches up via direct reward optimisation.

GRPO training dynamics vs HeXA: HeXA leads at matched 50-seed budget
Figure 11. Val success vs. unique training seeds (log scale). At the matched budget (dashed line), both GRPO variants fall below HExA (β˜…).

Zero-Shot Cross-Level Transfer

Do distilled skills encode genuine physics primitives or just level-specific recipes? We test this by synthesising a skill bank for a target level using only source-level banks and a one-sentence level description — no target-level trajectories at any stage.

Multi-source  Β·  Claude Sonnet
Down to Earth 8 skills Two Body Problem 8 skills Pass the Parcel 10 skills
↓ evolver synthesises
Catapult 9 skills Β· 4 mistakes 44% +36 pp
Single-source  Β·  Qwen-2.5-7B
Down to Earth 8 skills
↓ evolver synthesises
Falling Into Place 4 skills Β· 2 mistakes 32% +12 pp
Two Body Problem 3 skills Β· 1 mistake 34% +16 pp
Target Level Sources Model ReAct Transfer Ξ”
Catapult DTE + TBP + PTP Sonnet 8.0% 44.0% +36 pp
Falling Into Place DTE only Qwen-7B 20.0% 32.0% +12 pp
Two Body Problem DTE only Qwen-7B 18.0% 34.0% +16 pp

Transfer is positive in every case. No target-level trajectories are seen at any stage.

Cross-level skill transfer bar chart: HExA transfer vs ReAct baseline for catapult, falling into place, and two body problem
Figure 6. Success rate with only the synthesised skill bank injected vs. the ReAct baseline. Multi-source catapult transfer (+36 pp) closes ~60% of the gap to the within-level HExA result of 64%. Positive transfer even on structurally dissimilar pairs (DTE→Two Body Problem) indicates the evolver extracts abstract principles — momentum transfer, contact geometry, directional impulse — rather than narrow placement heuristics.

Inside the Skill Bank

After each round of experimentation, HExA's evolver distils interaction trajectories into a growing library of physics principles and anti-patterns. Each skill carries a confidence score computed from the rewards of the trajectories it was extracted from — higher-confidence skills appear at the top of every agent's context. Below are real entries from evolved banks across three InterPhyre levels.

Skills 5 principles · generation 10
x β‰ˆ 0.5 Is the Primary Launch Sweet Spot 0.91

Placing the red ball at x=0.5 on the catapult arm produces a consistent rightward launch arc. The contact point sits on the right side of the pivot, creating sufficient lever arm for strong rotation while staying in the geometrically stable zone. Only deviate to x=0.3 when a ceiling hit is observed.

When: Always as the first placement attempt. Deviate only on ceiling hit or overlap. Example: (0.5, 0.4, 1.5) — canonical primary (Seed 44)
Seeds: 44
Use Large Radius (β‰₯1.5) for Sufficient Launch Energy 0.90

The catapult arm is a lever — red ball mass (∝r³) determines angular momentum imparted to the arm. Below r≈1.2 the arm rotates too slowly to launch the green ball across the ∼7-unit gap. r=1.5 is the minimum reliable threshold; r=2.0 adds negligible additional range at x=0.5 due to arm rotation saturation.

When: Always use r=1.5 as baseline. Scale down only when overlap constraints require it. Example: (0.5, 0.4, 1.5) — Seed 44
Seeds: 44
Three-Tier Fallback Sequence for Failed Primary 0.93

When primary (x=0.5, y=0.4, r=1.5) fails: (1) ceiling hit → try x=0.3, y=0.4, r=1.5; if still hitting, try x=0.3, y=−0.3, r=2.0 but verify no overlap first. (2) short range → increase y to 0.9 or shift x to 0.4–0.45. (3) persistent failure after 2+ radius variations at same x → stop tuning, shift x to 0.2–0.3.

When: Immediately when ceiling hit or short range is observed. Do not micro-tune the failed placement. Example: Seed 44 primary hit ceiling → x=0.3, y=0.4, r=1.5 succeeded
Seeds: 44
x=0.3 Is the Ceiling-Escape Position; y=0.4 Is Gray-Ball-Safe 0.93

Shifting x from 0.5 to 0.3 flattens the launch arc by changing the arm contact point — this x-shift is the primary ceiling-escape mechanism, not lowering y. When x=0.3, y=−0.3, r=2.0 is invalid due to gray_ball pivot overlap, use x=0.3, y=0.4, r=1.5; y=0.4 avoids the overlap zone while preserving arc-flattening.

When: Primary (x=0.5) causes ceiling hit. Try x=0.3, y=0.4, r=1.5 before attempting y=−0.3. Example: Seed 44: x=0.3, y=0.4, r=1.5 → green ball bounced off top wall into basket
Seeds: 44
Ceiling Blocker Lethality Depends on Its x-Position 0.87

The static black ball near y≈4.6 varies in x-position across seeds (x≈−3.85 to −1.14). When the blocker is at x > −2.5, primary placement causes a ceiling hit. When x < −3.5, x=0.3 launches allow the green ball to arc up and bounce off the top wall before descending into the basket.

When: Check ceiling blocker x via describe_scene_geometry. If x > −2.5, ceiling escape is needed. Example: Seed 44: black_ball at x=−3.63 → green ball bounced off top_wall then landed in basket
Seeds: 44
Common Mistakes 3 anti-patterns · generation 10

Micro-tuning x/y/r around the same narrow region without escaping the local solution space.

Why it happensThe catapult arm is the most obvious mechanism — after 2–3 failures, agents try small perturbations instead of qualitatively different placements.
How to avoidAfter 2 failures within x±0.2, move to a completely different x zone (x=0.5 → x=0.3) rather than fine-tuning.

Increasing radius at fixed x=0.5 expecting more range, when arm rotation is already saturated.

Why it happensLinear energy intuition: more mass = more energy = more range. But lever rotation saturates before additional mass converts to green ball velocity at x=0.5.
How to avoidWhen r=1.5 fails at x=0.5, never try r=2.0. Change x-position to 0.3 instead.

Adjusting y upward after a ceiling hit — increasing launch energy, worsening the collision.

Why it happensHigher drop = more kinetic energy at impact. But more energy produces a higher-arcing launch, worsening the ceiling hit.
How to avoidAfter a ceiling hit, shift x left to 0.3 (keeping y=0.4) to flatten the arc. Never increase y above 1.5 in response to ceiling hits.
Skills 3 principles · generation 13–14
Optimal Initial Positioning Near the Green Ball 0.90

Place the red ball near the green ball but not overlapping to ensure a precise collision. Proximity maximises impulse transfer while avoiding invalid placements that waste iterations.

When: When placing the red ball for the first time.
Seeds: 51, 52, 53
Red Ball Positioning for Strong Directed Impulse 0.88

Place the red ball closer to the green ball and near the edge of the platform to ensure a strong and directed collision impulse. Proximity to the platform edge increases the lateral component of the impulse, helping the green ball clear the edge rather than bouncing back.

When: When the green ball is near the centre of the platform and needs a directed push toward the edge. Example: (1.3, 2.5)
Seeds: 52
Platform Edge Geometry: Maximise Lateral Impulse 0.56

Ensure the red ball is placed near the edge of the platform to maximise the collision impulse toward the edge. The platform boundary creates a geometry where a well-placed red ball can leverage the edge to redirect the green ball downward rather than laterally.

When: When the red ball needs to push the green ball off the platform edge. Example: (1.5, 3.5)
Seeds: 52, 53
Common Mistakes 2 anti-patterns · generation 13–14

Repeatedly trying the red ball in the same crowded area around the green ball after each failure.

Why it happensAgent lacks confidence in other regions and defaults to the same vicinity, treating failures as positioning errors rather than strategy errors.
How to avoidShift x or y by at least 0.5 units after two consecutive failures from the same region before any further fine-tuning.

Ignoring the direction and magnitude of the collision impulse when choosing placement.

Why it happensAgent focuses on avoiding overlap rather than engineering the right impulse vector — treating placement as a constraint problem rather than a physics problem.
How to avoidApproach from the side opposite the desired green ball exit direction to ensure the impulse is correctly directed.
Skills 5 principles · generation 12
Ramp Launch: Upper-Mid Zone Placement 0.90

Place the red ball on the upper portion of the ramp at x ≈ ramp_center_x − 0.15 to −0.50 and y ≈ 3.8–4.2. This offset ensures the ball rolls down the ramp with sufficient leftward momentum to strike the top basket. The exact x-offset modulates the angle of impact on the inverted basket.

When: Always use as the primary strategy; compute x-offset from get_ramp_center. Example: ramp_center_x=3.401 → try x=2.9; ramp_center_x=3.774 → try x=3.3
Seeds: 51, 53
Safe Radius Range: r = 0.45–0.52 0.88

Radii below 0.40 often produce insufficient impulse to dislodge the top basket. Radii above 0.52 frequently overshoot the green ball far left (x < −3.0), missing the bottom basket entirely. The r=0.45–0.52 range provides the best impulse-accuracy tradeoff.

When: Default to r=0.45 or r=0.5 on first attempt; do not exceed r=0.55 without overshoot analysis. Example: Seed 52: r=0.6 → rim stall, r=0.7 → overshoot to x=−4.73; r=0.5 is safer.
Seeds: 51, 52, 53
y = 4.2 Is the Preferred Default Launch Height 0.87

Setting y=4.2 provides reliable impulse for both r=0.45 and r=0.5. Trajectories at y=4.0 frequently leave the green ball energy-deficient, stalling above the bottom basket. y=4.2 consistently cleared this gap in successful runs; y=4.3 is the fallback for persistent energy deficit.

When: Use y=4.2 as the first-attempt height; verify y + r ≤ 4.95 to avoid ceiling clamp. Escalate to y=4.3 only if green ball still stalls. Example: Seed 51: x=2.9, y=4.2, r=0.45 → SUCCESS after y=4.0 failed
Seeds: 51, 53
Basket-Fall Coupling: Velocity-Match Signals Re-Trapping 0.87

When the top basket and green ball share nearly identical velocities during descent (green_y − basket_y < 0.6), the basket is re-trapping the green ball. A large x-shift alone may not break this coupling — increasing y is often required to give the green ball enough velocity to separate from the basket mid-fall.

When: After any simulation where basket_y is within 0.6 of green_ball_y in the final state, increase y by 0.15–0.20 rather than shifting x. Example: Seed 51: x=2.7, y=3.8 → re-trap; x=2.9, y=4.2 → separation → SUCCESS
Seeds: 51
Overshoot Detection: Physics Has Binary Threshold Behaviour 0.90

The level exhibits sharp non-monotonic thresholds: small radius changes (Δ≈0.05–0.1) can flip the green ball from rim-stall to far-overshoot without passing through a success region. When r=0.5 produces rim stall and r=0.7 produces far overshoot, intermediate radii typically also stall. Switch to adjusting y rather than bisecting r.

When: Two radius values produce opposite failure modes (stall vs. overshoot) — do not bisect. Hold r constant and increase y to add energy instead. Example: Seed 52: r=0.5/0.6/0.65 → rim stall; r=0.7 → far overshoot. Bisecting r was futile.
Seeds: 52
Common Mistakes 3 anti-patterns · generation 12

Fine-grained bisection of radius or y in chaotic threshold regions where physics is non-monotonic.

Why it happensAgent treats the parameter space as smooth and continuous, assuming bisection will converge. Threshold regions have discontinuous jumps.
How to avoidBisect at most once between opposite failure modes, then switch to a completely different parameter axis (e.g., from r to y).

Failing to prevent the top basket from tracking alongside the green ball and re-trapping it mid-fall.

Why it happensAgent adjusts x when basket re-trapping occurs, not recognising that more vertical energy (higher y) is needed to break velocity coupling.
How to avoidWhen final basket_y is within 0.6 of green_ball_y, increase y by 0.15–0.20 rather than shifting x alone.

Ignoring the 2000-step budget — green ball has the right direction but insufficient speed to reach the target.

Why it happensAgent interprets near-miss positions (green above blue) as close to success, when the green ball has actually stopped with steps exhausted.
How to avoidCheck final velocities — if green_ball velocity ≈ 0 at step 2000 with green still above blue, this is an energy-deficit failure requiring a y or r increase.

BibTeX

@article{hexa2026,
  title  = {HExA:@misc{chandra2026hierarchicalexperimentalistagents,
      title={Hierarchical Experimentalist Agents}, 
      author={Abhranil Chandra and Sankaran Vaidyanathan and Utsav Dhanuka and Varun Gandhi and Scott Niekum},
      year={2026},
      eprint={2606.29315},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.29315}, 
}