Steerable Vision-Language-Action Policies

for Embodied Reasoning and Hierarchical Control

1UC Berkeley, 2Stanford University 3Physical Intelligence

Steerable Policies can flexibly follow diverse commands, allowing them to better interface with VLMs to transfer foundation model capabilities to the real world.

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks.

Steerable Policies

Our Steerable Policies are vision-language-action models trained on diverse and detailed steering commands

  • A core limitation of transfering foundation model capabilities to robotics is limited low-level policy steerability.
  • We thus introduce Steerable Policies: vision-language-action models (VLAs) trained on diverse steering commands.
  • For training data, we use an automated annotation pipeline to label robot demos with synthetic steering commands.
  • We showcase two new hierarchial control methods, highlighting how steerability enables better use of VLM capabilities:
    • Fine-tuning a VLM into an embodied reasoner, which produces chain-of-thought rationales for how to decompose tasks into appropriate steering commands.
    • Using an off-the-shelf VLM to perform robot in-context learning over steering abstractions, allowing it to adapt its commands based on past experience.

Steering Commands

Illustrative examples of steering command styles used to train Steerable Policies.

Standard VLAs are trained on "task-level" commands, i.e., high-level natural language descriptions of the task to be performed. However, these commands are often too formulaic and vague to induce the full range of physical skills necessary for solving novel manipulation tasks.

We thus train our policy on steering commands: a diverse set of instructions spanning many styles and levels of abstraction. Beyond standard task-level commands, we also include:

  • Semantic subtasks, e.g., "reach for the carrot" and "grasp the container".
  • Atomic motions, e.g., "move left and grasp".
  • Pointing, e.g., "open gripper above the container at [x, y]").
  • Gripper traces, e.g., "move along [x1, y1], [x2, y2], ...".
  • Hybrids of these styles, e.g., "move left from [x1, y1] to [x2, y2] to grasp the carrot".

Generating Data

We use foundation models to automatically parse robot demonstrations into segments that are labeled with diverse steering commands.

To train Steerable Policies, we need a dataset of steering commands. We acquire this by automatically attaching synthetic language labels to existing robot trajectories using a multi-stage pipeline. We leverage various foundation models to extract relevant embodied features, then query an API-based VLM (Gemini) to compile them into commands of all our steering styles.

Hierarchical Control Experiments

Our two instantiations of hierarchical control methods using Steerable Policies.

We instantiate a Steerable Policy by adapting the OpenVLA codebase, and train it on the Bridge WidowX dataset. Using this VLA, we study two ways in which Steerable Policies allow better application of VLM capabilities to real-world manipulation.

Embodied Reasoning

"Put the carrot in the pot"

"Put the watermelon on the towel"

Controlling Steerable Policies with high-level embodied reasoning VLMs is an effective approach for generalizable control.

We fine-tune a VLM into a high-level embodied reasoner that autoregressively generates a grounded rationale explaining what the robot should do, before picking a steering command to execute with the low-level VLA. We find that this approach outperforms equivalent standard VLAs, past embodied reasoning methods, and a hierarchical non-reasoning ablation.

Robot In-context Learning

"Make the blue block the only object on the plate"

"Stack the pots on the towel"

Steerable Policies allow high-level VLMs to perform robot in-context learning on novel multi-step tasks.

Steerable Policies also allow VLMs to leverage in-context learning, where the model reasons to select a steering command style, observes the resulting behavior, and iteratively refines its commands to adaptively improve on the task. This approach casts robot in-context learning as standard vision-language in-context learning, obviating the need for structured scene and action representations that prior robot in-context learning methods rely on.


Paraphrased examples of how in-context learning over steering commands allow VLMs to correct erroneous or stalling behaviors, apply fine-grained physical and semantic reasoning, and infer which command styles are most appropriate.

As in-context learning is used to select the appropriate level of abstraction for steering the robot, this approach is uniquely enabled by our Steerable Policies, as past VLAs are only trained on one or two steering modalities. We find our approach outperforms a SayCan-like baseline (where the VLA is restricted to subtask-level commands), confirming the performance benefits of in-context learning over many prompting modalities.

BibTeX

@article{Chen26-steerable-policies,
    title={Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control},
    author={William Chen and Jagdeep Bhatia and Catherine Glossop and Nikhil Mathihalli and Ria Doshi and Andy Tang and Danny Driess and Karl Pertsch and Sergey Levine},
    year={2026}
}