Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks.
Our Steerable Policies are vision-language-action models trained on diverse and detailed steering commands
Illustrative examples of steering command styles used to train Steerable Policies.
Standard VLAs are trained on "task-level" commands, i.e., high-level natural language descriptions of the task to be performed. However, these commands are often too formulaic and vague to induce the full range of physical skills necessary for solving novel manipulation tasks.
We thus train our policy on steering commands: a diverse set of instructions spanning many styles and levels of abstraction. Beyond standard task-level commands, we also include:
We use foundation models to automatically parse robot demonstrations into segments that are labeled with diverse steering commands.
To train Steerable Policies, we need a dataset of steering commands. We acquire this by automatically attaching synthetic language labels to existing robot trajectories using a multi-stage pipeline. We leverage various foundation models to extract relevant embodied features, then query an API-based VLM (Gemini) to compile them into commands of all our steering styles.
Our two instantiations of hierarchical control methods using Steerable Policies.
We instantiate a Steerable Policy by adapting the OpenVLA codebase, and train it on the Bridge WidowX dataset. Using this VLA, we study two ways in which Steerable Policies allow better application of VLM capabilities to real-world manipulation.
"Put the carrot in the pot"
"Put the watermelon on the towel"
Controlling Steerable Policies with high-level embodied reasoning VLMs is an effective approach for generalizable control.
We fine-tune a VLM into a high-level embodied reasoner that autoregressively generates a grounded rationale explaining what the robot should do, before picking a steering command to execute with the low-level VLA. We find that this approach outperforms equivalent standard VLAs, past embodied reasoning methods, and a hierarchical non-reasoning ablation.
"Make the blue block the only object on the plate"
"Stack the pots on the towel"
Steerable Policies allow high-level VLMs to perform robot in-context learning on novel multi-step tasks.
Steerable Policies also allow VLMs to leverage in-context learning, where the model reasons to select a steering command style, observes the resulting behavior, and iteratively refines its commands to adaptively improve on the task. This approach casts robot in-context learning as standard vision-language in-context learning, obviating the need for structured scene and action representations that prior robot in-context learning methods rely on.
Paraphrased examples of how in-context learning over steering commands allow VLMs to correct erroneous or stalling behaviors, apply fine-grained physical and semantic reasoning, and infer which command styles are most appropriate.
As in-context learning is used to select the appropriate level of abstraction for steering the robot, this approach is uniquely enabled by our Steerable Policies, as past VLAs are only trained on one or two steering modalities. We find our approach outperforms a SayCan-like baseline (where the VLA is restricted to subtask-level commands), confirming the performance benefits of in-context learning over many prompting modalities.
@article{Chen26-steerable-policies,
title={Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control},
author={William Chen and Jagdeep Bhatia and Catherine Glossop and Nikhil Mathihalli and Ria Doshi and Andy Tang and Danny Driess and Karl Pertsch and Sergey Levine},
year={2026}
}