IVA: Instruct, Verify, Act

Abstract

IVA is an unified framework for Vision-Language-Action (VLA) models that detects when an instruction is unfulfillable (false premise), clarifies or corrects it in natural language, and then acts safely. Trained with paired true- and false-premise instructions, IVA improves false-premise handling while maintaining strong true-premise task performance.

Figure 1: IVA detects a false-premise instruction, clarifies, and proposes a valid alternative.

Why IVA?

Most VLAs assume user instructions are always feasible; in practice, people often issue commands with false premises (missing objects, attribute mismatches, or impossible actions).
Naïvely following such instructions is unsafe or wasteful. IVA introduces explicit detection and clarification/correction to keep interaction safe and productive.
We evaluate both In-Domain (plausible-but-absent) and Out-of-Domain (impossible/absurd) false premises, and train IVA to handle both categories.

Our Solution

Single-stage end-to-end tuning: IVA jointly learns false-premise detection, language clarification/correction, and action prediction (unlike two-stage baselines).
Inputs & Outputs: front camera and the previous 5 joint positions (angles) → predict the 2-D visual trace and the next action as an 8-D vector (7 joint velocities + binary gripper).
Architecture: frozen vision & language encoders; an autoregressive decoder fine-tuned with LoRA adapters for efficient training.
Instruction template: structured prompt with robot type, control mode, task, and a short proprioceptive history enables grounded reasoning and safe actions.

How did we construct the dataset?

Built on RLBench trajectories across 9 tasks, with paired true and false-premise prompts.
800 episodes per task; false premises are injected at 10% of the steps: about 65% In-Domain FP episodes and 20% Out-of-Domain FP episodes (rest true-premise).
In-Domain FP: plausible but absent objects (e.g., “open the middle block” in a drawer scene). Out-of-Domain FP: clearly infeasible requests (e.g., “open the top elephant”).

Evaluation

Episodes: 9 RLBench tasks × 25 episodes each = 225 episodes with randomized object poses and paired prompts.
(1) Detection: parse the model’s text to classify Accept (true-premise) vs Clarify/Refuse (false-premise); average per-step FP scores per episode.
(2) Execution: when IVA accepts, execute the predicted 8D joint-velocity sequence; task success judged by the RLBench success detector.
(3) Overall: average Detection + Execution over all 225 episodes into a single accuracy metric.

Results

False-premise detection: 100% (In-Domain) and 97.78% (Out-of-Domain).
Improvement over baseline: IVA markedly boosts FP detection and increases successful responses under FP scenarios (see Table 1 per-task details).
True-premise performance: IVA maintains strong success (42.67% ± 8.34%) comparable to LLaRVA (38.67% ± 8.55%).

Table 1: IVA vs. LLaRVA across 9 RLBench tasks

Table 1: Per-task Overall, False-Premise Detection (ID/OOD), and True-Premise Success across 9 RLBench tasks.

Figure 2: Qualitative examples across 9 RLBench tasks showing IVA's false-premise handling and safe alternative suggestions.

Key Takeaways

IVA explicitly reasons about feasibility, clarifies impossible requests, and proposes valid alternatives—leading to safer HRI.
Robust FP handling does not degrade feasible task performance.
The framework is general and extendable to broader robotic tasks, sensors, and interaction settings.

BibTeX

@inproceedings{hsieh2025do, title = {Do What? Teaching Vision-Language-Action Models to Reject the Impossible}, author = {Wen-Han Hsieh and Elvis Hsieh and Dantong Niu and Trevor Darrell and Roei Herzig and David M. Chan}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, year = {2025}, url = {https://arxiv.org/abs/2508.16292} }