IVA icon

IVA: Instruct, Verify, Act

Teaching VLA Models to Reject the Impossible

Abstract

IVA is an unified framework for Vision-Language-Action (VLA) models that detects when an instruction is unfulfillable (false premise), clarifies or corrects it in natural language, and then acts safely. Trained with paired true- and false-premise instructions, IVA improves false-premise handling while maintaining strong true-premise task performance.

Figure 1: IVA detects a false-premise instruction, clarifies, and proposes a valid alternative.
Figure 1: IVA detects a false-premise instruction, clarifies, and proposes a valid alternative.

Why IVA?

  • Most VLAs assume user instructions are always feasible; in practice, people often issue commands with false premises (missing objects, attribute mismatches, or impossible actions).
  • Naïvely following such instructions is unsafe or wasteful. IVA introduces explicit detection and clarification/correction to keep interaction safe and productive.
  • We evaluate both In-Domain (plausible-but-absent) and Out-of-Domain (impossible/absurd) false premises, and train IVA to handle both categories.

Our Solution

  • Single-stage end-to-end tuning: IVA jointly learns false-premise detection, language clarification/correction, and action prediction (unlike two-stage baselines).
  • Inputs & Outputs: front camera and the previous 5 joint positions (angles) → predict the 2-D visual trace and the next action as an 8-D vector (7 joint velocities + binary gripper).
  • Architecture: frozen vision & language encoders; an autoregressive decoder fine-tuned with LoRA adapters for efficient training.
  • Instruction template: structured prompt with robot type, control mode, task, and a short proprioceptive history enables grounded reasoning and safe actions.

How did we construct the dataset?

  • Built on RLBench trajectories across 9 tasks, with paired true and false-premise prompts.
  • 800 episodes per task; false premises are injected at 10% of the steps: about 65% In-Domain FP episodes and 20% Out-of-Domain FP episodes (rest true-premise).
  • In-Domain FP: plausible but absent objects (e.g., “open the middle block” in a drawer scene). Out-of-Domain FP: clearly infeasible requests (e.g., “open the top elephant”).

Evaluation

  • Episodes: 9 RLBench tasks × 25 episodes each = 225 episodes with randomized object poses and paired prompts.
  • (1) Detection: parse the model’s text to classify Accept (true-premise) vs Clarify/Refuse (false-premise); average per-step FP scores per episode.
  • (2) Execution: when IVA accepts, execute the predicted 8D joint-velocity sequence; task success judged by the RLBench success detector.
  • (3) Overall: average Detection + Execution over all 225 episodes into a single accuracy metric.

Results

  • False-premise detection: 100% (In-Domain) and 97.78% (Out-of-Domain).
  • Improvement over baseline: IVA markedly boosts FP detection and increases successful responses under FP scenarios (see Table 1 per-task details).
  • True-premise performance: IVA maintains strong success (42.67% ± 8.34%) comparable to LLaRVA (38.67% ± 8.55%).
Table 1: IVA vs. LLaRVA across 9 RLBench tasks
Table 1: Per-task Overall, False-Premise Detection (ID/OOD), and True-Premise Success across 9 RLBench tasks.
Figure 2: Qualitative examples across 9 RLBench tasks
Figure 2: Qualitative examples across 9 RLBench tasks showing IVA's false-premise handling and safe alternative suggestions.

Key Takeaways

  • IVA explicitly reasons about feasibility, clarifies impossible requests, and proposes valid alternatives—leading to safer HRI.
  • Robust FP handling does not degrade feasible task performance.
  • The framework is general and extendable to broader robotic tasks, sensors, and interaction settings.

BibTeX

@inproceedings{hsieh2025do,
    title     = {Do What? Teaching Vision-Language-Action Models to Reject the Impossible},
    author    = {Wen-Han Hsieh and Elvis Hsieh and Dantong Niu and Trevor Darrell and Roei Herzig and David M. Chan},
    booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
    year      = {2025},
    url       = {https://arxiv.org/abs/2508.16292}
}