Presenter: Jeffrey Jiawen Deng
Faculty Sponsor: Abhidip Bhattacharyya
School: UMass Amherst
Research Area: Artificial Intelligence
Session: Poster Session 4, 2:15 PM - 3:00 PM, 163, C30
ABSTRACT
Vision-Language-Action models integrate natural language understanding, visual perception, and robotic control in order to solve complex, multi-modal, embodied Artificial Intelligence tasks and have achieved remarkable progress due to the availability of large-scale data, advancements in transformer-based multi-modal representation learning, and imitation-learning policy training pipelines. Recent work, however, indicates that these models can be brittle and rely on superficial pixel correlations rather than robust semantic grounding. We investigate the Compositional Generalization Gap in VLA models by systematically testing their visual and linguistic understanding within a robotic simulation environment. The methodology utilizes the LIBERO simulation suite to evaluate open-source models like OpenVLA and SmolVLA, quantifying visual brittleness through high-throughput parallelized rendering of visual perturbations (e.g. lighting intensity, camera viewpoint shifts, and texture randomization) and assessing language neglect through adversarial linguistic instructions (e.g. semantic rephrasing). We apply an optimization algorithm to automatically determine the worst case adversarial scenarios in which visual and linguistic noise are combined in order to define a detailed taxonomy of failure modes and a quantitative measurement of performance degradation under compositional noise. We highlight critical safety gaps in current embodied AI architectures, moving the field toward more robust, general-purpose robotic agents.