Learning Intuitive Physics from Video to Improve Reinforcement Learning

Presenter: Shhreya Anand

Faculty Sponsor: Bruno C. da Silva

School: UMass Amherst

Research Area: Artificial Intelligence

Session: Poster Session 4, 2:15 PM - 3:00 PM, 163, C31

ABSTRACT

Reinforcement learning (RL) agents trained directly from pixels remain sample-inefficient and brittle, particularly in environments governed by physical dynamics. Because perception and control are learned simultaneously from reward signals, agents must rediscover basic physical regularities-such as object permanence and motion continuity-through costly trial-and-error. In contrast, humans acquire intuitive physics through observation before engaging in goal-directed behavior.

This project investigates whether self-supervised predictive world models can provide RL agents with such priors. Specifically, I evaluate Video Joint Embedding Predictive Architectures (V-JEPA) models pretrained on large-scale video to learn latent physical dynamics, as representation learners for control. In this work, I use Joint Embedding Predictive Architectures (JEPA) as pretrained representation learners for reinforcement learning. By learning physical dynamics from video through self-supervised prediction, we aim to separate perception from control and provide RL agents with intuitive physics priors before reward-based training.

I systematically compare three conditions using PPO-based agents: (1) a baseline CNN trained end-to-end from scratch, (2) a frozen pretrained V-JEPA encoder with a learned policy head, and (3) a fine-tuned V-JEPA encoder jointly optimized with RL. Experiments progress from CartPole to partially observable variants with occlusion, and finally to a robotic cube-pushing task involving real physical interactions.

By testing whether video-derived physics knowledge transfers to embodied control, this work provides a foundational evaluation of predictive world modeling as a scalable precursor to reinforcement learning.