A "World Model" for robotic manipulation that treats video frames and robot actions as "physical tokens." PAR allows robots to learn physical dynamics from large-scale video pre-training without requiring specific action pre-training, achieving state-of-the-art results on ManiSkill benchmarks.

Paper

arXiv: 2508.01234

embodiedresearchvision