MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
Published in arXiv preprint, 2026
MolmoMotion addresses motion forecasting as central to visual intelligence: agents must anticipate how objects will move to enable planning and physical reasoning. We propose using 3D points in world coordinates as a general representation that is class-agnostic, view-stable, compact, and directly useful.
We introduce MolmoMotion-1M, a dataset of 1.16 million videos with annotated 3D trajectories, alongside PointMotionBench for evaluation and a forecasting model supporting both autoregressive and flow-matching approaches.
Recommended citation: Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna. (2026). "MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction." arXiv preprint arXiv:2606.18558.
