MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Published in arXiv preprint, 2026

MolmoMotion addresses motion forecasting as central to visual intelligence: agents must anticipate how objects will move to enable planning and physical reasoning. We propose using 3D points in world coordinates as a general representation that is class-agnostic, view-stable, compact, and directly useful.

We introduce MolmoMotion-1M, a dataset of 1.16 million videos with annotated 3D trajectories, alongside PointMotionBench for evaluation and a forecasting model supporting both autoregressive and flow-matching approaches.

Recommended citation: Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna. (2026). "MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction." arXiv preprint arXiv:2606.18558.

[Paper] [Project] [Data] [Model] [Video]

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)