The Shift from Perception to Prediction
Machine vision has mastered the ability to track what has already occurred, but the next frontier of autonomy requires anticipating what comes next. While modern models can track objects in a scene with high confidence, this capability remains retrospective. MolmoMotion changes this by introducing language-guided 3D motion forecasting, allowing models to predict future trajectories based on written instructions.
The model functions by taking an RGB observation, specific query points on an object, and an action description to predict how those points will move in 3D space. This capability is essential for any system requiring physical plausibility, such as a robot anticipating the movement of a cup before contact or a video generator producing realistic subsequent frames.
The scale of this release is significant for the research community. Alongside the model, the researchers are publishing MolmoMotion-1M, which stands as the largest collection of 3D point trajectories paired with action descriptions, sourced from 1.16M videos. To provide a standard for evaluation, they are also releasing PointMotionBench, a human-validated benchmark containing 2.7K video clips designed to measure object-centric 3D motion forecasting accuracy.
The architecture of MolmoMotion relies on representing motion as object-attached 3D points in world space. This method captures motion without the computational cost of rendering full video and maintains three critical properties: it is class-agnostic, meaning it is not tied to templates for specific categories like hands or rigid objects, and it is view-stable.
This development moves the needle for downstream applications in robotics planning and trajectory-conditioned video generation. By releasing the model weights, the MolmoMotion-1M dataset, and the PointMotionBench benchmark openly, the researchers are providing the tools necessary for the community to customize and improve motion forecasting.
The central question for developers is no longer how well a machine can see the present, but how accurately it can simulate the immediate future.
Subscribe to The Mansa Report
Strategic intelligence on AI, business building, and the future of technology. Delivered Monday through Friday.