SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

High-Level Concept

Comparison of camera-control models, 4D multi-view models, and SpaceTimePilot

Blue cells denote the input video or input views, while arrows and dots indicate generated continuous videos or sparse frames.

Camera-control V2V models such as ReCamMaster Bai et al., ICCV 2025 and Generative Camera Dolly Van Hoorick et al., ECCV 2024 modify only the camera trajectory while keeping time strictly monotonic.

4D multi-view models such as Cat4D Wu et al., CVPR 2024 and Diffusion4D Liang et al., NeurIPS 2024 synthesize discrete, sparse views conditioned on both space and time, but do not generate continuous temporal sequences.

SpaceTimePilot enables free movement along both the camera and time axes with full control over direction and speed, supporting bullet-time, slow motion, reverse playback, and mixed space–time trajectories.

Explanation Video

Watch our video presentation to learn more about SpaceTimePilot.

Results

We present four temporal-trajectory examples, each paired with several arbitrary camera trajectories in the scene.
Click any video to view it in focus mode.

Bullet Time Reverse Motion Zigzag Motion Slow Motion

Bullet Time

Reverse Motion

Zigzag Motion

Slow Motion

Methods

Temporal Warping Augmentation for Training: We introduce temporal warping augmentation to diversify temporal patterns in multi-view dynamic videos. Training on warped video pairs encourages disentanglement of camera motion (space) and scene dynamics (time) in video diffusion models.

Cam×Time Dataset

To further strengthen disentanglement, we introduce a new dataset that spans the full grid of camera–time combinations along a trajectory. Our synthetic Cam×Time dataset contains 180K videos rendered from 1,000 animations across 100 scenes and three camera paths. Each path provides full-motion sequences for every camera pose, yielding dense multi-view and full-temporal coverage. This rich supervision enables effective disentanglement of spatial and temporal control.

General Example

This is how our dataset looks like:

Full-Grid Rendering

Original Video

Full-Grid Rendering

We do this for all captured animations.

AR Demos

We demonstrate the results of proposed autoregressive inference pipeline, which leverages our model's ability to start generation from arbitrary spatial viewpoints and temporal positions. This flexibility enables multi-turn inference while maintaining temporal, camera, and contextual consistency across segments. Here, we showcase a three-turn autoregressive example featuring large camera transitions.

81 frames