High-Level Concept
Blue cells denote the input video or input views, while arrows and dots indicate generated continuous videos or sparse frames.
Camera-control V2V models such as ReCamMaster Bai et al., ICCV 2025 and Generative Camera Dolly Van Hoorick et al., ECCV 2024 modify only the camera trajectory while keeping time strictly monotonic.
4D multi-view models such as Cat4D Wu et al., CVPR 2024 and Diffusion4D Liang et al., NeurIPS 2024 synthesize discrete, sparse views conditioned on both space and time, but do not generate continuous temporal sequences.
SpaceTimePilot enables free movement along both the camera and time axes with full control over direction and speed, supporting bullet-time, slow motion, reverse playback, and mixed space–time trajectories.
Results
We present four temporal-trajectory examples, each paired with several arbitrary camera trajectories in the scene.
Click any video to view it in focus mode.
Bullet Time
Reverse Motion
Zigzag Motion
Slow Motion
Methods
Temporal Warping Augmentation for Training: We introduce temporal warping augmentation to diversify temporal patterns in multi-view dynamic videos. Training on warped video pairs encourages disentanglement of camera motion (space) and scene dynamics (time) in video diffusion models.
Cam×Time Dataset
To further strengthen disentanglement, we introduce a new dataset that spans the full grid of camera–time combinations along a trajectory. Our synthetic Cam×Time dataset contains 360K videos rendered from 1,000 animations across 100 scenes and three camera paths. Each path provides full-motion sequences for every camera pose, yielding dense multi-view and full-temporal coverage. This rich supervision enables effective disentanglement of spatial and temporal control.
General Example
This is how our dataset looks like:
Full-Grid Rendering
Original Video
Full-Grid Rendering
We do this for all captured animations.
AR Demos
We demonstrate the results of proposed autoregressive inference pipeline, which leverages our model's ability to start generation from arbitrary spatial viewpoints and temporal positions. This flexibility enables multi-turn inference while maintaining temporal, camera, and contextual consistency across segments. Here, we showcase a three-turn autoregressive example featuring large camera transitions.
81 frames
Source Video
243 frames
AR Demonstration
81 frames
Source Video
243 frames
AR Demonstration