Zhening Huang1,2, Hyeonho Jeong2, Xuelin Chen2, Yulia Gryaditskaya2

Tuanfeng Y. Wang2, Joan Lasenby1, Chun-Hao Huang2

1University of Cambridge, 2Adobe Research

TLDR: SpaceTimePilot disentangles space and time in video diffusion model for controllable generative rendering. Given a single input video of a dynamic scene, SpaceTimePilot freely steers both camera viewpoint and temporal motion within the scene, enabling freely exploration across the 4D space–time domain.

High-Level Concept

Comparison of camera-control models, 4D multi-view models, and SpaceTimePilot

Blue cells denote the input video or input views, while arrows and dots indicate generated continuous videos or sparse frames.

Camera-control V2V models such as ReCamMaster Bai et al., ICCV 2025 and Generative Camera Dolly Van Hoorick et al., ECCV 2024 modify only the camera trajectory while keeping time strictly monotonic.

4D multi-view models such as Cat4D Wu et al., CVPR 2024 and Diffusion4D Liang et al., NeurIPS 2024 synthesize discrete, sparse views conditioned on both space and time, but do not generate continuous temporal sequences.

SpaceTimePilot enables free movement along both the camera and time axes with full control over direction and speed, supporting bullet-time, slow motion, reverse playback, and mixed space–time trajectories.

Results

We present four temporal-trajectory examples, each paired with several arbitrary camera trajectories in the scene.
Click any video to view it in focus mode.

Bullet Time

Reverse Motion

Zigzag Motion

Slow Motion

Methods

Temporal Warping Augmentation for Training: We introduce temporal warping augmentation to diversify temporal patterns in multi-view dynamic videos. Training on warped video pairs encourages disentanglement of camera motion (space) and scene dynamics (time) in video diffusion models.

Temporal Warping Augmentation

Cam×Time Dataset

To further strengthen disentanglement, we introduce a new dataset that spans the full grid of camera–time combinations along a trajectory. Our synthetic Cam×Time dataset contains 360K videos rendered from 1,000 animations across 100 scenes and three camera paths. Each path provides full-motion sequences for every camera pose, yielding dense multi-view and full-temporal coverage. This rich supervision enables effective disentanglement of spatial and temporal control.

General Example

This is how our dataset looks like:

Full-Grid Rendering

Original Video

Full-Grid Rendering

We do this for all captured animations.

AR Demos

We demonstrate the results of proposed autoregressive inference pipeline, which leverages our model's ability to start generation from arbitrary spatial viewpoints and temporal positions. This flexibility enables multi-turn inference while maintaining temporal, camera, and contextual consistency across segments. Here, we showcase a three-turn autoregressive example featuring large camera transitions.

81 frames

Source Video

243 frames

AR Demonstration

81 frames

Source Video

243 frames

AR Demonstration