Adaptive Context Management
Selects valid temporal context and retrieved spatial memory using camera-pose-aware similarity and local latent dynamics.
arXiv Preprint
Training-free adaptive computation for long-horizon interactive video generation.
1Zhejiang University 2NVIDIA
*Corresponding author
Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps.
We present Light Interaction, a training-free acceleration framework for interactive video world models. It combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D sparse attention with fused kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59× speedup without model retraining while maintaining competitive visual quality.
Selects valid temporal context and retrieved spatial memory using camera-pose-aware similarity and local latent dynamics.
Reuses early-step model outputs during reliable revisiting while preserving the final denoising step for quality correction.
Sparsifies historical visual KV blocks and uses fused kernels to reduce layout conversion and gather/scatter overhead.
@misc{lu2026lightinteractiontrainingfreeinference,
title = {Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models},
author = {Jiacheng Lu and Haoyi Zhu and Sipei Yi and Enze Xie and Yu Li and Cheng Zhuo},
year = {2026},
eprint = {2605.31158},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
doi = {10.48550/arXiv.2605.31158},
url = {https://arxiv.org/abs/2605.31158}
}