arXiv Preprint

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Training-free adaptive computation for long-horizon interactive video generation.

Jiacheng Lu1 Haoyi Zhu2 Sipei Yi1 Enze Xie2 Yu Li1,* Cheng Zhuo1

1Zhejiang University 2NVIDIA

*Corresponding author

2.59× Speedup on HY-WorldPlay
24.81 PSNR vs. original model
21.91 GB Peak memory reduction
76.57 → 54.66 GB
Summary

Abstract

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps.

We present Light Interaction, a training-free acceleration framework for interactive video world models. It combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D sparse attention with fused kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59× speedup without model retraining while maintaining competitive visual quality.

Method

Overview

Light Interaction overview
Light Interaction combines adaptive context management, denoising cache acceleration, and AR-aware 3D sparse attention.
01

Adaptive Context Management

Selects valid temporal context and retrieved spatial memory using camera-pose-aware similarity and local latent dynamics.

02

Denoising Cache Acceleration

Reuses early-step model outputs during reliable revisiting while preserving the final denoising step for quality correction.

03

Co-designed 3D Sparse Attention

Sparsifies historical visual KV blocks and uses fused kernels to reduce layout conversion and gather/scatter overhead.

Efficiency analysis

Results

Stage-wise latency breakdown
Stage-wise latency breakdown on HY-WorldPlay.
Fusion efficiency comparison
Latency of sparse operators before and after kernel fusion.
Citation

BibTeX

@misc{lu2026lightinteractiontrainingfreeinference,
  title         = {Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models},
  author        = {Jiacheng Lu and Haoyi Zhu and Sipei Yi and Enze Xie and Yu Li and Cheng Zhuo},
  year          = {2026},
  eprint        = {2605.31158},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2605.31158},
  url           = {https://arxiv.org/abs/2605.31158}
}