PackUV

📄 Abstract

Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications.

We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure.

To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.

📊 Baseline Comparison

3DGStream

OURS

🗂️ PackUV-2B Dataset

To better showcase the abilities of our representation and compare it with existing work, we captured the largest long-duration multi-view video dataset, PackUV-2B. PackUV-2B features real-world dynamic scenes with more than 50 synchronized cameras, providing 360^◦ coverage in both controlled studio and uncontrolled in-the-wild settings.

🎬 Long Video

To showcase the scalability of our method, we play a long video sequence lasting 30 minutes featuring complex human motions in a dynamic environment. The video is played at 30x speed.

📖 Citation

@misc{rai2026packuv,
  title={PackUV: Packed Gaussian UV Maps for 4D Volumetric Video}, 
  author={Aashish Rai, Angela Xing, Anushka Agarwal, Xiaoyan Cong, Zekun Li, Tao Lu, Aayush Prakash, Srinath Sridhar},
  booktitle={Conference on Computer Vision and Pattern Recognition},
  year={2026},
  eprint={2602.23040},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.23040}, 
  }

🙏 Acknowledgements

This research was supported by ONR DURIP grant N00014-23-1-2804, NSF CAREER award #2143576, and an Amazon Cloud Credits award.