"A Unified Tokenizer for Vision." The first unified visual tokenizer achieving both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Encodes diverse visual inputs into a shared 4D latent space using a pure Transformer with 4D rotary position embeddings.

Introduces an adversarial-free training objective combining perceptual and Gram matrix losses. Achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD for videos, and 28.28 PSNR for 3D. CVPR 2026. By Lu, Song, Xu, Ahn, Wang, Chen, Dehghan, and Yang.

Paper

arXiv: 2509.14476

Venue: CVPR 2026

visionmultimodalopen-source

Related