AToken | Lab Index

"A Unified Tokenizer for Vision." The first unified visual tokenizer achieving both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Encodes diverse visual inputs into a shared 4D latent space using a pure Transformer with 4D rotary position embeddings.

Introduces an adversarial-free training objective combining perceptual and Gram matrix losses. Achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD for videos, and 28.28 PSNR for 3D. CVPR 2026. By Lu, Song, Xu, Ahn, Wang, Chen, Dehghan, and Yang.

Paper (arXiv)GitHub Apple ML Research

Paper

Venue CVPR 2026

arXiv HTML

visionmultimodalopen-source

Paper

Related