Towards omni-modal representations by extending ViT to additional modalities (3D, audio, etc.) via lightweight lens modules. Published at CVPR 2024.

Paper

arXiv: 2311.16081

Venue: CVPR 2024

multimodalvisionresearch