Towards omni-modal representations by extending ViT to additional modalities (3D, audio, etc.) via lightweight lens modules. Published at CVPR 2024.

Paper

Venue CVPR 2024
multimodalvisionresearch