Vision-language model with dynamic tiling encoder for high-resolution image understanding.

Paper

arXiv: 2403.05525

multimodalopen-weight

Related