Vision-language model with dynamic tiling encoder for high-resolution image understanding.

Paper

Citations 45
multimodalopen-weight

Related