Vision-language model with dynamic tiling encoder for high-resolution image understanding.

Paper

Citations 43
multimodalopen-weight

Related