World's first scalable native vision-language model architecture, co-developed with NTU S-Lab. Abandons the traditional "visual encoder + projector + LLM" paradigm, rewriting attention mechanisms, position encoding, and semantic mapping from scratch. Achieves SOTA with only 1/10th the training data (390M image-text examples). Open-sourced in 2B and 9B parameter sizes.

Outputs 2

NEO Models

model

Variants

Name Parameters Notes
NEO-2B 2B
NEO-9B 9B

From Pixels to Words: Towards Native Vision-Language Primitives at Scale

paper

arXiv: 2510.14979

multimodalarchitectureopen-sourcevision