NEO (Native VLM Architecture)
model paperWorld's first scalable native vision-language model architecture, co-developed with NTU S-Lab. Abandons the traditional "visual encoder + projector + LLM" paradigm, rewriting attention mechanisms, position encoding, and semantic mapping from scratch. Achieves SOTA with only 1/10th the training data (390M image-text examples). Open-sourced in 2B and 9B parameter sizes.
Outputs 2
NEO Models
modelVariants
| Name | Parameters | Notes |
|---|---|---|
| NEO-2B | 2B | — |
| NEO-9B | 9B | — |
From Pixels to Words: Towards Native Vision-Language Primitives at Scale
paperarXiv: 2510.14979