NEO (Native VLM Architecture)

World's first scalable native vision-language model architecture, co-developed with NTU S-Lab. Abandons the traditional "visual encoder + projector + LLM" paradigm, rewriting attention mechanisms, position encoding, and semantic mapping from scratch. Achieves SOTA with only 1/10th the training data (390M image-text examples). Open-sourced in 2B and 9B parameter sizes.

No results found