Vision architecture that treats image patches as "visual sentences" and further divides them into smaller patches as "visual words," enabling multi-granularity attention. Achieves 81.5% top-1 accuracy on ImageNet, about 1.7% higher than existing vision transformers at similar compute cost. Published at NeurIPS 2021.

Outputs 2

TNT

model

Transformer in Transformer

paper

arXiv: 2103.00112

visionarchitectureopen-source