Transformer in Transformer (TNT)

Vision architecture that treats image patches as "visual sentences" and further divides them into smaller patches as "visual words," enabling multi-granularity attention. Achieves 81.5% top-1 accuracy on ImageNet, about 1.7% higher than existing vision transformers at similar compute cost. Published at NeurIPS 2021.

No results found