Improving Vision Transformers to Learn Small-Size Dataset From Scratch

Lee, Seunghoon; Lee, Seunghyun; Song, Byung Cheol

doi:10.1109/ACCESS.2022.3224044

상세 보기

Improving Vision Transformers to Learn Small-Size Dataset From Scratch

Lee, Seunghoon;
Lee, Seunghyun;
Song, Byung Cheol

Citations

WEB OF SCIENCE

38

Citations

SCOPUS

51

초록

This paper proposes various techniques that help Vision Transformer (ViT) to learn small-size datasets from scratch successfully. ViT, which applied the transformer structure to the image classification task, has outperformed convolutional neural networks, recently. However, the high performance of ViT results from pre-training using large-size datasets, and its dependence on large datasets comes from low locality inductive bias. And conventional ViT cannot effectively attend the target class due to redundant attention caused by a rather high constant temperature factor. In order to improve the locality inductive bias of ViT, this paper proposes novel tokenization (Shifted Patch Tokenization: SPT) using shifted patches and a position encoding (CoordConv Position Encoding: CPE) using $1 \times 1$ CoordConv. Also, to improve poor attention, we propose a new self-attention mechanism (Locality Self-Attention: LSA) based on learnable temperature and self-relation masking. SPT, CPE, and LSA are intuitive techniques, but they successfully improve the performance of ViT even on small-size datasets. We qualitatively show that each technique attends a more important area and contributes to having a flatter loss landscape. Moreover, the proposed techniques are generic add-on modules applicable to various ViT backbones. Our experiments show, when learning Tiny-ImageNet from scratch, the proposed scheme based on SPT, CPE, and LSA increases the accuracy of ViT backbones by +3.66 on average and up to +5.7. Also, the performance improvement of ViT backbones in ImageNet-1K classification, learning on COCO from scratch, and transfer learning on classification datasets verify that the generalization ability of the proposed method is excellent.

키워드

Vision transformer; attention mechanism; data efficient learning

제목: Improving Vision Transformers to Learn Small-Size Dataset From Scratch

저자: Lee, Seunghoon; Lee, Seunghyun; Song, Byung Cheol

DOI: 10.1109/ACCESS.2022.3224044

발행일: 2022

유형: Article

저널명: IEEE Access

권: 10

페이지: 123212 ~ 123224