DPT-tracker: Dual pooling transformer for efficient visual tracking

  • Fang, Yang
  • Xie, Bailian
  • Khairuddin, Uswah
  • Min, Zijian
  • Jiang, Bingbing
  • 외 1명
Citations

WEB OF SCIENCE

7
Citations

SCOPUS

8

초록

Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target-search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer-based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi-frame templates along space-time dimensions. MCPT aims to capture multi-scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi-scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state-of-the-art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real-time tracking baseline with a good trade-off between tracking performance and inference efficiency.

키워드

human-computer interfacingimage motion analysispattern recognitionsignal processingtracking
제목
DPT-tracker: Dual pooling transformer for efficient visual tracking
저자
Fang, YangXie, BailianKhairuddin, UswahMin, ZijianJiang, BingbingLi, Weisheng
DOI
10.1049/cit2.12296
발행일
2024-08
유형
Article
저널명
CAAI Transactions on Intelligence Technology
9
4
페이지
948 ~ 959