상세 보기
초록
Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target-search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer-based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi-frame templates along space-time dimensions. MCPT aims to capture multi-scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi-scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state-of-the-art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real-time tracking baseline with a good trade-off between tracking performance and inference efficiency.
키워드
- 제목
- DPT-tracker: Dual pooling transformer for efficient visual tracking
- 저자
- Fang, Yang; Xie, Bailian; Khairuddin, Uswah; Min, Zijian; Jiang, Bingbing; Li, Weisheng
- 발행일
- 2024-08
- 유형
- Article
- 저널명
- CAAI Transactions on Intelligence Technology
- 권
- 9
- 호
- 4
- 페이지
- 948 ~ 959