비디오 스윈 트랜스포머 기반의 향상된 Visual Saliency 예측

Improved Visual Saliency Prediction Based on Video Swin Transformers
  • 우채은
  • 이수민
  • 박수민
  • 최세린
  • 류제경
  • ... 김병형

초록

In this paper, we propose a Video Swin Transformer Saliency Network (VST-SalNet). The proposed model utilizes the Video Swin Transformer as its backbone to effectively learn the spatiotemporal features of video data and is designed to handle long-range spatiotemporal dependencies. Additionally, it integrates high-level semantic information and low-level details through the application of a feature pyramid structure. This structure enables multi-scale feature fusion and refines spatial details across resolutions. In turn, the model enhances spatial resolution by effectively handling objects of various sizes, preserving semantic information, and minimizing information loss. Experimental results on DHF1K, Hollywood-2, and UCF Sports datasets, evaluated using metrics such as SIM and CC, confirm that VST-SalNet outperforms the state-of-the-art models.

키워드

Video Saliency PredictionVideo Swin TransformerFeature Pyramid NetworkMulti Stage
제목
비디오 스윈 트랜스포머 기반의 향상된 Visual Saliency 예측
제목 (타언어)
Improved Visual Saliency Prediction Based on Video Swin Transformers
저자
우채은이수민박수민최세린류제경김병형
발행일
2024-11
유형
Y
저널명
멀티미디어학회논문지
27
11
페이지
1314 ~ 1325