비디오 스윈 트랜스포머 기반의 향상된 Visual Saliency 예측

우채은; 이수민; 박수민; 최세린; 류제경; 김병형

상세 보기

비디오 스윈 트랜스포머 기반의 향상된 Visual Saliency 예측

Improved Visual Saliency Prediction Based on Video Swin Transformers

우채은;
이수민;
박수민;
최세린;
류제경;
... 김병형

초록

In this paper, we propose a Video Swin Transformer Saliency Network (VST-SalNet). The proposed model utilizes the Video Swin Transformer as its backbone to effectively learn the spatiotemporal features of video data and is designed to handle long-range spatiotemporal dependencies. Additionally, it integrates high-level semantic information and low-level details through the application of a feature pyramid structure. This structure enables multi-scale feature fusion and refines spatial details across resolutions. In turn, the model enhances spatial resolution by effectively handling objects of various sizes, preserving semantic information, and minimizing information loss. Experimental results on DHF1K, Hollywood-2, and UCF Sports datasets, evaluated using metrics such as SIM and CC, confirm that VST-SalNet outperforms the state-of-the-art models.

키워드

Video Saliency Prediction; Video Swin Transformer; Feature Pyramid Network; Multi Stage

제목: 비디오 스윈 트랜스포머 기반의 향상된 Visual Saliency 예측

제목 (타언어): Improved Visual Saliency Prediction Based on Video Swin Transformers

저자: 우채은; 이수민; 박수민; 최세린; 류제경; 김병형

발행일: 2024-11

유형: Y

저널명: 멀티미디어학회논문지

권: 27

호: 11

페이지: 1314 ~ 1325