SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

Ghimire, Akash; Kakani, Vijay; Kim, Hakil

doi:10.1109/ACCESS.2023.3278974

상세 보기

SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

Ghimire, Akash;
Kakani, Vijay;
Kim, Hakil

Citations

WEB OF SCIENCE

17

Citations

SCOPUS

17

초록

Combining skeleton and RGB modalities in human action recognition (HAR) has garnered attention due to their ability to complement each other. However, previous studies did not address the challenge of recognizing fine-grained human-object interaction (HOI). To tackle this problem, this study introduces a new transformer-based architecture called Sequential Skeleton RGB Transformer (SSRT), which fuses skeleton and RGB modalities. First, SSRT leverages the strength of Long Short-Term Memory (LSTM) and a multi-head attention mechanism to extract high-level features from both modalities. Subsequently, SSRT employs a two-stage fusion method, including transformer cross-attention fusion and softmax layer late score fusion, to effectively integrate the multimodal features. Aside from evaluating the proposed method on fine-grained HOI, this study also assesses its performance on two other action recognition tasks: general HAR and cross-dataset HAR. Furthermore, this study conducts a performance comparison between a HAR model using single-modality features (RGB and skeleton) alongside multimodality features on all three action recognition tasks. To ensure a fair comparison, comparable state-of-the-art transformer architectures are employed for both the single-modality HAR model and SSRT. In terms of modality, SSRT outperforms the best-performing single-modality HAR model on all three tasks, with accuracy improved by 9.92% on fine-grained HOI recognition, 6.73% on general HAR, and 11.08% on cross-dataset HAR. Additionally, the proposed fusion model surpasses state-of-the-art multimodal fusion techniques like Transformer Early Concatenation, with an accuracy improved by 6.32% on fine-grained HOI recognition, 4.04% on general HAR, and 6.56% on cross-dataset.

키워드

Human factors; Transformers; Task analysis; Feature extraction; Human activity recognition; Solid modeling; Multimodality fusion; human action recognition; fine-grained actions; transformer cross-attention fusion

제목: SSRT: A Sequential Skeleton RGB Transformer to Recognize Fine-Grained Human-Object Interactions and Action Recognition

저자: Ghimire, Akash; Kakani, Vijay; Kim, Hakil

DOI: 10.1109/ACCESS.2023.3278974

발행일: 2023

유형: Article

저널명: IEEE Access

권: 11

페이지: 51930 ~ 51948