Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

Kim, DeogHwa; Lee, Yu Il; Yoon, Da Hyun; Kim, Byeong Jun; Kim, Deok-Hwan

doi:10.3390/app16063103

상세 보기

Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

Kim, DeogHwa;
Lee, Yu Il;
Yoon, Da Hyun;
Kim, Byeong Jun;
Kim, Deok-Hwan

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

Multimodal Emotion Recognition in Conversation (ERC) has attracted attention as a key technology in human-computer interaction, mental healthcare, and intelligent services. However, deploying ERC in real-world settings remains challenging due to reliability gaps across modalities, instability in visual representations, and the high computational cost of large pretrained models. In particular, on resource-constrained edge devices, it is difficult to reduce model size and inference latency while preserving accuracy. To address these challenges, we jointly propose a knowledge-distillation-based multimodal ERC model, called DDVLM, with an edge-optimized Weight-Only Quantization (WOQ) pipeline for efficient edge deployment. DDVLM assigns the textual modality as the teacher and the visual modality as the student, transferring emotion-distribution knowledge to improve non-verbal representations and stabilize multimodal learning. In addition, Exponential Moving Average (EMA)-based self-distillation enhances the consistency and generalization capability of text features. Meanwhile, the proposed WOQ pipeline quantizes linear-layer weights to INT8 while preserving precision-sensitive operations in mixed precision, thereby minimizing accuracy loss and reducing model size, memory usage, and inference latency. Experiments on the MELD dataset demonstrated that the proposed approach achieves state-of-the-art performance while also enabling real-time inference on edge devices such as NVIDIA Jetson. Overall, this work presents a practical ERC framework that jointly considers accuracy and deployability.

키워드

multimodal emotion recognition in conversation; knowledge distillation; exponential moving average; vision-language model; weight-only quantization

제목: Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

저자: Kim, DeogHwa; Lee, Yu Il; Yoon, Da Hyun; Kim, Byeong Jun; Kim, Deok-Hwan

DOI: 10.3390/app16063103

발행일: 2026-03-23

유형: Article

저널명: APPLIED SCIENCES-BASEL

권: 16

호: 6