【CCF-AIR青年基金】Multi-modal Learning for Emotion Recognition

Research Themes



Tmall Genie aspires to become an intelligent speaker that understands the users' mind, and emotion recognition is one of the most important ways to achieve this. With the help of emotion recognition, Tmall Genie can provide specialized services for users. For example, for the same request "Tmall Genie to play music", some cheerful music can be played when the user is in a happy mood, and some light music can be played when the user is tired. Besides, the emotion recognition results can also help the Tmall Genie to provide customized dialogue content as well as judge the quality of services, which can form a technical closed loop. The above process can help us to continuously polish and iteratively upgrade our artificial intelligence technology. Therefore, it is very necessary to perform accurate emotion recognition.


Towards the task of emotion recognition, numerous of approaches have been proposed. Discriminative feature learning is critical to the final results. For example, the linear discriminative analysis can realize this by minimizing the intra-sample distance as well as maximize the inter-samples distance. However, it relies on a large amount of labeled samples, which is very expensive to manually construct such kinds of dataset. By contrast, we can easily collect unlabeled samples from the Internet. In this case, one can rely on self or semi-supervised learning to learn discriminative feature representations.


Another important factor for emotion recognition is multi-modal fusion. For example, we can collect not only the audio data but also the video as well as images from the users. By making full use of these multi-modality data, the accuracy of emotion recognition can be further improved. It is meaningful to investigate the multi-modal fusion strategies.


  • Deep learning based modality-specific emotion recognition model with limited labeled samples
  • Large-scale self-supervised and contrastive learning or pretrained model for discriminative feature learning
  • Effective multi-modal fusion method to improve the accuracy of emotion recognition

Related Research Topics

  • Videos based facial expression recognition in the wild 
  • Feature-level and decision level fusion for multiple modalities
  • Transfer learning for emotion recognition
  • Emotion recognition under the situation of incomplete modalities
  • Unsupervised large-scale pretrained model
  • Cross-modal retrieval and recommendation

Scan QR code
关注Ali TechnologyWechat Account