The 1st Mandarin Audio-Visual Speech Recognition Challenge
7th Emotion Recognition in the Wild Challenge (EmotiW)
The 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU)

The 1st Mandarin Audio-Visual Speech Recognition Challenge

This challenge aims at exploring the complementarity between visual and acoustic information in real-world speech recognition systems. Audio-based speech recognition has made great progresses in the last decade, while still face many challenges when coming to noisy conditions. With the rapid development of computer vision technologies, audio-visual speech recognition has become a hot topic. But it is unclear how much the visual speech can complement the acoustic speech. In this challenge, we encourage not only the contributions achieving high recognition performance, but also the contributions bringing brave new ideas for the topic. There will be three sub-challenges in total and participants are free to perform one, two or all of them:

  1. Sub-challenge 1: Closed-set word-level speech recognition: The task here is to recognize the word in test videos, in which all the testing words have been exposed in the training set. Therefore, this task can be accomplished by simple classification, decoding, or any other reasonable methods.
  2. Sub-challenge 2: Open-set word-level speech recognition: Unlike sub-challenge 1, the testing words may be unseen in the training stage. When the speech recognition model has learned the true pronunciation rules, the test word would be recognized correctly no matter whether the word has appeared or not in the training set.
  3. Sub-challenge 3: Audio-visual Key-word Spotting: This task is to decide the existence and locate the position in the test video of a given keyword. This task is especially useful in practical systems, but always difficult to be very satisfying due to the environmental noises. By introducing visual information, it is expected to improve the current audio-based systems.

7th Emotion Recognition in the Wild Challenge (EmotiW)

The Seventh Emotion Recognition in the Wild 2019 Grand Challenge consists of an all-day event with a focus on affective sensing in unconstrained conditions and an embedded audio-video based emotion classification challenge and an image based group-level facial expression recognition, which mimic the real-world conditions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the increase in the number of video clips online, it is worthwhile to explore the performance of emotion recognition methods that work "in the wild". The goal of this challenge is to extend and carry forward the new common platform for evaluation of emotion recognition methods in real-world conditions defined in EmotiW2018 Grand Challenge held at the ACM International Conference on Multimodal Interaction 2018. This year there will be three sub-challenges:

  1. Audio-video based emotion recognition sub-challenge (AV)
  2. Group-level Cohesion sub-challenge (GC)
  3. Engagement prediction in the Wild (EW)

Please visit the EmotiW 2019 website for important dates, information and updates:

The 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU)

The spoken language understanding (SLU) is a key component of spoken dialogue system (SDS), parsing user's utterances into corresponding semantic concepts. For example, the utterance "Show me flights from Boston to New York" can be parsed into (fromloc.city_name=Boston, toloc.city_name=New York). Building a robust semantic parser system of multi-turn task-oriented spoken dialogue system is challenging, as it faces three main problems: variety of spoken language expression, uncertainty of automatic speech recognition (ASR) and adaption of dialogue domain. To fully investigate these problems and promote application of spoken dialogue system, we will release a multi-turn task-oriented Chinese spoken dialog dataset and organize the first open, audio-text based Chinese Task Oriented Spoken Language Understanding Challenge. This challenge consists of two sub-challenges:

  1. SLU in domain: Perform a slot filling system in a single domain. A large number of training dialogues related to music search and map navigation will be released. The data was collected from dialogues between users and a manageable spoken dialogue system (human-computer interaction), which happened in the real world. Both audio and textual information are very important for understanding users. Therefore, audio features will also be provided as well as text features.
  2. Domain adaptation of SLU: Adapt the SLU model of source domain to target domain. We will set music and map domains as source domains, while leave video and weather as target domains (20% utterances will be randomly selected as seed data and the rest is used for evaluation). Participants can use the seed data plus the music and map data from the first sub-challenge for adaptive training.

For futher important information, dates and updates please visit the CATSLU website:

ICMI 2019 ACM International Conference on Multimodal Interaction. Copyright © 2018-2019