Developing systems that can robustly understand human-human communication or respond to human input requires identifying the best algorithms and their failure modes. In fields such as computer vision, speech recognition, and computational linguistics, the availability of datasets and common tasks have led to great progress. This year we invited the ICMI community to collectively define and tackle scientific Grand Challenges in multimodal interaction for the next 5 years. We received a good response to the call, and we are hosting four Challenge events at the ICMI 2013 conference. Multimodal Grand Challenges are driven by ideas that are bold, innovative, and inclusive. We hope they will inspire new ideas in the ICMI community and create momentum for future collaborative work.

Multimodal Grand Challenge Chairs

Jean-Marc Odobez
IDIAP Research Institute, Switzerland

Vidhyasaharan Sethu
The University of New South Wales, Australia

1st International Challenge for Multimodal Mid-Air Gesture Recognition for Close HCI : ChAirGest 2013

The ChAirGest challenge is a research oriented competition designed to compare multimodal gesture recognizers. The data provided in the challenge has been recorded from multiple sensors to optimize methods for gesture spotting and recognition. A common benchmark tool will be used to compare quantitatively the various algorithms submitted under strictly comparable conditions. The proposed algorithms can use any combination of the data types available in the dataset.

The provided data come from one Kinect camera and 4 Inertial Motion Units (IMU) attached to the right arm and neck of the subject. The dataset contains 10 different gestures, started from 3 different resting postures and recorded in two different lighting conditions by 10 different subjects. Thus, the total dataset contains 1200 annotated gestures split in continuous video sequences containing a variable number of gestures. The goal of the challenge and related workshop is to promote research on methods using multimodal data to spot and recognize gestures in the context of close human-computer interaction. However, many side-goals and research paths may be explored with the provided dataset; notably sensor fusion or enhancement of a single sensor recognition using multi-sensory information. Many different research challenges need to be tackled in this domain. Participants are given the opportunity to submit short papers and share their innovative findings concerning their research during the one-day workshop; these papers will be published alongside the main ICMI proceedings in ACM Digital Library.

Please visit our website for more information

Please do not hesitate to use our discussion group for questions and remarks:!forum/chairgest


Simon Ruffieux
Department of Information and Telecommunication Technology, University of Applied Sciences of Western Switzerland.
Denis Lalanne
Department of Informatics, University of Fribourg, Switzerland.
Elena Mugellini
Department of Information and Telecommunication Technology, University of Applied Sciences of Western Switzerland.
Daniel Roggen
Department of Information Technology and Electrical Engineering, Swiss Federal Institute of Technology, Switzerland.
Stefano Carrino
Department of Information and Telecommunication Technology, University of Applied Sciences of Western Switzerland.

Important dates

Release of development data TBA
Program executable & short-paper submission TBA
Notification of acceptance TBA
Camera-ready paper TBA


Multimodal Conversational Analytics

The MCA challenge concerns the multimodal analysis of primary cues and qualities of conversations. It proposes to set up the basis for comparison, analysis, and further improvement of multimodal data annotations and multimodal interactive systems which are important in building various multimodal applications. Such machine learning-based challenges do not exist in the Multimodal Interaction community, and by focusing on the elaboration of algorithms and techniques on shared data sets, we aim to foster the research and development of multimodal interactive systems.

Please visit our website for more information


Xavier Alameda-Pineda
INRIA Grenoble Rhône-Alpes, University of Grenoble, France
Roman Bednarik
University of Eastern Finland, Finland
Kristiina Jokinen
University of Helsinki, Finland
Michal Hradis
Brno University of Technology, Czech republic

Important dates

Paper deadline TBA
Author notification TBA
Camera-ready paper TBA


ChaLearn Challenge and Workshop on Multi-modal Gesture Recognition

ChaLearn organizes in 2013 a challenge and workshop on multi-modal gesture recognition from 2D and 3D video data using Kinect, in conjunction with ICMI 2013, December 9-13, Sidney, Australia. Kinect is revolutionizing the field of gesture recognition given the set of input data modalities it provides, including RGB image, depth image (using an infrared sensor), and audio. Gesture recognition is genuinely important in many multi-modal interaction and computer vision applications, including image/video indexing, video surveillance, computer interfaces, and gaming. It also provides excellent benchmarks for algorithms. The recognition of continuous, natural signing is very challenging due to the multimodal nature of the visual cues (e.g., movements of fingers and lips, facial expressions, body pose), as well as technical limitations such as spatial and temporal resolution and unreliable depth cues. The workshop is devoted to the presentation of most recent and challenging techniques from multi-modal gesture recognition. The committee encourages paper submissions in the following topics (but not limited to):

  • Multi-modal descriptors for gesture recognition
  • Fusion strategies for gesture recognition
  • Multi-modal learning for gesture recognition
  • Data sets and evaluation protocols for multi-modal gesture recognition
  • Applications of multi-modal gesture recognition

The results of the challenge will be discussed at the workshop. It features a quantitative evaluation of automatic gesture recognition from a multi-modal dataset recorded with Kinect (providing RGB images of face and body, depth images of face and body, skeleton information, joint orientation and audio sources), including about 20,000 gestures from several users. The gestures are drawn from different gesture vocabularies from very diverse domains. The emphasis of the competition is on multi-modal automatic learning of vocabularies of gestures performed by several different users, with the aim of performing user independent continuous gesture recognition.

Additionally, the challenge includes a live competition of demos/systems of applications based on multi-modal gesture recognition techniques. Demos using data from different modalities and different kind of devices are welcome. The demos will be evaluated in terms of multi-modality, technical quality, and applicability.

Best workshop papers and top three ranked participants of the quantitative evaluation will be invited to present their work at ICMI 2013 and their papers will be published in the proceedings. Additionally, there will be travel grants (based on availability) and the possibility to be invited to present extended versions of their works to a special issue in a high impact factor journal. Moreover, all three top ranking participants in the quantitative challenge will be awarded with a ChaLearn winner certificate and an economic prize (based on availability). We will also announce a best paper and best student paper awards among the workshop contributions.

The ChaLearn Challenge organisers have negotiated a Special Topic on Gesture Recognition call for papers with the Journal of Machine Learning Research. More details can be found in this downloadable call for papers.

Please visit our website for more information


Sergio Escalera
Computer Vision Center (UAB) and University of Barcelona, Spain
Jordi Gonzàlez
Universitat Autònoma de Barcelona & Computer Vision Center, Spain
Isabelle Guyon
Clopinet, Berkeley, California, USA
Thomas B. Moeslund
Aalborg University, Denmark
Oscar Lopes
Computer Vision Center (UAB), Spain
Miguel Reyes
Computer Vision Center (UAB) and University of Barcelona, Spain
Xavier Baró
Computer Vision Center and Universitat Oberta de Catalunya, Spain
Vassilis Athitsos
University of Texas, USA
Pat Jangyodsuk
University of Texas, USA
Hugo Jair Escalante
INAOE, Puebla, Mexico
Aaron Negrín
University of Barcelona, Spain

Important dates

Quantitative Challenge
Beginning of the quantitative competition, release of the first data examples TBA
Full release of development and validation data TBA
Release of validation data TBA
Release of final evaluation data TBA
Release of final evaluation data decryption key TBA
End of the quantitative competition. Deadline for code submission and the prediction results on final evaluation data. The organizers start the code verification by running it on the final evaluation data TBA
Deadline for submitting the fact sheets summarizing proposed methods TBA
Release of the verification results to the participants for review TBA
Workshop paper submission deadline (top three ranked quantitative and qualitative participants will be invited to submit their contribution as a paper submission to the workshop) TBA
Notification of workshop paper acceptance TBA
Camera ready of workshop papers TBA

Contact Persons

Emotion Recognition In The Wild Challenge and Workshop (EmotiW)

The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2013 Grand Challenge consists of an audio-video based emotion classification challenges, which mimics real-world conditions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the increase in the number of video clips online, it is worthwhile to explore the performance of emotion recognition methods that work 'in the wild'. The goal of this Grand Challenge is to define a common platform for evaluation of emotion recognition methods in real-world conditions.

The database in the 2013 challenge is the Acted Facial Expression In Wild (AFEW), which has been collected from movies showing close-to-real-world conditions. Three sets for training, validation and testing will be made available.

Please visit our website for more information


Abhinav Dhall
Australian National University
Roland Goecke
University of Canberra / Australian National University
Jyoti Joshi
University of Canberra
Michael Wagner
University of Canberra / Australian National University
Tom Gedeon
Australian National University

Important dates

Training and validation data available TBA
Testing data available TBA
Paper submission deadline TBA
Notification of acceptance TBA
Camera ready paper TBA


Multimodal Learning Analytics (MMLA)

Multimodal learning analytics, learning analytics, and educational data mining are emerging disciplines concerned with developing techniques to more deeply explore unique data in learning settings. They also use the results based on these analyses to understand how students learn. Among other things, this includes how they communicate, collaborate, and use digital and non-digital tools during learning activities, and the impact of these interactions on developing new skills and constructing knowledge. Advances in learning analytics are expected to contribute new empirical findings, theories, methods, and metrics for understanding how students learn. It also can contribute to improving pedagogical support for students' learning through new digital tools, teaching strategies, and curricula. The most recent direction within this area is multimodal learning analytics, which emphasizes the analysis of natural rich modalities of communication during situated interpersonal and computer-mediated learning activities. This includes students' speech, writing, and nonverbal interaction (e.g., gestures, facial expressions, gaze, sentiment. The First International Conference on Multimodal Learning Analytics ( represented the first intellectual gathering of multidisciplinary scientists interested in this new topic.

Please visit our website for more information:

Important dates

Distribution of workshop announcement to email lists TBA
MMLA database available for grand challenge participants TBA
Paper submission deadline (extended) TBA
Notification of acceptance TBA
Camera-ready papers due TBA
Workshop event TBA

Grand Challenge Workshop and Participation Levels

The Second International Workshop on Multimodal Learning Analytics will bring together researchers in multimodal interaction and systems, cognitive and learning sciences, educational technologies, and related areas to advance research on multimodal learning analytics. Following the First International Workshop on Multimodal Learning Analytics in Santa Monica in 2012, this second workshop will be organized as a data-driven "Grand Challenge" event, to be held at ICMI 2013 in Sydney Australia on December 9th of 2013. There will be three levels of workshop participation, including attendees who wish to:

  • Participate in grand challenge dataset competition and report results (using your own dataset, or the Math Data Corpus described below which is available to access)
  • Submit an independent research paper on MMLA, including learning-oriented behaviors related to the development of domain expertise, prediction techniques, data resources, and other topics
  • Observe and discuss new topics and challenges in MMLA with other attendees, for which a position paper should be submitted

For those wishing to participate in the competition using the Math Data Corpus, they will be asked to contact the workshop organizers and sign a "collaborator agreement" for IRB purposes to access the dataset (see data corpus section). The dataset used for the competition is well structured to support investigating different aspects of multimodal learning analytics. It involves high school students collaborating while solving mathematics problems.

The dataset will be available for a six-month period so researchers can participate in the competition. The competition will involve identifying one or more factors and demonstrating that they can predict domain expertise: (1) with high reliability, and (2) as early in a session as possible. Researchers will be asked to accurately identify: (1) which of three students in each session is the dominant domain expert, and (2) which of 16 problems in each session is solved correctly versus incorrectly using their predictor(s).

Available Data Corpus and Multimodal Analysis Tools

Existing Dataset: A data corpus is available for analysis during the multimodal learning analytics competition. It involves 12 sessions, with small groups of three students collaborating while solving mathematics problems (i.e., geometry, algebra). Data were collected on their natural multimodal communication and activity patterns during these problem-solving and peer tutoring sessions, including students' speech, digital pen input, facial expressions, and physical movements. In total, approximately 15-18 hours of multimodal data is available during these situated problem-solving sessions.

Participants were 18 high-school students, including 3-person male and female groups. Each group of three students met for two sessions. These student groups varied in performance characteristics, with some low-to-moderate performers and others high-performing students. During the sessions, students were engaged in authentic problem solving and peer tutoring as they worked on 16 mathematics problems, four apiece representing easy, moderate, hard, and very hard difficulty levels. Each problem had a canonical correct answer. Students were motivated to solve problems correctly, because one student was randomly called upon to explain the answer after solving it. During each session, natural multimodal data were captured from 12 independent audio, visual, and pen signal streams. These included high-fidelity: (1) close-up camera views of each student while working, showing the face and hand movements while working at the table (waist up view), as well as a wide-angle view for context and another top-down view of students' writing and artifacts on the table; (2) close-talking microphone capture of each students' speech, and a room microphone for recording group discussion; (3) digital pen input for each student, who used an Anoto-based digital pen and large sheet of digital paper for streaming written input. Software was developed for accurate time synchronization of all twelve of these media streams during collection and playback. The data have been segmented by start and end time of each problem, scored for solution correctness, and also scored for which student solved the problem correctly. The data available for analysis includes students':

  • Speech signals
  • Digital pen signals
  • Video signals showing activity patterns (e.g., gestures, facial expressions)

In addition, for each student group one session of digital pen data has been coded for written representations, including (1) type of written representation (e.g., linguistic, symbolic, numeric, diagrammatic, marking), (2) meaning of representation, (3) start/end time of each representation, and (4) presence of written disfluencies. Note that lexical transcriptions of speech will not be available with the dataset. But people are free to complete transcriptions if they want to analyze the content.


Dr. Stefan Scherer
USC Institute for Creative Technologies
Dr. Nadir Weibel
Department of Computer Science and Engineering
Marcelo Worsley
Transformative Learning Technologies Lab
Dr. Louis-Philippe Morency
USC Institute for Creative Technologies
Dr. Sharon Oviatt
President & Research Director, Incaa Designs Nonprofit


ICMI 2013 ACM International Conference on Multimodal Interaction. 9-13th December 2013, Sydney, Australia. Copyright © 2010-2021
Photo credits: David Iliff, Enoch Lau (license: CC-BY-SA 3.0). Destination NSW, Don Fuchs, Susan Wright, David Druce.