25^th ACM International Conference on Multimodal Interaction
(9-13 October 2023)

Home

Awards

Call for Bids 2025

Program

Keynotes

Registration (closed)

Tutorials

Workshops

Grand Challenges

Presentation Instruction

Doctoral Consortium

Accommodation

Proceedings

Companion Proceedings

Camera-Ready Instructions

Call for Sponsors

Call for Papers

Guidelines for Authors

Guidelines for Reviewers

Call for Blue Sky Papers

Call for Late Breaking Results

Call for Demonstrations
and Exhibits

Call for Doctoral Consortium

Call for Tutorials

Important Dates

People

Conference venue

Platinum Sponsor

Bronze Sponsor

Institutional Sponsors

Tutorials

1. Introduction to eye and speech behaviour computing for affect analysis in wearable contexts

2. Platform for Situated Intelligence and OpenSense: A Tutorial on Building Multimodal Interactive Applications for Research

3. Multimodal Machine Learning: Principles, Challenges, and Open Questions

Introduction to eye and speech behaviour computing for affect analysis in wearable contexts

Duration:

3 hours

Description:

Eye and speech are two of the most ubiquitous sensing modalities during human-human interaction and human-computer interaction in daily life. With wearables becoming lightweight, easy to wear and offering powerful computing capabilities, they are likely to be the next generation of computing devices (e.g., Apple Vision Pro). This provides novel opportunities to explore new types of eye behaviour and new methods of body sound sensing for affect analysis and modelling. Multimodal affective computing systems within the machine learning framework have seen success in certain scenarios, but their application to multimodal wearable contexts remains less explored. Understanding the theoretical (e.g., psychophysiological) basis and approaches for eye and speech/audio behaviour computing, and for multimodal computing is paramount for affect analysis and for innovative wearable affect systems in a range of contexts during human-human and human-computer interactions.

This tutorial focuses on the topics of wearable sensing for affect computing systems and specifically targets both fundamental and state-of-the-art eye and speech/audio behaviour processing, and multimodal computing. This is the first tutorial that discusses multimodal perspectives on wearable sensing and computing for affect analysis and is particularly suited to the ICMI 2023 theme of science of multimodal interactions. This tutorial consists of four parts: (i) eye behaviour computing, which introduces wearable devices to enquire eye images, novel eye behaviour types and correlation to affect, and computing methods to extract eye behaviours; (ii) Speech and audio analysis, which covers wearable devices for audio collection, different forms of audio and relevance to affect, as well as processing pipeline and potential innovative applications; (iii) Multimodality, which focuses on the motivation, approaches, and applications; (iv) practical session, which contains demonstrations of eye behaviour computing, audio sensing and processing in practice and hands-on exercises with shared code.

Program:

Part 1: Introduction to eye behaviour computing for affect, including (50 min including Q&A and break)
- Overview of current eyewear and head-mounted systems and multimodal affect analysis in these wearable platforms.
- Wearable devices to obtain eye information.
- Eye behaviour types (e.g., pupil, blink, gaze) and their relationships with affect.
- Computational methods for eye behaviour analysis.
- Issues in experiment design, including data collection, feature extraction and selection, machine learning pipeline, in-the-wild data, bias.
- Available datasets, off-the-shelf tools and how to get started.
- Future directions and challenges.
Part 2: Speech and audio analysis for affect computing (45 min including Q&A and break)
- Introduction to wearable devices for speech and audio collection.
- Different forms of wearable audio (e.g., speech, cough, bone-conducted sounds).
- Relevance to affect and potential applications.
- Speech and audio processing pipelines for affect computing, from data collection, model design, to evaluation.
- Future directions and challenges (e.g., emerging trends and technologies, such as augmented reality and personalized audio; ethical considerations such as privacy and security; data challenges such as missingness, etc.)
Part 3: Multimodality (45 min including Q&A and break)
- Motivation for multimodal approaches (performance increase, redundancy, different types of information, context).
- What multimodal approaches can contribute to assessing affect and cognition (benefits of multimodal specifically in the context of affect/cognition).
- Approaches for multimodal analysis, modelling and system design, including for longitudinal analysis (fusion, statistical feature vs. event feature based, analysis methods).
- Examples of multimodal system designs and their benefits.
- Applications of multimodal systems and use case considerations.
- Future directions and challenges.
Part 4: Interactive research design activity (40 min including Q&A)

In this session, we will firstly use an open-source eye tracker and a publicly available dataset to demonstrate eye behaviour computing. Then we will present audio wearables (maybe also a real time platform) to demonstrate audio sensing and processing in practice. All the codes including simple machine learning pipeline will be shared with the participants. We will also leave the participants sufficient time for them to try on and provide help with using them. If time permits, participants will share how they might use these kinds of devices, data and computing methods in their own research.

Speakers:

Dr. Siyuan Chen, University of New South Wales
Bio: Siyuan Chen is a lecturer at the University of New South Wales (UNSW). Her work focuses on using “big data” from close-up eye videos, speech and head movement, and advanced analytics to understand human internal state such as emotion, cognition and action, and developing models for these internal states using wearable eyewear systems. She received her PhD in Electrical Engineering from UNSW. During her PhD, she worked as a Research Intern at NII, Tokyo, Japan. Before joining UNSW, she was a Research Fellow in the Department of Computer Science and Information Systems at the University of Melbourne and a visiting researcher to the STARS team, INRIA, Sophia Antipolis, France. Dr. Chen is a recipient of the NICTA Postgraduate and the top-up Project Scholarship, the Commercialization Training Scheme Scholarship, and the Australia Endeavor Fellowship 2015. She has published over 30 papers in high quality peer-reviewed venues and filed two patents. She led a special session in SMC2021 and a special issue in Frontiers in Computer Science in 2021. She also served as a session chair in WCCI 2020 and SMC2021, and was invited to be a Programme Committee member of several conferences, such as ACII, IEEE CBMS, Social AI for Healthcare 2021 workshop. She is a member of Woman in Signal Processing Committee. Her work has been supported by US-based funding source multiple times. She was also a recipient of UNSW Faculty Engineering Early Career Academics funding in 2021.

Dr. Ting Dang, Nokia Bell Labs/ University of Cambridge
Bio: Ting Dang is currently a Senior Research Scientist in Nokia Bell Labs, and a visiting researcher in the Department of Computer Science and Technology, University of Cambridge. Prior to this, she worked as a Senior Research Associate at the University of Cambridge. She received her Ph.D. from the University of New South Wales, Australia. Her primary research interests are on human centric sensing and machine learning for mobile health monitoring and delivery, specifically on exploring the potential of audio signals (e.g., speech, cough) via mobile and wearable sensing for automatic mental state (e.g., emotion, depression) prediction and disease (e.g., COVID-19) detection and monitoring. Further, her work aims to develop generalized, interpretable, and robust machine learning models to improve healthcare delivery. She served as the (senior) program committee and reviewer for more than 30 conferences and top-tier journals, such as NeurIPS, AAAI, IJCAI, IEEE TAC, IEEE TASLP, JMIR, ICASSP, INTERSPEECH, etc. She was shortlisted and invited to attend Asian Dean’s Forum Rising Star 2022 and won the IEEE Early Career Writing Retreat Grant 2019 and ISCA Travel Grant 2017. She has previous experience in successful bidding of INTERSPEECH 2026 (social media co-chair) and is organizing scientific meetings such as UbiComp WellComp workshop 2023 (co-organizer).

Prof. Julien Epps, University of New South Wales
Bio: Julien Epps received the BE and PhD degrees from the University of New South Wales, Sydney, Australia, in 1997 and 2001, respectively. From 2002 to 2004, he was a Senior Research Engineer with Motorola Labs, where he was engaged in speech recognition. From 2004 to 2006, he was a Senior Researcher and Project Leader with National ICT Australia, Sydney, where he worked on multimodal interface design. He then joined the UNSW School of Electrical Engineering and Telecommunications, Australia, in 2007 as a Senior Lecturer, and is currently a Professor and Head of School. He is also a Co-Director of the NSW Smart Sensing Network, a Contributed Researcher with Data61, CSIRO, and a Scientific Advisor for Sonde Health (Boston, MA). He has authored or co-authored more than 270 publications and serves as an Associate Editor for the IEEE Transactions on Affective Computing. His current research interests include characterisation, modelling, and classification of mental state from behavioural signals, such as speech, eye activity, and head movement.

Tutorial Website:

http://ireye4task.github.io/icmi_tutorial.html

2. Platform for Situated Intelligence and OpenSense: A Tutorial on Building Multimodal Interactive Applications for Research

Duration:

3 hours

Description:

Platform for Situated Intelligence (\psi ) is an open-source framework intended to support the rapid development and study of multimodal, integrative-AI applications. The framework provides infrastructure for constructing and executing pipelines of heterogeneous components that operate over temporal streams of data; a set of development tools for visualization, annotation, and debugging; and an open ecosystem of component technologies. A number of applications and higher-level research platforms have been developed on this framework by various researchers in the ICMI community over the past few years, including OpenSense, a platform specifically targeted for real-time multimodal data acquisition and behavior perception. In this tutorial, we will walk through how to use Platform for Situated Intelligence and OpenSense for multimodal interaction research, starting from the basic concepts, working together towards a sample application, and showing many hands-on examples of the breadth of research applications that can be targeted.

Program:

Tutorial overview and welcome (10 minutes)
Introduction to \psi (25 minutes)
Introduction to OpenSense (15 minutes)
Hands-on session and sample data collection (25 minutes)
Break (15 minutes)
Tools for data visualization and annotation (30 minutes)
Hands-on coding session for developing new components and applications (45 minutes)
Break (15 minutes)
Demonstration of advanced applications in embodied agents, mixed reality, and video-mediated communication (30 minutes)
Group discussion of potential application areas and scenarios (30 minutes)

Materials:

The open-source repository for Platform for Situated Intelligence (\psi) can be found here: https://github.com/microsoft/psi.

The open-source repository for OpenSense can be found here: https://github.com/ihp-lab/OpenSense.

Attendees who wish to follow along with the hands-on portions of the tutorial will be expected to bring their own laptops with basic \psi installation steps already completed (found here: https://github.com/microsoft/psi/wiki/Building-the-Codebase).

Speakers:

Sean Andrist, senior researcher at Microsoft Research in Redmond, Washington
Bio: His research interests involve designing, building, and evaluating socially interactive technologies that are physically situated in the open world, particularly embodied virtual agents and robots. He is currently working on the Platform for Situated Intelligence project, an open-source framework designed to accelerate research and development on a broad class of multimodal, integrative-AI applications. He received his Ph.D. from the University of Wisconsin–Madison, where he primarily researched effective social gaze behaviors in human-robot and human-agent interaction.

Dan Bohus, Senior Principal Researcher in the Adaptive Systems and Interaction Group in Microsoft Research, Redmond, USA
Bio: His work is focused on the study and development of computational models for multimodal, physically situated interaction. The long-term question that drives his research agenda is how can we create systems that reason more deeply about their surroundings and seamlessly participate in interactions and collaborations with people in the physical world?

Zongjian Li, scientific developer with the USC Institute for Creative Technologies and the architect of OpenSense.
Bio: She is actively developing software based on PSI to enable research in multimodal interaction.

Mohammad Soleymani, research associate professor of computer science at USC Institute for Creative Technologies.
Bio: His research interests include affective computing, emotion recognition and multimodal social behavior understanding. He has been leading the development of OpenSense, which has been used in research and educational platforms.

3. Multimodal Machine Learning: Principles, Challenges, and Open Questions

Duration:

4 hours

Description:

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodality has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this tutorial is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. Building upon a new edition of our survey paper on multimodal ML and academic courses at CMU, this tutorial will cover three topics: (1) what is multimodal: the principles in learning from heterogeneous, connected, and interacting data, (2) why is it hard: a taxonomy of six core technical challenges faced in multimodal ML but understudied in unimodal ML, and (3) what is next: major directions for future research as identified by our taxonomy.

Program:

Introduction (15 mins)
- What is multimodal? Historical view and multimodal research tasks.
- Key principles and technical challenges.
Key principle 1: Heterogeneity (15 mins)
- Dimensions of heterogeneity: elements, structure, information, noise topologies, and relevance.
- Measuring heterogeneity: distribution distances and tests, transfer, structure.
Key principle 2: Connections (15 mins)
- Associations, correspondences, dependencies, and relationships.
- Information theory, shared and unique information.
Key principle 3: Interactions (15 mins)
- Redundancy, uniqueness, and synergy, agreement and disagreement.
- Human judgment and automatic quantification.
Challenge 1: Representation (30 mins)
- Fusion: additive, multiplicative, and higher-order interactions.
- Coordination: vector-space models, canonical correlation analysis, order and hierarchy embeddings.
- Fission: factorization, component analysis, and disentanglement.
Challenge 2: Alignment (30 mins)
- Discrete alignment: contrastive learning, global alignment, and optimal transport.
- Continuous alignment: continuous warping, latent alignment approaches, and segmentation.
- Contextualized representations: attention models, multimodal transformers, and pretraining.
BREAK
Challenge 3: Reasoning (30 mins)
- Structure: hierarchical, graphical, temporal, and interactive structure, structure discovery.
- Concepts: dense and neuro-symbolic.
- Composition: causal and logical relationships.
- Knowledge: external knowledge bases, commonsense reasoning.
Challenge 4: Generation (20 mins)
- Summarization, translation, and creation.
- Model evaluation and ethical concerns.
Challenge 5: Transference (20 mins)
- Cross-modal transfer: pre-trained models and adaptation.
- Co-learning: representation and generation with auxiliary modalities.
- Model induction: co-training, co-regularization.
Challenge 6: Quantification (30 mins)
- Heterogeneity: modality importance, biases, noise.
- Interconnections: visualizing and interpreting connections and interactions.
- Learning: generalization, optimization, modality selection and tradeoffs.
Future directions and conclusion (20 mins)

Speakers:

Paul Pu Liang (MLD, CMU), Ph.D. student in Machine Learning at CMU.
Bio: His research lies in the foundations of multimodal machine learning with applications in socially intelligent AI, understanding human and machine intelligence, natural language processing, healthcare, and education. He is a recipient of theWaibel Presidential Fellowship, Facebook PhD Fellowship, Center for Machine Learning and Health Fellowship, and the Alan J. Perlis Graduate Student Teaching Award, and his research has been recognized by 3 best-paper awards at NeurIPS workshops and ICMI. He regularly organizes courses, workshops, and tutorials on multimodal machine learning and was a workflow chair for ICML 2019.

Louis-Philippe Morency (LTI, CMU), Associate Professor at CMU Language Technology Institute, leader of the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab).
Bio: He received his Ph.D. and Master’s degrees from MIT Computer Science and Artificial Intelligence Laboratory. In 2008, Dr. Morency was selected as one of “AI’s 10 to Watch” by IEEE Intelligent Systems. He has received 7 best paper awards in multiple ACM- and IEEE-sponsored conferences for his work on context-based gesture recognition, multimodal probabilistic fusion, and computational models of human communication dynamics. He has taught 10 editions of the multimodal machine learning course at CMU and before that at the University of Southern California. He has given multiple tutorials on this topic, including at CVPR 2022, NAACL 2022, ACL 2017, CVPR 2016, and ICMI 2016.

Tutorial Website:

https://cmu-multicomp-lab.github.io/mmml-tutorial/icmi2023/

25th ACM International Conference on Multimodal Interaction (9-13 October 2023)

Platinum Sponsor

Bronze Sponsor

Institutional Sponsors

Tutorials

1. Introduction to eye and speech behaviour computing for affect analysis in wearable contexts

2. Platform for Situated Intelligence and OpenSense: A Tutorial on Building Multimodal Interactive Applications for Research

3. Multimodal Machine Learning: Principles, Challenges, and Open Questions

Introduction to eye and speech behaviour computing for affect analysis in wearable contexts

Duration:

Description:

Program:

Speakers:

Tutorial Website:

2. Platform for Situated Intelligence and OpenSense: A Tutorial on Building Multimodal Interactive Applications for Research

Duration:

Description:

Program:

Materials:

Speakers:

3. Multimodal Machine Learning: Principles, Challenges, and Open Questions

Duration:

Description:

Program:

Speakers:

Tutorial Website:

25^th ACM International Conference on Multimodal Interaction
(9-13 October 2023)