25th ACM International Conference on Multimodal Interaction
(9-13 October 2023)



Call for Bids 2025



Registration (closed)



Grand Challenges

Presentation Instruction

Doctoral Consortium



Companion Proceedings

Camera-Ready Instructions

Call for Sponsors

Call for Papers

Guidelines for Authors

Guidelines for Reviewers

Call for Blue Sky Papers

Call for Late Breaking Results

Call for Demonstrations
and Exhibits

Call for Doctoral Consortium

Call for Tutorials

Important Dates


Conference venue

Platinum Sponsor

Bronze Sponsor


Institutional Sponsors


ICMI 2023 Conference Program

Please note that some changes can still happen due to unforeseen circumstances.

Program at a glance

Workshops and Tutorials

Each event will start at 9:00 at the earliest and will end at 18:00 at the latest. The detailed schedule for each event can be found on their respective websites.



Main Conference



Detailed Program

Doctoral Consortium (Monday, 09 October 2023)
Tuesday, 10 October 2023
Wednesday, 11 October 2023
Thursday, 12 October 2023
Papers not presented in-person
Tuesday, 10 October

All sessions will take place in the Auditorium, Sorbonne University International Conference Centre except for the Poster Session that will be in the Foyer of the Auditorium, Sorbonne University International Conference Centre

09:00-09:15 Welcome
ICMI 2023 General Chairs
09:15-10:15 Keynote 1: Multimodal information processing in communication: the nature of faces and voices
Prof. Sophie Scott
Session Chair: Louis-Philippe Morency
10:15-10:45 Break
10:45-12:05 Oral Session 1: Social and Physiological Signals
Session Chair: Zakia Hammal
10:45-11:05  EEG-based Cognitive Load Classification using Feature Masked Autoencoding and Emotion Transfer Learning
D.Pulver, P.Angka, P.Hungler and A.Etemad
11:05-11:25   Representation Learning for Interpersonal and Multimodal Behavior Dynamics: A Multiview Extension of Latent Change Score Models
A.Vail, J.M.Girard,L.Bylsma, J.Fournier, H.Swartz, J.Cohn and L.-P.Morency
11:25-11:45   Crucial Clues: Investigating Psychophysiological Behaviors for Measuring Trust in Human-Robot Interaction
M.Ahmad and A.Alzahrani
11:45-12:05   Understanding the Social Context of Eating with Multimodal Smartphone Sensing: The Role of Country Diversity
N.D.Kammoun, L.Meegahapola and D.Gatica-Perez
12:05-14:00 Lunch
14:00-15:20 Oral Session 2: Bias and Diversity
Session Chair: Chloé Clavel
14:00-14:20  Using Explainability for Bias Mitigation: A Case Study for Fair Recruitment Assessment
G.Sogancioglu, H.Kaya and A.A.Salah
14:20-14:40   Multimodal Bias: Assessing Gender Bias in Computer Vision Models with NLP Techniques
A. Mandal, S.Little and S.Leavy
14:40-15:00   Recognizing Intent in Collaborative Manipulation
Z.Rysbek, K-H.Oh and M.Zefran
15:00-15:20   Evaluating Outside the Box: Lessons Learned on eXtended Reality Multi-modal Experiments Beyond the Laboratory
B.Marques, S.Silva, R.Maio, J.Alves, C.Ferreira, P.Dias, B.Sousa Santos
15:20-15:50 Break
15:20-17:20 Poster Session 1 (including Doctoral Consortium posters)
Session Chair: TBA
Analyzing and Recognizing Interlocutors’ Gaze Functions from Multimodal Nonverbal Cues
A.Tashiro, M.Imamura, S.Kumano and K.Otsuka
Multimodal Fusion Interactions: A Study of Human and Automatic Quantification
P.P.Liang, Y.Cheng, R.Salakhutdinov and L.-P.Morency
HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer
Y.Kim, D.W.Lee, P.P.Liang, S.Alghowinem, C.Breazeal and H.W.Park
Deciphering Entrepreneurial Pitches: A Multimodal Deep Learning Approach to Predict Probability of Investment
P.van Aken, M.M.Jung, W.Liebregts and I.O.Ertugrul
Identifying Interlocutors’ Behaviors and its Timings Involved with Impression Formation from Head-Movement Features and Linguistic Features
S.Otsuchi, K.Ito, Y.Ishii, R.Ishii, S.Eitoku and K.Otsuka
Evaluating the Potential of Caption Activation to Mitigate Confusion Inferred from Facial Gestures in Virtual Meetings
M.Heck, J.Jeong and C.Becker
Towards Autonomous Physiological Signal Extraction From Thermal Videos Using Deep Learning
K.Das, M.Abouelenien, M.G.Burzo, J.Elson, K.Prakah-Asante and C.Maranville
Exploring Feedback Modality Designs to Improve Young Children’s Collaborative Actions
A.Melniczuk and E.Vrapi
Breathing New Life into COPD Assessment: Multisensory Home-monitoring for Predicting Severity
Z.Xiao, M.Muszynski, R.Marcinkevičs, L.Zimmerli, A.D.Ivankay, D.Kohlbrenner, M.Kuhn, Y.Nordmann, U.Muehlner, C.Clarenbach,J.E.Vogt and T.Brunschwiler
Analyzing Synergetic Functional Spectrum from Head Movements and Facial Expressions in Conversations
M.Imamura, A.Tashiro, S.Kumano and K.Otsuka
Do I Have Your Attention: A Large Scale Engagement Prediction Dataset and Baselines
M.Singh, X.Hoque, D.Zeng, Y.Wang, K.Ikeda and A.Dhall
Implicit Search Intent Recognition using EEG and Eye Tracking: Novel Dataset and Cross-User Prediction
M.Sharma, S.Chen, P.Müller, M.Rekrut and A.Krüger
Multimodal Analysis and Assessment of Therapist Empathy in Motivational Interviews
T.Tran, Y.Yin, L.Tavabi, J.Delacruz, B.Borsari, J.D..Woolley, S.Scherer and M.Soleymani
Multimodal Turn Analysis and Prediction for Multi-party Conversations
M-C.Lee, M.Trinh and Z.Deng
Explainable Depression Detection via Head Motion Patterns
M.Gahalawat, R.Fernandez Rojas, T.Guha, R.Subramanian, R.Goecke
Early Classifying Multimodal Sequences
A.Cao, J.Utke and D.Klabjan
Predicting Player Engagement in Tom Clancy’s The Division 2: A Multimodal Approach via Pixels and Gamepad Actions
K.Pinitas, D.Renaudie, M.Thomsen, M.Barthet, K.Makantasis, A.Liapis and G.Yannakakis
On Head Motion for Recognizing Aggression and Negative Affect during Speaking and Listening
S.Fitrianie and I.Lefter
SHAP-based Prediction of Mother’s History of Depression to Understand the Influence on Child Behavior
M.Bilalpur, S.Hinduja, L.Cariola, L.Sheeber, N.Allen, L-P. Morency, and. J.Cohn
Computational analyses of linguistic features with schizophrenic and autistic traits along with formal thought disorders
T.Saga, H.Tanaka and S.Nakamura
Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
E.Yaghoubi, A.P.Kelm, T.Gerkmann and S.Frintrop
Performance Exploration of RNN Variants for Recognizing Daily Life Stress Levels by Using Multimodal Physiological Signals
Y.Said Ca and, E.André
Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization
W-C.Lin, L.Goncalves and C.Busso
Interpreting Sign Language Recognition using Transformers and MediaPipe Landmarks
C.Luna-Jiménez, M.Gil-Martín, R.Kleinlein, R.San-Segundo and F.Fernández-Martínez
Expanding the Role of Affective Phenomena in Multimodal Interaction Research
L.Mathur, M.Mataric and L.-P.Morency
15:20-17:20 Doctoral Consortium posters
Session Chair: TBA
Smart Garments for Immersive Home Rehabilitation Using VR
Crowd Behavior Prediction Using Visual and Location Data un Super-Crowded Scenarios
Recording Multimodal Pair-Programming Dialogue for Reference Resolution by Conversational Agents
Modeling Social Cognition and Its Neurologic Deficits with Artificial Neural Networks
Come Fl.. Run with me: Understanding the Utilization of Drones to Support Recreational Runner’s Well Being
Conversational Grounding in Multimodal Dialog Systems
Explainable Depression Detection using Multimodal Behavioural Cues
Enhancing Surgical Team Collaboration and Situation Awareness Through Multimodal Sensing
Bridging Multimedia Modalities: Enhanced Multimodal AI Understanding and Intelligent Agents


Wednesday, 11 October

All sessions will take place in the Auditorium, Sorbonne University International Conference Centre, except for the Poster session that will be in TBA and Demo Session that will be in Foyer of the Auditorium, Sorbonne University International Conference Centre

09:15-10:15 Keynote 2: A Robot Just for You: Multimodal Personalized Human-Robot Interaction and the Future of Work and Care
Prof. Maja Mataric
Session Chair: Tanja Schultz
10:15-10:45 Break
10:45-12:05 Oral Session 3: Affective Computing
Session Chair: Dirk Heylen
10:45-11:05  Neural Mixed Effects for Nonlinear Personalized Predictions
T.Wörtwein, N.Allen, L.Sheeber, R.Auerbach, J.Cohn and L.-P.Morency
11:05-11:25   Detecting When the Mind Wanders Off Task in Real-time: An Overview and Systematic Review
V.Kuvar, J.W.Y.Kam, S. Hutt and C.Mills
11:25-11:45   Annotations from speech and heart rate: impact on multimodal emotion recognition
K.Sharma and G.Chanel
11:45-12:05   Toward Fair Facial Expression Recognition with Improved Distribution Alignment
M.Kolahdouzi and A.Etemad
12:05-14:00 Lunch
14:00-15:20 Oral Session 4: Multimodal Interfaces
Session Chair: Sean Andrist
14:00-14:20  Ether-Mark: An Off-Screen Marking Menu For Mobile Devices
H.Rateau, Y.Rekik and E.Lank
14:20-14:40   Embracing Contact: Detecting Parent-Infant Interactions
M.Doyran, R.Poppe and A.Ali Salah
14:40-15:00   Cross-Device Shortcuts: An Interaction Technique that Creates Deep Links between Apps Across Devices for Content Transfer
M.Beyeler, Y.F.Cheng and C.Holz
15:00-15:20   Component attention network for multimodal dance improvisation recognition
J. Fu, J. Tan, W. Yin, S. Pashami, and M. Björkman
15:20-15:40 Challenge Overview Talks
15:40-16:10 Break
Overlapping with the poster session
15:40-17:40 Poster Session 2 (and Demo Session)
Session Chair: TBA
TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices
T.Gemicioglu, R.Michael Winters, Y-T.Wang,T.Gable, I.J.Tashev
Using Augmented Reality to Assess the Role of Intuitive Physics in the Water-Level Task
R.Abadi, LM.Wilcox and R.Allison
Classification of Alzheimer’s Disease with Deep Learning on Eye-tracking Data
H.Sriram, C.Conati and T.Field
Video-based Respiratory Waveform Estimation in Dialogue: A Novel Task and Dataset for Human-Machine Interaction
T.Obi and K.Funakoshi
The Role of Audiovisual Feedback Delays and Bimodal Congruency for Visuomotor Performance in Human-Machine Interaction
A.Dix,C.Sabrina and A.M.Harkin
Can empathy affect the attribution of mental states to robots?
C.Gena, F.Manini, A.Lieto, A.Lillo and F.Vernero
AIUnet: Asymptotic inference with U2-Net for referring image segmentation
M.Heck, J.Jeong and C.Becker
Using Speech Patterns to Model the Dimensions of Teamness in Human-Agent Teams
E.Doherty, C.Spencer, L.Eloy, N.R.Dickler and L.Hirshfield
Robot Duck Debugging: Can Attentive Listening Improve Problem Solving?
M.T.Parreira, S.Gillet and I.Leite
Estimation of Violin Bow Pressure Using Photo-Reflective Sensors
Y.Mizuho and R.Kitamura and Y.Sugiurar
Paying Attention to Wildfire: Using U-Net with Attention Blocks on Multimodal Data for Next Day Prediction
J.Fitzgerald,E.Seefried, J.E.Yost, S.Pallickara and N.Blanchard
ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents
D.S.Withanage Don, P.Müller, F.Nunnari, E.André and P.Gebhard
Large language models in textual analysis for gesture selection
L.Birka, N.Yongsatianchot, P.G.Torshizi, E.Minucci and S.Marsella
Increasing Heart Rate and Anxiety Level with Vibrotactile and Audio Presentation of Fast Heartbeat
R.Wang, H.Zhang, S.A.Macdonald, P.Di Campli San Vito
User Feedback-based Online Learning for Intent Classification
K.Gönç, B.Sağlam, O.Dalmaz, T.Çukur, S.Kozat and H.Dibeklioglu
µGeT: Multimodal eyes-free text selection technique combining touch interaction and microgestures
G.R.J.Faisandaz, A.Goguey, C.Jouffrais and L.Nigay
Deep Breathing Phase Classification with a Social Robot for Mental Health
K.Matheus, E.Mamantov, M.Vázquez and B.Scassellati
ASMRcade: Interactive Audio Triggers for an Autonomous Sensory Meridian Response
S.Mertes, M.Strobl, R.Schlagowski and E. André
Augmented Immersive Viewing and Listening Experience Based on Arbitrarily Angled Interactive Audiovisual Representation
T.Horiuchi, S.Okuba and T.Kobayashi
Out of Sight, … How Asymmetry in Video-Conference Affects Social Interaction
C.Sallaberry, G.Englebienne, J.Van Erp and V.Evers
Demo Session
Session Chair: TBA


Thursday, 12 October

All sessions will take place in the Auditorium, Sorbonne University International Conference Centre except for the Poster Session that will be in the Foyer of the Auditorium, Sorbonne University International Conference Centre

09:15-10:15 Keynote 3: Projecting Life Onto Machines
Prof. Simone Natale
Session Chair: Alessandro Vinciarelli
10:15-10:45 Break
10:45-12:05 Oral Session 5: Gestures and Social Interactions
Session Chair: Mohammad Soleymani
10:45-11:05  AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis
H.Voß and S.Kopp
11:05-11:25   Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion
A.Ideno, T.Kaneko and T.Harada
11:25-11:45   FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning
K.I.Haque and Z.Yumak
11:45-12:05   Influence of hand representation on a grasping task in augmented reality
L.Lafuma, G.Bouyer, O.Goguel and J.-Y.P.Didier
12:05-14:00 Lunch
14:00-15:00 Keynote 4 – Sustained Achievement Award
Prof. Louis-Philippe Morency
Session Chair: TBA
15:00-15:30 Break
Overlapping with Poster Session 3
15:00-16:45 Poster Session 3 and Late Breaking Results
Session Chair: TBA
Synerg-eye-zing: Decoding Nonlinear Gaze Dynamics Driving Successful Collaborations in Co-located Teams
G.S.Rajshekar, L.Eloy, R.Dickler, J.G.Reitman, S.L.Pugh, P.Foltz, J.C.Gorman, J.Harrison and L. Hirshfield
Exploring Neurophysiological Responses to Cross-Cultural Deepfake Videos
M.R.Khan, S.Naeem, U.Tariq, A.Dhall, M.N.A.Khan, F.Al Shargie and H.Al Nashash
Characterization of collaboration in a virtual environment with gaze and speech signals
A.Léchappé, A.Milliat, C.Fleury, M.Chollet and C.Dumas
HEARD-LE: An Intelligent Conversational Interface for Wordle
C.Yang, K.Arredondo, J.I.Koh, P.Taele and T.Hammond
Assessing Infant and Toddler Behaviors through Wearable Inertial Sensors: A Preliminary Investigation
A.Onodera, R.Ishioka, Y.Nishiyama and K.Sezaki
ASAR Dataset and Computational Model for Affective State Recognition During ARAT Assessment for Upper Extremity Stroke Survivors
T.Ahmed, T.Rikakis, A.Kelliher and M.Soleymani
The Limitations of Current Similarity-Based Objective Metrics In the Context of Human-Agent Interaction Applications
A.Deffrennes, L.Vincent, M.Pivette, K.El Haddad, J.D.Bailey, M.Perusquia-Hernandez, S.M.Alarcão and T.Dutoit
Do Body Expressions Leave Good Impressions? – Predicting Investment Decisions based on Pitcher’s Body Expressions
M.M.Jung, M.van Vlierden, W.Liebregts and I.Onal Eturgul
Multimodal Entrainment in Bio-Responsive Multi-User VR Interactives
M.Song and S.Di Paola
Multimodal Synchronization in Musical Ensembles: Investigating Audio and Visual Cues
S.Chakraborty and J.Timoney
Insights Into the Importance of Linguistic Textual Features on the Persuasiveness of Public Speaking
A.Barkar, M.Chollet, B.Biancardi and C.Clavel
Detection of contract cheating in pen-and-paper exams through the analysis of handwriting style
K.Kunzentsov, M.Barz and D.Sonntag
Leveraging gaze for potential error prediction in AI-support systems: An exploratory analysis of interaction with a simulated robot
B.Severitt, N.J.Castner, O.Lukashova-Sanz and S.Wahl
Developing a Generic Focus Modality for Multimodal Interactive Environments
F.Barros, A.Teixeira and S.Silva
Multimodal Prediction of User’s Performance in High-Stress Dialogue Interactions
S.Nasihati Gilani, K.Pollard and D.Traum
Understanding the Physiological Arousal of Novice Performance Drivers for the Design of Intelligent Driving Systems
E.Kimani, A.L.S.Filipowicz and H.Yasuda
A Portable Ball with Unity-based Computer Game for Interactive Arm Motor Control Exercise
Y.Zhou, Y.An, Q.Niu, Q.Bu, Y.C.Liang, M.Leach and J.Sun
Virtual Reality Music Instrument Playing Game for Upper Limb Rehabilitation Training
M.Sun, Q.Bu, Y.Hou, X.Ju, L.Yu, E.G.Lim and J.Sun
Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
K.Inoue, D.Lala, K.Ochi, T.Kawahara and G.Skantze
“Am I listening?”, Evaluating the Quality of Generated Data-driven Listening Motion
P.Wolfert, G.E.Henter and T.Belpaeme
LinLED: Low latency and accurate contactless gesture interaction
S.Viollet, C.Martin and J.-M.Ingargiola
16:45-17:45 Blue Sky Papers
17:45-18:00 Closing
19:00-22:00 Banquet, Le Grand Salon, La Sorbonne, La Chancellerie des Universités de Paris


Papers Not Presented In-person

This is a list of papers for which no authors were able to attend the conference in person. While these papers do not appear in the program above, they are still available in the conference proceedings. Optionally, authors were invited to submit a pre-recorded video presentation of their paper, and submit it as supplementary material, accompanying the conference proceedings.

MMASD: A Multimodal Dataset for Autism Intervention Analysis
J.Li, V.Chheang, P.Kullu, Z.Guo, A.Bhat, K.E.Barner and R.L.Barmaki
GCFormer: A Graph Convolutional Transformer for Speech Emotion Recognition
Y.Gao, H.Zhao, Y.Xiao and Z.Zhang
How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning
Y.Ma, M.Celepkolu, K.E.Boyer, C.Lynch, E.Wiebe and M.Israel
Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation
Y.Sun, Q.Wu, H.Zhou, K.Wang, T.Hu, C.-C.Liao, S.Miyafuji, Z.Liu and H.Koike
Gait Event Prediction of People with Cerebral Palsy using Feature Uncertainty: A Low-Cost Approach
S.Chakraborty, N.Thomas and A.Nandy
ViFi-Loc: Multi-modal Pedestrian Localization using GAN with Camera-Phone Correspondences
H.Liu, H.Lu, K.Dana and M.Gruteser
Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-robot Collaboration
A.Kalatzis, S.Rahman, V.G.Prabhu, L.Stanley and M.Wittie
WiFiTuned: Monitoring Engagement in Online Participation by Harmonizing WiFi and Audio
V.K.Singh, P.Kar, A.M.Sohini, M.Rangaiah, S.Chakraborty and M.Maity