ICMI ’23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
SESSION: Keynote Talks
In this talk I will take a neurobiological perspective on human communication, and explore the ways in which visual and auditory channels express common and distinct patterns of information. I will extend this to that ways in which facial and vocal information is processed neurally and how they interact in communication.
A Robot Just for You: Multimodal Personalized Human-Robot Interaction and the Future of Work and Care
As AI becomes ubiquitous, its physical embodiment—robots–will also gradually enter our lives. As they do, we will demand that they understand us, predict our needs and wants, and adapt to us as we change our moods and minds, learn, grow, and age. The nexus created by recent major advances in machine learning for machine perception, navigation, and natural language processing has enabled human-robot interaction in real-world contexts, just as the need for human services continues to grow, from elder care to nursing to education and training. This talk will discuss our research in socially assistive robotics (SAR), which uses embodied social interaction to support user goals in health, wellness, training, and education. SAR brings together machine learning for user modeling, multimodal behavioral signal processing, and affective computing to enable robots to understand, interact, and adapt to users’ specific and ever-changing needs. The talk will cover methods and challenges of using multi-modal interaction data and expressive robot behavior to monitor, coach, motivate, and support a wide variety of user populations and use cases. We will cover insights from work with users across the age span (infants, children, adults, elderly), ability span (typically developing, autism, stroke, Alzheimer’s), contexts (schools, therapy centers, homes), and deployment durations (up to 6 months), as well as commercial implications.
Public discussions and imaginaries about AI often center around the idea that technologies such as neural networks might one day lead to the emergence of machines that think or even feel like humans. Drawing on histories of how people project lives onto talking things, from spiritualist seances in the Victorian era to contemporary advances in robotics, this talk argues that the “lives” of AI have more to do with how humans perceive and relate to machines exhibiting communicative behavior, than with the functioning of computing technologies in itself. Taking up this point of view helps acknowledge and further interrogate how perceptions and cultural representations inform the outcome of technologies that are programmed to interact and communicate with human users.
SESSION: Main Track – Long and Short Papers
A Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-robot Collaboration
One of the primary aims of Industry 5.0 is to refine the interaction between humans, machines, and robots by developing human-centered design solutions to enhance Human-Robot Collaboration, performance, trust, and safety. This research investigated how deploying a user interface utilizing a 2-D and 3-D display affects participants’ cognitive effort, task performance, trust, and situational awareness while performing a collaborative task using a robot. The study used a within-subject design where fifteen participants were subjected to three conditions: no interface, display User Interface, and mixed reality User Interface where vision assistance was provided. Participants performed a pick-and-place task with a robot in each condition under two levels of cognitive workload (i.e., high and low). The cognitive workload was measured using subjective (i.e., NASA TLX) and objective measures (i.e., heart rate variability). Additionally, task performance, situation awareness, and trust when using these interfaces were measured to understand the impact of different user interfaces during a Human-Robot Collaboration task. Findings from this study indicated that cognitive workload and user interfaces impacted task performance, where a significant decrease in efficiency and accuracy was observed while using the mixed reality interface. Additionally, irrespective of the three conditions, all participants perceived the task as more cognitively demanding during the high cognitive workload session. However, no significant differences across the interfaces were observed. Finally, cognitive workload impacted situational awareness and trust, where lower levels were reported in the high cognitive workload session, and the lowest levels were observed under the mixed reality user interface condition.
This paper introduces an unsupervised model for audio-visual localization, which aims to identify regions in the visual data that produce sounds. Our key technical contribution is to demonstrate that using distilled prior knowledge of both sounds and objects in an unsupervised learning phase can improve performance significantly. We propose an Audio-Visual Correspondence (AVC) model consisting of an audio and a vision student, which are respectively supervised by an audio teacher (audio recognition model) and a vision teacher (object detection model). Leveraging a contrastive learning approach, the AVC student model extracts features from sounds and images and computes a localization map, discovering the regions of the visual data that correspond to the sound signal. Simultaneously, the teacher models provide feature-based hints from their last layers to supervise the AVC model in the training phase. In the test phase, the teachers are removed. Our extensive experiments show that the proposed model outperforms the state-of-the-art audio-visual localization models on 10k and 144k subsets of the Flickr and VGGS datasets, including cross-dataset validation.
Referring image segmentation aims to segment a target object from an image by providing a natural language expression. While recent methods have made remarkable advancements, few have designed effective deep fusion processes for cross-model features or focused on the fine details of vision. In this paper, we propose AIUnet, an asymptotic inference method that uses U2-Net. The core of AIUnet is a Cross-model U2-Net (CMU) module, which integrates a Text guide vision (TGV) module into U2-Net, achieving efficient interaction of cross-model information at different scales. CMU focuses more on location information in high-level features and learns finer detail information in low-level features. Additionally, we propose a Features Enhance Decoder (FED) module to improve the recognition of fine details and decode cross-model features to binary masks. The FED module leverages a simple CNN-based approach to enhance multi-modal features. Our experiments show that AIUnet achieved competitive results on three standard datasets.Code is available at https://github.com/LJQbiu/AIUnet.
A novel framework is presented for analyzing and recognizing the functions of gaze in group conversations. Considering the multiplicity and ambiguity of the gaze functions, we first define 43 nonexclusive gaze functions that play essential roles in conversations, such as monitoring, regulation, and expressiveness. Based on the defined functions, in this study, a functional gaze corpus is created, and a corpus analysis reveals several frequent functions, such as addressing and thinking while speaking and attending by listeners. Next, targeting the ten most frequent functions, we build convolutional neural networks (CNNs) to recognize the frame-based presence/absence of each gaze function from multimodal inputs, including head pose, utterance status, gaze/avert status, eyeball direction, and facial expression. Comparing different input sets, our experiments confirm that the proposed CNN using all modality inputs achieves the best performance and an F value of 0.839 for listening while looking.
Analyzing Synergetic Functional Spectrum from Head Movements and Facial Expressions in Conversations
A framework, synergetic functional spectrum analysis (sFSA), is proposed to reveal how multimodal nonverbal behaviors such as head movements and facial expressions cooperatively perform communicative functions in conversations. We first introduce a functional spectrum to represent the functional multiplicity and ambiguity in nonverbal behaviors, e.g., a nod could imply listening, agreement, or both. More specifically, the functional spectrum is defined as the distribution of perceptual intensities of multiple functions across multiple modalities, which are based on multiple raters’ judgments. Next, the functional spectrum is decomposed into a small basis set called the synergetic functional basis, which can characterize primary and distinctive multimodal functionalities and span a synergetic functional space. Using these bases, the input spectrum is approximated as a linear combination of the bases and corresponding coefficients, which represent the coordinate in the functional space. To that purpose, this paper proposes semi-orthogonal nonnegative matrix factorization (SO-NMF) and discovers some essential multimodal synergies in the listener’s back-channel, thinking, positive responses, and speaker’s thinking and addressing. Furthermore, we proposes regression models based on convolutional neural networks (CNNs) to estimate the functional space coordinates from head movements and facial action units, and confirm the potential of the sFSA.
The focus of multimodal emotion recognition has often been on the analysis of several fusion strategies. However, little attention has been paid to the effect of emotional cues, such as physiological and audio cues, on external annotations used to generate the Ground Truths (GTs). In our study, we analyze this effect by collecting six continuous arousal annotations for three groups of emotional cues: speech only, heartbeat sound only and their combination. Our results indicate significant differences between the three groups of annotations, thus giving three distinct cue-specific GTs. The relevance of these GTs is estimated by training multimodal machine learning models to regress speech, heart rate and their multimodal fusion on arousal. Our analysis shows that a cue(s)-specific GT is better predicted by the corresponding modality(s). In addition, the fusion of several emotional cues for the definition of GTs allows to reach a similar performance for both unimodal models and multimodal fusion. In conclusion, our results indicates that heart rate is an efficient cue for the generation of a physiological GT; and that combining several emotional cues for GTs generation is as important as performing input multimodal fusion for emotion prediction.
The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available.
Autonomous Sensory Meridian Response (ASMR) is a sensory phenomenon involving pleasurable tingling sensations in response to stimuli such as whispering, tapping, and hair brushing. It is increasingly used to promote health and well-being, help with sleep, and reduce stress and anxiety. ASMR triggers are both highly individual and of great variety. Consequently, finding or identifying suitable ASMR content, e.g., by searching online platforms, can take time and effort. This work addresses this challenge by introducing a novel interactive approach for users to generate personalized ASMR sounds. The presented system utilizes a generative adversarial network (GAN) for sound generation and a graphical user interface (GUI) for user control. Our system allows users to create and manipulate audio samples by interacting with a visual representation of the GAN’s latent input vector. Further, we present the results of a first user study which indicates that our approach is suitable for triggering ASMR experiences.
Augmented Immersive Viewing and Listening Experience Based on Arbitrarily Angled Interactive Audiovisual Representation
We propose an arbitrarily angled interactive audiovisual representation technique that combines a unique sound field synthesis with visual representation in order to augment the possibility of interactive immersive viewing experiences on mobile devices. This technique can synthesize two-channel stereo sound with constant stereo width having an arbitrary angle range from minimum 30 to maximum 360 degrees centering on an arbitrary direction from multi-channel surround sound. The visual representation can be chosen either equirectangular projection or stereographic projection. The developed video player app allows users to enjoy arbitrarily angled 360-degree videos by manipulating the touchscreen, and the stereo sound and the visual representation changes in terms of its spatial synchronization depending on the view. The app was released as a demonstration, and its acceptability and worth were investigated through interviews and subjective assessment tests. The app has been well received, and to date, more than 30 pieces of content have been produced in multiple genres, with a total of more than 200,000 views.
Chronic obstructive pulmonary disease (COPD) is a significant public health issue, affecting more than 100 million people worldwide. Remote patient monitoring has shown great promise in the efficient management of patients with chronic diseases. This work presents the analysis of the data from a monitoring system developed to track COPD symptoms alongside patients’ self-reports. In particular, we investigate the assessment of COPD severity using multisensory home-monitoring device data acquired from 30 patients over a period of three months. We describe a comprehensive data pre-processing and feature engineering pipeline for multimodal data from the remote home-monitoring of COPD patients. We develop and validate predictive models forecasting i) the absolute and ii) differenced COPD Assessment Test (CAT) scores based on the multisensory data. The best obtained models achieve Pearson’s correlation coefficient of 0.93 and 0.37 for absolute and differenced CAT scores. In addition, we investigate the importance of individual sensor modalities for predicting CAT scores using group sparse regularization techniques. Our results suggest that feature groups indicative of the patient’s general condition, such as static medical and physiological information, date, spirometer, and air quality, are crucial for predicting the absolute CAT score. For predicting changes in CAT scores, sleep and physical activity features are most important, alongside the previous CAT score value. Our analysis demonstrates the potential of remote patient monitoring for COPD management and investigates which sensor modalities are most indicative of COPD severity as assessed by the CAT score. Our findings contribute to the development of effective and data-driven COPD management strategies.
This paper presents an experimental study showing that the humanoid robot NAO, in a condition already validated with regards to its capacity to trigger situational empathy in humans, is able to stimulate the attribution of mental states towards itself. Indeed, results show that participants not only experienced empathy towards NAO, when the robot was afraid of losing its memory due to a malfunction, but they also attributed higher scores to the robot emotional intelligence in the Attribution of Mental State Questionnaire, in comparison with the users in the control condition. This result suggests a possible correlation between empathy toward the robot and humans’ attribution of mental states to it.
Existing research has shown the potential of classifying Alzheimer’s Disease (AD) from eye-tracking (ET) data with classifiers that rely on task-specific engineered features. In this paper, we investigate whether we can improve on existing results by using a Deep Learning classifier trained end-to-end on raw ET data. This classifier (VTNet) uses a GRU and a CNN in parallel to leverage both visual (V) and temporal (T) representations of ET data and was previously used to detect user confusion while processing visual displays. A main challenge in applying VTNet to our target AD classification task is that the available ET data sequences are much longer than those used in the previous confusion detection task, pushing the limits of what is manageable by LSTM-based models. We discuss how we address this challenge and show that VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model to make predictions from ET data.
Dance improvisation is an active research topic in the arts. Motion analysis of improvised dance can be challenging due to its unique dynamics. Data-driven dance motion analysis, including recognition and generation, is often limited to skeletal data. However, data of other modalities, such as audio, can be recorded and benefit downstream tasks. This paper explores the application and performance of multimodal fusion methods for human motion recognition in the context of dance improvisation. We propose an attention-based model, component attention network (CANet), for multimodal fusion on three levels: 1) feature fusion with CANet, 2) model fusion with CANet and graph convolutional network (GCN), and 3) late fusion with a voting strategy. We conduct thorough experiments to analyze the impact of each modality in different fusion methods and distinguish critical temporal or component features. We show that our proposed model outperforms the two baseline methods, demonstrating its potential for analyzing improvisation in dance.
Computational analyses of linguistic features with schizophrenic and autistic traits along with formal thought disorders
Formal Thought Disorder (FTD), which is a group of symptoms in cognition that affects language and thought, can be observed through language. FTD is seen across such developmental or psychiatric disorders as Autism Spectrum Disorder (ASD) or Schizophrenia, and its related Schizotypal Personality Disorder (SPD). Researchers have worked on computational analyses for the early detection of such symptoms and to develop better treatments more than 40 years. This paper collected a Japanese audio-report dataset with score labels related to ASD and SPD through a crowd-sourcing service from the general population. We measured language characteristics with the 2nd edition of the Social Responsiveness Scale (SRS2) and the Schizotypal Personality Questionnaire (SPQ), including an odd speech subscale from SPQ to quantize the FTD symptoms. We investigated the following four research questions through machine-learning-based score predictions: (RQ1) How are schizotypal and autistic measures correlated? (RQ2) What is the most suitable task to elicit FTD symptoms? (RQ3) Does the length of speech affect the elicitation of FTD symptoms? (RQ4) Which features are critical for capturing FTD symptoms? We confirmed that an FTD-related subscale, odd speech, was significantly correlated with both the total SPQ and SRS scores, although they themselves were not correlated significantly. In terms of the tasks, our result identified the effectiveness of FTD elicitation by the most negative memory. Furthermore, we confirmed that longer speech elicited more FTD symptoms as the increased score prediction performance of an FTD-related subscale odd speech from SPQ. Our ablation study confirmed the importance of function words and both the abstract and temporal features for FTD-related odd speech estimation. In contrast, embedding-based features were effective only in the SRS predictions, and content words were effective only in the SPQ predictions, a result that implies the differences of SPD-like and ASD-like symptoms. Data and programs used in this paper can be found here: https://sites.google.com/view/sagatake/resource.
Cross-Device Shortcuts: Seamless Attention-guided Content Transfer via Opportunistic Deep Links between Apps and Devices
Although users increasingly spread their activities across multiple devices—even to accomplish a single task—information transfer between apps on separate devices still incurs non-negligible effort and time overhead. These interaction flows would considerably benefit from more seamless cross-device interaction that directly connects the information flow between the involved apps across devices. In this paper, we propose cross-device shortcuts, an interaction technique that enables direct and discoverable content exchange between apps on different devices. When users switch their attention between multiple engaged devices as part of a workflow, our system establishes a cross-device shortcut—a deep link between apps on separate devices that presents itself through feed-forward previews, inviting and facilitating quick content transfer. We explore the use of this technique in four scenarios spanning multiple devices and applications, and highlight the potential, limitations, and challenges of its design with a preliminary evaluation.
Crucial Clues: Investigating Psychophysiological Behaviors for Measuring Trust in Human-Robot Interaction
Existing work on the measurements of trust during Human-Robot Interaction (HRI) indicates that psychophysiological behaviours (PBs) have the potential to measure trust. However, we see limited work on the use of multiple PBs in combination to calibrate human’s trust in robots in real-time during HRI. Therefore, this study aims to estimate human trust in robots by examining the differences in PBs between trust and distrust states. It further investigates the changes in PBs across repeated HRI and also explores the potential of machine learning classifiers in predicting trust levels during HRI. We collected participants’ electrodermal activity (EDA), blood volume pulse (BVP), heart rate (HR), skin temperature (SKT), blinking rate (BR), and blinking duration (BD) during repeated HRI. The results showed significant differences in HR and SKT between trust and distrust groups and no significant interaction effect of session and decision for all PBs. Random Forest classifier achieved the best accuracy of 68.6% to classify trust, while SKT, HR, BR, and BD were the important features. These findings highlight the value of PBs in measuring trust in real-time during HRI and encourage further investigation of trust measures with PBs in various HRI settings.
Deciphering Entrepreneurial Pitches: A Multimodal Deep Learning Approach to Predict Probability of Investment
Acquiring early-stage investments for the purpose of developing a business is a fundamental aspect of the entrepreneurial process, which regularly entails pitching the business proposal to potential investors. Previous research suggests that business viability data and the perception of the entrepreneur play an important role in the investment decision-making process. This perception of the entrepreneur is shaped by verbal and non-verbal behavioral cues produced in investor-entrepreneur interactions. This study explores the impact of such cues on decisions that involve investing in a startup on the basis of a pitch. A multimodal approach is developed in which acoustic and linguistic features are extracted from recordings of entrepreneurial pitches to predict the likelihood of investment. The acoustic and linguistic modalities are represented using both hand-crafted and deep features. The capabilities of deep learning models are exploited to capture the temporal dynamics of the inputs. The findings show promising results for the prediction of the likelihood of investment using a multimodal architecture consisting of acoustic and linguistic features. Models based on deep features generally outperform hand-crafted representations. Experiments with an explainable model provide insights about the important features. The most predictive model is found to be a multimodal one that combines deep acoustic and linguistic features using an early fusion strategy and achieves an MAE of 13.91.
Social robots are in a unique position to aid mental health by supporting engagement with behavioral interventions. One such behavioral intervention is the practice of deep breathing, which has been shown to physiologically reduce symptoms of anxiety. Multiple robots have been recently developed that support deep breathing, but none yet implement a method to detect how accurately an individual is performing the practice. Detecting breathing phases (i.e., inhaling, breath holding, or exhaling) is a challenge with these robots since often the robot is being manipulated or moved by the user, or the robot itself is moving to generate haptic feedback. Accordingly, we first present OMMDB: a novel, multimodal, public dataset made up of individuals performing deep breathing with an Ommie robot in multiple conditions of robot ego-motion. The dataset includes RGB video, inertial sensor data, and motor encoder data, as well as ground truth breathing data from a respiration belt. Our second contribution features experimental results with a convolutional long-short term memory neural network trained using OMMDB. These results show the system’s ability to be applied to the domain of deep breathing and generalize between individual users. We additionally show that our model is able to generalize across multiple types of robot ego-motion, reducing the need to train individual models for varying human-robot interaction conditions.
Research on the ubiquity and consequences of task-unrelated thought (TUT; often used to operationalize mind wandering) in several domains recently sparked a surge in efforts to create “stealth measurements” of TUT using machine learning. Although these attempts have been successful, they have used widely varied algorithms, modalities, and performance metrics — making them difficult to compare and inform future work on best practices. We aim to synthesize these findings through a systematic review of 42 studies identified following PRISMA guidelines to answer two research questions: 1) are there any modalities that are better indicators of TUT than the rest; and 2) do multimodal models provide better results than unimodal models? We found that models built on gaze typically outperform other modalities and that multimodal models do not present a clear edge over their unimodal counterparts. Our review highlights the typical steps involved in model creation and the choices available in each step to guide future research, while also discussing the limitations of the current “state of the art” — namely the barriers to generalizability.
The degree of concentration, enthusiasm, optimism, and passion displayed by individual(s) while interacting with a machine is referred to as ‘user engagement’. Engagement comprises of behavioral, cognitive, and affect related cues . To create engagement prediction systems that can work in real-world conditions, it is quintessential to learn from rich, diverse datasets. To this end, a large scale multi-faceted engagement in the wild dataset EngageNet is proposed. 31 hours duration data of 127 participants representing different illumination conditions are recorded. Thorough experiments are performed exploring the applicability of different features, action units, eye gaze, head pose, and MARLIN. Data from user interactions (question-answer) are analyzed to understand the relationship between effective learning and user engagement. To further validate the rich nature of the dataset, evaluation is also performed on the EngageWild dataset. The experiments show the usefulness of the proposed dataset. The code, models, and dataset link are publicly available at https://github.com/engagenet/engagenet_baselines.
Often pieces of information are received sequentially over time. When did one collect enough such pieces to classify? Trading wait time for decision certainty leads to early classification problems that have recently gained attention as a means of adapting classification to more dynamic environments. However, so far results have been limited to unimodal sequences. In this pilot study, we expand into early classifying multimodal sequences by combining existing methods. Spatial-temporal transformers trained in the supervised framework of Classifier-Induced Stopping outperform exploration-based methods. We show our new method yields experimental AUC advantages of up to 8.7%.
EEG-based Cognitive Load Classification using Feature Masked Autoencoding and Emotion Transfer Learning
Cognitive load, the amount of mental effort required for task completion, plays an important role in performance and decision-making outcomes, making its classification and analysis essential in various sensitive domains. In this paper, we present a new solution for the classification of cognitive load using electroencephalogram (EEG). Our model uses a transformer architecture employing transfer learning between emotions and cognitive load. We pre-train our model using self-supervised masked autoencoding on emotion-related EEG datasets and use transfer learning with both frozen weights and fine-tuning to perform downstream cognitive load classification. To evaluate our method, we carry out a series of experiments utilizing two publicly available EEG-based emotion datasets, namely SEED and SEED-IV, for pre-training, while we use the CL-Drive dataset for downstream cognitive load classification. The results of our experiments show that our proposed approach achieves strong results and outperforms conventional single-stage fully supervised learning. Moreover, we perform detailed ablation and sensitivity studies to evaluate the impact of different aspects of our proposed solution. This research contributes to the growing body of literature in affective computing with a focus on cognitive load, and opens up new avenues for future research in the field of cross-domain transfer learning using self-supervised pre-training.
We focus on a largely overlooked but crucial modality for parent-child interaction analysis: physical contact. In this paper, we provide a feasibility study to automatically detect contact between a parent and child from videos. Our multimodal CNN model uses a combination of 2D pose heatmaps, body part heatmaps, and cropped images. Two datasets (FlickrCI3D and YOUth PCI) are used to explore the generalization capabilities across different contact scenarios. Our experiments demonstrate that using 2D pose heatmaps and body part heatmaps yields the best performance in contact classification when trained from scratch on parent-infant interactions. We further investigate the influence of proximity on our classification performance. Our results indicate that there are unique challenges in parent-infant contact classification. Finally, we show that contact rates from aggregating frame-level predictions provide decent approximations of the true contact rates, suggesting that they can serve as an automated proxy for measuring the quality of parent-child interactions. By releasing the annotations for the YOUth PCI dataset and our code1, we encourage further research to deepen our understanding of parent-infant interactions and their implications for attachment and development.
Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization
Most existing audio-text emotion recognition studies have focused on the computational modeling aspects, including strategies for fusing the modalities. An area that has received less attention is understanding the role of proper temporal synchronization between the modalities in the model performance. This study presents a transformer-based model designed with a word-chunk concept, which offers an ideal framework to explore different strategies to align text and speech. The approach creates chunks with alternative alignment strategies with different levels of dependency on the underlying lexical boundaries. A key contribution of this study is the multi-scale chunk alignment strategy, which generates random alignments to create the chunks without considering lexical boundaries. For every epoch, the approach generates a different alignment for each sentence, serving as an effective regularization method for temporal dependency. Our experimental results based on the MSP-Podcast corpus indicate that providing precise temporal alignment information to create the audio-text chunks does not improve the performance of the system. The attention mechanisms in the transformer-based approach are able to compensate for imperfect synchronization between the modalities. However, using exact lexical boundaries makes the system highly vulnerable to missing modalities. In contrast, the model trained with the proposed multi-scale chunk regularization strategy using random alignment can significantly increase its robustness against missing data and remain effective, even under a single audio-only emotion recognition task. The code is available at: https://github.com/winston-lin-wei-cheng/MultiScale-Chunk-Regularization
The violin is one of the most popular instruments, but it is hard to learn. The bowing of the right hand is a crucial factor in determining the tone quality, but it is too complex to master, teach, and reproduce. Therefore, many studies have attempted to measure and analyze the bowing of the violin to help record performances and support practice. This work aimed to measure bow pressure, one of the parameters of bowing motion.
The proposed method uses a regression model to estimate bow pressure based on bow deformation. First, the deformation of the bow hair is measured by photo-reflective sensors attached to the bow stick. Next, the regression model is trained based on the correspondence between the bow deformation and the true value of bow force measured by a load cell. The adjusted coefficient of determination was 0.84, with an average mean absolute error of 0.11 N and an average mean absolute percentage error of 19.1%. We also present an example of an application for real-time bow pressure estimation and visual feedback during violin playing.
Given the computing power of mobile devices, porting feature-rich applications to these devices is increasingly feasible. However, feature-rich applications include large command sets, and providing access to these commands through screen-based widgets results in issues of occlusion and layering. To address this issue, we introduce Ether-Mark, a hierarchical, gesture-based, marking menu inspired, around-device menu for mobile devices enabling both on- and near-device interaction. We investigate the design of such menus and their learnability through three experiments. We first design and contrast three variants of Ether-Mark, yielding a zigzag menu design. We then refine input accuracy via a deformation model of the menu. And, we evaluate the learnability of the menus and the accuracy of the deformation model, revealing an accuracy rate up to 98.28%. We finally, compare in-air Ether-Mark with marking menus.Our results argue for Ether-Mark as a promising effective mechanism to leverage proximal around-device space.
Evaluating Outside the Box: Lessons Learned on eXtended Reality Multi-modal Experiments Beyond the Laboratory
Over time, numerous multimodal eXtended Reality (XR) user studies have been conducted in laboratory environments, with participants fulfilling tasks under the guidance of a researcher. Although generalizable results contributed to increase the maturity of the field, it is also paramount to address the ecological validity of evaluations outside the laboratory. Despite real-world scenarios being clearly challenging, successful in-situ and remote deployment has become realistic to address a broad variety of research questions, thus, expanding participants’ sample to more specific target users, considering multi-modal constraints not reflected in controlled laboratory settings and other benefits. In this paper, a set of multimodal XR experiments conducted outside the laboratory are described (e.g., industrial field studies, remote collaborative tasks, longitudinal rehabilitation exercises). Then, a list of lessons learned is reported, illustrating challenges, and opportunities, aiming to increase the level of awareness of the research community and facilitate performing further evaluations.
Evaluating the Potential of Caption Activation to Mitigate Confusion Inferred from Facial Gestures in Virtual Meetings
Following the COVID-19 pandemic, virtual meetings have not only become an integral part of collaboration, but are now also a popular tool for disseminating information to a large audience through webinars, online lectures, and the like. Ideally, the meeting participants should understand discussed topics as smoothly as in physical encounters. However, many experience confusion, but are hesitant to express their doubts. In this paper, we present the results from a user study with 45 Google Meet users that investigates how auto-generated captions can be used to improve comprehension. The results show that captions can help overcome confusion caused by language barriers, but not if it is the result of distorted words. To mitigate negative side effects such as occlusion of important visual information when captions are not strictly needed, we propose to activate them dynamically only when a user effectively experiences confusion. To determine instances that require captioning, we test whether the subliminal cues from facial gestures can be used to detect confusion. We confirm that confusion activates six facial action units (AU4, AU6, AU7, AU10, AU17, and AU23).
In recent decades, the field of affective computing has made substantial progress in advancing the ability of AI systems to recognize and express affective phenomena, such as affect and emotions, during human-human and human-machine interactions. This paper describes our examination of research at the intersection of multimodal interaction and affective computing, with the objective of observing trends and identifying understudied areas. We examined over 16,000 papers from selected conferences in multimodal interaction, affective computing, and natural language processing: ACM International Conference on Multimodal Interaction, AAAC International Conference on Affective Computing and Intelligent Interaction, Annual Meeting of the Association for Computational Linguistics, and Conference on Empirical Methods in Natural Language Processing. We identified 910 affect-related papers and present our analysis of the role of affective phenomena in these papers. We find that this body of research has primarily focused on enabling machines to recognize or express affect and emotion; there has been limited research on how affect and emotion predictions might, in turn, be used by AI systems to enhance machine understanding of human social behaviors and cognitive states. Based on our analysis, we discuss directions to expand the role of affective phenomena in multimodal interaction research.
While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed kinemes, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes. Employing machine learning methods, we evaluate depression classification performance on the BlackDog and AVEC2013 datasets. Our findings indicate that: (1) head motion patterns are effective biomarkers for detecting depressive symptoms, and (2) explanatory kineme patterns consistent with prior findings can be observed for the two classes. Overall, we achieve peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic thin-slices, and a peak F1 of 0.72 over videos for AVEC2013.
Tangible user interfaces offer the benefit of incorporating physical aspects in the interaction with digital systems, enriching how system information can be conveyed. We investigated how visual, haptic, and audio modalities influence young children’s joint actions. We used a design-based research method to design and develop a multi-sensory tangible device. Two kindergarten teachers and 31 children were involved in our design process. We tested the final prototype with 20 children aged 5-6 from three kindergartens. The main findings were: a) involving and getting approval from kindergarten teachers in the design process was essential; b) simultaneously providing visual and audio feedback might help improve children’s collaborative actions. Our study was an interdisciplinary research on human-computer interaction and children’s education, which contributed an empirical understanding of the factors influencing children collaboration and communication.
FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning
This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that generates facial cues driven by an emotional expressiveness condition. In addition, it can handle audio recorded in a variety of situations (e.g. background noise, multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate 3D facial animation. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-syncing, emotional expressivity, person-specific facial cues and generalizability. In this work, we first achieve better results than state-of-the-art on the speech-driven 3D facial animation generation task by effectively employing the self-supervised pretrained HuBERT speech model that allows to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Second, we incorporate emotional expressiveness modality by guiding the network with a binary emotion condition. We carried out extensive objective and subjective evaluations in comparison to ground-truth and state-of-the-art. A perceptual user study demonstrates that expressively generated facial animations using our approach are indeed perceived more realistic and are preferred over the non-expressive ones. In addition, we show that having a strong audio encoder alone eliminates the need of a complex decoder for the network architecture, reducing the network complexity and training time significantly. We provide the code1 publicly and recommend watching the video.
Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion
Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).
Incorporation of feature uncertainty during model construction explores the real generalization ability of that model. But this factor has been avoided often during automatic gait event detection for Cerebral Palsy patients. Again, the prevailing vision-based gait event detection systems are expensive due to incorporation of high-end motion tracking cameras. This study proposes a low-cost gait event detection system for heel strike and toe-off events. A state-space model was constructed where the temporal evolution of gait signal was devised by quantifying feature uncertainty. The model was trained using Cardiff classifier. Ankle velocity was taken as the input feature. The frame associated with state transition was marked as a gait event. The model was tested on 15 Cerebral Palsy patients and 15 normal subjects. Data acquisition was performed using low-cost Kinect cameras. The model identified gait events on an average of 2 frame error. All events were predicted before the actual occurrence. Error for toe-off was less than the heel strike. Incorporation of the uncertainty factor in the detection of gait events exhibited a competing performance with respect to state-of-the-art.
Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.
HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer
Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.
How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning
Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.
Identifying Interlocutors’ Behaviors and its Timings Involved with Impression Formation from Head-Movement Features and Linguistic Features
A prediction-explanation framework is proposed to identify when and what behaviors are involved in forming interlocutors’ impressions in group discussions. We targeted the self-reported scores of 16 impressions, including enjoyment and concentration. To that end, we formulate the problem as discovering behavioral features that contributed to the impression prediction and determining the timings that the behaviors frequently occurred. To solve this problem, this paper proposes a two-fold framework consisting of the prediction part followed by the explanation part. The former prediction part employs random forest regressors using functional head-movement features and BERT-based linguistic features, which can capture various aspects of interactive conversational behaviors. The later part measures the levels of features’ contribution to the prediction using a SHAP analysis and introduces a novel idea of temporal decomposition of features’ contributions over time. The influential behaviors and their timings are identified from local maximums of the temporal distribution of features’ contributions. Targeting 17-group 4-female discussions, the predictability and explainability of the proposed framework are confirmed by some case studies and quantitative evaluations of the detected timings.
Implicit Search Intent Recognition using EEG and Eye Tracking: Novel Dataset and Cross-User Prediction
For machines to effectively assist humans in challenging visual search tasks, they must differentiate whether a human is simply glancing into a scene (navigational intent) or searching for a target object (informational intent). Previous research proposed combining electroencephalography (EEG) and eye-tracking measurements to recognize such search intents implicitly, i.e., without explicit user input. However, the applicability of these approaches to real-world scenarios suffers from two key limitations. First, previous work used fixed search times in the informational intent condition – a stark contrast to visual search, which naturally terminates when the target is found. Second, methods incorporating EEG measurements addressed prediction scenarios that require ground truth training data from the target user, which is impractical in many use cases. We address these limitations by making the first publicly available EEG and eye-tracking dataset for navigational vs. informational intent recognition, where the user determines search times. We present the first method for cross-user prediction of search intents from EEG and eye-tracking recordings and reach accuracy in leave-one-user-out evaluations – comparable to within-user prediction accuracy () but offering much greater flexibility.
Heartbeat is not only one of our physical health indicators, but also plays an important role in our emotional changes. Previous investigations have been repeatedly investigated to the soothing effects of low frequency vibrotactile cues which evoke a slow heartbeat in stressful situations. The impact of stimuli which evoke faster heartbeats on users’ anxiety or heart rate is, however, poorly understood. We conducted two studies to evaluate the influence of the presentation of a fast heartbeat via vibration and/or sound, both in calm and stressed states. Results showed that the presentation of fast heartbeat stimuli can induce increased anxiety levels and heart rate. We use these results to inform how future designers could carefully present fast heartbeat stimuli in multimedia application to enhance feelings of immersion, effort and engagement.
Research has shown that modifying the aspect of the virtual hand in immersive virtual reality can convey objects properties to users. Whether we can achieve the same results in augmented reality is still to be determined since the user’s real hand is visible through the headset. Although displaying a virtual hand in augmented reality is usually not recommended, it could positively impact the user effectiveness or appreciation of the application.
For this purpose, we propose an algorithm to compute virtual hand shape in AR, based on inverse kinematics and physical constraints. It allows to naturally grasp virtual objects while keeping the virtual hand on their surface. We compare the influence of this hand representation on performance and user experience in a grasping task in AR, with two control conditions: a simple virtual hand that follows the real hand and a baseline condition without a virtual hand. Results on 48 participants show that all virtual hand conditions decreased user performance, but enhanced the satisfaction with the task.
Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model. We propose to embed a learnable array of parameters into the model that performs an element-wise multiplication of the inputs. This learned array highlights the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions. We evaluate our approach on public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). We believe that the insights gained in this way could be exploited for the development of more efficient SLR pipelines.
Gestures perform a variety of communicative functions that powerfully influence human face-to-face interaction. How this communicative function is achieved varies greatly between individuals and depends on the role of the speaker and the context of the interaction. Approaches to automatic gesture generation vary not only in the degree to which they rely on data-driven techniques but also the degree to which they can produce context and speaker specific gestures. However, these approaches face two major challenges: The first is obtaining sufficient training data that is appropriate for the context and the goal of the application. The second is related to designer control to realize their specific intent for the application. Here, we approach these challenges by using large language models (LLMs) to show that these powerful models of large amounts of data can be adapted for gesture analysis and generation. Specifically, we used ChatGPT as a tool for suggesting context-specific gestures that can realize designer intent based on minimal prompts. We also find that ChatGPT can suggests novel yet appropriate gestures not present in the minimal training data. The use of LLMs is a promising avenue for gesture generation that reduce the need for laborious annotations and has the potential to flexibly and quickly adapt to different designer intents.
Creating the photo-realistic version of people’s sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is preferred by users.
Autism spectrum disorder (ASD) is a developmental disorder characterized by significant impairments in social communication and difficulties perceiving and presenting communication signals. Machine learning techniques have been widely used to facilitate autism studies and assessments. However, computational models are primarily concentrated on very specific analysis and validated on private, non-public datasets in the autism community, which limits comparisons across models due to privacy-preserving data-sharing complications. This work presents a novel open source privacy-preserving dataset, MMASD as a MultiModal ASD benchmark dataset, collected from play therapy interventions for children with autism. The MMASD includes data from 32 children with ASD, and 1,315 data samples segmented from more than 100 hours of intervention recordings. To promote the privacy of children while offering public access, each sample consists of four privacy-preserving modalities, some of which are derived from original videos: (1) optical flow, (2) 2D skeleton, (3) 3D skeleton, and (4) clinician ASD evaluation scores of children. MMASD aims to assist researchers and therapists in understanding children’s cognitive status, monitoring their progress during therapy, and customizing the treatment plan accordingly. It also inspires downstream social tasks such as action quality assessment and interpersonal synchrony estimation. The dataset is publicly accessible via the MMASD project website.
The quality and effectiveness of psychotherapy sessions are highly influenced by the therapists’ ability to meaningfully connect with clients. Automated assessment of therapist empathy provides cost-effective and systematic means of assessing the quality of therapy sessions. In this work, we propose to assess therapist empathy using multimodal behavioral data, i.e. spoken language (text) and audio in real-world motivational interviewing (MI) sessions for alcohol abuse intervention. We first study each modality (text vs. audio) individually and then evaluate a multimodal approach using different fusion strategies for automated recognition of empathy levels (high vs. low). Leveraging recent pre-trained models both for text (DistilRoBERTa) and speech (HuBERT) as strong unimodal baselines, we obtain consistent 2-3 point improvements in F1 scores with early and late fusion, and the highest absolute improvement of 6–12 points over unimodal baselines. Our models obtain F1 scores of 68% when only looking at an early segment of the sessions and up to 72% in a therapist-dependent setting. In addition, our results show that a relatively small portion of sessions, specifically the second quartile, is most important in empathy prediction, outperforming predictions on later segments and on the full sessions. Our analyses in late fusion results show that fusion models rely more on the audio modality in limited-data settings, such as in individual quartiles and when using only therapist turns. Further, we observe the highest misclassification rates for parts of the sessions with MI inconsistent utterances (20% misclassified by all models), likely due to the complex nature of these types of intents in relation to perceived empathy.
Large multimodal deep learning models such as Contrastive Language Image Pretraining (CLIP) have become increasingly powerful with applications across several domains in recent years. CLIP works on visual and language modalities and forms a part of several popular models, such as DALL-E and Stable Diffusion. It is trained on a large dataset of millions of image-text pairs crawled from the internet. Such large datasets are often used for training purposes without filtering, leading to models inheriting social biases from internet data. Given that models such as CLIP are being applied in such a wide variety of applications ranging from social media to education, it is vital that harmful biases are detected. However, due to the unbounded nature of the possible inputs and outputs, traditional bias metrics such as accuracy cannot detect the range and complexity of biases present in the model. In this paper, we present an audit of CLIP using an established technique from natural language processing called Word Embeddings Association Test (WEAT) to detect and quantify gender bias in CLIP and demonstrate that it can provide a quantifiable measure of such stereotypical associations. We detected, measured, and visualised various types of stereotypical gender associations with respect to character descriptions and occupations and found that CLIP shows evidence of stereotypical gender bias.
In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different annotators annotate the label given the first, second, and both modalities, and (2) counterfactual labels, where the same annotator annotates the label given the first modality before asking them to explicitly reason about how their answer changes when given the second. We further propose an alternative taxonomy based on (3) information decomposition, where annotators annotate the degrees of redundancy: the extent to which modalities individually and together give the same predictions, uniqueness: the extent to which one modality enables a prediction that the other does not, and synergy: the extent to which both modalities enable one to make a prediction that one would not otherwise make using individual modalities. Through experiments and annotations, we highlight several opportunities and limitations of each approach and propose a method to automatically convert annotations of partial and counterfactual labels to information decomposition, yielding an accurate and efficient method for quantifying multimodal interactions.
This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal natural conversational behaviors of interlocutors in three-party conversations, including gazes, head movements, body movements, speech, etc. Based on the inter-pausal units (IPUs) extracted from the in-house acquired dataset, we propose a transformer-based computational model to predict the turns based on the interlocutor states (speaking/back-channeling/silence) and the gaze targets. Our model can robustly achieve more than 80% accuracy, and the generalizability of our model was extensively validated through cross-group experiments. Also, we introduce a novel computational metric called “relative engagement level” (REL) of IPUs, and further validate its statistical significance between turn-keeping IPUs and turn-taking IPUs, and between different conversational groups. Our experimental results also found that the patterns of the interlocutor states can be used as a more effective cue than their gaze behaviors for predicting turns in multiparty conversations.
Personalized prediction is a machine learning approach that predicts a person’s future observations based on their past labeled observations and is typically used for sequential tasks, e.g., to predict daily mood ratings. When making personalized predictions, a model can combine two types of trends: (a) trends shared across people, i.e., person-generic trends, such as being happier on weekends, and (b) unique trends for each person, i.e., person-specific trends, such as a stressful weekly meeting. Mixed effect models are popular statistical models to study both trends by combining person-generic and person-specific parameters. Though linear mixed effect models are gaining popularity in machine learning by integrating them with neural networks, these integrations are currently limited to linear person-specific parameters: ruling out nonlinear person-specific trends. In this paper, we propose Neural Mixed Effect (NME) models to optimize nonlinear person-specific parameters anywhere in a neural network in a scalable manner1. NME combines the efficiency of neural network optimization with nonlinear mixed effects modeling. Empirically, we observe that NME improves performance across six unimodal and multimodal datasets, including a smartphone dataset to predict daily mood and a mother-adolescent dataset to predict affective state sequences where half the mothers experience symptoms of depression. Furthermore, we evaluate NME for two model architectures, including for neural conditional random fields (CRF) to predict affective state sequences where the CRF learns nonlinear person-specific temporal transitions between affective states. Analysis of these person-specific transitions on the mother-adolescent dataset shows interpretable trends related to the mother’s depression symptoms.
Affective aggression is a form of aggression characterized by impulsive reactions driven by strong negative emotions. Despite the extensive research in the area of automatic emotion recognition, affective aggression is a phenomenon that has received less attention. This study investigates the use of head motion as a potential indicator of affective aggression and negative affect. It provides an analysis of head movement patterns associated with various levels of aggression, valence, arousal and dominance, and compares behaviors and recognition performance under speaking and listening conditions. The study was conducted on the Negative Affect and Aggression database – a multimodal corpus of dyadic interactions between aggression regulation training actors and non-actors, annotated for levels of aggression, valence, arousal, and dominance. Results demonstrate that head motion features can serve as promising indicators of affect during both speaking and listening. Valence and arousal prediction achieved better performance during speaking, while aggression and dominance were better predicted during listening. Significant increases in the magnitude of pitch angular acceleration were associated with escalation along all four annotated dimensions. Interestingly, higher escalation was accompanied by a significant increase in the total number of movements during speaking, but a significant decrease of the number of movements was observed as escalation increased along listening intervals. These findings are particularly relevant as head motion can be used solely or potentially as a supplementary modality when other modalities such as speech or facial expressions are unavailable or altered.
As social-mediated interaction is becoming increasingly important and multi-modal, even expanding into virtual reality and physical telepresence with robotic avatars, new challenges emerge. For instance, video calls have become the norm and it is increasingly common that people experience a form of asymmetry, such as not being heard or seen by their communication partners online due to connection issues. Previous research has not yet extensively explored the effect on social interaction. In this study, 61 Dyads, i.e. 122 adults, played a quiz-like game using a video-conferencing platform and evaluated the quality of their social interaction by measuring five sub-scales of social presence. The Dyads had either symmetrical access to social cues (both only audio, or both audio and video) or asymmetrical access (one partner receiving only audio, the other audio and video). Our results showed that in the case of asymmetrical access, the party receiving more modalities, i.e. audio and video from the other, felt significantly less connected than their partner. We discuss these results in relation to the Media Richness Theory (MRT) and the Hyperpersonal Model: in asymmetry, more modalities or cues will not necessarily increase feeling socially connected, in opposition to what was predicted by MRT. We hypothesize that participants sending fewer cues compensate by increasing the richness of their expressions and that the interaction shifts towards an equivalent richness for both participants.
Paying Attention to Wildfire: Using U-Net with Attention Blocks on Multimodal Data for Next Day Prediction
Predicting where wildfires will spread provides invaluable information to firefighters and scientists, which can save lives and homes. However, doing so requires a large amount of multimodal data e.g., accurate weather predictions, real-time satellite data, and environmental descriptors. In this work, we utilize 12 distinct features from multiple modalities in order to predict where wildfires will spread over the next 24 hours. We created a custom U-Net architecture designed to train as efficiently as possible, while still maximizing accuracy, to facilitate quickly deploying the model when a wildfire is detected. Our custom architecture demonstrates state-of-the-art performance and trains an order of magnitude more quickly than prior work, while using fewer computational resources. We further evaluated our architecture with an ablation study to identify which features were key for prediction and which provided negligible impact on performance. All of our source code is available on GitHub1.
Performance Exploration of RNN Variants for Recognizing Daily Life Stress Levels by Using Multimodal Physiological Signals
Enduring stress can have negative impacts on human health and behavior. Widely used wearable devices are promising for assessing, monitoring and potentially alleviating high stress in daily life. Although numerous automatic stress recognition studies have been carried out in the laboratory environment with high accuracy, the performance of daily life studies is still far away from what the literature has in laboratory environments. Since the physiological signals obtained from these devices are time-series data, Recursive Neural Network (RNN) based classifiers promise better results than other machine learning methods. However, the performance of RNN-based classifiers has not been extensively evaluated (i.e., with several variants and different application techniques) for detecting daily life stress yet. They could be combined with CNN architectures, applied to raw data or handcrafted features. In this study, we created different RNN architecture variants and explored their performance for recognizing daily life stress to guide researchers in the field.
Predicting Player Engagement in Tom Clancy’s The Division 2: A Multimodal Approach via Pixels and Gamepad Actions
This paper introduces a large scale multimodal corpus collected for the purpose of analysing and predicting player engagement in commercial-standard games. The corpus is solicited from 25 players of the action role-playing game Tom Clancy’s The Division 2, who annotated their level of engagement using a time-continuous annotation tool. The cleaned and processed corpus presented in this paper consists of nearly 20 hours of annotated gameplay videos accompanied by logged gamepad actions. We report preliminary results on predicting long-term player engagement based on in-game footage and game controller actions using Convolutional Neural Network architectures. Results obtained suggest we can predict the player engagement with up to accuracy on average ( at best) when we fuse information from the game footage and the player’s controller input. Our findings validate the hypothesis that long-term (i.e. 1 hour of play) engagement can be predicted efficiently solely from pixels and gamepad actions.
Collaborative manipulation is inherently multimodal, with haptic communication playing a central role. When performed by humans, it involves back-and-forth force exchanges between the participants through which they resolve possible conflicts and determine their roles. Much of the existing work on collaborative human-robot manipulation assumes that the robot follows the human. But for a robot to match the performance of a human partner it needs to be able to take initiative and lead when appropriate. To achieve such human-like performance, the robot needs to have the ability to (1) determine the intent of the human, (2) clearly express its own intent, and (3) choose its actions so that the dyad reaches consensus. This work proposes a framework for recognizing human intent in collaborative manipulation tasks using force exchanges. Grounded in a dataset collected during a human study, we introduce a set of features that can be computed from the measured signals and report the results of a classifier trained on our collected human-human interaction data. Two metrics are used to evaluate the intent recognizer: overall accuracy and the ability to correctly identify transitions. The proposed recognizer shows robustness against the variations in the partner’s actions and the confounding effects due to the variability in grasp forces and dynamic effects of walking. The results demonstrate that the proposed recognizer is well-suited for implementation in a physical interaction control scheme.
Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs’ listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib
Representation Learning for Interpersonal and Multimodal Behavior Dynamics: A Multiview Extension of Latent Change Score Models
Characterizing the dynamics of behavior across multiple modalities and individuals is a vital component of computational behavior analysis. This is especially important in certain applications, such as psychotherapy, where individualized tracking of behavior patterns can provide valuable information about the patient’s mental state. Conventional methods that rely on aggregate statistics and correlational metrics may not always suffice, as they are often unable to capture causal relationships or evaluate the true probability of identified patterns. To address these challenges, we present a novel approach to learning multimodal and interpersonal representations of behavior dynamics during one-on-one interaction. Our approach is enabled by the introduction of a multiview extension of latent change score models, which facilitates the concurrent capture of both inter-modal and interpersonal behavior dynamics and the identification of directional relationships between them. A core advantage of our approach is its high level of interpretability while simultaneously achieving strong predictive performance. We evaluate our approach within the domain of therapist-client interactions, with the objective of gaining a deeper understanding about the collaborative relationship between the two, a crucial element of the therapeutic process. Our results demonstrate improved performance over conventional approaches that rely upon summary statistics or correlational metrics. Furthermore, since our multiview approach includes the explicit modeling of uncertainty, it naturally lends itself to integration with probabilistic classifiers, such as Gaussian process models. We demonstrate that this integration leads to even further improved performance, all the while maintaining highly interpretable qualities. Our analysis provides compelling motivation for further exploration of stochastic systems within computational models of behavior.
While thinking aloud has been reported to positively affect problem-solving, the effects of the presence of an embodied entity (e.g., a social robot) to whom words can be directed remain mostly unexplored. In this work, we investigated the role of a robot in a “rubber duck debugging” setting, by analyzing how a robot’s listening behaviors could support a thinking-aloud problem-solving session. Participants completed two different tasks while speaking their thoughts aloud to either a robot or an inanimate object (a giant rubber duck). We implemented and tested two types of listener behavior in the robot: a rule-based heuristic and a deep-learning-based model. In a between-subject user study with 101 participants, we evaluated how the presence of a robot affected users’ engagement in thinking aloud, behavior during the task, and self-reported user experience. In addition, we explored the impact of the two robot listening behaviors on those measures. In contrast to prior work, our results indicate that neither the rule-based heuristic nor the deep learning robot conditions improved performance or perception of the task, compared to an inanimate object. We discuss potential explanations and shed light on the feasibility of designing social robots as assistive tools in thinking-aloud problem-solving tasks.
SHAP-based Prediction of Mother’s History of Depression to Understand the Influence on Child Behavior
Depression strongly impacts parents’ behavior. Does parents’ depression strongly affect the behavior of their children as well? To investigate this question, we compared dyadic interactions between 73 depressed and 75 non-depressed mothers and their adolescent child. Families were of low income and 84% were white. Child behavior was measured from audio-video recordings using manual annotation of verbal and nonverbal behavior by expert coders and by multimodal computational measures of facial expression, face and head dynamics, prosody, speech behavior, and linguistics. For both sets of measures, we used Support Vector Machines. For computational measures, we investigated the relative contribution of single versus multiple modalities using a novel approach to SHapley Additive exPlanations (SHAP). Computational measures outperformed manual ratings by human experts. Among individual computational measures, prosody was the most informative. SHAP reduction resulted in a four-fold decrease in the number of features and highest performance (77% accuracy; positive and negative agreements at 75% and 76%, respectively). These findings suggest that maternal depression strongly impacts the behavior of adolescent children; differences are most revealed in prosody; multimodal features together with SHAP reduction are most powerful.
Synerg-eye-zing: Decoding Nonlinear Gaze Dynamics Underlying Successful Collaborations in Co-located Teams
Joint Visual Attention (JVA) has long been considered a critical component of successful collaborations, enabling coordination and construction of a shared knowledge space. However, recent studies challenge the notion that JVA alone ensures effective collaboration. To gain deeper insights into JVA’s influence, we examine nonlinear gaze coupling and gaze regularity in the collaborators’ visual attention. Specifically, we analyze gaze data from 19 dyadic and triadic teams engaged in a co-located programming task using Recurrence Quantification Analysis (RQA). Our results emphasize the significance of team-level gaze regularity for improving task performance – highlighting the importance of maintaining stable or sustained episodes of joint or individual attention, than disjointed patterns. Additionally, through regression analyses, we examine the predictive capacity of recurrence metrics for subjective traits such as social cohesion and social loafing, revealing unique interpersonal and team dynamics behind productive collaborations. We elaborate on our findings via qualitative anecdotes and discuss their implications in shaping real-time interventions for optimizing collaborative success.
The Role of Audiovisual Feedback Delays and Bimodal Congruency for Visuomotor Performance in Human-Machine Interaction
Despite incredible technological progress in the last decades, latency is still an issue for today’s technologies and their applications. To better understand how latency and resulting feedback delays affect the interaction between humans and cyber-physical systems (CPS), the present study examines separate and joint effects of visual and auditory feedback delays on performance and the motor control strategy in a complex visuomotor task. Thirty-six participants played the Wire Loop Game, a fine motor skill task, while going through four different delay conditions: no delay, visual only, auditory only, and audiovisual (length: 200 ms). Participants’ speed and accuracy for completing the task and movement kinematic were assessed. Visual feedback delays slowed down movement execution and impaired precision compared to a condition without feedback delays. In contrast, delayed auditory feedback improved precision. Descriptively, the latter finding mainly appeared when congruent visual and auditory feedback delays were provided. We discuss the role of temporal congruency of audiovisual information as well as potential compensatory mechanisms that can inform the design of multisensory feedback in human-CPS interaction faced with latency.
Mouth-based interfaces are a promising new approach enabling silent, hands-free and eyes-free interaction with wearable devices. However, interfaces sensing mouth movements are traditionally custom-designed and placed near or within the mouth. TongueTap synchronizes multimodal EEG, PPG, IMU, eye tracking and head tracking data from two commercial headsets to facilitate tongue gesture recognition using only off-the-shelf devices on the upper face. We classified eight closed-mouth tongue gestures with 94% accuracy, offering an invisible and inaudible method for discreet control of head-worn devices. Moreover, we found that the IMU alone differentiates eight gestures with 80% accuracy and a subset of four gestures with 92% accuracy. We built a dataset of 48,000 gesture trials across 16 participants, allowing TongueTap to perform user-independent classification. Our findings suggest tongue gestures can be a viable interaction technique for VR/AR headsets and earables without requiring novel hardware.
We present a novel approach to mitigate bias in facial expression recognition (FER) models. Our method aims to reduce sensitive attribute information such as gender, age, or race, in the embeddings produced by FER models. We employ a kernel mean shrinkage estimator to estimate the kernel mean of the distributions of the embeddings associated with different sensitive attribute groups, such as young and old, in the Hilbert space. Using this estimation, we calculate the maximum mean discrepancy (MMD) distance between the distributions and incorporate it in the classifier loss along with an adversarial loss, which is then minimized through the learning process to improve the distribution alignment. Our method makes sensitive attributes less recognizable for the model, which in turn promotes fairness. Additionally, for the first time, we analyze the notion of attractiveness as an important sensitive attribute in FER models and demonstrate that FER models can indeed exhibit biases towards more attractive faces. To prove the efficacy of our model in reducing bias regarding different sensitive attributes (including the newly proposed attractiveness attribute), we perform several experiments on two widely used datasets, CelebA and RAF-DB. The results in terms of both accuracy and fairness measures outperform the state-of-the-art in most cases, demonstrating the effectiveness of the proposed method.
Using the thermal modality in order to extract physiological signals as a noncontact means of remote monitoring is gaining traction in applications, such as healthcare monitoring. However, existing methods rely heavily on traditional tracking and mostly unsupervised signal processing methods, which could be affected significantly by noise and subjects’ movements. Using a novel deep learning architecture based on convolutional long short-term memory networks on a diverse dataset of 36 subjects, we present a personalized approach to extract multimodal signals, including the heart rate, respiration rate, and body temperature from thermal videos. We perform multimodal signal extraction for subjects in states of both active speaking and silence, requiring no parameter tuning in an end-to-end deep learning approach with automatic feature extraction. We experiment with different data sampling methods for training our deep learning models, as well as different network designs. Our results indicate the effectiveness and improved efficiency of the proposed models reaching more than 90% accuracy based on the availability of proper training data for each subject.
We present µGeT, a novel multimodal eyes-free text selection technique. µGeT combines touch interaction with microgestures. µGeT is especially suited for People with Visual Impairments (PVI) by expanding the input bandwidth of touchscreen devices, thus shortening the interaction paths for routine tasks. To do so, µGeT extends touch interaction (left/right and up/down flicks) using two simple microgestures: thumb touching either the index or the middle finger. For text selection, the multimodal technique allows us to directly modify the positioning of the two selection handles and the granularity of text selection. Two user studies, one with 9 PVI and one with 8 blindfolded sighted people, compared µGeT with a baseline common technique (VoiceOver like on iPhone). Despite a large variability in performance, the two user studies showed that µGeT is globally faster and yields fewer errors than VoiceOver. A detailed analysis of the interaction trajectories highlights the different strategies adopted by the participants. Beyond text selection, this research shows the potential of combining touch interaction and microgestures for improving the accessibility of touchscreen devices for PVI.
Understanding the Social Context of Eating with Multimodal Smartphone Sensing: The Role of Country Diversity
Understanding the social context of eating is crucial for promoting healthy eating behaviors. Multimodal smartphone sensor data could provide valuable insights into eating behavior, particularly in mobile food diaries and mobile health apps. However, research on the social context of eating with smartphone sensor data is limited, despite extensive studies in nutrition and behavioral science. Moreover, the impact of country differences on the social context of eating, as measured by multimodal phone sensor data and self-reports, remains under-explored. To address this research gap, our study focuses on a dataset of approximately 24K self-reports on eating events provided by 678 college students in eight countries to investigate the country diversity that emerges from smartphone sensors during eating events for different social contexts (alone or with others). Our analysis revealed that while some smartphone usage features during eating events were similar across countries, others exhibited unique trends in each country. We further studied how user and country-specific factors impact social context inference by developing machine learning models with population-level (non-personalized) and hybrid (partially personalized) experimental setups. We showed that models based on the hybrid approach achieve AUC scores up to 0.75 with XGBoost models. These findings emphasize the importance of considering country differences in building and deploying machine learning models to minimize biases and improve generalization across different populations.
Intent classification is a key task in natural language processing (NLP) that aims to infer the goal or intention behind a user’s query. Most existing intent classification methods rely on supervised deep models trained on large annotated datasets of text-intent pairs. However, obtaining such datasets is often expensive and impractical in real-world settings. Furthermore, supervised models may overfit or face distributional shifts when new intents, utterances, or data distributions emerge over time, requiring frequent retraining. Online learning methods based on user feedback can overcome this limitation, as they do not need access to intents while collecting data and adapting the model continuously. In this paper, we propose a novel multi-armed contextual bandit framework that leverages a text encoder based on a large language model (LLM) to extract the latent features of a given utterance and jointly learn multimodal representations of encoded text features and intents. Our framework consists of two stages: offline pretraining and online fine-tuning. In the offline stage, we train the policy on a small labeled dataset using a contextual bandit approach. In the online stage, we fine-tune the policy parameters using the REINFORCE algorithm with a user feedback-based objective, without relying on the true intents. We further introduce a sliding window strategy for simulating the retrieval of data samples during online training. This novel two-phase approach enables our method to efficiently adapt to dynamic user preferences and data distributions with improved performance. An extensive set of empirical studies indicate that our method significantly outperforms policies that omit either offline pretraining or online fine-tuning, while achieving competitive performance to a supervised benchmark trained on an order of magnitude larger labeled dataset.
The “Water Level Task” (WLT) is a classic cognitive task that assesses an individual’s ability to draw the water level in a tilted container. Most of the existing research has used 2D imagery and shown that adults struggle with the task. Our research investigates if the use of augmented reality (AR) improves an individual’s performance by engaging embodied interaction and natural interaction with the world, thus taking advantage of their “intuitive physics.” We created a traditional online WLT to recruit low- and high-scoring participants for the AR experiment. Using a HoloLens2 AR headset, we created two containers half-filled with water. One of the simulations featured a water surface that did not remain horizontal when the container was tilted, while in the other simulation, the water surface remained level. Participants were able to interact with the containers and were asked to indicate which simulation looked more natural. Our results revealed that individuals prone to errors in the 2D version of the task were more likely to make errors in the AR version, indicating that misconceptions about water orientation persist even in a more natural setting. However, people’s perceptions of the natural orientation of water differed in 2D and AR settings, suggesting that different perceptual and cognitive factors were involved in participants’ intuitive understanding of the natural orientation of water in the two settings. Additionally, we found that participants were insensitive to minor tilts of the water surface. Our study highlights the potential benefits of using AR to create more realistic and interactive virtual environments, which provides a basis for further study of intuitive physics and how humans interact with physical environments.
In this study, we propose a bias-mitigation algorithm, dubbed ProxyMute, that uses an explainability method to detect proxy features of a given sensitive attribute (e.g., gender) and reduces their effects on decisions by disabling them during prediction time. We evaluate our method for a job recruitment use-case, on two different multimodal datasets, namely, FairCVdb and ChaLearn LAP-FI. The exhaustive set of experiments shows that information regarding the proxy features that are provided by explainability methods is beneficial and can be successfully used for the problem of bias mitigation. Furthermore, when combined with a target label normalization method, the proposed approach shows a good performance by yielding one of the fairest results without deteriorating the performance significantly compared to previous works on both experimental datasets. The scripts to reproduce the results are available at: https://github.com/gizemsogancioglu/expl-bias-mitigation.
Teamness is a newly proposed multidimensional construct aimed to characterize teams and their dynamic levels of interdependence over time. Specifically, teamness is deeply rooted in team cognition literature, considering how a team’s composition, processes, states, and actions affect collaboration. With this multifaceted construct being recently proposed, there is a call to the research community to investigate, measure, and model dimensions of teamness. In this study, we explored the speech content of 21 human-human-agent teams during a remote collaborative search task. Using self-report surveys of their social and affective states throughout the task, we conducted factor analysis to condense the survey measures into four components closely aligned with the dimensions outlined in the teamness framework: social dynamics and trust, affect, cognitive load, and interpersonal reliance. We then extracted features from teams’ speech using Linguistic Inquiry and Word Count (LIWC) and performed Epistemic Network Analyses (ENA) across these four teamwork components as well as team performance. We developed six hypotheses of how we expected specific LIWC features to correlate with self-reported team processes and performance, which we investigated through our ENA analyses. Through quantitative and qualitative analyses of the networks, we explore differences of speech patterns across the four components and relate these findings to the dimensions of teamness. Our results indicate that ENA models based on selected LIWC features were able to capture elements of teamness as well as team performance; this technique therefore shows promise for modeling of these states during CSCW, to ultimately design intelligent systems to promote greater teamness using speech-based measures.
Video-based Respiratory Waveform Estimation in Dialogue: A Novel Task and Dataset for Human-Machine Interaction
Respiration is closely related to speech, so respiratory information is useful for improving human-machine multimodal spoken interaction from various perspectives. A machine-learning task is presented for multimodal interactive systems to improve the compatibility of the systems and promote smooth interaction with them. This “video-based respiration waveform estimation (VRWE)” task consists of two subtasks: waveform amplitude estimation and waveform gradient estimation. A dataset consisting of respiratory data for 30 participants was created for this task, and a strong baseline method based on 3DCNN-ConvLSTM was evaluated on the dataset. Finally, VRWE, especially gradient estimation, was shown to be effective in predicting user voice activity after 200 ms. These results suggest that VRWE is effective for improving human-machine multimodal interaction.
In Smart City and Vehicle-to-Everything (V2X) systems, acquiring pedestrians’ accurate locations is crucial to traffic and pedestrian safety. Current systems adopt cameras and wireless sensors to estimate people’s locations via sensor fusion. Standard fusion algorithms, however, become inapplicable when multi-modal data is not associated. For example, pedestrians are out of the camera field of view, or data from the camera modality is missing. To address this challenge and produce more accurate location estimations for pedestrians, we propose a localization solution based on a Generative Adversarial Network (GAN) architecture. During training, it learns the underlying linkage between pedestrians’ camera-phone data correspondences. During inference, it generates refined position estimations based only on pedestrians’ phone data that consists of GPS, IMU, and FTM. Results show that our GAN produces 3D coordinates at 1 to 2 meters localization error across 5 different outdoor scenes. We further show that the proposed model supports self-learning. The generated coordinates can be associated with pedestrians’ bounding box coordinates to obtain additional camera-phone data correspondences. This allows automatic data collection during inference. Results show that after fine-tuning the GAN model on the expanded dataset, localization accuracy is further improved by up to 26%.
This paper proposes a multi-modal, non-intrusive and privacy preserving system WiFiTuned for monitoring engagement in online participation i.e., meeting/classes/seminars. It uses two sensing modalities i.e., WiFi CSI and audio for the same. WiFiTuned detects the head movements of participants during online participation through WiFi CSI and detects the speaker’s intent through audio. Then it correlates the two to detect engagement. We evaluate WiFiTuned with 22 participants and observe that it can detect the engagement level with an average accuracy of more than .
SESSION: Blue Sky Papers
The traditional data processing uses machine as a passive feature detector or classifier for a given fixed dataset. However, we contend that this is not how humans understand and process data from the real world. Based on active inference, we propose a neural network model that actively processes the incoming data using predictive processing and actively samples the inputs from the environment that conforms to its internal representations. The model we adopt is the Helmholtz machine, a perfect parallel for the hierarchical model of the brain and the forward-backward connections of the cortex, thus available a biologically plausible implementation of the brain functions such as predictive processing, hierarchical message passing, and predictive coding under a machine-learning context. Besides, active sampling could also be incorporated into the model via the generative end as an interaction of the agent with the external world. The active sampling of the environment directly resorts to environmental salience and cultural niche construction. By studying a coupled multi-agent model of constructing a “desire path” as part of a cultural niche, we find a plausible way of explaining and simulating various problems under group flow, social interactions, shared cultural practices, and thinking through other minds.
From Natural to Non-Natural Interaction: Embracing Interaction Design Beyond the Accepted Convention of Natural
Natural interactions feel intuitive, familiar, and a good match to the task, user’s abilities, and context. Consequently, a wealth of scientific research has been conducted on natural interaction with computer systems. Contrary to conventional mainstream, we advocate for “non-natural interaction design” as a transformative, creative process that results in highly usable and effective interactions by deliberately deviating from users’ expectations and experience of engaging with the physical world. The non-natural approach to interaction design provokes a departure from the established notion of the “natural,” all the while prioritizing usability—albeit amidst the backdrop of the unconventional, unexpected, and intriguing.
Towards Adaptive User-centered Neuro-symbolic Learning for Multimodal Interaction with Autonomous Systems
Recent advances in deep learning and data-driven approaches have facilitated the perception of objects and their environments in a perceptual subsymbolic manner. Thus, these autonomous systems can now perform object detection, sensor data fusion, and language understanding tasks. However, there is an increasing demand to further enhance these systems to attain a more conceptual and symbolic understanding of objects to acquire the underlying reasoning behind the learned tasks. Achieving this level of powerful artificial intelligence necessitates considering both explicit teachings provided by humans (e.g., explaining how to act) and implicit teaching obtained through observing human behavior (e.g., through system sensors). Hence, it is imperative to incorporate symbolic and subsymbolic learning approaches to support implicit and explicit interaction models. This integration enables the system to achieve multimodal input and output capabilities. In this Blue Sky paper, we argue for considering these input types, along with human-in-the-loop and incremental learning techniques, to advance the field of artificial intelligence and enable autonomous systems to emulate human learning. We propose several hypotheses and design guidelines aimed at achieving this objective.
SESSION: Doctoral Consortium
With the increasing availability of multimodal data, especially in the sports and medical domains, there is growing interest in developing Artificial Intelligence (AI) models capable of comprehending the world in a more holistic manner. Nevertheless, various challenges exist in multimodal understanding, including the integration of multiple modalities and the resolution of semantic gaps between them. The proposed research aims to leverage multiple input modalities for the multimodal understanding of AI models, enhancing their reasoning, generation, and intelligent behavior. The research objectives focus on developing novel methods for multimodal AI, integrating them into conversational agents with optimizations for domain-specific requirements. The research methodology encompasses literature review, data curation, model development and implementation, evaluation and performance analysis, domain-specific applications, and documentation and reporting. Ethical considerations will be thoroughly addressed, and a comprehensive research plan is outlined to provide guidance. The research contributes to the field of multimodal AI understanding and the advancement of sophisticated AI systems by experimenting with multimodal data to enhance the performance of state-of-the-art neural networks.
Come Fl.. Run with Me: Understanding the Utilization of Drones to Support Recreational Runners’ Well Being
The utilization of drones to assist runners in real-time and post-run remains a promising yet unexplored field within human-drone interaction (HDI). Hence, in my doctoral research, I aim to delve into the concepts and relationships surrounding drones in the context of running, than focusing solely on one specific application. I plan on accomplishing this through a three-stage research plan: 1) investigate the feasibility of drones to support outdoor running research, 2) empathize with runners to assess their preferences and experiences running with drone, and 3) implement and test an interactive running with drone scenario. Each stage has specific objectives and research questions aimed at providing valuable insights into the utilization of drones to support runners. This paper outlines the work conducted during my Ph.D. research along with future plans, with the goal of advancing the knowledge in the field of runner drone interaction.
The process of “conversational grounding” is an interactive process that has been studied extensively in cognitive science, whereby participants in a conversation check to make sure their interlocutors understand what is being referred to. This interactive process uses multiple modes of communication to establish the information between the participants. This could include information provided through eye-gaze, head movements, intonation in speech, along with the content of the speech. While the process is essential to successful communication between humans and between humans and machines, work needs to be done on testing and building the capabilities of the current dialogue system in managing conversational grounding, especially in multimodal medium of communication. Recent work such as Benotti and Blackburn  have shown the importance of conversational grounding in dialog systems and how current systems fail in them which is essential for the advancement of Embodied Conversational Agents and Social Robots. Thus my Ph.D. project aims to test, understand and improve the functioning of current dialog models with respect to Conversational Grounding.
Predicting the future trajectory of a crowd is important for safety to prevent disasters such as stampedes or collisions. Extensive research has been conducted to explore trajectory prediction in typical crowd scenarios, where the majority of individuals can be easily identified. However, this study focuses on a more challenging scenario known as the super-crowd scene, wherein individuals within the crowd can only be annotated based on their heads. In this particular scenario, people’s re-identification process in tracking does not perform well due to a lack of clear image data. Our research proposes a clustering strategy to overcome people re-identification problems and predict the cluster crowd trajectory. Two-dimensional(2D) maps and multi-cameras will be used to capture full pictures of crowds in a location and extract the venue’s spatial data (see figure 1). The research methodology encompasses several key steps, including evaluating data extraction of the state-of-the-art methods, estimating crowd clusters, integrating 2D maps and multi-view fusion, and evaluating the proposed method on a dataset of multi-view videos collected in a real-world super-crowded scenario.
Surgery, typically seen as the surgeon’s sole responsibility, requires a broader perspective acknowledging the vital roles of other operating room (OR) personnel. The interactions among team members are crucial for delivering quality care and depend on shared situation awareness. I propose a two-phase approach to design and evaluate a multimodal platform that monitors OR members, offering insights into surgical procedures. The first phase focuses on designing a data-collection platform, tailored to surgical constraints, to generate novel collaboration and situation-awareness metrics using synchronous recordings of the participants’ voices, positions, orientations, electrocardiograms, and respiration signals. The second phase concerns the creation of intuitive dashboards and visualizations, aiding surgeons in reviewing recorded surgery, identifying adverse events and contributing to proactive measures. This work aims to demonstrate an innovative approach to data collection and analysis, augmenting the surgical team’s capabilities. The multimodal platform has the potential to enhance collaboration, foster situation awareness, and ultimately mitigate surgical adverse events. This research sets the stage for a transformative shift in the OR, enabling a more holistic and inclusive perspective that recognizes that surgery is a team effort.
Depression is a severe mental illness that not only affects the patient but also has major social and economical implications. Recent studies have employed artificial intelligence using multimodal behavioural cues to objectively investigate depression and alleviate the subjectivity involved in current depression diagnostic process. However, head motion has received a fairly limited attention as a behavioural marker for detecting depression and the lack of explainability of the “black box” approaches have restricted their widespread adoption. Consequently, the objective of this research is to examine the utility of fundamental head-motion units termed kinemes and explore the explainability of multimodal behavioural cues for depression detection. To this end, the research to date evaluated depression classification performance on the BlackDog and AVEC2013 datasets using multiple machine learning methods. Our findings indicate that: (a) head motion patterns are effective cues for depression assessment, and (b) explanatory kineme patterns can be observed for the two classes, consistent with prior research.
Artificial Neural Networks (ANNs) are computer models loosely inspired by the functioning of the human brain. They are the state-of-the-art method for tackling a variety of Artificial Intelligence (AI) problems, and an increasingly popular tool in neuroscientific studies. However, both domains pursue different goals: in AI, performance is key and brain resemblance is incidental, while in neuroscience the aim is chiefly to better understand the brain. This PhD is situated at the intersection of both disciplines. Its goal is to develop ANNs that model social cognition in neurotypical individuals, and that can be altered in a controlled way to exhibit behavior consistent with individuals with one of two clinical conditions, Autism Spectrum Disorder and Frontotemporal Dementia.
Pair programming is a collaborative technique which has proven highly beneficial in terms of the code produced and the learning gains for programmers. With recent advances in Programming Language Processing (PLP), numerous tools have been created that assist programmers in non-collaborative settings (i.e., where the technology provides users with a solution, instead of discussing the problem to develop a solution together). How can we develop AI that can assist in pair programming, a collaborative setting? To tackle this task, we begin by gathering multimodal dialogue data which can be used to train systems in a basic subtask of dialogue understanding: multimodal reference resolution, i.e., understanding which parts of a program are being mentioned by users through speech or by using the mouse and keyboard.
Adherence to a rehabilitation programme is vital to recover from injury, failing to do so can keep a promising athlete off the field permanently. Although the importance to follow their home exercise programme (HEP) is broadly explained to patients by their physicians, few of them actually complete it correctly. In my PhD research, I focus on factors that could help increase engagement in home exercise programmes for patients recovering from knee injuries using VR and wearable sensors. This will be done through the gamification of the rehabilitation process, designing the system with a user-centered design approach to test different interactions that could affect the engagement of the users.
SESSION: Grand Challenges: Emotion Recognition in the Wild Challenge (EmotiW23)
Audio-Visual Group-based Emotion Recognition using Local and Global Feature Aggregation based Multi-Task Learning
Audio-video group emotion recognition is a challenging task and has attracted more attention in recent decades. Recently, deep learning models have shown tremendous advances in analyzing human emotion. However, due to its difficulties such as hard to gather a broad range of potential information to obtain meaningful emotional representations and hard to associate implicit contextual knowledge like humans. To tackle these problems, in this paper, we proposed the Local and Global Feature Aggregation based Multi-Task Learning (LGFAM) method to tackle the Group Emotion Recognition problem. The framework consists of three parallel feature extraction networks that were verified in previous work. After that, an attention network using MLP as a backbone with specially designed loss functions was used to fuse features from different modalities. In the experiment section, we present its performance on the EmotiW2023 Audio-Visual Group-based Emotion Recognition subchallenge which aims to classify a video into one of the three emotions. According to the feedback results, the best result achieved 70.63 WAR and 70.38 UAR on the test set. Such improvement proves the effectiveness of our method.
This paper describes the 9th Emotion Recognition in the Wild (EmotiW) challenge, which is being run as a grand challenge at the 25th ACM International Conference on Multimodal Interaction 2023. EmotiW challenge focuses on affect related benchmarking tasks and comprises of two sub-challenges: a) User Engagement Prediction in the Wild, and b) Audio-Visual Group-based Emotion Recognition. The purpose of this challenge is to provide a common platform for researchers from diverse domains. The objective is to promote the development and assessment of methods, which can predict engagement levels and/or identify perceived emotional well-being of a group of individuals in real-world circumstances. We describe the datasets, the challenge protocols and the accompanying sub-challenge.
This paper explores privacy-compliant group-level emotion recognition “in-the-wild” within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
SESSION: Grand Challenges: The GENEA Challenge 2023: Full-Body Speech-Driven Gesture Generation in a Dyadic Setting
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.
Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.
This paper presents the CASIA-GO entry to the Generation and Evaluation of Non-verbal Behaviour for Embedded Agents (GENEA) Challenge 2023. The system is originally designed for few-shot scenarios such as generating gestures with the style of any in-the-wild target speaker from short speech samples. Given a group of reference speech data including gesture sequences, audio, and text, it first constructs a gesture motion graph that describes the soft gesture units and interframe continuity inside the speech, which is ready to be used for new rhythmic and semantic gesture reenactment by pathfinding when test audio and text are provided. We randomly choose one clip from the training data for one test clip to simulate a few-shot scenario and provide compatible results for subjective evaluations. Despite the 0.25% average utilization of the whole training set for each clip in the test set and the 17.5% total utilization of the training set for the whole test set, the system succeeds in providing valid results and ranks in the top 1/3 in the appropriateness for agent speech evaluation.
In this paper, we introduce the DiffuseStyleGesture+, our solution for the Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023, which aims to foster the development of realistic, automated systems for generating conversational gestures. Participants are provided with a pre-processed dataset and their systems are evaluated through crowdsourced scoring. Our proposed model, DiffuseStyleGesture+, leverages a diffusion model to generate gestures automatically. It incorporates a variety of modalities, including audio, text, speaker ID, and seed gestures. These diverse modalities are mapped to a hidden space and processed by a modified diffusion model to produce the corresponding gesture for a given speech input. Upon evaluation, the DiffuseStyleGesture+ demonstrated performance on par with the top-tier models in the challenge, showing no significant differences with those models in human-likeness, appropriateness for the interlocutor, and achieving competitive performance with the best model on appropriateness for agent speech. This indicates that our model is competitive and effective in generating realistic and appropriate gestures for given speech. The code, pre-trained models, and demos are available at this URL.
This paper describes FineMotion’s entry to the GENEA Challenge 2023. We explore the potential of DeepPhase embeddings by adapting neural motion controllers to conversational gesture generation. This is achieved by introducing a recurrent encoder for control features. We additionally use VQ-VAE codebook encoding of gestures to support dyadic setup. The resulting system generates stable realistic motion controllable by audio, text and interlocutor’s motion.
The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year’s challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the interlocutor. We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies. The studies focused on three aspects: 1) the human-likeness of the motion, 2) the appropriateness of the motion for the agent’s own speech whilst controlling for the human-likeness of the motion, and 3) the appropriateness of the motion for the behaviour of the interlocutor in the interaction, using a setup that controls for both the human-likeness of the motion and the agent’s own speech. We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap. Appropriateness seems far from being solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor. Additional material is available via the project website at svito-zar.github.io/GENEAchallenge2023/.
This paper describes our entry to the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. This year’s challenge focuses on generating gestures in a dyadic setting – predicting a main-agent’s motion from the speech of both the main-agent and an interlocutor. We adapt a Transformer-XL architecture for this task by adding a cross-attention module that integrates the interlocutor’s speech with that of the main-agent. Our model is conditioned on speech audio (encoded using PASE+), text (encoded using FastText) and a speaker identity label, and is able to generate smooth and speech appropriate gestures for a given identity. We consider the GENEA Challenge user study results and present a discussion of our model strengths and where improvements can be made.
SESSION: Workshop Summaries
Analysing and understanding child behaviour is a topic of great scientific interest across a wide range of disciplines, including social sciences and artificial intelligence (AI). Knowledge in these diverse fields is not yet integrated to its full potential. The aim of this workshop is to bring researchers from these fields together. The first three workshops had a significant impact. In this workshop, we discussed topics such as the use of AI techniques to better examine and model interactions and children’s emotional development, analyzing head movement patterns with respect to child age. The 2023 edition of the workshop is a successful new step towards the objective of bridging social sciences and AI, attracting contributions from various academic fields on child behaviour analysis. We see that atypical child development holds an important space in child behaviour research. While in visual domain, gaze and joint attention are popularly studied; speech and physiological signals of atypically developing children are shown to provide valuable cues motivating future work. This document summarizes the WoCBU’23 workshop, including the review process, keynote talks and the accepted papers.
“Aesthetic experience” corresponds to the inner state of a person exposed to the form and content of artistic objects. Quantifying and interpreting the aesthetic experience of people in various contexts contribute towards a) creating context, and b) better understanding people’s affective reactions to aesthetic stimuli. Focusing on different types of artistic content, such as movie, music, literature, urban art, ancient artwork, and modern interactive technology, the 4th international workshop on Multimodal Affect and Aesthetic Experience (MAAE) aims to enhance interdisciplinary collaboration among researchers from affective computing, aesthetics, human-robot/computer interaction, digital archaeology and art, culture, ethics, and addictive games.
This workshop discusses how interactive, multimodal technology, such as virtual agents, can measure and train social-affective interactions. Sensing technology now enables analyzing users’ behaviors and physiological signals. Various signal processing and machine learning methods can be used for prediction tasks. Such social signal processing and tools can be applied to measure and reduce social stress in everyday situations, including public speaking at schools and workplaces.
The ACE – how Artificial Character Embodiment shapes user behavior in multi-modal interactions – workshop aims to bring together researchers, practitioners and experts on the topic of embodiment, to analyze and foster discussion on its effects on user behavior in multi-modal interaction. ACE is aimed at stimulating multidisciplinary discussions on the topic, sharing recent progress, and providing participants with a forum to debate current and future challenges. The workshop includes contributions from computational, neuroscientific and psychological perspectives, as well as technical applications.
Pain communication varies, with some patients being highly expressive regarding their pain and others exhibiting stoic forbearance and minimal verbal account of discomfort. Considerable progress has been made in defining behavioral indices of pain [1-3]. An abundant literature shows that a limited subset of facial movements, in several non-human species, encode pain intensity across the lifespan . To advance reliable pain monitoring, automated assessment of pain is emerging as a powerful mean to realize that goal. Though progress has been made, this field remains in its infancy. The workshop aims to promote current research and support growth of interdisciplinary collaborations to advance this groundbreaking research.
GENEA Workshop 2023: The 4th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
Non-verbal behavior is advantageous for embodied agents when interacting with humans. Despite many years of research on the generation of non-verbal behavior, there is no established benchmarking practice in the field. Most researchers do not compare their results to prior work, and if they do, they often do so in a manner that is not compatible with other approaches. The GENEA Workshop 2023 seeks to bring the community together to discuss the major challenges and solutions, and to identify the best ways to progress the field.
Neurodevelopmental Disorders (NDD) involve developmental deficits in cognition, social interaction, and communication. Despite growing interest, gaps persist in understanding usability, effectiveness, and perceptions of such agents. We organize a workshop focusing on the use of conversational agents with multi-modal capabilities for therapeutic interventions in NDD. The workshop brings together researchers and practitioners to discuss design, evaluation, and ethical considerations. Anticipated outcomes include identifying challenges, sharing advancements, fostering collaboration, and charting future research directions.
In the rapidly evolving landscape of education, the integration of technology and innovative pedagogical approaches has become imperative to engage learners effectively. Our workshop aimed to delve into the intersection of technology, cognitive psychology, and educational theory to explore the potential of multimodal interfaces in transforming the learning experience for both regular and special education. Its interdisciplinary brought together experts from fields of human-computer interaction, education, cognitive science, and computer science. To give further insights to participants discussions, 3 keynotes from experts in the field, 6 presentations of accepted short-papers from participants, and 6 in-loco demos of relevant projects were performed. The high-level content approached tend to tailor works future developed towards this area.
The 5th Workshop on Modeling Socio-Emotional and Cognitive Processes from Multimodal Data in the Wild (MSECP-Wild)
The ability to automatically infer relevant aspects of human users’ thoughts and feelings is crucial for technologies to intelligently adapt their behaviors in complex interactions. Research on multimodal analysis has demonstrated the potential of technology to provide such estimates for a broad range of internal states and processes. However, constructing robust approaches for deployment in real-world applications remains an open problem. The MSECP-Wild workshop series is a multidisciplinary forum to present and discuss research addressing this challenge. Submissions to this 5th iteration span efforts relevant to multimodal data collection, modeling, and applications. In addition, our workshop program builds on discussions emerging in previous iterations, highlighting ethical considerations when building and deploying technology modeling internal states in the wild. For this purpose, we host a range of relevant keynote speakers and interactive activities.