ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

From Hands to Brains: How Does Human Body Talk, Think and Interact in Face-to-Face Language Use?

  • Asli Özyürek

Most research on language has focused on spoken and written language only. However when we use language in face-to-face interactions we use not only speech but also use our bodily actions, such as hand gestures in meaningful ways to communicate our messages and in ways closely linked to the spoken aspects of our language. For example we can enhance or complement our speech with a drinking gesture, a so- called an iconic gesture, as we say 'we stayed up late last night'. In this talk I will summarize research that investigates how such meaningful bodily actions are recruited in using language as a dynamic adaptive and flexible system and how gestures interact with speech during production and comprehension of language at the behavioral, cognitive, and neural levels. First part of the lecture will focus on how gestures are linked to the language production system even though they have a very different representational format (i.e, iconic and analogue) than speech (arbitrary, discrete and categorical) [1] and how they express communicative intentions during language use [2]. In doing so I will show different ways gestures are linked to speech in different languages and in different communicative contexts as well as in bilinguals and language learners. Second part of the talk will focus on how gestures influence and enhance language comprehension by reducing the ambiguity of the communicative signal [3] and providing kinematic cues to the communicative intentions of the speaker [4] and the underlying neural correlates of gesture that facilitates its role in language comprehension [5]. In the final part of the talk I will show how gestures facilitate mutual understanding, that is alignment between interactants in dialogue. Overall I will claim that a complete understanding of the role language plays in our cognition and communication is not possible without having a multimodal approach.

Musical Multimodal Interaction: From Bodies to Ecologies

  • Atau Tanaka

Musical performance can be thought of in multimodal terms - physical interaction with musical instruments produces sound output, often while the performer is visually reading a score. Digital Musical Instrument (DMI) design merges tenets of HCI and musical instrument practice. Audiovisual performance and other forms of multimedia might benefit from multimodal thinking. This keynote revisits two decades of interactive music practice that has paralleled the development of the field of multimodal interaction research. The BioMuse was an early digital musical instrument system using EMG muscle sensing that was extended by a second mode of sensing, allowing effort and position to be two complementary modalities [1]. The Haptic Wave applied principles of cross-modal information display to create a haptic audio editor enabling visually impaired audio producers to 'feel' audio waveforms they could not see in a graphical user interface [2]. VJ culture extends the idea of music DJs to create audiovisual cultural experiences. AVUIs were a set of creative coding tools that enabled the convergence of performance UI and creative visual output [3]. The Orchestra of Rocks is a continuing collaboration with visual artist Uta Kogelsberger that has manifested itself through physical and virtual forms - allowing multimodality over time [4]. Be it a physical exhibition in a gallery or audio reactive 3D animation on YouTube 360, the multiple modes in which an artwork is articulated support its original conceptual foundations. These four projects situate multimodal interaction at the heart of artistic research.

Human-centered Multimodal Machine Intelligence

  • Shrikanth Shri Narayanan

Multimodal machine intelligence offers enormous possibilities for helping understand the human condition and in creating technologies to support and enhance human experiences [1, 2]. What makes such approaches and systems exciting is the promise they hold for adaptation and personalization in the presence of the rich and vast inherent heterogeneity, variety and diversity within and across people. Multimodal engineering approaches can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics (e.g., interaction synchrony) objectively, and at scale. Machine intelligence could also help detect and analyze deviation in patterns from what is deemed typical. These techniques in turn can assist, facilitate or enhance decision making by humans, and by autonomous systems. Realizing such a promise requires addressing two major lines of, oft intertwined, challenges: creating inclusive technologies that work for everyone while enabling tools that can illuminate the source of variability or difference of interest.

This talk will highlight some of these possibilities and opportunities through examples drawn from two specific domains. The first relates to advancing health informatics in behavioral and mental health [3, 4]. With over 10% of the world's population affected, and with clinical research and practice heavily dependent on (relatively scarce) human expertise in diagnosing, managing and treating the condition, engineering opportunities in offering access and tools to support care at scale are immense. For example, in determining whether a child is on the Autism spectrum, a clinician would engage and observe a child in a series of interactive activities, targeting relevant cognitive, communicative and socio- emotional aspects, and codify specific patterns of interest e.g., typicality of vocal intonation, facial expressions, joint attention behavior. Machine intelligence driven processing of speech, language, visual and physiological data, and combining them with other forms of clinical data, enable novel and objective ways of supporting and scaling up these diagnostics. Likewise, multimodal systems can automate the analysis of a psychotherapy session, including computing treatment quality-assurance measures e.g., rating a therapist's expressed empathy. These technology possibilities can go beyond the traditional realm of clinics, directly to patients in their natural settings. For example, remote multimodal sensing of biobehavioral cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace) and progress to treatment (e.g., for depression), and offer just in time support.

The second example is drawn from the world of media. Media are created by humans and for humans to tell stories. They cover an amazing range of domains'from the arts and entertainment to news, education and commerce and in staggering volume. Machine intelligence tools can help analyze media and measure their impact on individuals and society. This includes offering objective insights into diversity and inclusion in media representations through robustly characterizing media portrayals from an intersectional perspective along relevant dimensions of inclusion: gender, race, gender, age, ability and other attributes, and in creating tools to support change [5,6]. Again this underscores the twin technology requirements: to perform equally well in characterizing individuals regardless of the dimensions of the variability, and use those inclusive technologies to shine light on and create tools to support diversity and inclusion.

SESSION: Long Papers

A Multi-modal System to Assess Cognition in Children from their Physical Movements

  • Ashwin Ramesh Babu
  • Mohammad Zaki Zadeh
  • Ashish Jaiswal
  • Alexis Lueckenhoff
  • Maria Kyrarini
  • Fillia Makedon

In recent years, computer and game-based cognitive tests have become popular with the advancement in mobile technology. However, these tests require very little body movements and do not consider the influence that physical motion has on cognitive development. Our work mainly focus on assessing cognition in children through their physical movements. Hence, an assessment test "Ball-Drop-to-the-Beat" that is both physically and cognitively demanding has been used where the child is expected to perform certain actions based on the commands. The task is specifically designed to measure attention, response inhibition, and coordination in children. A dataset has been created with 25 children performing this test. To automate the scoring, a computer vision-based assessment system has been developed. The vision system employs an attention-based fusion mechanism to combine multiple modalities such as optical flow, human poses, and objects in the scene to predict a child's action. The proposed method outperforms other state-of-the-art approaches by achieving an average accuracy of 89.8 percent on predicting the actions and an average accuracy of 88.5 percent on predicting the rhythm on the Ball-Drop-to-the-Beat dataset.

A Neural Architecture for Detecting User Confusion in Eye-tracking Data

  • Shane D. Sims
  • Cristina Conati

Encouraged by the success of deep learning in a variety of domains, we investigate the effectiveness of a novel application of such methods for detecting user confusion with eye-tracking data. We introduce an architecture that uses RNN and CNN sub-models in parallel, to take advantage of the temporal and visuospatial aspects of our data. Experiments with a dataset of user interactions with the ValueChart visualization tool show that our model outperforms an existing model based on a Random Forest classifier, resulting in a 22% improvement in combined confused & not confused class accuracies.

Analysis of Face-Touching Behavior in Large Scale Social Interaction Dataset

  • Cigdem Beyan
  • Matteo Bustreo
  • Muhammad Shahid
  • Gian Luca Bailo
  • Nicolo Carissimi
  • Alessio Del Bue

We present the first publicly available annotations for the analysis of face-touching behavior. These annotations are for a dataset composed of audio-visual recordings of small group social interactions with a total number of 64 videos, each one lasting between 12 to 30 minutes and showing a single person while participating to four-people meetings. They were performed by in total 16 annotators with an almost perfect agreement (Cohen's Kappa=0.89) on average. In total, 74K and 2M video frames were labelled as face-touch and no-face-touch, respectively. Given the dataset and the collected annotations, we also present an extensive evaluation of several methods: rule-based, supervised learning with hand-crafted features and feature learning and inference with a Convolutional Neural Network (CNN) for Face-Touching detection. Our evaluation indicates that among all, CNN performed the best, reaching 83.76% F1-score and 0.84 Matthews Correlation Coefficient. To foster future research in this problem, code and dataset were made publicly available (, providing all video frames, face-touch annotations, body pose estimations including face and hands key-points detection, face bounding boxes as well as the baseline methods implemented and the cross-validation splits used for training and evaluating our models.

Attention Sensing through Multimodal User Modeling in an Augmented Reality Guessing Game

  • Felix Putze
  • Dennis Küster
  • Timo Urban
  • Alexander Zastrow
  • Marvin Kampen

We developed an attention-sensitive system that is capable of playing the children's guessing game "I spy with my litte eye" with a human user. In this game, the user selects an object from a given scene and provides the system with a single-sentence clue about it. For each trial, the system tries to guess the target object. Our approach combines top-down and bottom-up machine learning for object and color detection, automatic speech recognition, natural language processing, a semantic database, eye tracking, and augmented reality. Our evaluation demonstrates performance significantly above chance level, and results for most of the individual machine learning components are encouraging. Participants reported very high levels of satisfaction and curiosity about the system. The collected data shows that our guessing game generates a complex and rich data set. We discuss the capabilities and challenges of our system and its components with respect to multimodal attention sensing.

BreathEasy: Assessing Respiratory Diseases Using Mobile Multimodal Sensors

  • Md Mahbubur Rahman
  • Mohsin Yusuf Ahmed
  • Tousif Ahmed
  • Bashima Islam
  • Viswam Nathan
  • Korosh Vatanparvar
  • Ebrahim Nemati
  • Daniel McCaffrey
  • Jilong Kuang
  • Jun Alex Gao

Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$\pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.

Bring the Environment to Life: A Sonification Module for People with Visual Impairments to Improve Situation Awareness

  • Angela Constantinescu
  • Karin Müller
  • Monica Haurilet
  • Vanessa Petrausch
  • Rainer Stiefelhagen

Digital navigation tools for helping people with visual impairments have become increasingly popular in recent years. While conventional navigation solutions give routing instructions to the user, systems such as GoogleMaps, BlindSquare, or Soundscape offer additional information about the surroundings and, thereby, improve the orientation of people with visual impairments. However, these systems only provide information about static environments, while dynamic scenes comprising objects such as bikes, dogs, and persons are not considered. In addition, both the routing and the information about the environment are usually conveyed by speech. We address this gap and implement a mobile system that combines object identification with a sonification interface. Our system can be used in three different scenarios of macro and micro navigation: orientation, obstacle avoidance, and exploration of known and unknown routes. Our proposed system leverages popular computer vision methods to localize 18 static and dynamic object classes in real-time. At the heart of our system is a mixed reality sonification interface which is adaptable to the user's needs and is able to transmit the recognized semantic information to the user. The system is designed in a user-centered approach. An exploratory user study conducted by us showed that our object-to-sound mapping with auditory icons is intuitive. On average, users perceived our system as useful and indicated that they want to know more about their environment, apart from wayfinding and points of interest.

Combining Auditory and Mid-Air Haptic Feedback for a Light Switch Button

  • Cisem Ozkul
  • David Geerts
  • Isa Rutten

As Mid-Air Haptic (MAH) feedback, which provides a sensation of touch without direct physical contact, is a relatively new technology, research investigating MAH feedback in home usage as well as multi-sensory integration with MAH feedback is still scarce. To address this gap, we propose a possible usage context for MAH feedback, perform an experiment by manipulating auditory and haptic feedback in various physical qualities and suggest possible combinations for positive experiences. Certain sensory combinations led to changes in the emotional responses, as well as the responses regarding utilitarian (e.g. clarity) and perceptual (sensory match) qualities. The results show an added value of MAH feedback when added to sensory compositions, and an increase in the positive experiences induced by MAH length and multimodality.

Depression Severity Assessment for Adolescents at High Risk of Mental Disorders

  • Michal Muszynski
  • Jamie Zelazny
  • Jeffrey M. Girard
  • Louis-Philippe Morency

Recent progress in artificial intelligence has led to the development of automatic behavioral marker recognition, such as facial and vocal expressions. Those automatic tools have enormous potential to support mental health assessment, clinical decision making, and treatment planning. In this paper, we investigate nonverbal behavioral markers of depression severity assessed during semi-structured medical interviews of adolescent patients. The main goal of our research is two-fold: studying a unique population of adolescents at high risk of mental disorders and differentiating mild depression from moderate or severe depression. We aim to explore computationally inferred facial and vocal behavioral responses elicited by three segments of the semi-structured medical interviews: Distress Assessment Questions, Ubiquitous Questions, and Concept Questions. Our experimental methodology reflects best practise used for analyzing small sample size and unbalanced datasets of unique patients. Our results show a very interesting trend with strongly discriminative behavioral markers from both acoustic and visual modalities. These promising results are likely due to the unique classification task (mild depression vs. moderate and severe depression) and three types of probing questions.

Detecting Depression in Less Than 10 Seconds: Impact of Speaking Time on Depression Detection Sensitivity

  • Nujud Aloshban
  • Anna Esposito
  • Alessandro Vinciarelli

This article investigates whether it is possible to detect depression using less than 10 seconds of speech. The experiments have involved 59 participants (including 29 that have been diagnosed with depression by a professional psychiatrist) and are based on a multimodal approach that jointly models linguistic (what people say) and acoustic (how people say it) aspects of speech using four different strategies for the fusion of multiple data streams. On average, every interview has lasted for 242.2 seconds, but the results show that 10 seconds or less are sufficient to achieve the same level of recall (roughly 70%) observed after using the entire inteview of every participant. In other words, it is possible to maintain the same level of sensitivity (the name of recall in clinical settings) while reducing by 95%, on average, the amount of time requireed to collect the necessary data.

Did the Children Behave?: Investigating the Relationship Between Attachment Condition and Child Computer Interaction

  • Dong Bach Vo
  • Stephen Brewster
  • Alessandro Vinciarelli

This work investigates the interplay between Child-Computer Interaction and attachment, a psychological construct that accounts for how children perceive their parents to be. In particular, the article makes use of a multimodal approach to test whether children with different attachment conditions tend to use differently the same interactive system. The experiments show that the accuracy in predicting usage behaviour changes, to a statistically significant extent, according to the attachment conditions of the 52 experiment participants (age-range 5 to 9). Such a result suggests that attachment-relevant processes are actually at work when people interact with technology, at least when it comes to children.

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

  • Huili Chen
  • Yue Zhang
  • Felix Weninger
  • Rosalind Picard
  • Cynthia Breazeal
  • Hae Won Park

Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing. Traditional approaches frequently require hand-crafted speech features and segmentation of speaker turns. In this work, we design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers, automatically discovering features and time regions relevant to the target speaker's affect. We integrate a local attention mechanism into the end-to-end architecture and compare the performance of three attention implementations - one mean pooling and two weighted pooling methods. Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information and successfully extract the individual's valence and arousal intensity. Here we introduce and use a "dyadic affect in multimodal interaction - parent to child" (DAMI-P2C) dataset collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. In contrast to existing public datasets for affect recognition, each instance for both speakers in the DAMI-P2C dataset is annotated for the perceived affect by three labelers. To encourage more research on the challenging task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly available, including acoustic features of the dyads' raw audios, affect annotations, and a diverse set of developmental, social, and demographic profiles of each dyad.

Early Prediction of Visitor Engagement in Science Museums with Multimodal Learning Analytics

  • Andrew Emerson
  • Nathan Henderson
  • Jonathan Rowe
  • Wookhee Min
  • Seung Lee
  • James Minogue
  • James Lester

Modeling visitor engagement is a key challenge in informal learning environments, such as museums and science centers. Devising predictive models of visitor engagement that accurately forecast salient features of visitor behavior, such as dwell time, holds significant potential for enabling adaptive learning environments and visitor analytics for museums and science centers. In this paper, we introduce a multimodal early prediction approach to modeling visitor engagement with interactive science museum exhibits. We utilize multimodal sensor data including eye gaze, facial expression, posture, and interaction log data captured during visitor interactions with an interactive museum exhibit for environmental science education, to induce predictive models of visitor dwell time. We investigate machine learning techniques (random forest, support vector machine, Lasso regression, gradient boosting trees, and multi-layer perceptron) to induce multimodal predictive models of visitor engagement with data from 85 museum visitors. Results from a series of ablation experiments suggest that incorporating additional modalities into predictive models of visitor engagement improves model accuracy. In addition, the models show improved predictive performance over time, demonstrating that increasingly accurate predictions of visitor dwell time can be achieved as more evidence becomes available from visitor interactions with interactive science museum exhibits. These findings highlight the efficacy of multimodal data for modeling museum exhibit visitor engagement.

Effects of Visual Locomotion and Tactile Stimuli Duration on the Emotional Dimensions of the Cutaneous Rabbit Illusion

  • Mounia Ziat
  • Katherine Chin
  • Roope Raisamo

In this study, we assessed the emotional dimensions (valence, arousal, and dominance) of the multimodal visual-cutaneous rabbit effect. Simultaneously to the tactile bursts on the forearm, visual silhouettes of saltatorial animals (rabbit, kangaroo, spider, grasshopper, frog, and flea) were projected on the left arm. Additionally, there were two locomotion conditions: taking-off and landing. The results showed that the valence dimension (happy-unhappy) was only affected by the visual stimuli with no effect of the tactile conditions nor the locomotion phases. Arousal (excited-calm) showed a significant difference for the three tactile conditions with an interaction effect with the locomotion condition. Arousal scores were higher when the taking-off condition was associated with the intermediate duration (24 ms) and when the landing condition was associated with either the shortest duration (12 ms) or the longest duration (48 ms). There was no effect for the dominance dimension. Similar to our previous results, the valence dimension seems to be highly affected by visual information reducing any effect of tactile information, while touch can modulate the arousal dimension. This can be beneficial for designing multimodal interfaces for virtual or augmented reality.

Eliciting Emotion with Vibrotactile Stimuli Evocative of Real-World Sensations

  • Shaun Alexander Macdonald
  • Stephen Brewster
  • Frank Pollick

This paper describes a novel category of affective vibrotactile stimuli which evoke real-world sensations and details a study into emotional responses to them. The affective properties of short and abstract vibrotactile waveforms have previously been studied and shown to have a narrow emotional range. By contrast this paper investigated emotional responses to longer waveforms and to emotionally resonant vibrotactile stimuli, stimuli which are evocative of real-world sensations such as animal purring or running water. Two studies were conducted. The first recorded emotional responses to Tactons with a duration of 20 seconds. The second investigated emotional responses to novel emotionally resonant stimuli. Stimuli that users found more emotionally resonant were more pleasant, particularly if they had prior emotional connections to the sensation represented. Results suggest that future designers could use emotional resonance to expand the affective response range of vibrotactile cues by utilising stimuli with which users bear an emotional association.

Enhancing Affect Detection in Game-Based Learning Environments with Multimodal Conditional Generative Modeling

  • Nathan Henderson
  • Wookhee Min
  • Jonathan Rowe
  • James Lester

Accurately detecting and responding to student affect is a critical capability for adaptive learning environments. Recent years have seen growing interest in modeling student affect with multimodal sensor data. A key challenge in multimodal affect detection is dealing with data loss due to noisy, missing, or invalid multimodal features. Because multimodal affect detection often requires large quantities of data, data loss can have a strong, adverse impact on affect detector performance. To address this issue, we present a multimodal data imputation framework that utilizes conditional generative models to automatically impute posture and interaction log data from student interactions with a game-based learning environment for emergency medical training. We investigate two generative models, a Conditional Generative Adversarial Network (C-GAN) and a Conditional Variational Autoencoder (C-VAE), that are trained using a modality that has undergone varying levels of artificial data masking. The generative models are conditioned on the corresponding intact modality, enabling the data imputation process to capture the interaction between the concurrent modalities. We examine the effectiveness of the conditional generative models on imputation accuracy and its impact on the performance of affect detection. Each imputation model is evaluated using varying amounts of artificial data masking to determine how the data missingness impacts the performance of each imputation method. Results based on the modalities captured from students? interactions with the game-based learning environment indicate that deep conditional generative models within a multimodal data imputation framework yield significant benefits compared to baseline imputation techniques in terms of both imputation accuracy and affective detector performance.

Estimating the Intensity of Facial Expressions Accompanying Feedback Responses in Multiparty Video-Mediated Communication

  • Ryosuke Ueno
  • Yukiko I. Nakano
  • Jie Zeng
  • Fumio Nihei

Providing feedback to a speaker is an essential communication signal for maintaining a conversation. In specific feedback, which indicates the listener's reaction to the speaker?s utterances, the facial expression is an effective modality for conveying the listener's reactions. Moreover, not only the type of facial expressions, but also the degree of intensity of the expressions, may influence the meaning of the specific feedback. In this study, we propose a multimodal deep neural network model that predicts the intensity of facial expressions co-occurring with feedback responses. We focus on multiparty video-mediated communication. In video-mediated communication, close-up frontal face images of each participant are continuously presented on the display; the attention of the participants is more likely to be drawn to the facial expressions. We assume that in such communication, the importance of facial expression in the listeners? feedback responses increases. We collected 33 video-mediated conversations by groups of three people and obtained audio and speech data for each participant. Using the corpus collected as a dataset, we created a deep neural network model that predicts the intensity of 17 types of action units (AUs) co-occurring with the feedback responses. The proposed method employed GRU-based model with attention mechanism for audio, visual, and language modalities. A decoder was trained to produce the intensity values for the 17 AUs frame by frame. In the experiment, unimodal and multimodal models were compared in terms of their performance in predicting salient AUs that characterize facial expression in feedback responses. The results suggest that well-performing models differ depending on the AU categories; audio information was useful for predicting AUs that express happiness, and visual and language information contributes to predicting AUs expressing sadness and disgust.

Exploring Personal Memories and Video Content as Context for Facial Behavior in Predictions of Video-Induced Emotions

  • Bernd Dudzik
  • Joost Broekens
  • Mark Neerincx
  • Hayley Hung

Empirical evidence suggests that the emotional meaning of facial behavior in isolation is often ambiguous in real-world conditions. While humans complement interpretations of others' faces with additional reasoning about context, automated approaches rarely display such context-sensitivity. Empirical findings indicate that the personal memories triggered by videos are crucial for predicting viewers' emotional response to such videos ?- in some cases, even more so than the video's audiovisual content. In this article, we explore the benefits of personal memories as context for facial behavior analysis. We conduct a series of multimodal machine learning experiments combining the automatic analysis of video-viewers' faces with that of two types of context information for affective predictions: \beginenumerate* [label=(\arabic*)] \item self-reported free-text descriptions of triggered memories and \item a video's audiovisual content \endenumerate*. Our results demonstrate that both sources of context provide models with information about variation in viewers' affective responses that complement facial analysis and each other.

Eye-Tracking to Predict User Cognitive Abilities and Performance for User-Adaptive Narrative Visualizations

  • Oswald Barral
  • Sébastien Lallé
  • Grigorii Guz
  • Alireza Iranpour
  • Cristina Conati

We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.

Facial Electromyography-based Adaptive Virtual Reality Gaming for Cognitive Training

  • Lorcan Reidy
  • Dennis Chan
  • Charles Nduka
  • Hatice Gunes

Cognitive training has shown promising results for delivering improvements in human cognition related to attention, problem solving, reading comprehension and information retrieval. However, two frequently cited problems in cognitive training literature are a lack of user engagement with the training programme, and a failure of developed skills to generalise to daily life. This paper introduces a new cognitive training (CT) paradigm designed to address these two limitations by combining the benefits of gamification, virtual reality (VR), and affective adaptation in the development of an engaging, ecologically valid, CT task. Additionally, it incorporates facial electromyography (EMG) as a means of determining user affect while engaged in the CT task. This information is then utilised to dynamically adjust the game's difficulty in real-time as users play, with the aim of leading them into a state of flow. Affect recognition rates of 64.1% and 76.2%, for valence and arousal respectively, were achieved by classifying a DWT-Haar approximation of the input signal using kNN. The affect-aware VR cognitive training intervention was then evaluated with a control group of older adults. The results obtained substantiate the notion that adaptation techniques can lead to greater feelings of competence and a more appropriate challenge of the user's skills.

Facilitating Flexible Force Feedback Design with Feelix

  • Anke van Oosterhout
  • Miguel Bruns
  • Eve Hoggan

In the last decade, haptic actuators have improved in quality and efficiency, enabling easier implementation in user interfaces. One of the next steps towards a mature haptics field is a larger and more diverse toolset that enables designers and novices to explore with the design and implementation of haptic feedback in their projects. In this paper, we look at several design projects that utilize haptic force feedback to aid interaction between the user and product. We analysed the process interaction designers went through when developing their haptic user interfaces. Based on our insights, we identified requirements for a haptic force feedback authoring tool. We discuss how these requirements are addressed by 'Feelix', a tool that supports sketching and refinement of haptic force feedback effects.

FeetBack: Augmenting Robotic Telepresence with Haptic Feedback on the Feet

  • Brennan Jones
  • Jens Maiero
  • Alireza Mogharrab
  • Ivan A. Aguliar
  • Ashu Adhikari
  • Bernhard E. Riecke
  • Ernst Kruijff
  • Carman Neustaedter
  • Robert W. Lindeman

Telepresence robots allow people to participate in remote spaces, yet they can be difficult to manoeuvre with people and obstacles around. We designed a haptic-feedback system called "FeetBack," which users place their feet in when driving a telepresence robot. When the robot approaches people or obstacles, haptic proximity and collision feedback are provided on the respective sides of the feet, helping inform users about events that are hard to notice through the robot's camera views. We conducted two studies: one to explore the usage of FeetBack in virtual environments, another focused on real environments. We found that FeetBack can increase spatial presence in simple virtual environments. Users valued the feedback to adjust their behaviour in both types of environments, though it was sometimes too frequent or unneeded for certain situations after a period of time. These results point to the value of foot-based haptic feedback for telepresence robot systems, while also the need to design context-sensitive haptic feedback.

Fifty Shades of Green: Towards a Robust Measure of Inter-annotator Agreement for Continuous Signals

  • Brandon M. Booth
  • Shrikanth S. Narayanan

Continuous human annotations of complex human experiences are essential for enabling psychological and machine-learned inquiry into the human mind, but establishing a reliable set of annotations for analysis and ground truth generation is difficult. Measures of consensus or agreement are often used to establish the reliability of a collection of annotations and thereby purport their suitability for further research and analysis. This work examines many of the commonly used agreement metrics for continuous-scale and continuous-time human annotations and demonstrates their shortcomings, especially in measuring agreement in general annotation shape and structure. Annotation quality is carefully examined in a controlled study where the true target signal is known and evidence is presented suggesting that annotators' perceptual distortions can be modeled using monotonic functions. A novel measure of agreement is proposed which is agnostic to these perceptual differences between annotators and provides unique information when assessing agreement. We illustrate how this measure complements existing agreement metrics and can serve as a tool for curating a reliable collection of human annotations based on differential consensus.

FilterJoint: Toward an Understanding of Whole-Body Gesture Articulation

  • Aishat Aloba
  • Julia Woodward
  • Lisa Anthony

Classification accuracy of whole-body gestures can be improved by selecting gestures that have few conflicts (i.e., confusions or misclassifications). To identify such gestures, an understanding of the nuances of how users articulate whole-body gestures can help, especially when conflicts may be due to confusion among seemingly dissimilar gestures. To the best of our knowledge, such an understanding is currently missing in the literature. As a first step to enable this understanding, we designed a method that facilitates investigation of variations in how users move their body parts as they perform a motion. This method, which we call filterJoint, selects the key body parts that are actively moving during the performance of a motion. The paths along which these body parts move in space over time can then be analyzed to make inferences about how users articulate whole-body gestures. We present two case studies to show how the filterJoint method enables a deeper understanding of whole-body gesture articulation, and we highlight implications for the selection of whole-body gesture sets as a result of these insights.

Finally on Par?! Multimodal and Unimodal Interaction for Open Creative Design Tasks in Virtual Reality

  • Chris Zimmerer
  • Erik Wolf
  • Sara Wolf
  • Martin Fischbach
  • Jean-Luc Lugrin
  • Marc Erich Latoschik

Multimodal Interfaces (MMIs) have been considered to provide promising interaction paradigms for Virtual Reality (VR) for some time. However, they are still far less common than unimodal interfaces (UMIs). This paper presents a summative user study comparing an MMI to a typical UMI for a design task in VR. We developed an application targeting creative 3D object manipulations, i.e., creating 3D objects and modifying typical object properties such as color or size. The associated open user task is based on the Torrence Tests of Creative Thinking. We compared a synergistic multimodal interface using speech-accompanied pointing/grabbing gestures with a more typical unimodal interface using a hierarchical radial menu to trigger actions on selected objects. Independent judges rated the creativity of the resulting products using the Consensual Assessment Technique. Additionally, we measured the creativity-promoting factors flow, usability, and presence. Our results show that the MMI performs on par with the UMI in all measurements despite its limited flexibility and reliability. These promising results demonstrate the technological maturity of MMIs and their potential to extend traditional interaction techniques in VR efficiently.

Force9: Force-assisted Miniature Keyboard on Smart Wearables

  • Lik Hang Lee
  • Ngo Yan Yeung
  • Tristan Braud
  • Tong Li
  • Xiang Su
  • Pan Hui

Smartwatches and other wearables are characterized by small-scale touchscreens that complicate the interaction with content. In this paper, we present Force9, the first optimized miniature keyboard leveraging force-sensitive touchscreens on wrist-worn computers. Force9 enables character selection in an ambiguous layout by analyzing the trade-off between interaction space and the easiness of force-assisted interaction. We argue that dividing the screen's pressure range into three contiguous force levels is sufficient to differentiate characters for fast and accurate text input. Our pilot study captures and calibrates the ability of users to perform force-assisted touches on miniature-sized keys on touchscreen devices. We then optimize the keyboard layout considering the goodness of character pairs (with regards to the selected English corpus) under the force-based configuration and the users? familiarity with the QWERTY layout. We finally evaluate the performance of the trimetric optimized Force9 layout, and achieve an average of 10.18 WPM by the end of the final session. Compared to the other state-of-the-art approaches, Force9 allows for single-gesture character selection without addendum sensors.

Gesticulator: A framework for semantically-aware speech-driven gesture generation

  • Taras Kucherenko
  • Patrik Jonell
  • Sanne van Waveren
  • Gustav Eje Henter
  • Simon Alexandersson
  • Iolanda Leite
  • Hedvig Kjellström

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high''): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page .

Gesture Enhanced Comprehension of Ambiguous Human-to-Robot Instructions

  • Dulanga Weerakoon
  • Vigneshwaran Subbaraju
  • Nipuni Karumpulli
  • Tuan Tran
  • Qianli Xu
  • U-Xuan Tan
  • Joo Hwee Lim
  • Archan Misra

This work demonstrates the feasibility and benefits of using pointing gestures, a naturally-generated additional input modality, to improve the multi-modal comprehension accuracy of human instructions to robotic agents for collaborative tasks.We present M2Gestic, a system that combines neural-based text parsing with a novel knowledge-graph traversal mechanism, over a multi-modal input of vision, natural language text and pointing. Via multiple studies related to a benchmark table top manipulation task, we show that (a) M2Gestic can achieve close-to-human performance in reasoning over unambiguous verbal instructions, and (b) incorporating pointing input (even with its inherent location uncertainty) in M2Gestic results in a significant (30%) accuracy improvement when verbal instructions are ambiguous.

Going with our Guts: Potentials of Wearable Electrogastrography (EGG) for Affect Detection

  • Angela Vujic
  • Stephanie Tong
  • Rosalind Picard
  • Pattie Maes

A hard challenge for wearable systems is to measure differences in emotional valence, i.e. positive and negative affect via physiology. However, the stomach or gastric signal is an unexplored modality that could offer new affective information. We created a wearable device and software to record gastric signals, known as electrogastrography (EGG). An in-laboratory study was conducted to compare EGG with electrodermal activity (EDA) in 33 individuals viewing affective stimuli. We found that negative stimuli attenuate EGG's indicators of parasympathetic activation, or "rest and digest" activity. We compare EGG to the remaining physiological signals and describe implications for affect detection. Further, we introduce how wearable EGG may support future applications in areas as diverse as reducing nausea in virtual reality and helping treat emotion-related eating disorders.

Hand-eye Coordination for Textual Difficulty Detection in Text Summarization

  • Jun Wang
  • Grace Ngai
  • Hong Va Leong

The task of summarizing a document is a complex task that requires a person to multitask between reading and writing processes. Since a person's cognitive load during reading or writing is known to be dependent upon the level of comprehension or difficulty of the article, this suggests that it should be possible to analyze the cognitive process of the user when carrying out the task, as evidenced through their eye gaze and typing features, to obtain an insight into the different difficulty levels. In this paper, we categorize the summary writing process into different phases and extract different gaze and typing features from each phase according to characteristics of eye-gaze behaviors and typing dynamics. Combining these multimodal features, we build a classifier that achieves an accuracy of 91.0% for difficulty level detection, which is around 55% performance improvement above the baseline and at least 15% improvement above models built on a single modality. We also investigate the possible reasons for the superior performance of our multimodal features.

How Good is Good Enough?: The Impact of Errors in Single Person Action Classification on the Modeling of Group Interactions in Volleyball

  • Lian Beenhakker
  • Fahim Salim
  • Dees Postma
  • Robby van Delden
  • Dennis Reidsma
  • Bert-Jan van Beijnum

In Human Behaviour Understanding, social interaction is often modeled on the basis of lower level action recognition. The accuracy of this recognition has an impact on the system's capability to detect the higher level social events, and thus on the usefulness of the resulting system. We model team interactions in volleyball and investigate, through simulation of typical error patterns, how one can consider the required quality (in accuracy and in allowable types of errors) of the underlying action recognition for automated volleyball monitoring. Our proposed approach simulates different patterns of errors, grounded in related work in volleyball action recognition, on top of a manually annotated ground truth to model their different impact on the interaction recognition. Our results show that this can provide a means to quantify the effect of different type of classification errors on the overall quality of the system. Our chosen volleyball use case, in the rising field of sports monitoring, also addresses specific team related challenges in such a system and how these can be visualized to grasp the interdependencies. In our use case the first layer of our system classifies actions of individual players and the second layer recognizes multiplayer exercises and complexes (i.e. sequences in rallies) to enhance training. The experiments performed for this study investigated how errors at the action recognition layer propagate and cause errors at the complexes layer. We discuss the strengths and weaknesses of the layered system to model volleyball rallies. We also give indications regarding what kind of errors are causing more problems and what choices can follow from them. In our given context we suggest that for recognition of non-Freeball actions (e.g. smash, block) it is more important to achieve a higher accuracy, which can be done at the cost of accuracy of classification of Freeball actions (which are mostly plays between team members and are more interchangable as to their role in the complexes).

Incorporating Measures of Intermodal Coordination in Automated Analysis of Infant-Mother Interaction

  • Lauren Klein
  • Victor Ardulov
  • Yuhua Hu
  • Mohammad Soleymani
  • Alma Gharib
  • Barbara Thompson
  • Pat Levitt
  • Maja J. Matarić

Interactions between infants and their mothers can provide meaningful insight into the dyad's health and well-being. Previous work has shown that infant-mother coordination, within a single modality, varies significantly with age and interaction quality. However, as infants are still developing their motor, language, and social skills, they may differ from their mothers in the modes they use to communicate. This work examines how infant-mother coordination across modalities can expand researchers' abilities to observe meaningful trends in infant-mother interactions. Using automated feature extraction tools, we analyzed the head position, arm position, and vocal fundamental frequency of mothers and their infants during the Face-to-Face Still-Face (FFSF) procedure. A de-identified dataset including these features was made available online as a contribution of this work. Analysis of infant behavior over the course of the FFSF indicated that the amount and modality of infant behavior change evolves with age. Evaluating the interaction dynamics, we found that infant and mother behavioral signals are coordinated both within and across modalities, and that levels of both intramodal and intermodal coordination vary significantly with age and across stages of the FFSF. These results support the significance of intermodal coordination when assessing changes in infant-mother interaction across conditions.

Influence of Electric Taste, Smell, Color, and Thermal Sensory Modalities on the Liking and Mediated Emotions of Virtual Flavor Perception

  • Nimesha Ranasinghe
  • Meetha Nesam James
  • Michael Gecawicz
  • Jonathan Bland
  • David Smith

Little is known about the influence of various sensory modalities such as taste, smell, color, and thermal, towards perceiving simulated flavor sensations, let alone their influence on people's emotions and liking. Although flavor sensations are essential in our daily experiences and closely associated with our memories and emotions, the concept of flavor and the emotions caused by different sensory modalities are not thoroughly integrated into Virtual and Augmented Reality technologies. Hence, this paper presents 1) an interactive technology to simulate different flavor sensations by overlaying taste (via electrical stimulation on the tongue), smell (via micro air pumps), color (via RGB Lights), and thermal (via Peltier elements) sensations on plain water, and 2) a set of experiments to investigate a) the influence of different sensory modalities on the perception and liking of virtual flavors and b) varying emotions mediated through virtual flavor sensations. Our findings reveal that the participants perceived and liked various stimuli configurations and mostly associated them with positive emotions while highlighting important avenues for future research.

Introducing Representations of Facial Affect in Automated Multimodal Deception Detection

  • Leena Mathur
  • Maja J. Matarić

Automated deception detection systems can enhance health, justice, and security in society by helping humans detect deceivers in high-stakes situations across medical and legal domains, among others. Existing machine learning approaches for deception detection have not leveraged dimensional representations of facial affect: valence and arousal. This paper presents a novel analysis of the discriminative power of facial affect for automated deception detection, along with interpretable features from visual, vocal, and verbal modalities. We used a video dataset of people communicating truthfully or deceptively in real-world, high-stakes courtroom situations. We leveraged recent advances in automated emotion recognition in-the-wild by implementing a state-of-the-art deep neural network trained on the Aff-Wild database to extract continuous representations of facial valence and facial arousal from speakers. We experimented with unimodal Support Vector Machines (SVM) and SVM-based multimodal fusion methods to identify effective features, modalities, and modeling approaches for detecting deception. Unimodal models trained on facial affect achieved an AUC of 80%, and facial affect contributed towards the highest-performing multimodal approach (adaptive boosting) that achieved an AUC of 91% when tested on speakers who were not part of training sets. This approach achieved a higher AUC than existing automated machine learning approaches that used interpretable visual, vocal, and verbal features to detect deception in this dataset, but did not use facial affect. Across all videos, deceptive and truthful speakers exhibited significant differences in facial valence and facial arousal, contributing computational support to existing psychological theories on relationships between affect and deception. The demonstrated importance of facial affect in our models informs and motivates the future development of automated, affect-aware machine learning approaches for modeling and detecting deception and other social behaviors in-the-wild.

Is She Truly Enjoying the Conversation?: Analysis of Physiological Signals toward Adaptive Dialogue Systems

  • Shun Katada
  • Shogo Okada
  • Yuki Hirano
  • Kazunori Komatani

In human-agent interactions, it is necessary for the systems to identify the current emotional state of the user to adapt their dialogue strategies. Nevertheless, this task is challenging because the current emotional states are not always expressed in a natural setting and change dynamically. Recent accumulated evidence has indicated the usefulness of physiological modalities to realize emotion recognition. However, the contribution of the time series physiological signals in human-agent interaction during a dialogue has not been extensively investigated. This paper presents a machine learning model based on physiological signals to estimate a user's sentiment at every exchange during a dialogue. Using a wearable sensing device, the time series physiological data including the electrodermal activity (EDA) and heart rate in addition to acoustic and visual information during a dialogue were collected. The sentiment labels were annotated by the participants themselves and by external human coders for each exchange consisting of a pair of system and participant utterances. The experimental results showed that a multimodal deep neural network (DNN) model combined with the EDA and visual features achieved an accuracy of 63.2%. In general, this task is challenging, as indicated by the accuracy of 63.0% attained by the external coders. The analysis of the sentiment estimation results for each individual indicated that the human coders often wrongly estimated the negative sentiment labels, and in this case, the performance of the DNN model was higher than that of the human coders. These results indicate that physiological signals can help in detecting the implicit aspects of negative sentiments, which are acoustically/visually indistinguishable.

Job Interviewer Android with Elaborate Follow-up Question Generation

  • Koji Inoue
  • Kohei Hara
  • Divesh Lala
  • Kenta Yamamoto
  • Shizuka Nakamura
  • Katsuya Takanashi
  • Tatsuya Kawahara

A job interview is a domain that takes advantage of an android robot's human-like appearance and behaviors. In this work, our goal is to implement a system in which an android plays the role of an interviewer so that users may practice for a real job interview. Our proposed system generates elaborate follow-up questions based on responses from the interviewee. We conducted an interactive experiment to compare the proposed system against a baseline system that asked only fixed-form questions. We found that this system was significantly better than the baseline system with respect to the impression of the interview and the quality of the questions, and that the presence of the android interviewer was enhanced by the follow-up questions. We also found a similar result when using a virtual agent interviewer, except that presence was not enhanced.

LASO: Exploiting Locomotive and Acoustic Signatures over the Edge to Annotate IMU Data for Human Activity Recognition

  • Soumyajit Chatterjee
  • Avijoy Chakma
  • Aryya Gangopadhyay
  • Nirmalya Roy
  • Bivas Mitra
  • Sandip Chakraborty

Annotated IMU sensor data from smart devices and wearables are essential for developing supervised models for fine-grained human activity recognition, albeit generating sufficient annotated data for diverse human activities under different environments is challenging. Existing approaches primarily use human-in-the-loop based techniques, including active learning; however, they are tedious, costly, and time-consuming. Leveraging the availability of acoustic data from embedded microphones over the data collection devices, in this paper, we propose LASO, a multimodal approach for automated data annotation from acoustic and locomotive information. LASO works over the edge device itself, ensuring that only the annotated IMU data is collected, discarding the acoustic data from the device itself, hence preserving the audio-privacy of the user. In the absence of any pre-existing labeling information, such an auto-annotation is challenging as the IMU data needs to be sessionized for different time-scaled activities in a completely unsupervised manner. We use a change-point detection technique while synchronizing the locomotive information from the IMU data with the acoustic data, and then use pre-trained audio-based activity recognition models for labeling the IMU data while handling the acoustic noises. LASO efficiently annotates IMU data, without any explicit human intervention, with a mean accuracy of $0.93$ ($\pm 0.04$) and $0.78$ ($\pm 0.05$) for two different real-life datasets from workshop and kitchen environments, respectively.

LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding

  • Yanan Wang
  • Jianming Wu
  • Jinfa Huang
  • Gen Hattori
  • Yasuhiro Takishima
  • Shinya Wada
  • Rui Kimura
  • Jie Chen
  • Satoshi Kurihara

Group cohesiveness reflects the level of intimacy that people feel with each other, and the development of a dialogue robot that can understand group cohesiveness will lead to the promotion of human communication. However, group cohesiveness is a complex concept that is difficult to predict based only on image pixels. Inspired by the fact that humans intuitively associate linguistic knowledge accumulated in the brain with the visual images they see, we propose a linguistic knowledge injectable deep neural network (LDNN) that builds a visual model (visual LDNN) for predicting group cohesiveness that can automatically associate the linguistic knowledge hidden behind images. LDNN consists of a visual encoder and a language encoder, and applies domain adaptation and linguistic knowledge transition mechanisms to transform linguistic knowledge from a language model to the visual LDNN. We train LDNN by adding descriptions to the training and validation sets of the Group AFfect Dataset 3.0 (GAF 3.0), and test the visual LDNN without any description. Comparing visual LDNN with various fine-tuned DNN models and three state-of-the-art models in the test set, the results demonstrate that the visual LDNN not only improves the performance of the fine-tuned DNN model leading to an MSE very similar to the state-of-the-art model, but is also a practical and efficient method that requires relatively little preprocessing. Furthermore, ablation studies confirm that LDNN is an effective method to inject linguistic knowledge into visual models.

Mimicker-in-the-Browser: A Novel Interaction Using Mimicry to Augment the Browsing Experience

  • Riku Arakawa
  • Hiromu Yakura

Humans are known to have a better subconscious impression of other humans when their movements are imitated in social interactions. Despite this influential phenomenon, its application in human-computer interaction is currently limited to specific areas, such as an agent mimicking the head movements of a user in virtual reality, because capturing user movements conventionally requires external sensors. If we can implement the mimicry effect in a scalable platform without such sensors, a new approach for designing human-computer interaction will be introduced. Therefore, we have investigated whether users feel positively toward a mimicking agent that is delivered by a standalone web application using only a webcam. We also examined whether a web page that changes its background pattern based on head movements can foster a favorable impression. The positive effect confirmed in our experiments supports mimicry as a novel design practice to augment our daily browsing experiences.

Mitigating Biases in Multimodal Personality Assessment

  • Shen Yan
  • Di Huang
  • Mohammad Soleymani

As algorithmic decision making systems are increasingly used in high-stake scenarios, concerns have risen about the potential unfairness of these decisions to certain social groups. Despite its importance, the bias and fairness of multimodal systems are not thoroughly studied. In this work, we focus on the multimodal systems designed for apparent personality assessment and hirability prediction. We use the First Impression dataset as a case study to investigate the biases in such systems. We provide detailed analyses on the biases from different modalities and data fusion strategies. Our analyses reveal that different modalities show various patterns of biases and data fusion process also introduces additional biases to the model. To mitigate the biases, we develop and evaluate two different debiasing approaches based on data balancing and adversarial learning. Experimental results show that both approaches can reduce the biases in model outcomes without sacrificing much performance. Our debiasing strategies can be deployed in real-world multimodal systems to provide fairer outcomes.

MMGatorAuth: A Novel Multimodal Dataset for Authentication Interactions in Gesture and Voice

  • Sarah Morrison-Smith
  • Aishat Aloba
  • Hangwei Lu
  • Brett Benda
  • Shaghayegh Esmaeili
  • Gianne Flores
  • Jesse Smith
  • Nikita Soni
  • Isaac Wang
  • Rejin Joy
  • Damon L. Woodard
  • Jaime Ruiz
  • Lisa Anthony

The future of smart environments is likely to involve both passive and active interactions on the part of users. Depending on what sensors are available in the space, users may make use of multimodal interaction modalities such as hand gestures or voice commands. There is a shortage of robust yet controlled multimodal interaction datasets for smart environment applications. One application domain of interest based on current state-of-the-art is authentication for sensitive or private tasks, such as banking and email. We present a novel, large multimodal dataset for authentication interactions in both gesture and voice, collected from 106 volunteers who each performed 10 examples of each of a set of hand gesture and spoken voice commands chosen from prior literature (10,600 gesture samples and 13,780 voice samples). We present the data collection method, raw data and common features extracted, and a case study illustrating how this dataset could be useful to researchers. Our goal is to provide a benchmark dataset for testing future multimodal authentication solutions, enabling comparison across approaches.

Modality Dropout for Improved Performance-driven Talking Faces

  • Ahmed Hussen Abdelaziz
  • Barry-John Theobald
  • Paul Dixon
  • Reinhard Knothe
  • Nicholas Apostoloff
  • Sachin Kajareker

We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-verbal facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g. a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Without modality dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.

MORSE: MultimOdal sentiment analysis for Real-life SEttings

  • Yiqun Yao
  • Verónica Pérez-Rosas
  • Mohamed Abouelenien
  • Mihai Burzo

Multimodal sentiment analysis aims to detect and classify sentiment expressed in multimodal data. Research to date has focused on datasets with a large number of training samples, manual transcriptions, and nearly-balanced sentiment labels. However, data collection in real settings often leads to small datasets with noisy transcriptions and imbalanced label distributions, which are therefore significantly more challenging than in controlled settings. In this work, we introduce MORSE, a domain-specific dataset for MultimOdal sentiment analysis in Real-life SEttings. The dataset consists of 2,787 video clips extracted from 49 interviews with panelists in a product usage study, with each clip annotated for positive, negative, or neutral sentiment. The characteristics of MORSE include noisy transcriptions from raw videos, naturally imbalanced label distribution, and scarcity of minority labels. To address the challenging real-life settings in MORSE, we propose a novel two-step fine-tuning method for multimodal sentiment classification using transfer learning and the Transformer model architecture; our method starts with a pre-trained language model and one step of fine-tuning on the language modality, followed by the second step of joint fine-tuning that incorporates the visual and audio modalities. Experimental results show that while MORSE is challenging for various baseline models such as SVM and Transformer, our two-step fine-tuning method is able to capture the dataset characteristics and effectively address the challenges. Our method outperforms related work that uses both single and multiple modalities in the same transfer learning settings.

MSP-Face Corpus: A Natural Audiovisual Emotional Database

  • Andrea Vidal
  • Ali Salman
  • Wei-Cheng Lin
  • Carlos Busso

Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings -- 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems.

Multimodal Automatic Coding of Client Behavior in Motivational Interviewing

  • Leili Tavabi
  • Kalin Stefanov
  • Larry Zhang
  • Brian Borsari
  • Joshua D. Woolley
  • Stefan Scherer
  • Mohammad Soleymani

Motivational Interviewing (MI) is defined as a collaborative conversation style that evokes the client's own intrinsic reasons for behavioral change. In MI research, the clients' attitude (willingness or resistance) toward change as expressed through language, has been identified as an important indicator of their subsequent behavior change. Automated coding of these indicators provides systematic and efficient means for the analysis and assessment of MI therapy sessions. In this paper, we study and analyze behavioral cues in client language and speech that bear indications of the client's behavior toward change during a therapy session, using a database of dyadic motivational interviews between therapists and clients with alcohol-related problems. Deep language and voice encoders, \ie BERT and VGGish, trained on large amounts of data are used to extract features from each utterance. We develop a neural network to automatically detect the MI codes using both the clients' and therapists' language and clients' voice, and demonstrate the importance of semantic context in such detection. Additionally, we develop machine learning models for predicting alcohol-use behavioral outcomes of clients through language and voice analysis. Our analysis demonstrates that we are able to estimate MI codes using clients' textual utterances along with preceding textual context from both the therapist and client, reaching an F1-score of 0.72 for a speaker-independent three-class classification. We also report initial results for using the clients' data for predicting behavioral outcomes, which outlines the direction for future work.

Multimodal Data Fusion based on the Global Workspace Theory

  • Cong Bao
  • Zafeirios Fountas
  • Temitayo Olugbade
  • Nadia Bianchi-Berthouze

We propose a novel neural network architecture, named the Global Workspace Network (GWN), which addresses the challenge of dynamic and unspecified uncertainties in multimodal data fusion. Our GWN is a model of attention across modalities and evolving through time, and is inspired by the well-established Global Workspace Theory from the field of cognitive science. The GWN achieved average F1 score of 0.92 for discrimination between pain patients and healthy participants and average F1 score = 0.75 for further classification of three pain levels for a patient, both based on the multimodal EmoPain dataset captured from people with chronic pain and healthy people performing different types of exercise movements in unconstrained settings. In these tasks, the GWN significantly outperforms the typical fusion approach of merging by concatenation. We further provide extensive analysis of the behaviour of the GWN and its ability to address uncertainties (hidden noise) in multimodal data.

Multimodal, Multiparty Modeling of Collaborative Problem Solving Performance

  • Shree Krishna Subburaj
  • Angela E.B. Stewart
  • Arjun Ramesh Rao
  • Sidney K. D'Mello

Modeling team phenomena from multiparty interactions inherently requires combining signals from multiple teammates, often by weighting strategies. Here, we explored the hypothesis that strategic weighting signals from individual teammates would outperform an equal weighting baseline. Accordingly, we explored role-, trait-, and behavior-based weighting of behavioral signals across team members. We analyzed data from 101 triads engaged in computer-mediated collaborative problem solving (CPS) in an educational physics game. We investigated the accuracy of machine-learned models trained on facial expressions, acoustic-prosodics, eye gaze, and task context information, computed one-minute prior to the end of a game level, at predicting success at solving that level. AUROCs for unimodal models that equally weighted features from the three teammates ranged from .54 to .67, whereas a combination of gaze, face, and task context features, achieved an AUROC of .73. The various multiparty weighting strategies did not outperform an equal-weighting baseline. However, our best nonverbal model (AUROC = .73) outperformed a language-based model (AUROC = .67), and there were some advantages to combining the two (AUROC = .75). Finally, models aimed at prospectively predicting performance on a minute-by-minute basis from the start of the level achieved a lower, but still above-chance, AUROC of .60. We discuss implications for multiparty modeling of team performance and other team constructs.

PiHearts: Resonating Experiences of Self and Others Enabled by a Tangible Somaesthetic Design

  • Ilhan Aslan
  • Andreas Seiderer
  • Chi Tai Dang
  • Simon Rädler
  • Elisabeth André

A human's heart beating can be sensed by sensors and displayed for others to see, hear, feel, and potentially "resonate'' with. Previous work in studying interaction designs with physiological data, such as a heart's pulse rate, have argued that feeding it back to the users may, for example support users' mindfulness and self-awareness during various everyday activities and ultimately support their health and wellbeing. Inspired by Somaesthetics as a discipline, we designed and explored multimodal displays, which enable experiencing heart beats as natural stimuli from oneself and others in social proximity. In this paper, we report on the design process of our design PiHearts and present qualitative results of a field study with 30 pairs of participants. Participants were asked to use PiHearts during watching short movies together and report their perceived experience in three different display conditions while watching movies. We found, for example that participants reported significant effects in experiencing sensory immersion when they received their own heart beats as stimuli compared to the condition without any heart beat display, and that feeling their partner's heart beats resulted in significant effects on social experience. We refer to resonance theory to motivate and discuss the results, highlighting the potential of how digitalization of heart beats as rhythmic natural stimuli may provide resonance in a modern society facing social acceleration.

Predicting Video Affect via Induced Affection in the Wild

  • Yi Ding
  • Radha Kumaran
  • Tianjiao Yang
  • Tobias Höllerer

Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.

Preserving Privacy in Image-based Emotion Recognition through User Anonymization

  • Vansh Narula
  • Kexin Feng
  • Theodora Chaspari

The large amount of data captured by ambulatory sensing devices can afford us insights into longitudinal behavioral patterns, which can be linked to emotional, psychological, and cognitive outcomes. Yet, the sensitivity of behavioral data, which regularly involve speech signals and facial images, can cause strong privacy concerns, such as the leaking of the user identity. We examine the interplay between emotion-specific and user identity-specific information in image-based emotion recognition systems. We further study a user anonymization approach that preserves emotion-specific information, but eliminates user-dependent information from the convolutional kernel of convolutional neural networks (CNN), therefore reducing user re-identification risks. We formulate an adversarial learning problem implemented with a multitask CNN, that minimizes emotion classification and maximizes user identification loss. The proposed system is evaluated on three datasets achieving moderate to high emotion recognition and poor user identity recognition performance. The resulting image transformation obtained by the convolutional layer is visually inspected, attesting to the efficacy of the proposed system in preserving emotion-specific information. Implications from this study can inform the design of privacy-aware emotion recognition systems that preserve facets of human behavior, while concealing the identity of the user, and can be used in ambulatory monitoring applications related to health, well-being, and education.

Purring Wheel: Thermal and Vibrotactile Notifications on the Steering Wheel

  • Patrizia Di Campli San Vito
  • Stephen Brewster
  • Frank Pollick
  • Simon Thompson
  • Lee Skrypchuk
  • Alexandros Mouzakitis

Haptic feedback can improve safety and driving behaviour. While vibration has been widely studied, other haptic modalities have been neglected. To address this, we present two studies investigating the use of uni- and bimodal vibrotactile and thermal cues on the steering wheel. First, notifications with three levels of urgency were subjectively rated and then identified during simulated driving. Bimodal feedback showed an increased identification time over unimodal vibrotactile cues. Thermal feedback was consistently rated less urgent, showing its suitability for less time critical notifications, where vibration would be unnecessarily attention-grabbing. The second study investigated more complex thermal and bimodal haptic notifications comprised of two different types of information (Nature and Importance of incoming message). Results showed that both modalities could be identified with high recognition rates of up to 92% for both and up to 99% for a single type, opening up a novel design space for haptic in-car feedback.

SmellControl: The Study of Sense of Agency in Smell

  • Patricia Cornelio
  • Emanuela Maggioni
  • Giada Brianza
  • Sriram Subramanian
  • Marianna Obrist

The Sense of Agency (SoA) is crucial in interaction with technology, it refers to the feeling of 'I did that' as opposed to 'the system did that' supporting a feeling of being in control. Research in human-computer interaction has recently studied agency in visual, auditory and haptic interfaces, however the role of smell on agency remains unknown. Our sense of smell is quite powerful to elicit emotions, memories and awareness of the environment, which has been exploited to enhance user experiences (e.g., in VR and driving scenarios). In light of increased interest in designing multimodal interfaces including smell and its close link with emotions, we investigated, for the first time, the effect of smell-induced emotions on the SoA. We conducted a study using the Intentional Binding (IB) paradigm used to measure SoA while participants were exposed to three scents with different valence (pleasant, unpleasant, neutral). Our results show that participants? SoA increased with a pleasant scent compared to neutral and unpleasant scents. We discuss how our results can inform the design of multimodal and future olfactory interfaces.

Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition

  • Yufeng Yin
  • Baiyu Huang
  • Yizhen Wu
  • Mohammad Soleymani

Automatic emotion recognition methods are sensitive to the variations across different datasets and their performance drops when evaluated across corpora. We can apply domain adaptation techniques e.g., Domain-Adversarial Neural Network (DANN) to mitigate this problem. Though the DANN can detect and remove the bias between corpora, the bias between speakers still remains which results in reduced performance. In this paper, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce both the domain bias and the speaker bias. Specifically, based on the DANN, we add a speaker discriminator to unlearn information representing speakers' individual characteristics with a gradient reversal layer (GRL). Our experiments with multimodal data (speech, vision, and text) and the cross-domain evaluation indicate that the proposed SIDANN outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, suggesting that the SIDANN has a better domain adaptation ability than the DANN. Besides, the modality contribution analysis shows that the acoustic features are the most informative for arousal detection while the lexical features perform the best for valence detection.

StrategicReading: Understanding Complex Mobile Reading Strategies via Implicit Behavior Sensing

  • Wei Guo
  • Byeong-Young Cho
  • Jingtao Wang

Mobile devices are becoming an important platform for reading. However, existing research on mobile reading primarily focuses on low-level metrics such as speed and comprehension. For complex reading tasks involving information seeking and context switching, researchers still rely on verbal reports via think-aloud. We present StrategicReading, an intelligent reading system running on unmodified smartphones, to understand high-level strategic reading behaviors on mobile devices. StrategicReading leverages multimodal behavior sensing and takes advantage of signals from camera-based gaze sensing, kinematic scrolling patterns, and cross-page behavior changes. Through a 40-participant study, we found that gaze patterns, muscle stiffness signals, and reading paths captured by StrategicReading can infer both users' reading strategies and reading performance with high accuracy.

Studying Person-Specific Pointing and Gaze Behavior for Multimodal Referencing of Outside Objects from a Moving Vehicle

  • Amr Gomaa
  • Guillermo Reyes
  • Alexandra Alles
  • Lydia Rupp
  • Michael Feld

Hand pointing and eye gaze have been extensively investigated in automotive applications for object selection and referencing. Despite significant advances, existing outside-the-vehicle referencing methods consider these modalities separately. Moreover, existing multimodal referencing methods focus on a static situation, whereas the situation in a moving vehicle is highly dynamic and subject to safety-critical constraints. In this paper, we investigate the specific characteristics of each modality and the interaction between them when used in the task of referencing outside objects (e.g. buildings) from the vehicle. We furthermore explore person-specific differences in this interaction by analyzing individuals' performance for pointing and gaze patterns, along with their effect on the driving task. Our statistical analysis shows significant differences in individual behaviour based on object's location (i.e. driver's right side vs. left side), object's surroundings, driving mode (i.e. autonomous vs. normal driving) as well as pointing and gaze duration, laying the foundation for a user-adaptive approach.

Temporal Attention and Consistency Measuring for Video Question Answering

  • Lingyu Zhang
  • Richard J. Radke

Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.

The eyes know it: FakeET- An Eye-tracking Database to Understand Deepfake Perception

  • Parul Gupta
  • Komal Chugh
  • Abhinav Dhall
  • Ramanathan Subramanian

We present FakeET -- an eye-tracking database to understand human visual perception of deepfake videos. Given that the principal purpose of deepfakes is to deceive human observers, FakeET is designed to understand and evaluate the ability of viewers to detect synthetic video artifacts. FakeET contains viewing patterns compiled from 40 users via the Tobii desktop eye-tracker for 811 videos from the Google Deepfake dataset, with a minimum of two viewings per video. Additionally, EEG responses acquired via the Emotiv sensor are also available. The compiled data confirms (a) distinct eye movement characteristics for real vs fake videos; (b) utility of the eye-track saliency maps for spatial forgery localization and detection, and (c) Error Related Negativity (ERN) triggers in the EEG responses, and the ability of the raw EEG signal to distinguish between real and fake videos.

The WoNoWa Dataset: Investigating the Transactive Memory System in Small Group Interactions

  • Beatrice Biancardi
  • Lou Maisonnave-Couterou
  • Pierrick Renault
  • Brian Ravenet
  • Maurizio Mancini
  • Giovanna Varni

We present WoNoWa, a novel multi-modal dataset of small group interactions in collaborative tasks. The dataset is explicitly designed to elicit and to study over time a Transactive Memory System (TMS), a group's emergent state characterizing the group's meta-knowledge about "who knows what". A rich set of automatic features and manual annotations, extracted from the collected audio-visual data, is available on request for research purposes. Features include individual descriptors (e.g., position, Quantity of Motion, speech activity) and group descriptors (e.g., F-formations). Additionally, participants' self-assessments are available. Preliminary results from exploratory analyses show that the WoNoWa design allowed groups to develop a TMS that increased across the tasks. These results encourage the use of the WoNoWa dataset for a better understanding of the relationship between behavioural patterns and TMS, that in turn could help to improve group performance.

Toward Adaptive Trust Calibration for Level 2 Driving Automation

  • Kumar Akash
  • Neera Jain
  • Teruhisa Misu

Properly calibrated human trust is essential for successful interaction between humans and automation. However, while human trust calibration can be improved by increased automation transparency, too much transparency can overwhelm human workload. To address this tradeoff, we present a probabilistic framework using a partially observable Markov decision process (POMDP) for modeling the coupled trust-workload dynamics of human behavior in an action-automation context. We specifically consider hands-off Level 2 driving automation in a city environment involving multiple intersections where the human chooses whether or not to rely on the automation. We consider automation reliability, automation transparency, and scene complexity, along with human reliance and eye-gaze behavior, to model the dynamics of human trust and workload. We demonstrate that our model framework can appropriately vary automation transparency based on real-time human trust and workload belief estimates to achieve trust calibration.

Toward Multimodal Modeling of Emotional Expressiveness

  • Victoria Lin
  • Jeffrey M. Girard
  • Michael A. Sayette
  • Louis-Philippe Morency

Emotional expressiveness captures the extent to which a person tends to outwardly display their emotions through behavior. Due to the close relationship between emotional expressiveness and behavioral health, as well as the crucial role that it plays in social interaction, the ability to automatically predict emotional expressiveness stands to spur advances in science, medicine, and industry. In this paper, we explore three related research questions. First, how well can emotional expressiveness be predicted from visual, linguistic, and multimodal behavioral signals? Second, how important is each behavioral modality to the prediction of emotional expressiveness? Third, which behavioral signals are reliably related to emotional expressiveness? To answer these questions, we add highly reliable transcripts and human ratings of perceived emotional expressiveness to an existing video database and use this data to train, validate, and test predictive models. Our best model shows promising predictive performance on this dataset (RMSE=0.65, R^2=0.45, r=0.74). Multimodal models tend to perform best overall, and models trained on the linguistic modality tend to outperform models trained on the visual modality. Finally, examination of our interpretable models' coefficients reveals a number of visual and linguistic behavioral signals---such as facial action unit intensity, overall word count, and use of words related to social processes---that reliably predict emotional expressiveness.

Towards Engagement Recognition of People with Dementia in Care Settings

  • Lars Steinert
  • Felix Putze
  • Dennis Küster
  • Tanja Schultz

Roughly 50 million people worldwide are currently suffering from dementia. This number is expected to triple by 2050. Dementia is characterized by a loss of cognitive function and changes in behaviour. This includes memory, language skills, and the ability to focus and pay attention. However, it has been shown that secondary therapy such as the physical, social and cognitive activation of People with Dementia (PwD) has significant positive effects. Activation impacts cognitive functioning and can help prevent the magnification of apathy, boredom, depression, and loneliness associated with dementia. Furthermore, activation can lead to higher perceived quality of life. We follow Cohen's argument that activation stimuli have to produce engagement to take effect and adopt his definition of engagement as "the act of being occupied or involved with an external stimulus".

Understanding Applicants' Reactions to Asynchronous Video Interviews Through Self-reports and Nonverbal Cues

  • Skanda Muralidhar
  • Emmanuelle Patricia Kleinlogel
  • Eric Mayor
  • Adrian Bangerter
  • Marianne Schmid Mast
  • Daniel Gatica-Perez

Asynchronous video interviews (AVIs) are increasingly used by organizations in their hiring process. In this mode of interviewing, the applicants are asked to record their responses to predefined interview questions using a webcam via an online platform. AVIs have increased usage due to employers' perceived benefits in terms of costs and scale. However, little research has been conducted regarding applicants' reactions to these new interview methods. In this work, we investigate applicants' reactions to an AVI platform using self-reported measures previously validated in psychology literature. We also investigate the connections of these measures with nonverbal behavior displayed during the interviews. We find that participants who found the platform creepy and had concerns about privacy reported lower interview performance compared to participants who did not have such concerns. We also observe weak correlations between nonverbal cues displayed and these self-reported measures. Finally, inference experiments achieve overall low-performance w.r.t. to explaining applicants' reactions. Overall, our results reveal that participants who are not at ease with AVIs (i.e., high creepy ambiguity score) might be unfairly penalized. This has implications for improved hiring practices using AVIs.

Using Emotions to Complement Multi-Modal Human-Robot Interaction in Urban Search and Rescue Scenarios

  • Sami Alperen Akgun
  • Moojan Ghafurian
  • Mark Crowley
  • Kerstin Dautenhahn

An experiment is presented to investigate whether there is consensus in mapping emotions to messages/situations in urban search and rescue scenarios, where efficiency and effectiveness of interactions are key to success. We studied mappings between 10 specific messages, presented in two different communication styles, reflecting common situations that might happen during search and rescue missions, and the emotions exhibited by robots in those situations. The data was obtained through a Mechanical Turk study with 78 participants. Our findings support the feasibility of using emotions as an additional communication channel to improve multi-modal human-robot interaction for urban search and rescue robots, and suggests that these mappings are robust, i.e. are not affected by the robot's communication style.

"Was that successful?" On Integrating Proactive Meta-Dialogue in a DIY-Assistant using Multimodal Cues

  • Matthias Kraus
  • Marvin Schiller
  • Gregor Behnke
  • Pascal Bercher
  • Michael Dorna
  • Michael Dambier
  • Birte Glimm
  • Susanne Biundo
  • Wolfgang Minker

Effectively supporting novices during performance of complex tasks, e.g. do-it-yourself (DIY) projects, requires intelligent assistants to be more than mere instructors. In order to be accepted as a competent and trustworthy cooperation partner, they need to be able to actively participate in the project and engage in helpful conversations with users when assistance is necessary. Therefore, a new proactive version of the DIY-assistant Robert is presented in this paper. It extends the previous prototype by including the capability to initiate reflective meta-dialogues using multimodal cues. Two different strategies for reflective dialogue are implemented: A progress-based strategy initiates a reflective dialogue about previous experience with the assistance for encouraging the self-appraisal of the user. An activity-based strategy is applied for providing timely, task-dependent support. Therefore, user activities with a connected drill driver are tracked that trigger dialogues in order to reflect on the current task and to prevent task failure. An experimental study comparing the proactive assistant against the baseline version shows that proactive meta-dialogue is able to build user trust significantly better than a solely reactive system. Besides, the results provide interesting insights for the development of proactive dialogue assistants.

You Have a Point There: Object Selection Inside an Automobile Using Gaze, Head Pose and Finger Pointing

  • Abdul Rafey Aftab
  • Michael von der Beeck
  • Michael Feld

Sophisticated user interaction in the automotive industry is a fast emerging topic. Mid-air gestures and speech already have numerous applications for driver-car interaction. Additionally, multimodal approaches are being developed to leverage the use of multiple sensors for added advantages. In this paper, we propose a fast and practical multimodal fusion method based on machine learning for the selection of various control modules in an automotive vehicle. The modalities taken into account are gaze, head pose and finger pointing gesture. Speech is used only as a trigger for fusion. Single modality has previously been used numerous times for recognition of the user's pointing direction. We, however, demonstrate how multiple inputs can be fused together to enhance the recognition performance. Furthermore, we compare different deep neural network architectures against conventional Machine Learning methods, namely Support Vector Regression and Random Forests, and show the enhancements in the pointing direction accuracy using deep learning. The results suggest a great potential for the use of multimodal inputs that can be applied to more use cases in the vehicle.

SESSION: Short Papers

A Comparison between Laboratory and Wearable Sensors in the Context of Physiological Synchrony

  • Jasper J. van Beers
  • Ivo V. Stuldreher
  • Nattapong Thammasan
  • Anne-Marie Brouwer

Measuring concurrent changes in autonomic physiological responses aggregated across individuals (Physiological Synchrony - PS) can provide insight into group-level cognitive or emotional processes. Utilizing cheap and easy-to-use wearable sensors to measure physiology rather than their high-end laboratory counterparts is desirable. Since it is currently ambiguous how different signal properties (arising from different types of measuring equipment) influence the detection of PS associated with mental processes, it is unclear whether, or to what extent, PS based on data from wearables compares to that from their laboratory equivalents. Existing literature has investigated PS using both types of equipment, but none compared them directly. In this study, we measure PS in electrodermal activity (EDA) and inter-beat interval (IBI, inverse of heart rate) of participants who listened to the same audio stream but were either instructed to attend to the presented narrative (n=13) or to the interspersed auditory events (n=13). Both laboratory and wearable sensors were used (ActiveTwo electrocardiogram (ECG) and EDA; Wahoo Tickr and EdaMove4). A participant's attentional condition was classified based on which attentional group they shared greater synchrony with. For both types of sensors, we found classification accuracies of 73% or higher in both EDA and IBI. We found no significant difference in classification accuracies between the laboratory and wearable sensors. These findings encourage the use of wearables for PS based research and for in-the-field measurements.

Analyzing Nonverbal Behaviors along with Praising

  • Toshiki Onishi
  • Arisa Yamauchi
  • Ryo Ishii
  • Yushi Aono
  • Akihiro Miyata

In this work, as a first attempt to analyze the relationship between praising skills and human behavior in dialogue, we focus on head and face behavior. We create a new dialogue corpus including face and head behavior information of persons who give praise (praiser) and receive praise (receiver) and the degree of success of praising (praising score). We also create a machine learning model that uses features related to head and face behavior to estimate praising score, clarify which features of the praiser and receiver are important in estimating praising score. The analysis results showed that features of the praiser and receiver are important in estimating praising score and that features related to utterance, head, gaze, and chin were important. The analysis of the features of high importance revealed that the praiser and receiver should face each other without turning their heads to the left or right, and the longer the praiser's utterance, the more successful the praising.

Automated Time Synchronization of Cough Events from Multimodal Sensors in Mobile Devices

  • Tousif Ahmed
  • Mohsin Y. Ahmed
  • Md Mahbubur Rahman
  • Ebrahim Nemati
  • Bashima Islam
  • Korosh Vatanparvar
  • Viswam Nathan
  • Daniel McCaffrey
  • Jilong Kuang
  • Jun Alex Gao

Tracking the type and frequency of cough events is critical for monitoring respiratory diseases. Coughs are one of the most common symptoms of respiratory and infectious diseases like COVID-19, and a cough monitoring system could have been vital in remote monitoring during a pandemic like COVID-19. While the existing solutions for cough monitoring use unimodal (e.g., audio) approaches for detecting coughs, a fusion of multimodal sensors (e.g., audio and accelerometer) from multiple devices (e.g., phone and watch) are likely to discover additional insights and can help to track the exacerbation of the respiratory conditions. However, such multimodal and multidevice fusion requires accurate time synchronization, which could be challenging for coughs as coughs are usually concise events (0.3-0.7 seconds). In this paper, we first demonstrate the time synchronization challenges of cough synchronization based on the cough data collected from two studies. Then we highlight the performance of a cross-correlation based time synchronization algorithm on the alignment of cough events. Our algorithm can synchronize 98.9% of cough events with an average synchronization error of 0.046s from two devices.

Conventional and Non-conventional Job Interviewing Methods: A Comparative Study in Two Countries

  • Kumar Shubham
  • Emmanuelle Patricia Kleinlogel
  • Anaïs Butera
  • Marianne Schmid Mast
  • Dinesh Babu Jayagopi

With recent advancements in technology, new platforms have come up to substitute face-to-face interviews. Of particular interest are asynchronous video interviewing (AVI) platforms, where candidates talk to a screen with questions, and virtual agent based interviewing platforms, where a human-like avatar interviews candidates. These anytime-anywhere interviewing systems scale up the overall reach of the interviewing process for firms, though they may not provide the best experience for the candidates. An important research question is how the candidates perceive such platforms and its impact on their performance and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar vs. Platform? Finally, would such differences be consistent across cultures? In this paper, we present the results of a comparative study conducted in three different interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated, others-rated performance, and automatic audiovisual behavioral cues.

Detection of Listener Uncertainty in Robot-Led Second Language Conversation Practice

  • Ronald Cumbal
  • José Lopes
  • Olov Engwall

Uncertainty is a frequently occurring affective state that learners experience during the acquisition of a second language. This state can constitute both a learning opportunity and a source of learner frustration. An appropriate detection could therefore benefit the learning process by reducing cognitive instability. In this study, we use a dyadic practice conversation between an adult second-language learner and a social robot to elicit events of uncertainty through the manipulation of the robot's spoken utterances (increased lexical complexity or prosody modifications). The characteristics of these events are then used to analyze multi-party practice conversations between a robot and two learners. Classification models are trained with multimodal features from annotated events of listener (un)certainty. We report the performance of our models on different settings, (sub)turn segments and multimodal inputs.

Effect of Modality on Human and Machine Scoring of Presentation Videos

  • Haley Lepp
  • Chee Wee Leong
  • Katrina Roohr
  • Michelle Martin-Raugh
  • Vikram Ramanarayanan

We investigate the effect of observed data modality on human and machine scoring of informative presentations in the context of oral English communication training and assessment. Three sets of raters scored the content of three minute presentations by college students on the basis of either the video, the audio or the text transcript using a custom scoring rubric. We find significant differences between the scores assigned when raters view a transcript or listen to audio recordings in comparison to watching a video of the same presentation, and present an analysis of those differences. Using the human scores, we train machine learning models to score a given presentation using text, audio, and video features separately. We analyze the distribution of machine scores against the modality and label bias we observe in human scores, discuss its implications for machine scoring and recommend best practices for future work in this direction. Our results demonstrate the importance of checking and correcting for bias across different modalities in evaluations of multi-modal performances.

Examining the Link between Children's Cognitive Development and Touchscreen Interaction Patterns

  • Ziyang Chen
  • Yu-Peng Chen
  • Alex Shaw
  • Aishat Aloba
  • Pavlo Antonenko
  • Jaime Ruiz
  • Lisa Anthony

It is well established that children's touch and gesture interactions on touchscreen devices are different from those of adults, with much prior work showing that children's input is recognized more poorly than adults? input. In addition, researchers have shown that recognition of touchscreen input is poorest for young children and improves for older children when simply considering their age; however, individual differences in cognitive and motor development could also affect children's input. An understanding of how cognitive and motor skill influence touchscreen interactions, as opposed to only coarser measurements like age and grade level, could help in developing personalized and tailored touchscreen interfaces for each child. To investigate how cognitive and motor development may be related to children's touchscreen interactions, we conducted a study of 28 participants ages 4 to 7 that included validated assessments of the children's motor and cognitive skills as well as typical touchscreen target acquisition and gesture tasks. We correlated participants? touchscreen behaviors to their cognitive development level, including both fine motor skills and executive function. We compare our analysis of touchscreen interactions based on cognitive and motor development to prior work based on children's age. We show that all four factors (age, grade level, motor skill, and executive function) show similar correlations with target miss rates and gesture recognition rates. Thus, we conclude that age and grade level are sufficiently sensitive when considering children's touchscreen behaviors.

Gaze Tracker Accuracy and Precision Measurements in Virtual Reality Headsets

  • Jari Kangas
  • Olli Koskinen
  • Roope Raisamo

To effectively utilize a gaze tracker in user interaction it is important to know the quality of the gaze data that it is measuring. We have developed a method to evaluate the accuracy and precision of gaze trackers in virtual reality headsets. The method consists of two software components. The first component is a simulation software that calibrates the gaze tracker and then performs data collection by providing a gaze target that moves around the headset's field-of-view. The second component makes an off-line analysis of the logged gaze data and provides a number of measurement results of the accuracy and precision. The analysis results consist of the accuracy and precision of the gaze tracker in different directions inside the virtual 3D space. Our method combines the measurements into overall accuracy and precision. Visualizations of the measurements are created to see possible trends over the display area. Results from selected areas in the display are analyzed to find out differences between the areas (for example, the middle/outer edge of the display or the upper/lower part of display).

Leniency to those who confess?: Predicting the Legal Judgement via Multi-Modal Analysis

  • Liang Yang
  • Jingjie Zeng
  • Tao Peng
  • Xi Luo
  • Jinghui Zhang
  • Hongfei Lin

The Legal Judgement Prediction (LJP) is now under the spotlight. And it usually consists of multiple sub-tasks, such as penalty prediction (fine and imprisonment) and the prediction of articles of law. For penalty prediction, they are often closely related to the trial process, especially the attitude analysis of criminal suspects, which will influence the judgment of the presiding judge to some extent. In this paper, we firstly construct a multi-modal dataset with 517 cases of intentional assault, which contains trial information as well as the attitude of the suspect. Then, we explore the relationship between suspect`s attitude and term of imprisonment. Finally, we use the proposed multi-modal model to predict the suspect's attitude, and compare it with several strong baselines. Our experimental results show that the attitude of the criminal suspect is closely related to the penalty prediction, which provides a new perspective for LJP.

Multimodal Assessment of Oral Presentations using HMMs

  • Everlyne Kimani
  • Prasanth Murali
  • Ameneh Shamekhi
  • Dhaval Parmar
  • Sumanth Munikoti
  • Timothy Bickmore

Audience perceptions of public speakers' performance change over time. Some speakers start strong but quickly transition to mundane delivery, while others may have a few impactful and engaging portions of their talk preceded and followed by more pedestrian delivery. In this work, we model the time-varying qualities of a presentation as perceived by the audience and use these models both to provide diagnostic information to presenters and to improve the quality of automated performance assessments. In particular, we use HMMs to model various dimensions of perceived quality and how they change over time and use the sequence of quality states to improve feedback and predictions. We evaluate this approach on a corpus of 74 presentations given in a controlled environment. Multimodal features-spanning acoustic qualities, speech disfluencies, and nonverbal behavior were derived both automatically and manually using crowdsourcing. Ground truth on audience perceptions was obtained using judge ratings on both overall presentations (aggregate) and portions of presentations segmented by topic. We distilled the overall presentation quality into states representing the presenter's gaze, audio, gesture, audience interaction, and proxemic behaviors. We demonstrate that an HMM of state-based representation of presentations improves the performance assessments.

Multimodal Gated Information Fusion for Emotion Recognition from EEG Signals and Facial Behaviors

  • Soheil Rayatdoost
  • David Rudrauf
  • Mohammad Soleymani

Emotions associated with neural and behavioral responses are detectable through scalp electroencephalogram (EEG) signals and measures of facial expressions. We propose a multimodal deep representation learning approach for emotion recognition from EEG and facial expression signals. The proposed method involves the joint learning of a unimodal representation aligned with the other modality through cosine similarity and a gated fusion for modality fusion. We evaluated our method on two databases: DAI-EF and MAHNOB-HCI. The results show that our deep representation is able to learn mutual and complementary information between EEG signals and face video, captured by action units, head and eye movements from face videos, in a manner that generalizes across databases. It is able to outperform similar fusion methods for the task at hand.

OpenSense: A Platform for Multimodal Data Acquisition and Behavior Perception

  • Kalin Stefanov
  • Baiyu Huang
  • Zongjian Li
  • Mohammad Soleymani

Automatic multimodal acquisition and understanding of social signals is an essential building block for natural and effective human-machine collaboration and communication. This paper introduces OpenSense, a platform for real-time multimodal acquisition and recognition of social signals. OpenSense enables precisely synchronized and coordinated acquisition and processing of human behavioral signals. Powered by the Microsoft's Platform for Situated Intelligence, OpenSense supports a range of sensor devices and machine learning tools and encourages developers to add new components to the system through straightforward mechanisms for component integration. This platform also offers an intuitive graphical user interface to build application pipelines from existing components. OpenSense is freely available for academic research.

Personalized Modeling of Real-World Vocalizations from Nonverbal Individuals

  • Jaya Narain
  • Kristina T. Johnson
  • Craig Ferguson
  • Amanda O'Brien
  • Tanya Talkar
  • Yue Zhang Weninger
  • Peter Wofford
  • Thomas Quatieri
  • Rosalind Picard
  • Pattie Maes

Nonverbal vocalizations contain important affective and communicative information, especially for those who do not use traditional speech, including individuals who have autism and are non- or minimally verbal (nv/mv). Although these vocalizations are often understood by those who know them well, they can be challenging to understand for the community-at-large. This work presents (1) a methodology for collecting spontaneous vocalizations from nv/mv individuals in natural environments, with no researcher present, and personalized in-the-moment labels from a family member; (2) speaker-dependent classification of these real-world sounds for three nv/mv individuals; and (3) an interactive application to translate the nonverbal vocalizations in real time. Using support-vector machine and random forest models, we achieved speaker-dependent unweighted average recalls (UARs) of 0.75, 0.53, and 0.79 for the three individuals, respectively, with each model discriminating between 5 nonverbal vocalization classes. We also present first results for real-time binary classification of positive- and negative-affect nonverbal vocalizations, trained using a commercial wearable microphone and tested in real time using a smartphone. This work informs personalized machine learning methods for non-traditional communicators and advances real-world interactive augmentative technology for an underserved population.

Predicting the Effectiveness of Systematic Desensitization Through Virtual Reality for Mitigating Public Speaking Anxiety

  • Margaret von Ebers
  • Ehsanul Haque Nirjhar
  • Amir H. Behzadan
  • Theodora Chaspari

Public speaking is central to socialization in casual, professional, or academic settings. Yet, public speaking anxiety (PSA) is known to impact a considerable portion of the general population. This paper utilizes bio-behavioral indices captured from wearable devices to quantify the effectiveness of systematic exposure to virtual reality (VR) audiences for mitigating PSA. The effect of separate bio-behavioral features and demographic factors is studied, as well as the amount of necessary data from the VR sessions that can yield a reliable predictive model of the VR training effectiveness. Results indicate that acoustic and physiological reactivity during the VR exposure can reliably predict change in PSA before and after the training. With the addition of demographic features, both acoustic and physiological feature sets achieve improvements in performance. Finally, using bio-behavioral data from six to eight VR sessions can yield reliable prediction of PSA change. Findings of this study will enable researchers to better understand how bio-behavioral factors indicate improvements in PSA with VR training.

Punchline Detection using Context-Aware Hierarchical Multimodal Fusion

  • Akshat Choube
  • Mohammad Soleymani

Humor has a history as old as humanity. Humor often induces laughter and elicits amusement and engagement. Humorous behavior involves behavior manifested in different modalities including language, voice tone, and gestures. Thus, automatic understanding of humorous behavior requires multimodal behavior analysis. Humor detection is a well-established problem in Natural Language Processing but its multimodal analysis is less explored. In this paper, we present a context-aware hierarchical fusion network for multimodal punchline detection. The proposed neural architecture first fuses the modalities two by two and then fuses all three modalities. The network also models the context of the punchline using Gated Recurrent Unit(s). The model's performance is evaluated on UR-FUNNY database yielding state-of-the-art performance.

ROSMI: A Multimodal Corpus for Map-based Instruction-Giving

  • Miltiadis Marios Katsakioris
  • Ioannis Konstas
  • Pierre Yves Mignotte
  • Helen Hastie

We present the publicly-available Robot Open Street Map Instructions (ROSMI) corpus: a rich multimodal dataset of map and natural language instruction pairs that was collected via crowdsourcing. The goal of this corpus is to aid in the advancement of state-of-the-art visual-dialogue tasks, including reference resolution and robot-instruction understanding. The domain described here concerns robots and autonomous systems being used for inspection and emergency response. The ROSMI corpus is unique in that it captures interaction grounded in map-based visual stimuli that is both human-readable but also contains rich metadata that is needed to plan and deploy robots and autonomous systems, thus facilitating human-robot teaming.

The iCub Multisensor Datasets for Robot and Computer Vision Applications

  • Murat Kirtay
  • Ugo Albanese
  • Lorenzo Vannucci
  • Guido Schillaci
  • Cecilia Laschi
  • Egidio Falotico

Multimodal information can significantly increase the perceptual capabilities of robotic agents, at the cost of a more complex sensory processing. This complexity can be reduced by employing machine learning techniques, provided that there is enough meaningful data to train on. This paper reports on creating novel datasets constructed by employing the iCub robot equipped with an additional depth sensor and color camera. We used the robot to acquire color and depth information for 210 objects in different acquisition scenarios. At the end, the results were large scale datasets that can be used for robot and computer vision applications: multisensory object representation, action recognition, rotation and distance invariant object recognition.

The Sensory Interactive Table: Exploring the Social Space of Eating

  • Roelof A. J. de Vries
  • Juliet A. M. Haarman
  • Emiel C. Harmsen
  • Dirk K. J. Heylen
  • Hermie J. Hermens

Eating is in many ways a social activity. Yet, little is known about the social dimension of eating influencing individual eating habits. Nor do we know much about how to purposefully design for interactions in the social space of eating. This paper presents (1) the journey of exploring the social space of eating by designing an artifact, and (2) the actual artifact designed for the purpose of exploring the interaction dynamics of social eating. The result of this Research through Design journey is the Sensory Interactive Table: an interactive dining table based on explorations of the social space of eating, and a probe to explore the social space of eating further.

Touch Recognition with Attentive End-to-End Model

  • Wail El Bani
  • Mohamed Chetouani

Touch is the earliest sense to develop and the first mean of contact with the external world. Touch also plays a key role in our socio-emotional communication: we use it to communicate our feelings, elicit strong emotions in others and modulate behavior (e.g compliance). Although its relevance, touch is an understudied modality in Human-Machine-Interaction compared to audition and vision. Most of the social touch recognition systems require a feature engineering step making them difficult to compare and to generalize to other databases. In this paper, we propose an end-to-end approach. We present an attention-based end-to-end model for touch gesture recognition evaluated on two public datasets (CoST and HAART) in the context of the ICMI 15 Social Touch Challenge. Our model gave a similar level of accuracy: 61% for CoST and 68% for HAART and uses self-attention as an alternative to feature engineering and Recurrent Neural Networks.

SESSION: Doctoral Consortium Papers

Automating Facilitation and Documentation of Collaborative Ideation Processes

  • Matthias Merk

My research is is in the field of computer supported and enabled innovation processes, in particular focusing on the first phases of ideation in a co-located environment. I'm developing a concept for documenting, tracking and enhancing creative ideation processes. Base of this concept are key figures derived from various system within the ideation sessions. The system designed in my doctoral thesis enables interdisciplinary teams to kick-start creativity by automating facilitation, moderation, creativity support and documentation of the process. Using the example of brainstorming, a standing table is equipped with camera and microphone based sensing as well as multiple ways of interaction and visualization through projection and LED lights. The user interaction with the table is implicit and based on real time metadata generated by the users of the system. System actions are calculated based on what is happening on the table using object recognition. Everything on the table influences the system thus making it into a multimodal input and output device with implicit interaction. While the technical aspects of my research are close to be done, the more problematic part of evaluation will benefit from feedback from the specialists for multimodal interaction at ICMI20.

Detection of Micro-expression Recognition Based on Spatio-Temporal Modelling and Spatial Attention

  • Mengjiong Bai

My PhD project aims to make contributions in the affective computing application to assist in the depression diagnosis by micro-expression recognition. My motivation is the similarities of the low-intensity facial expressions in micro-expressions and the low-intensity facial expressions (`frozen face?) in people with psycho-motor retardation caused by depression. It will focus on, firstly, investigating spatio-temporal modelling and attention systems for micro-expression recognition (MER) and, secondly, exploring the role of micro-expressions in automated depression analysis by improving deep learning architectures to detect low-intensity facial expressions. This work will investigate different deep learning architectures (e.g. Temporal Convolutional Networks (TCNN) or Gate Recurrent Unit (GRU)) and validate the results on publicly available micro-expression benchmark datasets to quantitatively analyse the robustness and accuracy of MER's contribution to improving automatic depression analysis. Moreover, video magnification as a way to enhance small movements will be combined with the deep learning methods to address the low-intensity issues in MER.

How to Complement Learning Analytics with Smartwatches?: Fusing Physical Activities, Environmental Context, and Learning Activities

  • George-Petru Ciordas-Hertel

To obtain a holistic perspective on learning, a multimodal technical infrastructure for Learning Analytics (LA) can be beneficial. Recent studies have investigated various aspects of technical LA infrastructure. However, it has not yet been explored how LA indicators can be complemented with Smartwatch sensor data to detect physical activity and the environmental context. Sensor data, such as the accelerometer, are often used in related work to infer a specific behavior and environmental context, thus triggering interventions on a just-in-time basis. In this dissertation project, we plan to use Smartwatch sensor data to explore further indicators for learning from blended learning sessions conducted in-the-wild, e.g., at home. Such indicators could be used within learning sessions to suggest breaks, or afterward to support learners in reflection processes.

We plan to investigate the following three research questions: (RQ1) How can multimodal learning analytics infrastructure be designed to support real-time data acquisition and processing effectively?; (RQ2) how to use smartwatch sensor data to infer environmental context and physical activities to complement learning analytics indicators for blended learning sessions; and (RQ3) how can we align the extracted multimodal indicators with pedagogical interventions.

RQ1 was investigated by a structured literature review and by conducting eleven semi-structured interviews with LA infrastructure developers. According to RQ2, we are currently designing and implementing a multimodal learning analytics infrastructure to collect and process sensor and experience data from Smartwatches. Finally, according to RQ3, an exploratory field study will be conducted to extract multimodal learning indicators and examine them with learners and pedagogical experts to develop effective interventions.

Researchers, educators, and learners can use and adapt our contributions to gain new insights into learners' time and learning tactics, and physical learning spaces from learning sessions taking place in-the-wild.

Multimodal Groups' Analysis for Automated Cohesion Estimation

  • Lucien Maman

Groups are getting more and more scholars' attention. With the rise of Social Signal Processing (SSP), many studies based on Social Sciences and Psychology findings focused on detecting and classifying groups? dynamics. Cohesion plays an important role in these groups? dynamics and is one of the most studied emergent states, involving both group motions and goals. This PhD project aims to provide a computational model addressing the multidimensionality of cohesion and capturing its subtle dynamics. It will offer new opportunities to develop applications to enhance interactions among humans as well as among humans and machines.

Multimodal Physiological Synchrony as Measure of Attentional Engagement

  • Ivo Stuldreher

When interested in monitoring attentional engagement, physiological signals can be of great value. A popular approach is to uncover the complex patterns between physiological signals and attentional engagement using supervised learning models, but it is often unclear which physiological measures can best be used in such models and collecting enough training data with a reliable ground-truth to train such model is very challenging. Rather than using physiological responses of individual participants and specific events in a trained model, one can also continuously determine the degree to which physiological measures of multiple individuals uniformly change, often referred to as physiological synchrony. As a directly proportional relation between physiological synchrony in brain activity and attentional engagement has been pointed out in the literature, no trained model is needed to link the two. I aim to create a more robust measure of attentional engagement among groups of individuals by combining electroencephalography (EEG), electrodermal activity (EDA) and heart rate into a multimodal metric of physiological synchrony. I formulate three main research questions in the current research proposal: 1) How do physiological synchrony in physiological measures from the central and peripheral nervous system relate to attentional engagement? 2) Does physiological synchrony reliably reflect shared attentional engagement in real-world use-cases? 3) How can these physiological measures be fused to obtain a multimodal metric of physiological synchrony that outperforms unimodal synchrony?

Personalised Human Device Interaction through Context aware Augmented Reality

  • Madhawa Perera

Human-device interactions in smart environments are shifting prominently towards naturalistic user interactions such as gaze and gesture. However, ambiguities arise when users have to switch interactions as contexts change. This could confuse users who are accustomed to a set of conventional controls leading to system inefficiencies. My research explores how to reduce interaction ambiguity by semantically modelling user specific interactions with context, enabling personalised interactions through AR. Sensory data captured from an AR device is utilised to interpret user interactions and context which is then modeled in an extendable knowledge graph along with user's interaction preference using semantic web standards. These representations are utilized to bring semantics to AR applications about user's intent to interact with a particular device affordance. Therefore, this research aims to bring semantical modeling of personalised gesture interactions in AR/VR applications for smart/immersive environments.

Robot Assisted Diagnosis of Autism in Children

  • B Ashwini

The diagnosis of autism spectrum disorder is cumbersome even for expert clinicians owing to the diversity in the symptoms exhibited by the children which depend on the severity of the disorder. Furthermore, the diagnosis is based on the behavioural observations and the developmental history of the child which has substantial dependence on the perspectives and interpretations of the specialists. In this paper, we present a robot-assisted diagnostic system for the assessment of behavioural symptoms in children for providing a reliable diagnosis. The robotic assistant is intended to support the specialist in administering the diagnostic task, perceiving and evaluating the task outcomes as well as the behavioural cues for assessment of symptoms and diagnosing the state of the child. Despite being used widely in education and intervention for children with autism (CWA), the application of robot assistance in diagnosis is less explored. Further, there have been limited studies addressing the acceptance and effectiveness of robot-assisted interventions for CWA in the Global South. We aim to develop a robot-assisted diagnostic framework for CWA to support the experts and study the viability of such a system in the Indian context.

Supporting Instructors to Provide Emotional and Instructional Scaffolding for English Language Learners through Biosensor-based Feedback

  • Heera Lee

Delivering a presentation has been reported as one of the most anxiety-provoking tasks faced by English Language Learners. Researchers suggest that instructors should be more aware of the learners' emotional states to provide appropriate emotional and instructional scaffolding to improve their performance when presenting. Despite the critical role of instructors in perceiving the emotional states among English language learners, it can be challenging to do this solely by observing the learners? facial expressions, behaviors, and their limited verbal expressions due to language and cultural barriers. To address the ambiguity and inconsistency in interpreting the emotional states of the students, this research focuses on identifying the potential of using biosensor-based feedback of learners to support instructors. A novel approach has been adopted to classify the intensity and characteristics of public speaking anxiety and foreign language anxiety among English language learners and to provide tailored feedback to instructors while supporting teaching and learning. As part of this work, two further studies were proposed. The first study was designed to identify educators' needs for solutions providing emotional and instructional support. The second study aims to evaluate a resulting prototype as a view of instructors to offer tailored emotional and instructional scaffolding to students. The contribution of these studies includes the development of guidance in using biosensor-based feedback that will assist English language instructors in teaching and identifying the students' anxiety levels and types while delivering a presentation.

Towards a Multimodal and Context-Aware Framework for Human Navigational Intent Inference

  • Zhitian Zhang

A socially acceptable robot needs to make correct decisions and be able to understand human intent in order to interact with and navigate around humans safely. Although research in computer vision and robotics has made huge advance in recent years, today's robotics systems still need better understanding of human intent to be more effective and widely accepted. Currently such inference is typically done using only one mode of perception such as vision, or human movement trajectory. In this extended abstract, I describe my PhD research plan of developing a novel multimodal and context-aware framework, in which a robot infers human navigational intentions through multimodal perception comprised of human temporal facial, body pose and gaze features, human motion feature as well as environmental context. To facility this framework, a data collection experiment is designed to acquire multimodal human-robot interaction data. Our initial design of the framework is based on a temporal neural network model with human motion, body pose and head orientation features as input. And we will increase the complexity of the neural network model as well as the input features along the way. In the long term, this framework can benefit a variety of settings such as autonomous driving, service and household robots.

Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents

  • Mireille Fares

One of the key challenges in designing Embodied Conversational Agents (ECA) is to produce human-like gestural and visual prosody expressivity. Another major challenge is to maintain the interlocutor's attention by adapting the agent's behavior to the interlocutor's multimodal behavior. This paper outlines my PhD research plan that aims to develop convincing expressive and natural behavior in ECAs and to explore and model the mechanisms that govern human-agent multimodal interaction. Additionally, I describe in this paper my first PhD milestone which focuses on developing an end-to-end LSTM Neural Network model for upper-face gestures generation. The main task consists of building a model that can produce expressive and coherent upper-face gestures while considering multiple modalities: speech audio, text, and action units.

Towards Real-Time Multimodal Emotion Recognition among Couples

  • George Boateng

Researchers are interested in understanding the emotions of couples as it relates to relationship quality and dyadic management of chronic diseases. Currently, the process of assessing emotions is manual, time-intensive, and costly. Despite the existence of works on emotion recognition among couples, there exists no ubiquitous system that recognizes the emotions of couples in everyday life while addressing the complexity of dyadic interactions such as turn-taking in couples? conversations. In this work, we seek to develop a smartwatch-based system that leverages multimodal sensor data to recognize each partner's emotions in daily life. We are collecting data from couples in the lab and in the field and we plan to use the data to develop multimodal machine learning models for emotion recognition. Then, we plan to implement the best models in a smartwatch app and evaluate its performance in real-time and everyday life through another field study. Such a system could enable research both in the lab (e.g. couple therapy) or in daily life (assessment of chronic disease management or relationship quality) and enable interventions to improve the emotional well-being, relationship quality, and chronic disease management of couples.

Zero-Shot Learning for Gesture Recognition

  • Naveen Madapana

Zero-Shot Learning (ZSL) is a new paradigm in machine learning that aims to recognize the classes that are not present in the training data. Hence, this paradigm is capable of comprehending the categories that were never seen before. While deep learning has pushed the limits of unseen object recognition, ZSL for temporal problems such as unfamiliar gesture recognition (referred to as ZSGL) remain unexplored. ZSGL has the potential to result in efficient human-machine interfaces that can recognize and understand the spontaneous and conversational gestures of humans. In this regard, the objective of this work is to conceptualize, model and develop a framework to tackle ZSGL problems. The first step in the pipeline is to develop a database of gesture attributes that are representative of a range of categories. Next, a deep architecture consisting of convolutional and recurrent layers is proposed to jointly optimize the semantic and classification losses. Lastly, rigorous experiments are performed to compare the proposed model with respect to existing ZSL models on CGD 2013 and MSRC-12 datasets. In our preliminary work, we identified a list of 64 discriminative attributes related to gestures' morphological characteristics. Our approach yields an unseen class accuracy of (41%) which outperforms the state-of-the-art approaches by a considerable margin. Future work involves the following: 1. Modifying the existing architecture in order to improve the ZSL accuracy, 2. Augmenting the database of attributes to incorporate semantic properties, 3. Addressing the issue of data imbalance which is inherent to ZSL problems, and 4. Expanding this research to other domains such as surgeme and action recognition.

SESSION: Demo and Exhibit Papers

Alfie: An Interactive Robot with Moral Compass

  • Cigdem Turan
  • Patrick Schramowski
  • Constantin Rothkopf
  • Kristian Kersting

This work introduces Alfie, an interactive robot that is capable of answering moral (deontological) questions of a user. The interaction of Alfie is designed in a way in which the user can offer an alternative answer when the user disagrees with the given answer so that Alfie can learn from its interactions. Alfie's answers are based on a sentence embedding model that uses state-of-the-art language models, e.g. Universal Sentence Encoder and BERT. Alfie is implemented on a Furhat Robot, which provides a customizable user interface to design a social robot.

FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment

  • Alejandro Peña
  • Ignacio Serna
  • Aythami Morales
  • Julian Fierrez

With the aim of studying how current multimodal AI algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data, this demonstrator experiments over an automated recruitment testbed based on Curriculum Vitae: FairCVtest. The presence of decision-making algorithms in society is rapidly increasing nowadays, while concerns about their transparency and the possibility of these algorithms becoming new sources of discrimination are arising. This demo shows the capacity of the Artificial Intelligence (AI) behind a recruitment tool to extract sensitive information from unstructured data, and exploit it in combination to data biases in undesirable (unfair) ways. Aditionally, the demo includes a new algorithm (SensitiveNets) for discrimination-aware learning which eliminates sensitive information in our multimodal AI framework.

LieCatcher: Game Framework for Collecting Human Judgments of Deceptive Speech

  • Sarah Ita Levitan
  • Xinyue Tan
  • Julia Hirschberg

Humans are notoriously poor at detecting deception --- most are worse than chance. To address this issue we have developed LieCatcher, a single-player web-based Game With A Purpose (GWAP) that allows players to assess their lie detection skills while providing human judgments of deceptive speech. Players listen to audio recordings drawn from a corpus of deceptive and non-deceptive interview dialogues, and guess if the speaker is lying or telling the truth. They are awarded points for correct guesses and at the end of the game they receive a score summarizing their performance at lie detection. We present the game design and implementation, and describe a crowdsourcing experiment conducted to study perceived deception.

Spark Creativity by Speaking Enthusiastically: Communication Training using an E-Coach

  • Carla Viegas
  • Albert Lu
  • Annabel Su
  • Carter Strear
  • Yi Xu
  • Albert Topdjian
  • Daniel Limon
  • J.J. Xu

Enthusiasm in speech has a huge impact on listeners. Students of enthusiastic teachers show better performance. Leaders that are enthusiastic influence employee's innovative behavior and can also spark excitement in customers. We, at TalkMeUp, want to help people learn how to talk with enthusiasm in order to spark creativity among their listeners. In this work we want to present a multimodal speech analysis platform. We provide feedback on enthusiasm by analyzing eye contact, facial expressions, voice prosody, and text content.

The AI-Medic: A Multimodal Artificial Intelligent Mentor for Trauma Surgery

  • Edgar Rojas-Muñoz
  • Kyle Couperus
  • Juan P. Wachs

Telementoring generalist surgeons as they treat patients can be essential when in situ expertise is not readily available. However, adverse cyber-attacks, unreliable network conditions, and remote mentors' predisposition can significantly jeopardize the remote intervention. To provide medical practitioners with guidance when mentors are unavailable, we present the AI-Medic, the initial steps towards the development of a multimodal intelligent artificial system for autonomous medical mentoring. The system uses a tablet device to acquire the view of an operating field. This imagery is provided to an encoder-decoder neural network trained to predict medical instructions from the current view of a surgery. The network was training using DAISI, a dataset including images and instructions providing step-by-step demonstrations of surgical procedures. The predicted medical instructions are conveyed to the user via visual and auditory modalities.

SESSION: Grand Challenge Papers: Emotion Recognition in the Wild Challenge

A Multi-Modal Approach for Driver Gaze Prediction to Remove Identity Bias

  • Zehui Yu
  • Xiehe Huang
  • Xiubao Zhang
  • Haifeng Shen
  • Qun Li
  • Weihong Deng
  • Jian Tang
  • Yi Yang
  • Jieping Ye

Driver gaze prediction is an important task in Advanced Driver Assistance System (ADAS). Although the Convolutional Neural Network (CNN) can greatly improve the recognition ability, there are still several unsolved problems due to the challenge of illumination, pose and camera placement. To solve these difficulties, we propose an effective multi-model fusion method for driver gaze estimation. Rich appearance representations, i.e. holistic and eyes regions, and geometric representations, i.e. landmarks and Delaunay angles, are separately learned to predict the gaze, followed by a score-level fusion system. Moreover, pseudo-3D appearance supervision and identity-adaptive geometric normalization are proposed to further enhance the prediction accuracy. Finally, the proposed method achieves state-of-the-art accuracy of 82.5288% on the test data, which ranks 1st at the EmotiW2020 driver gaze prediction sub-challenge.

Advanced Multi-Instance Learning Method with Multi-features Engineering and Conservative Optimization for Engagement Intensity Prediction

  • Jianming Wu
  • Bo Yang
  • Yanan Wang
  • Gen Hattori

This paper proposes an advanced multi-instance learning method with multi-features engineering and conservative optimization for engagement intensity prediction. It was applied to the EmotiW Challenge 2020 and the results demonstrated the proposed method's good performance. The task is to predict the engagement level when a subject-student is watching an educational video under a range of conditions and in various environments. As engagement intensity has a strong correlation with facial movements, upper-body posture movements and overall environmental movements in a given time interval, we extract and incorporate these motion features into a deep regression model consisting of layers with a combination of long short-term memory(LSTM), gated recurrent unit (GRU) and a fully connected layer. In order to precisely and robustly predict the engagement level in a long video with various situations such as darkness and complex backgrounds, a multi-features engineering function is used to extract synchronized multi-model features in a given period of time by considering both short-term and long-term dependencies. Based on these well-processed engineered multi-features, in the 1st training stage, we train and generate the best models covering all the model configurations to maximize validation accuracy. Furthermore, in the 2nd training stage, to avoid the overfitting problem attributable to the extremely small engagement dataset, we conduct conservative optimization by applying a single Bi-LSTM layer with only 16 units to minimize the overfitting, and split the engagement dataset (train + validation) with 5-fold cross validation (stratified k-fold) to train a conservative model. The proposed method, by using decision-level ensemble for the two training stages' models, finally win the second place in the challenge (MSE: 0.061110 on the testing set).

EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges

  • Abhinav Dhall
  • Garima Sharma
  • Roland Goecke
  • Tom Gedeon

This paper introduces the Eighth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW is a benchmarking effort run as a grand challenge of the 22nd ACM International Conference on Multimodal Interaction 2020. It comprises of four tasks related to automatic human behavior analysis: a) driver gaze prediction; b) audio-visual group-level emotion recognition; c) engagement prediction in the wild; and d) physiological signal based emotion recognition. The motivation of EmotiW is to bring researchers in affective computing, computer vision, speech processing and machine learning to a common platform for evaluating techniques on a test data. We discuss the challenge protocols, databases and their associated baselines.

Extract the Gaze Multi-dimensional Information Analysis Driver Behavior

  • Kui Lyu
  • Minghao Wang
  • Liyu Meng

Recent studies has been shown that most traffic accidents are related to the driver's engagement in the driving process. Driver gaze is considered as an important cue to monitor driver distraction. While there has been marked improvement in driver gaze region estimation systems, but there are many challenges exist like cross subject test, perspectives and sensor configuration. In this paper, we propose a Convolutional Neural Networks (CNNs) based multi-model fusion gaze zone estimation systems. Our method mainly consists of two blocks, which implemented the extraction of gaze features based on RGB images and estimation of gaze based on head pose features. Based on the original input image, first general face processing model were used to detect face and localize 3D landmarks, and then extract the most relevant facial information based on it. We implement three face alignment methods to normalize the face information. For the above image-based features, using a multi-input CNN classifier can get reliable classification accuracy. In addition, we design a 2D CNN based PointNet predict the head pose representation by 3D landmarks. Finally, we evaluate our best performance model on the Eighth EmotiW Driver Gaze Prediction sub-challenge test dataset. Our model has a competitive overall accuracy of 81.5144% gaze zone estimation ability on the cross-subject test dataset.

Fusical: Multimodal Fusion for Video Sentiment

  • Boyang Tom Jin
  • Leila Abdelrahman
  • Cong Kevin Chen
  • Amil Khanzada

Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.

Group Level Audio-Video Emotion Recognition Using Hybrid Networks

  • Chuanhe Liu
  • Wenqiang Jiang
  • Minghao Wang
  • Tianhao Tang

This paper presents a hybrid network for audio-video group Emo-tion Recognition. The proposed architecture includes audio stream,facial emotion stream, environmental object statistics stream (EOS)and video stream. We adopted this method at the 8th EmotionRecognition in the Wild Challenge (EmotiW2020). According to thefeedback of our submissions, the best result achieved 76.85% in theVideo level Group AFfect (VGAF) Test Database, 26.89% higherthan the baseline. Such improvements prove that our method isstate-of-the-art.

Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach

  • Anastasia Petrova
  • Dominique Vaufreydaz
  • Philippe Dessus

This article presents our unimodal privacy-safe and non-individual proposal for the audio-video group emotion recognition subtask at the Emotion Recognition in the Wild (EmotiW) Challenge 2020. This sub challenge aims to classify in the wild videos into three categories: Positive, Neutral and Negative. Recent deep learning models have shown tremendous advances in analyzing interactions between people, predicting human behavior and affective evaluation. Nonetheless, their performance comes from individual-based analysis, which means summing up and averaging scores from individual detections, which inevitably leads to some privacy issues. In this research, we investigated a frugal approach towards a model able to capture the global moods from the whole image without using face or pose detection, or any individual-based feature as input. The proposed methodology mixes state-of-the-art and dedicated synthetic corpora as training sources. With an in-depth exploration of neural network architectures for group-level emotion recognition, we built a VGG-based model achieving 59.13% accuracy on the VGAF test set (eleventh place of the challenge). Given that the analysis is unimodal based only on global features and that the performance is evaluated on a real-world dataset, these results are promising and let us envision extending this model to multimodality for classroom ambiance evaluation, our final target application.

Group-level Speech Emotion Recognition Utilising Deep Spectrum Features

  • Sandra Ottl
  • Shahin Amiriparian
  • Maurice Gerczuk
  • Vincent Karas
  • Björn Schuller

The objectives of this challenge paper are two fold: first, we apply a range of neural network based transfer learning approaches to cope with the data scarcity in the field of speech emotion recognition, and second, we fuse the obtained representations and predictions in a nearly and late fusion strategy to check the complementarity of the applied networks. In particular, we use our Deep Spectrum system to extract deep feature representations from the audio content of the 2020 EmotiW group level emotion prediction challenge data. We evaluate a total of ten ImageNet pre-trained Convolutional Neural Networks, including AlexNet, VGG16, VGG19 and three DenseNet variants as audio feature extractors. We compare their performance to the ComParE feature set used in the challenge baseline, employing simple logistic regression models trained with Stochastic Gradient Descent as classifiers. With the help of late fusion, our approach improves the performance on the test set from 47.88 % to 62.70 % accuracy.

Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition

  • Yanan Wang
  • Jianming Wu
  • Panikos Heracleous
  • Shinya Wada
  • Rui Kimura
  • Satoshi Kurihara

Audio-video group emotion recognition is a challenging task since it is difficult to gather a broad range of potential information to obtain meaningful emotional representations. Humans can easily understand emotions because they can associate implicit contextual knowledge (contained in our memory) when processing explicit information they can see and hear directly. This paper proposes an end-to-end architecture called implicit knowledge injectable cross attention audiovisual deep neural network (K-injection audiovisual network) that imitates this intuition. The K-injection audiovisual network is used to train an audiovisual model that can not only obtain audiovisual representations of group emotions through an explicit feature-based cross attention audiovisual subnetwork (audiovisual subnetwork), but is also able to absorb implicit knowledge of emotions through two implicit knowledge-based injection subnetworks (K-injection subnetwork). In addition, it is trained with explicit features and implicit knowledge but can easily make inferences using only explicit features. We define the region of interest (ROI) visual features and Melspectrogram audio features as explicit features, which obviously are present in the raw audio-video data. On the other hand, we define the linguistic and acoustic emotional representations that do not exist in the audio-video data as implicit knowledge. The implicit knowledge distilled by adapting video situation descriptions and basic acoustic features (MFCCs, pitch and energy) to linguistic and acoustic K-injection subnetworks is defined as linguistic and acoustic knowledge, respectively. When compared to the baseline accuracy for the testing set of 47.88%, the average of the audiovisual models trained with the (linguistic, acoustic and linguistic-acoustic) K-injection subnetworks achieved an overall accuracy of 66.40%.

Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition

  • Mo Sun
  • Jian Li
  • Hui Feng
  • Wei Gou
  • Haifeng Shen
  • Jian Tang
  • Yi Yang
  • Jieping Ye

This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.

Multi-rate Attention Based GRU Model for Engagement Prediction

  • Bin Zhu
  • Xinjie Lan
  • Xin Guo
  • Kenneth E. Barner
  • Charles Boncelet

Engagement detection is essential in many areas such as driver attention tracking, employee engagement monitoring, and student engagement evaluation. In this paper, we propose a novel approach using attention based hybrid deep models for the 8th Emotion Recognition in the Wild (EmotiW 2020) Grand Challenge in the category of engagement prediction in the wild EMOTIW2020. The task aims to predict the engagement intensity of subjects in videos, and the subjects are students watching educational videos from Massive Open Online Courses (MOOCs). To complete the task, we propose a hybrid deep model based on multi-rate and multi-instance attention. The novelty of the proposed model can be summarized in three aspects: (a) an attention based Gated Recurrent Unit (GRU) deep network, (b) heuristic multi-rate processing on video based data, and (c) a rigorous and accurate ensemble model. Experimental results on the validation set and test set show that our method makes promising improvements, achieving a competitively low MSE of 0.0541 on the test set, improving on the baseline results by 64%. The proposed model won the first place in the engagement prediction in the wild challenge.

Recognizing Emotion in the Wild using Multimodal Data

  • Shivam Srivastava
  • Saandeep Aathreya SIdhapur Lakshminarayan
  • Saurabh Hinduja
  • Sk Rahatul Jannat
  • Hamza Elhamdadi
  • Shaun Canavan

In this work, we present our approach for all four tracks of the eighth Emotion Recognition in the Wild Challenge (EmotiW 2020). The four tasks are group emotion recognition, driver gaze prediction, predicting engagement in the wild, and emotion recognition using physiological signals. We explore multiple approaches including classical machine learning tools such as random forests, state of the art deep neural networks, and multiple fusion and ensemble-based approaches. We also show that similar approaches can be used across tracks as many of the features generalize well to the different problems (e.g. facial features). We detail evaluation results that are either comparable to or outperform the baseline results for both the validation and testing for most of the tracks.

X-AWARE: ConteXt-AWARE Human-Environment Attention Fusion for Driver Gaze Prediction in the Wild

  • Lukas Stappen
  • Georgios Rizos
  • Björn Schuller

Reliable systems for automatic estimation of the driver's gaze are crucial for reducing the number of traffic fatalities and for many emerging research areas aimed at developing intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially in environments with varying illumination and reflection properties. Furthermore, there is wide diversity with respect to the appearance of drivers' faces, both in terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason, analysing the face along with contextual information - for example, the vehicle cabin environment - adds another, less subjective signal towards the design of robust systems for passenger gaze estimation. In this paper, we present an integrated approach to jointly model different features for this task. In particular, to improve the fusion of the visually captured environment with the driver's face, we have developed a contextual attention mechanism, X-AWARE, attached directly to the output convolutional layers of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach, we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms the baseline by an absolute of 15.03% in accuracy on the validation set, and improves the previously best reported result by an absolute of 8.72% on the test set.

SESSION: Workshops Summaries

Bridging Social Sciences and AI for Understanding Child Behaviour

  • Heysem Kaya
  • Roy S. Hessels
  • Maryam Najafian
  • Sandra Hanekamp
  • Saeid Safavi

Child behaviour is a topic of wide scientific interest among many different disciplines, including social and behavioural sciences and artificial intelligence (AI). In this workshop, we aimed to connect researchers from these fields to address topics such as the usage of AI to better understand and model child behavioural and developmental processes, challenges and opportunities for AI in large-scale child behaviour analysis and implementing explainable ML/AI on sensitive child data. The workshop served as a successful first step towards this goal and attracted contributions from different research disciplines on the analysis of child behaviour. This paper provides a summary of the activities of the workshop and the accepted papers and abstracts.

International Workshop on Deep Video Understanding

  • Keith Curtis
  • George Awad
  • Shahzad Rajput
  • Ian Soboroff

This is the introduction paper to the International Workshop on Deep Video Understanding, organized at the 22nd ACM Interational Conference on Multimodal Interaction. In recent years, a growing trend towards working on understanding videos (in particular movies) to a deeper level started to motivate researchers working in multimedia and computer vision to present new approaches and datasets to tackle this problem. This is a challenging research area which aims to develop a deep understanding of the relations which exist between different individuals and entities in movies using all available modalities such as video, audio, text and metadata. The aim of this workshop is to foster innovative research in this new direction and to provide benchmarking evaluations to advance technologies in the deep video understanding community.

Face and Gesture Analysis for Health Informatics

  • Zakia Hammal
  • Di Huang
  • Kévin Bailly
  • Liming Chen
  • Mohamed Daoudi

The goal of Face and Gesture Analysis for Health Informatics's workshop is to share and discuss the achievements as well as the challenges in using computer vision and machine learning for automatic human behavior analysis and modeling for clinical research and healthcare applications. The workshop aims to promote current research and support growth of multidisciplinary collaborations to advance this groundbreaking research. The meeting gathers scientists working in related areas of computer vision and machine learning, multi-modal signal processing and fusion, human centered computing, behavioral sensing, assistive technologies, and medical tutoring systems for healthcare applications and medicine.

Workshop on Interdisciplinary Insights into Group and Team Dynamics

  • Hayley Hung
  • Gabriel Murray
  • Giovanna Varni
  • Nale Lehmann-Willenbrock
  • Fabiola H. Gerpott
  • Catharine Oertel

There has been gathering momentum over the last 10 years in the study of group behavior in multimodal multiparty interactions. While many works in the computer science community focus on the analysis of individual or dyadic interactions, we believe that the study of groups adds an additional layer of complexity with respect to how humans cooperate and what outcomes can be achieved in these settings. Moreover, the development of technologies that can help to interpret and enhance group behaviours dynamically is still an emerging field. Social theories that accompany the study of groups dynamics are in their infancy and there is a need for more interdisciplinary dialogue between computer scientists and social scientists on this topic. This workshop has been organised to facilitate those discussions and strengthen the bonds between these overlapping research communities

Multisensory Approaches to Human-Food Interaction

  • Carlos Velasco
  • Anton Nijholt
  • Charles Spence
  • Takuji Narumi
  • Kosuke Motoki
  • Gijs Huisman
  • Marianna Obrist

Here, we present the outcome of the 4th workshop on Multisensory Approaches to Human-Food Interaction (MHFI), developed in collaboration with ICMI 2020 in Utrecht, The Netherlands. Capitalizing on the increasing interest on multisensory aspects of human-food interaction and the unique contribution that our community offers, we developed a space to discuss ideas ranging from mechanisms of multisensory food perception, through multisensory technologies, to new applications of systems in the context of MHFI. All in all, the workshop involved 11 contributions, which will hopefully further help shape the basis of a field of inquiry that grows as we see progress in our understanding of the senses and the development of new technologies in the context of food.

Multimodal Interaction in Psychopathology

  • Itir Onal Ertugrul
  • Jeffrey F. Cohn
  • Hamdi Dibeklioglu

This paper presents an introduction to the Multimodal Interaction in Psychopathology workshop, which is held virtually in conjunction with the 22nd ACM International Conference on Multimodal Interaction on October 25th, 2020. This workshop has attracted submissions in the context of investigating multimodal interaction to reveal mechanisms and assess, monitor, and treat psychopathology. Keynote speakers from diverse disciplines present an overview of the field from different vantages and comment on future directions. Here we summarize the goals and the content of the workshop.

Modeling Socio-Emotional and Cognitive Processes from Multimodal Data in the Wild

  • Dennis Küster
  • Felix Putze
  • Patrícia Alves-Oliveira
  • Maike Paetzel
  • Tanja Schultz

Detecting, modeling, and making sense of multimodal data from human users in the wild still poses numerous challenges. Starting from aspects of data quality and reliability of our measurement instruments, the multidisciplinary endeavor of developing intelligent adaptive systems in human-computer or human-robot interaction (HCI, HRI) requires a broad range of expertise and more integrative efforts to make such systems reliable, engaging, and user-friendly. At the same time, the spectrum of applications for machine learning and modeling of multimodal data in the wild keeps expanding. From the classroom to the robot-assisted operation theatre, our workshop aims to support a vibrant exchange about current trends and methods in the field of modeling multimodal data in the wild.

Speech, Voice, Text, and Meaning: A Multidisciplinary Approach to Interview Data through the use of digital tools

  • Arjan van Hessen
  • Silvia Calamai
  • Henk van den Heuvel
  • Stefania Scagliola
  • Norah Karrouche
  • Jeannine Beeken
  • Louise Corti
  • Christoph Draxler

Interview data is multimodal data: it consists of speech sound, facial expression and gestures, captured in a particular situation, and containing textual information and emotion. This workshop shows how a multidisciplinary approach may exploit the full potential of interview data. The workshop first gives a systematic overview of the research fields working with interview data. It then presents the speech technology currently available to support transcribing and annotating interview data, such as automatic speech recognition, speaker diarization, and emotion detection. Finally, scholars who work with interview data and tools may present their work and discover how to make use of existing technology.

Multimodal Affect and Aesthetic Experience

  • Theodoros Kostoulas
  • Michal Muszynski
  • Theodora Chaspari
  • Panos Amelidis

The term 'aesthetic experience' corresponds to the inner state of a person exposed to form and content of artistic objects. Exploring certain aesthetic values of artistic objects, as well as interpreting the aesthetic experience of people when exposed to art can contribute towards understanding (a) art and (b) people's affective reactions to artwork. Focusing on different types of artistic content, such as movies, music, urban art and other artwork, the goal of this workshop is to enhance the interdisciplinary collaboration between affective computing and aesthetics researchers.

First Workshop on Multimodal e-Coaches

  • Leonardo Angelini
  • Mira El Kamali
  • Elena Mugellini
  • Omar Abou Khaled
  • Yordan Dimitrov
  • Vera Veleva
  • Zlatka Gospodinova
  • Nadejda Miteva
  • Richar Wheeler
  • Zoraida Callejas
  • David Griol
  • Kawtar Benghazi
  • Manuel Noguera
  • Panagiotis Bamidis
  • Evdokimos Konstantinidis
  • Despoina Petsani
  • Andoni Beristain Iraola
  • Dimitrios I. Fotiadis
  • Gérard Chollet
  • Inés Torres
  • Anna Esposito
  • Hannes Schlieter

T e-Coaches are promising intelligent systems that aims at supporting human everyday life, dispatching advices through different interfaces, such as apps, conversational interfaces and augmented reality interfaces. This workshop aims at exploring how e-coaches might benefit from spatially and time-multiplexed interfaces and from different communication modalities (e.g., text, visual, audio, etc.) according to the context of the interaction.

Social Affective Multimodal Interaction for Health

  • Hiroki Tanaka
  • Satoshi Nakamura
  • Jean-Claude Martin
  • Catherine Pelachaud

This workshop discusses how interactive, multimodal technology such as virtual agents can be used in social skills training for measuring and training social-affective interactions. Sensing technology now enables analyzing user's behaviors and physiological signals. Various signal processing and machine learning methods can be used for such prediction tasks. Such social signal processing and tools can be applied to measure and reduce social stress in everyday situations, including public speaking at schools and workplaces.

The First International Workshop on Multi-Scale Movement Technologies

  • Eleonora Ceccaldi
  • Benoit Bardy
  • Nadia Bianchi-Berthouze
  • Luciano Fadiga
  • Gualtiero Volpe
  • Antonio Camurri

Multimodal interfaces pose the challenge of dealing with the multi-ple interactive time-scales characterizing human behavior. To dothis, innovative models and time-adaptive technologies are needed,operating at multiple time-scales and adopting a multi-layered ap-proach. The first International Workshop on Multi-Scale MovementTechnologies, hosted virtually during the 22nd ACM InternationalConference on Multimodal Interaction, is aimed at providing re-searchers from different areas with the opportunity to discuss thistopic. This paper summarizes the activities of the workshop andthe accepted papers

ICMI 2020 ACM International Conference on Multimodal Interaction. Copyright © 2019-2020