ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

Full Citation in the ACM Digital Library

SESSION: ICMI 2020 Late Breaking Results

Gender Classification of Prepubescent Children via Eye Movements with Reading Stimuli

  • Sahar Mahdie Klim Al Zaidawi
  • Martin H.U. Prinzler
  • Christoph Schröder
  • Gabriel Zachmann
  • Sebastian Maneth

We present a new study of gender prediction using eye movements of prepubescent children aged 9--10. Despite previous research indicating that gender differences in eye movements are observed only in adults, we are able to predict gender with accuracies of up to 64%. Our method segments gaze point trajectories into saccades and fixations. It then computes a small number of features and classifies saccades and fixations separately using statistical methods. The used dataset contains non-dyslexic and dyslexic children. In mixed groups, the accuracy of our classifiers drops dramatically. To address this challenge, we construct a hierarchical classifier that makes use of dyslexia prediction to improve significantly the accuracy of gender prediction in mixed groups.

Investigating LSTM for Micro-Expression Recognition

  • Mengjiong Bai
  • Roland Goecke

This study investigates the utility of Long Short-Term Memory (LSTM) networks for modelling spatial-temporal patterns for micro-expression recognition (MER). Micro-expressions are involuntary, short facial expressions, often of low intensity. RNNs have attracted a lot of attention in recent years for modelling temporal sequences. The RNN-LSTM combination to be highly effective results in many application areas. The proposed method combines the recent VGGFace2 model, basically a ResNet-50 CNN trained on the VGGFace2 dataset, with uni-directional and bi-directional LSTM to explore different ways modelling spatial-temporal facial patterns for MER. The Grad-CAM heat map visualisation is used in the training stages to determine the most appropriate layer of the VGGFace2 model for retraining. Experiments are conducted with pure VGGFace2, VGGFace2 + uni-directional LSTM, and VGGFace2 + Bi-directional LSTM on the SMIC database using 5-fold cross-validation.

Speech Emotion Recognition among Elderly Individuals using Multimodal Fusion and Transfer Learning

  • George Boateng
  • Tobias Kowatsch

Recognizing the emotions of the elderly is important as it could give an insight into their mental health. Emotion recognition systems that work well on the elderly could be used to assess their emotions in places such as nursing homes and could inform the development of various activities and interventions to improve their mental health. However, several emotion recognition systems are developed using data from younger adults. In this work, we train machine learning models to recognize the emotions of elderly individuals via performing a 3-class classification of valence and arousal as part of the INTERSPEECH 2020 Computational Paralinguistics Challenge (COMPARE). We used speech data from 87 participants who gave spontaneous personal narratives. We leveraged a transfer learning approach in which we used pretrained CNN and BERT models to extract acoustic and linguistic features respectively and fed them into separate machine learning models. Also, we fused these two modalities in a multimodal approach. Our best model used a linguistic approach and outperformed the official competition of unweighted average recall (UAR) baseline for valence by 8.8% and the mean of valence and arousal by 3.2%. We also showed that feature engineering is not necessary as transfer learning without fine-tuning performs as well or better and could be leveraged for the task of recognizing the emotions of elderly individuals. This work is a step towards better recognition of the emotions of the elderly which could eventually inform the development of interventions to manage their mental health.

Speech Emotion Recognition among Couples using the Peak-End Rule and Transfer Learning

  • George Boateng
  • Laura Sels
  • Peter Kuppens
  • Peter Hilpert
  • Tobias Kowatsch

Extensive couples? literature shows that how couples feel after a conflict is predicted by certain emotional aspects of that conversation. Understanding the emotions of couples leads to a better understanding of partners? mental well-being and consequently their relationships. Hence, automatic emotion recognition among couples could potentially guide interventions to help couples improve their emotional well-being and their relationships. It has been shown that people's global emotional judgment after an experience is strongly influenced by the emotional extremes and ending of that experience, known as the peak-end rule. In this work, we leveraged this theory and used machine learning to investigate, which audio segments can be used to best predict the end-of-conversation emotions of couples. We used speech data collected from 101 Dutch-speaking couples in Belgium who engaged in 10-minute long conversations in the lab. We extracted acoustic features from (1) the audio segments with the most extreme positive and negative ratings, and (2) the ending of the audio. We used transfer learning in which we extracted these acoustic features with a pre-trained convolutional neural network (YAMNet). We then used these features to train machine learning models - support vector machines - to predict the end-of-conversation valence ratings (positive vs negative) of each partner. The results of this work could inform how to best recognize the emotions of couples after conversation-sessions and eventually, lead to a better understanding of couples? relationships either in therapy or in everyday life.

Music-Driven Animation Generation of Expressive Musical Gestures

  • Alysha Bogaers
  • Zerrin Yumak
  • Anja Volk

While audio-driven face and gesture motion synthesis has been studied before, to our knowledge no research has been done yet for automatic generation of musical gestures for virtual humans. Existing work either focuses on precise 3D finger movement generation required to play an instrument or expressive musical gestures based on 2D video data. In this paper, we propose a music-driven piano performance generation method using 3D motion capture data and recurrent neural networks. Our results show that it is feasible to automatically generate expressive musical gestures for piano playing using various audio and musical features. However, it is not yet clear which features work best for which type of music. Our future work aims to further test with other datasets, deep learning methods and musical instruments using both objective and subjective evaluations.

Engagement Analysis of ADHD Students using Visual Cues from Eye Tracker

  • Harshit Chauhan
  • Anmol Prasad
  • Jainendra Shukla

In this paper, we focus on finding the correlation between visual attention and engagement of ADHD students in one-on-one sessions with specialized educators using visual cues and eye-tracking data. Our goal is to investigate the extent to which observations of eye-gaze, posture, emotion and other physiological signals can be used to model the cognitive state of subjects and to explore the integration of multiple sensor modalities to improve the reliability of detection of human displays of awareness and emotion in the context of ADHD affected children. This is a novel problem since no previous studies have aimed to identify markers of attentiveness in the context of students affected with ADHD. The experiment has been designed to collect data in a controlled environment and later on can be used to generate Machine Learning models to assist real-world educators. Additionally, we propose a novel approach for AOI (Area of Interest) detection for eye-tracking analysis in dynamic scenarios using existing deep learning-based saliency prediction and fixation prediction models. We aim to use the processed data to extract the features from a subject's eye-movement patterns and use Machine Learning models to classify the attention levels.

mEBAL: A Multimodal Database for Eye Blink Detection and Attention Level Estimation

  • Roberto Daza
  • Aythami Morales
  • Julian Fierrez
  • Ruben Tolosana

This work presents mEBAL, a multimodal database for eye blink detection and attention level estimation. The eye blink frequency is related to the cognitive activity and automatic detectors of eye blinks have been proposed for many tasks including attention level estimation, analysis of neuro-degenerative diseases, deception recognition, drive fatigue detection, or face anti-spoofing. However, most existing databases and algorithms in this area are limited to experiments involving only a few hundred samples and individual sensors like face cameras. The proposed mEBAL improves previous databases in terms of acquisition sensors and samples. In particular, three different sensors are simultaneously considered: Near Infrared (NIR) and RGB cameras to capture the face gestures and an Electroencephalography (EEG) band to capture the cognitive activity of the user and blinking events. Regarding the size of mEBAL, it comprises 6,000 samples and the corresponding attention level from 38 different students while conducting a number of e-learning tasks of varying difficulty. In addition to presenting mEBAL, we also include preliminary experiments on: i) eye blink detection using Convolutional Neural Networks (CNN) with the facial images, and ii) attention level estimation of the students based on their eye blink frequency.

User Expectations and Preferences to How Social Robots Render Text Messages with Emojis

  • Karen Fucinato
  • Elena Lavinia Leustean
  • Lilla Fekecs
  • Tünde Tárnoková
  • Rosalyn M. Langedijk
  • Kerstin Fischer

Social robots are increasingly entering our households, being able to interact with humans in various ways. One functionality of social robots may be to connect to a user's mobile phone and to read text messages out loud. Such a technology and communication platform should therefore be able to support emojis properly. We therefore address emoji usage in computer-mediated communication in order to develop appropriate emoji conveyance in social robot behavior. Our research explores how participants feel about the behavior of a tabletop robot prototype named Nina that reads text messages to user and to what extent different renderings correspond to user expectations and preferences of how text messages and emoji combinations should be delivered. Based on online animated videos and questionnaires, respondents evaluated the behavior of Nina based on different renderings of text messages with emojis in them. The experiment results and data analysis show that respondents liked the social robot to display emojis with or without sound effect and to "act out" emojis in text messages almost equally well, but rated it less useful, less fun and more confusing to replace the emojis by words.

It's Not What They Play, It's What You Hear: Understanding Perceived vs. Induced Emotions in Hindustani Classical Music

  • Amogh Gulati
  • Brihi Joshi
  • Chirag Jain
  • Jainendra Shukla

Music is an efficient medium to elicit and convey emotions. The comparison between perceived and induced emotions from western music has been widely studied. However, this relationship has not been studied from the perspective of Hindustani classical music. In this work, we explore the relationship between perceived and induced emotions with Hindustani classical music as our stimuli. We observe that there is little to no correlation between them, however, audio features help in distinguishing the increase or decrease in induced emotion quality. We also introduce a novel dataset which contains induced valence and arousal annotations for 18 Hindustani classical music songs. Furthermore, we propose a latent space representation based approach, that leads to a relative increase in F1 Score of 32.2% for arousal and 34.5% for valence classification, as compared to feature-based approaches for Hindustani classical music.

Toward Mathematical Representation of Emotion: A Deep Multitask Learning Method Based On Multimodal Recognition

  • Seiichi Harata
  • Takuto Sakuma
  • Shohei Kato

To emulate human emotions in agents, the mathematical representation of emotion (an emotional space) is essential for each component, such as emotion recognition, generation, and expression. In this study, we aim to acquire a modality-independent emotional space by extracting shared emotional information from different modalities. We propose a method of acquiring an emotional space by integrating multimodalities on a DNN and combining the emotion recognition task and the unification task. The emotion recognition task learns the representation of emotions, and the unification task learns an identical emotional space from each modality. Through the experiments with audio-visual data, we confirmed that there are differences in emotional spaces acquired from unimodality, and the proposed method can acquire a joint emotional space. We also indicated that the proposed method could adequately represent emotions in a low-dimensional emotional space, such as in five or six dimensions, under this paper's experimental conditions.

Neuroscience to Investigate Social Mechanisms Involved in Human-Robot Interactions

  • Youssef Hmamouche
  • Magalie Ochs
  • Laurent Prévot
  • Thierry Chaminade

To what extent do human-robot interactions (HRI) rely on social processes similar to human-human interactions (HHI)? To address this question objectively, we use a unique corpus. Brain activity and behaviors were recorded synchronously while participants were discussing with a human (confederate of the experimenter) or a robotic device (controlled by the confederate). Here, we focus on two main regions of interest (ROIs), that form the core of "the social brain", the right temporoparietal junction [rTPJ] and right medial prefrontal cortex [rMPFC]. An new analysis approach derived from multivariate time-series forecasting is used. A prediction score describes the ability to predict brain activity for each ROI, and results identify which behavioral features, built from raw recordings of the conversations, are used for this prediction. Results identify some differences between HHI and HRI in the behavioral features predicting activity in these ROIs of the social brain, that could explain significant differences in the level of activity

Multimodal Self-Assessed Personality Prediction in the Wild

  • Daisuke Kamisaka
  • Yuichi Ishikawa

Personality is an essential human attribute, and plays an important role for personalization of face-to-face services that improves sales and customer satisfaction in a variety of different business domains. Most studies addressing personality prediction to date built a prediction model with user data gathered via online services (e.g., SNS and mobile phone services). On the other hand, predicting the personality traits of customers without an online service log (e.g., first-time visitors and customers who only use a physical store) is challenging. In this paper, we present a multimodal approach to predict the personality of customers about whom only visual data collected using an in-store surveillance cameras are available. Our approach extracts simple gait features and projects them into pseudo online service log features to gain predictive power. Through an evaluation using the data collected by real mobile retailers, our approach improved prediction accuracy compared with a method that directly predicts personality from the visual data.

Prediction of Shared Laughter for Human-Robot Dialogue

  • Divesh Lala
  • Koji Inoue
  • Tatsuya Kawahara

Shared laughter is a phenomenon in face-to-face human dialogue which increases engagement and rapport, and so should be considered for conversation robots and agents. Our aim is to create a model of shared laughter generation for conversational robots. As part of this system, we train models which predict if shared laughter will occur, given that the user has laughed. Models trained using combinations of acoustic, prosodic features and laughter type were compared with online versions considered to better quantify their performance in a real system. We find that these models perform better than the random chance, with the multimodal combination of acoustic and prosodic features performing the best.

The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition

  • Casper Laugs
  • Hendrik Vincent Koops
  • Daan Odijk
  • Heysem Kaya
  • Anja Volk

While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.

Using Physiological Cues to Determine Levels of Anxiety Experienced among Deaf and Hard of Hearing English Language Learners

  • Heera Lee
  • Varun Mandalapu
  • Jiaqi Gong
  • Andrea Kleinsmith
  • Ravi Kuber

Deaf and hard of hearing English language learners encounter a range of challenges when learning spoken/written English, many of which are not faced by their hearing counterparts. In this paper, we examine the feasibility of utilizing physiological data, including arousal and eye gaze behaviors, as a method of identifying instances of anxiety and frustration experienced when delivering presentations. Initial findings demonstrate the potential of using this approach, which in turn could aid English language instructors who could either provide emotional support or personalized instructions to assist deaf and hard of hearing English language learners in the classroom.

A Novel pseudo viewpoint based Holoscopic 3D Micro-gesture Recognition

  • Yi Liu
  • Shuang Yang
  • Hongying Meng
  • Mohammad Rafiq Swash
  • Shiguang Shan

Recently, video-based micro-gesture recognition with the data captured by holoscopic 3D (H3D) sensors is getting more and more attention, mainly because of their particular advantages to use a single aperture camera to embed the 3D information in 2D images. However, it is not easy to use the embedded 3D information in an efficient manner due to the special imaging principles of H3D sensors. In this paper, an efficient Pseudo View Points (PVP) based method is proposed to introduce the embedded 3D information in H3D images into a new micro-gesture recognition framework. Specifically, we obtain several pseudo view points based frames by composing all the pixels at the same position in each elemental image(EI) in the original H3D frames. This is a very efficient and robust step, and could mimic the real view points so as to represent the 3D information in the frames. Then, a new recognition framework based on 3D DenseNet and Bi-GRU networks is proposed to learn the dynamic patterns of different micro-gestures based on the representation of the pseudo view points. Finally, we perform a thorough comparison of the related benchmark, which demonstrates the effectiveness of our method and also reports a new state of the art performance.

Physiological Synchrony, Stress and Communication of Paramedic Trainees During Emergency Response Training

  • Vasundhara Misal
  • Surely Akiri
  • Sanaz Taherzadeh
  • Hannah McGowan
  • Gary Williams
  • J. Lee Jenkins
  • Helena Mentis
  • Andrea Kleinsmith

Paramedics play a critical role in society and face many high stress situations in their day-to-day work. Long-term unmanaged stress can result in mental health issues such as depression, anxiety, and post-traumatic stress disorder. Physiological synchrony - the unconscious, dynamic linking of physiological responses such as electrodermal activity (EDA) - have been linked to stress and team coordination. In this preliminary analysis, we examined the relationship between EDA synchrony, perceived stress and communication between paramedic trainee pairs during in-situ simulation training. Our initial results indicated a correlation between high physiological synchrony and social coordination and group processes. Moreover, communication between paramedic dyads was inversely related to physiological synchrony, i.e., communication increased during low synchrony segments of the interaction and decreased during high synchrony segments.

ET-CycleGAN: Generating Thermal Images from Images in the Visible Spectrum for Facial Emotion Recognition

  • Gerard Pons
  • Abdallah El Ali
  • Pablo Cesar

Facial thermal imaging has in recent years shown to be an efficient modality for facial emotion recognition. However, the use of deep learning in this field is still not fully exploited given the small number and size of the current datasets. The goal of this work is to improve the performance of the existing deep networks in thermal facial emotion recognition by generating new synthesized thermal images from images in the visual spectrum (RGB). To address this challenging problem, we propose an emotion-guided thermal CycleGAN (ET-CycleGAN). This Generative Adversarial Network (GAN) regularizes the training with facial and emotion priors by extracting features from Convolutional Neural Networks (CNNs) trained for face recognition and facial emotion recognition, respectively. To assess this approach, we generated synthesized images from the training set of the USTC-NVIE dataset, and included the new data to the training set as a data augmentation strategy. By including images generated using the ET-CycleGAN, the accuracy for emotion recognition increased by 10.9%. Our initial findings highlight the importance of adding priors related to training set image attributes (in our case face and emotion priors), to ensure such attributes are maintained in the generated images.

Visually Impaired User Experience using a 3D-Enhanced Facility Management System for Indoors Navigation

  • Eduardo B. Sandoval
  • Binghao Li
  • Abdoulaye Diakite
  • Kai Zhao
  • Nicholas Oliver
  • Tomasz Bednarz
  • Sisi Zlatanova

We developed a 3D-Enhanced Facility Management System for Indoors Navigation (3D-EFMS-IN) to assist visually impaired users (VIU). Additionally, the system aims to facilitate the management of estate property and provide support for future scenarios related to emergencies, security, and robotics devices. The system combines four main subsystems: Mapping, Navigation Paths, Indoor Localisation and Navigation, and a Visualisation. An Integration of the subsystems has been done and a pretest with one VIU was performed to obtain feedback and tune the critical characteristics of our development. We observed that the system offers an acceptable preliminary user experience for VIU and future tests require to improve the latency of the system and usability. Shortly, we aim to obtain qualitative and quantitative measurements in a significant pool of users once the COVID lockdown ends.

Not All Errors Are Created Equal: Exploring Human Responses to Robot Errors with Varying Severity

  • Maia Stiber
  • Chien-Ming Huang

Robot errors occurring during situated interactions with humans are inevitable and elicit social responses. While prior research has suggested how social signals may indicate errors produced by anthropomorphic robots, most have not explored Programming by Demonstration (PbD) scenarios or non-humanoid robots. Additionally, how human social signals may help characterize error severity, which is important to determine appropriate strategies for error mitigation, has been subjected to limited exploration. We report an exploratory study that investigates how people may react to technical errors with varying severity produced by a non-humanoid robotic arm in a PbD scenario. Our results indicate that more severe robot errors may prompt faster, more intense human responses and that multimodal responses tend to escalate as the error unfolds. This provides initial evidence suggesting temporal modeling of multimodal social signals may enable early detection and classification of robot errors, thereby minimizing unwanted consequences.

A Phonology-based Approach for Isolated Sign Production Assessment in Sign Language

  • Sandrine Tornay
  • Necati Cihan Camgoz
  • Richard Bowden
  • Mathew Magimai Doss

Interactive learning platforms are in the top choices to acquire new languages. Such applications or platforms are more easily available for spoken languages, but rarely for sign languages. Assessment of the production of signs is a challenging problem because of the multichannel aspect (e.g., hand shape, hand movement, mouthing, facial expression) inherent in sign languages. In this paper, we propose an automatic sign language production assessment approach which allows assessment of two linguistic aspects: (i) the produced lexeme and (ii) the produced forms. On a linguistically annotated Swiss German Sign Language dataset, SMILE DSGS corpus, we demonstrate that the proposed approach can effectively assess the two linguistic aspects in an integrated manner.

The Cross-modal Congruency Effect as an Objective Measure of Embodiment

  • Pim Verhagen
  • Irene Kuling
  • Kaj Gijsbertse
  • Ivo V. Stuldreher
  • Krista Overvliet
  • Sara Falcone
  • Jan Van Erp
  • Anne-Marie Brouwer

Remote control of robots generally requires a high level of expertise and may impose a considerable cognitive burden on operators. A sense of embodiment over a remote-controlled robot might enhance operators? task performance and reduce cognitive workload. We want to study the extent to which different factors affect embodiment. As a first step, we aimed to validate the cross-modal congruency effect (CCE) as a potential objective measure of embodiment under four conditions with different, a priori expected levels of embodiment, and by comparing CCE scores with subjective reports. The conditions were (1) a real hand condition (real condition), (2) a real hand seen through a telepresence unit (mediated condition), (3) a robotic hand seen through a telepresence unit (robot condition), and (4) a human-looking virtual hand seen through VR glasses (VR condition). We found no unambiguous evidence that the magnitude of the CCE was affected by the degree of visual realism in each of the four conditions. We neither found evidence to support the hypothesis that the CCE and embodiment score as assessed by the subjective reports are correlated. These findings raise serious concerns about the use of the CCE as an objective measure of embodiment.

SESSION: DVU'20 Workshop

"Was It You Who Stole 500 Rubles?" - The Multimodal Deception Detection

  • Valeriya Karpova
  • Polina Popenova
  • Nadezda Glebko
  • Vladimir Lyashenko
  • Olga Perepelkina

Automatic deception detection is a challenging issue since human behaviors are too complex to establish any standard behavioral signs that would explicitly indicate that a person is lying. Furthermore, it is difficult to collect naturalistic datasets for supervised learning as both external and self-annotation may be unreliable for deception annotation. For these purposes, we collected the TRuLie dataset that consists of synchronously recorded videos (34 hours in total) and data received from contact photoplethysmography (PPG) and hardware eye-tracker of ninety three subjects who tried to feign innocence during interrogation after they committed mock crimes. Thus, we had multimodal fragments with lie (n=3380) and truth (n=6444). We trained an end-to-end convolutional neural network (CNN) on this dataset to predict lie and truth from audio and video, and also built classifiers on combined features extracted from video, audio, PPG, eye-tracker, and predictions from CNN. The best classifier (LightGBM) showed a mean balanced accuracy of 0.64 and an F1-score of 0.76 on a 5-fold cross-validation.

Deep Video Understanding of Character Relationships in Movies

  • Yang Lu
  • Asri Rizki Yuliani
  • Keisuke Ishikawa
  • Ronaldo Prata Amorim
  • Roland Hartanto
  • Nakamasa Inoue
  • Kuniaki Uto
  • Koichi Shinoda

Humans can easily understand storylines and character relationships in movies. However, the automatic relationship analysis from videos is challenging. In this paper, we introduce a deep video understanding system to infer relationships between movie characters from multimodal features. The proposed system first extracts visual and text features from full-length movies. With these multimodal features, we then utilize graph-based relationship reasoning models to infer the characters' relationships. We evaluate our proposed system on the High-Level Video Understanding (HLVU) dataset. We achieve 53% accuracy on question answering tests.

See me Speaking? Differentiating on Whether Words are Spoken On Screen or Off to Optimize Machine Dubbing

  • Shravan Nayak
  • Timo Baumann
  • Supratik Bhattacharya
  • Alina Karakanta
  • Matteo Negri
  • Marco Turchi

Dubbing is the art of finding a translation from a source into a target language that can be lip-synchronously revoiced, i. e., that makes the target language speech appear as if it was spoken by the very actors all along. Lip synchrony is essential for the full-fledged reception of foreign audiovisual media, such as movies and series, as violated constraints of synchrony between video (lips) and audio (speech) lead to cognitive dissonance and reduce the perceptual quality. Of course, synchrony constraints only apply to the translation when the speaker's lips are visible on screen. Therefore, deciding whether to apply synchrony constraints requires an automatic method for detecting whether an actor's lips are visible on screen for a given stretch of speech or not. In this paper, we attempt, for the first time, to classify on- from off-screen speech based on a corpus of real-world television material that has been annotated word-by-word for the visibility of talking lips on screen. We present classification experiments in which we classify

Kinetics and Scene Features for Intent Detection

  • Raksha Ramesh
  • Vishal Anand
  • Ziyin Wang
  • Tianle Zhu
  • Wenfeng Lyu
  • Serena Yuan
  • Ching-Yung Lin

We create multi-modal fusion models to predict relational classes within entities in free-form inputs such as unseen movies. Our approach identifies information rich features within individual sources -- emotion, text-attention, age, gender, and contextual background object tracking. These information are absorbed and contrasted from baseline fusion architectures. These five models then showcase future research areas on this challenging problem of relational knowledge extraction from movies and free-form multi-modal input sources. We find that, generally, the Kinetics model added with Attributes and Objects beat the baseline models.

SESSION: FGAHI'20 Workshop

Bodily Expression of Social Initiation Behaviors in ASC and non-ASC children: Mixed Reality vs. LEGO Game Play

  • Batuhan Sayis
  • Narcis Pares
  • Hatice Gunes

This study is part of a larger project that showed the potential of our mixed reality (MR) system in fostering social initiation behaviors in children with Autism Spectrum Condition (ASC). We compared it to a typical social intervention strategy based on construction tools, where both mediated a face-to-face dyadic play session between an ASC child and a non-ASC child. In this study, our first goal is to show that an MR platform can be utilized to alter the nonverbal body behavior between ASC and non-ASC during social interaction as much as a traditional therapy setting (LEGO). A second goal is to show how these body cues differ between ASC and non-ASC children during social initiation in these two platforms. We present our first analysis of the body cues generated under two conditions in a repeated-measures design. Body cue measurements were obtained through skeleton information and characterized in the form of spatio-temporal features from both subjects individually (e.g. distances between joints and velocities of joints), and interpersonally (e.g. proximity and visual focus of attention). We used machine learning techniques to analyze the visual data of eighteen trials of ASC and non-ASC dyads. Our experiments showed that: (i) there were differences between ASC and non-ASC bodily expressions, both at individual and interpersonal level, in LEGO and in the MR system during social initiation; (ii) the number of features indicating differences between ASC and non-ASC in terms of nonverbal behavior during initiation were higher in the MR system as compared to LEGO; and (iii) computational models evaluated with combination of these different features enabled the recognition of social initiation type (ASC or non-ASC) from body features in LEGO and in MR settings. We did not observe significant differences between the evaluated models in terms of performance for LEGO and MR environments. This might be interpreted as the MR system encouraging similar nonverbal behaviors in children, perhaps more similar than the LEGO environment, as the performance scores in the MR setting are lower as compared to the LEGO setting. These results demonstrate the potential benefits of full body interaction and MR settings for children with ASC.

Using Object Tracking Techniques to Non-Invasively Measure Thoracic Rotation Range of Motion

  • Katelyn Morrison
  • Daniel Yates
  • Maya Roman
  • William W. Clark

Different measuring instruments, such as a goniometer, have been used by clinicians to measure a patient's ability to rotate their thoracic spine. Despite the simplicity of goniometers, this instrument requires the user to decipher the resulting measurement properly. The correctness of these measurements are imperative for clinicians to properly identify and evaluate injuries or help athletes enhance their overall performance. This paper introduces a goniometer-free, noninvasive measuring technique using a Raspberry Pi, a Pi Camera module, and software for clinicians to measure a subject's thoracic rotation range of motion (ROM) when administering the seated rotation technique with immediate measurement feedback. Determining this measurement is achieved by applying computer vision object tracking techniques on a live video feed from the Pi Camera that is secured on the ceiling above the subject. Preliminary results using rudimentary techniques reveal that our system is very accurate in static environments.

Enforcing Multilabel Consistency for Automatic Spatio-Temporal Assessment of Shoulder Pain Intensity

  • Diyala Erekat
  • Zakia Hammal
  • Maimoon Siddiqui
  • Hamdi Dibeklioğlu

The standard clinical assessment of pain is limited primarily to self-reported pain or clinician impression. While the self-reported measurement of pain is useful, in some circumstances it cannot be obtained. Automatic facial expression analysis has emerged asa potential solution for an objective, reliable, and valid measurement of pain. In this study, we propose a video based approach for the automatic measurement of self-reported pain and the observer pain intensity, respectively. To this end, we explore the added value of three self-reported pain scales, i.e., the Visual Analog Scale(VAS), the Sensory Scale (SEN), and the Affective Motivational Scale(AFF), as well as the Observer Pain Intensity (OPI) rating for a reliable assessment of pain intensity from facial expression. Using a spatio-temporal Convolutional Neural Network - Recurrent Neural Network (CNN-RNN) architecture, we propose to jointly minimize the mean absolute error of pain scores estimation for each of thesescales while maximizing the consistency between them. The reliability of the proposed method is evaluated on the benchmark database for pain measurement from videos, namely, the UNBC-McMaster Pain Archive. Our results show that enforcing the consistency be-tween different self-reported pain intensity scores collected using different pain scales enhances the quality of predictions and improve the state of the art in automatic self-reported pain estimation.The obtained results suggest that automatic assessment of self-reported pain intensity from videos is feasible, and could be used as a complementary instrument to unburden caregivers, specially for vulnerable populations that need constant monitoring.

Unsupervised Learning Method for Exploring Students' Mental Stress in Medical Simulation Training

  • Yujin Wu
  • Mohamed Daoudi
  • Ali Amad
  • Laurent Sparrow
  • Fabien D'Hondt

So far, stress detection technology usually uses supervised learning methods combined with a series of physiological, physical, or behavioral signals and has achieved promising results. However, the problem of label collection such as the latency of stress response and subjective uncertainty introduced by the questionnaires has not been effectively solved. This paper proposes an unsupervised learning method with K-means clustering for exploring students' autonomic responses to medical simulation training in an ambulant environment. With the use of wearable sensors, features of electrodermal activity and heart rate variability of subjects are extracted to train the K-means model. The Silhouette Score of 0.49 with two clusters was reached, proving the difference in students' mental stress between baseline stage and simulation stage. Besides, with the aid of external ground truth which could be associated with either the baseline phase or simulation phase, four evaluation metrics were calculated and provided comparable results concerning supervised and unsupervised learning methods. The highest classification performance of 70% was reached with the measure of precision. In the future, we will integrate context information or facial image to provide more accurate stress detection.

SESSION: IGTD'20 Workshop

A Model of Team Trust in Human-Agent Teams

  • Anna-Sophie Ulfert
  • Eleni Georganta

Trust is a central element for effective teamwork and successful human-technology collaboration. Although technologies, such as agents, are increasingly becoming autonomous team members operating alongside humans, research on team trust in human-agent teams is missing. Thus far, empirical and theoretical work have focused on aspects of trust only towards the agent as a technology neglecting how team trust - with regards to the human-agent team as a whole - develops. In this paper, we present a model of team trust in human-agent teams combining two streams of research: (1) theories of trust in human teams and (2) theories of human-computer interaction (HCI). We propose different antecedents (integrity, ability, benevolence) that influence team trust in human-agent teams as well as individual, team, system, and temporal factors that impact this relationship. The goal of the present article is to advance our understanding of team trust in human-agent teams and encourage an integration between HCI and team research when planning future research. This will also help to design trustworthy human-agent teams and thereby, when introducing human-agent teams, support organizational functioning.

Inferring Student Engagement in Collaborative Problem Solving from Visual Cues

  • Angelika Kasparova
  • Oya Celiktutan
  • Mutlu Cukurova

Automatic analysis of students' collaborative interactions in physical settings is an emerging problem with a wide range of applications in education. However, this problem has been proven to be challenging due to the complex, interdependent and dynamic nature of student interactions in real-world contexts. In this paper, we propose a novel framework for the classification of student engagement in open-ended, face-to-face collaborative problem-solving (CPS) tasks purely from video data. Our framework i) estimates body pose from the recordings of student interactions; ii) combines face recognition with a Bayesian model to identify and track students with a high accuracy; and iii) classifies student engagement leveraging a Team Long Short-Term Memory (Team LSTM) neural network model. This novel approach allows the LSTMs to capture dependencies among individual students in their collaborative interactions. Our results show that the Team LSTM significantly improves the performance as compared to the baseline method that takes individual student trajectories into account independently.

Modeling Dynamics of Task and Social Cohesion from the Group Perspective Using Nonverbal Motion Capture-based Features

  • Fabian Walocha
  • Lucien Maman
  • Mohamed Chetouani
  • Giovanna Varni

Group cohesion is a multidimensional emergent state that manifests during group interaction. It has been extensively studied in several disciplines such as Social Sciences and Computer Science and it has been investigated through both verbal and nonverbal communication. This work investigates the dynamics of task and social dimensions of cohesion through nonverbal motion-capture-based features. We modeled dynamics either as decreasing or as stable/increasing regarding the previous measurement of cohesion. We design and develop a set of features related to space and body movement from motion capture data as it offers reliable and accurate measurements of body motions. Then, we use a random forest model to binary classify (decrease or no decrease) the dynamics of cohesion, for the task and social dimensions. Our model adopts labels from self-assessments of group cohesion, providing a different perspective of study with respect to the previous work relying on third-party labelling. The analysis reveals that, in a multilabel setting, our model is able to predict changes in task and social cohesion with an average accuracy of 64%(±3%) and 67%(±3%), respectively, outperforming random guessing (50%). In a multiclass setting comprised of four classes (i.e., decrease/decrease, decrease/no decrease, no decrease/decrease and no decrease/no decrease), our model also outperforms chance level (25%) for each class (i.e., 54%, 44%, 33%, 50%, respectively). Furthermore, this work provides a method based on notions from cooperative game theory (i.e., SHAP values) to assess features' impact and importance. We identify that the most important features for predicting cohesion dynamics relate to spacial distance, the amount of movement while walking, the overall posture expansion as well as the amount of inter-personal facing in the group.

Group Performance Prediction with Limited Context

  • Uliyana Kubasova
  • Gabriel Murray

Automated prediction of group task performance normally proceeds by extracting linguistic, acoustic, or multimodal features from an entire conversation in order to predict an objective task measure. In this work, we investigate whether we can maintain robust prediction performance when using only limited context from the beginning of the meeting. Graph-based conversation features as well as more traditional linguistic features are extracted from the first minute of the meeting and from the entire meeting. We find that models trained only on the first minute are competitive with models trained on the full conversation. In particular, deriving features from graph-based models of conversational interaction in the first minute of discussion is particularly effective for predicting group performance, and outperforms models using more traditional linguistic features. This work also uses a much larger amount of data than previous work, by combining three similar survival task datasets.

Defining and Quantifying Conversation Quality in Spontaneous Interactions

  • Navin Raj Prabhu
  • Chirag Raman
  • Hayley Hung

Social interactions in general are multifaceted and there exists a wide set of factors and events that influence them. In this paper, we quantify social interactions with a holistic viewpoint on individual experiences, particularly focusing on non-task-directed spontaneous interactions. To achieve this, we design a novel perceived measure, the perceived Conversation Quality, which intends to quantify spontaneous interactions by accounting for several socio-dimensional aspects of individual experiences.

To further quantitatively study spontaneous interactions, we devise a questionnaire which measures the perceived Conversation Quality, at both the individual- and at the group- level. Using the questionnaire, we collected perceived annotations for conversation quality in a publicly available dataset using naive annotators. The results of the analysis performed on the distribution and the inter-annotator agreeability shows that naive annotators tend to agree less in cases of low conversation quality samples, especially while annotating for group-level conversation quality.

SESSION: MAAE'20 Workshop

Machine Understanding of Emotion and Sentiment

  • Mohammad Soleymani

Emotions are subjective experiences involving perceptual and con-textual factors [4]. There is no objective tool for precise measurement of emotions. However, we can anticipate an emotion's emergence through the knowledge of common responses to events in similar situations. We can also measure proxies of emotions by recognizing emotional expressions [3]. Studying emotional response to multimedia allows identifying expected emotions in users consuming the content. For example,abrupt loud voices are novel and unsettling which result in surprise and higher experience of arousal [2,6]. For a particular type of con-tent such as music, mid-level attributes such as rhythmic stability or melodiousness have strong association with expected emotions[1]. Given that such mid-level attributes are more related to the con-tent, their machine-perception is more straightforward. Moreover,their perception in combination with user models enables building person-specific emotion anticipation models.In addition to studying expected emotions, we can also observe users emotional reactions to understand emotion in multimedia.Typical methods of emotion recognition include recognizing emotions from facial or vocal expressions. Recognition of emotional expressions requires large amount of labeled data, expensive to produce. Hence, the most recent advances in machine-based emotion perception include methods that can leverage unlabeled data through self-supervised and semi-supervised learning [3, 5]. In this talk, I review the field and showcase methods for automatic modeling and recognition of emotions and sentiment indifferent contexts [3,8]. I show how we can identify underlying factors contributing to the construction of subjective experience of emotions [1,7]. Identification of these factors allows us to use them as mid-level attributes to build machine learning models for emotion and sentiment understanding. I also show how emotions and sentiment can be recognized from expressions with the goal of building empathetic autonomous agents [8].

Bio-sensing of Environmental Distress for Walkable Built Environment

  • Changbum Ryan Ahn

The negative environmental stimuli (e.g., poorly maintained sidewalks, blighted properties, graffiti, trash on the ground, unsafe traffic conditions) in the urban built environment are linked to stress symptomatology in a significant portion of the urban populations. It plays a significant contributor to the increase of urban-associated diseases such as depression, allergies, asthma, diabetes, and cardiovascular diseases. A few studies presented the potential to identify pedestrians' environmental distress caused by the negative environmental stimuli using bio-signals (e.g., gait patterns, blood volume pulse, and electrodermal activity) beyond the subjectivity concerns of traditional approaches such as neighborhood surveys and field observation. However, there remain several unanswered questions regarding whether the effect of the negative environmental stimuli can be identified from bio-signals captured in naturalistic ambulatory settings, which include various uncontrollable confounding factors (e.g., movement artifacts, physiology reactivity due to non-intended stimuli, and individual variability). In this context, this talk discusses the challenges and opportunities of leveraging bio-signals in capturing environmental distress. We examine empirical associations between bio-signals and environmental stimuli commonly observed in neighborhood-built environment. Then we present a novel method that identifies group-level environmental distress by capturing and aggregating prominent local patterns of bio-signals from multiple pedestrians. In addition, the potential benefits of multimodal data are illustrated through the experimental results that predict environmental distress by using both bio-signals and image-based data (e.g., visual features captured from built environment image, such as land use, sidewalk connectivity, and road speeds).

Heart Rate Detection from the Supratrochlear Vessels using a Virtual Reality Headset integrated PPG Sensor

  • Michal Gnacek
  • David Garrido-Leal
  • Ruben Nieto Lopez
  • Ellen Seiss
  • Theodoros Kostoulas
  • Emili Balaguer-Ballester
  • Ifigeneia Mavridou
  • Charles Nduka

An increasing amount of virtual reality (VR) research is carried out to support the vast number of applications across mental health, exercise and entertainment fields. Often, this research involves the recording of physiological measures such as heart rate recordings with an electrocardiogram (ECG). One challenge is to enable remote, reliable and unobtrusive VR and heart rate data collection which would allow a wider application of VR research and practice in the field in future. To address the challenge, this work assessed the viability of replacing standard ECG devices with a photoplethysmography (PPG) sensor that is directly integrated into a VR headset over the branches of the supratrochlear vessels. The objective of this study was to investigate the reliability of the PPG sensor for heart-rate detection. A total of 21 participants were recruited. They were asked to wear an ECG belt as ground truth and a VR headset with the embedded PPG sensor. Signals from both sensors were captured in free standing and sitting positions. Results showed that VR headset with an integrated PPG sensor is a viable alternative to an ECG for heart rate measurements in optimal conditions with limited movement. Future research will extend on this finding by testing it in more interactive VR settings

Aesthetics in Hypermedia: Impact of Colour Harmony on Implicit Memory and User Experience

  • Julien Venni
  • Mireille Bétrancourt

According to recent perspectives on human-computer interactions, subjective aspects (emotion or visual attractiveness) have to be considered to provide optimal multimedia material. However, the research investigating the impact of aesthetics or emotional design has yielded varying conclusions regarding the use of interfaces and the resulting learning outcomes. Possible reasons include implementation of the aesthetics variable which varies from one study to another. On this base, an experimental study was conducted to assess the influence of a specific feature of aesthetics, colour harmony, on the use and subjective evaluation of a website. The study involved 34 participants browsing on two versions of the same website about science-fiction movies, with harmonious vs. disharmonious colours as the between-subject factor. After conducting six information search tasks, participants answered to questionnaires assessing usability, user experience, non-instrumental and instrumental qualities. Measures of actual usability of the website, navigation, eye movements and implicit memory performance were collected. Results showed that disharmonious colours caused lower subjective ratings for pragmatic qualities, appeared to distract visual attention but, surprisingly, lead to higher memory performances. On the other hand, colour harmony did not impact the navigation and perceived usability of the system, the perception of the aesthetics (apart from colour), hedonic qualities as well as the experience of use. These findings comfort the hypothesis that aesthetic features affect users' behavior and perception, but not on all dimensions of user experience. Based on the findings, a model for future research in the field is suggested.

Adaptive Audio Mixing for Enhancing Immersion in Augmented Reality Audio Games

  • Konstantinos Moustakas
  • Emmanouel Rovithis
  • Konstantinos Vogklis
  • Andreas Floros

In this work we present an adaptive audio mixing technique to be implemented in the design of Augmented Reality Audio (ARA) systems. The content of such systems is delivered entirely through the acoustic channel: the real acoustic environment is mixed with a virtual soundscape and returns to the listener as "pseudoacoustic" environment. We argue that the proposed adaptive mixing technique enhances user immersion in the augmented space in terms of the localization of sound objects. The need to optimise our ARA mixing engine emerged from our previous research, and more specifically from the analysis of the experimental results regarding the development of the Augmented Reality Audio Game (ARAG) "Audio Legends" that was tested on the field. The purpose of our new design was to aid sound localization, which is a crucial and demanding factor for delivering an immersive acoustic experience. We describe in depth the adaptive mixing along with the experimental test-bed. The results for the sound localization scenario indicate a substantial increase of 55 percent in accuracy compared to the legacy ARA mix model.


Action Modelling for Interaction and Analysis in Smart Sports and Physical Education

  • Fahim A. Salim
  • Fasih Haider
  • Maite Frutos-Pascual
  • Dennis Reidsma
  • Saturnino Luz
  • Bert-Jan van Beijnum

This paper briefly overviews the first workshop on Action Modelling for Interaction and Analysis in Smart Sports and Physical Education (MAIStroPE). It focuses on the main aspects intended to be discussed in the workshop reflecting the main scope of the papers presented during the meeting. The MAIStroPE 2020 workshop is held in conjunction with the 22nd ACM International Conference on Mulitmodal Interaction (ICMI 2020) taking place in Utrecht, the Netherlands, in October 2020.

Scalable Infrastructure for Efficient Real-Time Sports Analytics

  • Håvard D. Johansen
  • Dag Johansen
  • Tomas Kupka
  • Michael A. Riegler
  • Pål Halvorsen

Recent technological advances are adapted in sports to improve performance, avoid injuries, and make advantageous decisions. In this paper, we describe our ongoing efforts to develop and deploy PMSys, our smartphone-based athlete monitoring and reporting system. We describe our first attempts to gain insight into some of the data we have collected. Experiences so far are promising, both on the technical side and for athlete performance development. Our initial application of artificial-intelligence methods for prediction is encouraging and indicative.

Autonomous and Remote Controlled Humanoid Robot for Fitness Training

  • Emanuele Antonioni
  • Vincenzo Suriani
  • Nicoletta Massa
  • Daniele Nardi

The world population currently counts more of 617 million people over 65 years old. COVID-19 has exposed this population group to new restrictions, leading to new difficulties in care and assistance by family members. New technologies can reduce the degree of isolation of these people, helping them in the execution of healthy activities such as performing periodic sports routines. NAO robots find in this a possible application; being able to alternate voice commands and execution of movements, they can guide elderly people in performing gymnastic exercises. Additional encouragement could come through demonstrations of the exercises and verbal interactions using the voice of loved ones (for example, grandchildren). These are transmitted in real time to the NAO which streams the video of older people exercising, bringing the two parties involved closer together. This proposal, realized with the robot NAO V6, allows to have a help at home ready to motivate, teach the exercises and train the elderly living alone at home.

Physical Exercise Form Correction Using Neural Networks

  • Cristian Militaru
  • Maria-Denisa Militaru
  • Kuderna-Iulian Benta

Monitoring and correcting the posture during physical exercises can be a challenging task, especially for beginners that do not have a personal trainer. Recently, successful mobile applications in this domain were launched on the market, but we are unable to find prior studies that are general-purpose and able to run on commodity hardware (smartphones). Our work focuses on static exercises (e.g. Plank and Holding Squat). We create a dataset of 2400 images. The main technical challenge is achieving high accuracy for as many circumstances as possible. We propose a solution that relies on Convolutional Neural Networks to classify images into: correct, hips too low or hips too high. The Neural Network is used in a mobile application that provides live feedback for posture correction. We discuss limitations of the solution and ways to overcome them.

Climbing Activity Recognition and Measurement with Sensor Data Analysis

  • Iustina Ivanova
  • Marina Andric
  • Andrea Janes
  • Francesco Ricci
  • Floriano Zini

The automatic detection of climbers activities can be the basis of software systems able to support trainers to assess the climbers performance and to define more effective training programs. We propose an initial building block of such a system, for the unobtrusive identification of the activity of someone pulling a rope after finishing the ascent. We use a novel type of quickdraw, augmented with a tri-axial accelerometer sensor. The acceleration data generated by the quickdraw during the climbs are used by a Machine Learning classifier for detecting the rope pulling activity. The obtained results show that this activity can be detected automatically with high accuracy, particularly by a Random Forest classifier. Moreover, we show that data acquired by the quickdraw sensor, as well as the detected rope pulling, can also be used to benchmark climbers.

SESSION: MeC'20 Workshop

Measuring Human Behaviour to Inform e-Coaching Actions

  • Oresti Banos

Having a clear understanding of people's behaviour is essential to characterise patient progress, make treatment decisions and elicit effective and relevant coaching actions. Hence, a great deal of research has been devoted in recent years to the automatic sensing and analysis of human behaviour. Sensing options are currently unparalleled due to the number of smart, ubiquitous sensor systems developed and deployed globally. Instrumented devices such as smartphones or wearables enable unobtrusive observation and detection of a wide variety of behaviours as we go about our physical and virtual interactions with the world. The vast amount of data generated by such sensing infrastructures can be then analysed by powerful machine-learning algorithms, which map the raw data into predictive trajectories of behaviour. The processed data is combined with computerised behaviour change frameworks and domain knowledge to dynamically generate tailored recommendations and guidelines through advanced reasoning. This talk explores the recent advances in the automatic sensing and analysis of human behaviour to inform e-coaching actions. The H2020 research and innovation project "Council of Coaches" is particularly used to illustrate the main concepts underpinning this novel area as well as to provide some guidelines and directions for the development of human behaviour measurement technologies to support the future generation of e-coaching systems.

Virtual Coaching for Older Adults at Home using SMART Goal Supported Behavior Change

  • Andoni Beristain Iraola
  • Roberto Álvarez Sánchez
  • Despoina Petsani
  • Santiago Hors-Fraile
  • Panagiotis Bamidis
  • Evdokimos Konstantinidis

This paper presents a virtual coach for older adults at home to support active and healthy aging, and independent living. It aids users in their behavior change process for improving on cognitive, physical, social interaction and nutrition areas using SMART goals. To achieve an effective behavior change of the user, the coach relies on the I-Change behavioral change model. Using a combination of projectors, cameras, microphones and support sensors, the older adult's home becomes an augmented reality environment, where common objects are used for projection and sensed. Older adults interact with this virtual coach in their home in a natural way using speech and body gestures (including touch in certain objects).

Coaching Older Adults: Persuasive and Multimodal Approaches to Coaching for Daily Living

  • Irina Paraschivoiu
  • Jakub Sypniewski
  • Artur Lupp
  • Magdalena Gärtner
  • Nadejda Miteva
  • Zlatka Gospodinova

In this work, we present our approach to designing a multimodal, persuasive system for coaching older adults in four domains of daily living: activity, mobility, sleep, social interaction. Our design choices were informed by considerations related to the deployment of the system in four pilot sites and three countries: Austria, Bulgaria and Slovenia. In particular, we needed to keep the system affordable, and design across divides such as urban-rural and high-low technological affinity. We present these considerations, together with our approach to coaching through text, audio, light and color, and with the participation of the users' social circles and caregivers. We conducted two workshops and found preference for voice and text. Participants in Bulgaria also showed a preference for music-based rendering of coaching actions.

Transforming Rehabilitation to Virtually Supported Care - The vCare Project

  • Johannes Kropf
  • Niklas Aron Hungerländer
  • Kai Gand
  • Hannes Schlieter

vCare is designing personalized rehabilitation programs that will lead to better continuity of care and a better quality of life for patients with stroke, heart failure, Parkinson's disease or ischemic heart disease. It's goal is to provide a holistic approach for transferring rehabilitation pathways from stationary rehabilitation to the patient's home. VCare persues two novel approaches in the field. First, it combines persuasive system design (PSD) with a health psychological model (IMB) which are implemented into a software system to motivate the user. Second, it integrates personalized rehabilitation paths and a virtual coach with graphical representation to support the rehabilitation process. The coach is based on patients' personalized care pathways. It engages with patients so that they meet their individual care plans. This encourages compliance with the patients' rehabilitation programs.

ISwimCoach: A Smart Coach guiding System for Assisting Swimmers Free Style Strokes: ISwimCoach

  • Mohamed Ehab
  • Hossam Mohamed
  • Mohamed Ahmed
  • Mostafa Hammad
  • Noha ElMasry
  • Ayman Atia

In sports, coaching remains an essential aspect of the efficiency of the athlete's performance. This paper proposes a wrist wearable assistant for the swimmer called iSwimCoach. The key aim behind the system is to detect and analyze incorrect swimming patterns in a free crawl swimming style using an accelerometer sensor. iSwimCoach collects patterns of a swimmer's stream which enables it to detect the strokes to be analyzed in real-time. Therefore, introducing quick and efficient self-coaching feature for mid-level athlete to enhance their swimming style. In our research, we were able to monitor athlete strokes underwater and hence assist swimming coaches. The proposed system was able to classify four types of strokes done by mid-level players (correct strokes, wrong recovery, wrong hand entry and wrong high elbow). The system informs both the swimmer and the coach when an incorrect movement is detected. iSwimCoach achieved 91% accuracy for the detection and classification of incorrect strokes by a fast non expensive dynamic time warping algorithm. These readings analyzed in real-time to automatically generate reports for the swimmer and coach.

Multimodal Conversational Agent for Older Adults' Behavioral Change

  • Mira El Kamali
  • Leonardo Angelini
  • Denis Lalanne
  • Omar Abou Khaled
  • Elena Mugellini

Recently, research in the multimodal interaction area has shown a rapid development and users have been embracing their experience of multimodal technologies. In fact, having a multimodal system means allowing the user to choose one or more communication channel to access the system. While, on the contrary, unimodal systems do not have this option and can only settle to one-way of communication. Particularly, conversational agents are also in rapid increment and older adults are becoming more and more exposed to such agents. NESTORE is a virtual coach that aims to follow older adults in their wellbeing journey. It comes in two forms of interfaces: (i) a chatbot which is a text-based messaging application and (ii) a tangible coach which is a vocal assistant. This virtual coach is multimodal, due to the fact that it can depict or send information and even receive the user's input through two different interfaces which are the chatbot and the tangible coach. Our aim is then to explore the modality that the user prefers in terms of user experience. We experimented with older adults five different types of scenarios where the virtual coach was interacting with a user through different modalities, used individually or combined. The virtual coach was asking a set of questions derived from a behavioral change model called HAPA. We measured the perceived user experience for each scenario with the UEQ-S questionnaire, and asked to rank the scenarios according to the users? preferences.

Measuring and Fostering Engagement with Mental Health e-Coaches

  • Zoraida Callejas
  • David Griol
  • Kawtar Benghazi
  • Manuel Noguera
  • Gerard Chollet
  • María Inés Torres
  • Anna Esposito

Mental health e-coaches and technology-delivered services are showing considerable benefits to foster mental health literacy, monitor symptoms, favour self-management of different mental health conditions and scaffold positive behaviours. However, adherence to these systems is usually low and generally declines over time. There exists a recent body of work addressing engagement with mental health technology with the aim to understand the factors that influence sustained use and inform the design of systems that are able to generate sufficient engagement to attain their expected results. This paper explores the different facets of engagement in mental health e-coaches, including aspects related to the estimation of system use from log data, effective engagement, user experience, motivation, incentives, user expectations, peer support and the specific challenges of technologies addressed to mental health.

Trends & Methods in Chatbot Evaluation

  • Jacky Casas
  • Marc-Olivier Tricot
  • Omar Abou Khaled
  • Elena Mugellini
  • Philippe Cudré-Mauroux

Chatbots are computer programs aiming to replicate human conversational abilities through voice exchanges, textual dialogues, or both. They are becoming increasingly pervasive in many domains like customer support, e-coaching or entertainment. Yet, there is no standardised way of measuring the quality of such virtual agents. Instead, multiple individuals and groups have established their own standards either specifically for their chatbot project or have taken some inspiration from other groups. In this paper, we make a review of current techniques and trends in chatbot evaluation. We examine chatbot evaluation methodologies and assess them according to the ISO 9214 concepts of usability: Effectiveness, Efficiency and Satisfaction. We then analyse the methods used in the literature from 2016 to 2020 and compare their results. We identify a clear trend towards evaluating the efficiency of chatbots in many recent papers, which we link to the growing popularity of task-based chatbots that are currently being deployed in many business contexts.

SESSION: MHFI'20 Workshop

Co-Designing Flavor-Based Memory Cues with Older Adults

  • Tom Gayler
  • Corina Sas
  • Vaiva Kalnikaite

This initial study explores the design of flavor-based cues with older adults for their self-defining memories. It proposes using food to leverage the connections between odor and memory to develop new multisensory memory cues. Working with 4 older adults, we identified 6 self-defining autobiographical memories for each participant, 3 related to food, 3 unrelated to food. Flavor-based cues were then created for each memory through a co-design process. Findings indicate the dominance of relationship themes in the identified self-defining memories and that flavor-based cues related mostly to multiple ingredient dishes. We discuss how these findings can support further research and design into flavor-based memory cues through 3D food printing.

Automatic Analysis of Facilitated Taste-liking

  • Yifan Chen
  • Zhuoni Jie
  • Hatice Gunes

This paper focuses on: (i) Automatic recognition of taste-liking from facial videos by comparatively training and evaluating models with engineered features and state-of-the-art deep learning architectures, and (ii) analysing the classification results along the aspects of facilitator type, and the gender, ethnicity, and personality of the participants. To this aim, a new beverage tasting dataset acquired under different conditions (human vs. robot facilitator and priming vs. non-priming facilitation) is utilised. The experimental results show that: (i) The deep spatiotemporal architectures provide better classification results than the engineered feature models; (ii) the classification results for all three classes of liking, neutral and disliking reach F1 scores in the range of 71% - 91%; (iii) the personality-aware network that fuses participants' personality information with that of facial reaction features provides improved classification performance; and (iv) classification results vary across participant gender, but not across facilitator type and participant ethnicity.

The Influence of Emotion-Oriented Extrinsic Visual and Auditory Cues on Coffee Perception: A Virtual Reality Experiment

  • Abilash Nivedhan
  • Line Ahm Mielby
  • Qian Janice Wang

Eating is a process that involves all senses. Recent research has shown that both food-intrinsic and extrinsic sensory factors play a role in the taste of the food we consume. Moreover, many studies have explored the relationship between emotional state and taste perception, where certain emotional states have been shown to alter the perception of basic tastes. This opens up a whole new world of possibilities for the design of eating environments which take into account both sensory attributes as well as their emotional associations. Here, we used virtual reality to study the effect of colours and music, with specific emotional associations, on the evaluation of cold brew coffee. Based on an online study (N=76), two colours and two pieces of music with similar emotional arousal but opposing valence ratings were chosen to produce a total of eight virtual coloured environments. Forty participants were recruited for the on-site experiment, which consisted of three blocks. First, a blind tasting of four coffee samples (0%, 2.5%, 5%, 7.5% sucrose) was carried out. Next, participants experienced the eight environments via an HTC Vive Pro headset and evaluated their expected liking, sweetness and bitterness of a mug of coffee presented in VR. Finally, they tasted identical 5% coffee samples in the same eight environments. Results revealed One of the key findings of this study that, when only one factor (colour or music) was manipulated, background colour significantly influenced coffee liking. When colour and music were used in combination, however, we found an overall effect of music valence on coffee sweetness, as well as an interaction effect of colour and music on liking. These results reinforce the importance of the extrinsic sensory and emotion factors on food expectations and liking. Overall, these results are in line with previous research , where positive emotions can lead to increased food liking and higher sweetness compared to negative emotions.

An Accessible Tool to Measure Implicit Approach-Avoidance Tendencies Towards Food Outside the Lab

  • Jasper J. van Beers
  • Daisuke Kaneko
  • Ivo V. Stuldreher
  • Hilmar G. Zech
  • Anne-Marie Brouwer

Implicit approach-avoidance tendencies can be measured by the approach-avoidance task (AAT). The emergence of mobile variants of the AAT enable its use for both in-the-lab and in-the-field experiments. Within the food domain, use of the AAT is concentrated in research on eating disorders or healthy eating and is seldom used as an implicit measure of food experience. Given the prevalence of explicit measures in this field, the AAT may provide additional valuable insights into food experience. To facilitate the use of the AAT as an implicit measure, a processing tool and accompanying graphical user interface (GUI) have been developed for a mobile smartphone variant of the AAT. This tool improves upon the existing processing framework of this mobile AAT by applying more robust filtering and introduces additional sensor correction algorithms to improve the quality of sensor data. Along with refined estimates of reaction time (RT) and reaction force (RF), these processing improvements introduce a new metric: the response distance (RD). The capabilities of the tool, along with the potential added value of calculating RF and RD, are explained in this paper through the processing of pilot data on molded and unmolded food. In particular, the RF and RD may be indicative of participants' arousal. The tool developed in this paper is open source and compatible with other experiments making use of the mobile AAT within, and beyond, the domain of food experience.

Eating with an Artificial Commensal Companion

  • Conor Patrick Gallagher
  • Radoslaw Niewiadomski
  • Merijn Bruijnes
  • Gijs Huisman
  • Maurizio Mancini

Commensality is defined as "a social group that eats together", and eating in a commensality setting has a number of positive effects on humans. The purpose of this paper is to investigate the effects of technology on commensality by presenting an experiment in which a toy robot showing non-verbal social behaviours tries to influence a participants' food choice and food taste perception. We managed to conduct both a qualitative and quantitative study with 10 participants. Results show the favourable impression of the robot on participants. It also emerged that the robot may be able to influence the food choices using its non-verbal behaviors only. However, these results are not statistically significant, perhaps due to the small sample size. In the future, we plan to collect more data using the same experimental protocol, and to verify these preliminary results.

Guess who's coming to dinner? Surveying Digital Commensality During Covid-19 Outbreak

  • Eleonora Ceccaldi
  • Gijs Huisman
  • Gualtiero Volpe
  • Maurizio Mancini

Eating together is one of the most treasured human activities. Its benefits range from improving the taste of food to mitigating the feelings of loneliness. In 2020, many countries have adopted lock-down and social distancing policies, forcing people to stay home,often alone and away from families and friends. Although technology can help connecting those that are physically distant, it is not clear whether eating together, at the same moment via video-call,is effective in creating the sense of connectedness that comes with sharing a meal with a friend or a family member in person. In this work, we report the results of an online survey on remote eating practices during Covid-19 lock-down, exploring the psychological motivations behind remote eating and behind deciding not to. Moreover, we sketch how future technologies could help creating digital commensality experiences

Augmentation of Perceived Sweetness in Sugar Reduced Cakes by Local Odor Display

  • Heikki Aisala
  • Jussi Rantala
  • Saara Vanhatalo
  • Markus Nikinmaa
  • Kyösti Pennanen
  • Roope Raisamo
  • Nesli Sözer

Multisensory augmented reality systems have demonstrated the potential of olfactory cues in the augmentation of flavor perception. Earlier studies have mainly used commercially available sample products. In this study, custom rye-based cakes with reduced sugar content were used to study the influence of different odorants on the perceived sweetness. A custom olfactory display was developed for presenting the odorants. The results showed that augmentation of a reduced sugar rye-based cake with localized maltol, vanilla, and strawberry odor increased the perceived sweetness of the cake-odor pair compared to a cake with deodorized airflow.

The Effect of Different Affective Arousal Levels on Taste Perception

  • Naoya Zushi
  • Monica Perusquia-Hernandez
  • Saho Ayabe-Kanamura

The emotions we experience shape our perception, and our emotion is shaped by our perceptions. Taste perception is also influenced by emotions. Positive and negative emotions alter sweetness, sourness, and bitterness perception. However, most previous studies mainly explored valence changes. The effect of arousal on taste perception is less studied. In this study, we asked volunteers to watch positive affect inducing videos with high arousal and low arousal. Our results showed a successful induction of high and low arousal levels as confirmed by self-report and electrophysiological signals. Moreover, self-report affective ratings did not show a significant effect on self-reported taste ratings. However, we found a negative correlation between smile occurrence and sweetness ratings. In addition, EDA scores were positively correlated with saltiness. This suggests that even if the self-reported affective state is not granular enough, looking at more fine-grained affective cues can inform ratings of taste.

Multimodal Interactive Dining with the Sensory Interactive Table: Two Use Cases

  • Roelof A. J. de Vries
  • Gijs H. J. Keizers
  • Sterre R. van Arum
  • Juliet A. M. Haarman
  • Randy Klaassen
  • Robby W. van Delden
  • Bert-Jan F. van Beijnum
  • Janet H. W. van den Boer

This paper presents two use cases for a new multimodal interactive instrument: the Sensory Interactive Table. The Sensory Interactive Table is an instrumented, interactive dining table, that measures eating behavior - through the use of embedded load cells - and interacts with diners - through the use of embedded LEDs. The table opens up new ways of exploring the social dynamics of eating. The two use cases describe explorations of the design space of the Sensory Interactive Table in the context of the social space of eating. The first use case details the process of co-designing and evaluating applications to stimulate children to eat more vegetables. The second use case presents the process of designing and evaluating applications to stimulate young adults to reduce their eating speed in a social setting. The results show the broad potential of the design space of the table across user groups, types of interactions, as well as the social space of eating.

Eating Like an Astronaut: How Children Are Willing to Eat

  • Chi Thanh Vi
  • Asier Marzo
  • Dmitrijs Dmitrenko
  • Martin Yeomans
  • Marianna Obrist

How food is presented and eaten influences the eating experience. Novel gustatory interfaces have opened up new ways for eating at the dining table. For example, recent developments in acoustic technology have enabled the transportation of food and drink in mid-air, directly into the user's tongue. Basic taste particles like sweet, bitter and umami have higher perceived intensity when delivered with acoustic levitation, and are perceived as more pleasant despite their small size (approx. 20 L or 4mm diameter droplets). However, it remains unclear if users are ready to accept this delivery method at the dining table. Sixty-nine children aged 14 to 16 years did a taste test of 7 types of foods and beverages, using two delivery methods: acoustic levitation, and knife and fork (traditional way). Children were divided into two groups: one group was shown a video demonstrating how levitating food can be eaten before the main experiment whereas the other group was shown the videos after. Our results showed no significant differences in liking of the foods and beverages between the two delivery methods. However, playing the video prior to the test significantly increased the liking and willingness to eat vegetables in the levitation method. Evaluative feedback suggested that a bigger portion size of levitating foods could be the game-changer to integrate this novel technology into real-life eating experiences.

Eating Sound Dataset for 20 Food Types and Sound Classification Using Convolutional Neural Networks

  • Jeannette Shijie Ma
  • Marcello A. Gómez Maureira
  • Jan N. van Rijn

Food identification technology potentially benefits both food and media industries, and can enrich human-computer interaction. We assembled a food classification dataset consisting of 11,141 clips, based on YouTube videos of 20 food types. This dataset is freely available on Kaggle. We suggest the grouped holdout evaluation protocol as evaluation method to assess model performance. As a first approach, we applied Convolutional Neural Networks on this dataset. When applying an evaluation protocol based on grouped holdout, the model obtained an accuracy of 18.5%, whereas when applying an evaluation protocol based on uniform holdout, the model obtained an accuracy of 37.58%. When approaching this as a binary classification task, the model performed well for most pairs. In both settings, the method clearly outperformed reasonable baselines. We found that besides texture properties, eating action differences are important consideration for data driven eating sound researches. Protocols based on biting sound are limited to textural classification and less heuristic while assembling food differences.

SESSION: MIP'20 Workshop

Ambient Pain Monitoring in Older Adults with Dementia to Improve Pain Management in Long-Term Care Facilities

  • Siavash Rezaei
  • Abhishek Moturu
  • Shun Zhao
  • Kenneth M. Prkachin
  • Thomas Hadjistavropo
  • Babak Taati

Painful conditions are prevalent in older adults, yet may go untreated, especially in people with severe dementia who often cannot verbally communicate their pain. Not addressing the pain can lead to the worsening of underlying conditions or lead to frustration and agitation. For older adults living in long-term care (LTC) facilities, timely assessment of pain remains a challenge. The main reasons for this are staff shortages at these facilities and/or insufficient expertise in cutting-edge pain assessment methods reliant on non-verbal cues. Ambient monitoring of non-verbal cues of pain, e.g. facial expressions, body movements, or vocalizations, is a promising avenue to improve pain management in LTC. Despite extensive existing research in computer vision algorithms for pain detection, the currently available techniques and models are not ready or directly applicable for use in LTC settings. Publicly available video datasets used for training and validating pain detection algorithms (e.g. the UNBC-McMaster Shoulder Pain Expression Archive Database and the BioVid Heat Pain Database) do not include older adults with dementia. Facial analysis models that are trained and validated on data from healthy and primarily young adults are known to under-perform, sometimes drastically, when tested on faces of older adults with dementia. As such, the performance of existing pain detection models on the dementia population remains to be validated. Furthermore, in existing datasets, participants are well-lit and face the camera; so the developed algorithm's performance may not transfer to a realistic ambient monitoring setting. In this work, we make three main contributions. First, we develop a fully-automated pain monitoring system (based on a convolutional neural network architecture) especially designed for and validated on a new dataset of over 162,000 video frames recorded unobtrusively from 95 older adults, of which 47 were community dwelling and cognitively healthy (age:~75.5~±~6.1), and 48 (age:~82.5~±~9.2) were individuals with severe dementia residing in LTC. Second, we introduce a data efficient pairwise training and inference method that calibrates to each individual face. Third, we introduce a contrastive training method and show that it significantly improves cross-dataset performance across UNBC-McMaster, cognitively healthy older adults, and older adults with dementia. We perform 5-fold (leave-subjects-out) cross-validation. Our algorithm achieves a Pearson correlation coefficient (PCC) of 0.48 for per-frame predictions of pain intensity and a PCC of 0.82 for predictions aggregated over 20 second windows for participants with dementia.

Automated Detection of Optimal DBS Device Settings

  • Yaohan Ding
  • Itir Onal Ertugrul
  • Ali Darzi
  • Nicole Provenza
  • László A. Jeni
  • David Borton
  • Wayne Goodman
  • Jeffrey Cohn

Continuous deep brain stimulation (DBS) of the ventral striatum (VS) is an effective treatment for severe, treatment-refractory obsessive-compulsive disorder (OCD). Optimal parameter settings are signaled by a mirth response of intense positive affect, which is subjectively identified by clinicians. Subjective judgments are idiosyncratic and difficult to standardize. To objectively measure mirth responses, we used Automatic Facial Affect Recognition (AFAR) in a series of longitudinal assessments of a patient treated with DBS. Pre- and post-adjustment DBS were compared using both statistical and machine learning approaches. Positive affect was significantly higher after DBS adjustment. Using XGBoost and SVM, the participant's pre- and post-adjustment responses were differentiated with accuracy values of 0.76 and 0.75, which suggest feasibility of objective measurement of mirth response.

Data Drive Development-Multimodal Measurement of Classroom Interaction

  • Daniel S. Messinger
  • Lynn Perry
  • Chaoming Song
  • Yudong Tao
  • Samantha Mitsven
  • Regina Fasano
  • Chitra Banarjee
  • Yi Zhang
  • Mei-Ling Shyu

The educational inclusion of children with communication disorders together with typically developing (TD) peers is a national standard. However, we have little mechanistic understanding of how interactions with peers and teachers contribute to the language development of these children. To build that understanding, we combine objective measurement of the quantity and quality of child and teacher speech with radio frequency identification of their physical movement and orientation. Longitudinal observations of two different sets of classrooms are analyzed. One set of classrooms contains children who require hearing aids and cochlear implants. Another set of classrooms contains children with autism spectrum disorder (ASD). Computational modeling of pair-wise movement/orientation is used to derive periods of social contact when speech may occur. Results suggest that children with ASD are isolated from peers but approach teachers relatively quickly. Overall, talk with peers in social contact (and speech heard from teachers) promotes children's own talk which, in turn, is associated with assessed language abilities.

Objective Measurement of Social Communication Behaviors in Children with Suspected ASD During the ADOS-2

  • Yeojin Amy Ahn
  • Jacquelyn Moffitt
  • Yudong Tao
  • Stephanie Custode
  • Mei-Ling Shyu
  • Lynn Perry
  • Daniel S. Messinger

Autism spectrum disorder (ASD) is defined by persistent disturbances of social communication, as well as repetitive patterns of behavior. ASD is identified on the basis of expert, but subjective, clinician judgment during assessments such as the Autism Diagnostic Observation Schedule-2 (ADOS-2). Quantification of key social behavioral features of ASD using objective measurements would enrich scientific understanding of the disorder. The current pilot study leveraged computer vision and audio signal processing to identify a key set of objective measures of children's social communication behaviors during the ADOS-2 (e.g., social gaze, social smile, vocal interaction) that were captured with adult-worn camera-embedded eyeglasses. Objective measurements of children's social communicative behaviors during the ADOS-2 showed relatively low levels of association with the examiner-adjudicated ADOS-2 scores. Future directions and implications for the use of objective measurements in diagnostic and treatment monitoring are discussed.

Mother-Infant Face-to-Face Intermodal Discrepancy and Risk

  • Beatrice Beebe

During mother-infant face-to-face communication, many modalities are always at play: gazing at and away from the partner, facial expression, vocalization, orientation and touch. Multi-modal information in social communication typically conveys congruent information, which facilitates attention, learning, and interpersonal relatedness. However, when different modalities convey discrepant information, social communication can be disturbed. This paper illustrates forms of discrepant mother-infant communication drawn from our prior studies in three risk contexts: maternal depression, maternal anxiety, and the origins of disorganized attachment. Because many examples of discrepancies emerged in the course of our studies, we consider inter-modal discrepancies to be important markers of disturbed mother-infant communication

MEMOS: A Multi-modal Emotion Stream Database for Temporal Spontaneous Emotional State Detection

  • Yan Li
  • Xiaohan Xia
  • Dongmei Jiang
  • Hichem Sahli
  • Ramesh Jain

Mental health applications are increasingly interested in using audio-visual and physiological measurements to detect the emotional state of a person, where significant researches aim to detect episodic emotional state. The availability of wearable devices and advanced signals is attracting researchers to explore the detection of a continuous sequence of emotion categories, referred to as emotion stream, for understanding mental health. Currently, there are no established databases for experimenting with emotion streams. In this paper, we make two contributions. First, we collect a Multi-modal EMOtion Stream (MEMOS) database in the scenario of social games. Audio-video recordings of the players are made via mobile phones and aligned Electrocardiogram (ECG) signals are collected by wearable sensors. Totally 40 multi-modal sessions have been recorded, each lasting between 25 to 70 minutes. Emotional states with time boundaries are self-reported and annotated by the participants while watching the video recordings. Secondly, we propose a two-step emotional state detection framework to automatically determine the emotion categories with their time boundaries along the video recordings. Experiments on the MEMOS database provide the baseline result for temporal emotional state detection research, with average mean-average-precision (mAP) score as 8.109% on detecting the five emotions (happiness, sadness, anger, surprise, other negative emotions) in videos. It is higher than 5.47% where the emotions are detected by averaging the frame-level confidence scores (obtained by Face++ emotion recognition API) in the segments from a sliding window. We expect that this paper will introduce a novel research problem and provide a database for related research.

SESSION: MSECP'20 Workshop

Emotion Recognition using EEG and Physiological Data for Robot-Assisted Rehabilitation Systems

  • Elif Gümüslü
  • Duygun Erol Barkana
  • Hatice Köse

Robot-assisted rehabilitation systems are developed to monitor the performance of the patients and adapt the rehabilitation task intensity and difficulty level accordingly to meet the needs of the patients. The robot-assisted rehabilitation systems can be more prosperous if they are able to recognize the emotions of patients, and modify the difficulty level of task considering these emotions to increase patient's engagement. We aim to develop an emotion recognition model using electroencephalography (EEG) and physiological signals (blood volume pulse (BVP), skin temperature (ST) and skin conductance (SC)) for a robot-assisted rehabilitation system. The emotions are grouped into three categories, which are positive (pleasant), negative (unpleasant) or neutral. A machine-learning algorithm called Gradient Boosting Machines (GBM) and a deep learning algorithm called Convolutional Neural Networks (CNN) are used to classify pleasant, unpleasant and neutral emotions from the recorded EEG and physiological signals. We ask the subjects to look at pleasant, unpleasant and neutral images from IAPS database and collect EEG and physiological signals during the experiments. The classification accuracies are compared for both GBM and CNN methods when only one sensory data (EEG, BVP, SC and ST) or the combination of the sensory data from both EEG and physiological signals are used.

Is There 'ONE way' of Learning? A Data-driven Approach

  • Jauwairia Nasir
  • Barbara Bruno
  • Pierre Dillenbourg

Intelligent Tutoring Systems (ITS) are required to intervene in a learning activity while it is unfolding, to support the learner. To do so, they often rely on performance of a learner, as an approximation for engagement in the learning process. However, in learning tasks that are exploratory by design, such as constructivist learning activities, performance in the task can be misleading and may not always hint at an engagement that is conducive to learning. Using the data from a robot mediated collaborative learning task in an out-of-lab setting, tested with around 70 children, we show that data-driven clustering approaches, applied on behavioral features including interaction with the activity, speech, emotional and gaze patterns, not only are capable of discriminating between high and low learners, but can do so better than classical approaches that rely on performance alone. First experiments reveal the existence of at least two distinct multi-modal behavioral patterns that are indicative of high learning in constructivist, collaborative activities.

Multimodal Fuzzy Assessment for Robot Behavioral Adaptation in Educational Children-Robot Interaction

  • Daniel C. Tozadore
  • Roseli A. F. Romero

Social robots' contributions to education are notorious but, in times, limited by the difficulty in their programming by regular teachers. Our framework named R-CASTLE aims to overcome this problem by providing the teachers with an easy way to program their content and the robot's behavior through a graphical interface. However, the robot's behavior adaptation algorithm maybe still not the best intuitive method for teachers' understanding. Fuzzy systems have the advantage of being modeled in a more human-like way than other methods due to their implementation based on linguistic variables and terms. Thus, fuzzy modeling for robot behavior adaptation in educational children-robot interactions is proposed for this framework. The modeling resulted in an adaptation algorithm that considers a multimodal and autonomous assessment of the students' skills: attention, communication, and learning. Furthermore, preliminary experiments were performed considering videos with the robot in a school environment. The adaptation was set to change the content approach difficulty to produce a suitably challenging behavior according to each students' reactions. Results were compared to a Rule-Based adaptive method. The fuzzy modeling showed similar accuracy to the ruled-based method with a suggestion of a more intuitive interpretation of the process.

Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition

  • Srinivas Parthasarathy
  • Shiva Sundaram

Automatic audio-visual expression recognition can play an important role in communication services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio-visual expression recognition could benefit from the interplay between the two modalities. However, most audio-visual expression recognition systems, trained in ideal conditions, fail to generalize in real world scenarios where either the audio or visual modality could be missing due to a number of reasons such as limited bandwidth, interactors' orientation, caller initiated muting. This paper studies the performance of a state-of-the art transformer when one of the modalities is missing. We conduct ablation studies to evaluate the model in the absence of either modality. Further, we propose a strategy to randomly ablate visual inputs during training at the clip or frame level to mimic real world scenarios. Results conducted on in-the-wild data, indicate significant generalization in proposed models trained on missing cues, with gains up to 17% for frame level ablations, showing that these training strategies cope better with the loss of input modalities.

Measuring Cognitive Load: Heart-rate Variability and Pupillometry Assessment

  • Nerea Urrestilla
  • David St-Onge

Cognitive load covers a wide field of study that triggers the interest of many disciplines, such as neuroscience, psychology and computer science since decades. With the growing impact of human factor in robotics, many more are diving into the topic, looking, namely, for a way to adapt the control of an autonomous system to the cognitive load of its operator. Theoretically, this can be achieved from heart-rate variability measurements, brain waves monitoring, pupillometry or even skin conductivity. This work introduces some recent algorithms to analyze the data from the first two and assess some of their limitations.

The Wizard is Dead, Long live Data: towards Autonomous Social Behaviour using Data-driven Methods

  • Tony Belpaeme

Assessment of Situation Awareness during Robotic Surgery using Multimodal Data

  • Aurelien Lechappe
  • Mathieu Chollet
  • Jérôme Rigaud
  • Caroline G.L. Cao

The use of robotic surgical systems disrupts existing team dynamics inside operating rooms and constitutes a major challenge for the development of crucial non-technical skills such as situation awareness (SA). Techniques for assessing SA mostly rely on subjective assessments and questionnaires; few leverage multimodal measures combining physiological, behavioural, and subjective indicators. We propose a conceptual model relating SA with mental workload, stress and communication, supported by measurable behaviours and physiological signals. To validate this model, we collect subjective, behavioural, and physiological data from surgical teams performing radical prostatectomy using robotic surgical systems. Statistical analyses will be performed to establish relationships between SA, subjective assessment of stress and mental workload, communication processes, and the surgeons' physiological signals.

Model-based Prediction of Exogeneous and Endogeneous Attention Shifts During an Everyday Activity

  • Felix Putze
  • Merlin Burri
  • Lisa-Marie Vortmann
  • Tanja Schultz

Human attention determines to a large degree how users interact with technical devices and how technical artifacts can support them optimally during their tasks. Attention shifts between different targets, triggered through changing requirements of an ongoing task or through salient distractions in the environment. Such shifts mark important transition points which an intelligent system needs to predict and attribute to an endogenous or exogenous cause for an appropriate reaction. In this paper, we describe a model which performs this task through a combination of bottom-up and topdown modeling components. We evaluate the model in a scenario with a dynamic task in a rich environment and show that the model is able to predict attention future switches with a robust classification performance.

SmartHelm: Towards Multimodal Detection of Attention in an Outdoor Augmented Reality Biking Scenario

  • Sromona Chatterjee
  • Kevin Scheck
  • Dennis Küster
  • Felix Putze
  • Harish Moturu
  • Johannes Schering
  • Jorge Marx Gómez
  • Tanja Schultz

Driving and biking are complex and attention-demanding tasks for which distractions are a major safety hazard.

Modeling driving-related attention with regard to audio-visual distractions and assessing the attentional overload could help drivers to reduce stress and increase safety. In this work, we present a multimodal recording architecture using dry EEG-electrodes and the eye-tracking capability of the HoloLens 2 for an outdoor Augmented Reality (AR) scenario. The AR street scene contains visual distractions and moving cars and is shown to the subject in a simulation through the Hololens 2. The system records EEG and eye-tracking data to predict changes in the driver's attention. A preliminary case study is presented here to detail the data acquisition setup and to detect the occurrences of visual distractions in the simulation. Our first results suggest that this approach may overall be viable. However, further research is still required to refine our setup and models, as well as to evaluate the ability of the system to capture meaningful changes of attention in the field.

SESSION: MSMT'20 Workshop

Gravity-Direction-Aware Joint Inter-Device Matching and Temporal Alignment between Camera and Wearable Sensors

  • Hiroyuki Ishihara
  • Shiro Kumano

To analyze human interaction behavior in a group or crowd, identification and device time synchronization are essential but time demanding to be performed manually. To automate the two processes jointly without any calibration steps nor auxiliary sensor, this paper presents an acceleration-correlation-based method for multi-person interaction scenarios where each target person wears an accelerometer and a camera is stationed in the scene. A critical issue is how to remove the time-varying gravity direction component from wearable device acceleration, which degrades the correlation of body acceleration between the device and video, yet is hard to estimate accurately. Our basic idea is to estimate the gravity direction component in the camera coordinate system, which can be obtained analytically, and to add it to the vision-based data to compensate the degraded correlation. We got high accuracy results for 4 person-device matching with only 40 to 60 frames (4 to 6 seconds). The average timing offset estimation is about 5 frames (0.5 seconds). Experimental results suggest it is useful for analyzing individual trajectories and group dynamics at low frequencies.

A Movement in Multiple Time Neural Network for Automatic Detection of Pain Behaviour

  • Temitayo Olugbade
  • Nicolas Gold
  • Amanda C de C Williams
  • Nadia Bianchi-Berthouze

The use of multiple clocks has been a favoured approach to modelling the multiple timescales of sequential data. Previous work based on clocks and multi-timescale studies in general have not clearly accounted for multidimensionality of data such that each dimension has its own timescale(s). Focusing on body movement data which has independent yet coordinating degrees of freedom, we propose a Movement in Multiple Time (MiMT) neural network. Our MiMT models multiple timescales by learning different levels of movement interpretation (i.e. labels) and further allows for separate timescales across movements dimensions. We obtain 0.75 and 0.58 average F1 scores respectively for binary frame-level and three-class window-level classification of pain behaviour based on the MiMT. Findings in ablation studies suggest that these two elements of the MiMT are valuable to modelling multiple timescales of multidimensional sequential data.

Structuring Multi-Layered Musical Feedback for Digital Bodily Interaction: Two Approaches to Multi-layered Interactive Musical Feedback Systems

  • Marc-André Weibezahn

This paper describes to approaches to develop simple systems for expressive bodily interaction with music, without prior musical knowledge on the user's part. It discusses two almost oppositional models: 1. Modifying a preexisting recording through spatial articulation, and 2. Rule based ad-hoc composition of a musical piece of indefinite length, based on precomposed chord progression(s). The approaches differ both in interaction models as well as in musical feedback.

A Computational Method to Automatically Detect the Perceived Origin of Full-Body Human Movement and its Propagation

  • Olga Matthiopoulou
  • Benoit Bardy
  • Giorgio Gnecco
  • Denis Mottet
  • Marcello Sanguineti
  • Antonio Camurri

The work reports ongoing research about a computational method, based on cooperative games on graphs, aimed at detecting the perceived origin of full-body human movement and its propagation. Compared with previous works, a larger set of movement features is considered, and a ground truth is produced, able to assess and compare the effectiveness of each such feature. This is done through the use of the Shapley Value as a centrality index. An Origin of Movement Continuum is also defined, as the basis for creating a repository of movement qualities.

SESSION: OHT'20 Workshop

Speech, Voice, Text, And Meaning: A Multidisciplinary Approach to Interview Data through the use of digital tools

  • Arjan van Hessen
  • Silvia Calamai
  • Henk van den Heuvel
  • Stefania Scagliola
  • Norah Karrouche
  • Jeannine Beeken
  • Louise Corti
  • Christoph Draxler

Interview data is multimodal data: it consists of speech sound, facial expression and gestures, captured in a particular situation, and containing textual information and emotion. This workshop shows how a multidisciplinary approach may exploit the full potential of interview data. The workshop first gives a systematic overview of the research fields working with interview data. It then presents the speech technology currently available to support transcribing and annotating interview data, such as automatic speech recognition, speaker diarization, and emotion detection. Finally, scholars who work with interview data and tools may present their work and discover how to make use of existing technology.

SESSION: SAMIH'20 Workshop

Upskilling the Future Workforce Using AI and Affective Computing

  • Ehsan Hoque

Preliminary Study of the Perception of Emotions Expressed by Virtual Agents in the Context of Parkinson's Disease

  • Claire Dussard
  • Anahita Basirat
  • Nacim Betrouni
  • Caroline Moreau
  • David Devos
  • François Cabestaing
  • José Rouillard

In the context of Parkinson's disease, this preliminary work aims to study the recognition profiles of emotional faces, dynamically expressed by virtual agents in a Healthy Control (HC) population. In this online experiment, users had to watch 56 trials of two-second animations, showing an emotion progressively expressed by an avatar and then indicate the recognized emotion by clicking a button. 211 participants completed this experiment online as HC. Of the demographics variables, only age influenced negatively recognition accuracy in HC. The intensity of the expression influenced accuracy as well. Interaction effects between gender, emotion, intensity, and avatar gender are also discussed. The results of four patients with Parkinson's Disease are presented as well. Patients tended to have lower recognition accuracy than age-matched HC (59% for age-matched HC; 45.1% for patients). Joy, sadness and fear seemed less recognized by patients.

Effectiveness of Virtual Reality Playback in Public Speaking Training

  • Hangyu Zhou
  • Yuichiro Fujimoto
  • Masayuki Kanbara
  • Hirokazu Kato

In this paper, factors with positive effects in the playback of virtual reality (VR) presentation in training are discussed. To date, the effectiveness of VR public speaking training in both anxiety reduction and skills improvement has been reported. Though the playback using videotape is an effective way in original public speaking training, very few researchers focused on the effectiveness and possibility of VR playback. In this research, A VR playback system for public speaking training is proposed, and a pilot experiment is carried out, so as to figure out the effects of the virtual agent, immersion and public speaking anxiety level in VR playback.

Objective Prediction of Social Skills Level for Automated Social Skills Training Using Audio and Text Information

  • Takeshi Saga
  • Hiroki Tanaka
  • Hidemi Iwasaka
  • Satoshi Nakamura

Although Social Skills Training is a well-known effective method to obtain appropriate social skills during daily communication, getting such training is difficult due to a shortage of therapists. Therefore, automatic training systems are required to ameliorate this situation. To fairly evaluate social skills, we need an objective evaluation method. In this paper, we utilized the second edition of the Social Responsiveness Scale (SRS-2) as an objective evaluation metric and developed an automatic evaluation system using linear regression with multi-modal features. We newly adopted features including 28 audio features and BERT-based sequential similarity (seq-similarity), which indicates how well the meaning of users remains consistent within their utterances. We achieved a 0.35 Pearson correlation coefficient for the SRS-2's overall score prediction and 0.60 for the social communication score prediction, which is a treatment sub-scale score of SRS-2. This experiment shows that our system can objectively predict the levels of social skills. Please note that we only evaluated the system on healthy subjects since this study is still at the feasibility phase. Therefore, further evaluation of real patients is needed in future work.

Developing a Social Biofeedback Training System for Stress Management Training

  • Tanja Schneeberger
  • Naomi Sauerwein
  • Manuel S. Anglet
  • Patrick Gebhard

Mental stress is the psychological and physiological response to a high frequency of or continuous stressors. If prolonged and not regulated successfully, it has a negative impact on health. Developing stress coping techniques, as an emotion regulation strategy, is a crucial part of most therapeutic interventions. Interactive biofeedback agents can be employed as a digital health tool for therapists to let patients train and develop stress-coping strategies. This paper presents an interactive stress management training system using biofeedback derived from the heart rate variability (HRV), with an Interactive Social Agent as an autonomous biofeedback trainer. First evaluations have shown promising results.

Analysis of Mood Changes and Facial Expressions during Cognitive Behavior Therapy through a Virtual Agent

  • Kazuhiro Shidara
  • Hiroki Tanaka
  • Hiroyoshi Adachi
  • Daisuke Kanayama
  • Yukako Sakagami
  • Takashi Kudo
  • Satoshi Nakamura

In cognitive behavior therapy (CBT) with a virtual agent, facial expression processing is expected to be useful for dialogue response selection empathic dialogue. Unfortunately, its use in current works remains limited. One reason for this situation is the lack of research on the relationship between mood changes facial expressions through CBT-oriented interaction. This study confirms the improvement of negative moods through interaction with a virtual agent and identifying facial expressions that correlate with mood changes. Based on the cognitive restructuring of CBT, we created a fixed dialogue scenario and implemented it in a virtual agent. We recorded facial expressions during dialogues with 23 undergraduate and graduate students, calculated 17 types of action units (AUs), which are the units of facial movements, and performed a correlation analysis using the change rate of mood scores and the amount of the changes in the AUs. The mean mood improvement rate was 35%, and the mood improvements showed correlations with AU5 (r = -0.51), AU17 (r = 0.45), AU25 (r = -0.43), and AU45 (r = 0.45). These results imply that mood changes are reflected in facial expressions. The AUs identified in this study have the potential to be used for agent-interaction modeling.

Children as Candidates to Verbal Nudging in a Human-robot Experiment

  • Hugues Ali Mehenni
  • Sofiya Kobylyanskaya
  • Ioana Vasilescu
  • Laurence Devillers

In this research, Nudges, indirect suggestions which can affect the behaviour and the decision making, are considered in the context of conversational machines. A first long term goal of this work is to build an automatic dialog system able to nudge. A second goal is to measure the influence of nudges exerted by conversational agents and robots on humans in order to raise the awareness of their use or misuse and open an ethical reflection on their consequences. The study involved primary school children which are potentially more vulnerable in front of conversational machines. The children verbally interacted in three different setups: with a social robot, a conversational agent and a human. Each setup includes a Dictator Game adapted to children from which we can infer a nudge metric. First results from the Dictator Game highlight that the conversational agent and the robot seem more influential in nudging children than an adult. In this paper, we seek to measure whether the propensity of the children to be nudged can be predicted from personal and overall dialog features (e.g. age, interlocutor, etc.) and expressive behaviour located at speaker turn level (e.g. emotions, etc.). Features are integrated into vectors, with one vector by speaker turn, which are fed to machine learning models. The speakers' characteristics, the type of interlocutor, objective measures at speaker turn level (latency, duration) and also measures built to quantify the reactions to two influencing questions (open-ended and incongruous) correlate best with the reaction to the nudging strategies.

Music Generation and Emotion Estimation from EEG Signals for Inducing Affective States

  • Kana Miyamoto
  • Hiroki Tanaka
  • Satoshi Nakamura

Although emotion induction using music has been studied, the emotions felt by listening to it vary among individuals. In order to provide personalized emotion induction, it is necessary to predict an individual's emotions and select appropriate music. Therefore, we propose a feedback system that generates music from the continuous value of emotion estimated from electroencephalogram (EEG). In this paper, we describe a music generator and a method of emotion estimation from EEG to construct a feedback system. First, we generated music by calculating parameters from the valence and arousal values of the desired emotion. Our generated music was evaluated by crowdworkers. The median of the correlation coefficients between the input of the music generator and the emotions felt by the crowdworkers were valence r=0.60 and arousal r=0.76. Next, we recorded EEG when listening to music and estimated emotions from them. We compared three regression models: linear regression and convolutional neural network (with/without transfer learning). We obtained the lowest RMSE (valence: 0.1807, arousal: 0.1945) between the actual and estimated emotional values with a convolutional neural network with transfer learning.

Investigating the Influence of Sound Design for Inducing Anxiety in Virtual Public Speaking

  • Enora Gabory
  • Mathieu Chollet

Virtual reality has demonstrated successful outcomes for treating social anxiety disorders, or helping to improve social skills. Some studies showed that various factors can impact the level of participants' anxiety during public speaking. However, the influence of sound design on this anxiety has been less investigated, and it is necessary to study the possible impacts that it can have. In this paper, we propose a model relating sound design concepts to presence and anxiety during virtual reality interactions, and present a protocol of a future experimental study aimed at investigating how sound design and in particular sound distractions can influence anxiety during public speaking simulations in virtual environments.

Towards Detecting Need for Empathetic Response in Motivational Interviewing

  • Zixiu Wu
  • Rim Helaoui
  • Vivek Kumar
  • Diego Reforgiato Recupero
  • Daniele Riboni

Empathetic response from the therapist is key to the success of clinical psychotherapy, especially motivational interviewing. Previous work on computational modelling of empathy in motivational interviewing has focused on offline, session-level assessment of therapist empathy, where empathy captures all efforts that the therapistmakes to understand theclient's perspective and convey that understanding to the client. In this position paper, we propose a novel task of turn-level detection of client need for empathy. Concretely, we propose to leverage pre-trained language models and empathy-related general conversation corpora in a unique labeller-detector framework, where the labeller automatically annotates a motivational interviewing conversation corpus with empathy labels to train the detector that determines the need for therapist empathy. We also lay out our strategies of extending the detector with additional-input and multi-task setups to improve its detection and explainability.

SESSION: WoCBU'20 Workshop

Social Robot's Processing of Context-Sensitive Emotions in Child Care: A Dutch Use Case

  • Anouk Neerincx
  • Amy Luijk

Therapists, psychologists, family counselors and coaches in youth care show a clear need for social technology support, e.g. for education, motivation and guidance of the children. For example, the Dutch Child and Family Center explores the possibilities of social robot assistance in their regular care pathways. This robot should address the affective processes in the communication appropriately. Whereas there is an enormous amount of emotion research in human-robot interaction, there is not yet a proven set of models and methods that can be put into this practice directly. Our research aims at a model for robot's emotional recognition and expression that is effective in the Dutch youth care. Consequently, it has to take account of personal differences (e.g., child's developmental phase and mental problems) and the context (e.g., family circumstances and therapy approach). Our study distinguishes different phases that may partially run in parallel. First, possible solutions for affective computing by social robots are identified to set the general design space and understand the constraints. Second, in an exploration phase, focus group sessions are conducted to identify core features of emotional expressions that the robot should or could process, including the context-dependencies. Third, in the testing phase, via scenario-based design and child-robot interaction experiments a practical model of affect processing by a social robot in youth care is derived. This short paper provides an overview of the general approach of this research and some preliminary results of the design space and focus group.

Speech and Gaze during Parent-Child Interactions: The Role of Conflict and Cooperation

  • Gijs A. Holleman
  • Ignace T. C. Hooge
  • Jorg Huijding
  • Maja Deković
  • Chantal Kemner
  • Roy S. Hessels

Face-to-face interaction is a primary mode of human social behavior which includes verbal and non-verbal expressions, e.g. speech, gazing, eye contact, facial displays, and gestures (Holler & Levinson, 2019). In this study, we investigated the relation between speech and gaze behavior during 'face-to-face' dialogues between parents and their preadolescent children (9-11 years). 79 child-parent dyads engaged in two semi-structured conversations about family-related topics. We used a state-of-the-art dual-eye tracking setup (Hessels et al. 2019) that is capable of concurrently recording eye movements, frontal video recordings, and audio from two conversational partners. Crucially, the setup is designed in such a way that eye contact can be maintained using half-silvered mirrors, as opposed to e.g. Skype where the camera is located above the screen. Parents and children conversed about two different topics for five minutes each, one 'conflict' (e.g. bedtime, homework) and one 'cooperation' (e.g. organize a party) topic. Preliminary analyses of speech behavior (Figure 1) show that children talked more in the cooperative task and talked less when discussing a topic of disagreement with their parents. Conversely, parents talked more during the conflict-task and less during the cooperative-task. The next step is to combine measures of speech and gaze to investigate the interplay and temporal characteristics of verbal and non-verbal behavior during face-to-face interactions.

Using Markov Models and Classification to Understand Face Exploration Dynamics in Boys with Autism

  • Sofie Vettori
  • Jannes Nys
  • Bart Boets

Scanning faces is important for social interactions, and maintaining good eye contact carries significant social value. Difficulty with the social use of eye contact constitutes one of the clinical symptoms of autism spectrum disorder (ASD). It has been suggested that individuals with ASD look less at the eyes and more at the mouth than typically developing individuals, possibly due to gaze aversion (Tanaka & Sung, 2016) or gaze indifference (Chevallier et al., 2012). Eye tracking evidence for this hypothesis is mixed (e.g. Falck-Ytter & von Hofsten, 2011; Frazier et al., 2017). Face exploration dynamics (rather than the overall looking time to facial parts) might be altered in ASD. Recent studies have proposed a method for scanpath modeling and classification to capture systematic patterns diagnostic of a given class of observers and/or stimuli (Coutrot et al., 2018). We adopted this method combining Markov Models and classification analyses to understand face exploration dynamics in boys with ASD and typically developing school-aged boys (N = 42). Eye tracking data were recorded while participants viewed static faces. Faces were divided in areas of interest (AOIs) by means of limited-radius Voronoi tessellation (LRVT) (Hessels et al., 2016). Proportional looking time analyses show that both groups looked longer to eyes than mouth and we did not observe group differences in fixation duration to these features. TD boys look significantly longer to the nose while the ASD boys looked more outside the face. We modeled the temporal dynamics of the gaze behavior using Markov Models (MMs). To determine the individual separability of the resulting transition matrices we constructed a classification model using linear discriminant analysis (LDA). We found that the ASD group displays more exploratory dynamic gaze behavior as compared to the TD group, as indicated by higher transition probabilities of moving gaze between AOIs. Based on a leave-one-out cross validation analysis, we find an accuracy of 72%, implying that there is 72% chance to correctly predict group membership based on the face exploration dynamics. These results indicate that atypical eye contact in ASD might be manifested through more frequent gaze shifting, even when total looking time to the eyes is the same. Whereas individual accuracy is modest in this experiment, we hypothesize that when used in more realistic paradigms (e.g. real-life interaction), this method could be highly accurate in individual separability.

Eye Tracking in Human Interaction: Possibilities and Limitations

  • Niilo V. Valtakari
  • Ignace T. C. Hooge
  • Charlotte Viktorsson
  • Pär Nyström
  • Terje Falck-Ytter
  • Roy S. Hessels

There is a long history of interest in looking behavior during human interaction. With the advance of (wearable) video-based eye trackers, it has become possible to measure gaze during many different interactions, even in challenging situations, such as during interactions between young children and their caregivers. We outline the different types of eye-tracking setups that currently exist to investigate gaze during interaction. The setups differ mainly with regard to the nature of the eye-tracking signal (head- or world-centered) and the freedom of movement allowed for the participants (see Figure 1). These crucial, yet often overlooked features place constraints on the research questions that can be answered about human interaction. Furthermore, recent developments in machine learning have made available the measurement of gaze directly from video recordings, without the need for specialized eye-tracking hardware, widening the spectrum of possible eye-tracking setups. We discuss the link between type of eye-tracking setup and the research question being investigated, and end with a decision tree to help researchers judge the appropriateness of specific setups (see Figure 2).

Combining Clustering and Functionals based Acoustic Feature Representations for Classification of Baby Sounds

  • Heysem Kaya
  • Oxana Verkholyak
  • Maxim Markitantov
  • Alexey Karpov

This paper investigates different fusion strategies as well as provides insights on their effectiveness alongside standalone classifiers in the framework of paralinguistic analysis of infant vocalizations. The combinations of such systems as Support Vector Machines (SVM) and Extreme Learning Machines (ELM) based classifiers, as well as its weighted kernel version are explored, training systems on different acoustic feature representations and implementing weighted score-level fusion of the predictions. The proposed framework is tested on INTERSPEECH ComParE-2019 Baby Sounds corpus, which is a collection of Home Bank infant vocalization corpora annotated for five classes. Adhering to the challenge protocol, using a single test set submission we outperform the challenge baseline Unweighted Average Recall (UAR) score and achieve a comparable result to the state-of-the-art.

Early Development Indicators Predict Speech Features of Autistic Children

  • Elena E. Lyakso
  • Olga V. Frolova

The goal of the study is to reveal the correlation between speech peculiarities and different aspects of development of children with autism spectrum disorders. The participants in the study were 28 children with autism spectrum disorders (ASD) aged 4-11 years and 64 adults - listening to children's speech samples. Children with ASD were divided into two groups: ASD-1 - ASD is the leading symptom (F84, n=17); children assigned to ASD-2 (n=11) had other disorders accompanied by ASD symptomatology (F83 + F84). Recording of children's speech and behavior was carried out in the most similar situations: a dialogue with the experimenter, viewing pictures and retelling a story about them or answers to questions, book reading. The child's psychophysiological characteristics were estimated according to the method which includes determining the leading hemisphere by speech (dichotic listening test - DLT), phonemic hearing, and the profile of lateral functional asymmetry (PLFA). All tasks and the time of the study were adapted to the child's capacities. The study analyzed the level of speech formation in 4-11 year-old children with ASD, identified direct and indirect relationships between the features of early development, its psychophysiological indicators, and the speech development level at the time of the study. The ability of adults to recognize the psychoneurological state of children via their speech is determined. The results of the study support the need to increase focus on and understanding of the language strengths and weaknesses in children with ASD and an individual approach to teaching children.

Automatic Recognition of Target Words in Infant-Directed Speech

  • Anika van der Klis
  • Frans Adriaans
  • Mengru Han
  • René Kager

This study assesses the performance of a state-of-the-art automatic speech recognition (ASR) system at extracting target words in two different speech registers: infant-directed speech (IDS) and adult-directed speech (ADS). We used the Kaldi-NL ASR-service, developed by the Dutch Foundation of Open Speech Technology. The results indicate that the accuracy of the tool is much lower in IDS than in ADS. There are differences between IDS and ADS which negatively affect the performance of the existing ASR system. Therefore, new tools need to be developed for the automatic annotation of IDS. Nevertheless, the ASR system can already find more than half of the target words, which is promising.

Instagram Use and the Well-Being of Adolescents: Using Deep Learning to Link Social Scientific Self-reports with Instagram Data Download Packages

  • Laura Boeschoten
  • Irene I. van Driel
  • Daniel L. Oberski
  • Loes J. Pouwels

Since the introduction of social media platforms, researchers have investigated how the use of such media affects adolescents? well-being. Thus far, findings have been inconsistent [1, 2, 3]. The aim of our interdisciplinary project is to provide a more thorough understanding of these inconsistencies by investigating who benefits from social media use, who does not and why it is beneficial for one yet harmful for another [1]. In this presentation, we explain our approach to combining social scientific self-report data with the use of deep learning to analyze personal Instagram archives.

The implementation of the GDPR in 2018 opened up new possibilities for social media research. Each platform is legally mandated to provide its European users with their social media archive in digitally readable format upon request, to which all large platforms currently comply. These data download packages (DDPs) aid in resolving three main challenges in current research. First, the reliability of social media use self-reports suffer from recall bias, particularly among teens [2]. Instagram DDPs provide objective, timestamped insights in Instagram use. Second, previous research has demonstrated that time spent on social media has no or a small relationship with well-being [3]. The Instagram DDPs are an answer to recent calls for knowledge on adolescents? specific activities (posting, messaging, commenting) on social media [3]. Third, DDPs resolve selectivity issues related to research making use of APIs, analyzing public content only, while adolescent use knows an important private component [1].

In a longitudinal study, we invited we invited 388 adolescents (8th and 9th graders of a Dutch high school, mean age = 14.11, 54% girls) to participate in a panel survey, an experience sample (ESM) and to share their Instagram DDP at the end of both studies. Of this group, 104 Instagram users (mean age = 14.05, 66% girls) complied to sharing their DDP. As DDPs contain private and third-party content, data managers, ethical committee members and privacy officers have been closely monitoring the research process. Here, we developed a script in Python that anonymizes parts of the DDP by removing identifiers from images, videos, and text. Other parts of the DDP are pseudomyzed, allowing us to for example connect befriended users within the study. During the presentation, we report on this preparation process and the validation of the anonymization and pseudonymization script.

With the combined data-set containing panel survey results, ESM data and Instagram DDPs, we plan to perform a number of analyses, intended to investigate both the possibilities of using DDPs for scientific research and to investigate the well-being of adolescents. First, we plan to investigate the representativeness of the sub-sample that complied with sharing their DDPs. Second, we plan to generate emotional classifications and classifications of contextual factors of the images and text found in the DDPs using Microsoft Azure Cognitive Services and relate this to self-reported trait levels of well-being, derived from both the survey and the ESM. Third, we plan to develop natural language processing and computer vision algorithms using the DDP content as data and self-reported state levels of well-being as labels. By combining deep learning and social science, we aim to understand differences in Instagram use between adolescents who feel happy and those who feel less happy.

Speech Acquisition in Children with Typical and Atypical Development

  • Elena E. Lyakso

The keynote will present comparative experimental data on the formation of speech and communication skills of typically developing children and children with atypical development - with Autism Spectrum Disorders, Down syndrome, and intellectual disabilities. Specificity of the analysis of children's speech will be noted, databases of children's speech and their use will be presented. The main emphasis will be placed on the reflection in the characteristics of the voice of the pathological states of infants and children, on the revealing biomarkers of diseases according to the features of the speech and voice of children.


ICMI 2020 ACM International Conference on Multimodal Interaction. Copyright © 2019-2021