|
|
Proceedings
ICMI '19- 2019 International Conference on Multimodal Interaction
SESSION: Keynote & Invited Talks
Intelligence is the deciding factor of how human beings become the most dominant life forms on
earth. Throughout history, human beings have developed tools and technologies which help
civilizations evolve and grow. Computers, and by extension, artificial intelligence (AI), has
played important roles in that continuum of technologies. Recently artificial intelligence has
garnered much interest and discussion. As artificial intelligence are tools that can enhance
human capability, a sound understanding of what the technology can and cannot do is also
necessary to ensure their appropriate use. While developing artificial intelligence, we also
found out the definition and understanding of our own human intelligence continue evolving. The
debates of the race between human and artificial intelligence have been ever growing. In this
talk, I will describe the history of both artificial intelligence and human intelligence (HI).
From the great insights of the such historical perspectives, I would like to illustrate how AI
and HI will co-evolve with each other and project the future of AI and HI.
With the rapid progress in computing and sensory technologies, we will enter the era of
human-robot coexistence in the not-too-distant future, and it is time to address the challenges
of multimodal interaction. Should a robot take the form of humanoid? Is it better for robots to
behave as a second-class citizen or as an equal part of the society as human? Should the
communication between human and robot be symmetric or is it okay to be asymmetric? And how
about the communication between robots with human presence? What does it mean by emotional
intelligence for robots? With the inevitable physical interaction between human and robot, how
to guarantee safety? What is the ethical and moral model for robots and how do they follow?
Behind much of my research work over 4 decades has been the simple observation that people like
people and love interacting with other people more than they like interacting with machines.
Technologies that truly support such social desires are more likely to be adopted broadly.
Consider email, texting, chat rooms, social media, video conferencing, the internet, speech
translation, even videogames with a social element (e.g., Fortnite): we enjoy the technology
whenever it brings us closer to our fellow humans, instead of imposing attention-grabbing
clutter. If so, how then can we build better technologies that improve, encourage, support
human-human interaction? In this talk, I will recount my own story along this journey. When I
began, building technologies for the human-human experience, presented formidable challenges:
Computer interfaces would need to anticipate and understand the way humans interact, but in
1976, a typical computer had only two instructions to interact with humans: character-in &
character-out, and both only supported human-computer interaction. Over the decades that
followed, we began to develop interfaces that can process the various modalities of human
communication and we built systems that used several modalities in services to improve
human-human interaction. These included:
In my talk, I will discuss the challenges of interpreting multimodal signals of human-human
interaction in the wild. I will show the resulting human-human systems, we developed and how to
make them effective. Some went on to become services that affect the way we work and
communicate today.
Recent years have initiated a paradigm shift from pure task-based human-machine interfaces
towards socially-aware interaction. Advances in deep learning have led to anthropomorphic
interfaces with robust sensing capabilities that come close to or even exceed human
performance. In some cases, these interfaces may convey to humans the illusion of a sentient
being that cares for them. At the same time, there is the risk that - at some point - these
systems may have to reveal their lack of true comprehension of the situative context and the
user’s needs with serious consequences to user trust. The talk will discuss challenges
that arise when designing multimodal interfaces that hide the underlying complexity from the
user, but still demonstrate a transparent and plausible behavior. It will argue for hybrid AI
approaches that look beyond deep learning to encompass a theory of mind to obtain a better
understanding of the rationale behind human behaviors.
SESSION: Session 1: Human Behavior
- Ognjen Rudovic
- Meiru Zhang
- Bjorn Schuller
- Rosalind Picard
Human behavior expression and experience are inherently multimodal, and characterized by vast
individual and contextual heterogeneity. To achieve meaningful human-computer and human-robot
interactions, multi-modal models of the user’s states (e.g., engagement) are therefore
needed. Most of the existing works that try to build classifiers for the user’s states
assume that the data to train the models are fully labeled. Nevertheless, data labeling is
costly and tedious, and also prone to subjective interpretations by the human coders. This is
even more pronounced when the data are multi-modal (e.g., some users are more expressive with
their facial expressions, some with their voice). Thus, building models that can accurately
estimate the user’s states during an interaction is challenging. To tackle this, we
propose a novel multi-modal active learning (AL) approach that uses the notion of deep
reinforcement learning (RL) to find an optimal policy for active selection of the user’s
data, needed to train the target (modality-specific) models. We investigate different
strategies for multi-modal data fusion, and show that the proposed model-level fusion coupled
with RL outperforms the feature-level and modality-specific models, and the naïve AL
strategies such as random sampling, and the standard heuristics such as uncertainty sampling.
We show the benefits of this approach on the task of engagement estimation from real-world
child-robot interactions during an autism therapy. Importantly, we show that the proposed
multi-modal AL approach can be used to efficiently personalize the engagement classifiers to
the target user using a small amount of actively selected user’s data.
- Gian-Luca Savino
- Niklas Emanuel
- Steven Kowalzik
- Felix Kroll
- Marvin C. Lange
- Matthis Laudan
- Rieke Leder
- Zhanhua Liang
- Dayana Markhabayeva
- Martin Schmeißer
- Nicolai Schütz
- Carolin Stellmacher
- Zihe Xu
- Kerstin Bub
- Thorsten Kluss
- Jaime Maldonado
- Ernst Kruijff
- Johannes Schöning
Mobile navigation apps are among the most used mobile applications and are often used as a
baseline to evaluate new mobile navigation technologies in field studies. As field studies
often introduce external factors that are hard to control for, we investigate how pedestrian
navigation methods can be evaluated in virtual reality (VR). We present a study comparing
navigation methods in real life (RL) and VR to evaluate if VR environments are a viable
alternative to RL environments when it comes to testing these. In a series of studies,
participants navigated a real and a virtual environment using a paper map and a navigation app
on a smartphone. We measured the differences in navigation performance, task load and spatial
knowledge acquisition between RL and VR. From these we formulate guidelines for the improvement
of pedestrian navigation systems in VR like improved legibility for small screen devices. We
furthermore discuss appropriate low-cost and low-space VR-locomotion techniques and discuss
more controllable locomotion techniques.
- Metehan Doyran
- Batıkan Türkmen
- Eda Aydın Oktay
- Sibel Halfon
- Albert Ali Salah
Play therapy is an approach to psychotherapy where a child is engaging in play activities.
Because of the strong affective component of play, it provides a natural setting to analyze
feelings and coping strategies of the child. In this paper, we investigate an approach to track
the affective state of a child during a play therapy session. We assume a simple, camera-based
sensor setup, and describe the challenges of this application scenario. We use fine-tuned
off-the-shelf deep convolutional neural networks for the processing of the child’s face
during sessions to automatically extract valence and arousal dimensions of affect, as well as
basic emotional expressions. We further investigate text-based and body-movement based affect
analysis. We evaluate these modalities separately and in conjunction with play therapy videos
in natural sessions, discussing the results of such analysis and how it aligns with the
professional clinicians’ assessments.
- Byung Cheol Song
- Min Kyu Lee
- Dong Yoon Choi
Recognizing emotions by adapting to various human identities is very difficult. In order to
solve this problem, this paper proposes a relation-based conditional generative adversarial
network (RcGAN), which recognizes facial expressions by using the difference (or relation)
between neutral face and expressive face. The proposed method can recognize facial expression
or emotion independently of human identity. Experimental results show that the proposed method
provides higher accuracies of 97.93% and 82.86% for CK+ and MMI databases, respectively
than conventional method.
- Suowei Wu
- Zhengyin Du
- Weixin Li
- Di Huang
- Yunhong Wang
Continuous emotion recognition is of great significance in affective computing and
human-computer interaction. Most of existing methods for video based continuous emotion
recognition utilize facial expression. However, besides facial expression, other clues
including head pose and eye gaze are also closely related to human emotion, but have not been
well explored in continuous emotion recognition task. On the one hand, head pose and eye gaze
could result in different degrees of credibility of facial expression features. On the other
hand, head pose and eye gaze carry emotional clues themselves, which are complementary to
facial expression. Accordingly, in this paper we propose two ways to incorporate these two
clues into continuous emotion recognition. They are respectively an attention mechanism based
on head pose and eye gaze clues to guide the utilization of facial features in continuous
emotion recognition, and an auxiliary line which helps extract more useful emotion information
from head pose and eye gaze. Experiments are conducted on the Recola dataset, a database for
continuous emotion recognition, and the results show that our framework outperforms other
state-of-the-art methods due to the full use of head pose and eye gaze clues in addition to
facial expression for continuous emotion recognition.
- Md Abdullah Al Fahim
- Mohammad Maifi Hasan Khan
- Theodore Jensen
- Yusuf Albayram
- Emil Coman
- Ross Buck
Safety-critical systems (e.g., UAV systems) often incorporate warning modules that alert users
regarding imminent hazards (e.g., system failures). However, these warning systems are often
not perfect, and trigger false alarms, which can lead to negative emotions and affect
subsequent system usage. Although various feedback mechanisms have been studied in the past to
counter the possible negative effects of system errors, the effect of such feedback mechanisms
and system errors on users’ immediate emotions and task performance is not clear. To
investigate the influence of affective feedback on participants’ immediate emotions, we
designed a 2 (warning reliability: high/low) × 2 (feedback: present/absent) between-group
study where participants interacted with a simulated UAV system to identify and neutralize
enemy vehicles under time constraint. Task performance along with participants’ facial
expressions were analyzed. Results indicated that giving feedback decreased fear emotions
during the task whereas warning increased frustration for high reliability groups compared to
low reliability groups. Finally, feedback was found not to affect task performance.
SESSION: Session 2: Artificial Agents
- Mohammad Soleymani
- Kalin Stefanov
- Sin-Hwa Kang
- Jan Ondras
- Jonathan Gratch
Self-disclosure to others has a proven benefit for one’s mental health. It is shown that
disclosure to computers can be similarly beneficial for emotional and psychological well-being.
In this paper, we analyzed verbal and nonverbal behavior associated with self-disclosure in two
datasets containing structured human-human and human-agent interviews from more than 200
participants. Correlation analysis of verbal and nonverbal behavior revealed that linguistic
features such as affective and cognitive content in verbal behavior, and nonverbal behavior
such as head gestures are associated with intimate self-disclosure. A multimodal deep neural
network was developed to automatically estimate the level of intimate self-disclosure from
verbal and nonverbal behavior. Between modalities, verbal behavior was the best modality for
estimating self-disclosure within-corpora achieving r = 0.66. However, the cross-corpus
evaluation demonstrated that nonverbal behavior can outperform language modality in
cross-corpus evaluation. Such automatic models can be deployed in interactive virtual agents or
social robots to evaluate rapport and guide their conversational strategy.
- Deepali Aneja
- Daniel McDuff
- Shital Shah
Embodied avatars as virtual agents have many applications and provide benefits over disembodied
agents, allowing nonverbal social and interactional cues to be leveraged, in a similar manner
to how humans interact with each other. We present an open embodied avatar built upon the
Unreal Engine that can be controlled via a simple python programming interface. The avatar has
lip syncing (phoneme control), head gesture and facial expression (using either facial action
units or cardinal emotion categories) capabilities. We release code and models to illustrate
how the avatar can be controlled like a puppet or used to create a simple conversational agent
using public application programming interfaces (APIs). GITHUB link:
https://github.com/danmcduff/AvatarSim
- Chaitanya Ahuja
- Shugao Ma
- Louis-Philippe Morency
- Yaser Sheikh
Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic
cues have been shown to complement or clarify verbal messages. Hence to improve telepresence,
in form of an avatar, it is important to model these behaviours, especially in dyadic
interactions. Creating such personalized avatars not only requires to model intrapersonal
dynamics between a avatar’s speech and their body pose, but it also needs to model
interpersonal dynamics with the interlocutor present in the conversation. In this paper, we
introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates
intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to
generate sequences of body pose conditioned on audio and body pose of the interlocutor and
audio of the human operating the avatar. We evaluate our proposed model on dyadic
conversational data consisting of pose and audio of both participants, confirming the
importance of adaptive attention between monadic and dyadic dynamics when predicting avatar
pose. We also conduct a user study to analyze judgments of human observers. Our results confirm
that the generated body pose is more natural, models intrapersonal dynamics and interpersonal
dynamics better than non-adaptive monadic/dyadic models.
- Yuki Hirano
- Shogo Okada
- Haruto Nishimoto
- Kazunori Komatani
This paper presents multimodal computational modeling of three labels that are independently
annotated per exchange to implement an adaptation mechanism of dialogue strategy in spoken
dialogue systems based on recognizing user sentiment by multimodal signal processing. The three
labels include (1) user’s interest label pertaining to the current topic, (2)
user’s sentiment label, and (3) topic continuance denoting whether the system should
continue the current topic or change it. Predicting the three types of labels that capture
different aspects of the user’s sentiment level and the system’s next action
contribute to adopting a dialogue strategy based on the user’s sentiment. For this
purpose, we enhanced shared multimodal dialogue data by annotating impressed sentiment labels
and the topic continuance labels. With the corpus, we develop a multimodal prediction model for
the three labels. A multitask learning technique is applied for binary classification tasks of
the three labels considering the partial similarities among them. The prediction model was
efficiently trained even with a small data set (less than 2000 samples) thanks to the multitask
learning framework. Experimental results show that the multitask deep neural network (DNN)
model trained with multimodal features including linguistics, facial expressions, body and head
motions, and acoustic features, outperformed those trained as single-task DNNs by 1.6 points at
the maximum.
- Leili Tavabi
- Kalin Stefanov
- Setareh Nasihati Gilani
- David Traum
- Mohammad Soleymani
Embodied interactive agents possessing emotional intelligence and empathy can create natural and
engaging social interactions. Providing appropriate responses by interactive virtual agents
requires the ability to perceive users’ emotional states. In this paper, we study and
analyze behavioral cues that indicate an opportunity to provide an empathetic response.
Emotional tone in language in addition to facial expressions are strong indicators of dramatic
sentiment in conversation that warrant an empathetic response. To automatically recognize such
instances, we develop a multimodal deep neural network for identifying opportunities when the
agent should express positive or negative empathetic responses. We train and evaluate our model
using audio, video and language from human-agent interactions in a wizard-of-Oz setting, using
the wizard’s empathetic responses and annotations collected on Amazon Mechanical Turk as
ground-truth labels. Our model outperforms a text-based baseline achieving F1-score of 0.71 on
a three-class classification. We further investigate the results and evaluate the capability of
such a model to be deployed for real-world human-agent interactions.
SESSION: Session 3: Touch and Gesture
- Abishek Sriramulu
- Jionghao Lin
- Sharon Oviatt
Embodied Cognition theorists believe that mathematics thinking is embodied in physical activity,
like gesturing while explaining math solutions. This research asks the question whether
expertise in mathematics can be detected by analyzing students’ rate and type of manual
gestures. The results reveal several unique findings, including that math experts reduced their
total rate of gesturing by 50%, compared with non-experts. They also dynamically increased
their rate of gesturing on harder problems. Although experts reduced their rate of gesturing
overall, they selectively produced 62% more iconic gestures. Iconic gestures are strategic
because they assist with retaining spatial information in working memory, so that inferences
can be extracted to support correct problem solving. The present results on
representation-level gesture patterns are convergent with recent findings on signal-level
handwriting, while also contributing a causal understanding of how and why experts adapt their
manual activity during problem solving.
- Zhuoming Zhang
- Robin Héron
- Eric Lecolinet
- Françoise Detienne
- Stéphane Safin
As one of the most important non-verbal communication channel, touch plays an essential role in
interpersonal affective communication. Although some researchers have started exploring the
possibility of using wearable devices for conveying emotional information, most of the existing
devices still lack the capability to support affective and dynamic touch in interaction. In
this paper, we explore the effect of dynamic visual cues on the emotional perception of
vibrotactile signals. For this purpose, we developed VisualTouch, a haptic sleeve consisting of
a haptic layer and a visual layer. We hypothesized that visual cues would enhance the
interpretation of tactile cues when both types of cues are congruent. We first carried out an
experiment and selected 4 stimuli producing substantially different responses. Based on that, a
second experiment was conducted with 12 participants rating the valence and arousal of 36
stimuli using SAM scales.
- Jongho Lim
- Yongjae Yoo
- Hanseul Cho
- Seungmoon Choi
This paper presents TouchPhoto, which provides visual-audio-tactile assistive features to enable
visually-impaired users to take and understand photographs independently. A user can take
photographs under auditory guidance and record audio tags to aid later recall of the
photographs’ contents. For comprehension, the user can listen to audio tags embedded in a
photograph while touching salient features, e.g., human faces, using an electrovibration
display. We conducted two user studies with visually-impaired users, one for picture taking and
the other for understanding and recall, in a two-month interval. They considered auditory
assistance as very useful for taking and understanding photographs and tactile features as
helpful but to a limited extent.
- Ilhan Aslan
- Katharina Weitz
- Ruben Schlagowski
- Simon Flutura
- Susana Garcia Valesco
- Marius Pfeil
- Elisabeth André
Creativity as a skill is associated with a potential to drive both productivity and
psychological wellbeing. Since multimodality can foster cognitive ability, multimodal digital
tools should also be ideal to support creativity as an essentially cognitive skill. In this
paper, we explore this notion by presenting a multimodal pen-based interaction technique and
studying how it supports creativity. The multimodal solution uses micro-controller-technology
to augment a digital pen with RGB LEDs and a Leap Motion sensor to enable bimanual input. We
report on a user study with 26 participants demonstrating that the multimodal technique is
indeed perceived as supporting creativity significantly more than a baseline condition.
This paper focuses on the real-life scenario that people are handwriting while wearing small
mobile devices on their wrists. We explore the possibility of eavesdropping privacy-related
information based on motion signals. To achieve this, we elaborately develop a new deep
learning-based motion sensing framework with four major components, i.e., recorder, signal
preprocessor, feature extractor and handwriting recognizer. First, we integrate a series of
simple yet effective signal processing techniques to purify the sensory data to reflect the
kinetic property of a handwriting motion. Then we take advantage of properties of Multimodal
Convolutional Neural Network (MCNN) to extract abstract features. After that, a bidirectional
Long Short-Term Memory (BLSTM) network is exploited to model temporal dynamics. Finally, we
incorporate Connectionist Temporal Classification (CTC) algorithm to realize end-to-end
handwriting recognition. We prototype our design using a commercial off-the-shelf smartwatch
and carry out extensive experiments. The encouraging results reveal that our system can
robustly achieve an average accuracy of 64% at character-level and 71.9% at word-level,
and 56.6% accuracy rate for words unseen in the training set under certain conditions,
which expose the danger of privacy disclosure in daily lives.
SESSION: Session 4: Physiological Modeling
- Tobias Appel
- Natalia Sevcenko
- Franz Wortha
- Katerina Tsarava
- Korbinian Moeller
- Manuel Ninaus
- Enkelejda Kasneci
- Peter Gerjets
The reliable estimation of cognitive load is an integral step towards real-time adaptivity of
learning or gaming environments. We introduce a novel and robust machine learning method for
cognitive load assessment based on behavioral and physiological measures in a combined within-
and cross-participant approach. 47 participants completed different scenarios of a commercially
available emergency personnel simulation game realizing several levels of difficulty based on
cognitive load. Using interaction metrics, pupil dilation, eye-fixation behavior, and heart
rate data, we trained individual, participant-specific forests of extremely randomized trees
differentiating between low and high cognitive load. We achieved an average classification
accuracy of 72%. We then apply these participant-specific classifiers in a novel way, using
similarity between participants, normalization, and relative importance of individual features
to successfully achieve the same level of classification accuracy in cross-participant
classification. These results indicate that a combination of behavioral and physiological
indicators allows for reliable prediction of cognitive load in an emergency simulation game,
opening up new avenues for adaptivity and interaction.
- Yuning Qiu
- Teruhisa Misu
- Carlos Busso
New developments in advanced driver assistance systems (ADAS) can help drivers deal with risky
driving maneuvers, preventing potential hazard scenarios. A key challenge in these systems is
to determine when to intervene. While there are situations where the needs for intervention or
feedback is clear (e.g., lane departure), it is often difficult to determine scenarios that
deviate from normal driving conditions. These scenarios can appear due to errors by the
drivers, presence of pedestrian or bicycles, or maneuvers from other vehicles. We formulate
this problem as a driving anomaly detection, where the goal is to automatically identify cases
that require intervention. Towards addressing this challenging but important goal, we propose a
multimodal system that considers (1) physiological signals from the driver, and (2) vehicle
information obtained from the controller area network (CAN) bus sensor. The system relies on
conditional generative adversarial networks (GAN) where the models are constrained by the
signals previously observed. The difference of the scores in the discriminator between the
predicted and actual signals is used as a metric for detecting driving anomalies. We collected
and annotated a novel dataset for driving anomaly detection tasks, which is used to validate
our proposed models. We present the analysis of the results, and perceptual evaluations which
demonstrate the discriminative power of this unsupervised approach for detecting driving
anomalies.
- Mimansa Jaiswal
- Zakaria Aldeneh
- Emily Mower Provost
Various psychological factors affect how individuals express emotions. Yet, when we collect data
intended for use in building emotion recognition systems, we often try to do so by creating
paradigms that are designed just with a focus on eliciting emotional behavior. Algorithms
trained with these types of data are unlikely to function outside of controlled environments
because our emotions naturally change as a function of these other factors. In this work, we
study how the multimodal expressions of emotion change when an individual is under varying
levels of stress. We hypothesize that stress produces modulations that can hide the true
underlying emotions of individuals and that we can make emotion recognition algorithms more
generalizable by controlling for variations in stress. To this end, we use adversarial networks
to decorrelate stress modulations from emotion representations. We study how stress alters
acoustic and lexical emotional predictions, paying special attention to how modulations due to
stress affect the transferability of learned emotion recognition models across domains. Our
results show that stress is indeed encoded in trained emotion classifiers and that this
encoding varies across levels of emotions and across the lexical and acoustic modalities. Our
results also show that emotion recognition models that control for stress during training have
better generalizability when applied to new domains, compared to models that do not control for
stress during training. We conclude that is is necessary to consider the effect of extraneous
psychological factors when building and testing emotion recognition models.
- Yi Ding
- Brandon Huynh
- Aiwen Xu
- Tom Bullock
- Hubert Cecotti
- Matthew Turk
- Barry Giesbrecht
- Tobias Höllerer
Brain Computer Interfaces (BCIs) typically utilize electroencephalography (EEG) to enable
control of a computer through brain signals. However, EEG is susceptible to a large amount of
noise, especially from muscle activity, making it difficult to use in ubiquitous computing
environments where mobility and physicality are important features. In this work, we present a
novel multimodal approach for classifying the P300 event related potential (ERP) component by
coupling EEG signals with nonscalp electrodes (NSE) that measure ocular and muscle artifacts.
We demonstrate the effectiveness of our approach on a new dataset where the P300 signal was
evoked with participants on a stationary bike under three conditions of physical activity:
rest, low-intensity, and high-intensity exercise. We show that intensity of physical activity
impacts the performance of both our proposed model and existing state-of-the-art models. After
incorporating signals from nonscalp electrodes our proposed model performs significantly better
for the physical activity conditions. Our results suggest that the incorporation of additional
modalities related to eye-movements and muscle activity may improve the efficacy of mobile
EEG-based BCI systems, creating the potential for ubiquitous BCI.
SESSION: Session 5: Sound and interaction
- Erik Wolf
- Sara Klüber
- Chris Zimmerer
- Jean-Luc Lugrin
- Marc Erich Latoschik
Virtual Reality (VR) has always been considered a promising medium to support designers with
alternative work environments. Still, graphical user interfaces are prone to induce attention
shifts between the user interface and the manipulated target objects which hampers the creative
process. This work proposes a speech-and-gesture-based interaction paradigm for creative tasks
in VR. We developed a multimodal toolbox (MTB) for VR-based design applications and compared it
to a typical unimodal menu-based toolbox (UTB). The comparison uses a design-oriented use-case
and measures flow, usability, and presence as relevant characteristics for a VR-based design
process. The multimodal approach (1) led to a lower perceived task duration and a higher
reported feeling of flow. It (2) provided a higher intuitive use and a lower mental workload
while not being slower than an UTB. Finally, it (3) generated a higher feeling of presence.
Overall, our results confirm significant advantages of the proposed multimodal interaction
paradigm and the developed MTB for important characteristics of design processes in VR.
- Najla Al Futaisi
- Zixing Zhang
- Alejandrina Cristia
- Anne Warlaumont
- Bjorn Schuller
Using neural networks to classify infant vocalisations into important subclasses (such as crying
versus speech) is an emergent task in speech technology. One of the biggest roadblocks standing
in the way of progress lies in the datasets: The performance of a learning model is affected by
the labelling quality and size of the dataset used, and infant vocalisation datasets with good
quality labels tend to be small. In this paper, we assess the performance of three models for
infant VoCalisation Maturity (VCM) trained with a large dataset annotated automatically using a
purpose-built classifier and a small dataset annotated by highly trained human coders. The two
datasets are used in three different training strategies, whose performance is compared against
a baseline model. The first training strategy investigates adversarial training, while the
second exploits multi-task learning as the neural network trains on both datasets
simultaneously. In the final strategy, we integrate adversarial training and multi-task
learning. All of the training strategies outperform the baseline, with the adversarial training
strategy yielding the best results on the development set.
- Nicole Andelic
- Aidan Feeney
- Gary McKeown
Research has found that professional advice with empathy displays and signs of listening lead to
more successful outcomes. These skills are typically displayed through visual nonverbal
signals, whereas reduced multimodal contexts have to use other strategies to compensate for the
lack of visual nonverbal information. Debt advice is often a highly emotional scenario but to
date there has been no research comparing fully multimodal with reduced multimodal debt advice.
The aim of the current study was to compare explicit emotional content (as expressed verbally)
and implicit emotional content (as expressed through paralinguistic cues) in face to face (FTF)
and telephone debt advice recordings. Twenty-two debt advice recordings were coded as emotional
or functional and processed through emotion recognition software. The analysis found that FTF
recordings included more explicit emotion than telephone recordings did. However, linear mixed
effects modelling found substantially higher levels of arousal and slightly lower levels of
valence in telephone advice. Interaction analyses found that emotional speech in FTF advice was
characterised by lower levels of arousal than during functional speech, whereas emotional
speech in telephone advice had higher levels of arousal than in functional speech. We can
conclude that there are differences in emotional content when comparing full and reduced
multimodal debt advice. Furthermore, as telephone advice cannot avail of visual nonverbal
signals, it seems to compensate by using nonverbal cues present in the voice.
- Ahmed Hussen Abdelaziz
- Barry-John Theobald
- Justin Binder
- Gabriele Fanelli
- Paul Dixon
- Nick Apostoloff
- Thibaut Weise
- Sachin Kajareker
Speech-driven visual speech synthesis involves mapping acoustic speech features to the
corresponding lip animation controls for a face model. This mapping can take many forms, but a
powerful approach is to use deep neural networks (DNNs). The lack of synchronized audio, video,
and depth data is a limitation to reliably train DNNs, especially for speaker-independent
models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic
model (AM) for the visual speech synthesis problem. We train the ASR-AM on ten thousand hours
of audio-only transcribed speech. The ASR-AM is then adapted to the visual speech synthesis
domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment
test, we compared the performance of the AM-initialized DNN to a randomly initialized model.
The results show that viewers significantly prefer animations generated from the AM-initialized
DNN than the ones generated using the randomly initialized model. We conclude that visual
speech synthesis can significantly benefit from the powerful representation of speech in the
ASR acoustic models.
- Divesh Lala
- Koji Inoue
- Tatsuya Kawahara
Turn-taking in human-robot interaction is a crucial part of spoken dialogue systems, but current
models do not allow for human-like turn-taking speed seen in natural conversation. In this work
we propose combining two independent prediction models. A continuous model predicts the
upcoming end of the turn in order to generate gaze aversion and fillers as turn-taking cues.
This prediction is done while the user is speaking, so turn-taking can be done with little
silence between turns, or even overlap. Once a speech recognition result has been received at a
later time, a second model uses the lexical information to decide if or when the turn should
actually be taken. We constructed the continuous model using the speaker’s prosodic
features as inputs and evaluated its online performance. We then conducted a subjective
experiment in which we implemented our model in an android robot and asked participants to
compare it to one without turn-taking cues, which produces a response when a speech recognition
result is received. We found that using both gaze aversion and a filler was preferred when the
continuous model correctly predicted the upcoming end of turn, while using only gaze aversion
was better if the prediction was wrong.
- Rui Hou
- Veronica Perez-Rosas
- Stacy Loeb
- Rada Mihalcea
Recent years have witnessed a significant increase in the online sharing of medical information,
with videos representing a large fraction of such online sources. Previous studies have however
shown that more than half of the health-related videos on platforms such as YouTube contain
misleading information and biases. Hence, it is crucial to build computational tools that can
help evaluate the quality of these videos so that users can obtain accurate information to help
inform their decisions. In this study, we focus on the automatic detection of misinformation in
YouTube videos. We select prostate cancer videos as our entry point to tackle this problem. The
contribution of this paper is twofold. First, we introduce a new dataset consisting of 250
videos related to prostate cancer manually annotated for misinformation. Second, we explore the
use of linguistic, acoustic, and user engagement features for the development of classification
models to identify misinformation. Using a series of ablation experiments, we show that we can
build automatic models with accuracies of up to 74%, corresponding to a 76.5% precision
and 73.2% recall for misinformative instances.
SESSION: Session 6: Multiparty interaction
- Lucca Eloy
- Angela E.B. Stewart
- Mary Jean Amon
- Caroline Reinhardt
- Amanda Michaels
- Chen Sun
- Valerie Shute
- Nicholas D. Duran
- Sidney D'Mello
We adopt a multimodal approach to investigating team interactions in the context of remote
collaborative problem solving (CPS). Our goal is to understand multimodal patterns that emerge
and their relation with collaborative outcomes. We measured speech rate, body movement, and
galvanic skin response from 101 triads (303 participants) who used video conferencing software
to collaboratively solve challenging levels in an educational physics game. We use
multi-dimensional recurrence quantification analysis (MdRQA) to quantify patterns of team-level
regularity, or repeated patterns of activity in these three modalities. We found that teams
exhibit significant regularity above chance baselines. Regularity was unaffected by task
factors. but had a quadratic relationship with session time in that it initially increased but
then decreased as the session progressed. Importantly, teams that produce more varied
behavioral patterns (irregularity) reported higher emotional valence and performed better on a
subset of the problem solving tasks. Regularity did not predict arousal or subjective
perceptions of the collaboration. We discuss implications of our findings for the design of
systems that aim to improve collaborative outcomes by monitoring the ongoing collaboration and
intervening accordingly.
- Kevin El Haddad
- Sandeep Nallan Chakravarthula
- James Kennedy
Smiles and laughs have been the subject of many studies over the past decades, due to their
frequent occurrence in interactions, as well as their social and emotional functions in dyadic
conversations. In this paper we push forward previous work by providing a first study on the
influence one interacting partner’s smiles and laughs have on their interlocutor’s,
taking into account these expressions’ intensities. Our second contribution is a study on
the patterns of laugh and smile sequences during the dialogs, again taking the intensity into
account. Finally, we discuss the effect of the interlocutor’s role on smiling and
laughing. In order to achieve this, we use a database of naturalistic dyadic conversations
which was collected and annotated for the purpose of this study. The details of the collection
and annotation are also reported here to enable reproduction.
This paper proposes an approach to develop models for predicting the performance for multiple
group meeting tasks, where the model has no clear correct answer. This paper adopts
”product dimensions” [Hackman et al. 1967] (PD) which is proposed as a set of
dimensions for describing the general properties of written passages that are generated by a
group, as a metric measuring group output. This study enhanced the group discussion corpus
called the MATRICS corpus including multiple discussion sessions by annotating the performance
metric of PD. We extract group-level linguistic features including vocabulary level features
using a word embedding technique, topic segmentation techniques, and functional features with
dialog act and parts of speech on the word level. We also extracted nonverbal features from the
speech turn, prosody, and head movement. With a corpus including multiple discussion data and
an annotation of the group performance, we conduct two types of experiments thorough regression
modeling to predict the PD. The first experiment is to evaluate the task-dependent prediction
accuracy, in the situation that the samples obtained from the same discussion task are included
in both the training and testing. The second experiments is to evaluate the task-independent
prediction accuracy, in the situation that the type of discussion task is different between the
training samples and testing samples. In this situation, regression models are developed to
infer the performance in an unknown discussion task. The experimental results show that a
support vector regression model archived a 0.76 correlation in the discussion-task-dependent
setting and 0.55 in the task-independent setting.
- Philipp Matthias Muller
- Andreas Bulling
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing
research topic in social signal processing but existing methods were evaluated on single
datasets – an unrealistic assumption for real-world applications in which systems are
required to also work in settings unseen at training time. It therefore remains unclear whether
current methods for emergent leadership detection generalise to similar but new settings and to
which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation
setting for the emergent leadership detection task. We provide evaluations for within- and
cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well
as an investigation on the robustness of commonly used feature channels and online prediction
in the cross-dataset setting. Our evaluations show that using pose and eye contact based
features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing
another important piece of the puzzle towards real-world emergent leadership detection.
- Ameneh Shamekhi
- Timothy Bickmore
Group meetings are ubiquitous, with millions of meetings held across the world every day.
However, meeting quality, group performance, and outcomes are challenged by a variety of
dysfunctional behaviors, unproductive social dynamics, and lack of experience in conducting
efficient and productive meetings. Previous studies have shown that meeting facilitators can be
advantageous in helping groups reach their goals more effectively, but many groups do not have
access to human facilitators due to a lack of resources or other barriers. In this paper, we
describe the development of a multimodal robotic meeting facilitator that can improve the
quality of small group decision-making meetings. This automated group facilitation system uses
multimodal sensor inputs (user gaze, speech, prosody, and proxemics), as well as inputs from a
tablet application, to intelligently enforce meeting structure, promote time management,
balance group participation, and facilitate group decision-making processes. Results of a
between-subject study of 20 user groups (N=40) showed that the robot facilitator is
accepted by group members, is effective in enforcing meeting structure, and users found it
helpful in balancing group participation. We also report design implications derived from the
findings of our study.
SESSION: Poster Session
- Stephanie Arevalo
- Stanislaw Miller
- Martha Janka
- Jens Gerken
Interacting with the physical and digital environment multimodally enhances user flexibility and
adaptability to different scenarios. A body of research has focused on comparing the efficiency
and effectiveness of different interaction modalities in digital environments. However, little
is known about user behavior in an environment that provides freedom to choose from a range of
modalities. That is why, we take a closer look at the factors that influence input modality
choices. Building on the work by Jameson & Kristensson, our goal is to understand how
different factors influence user choices. In this paper, we present a study that aims to
explore modality choices in a hands-free interaction environment, wherein participants can
choose and combine freely three hands-free modalities (Gaze, Head movements, Speech) to execute
point and select actions in a 2D interface. On the one hand, our results show that users avoid
switching modalities more often than we expected, particularly, under conditions that should
prompt modality switching. On the other hand, when users make a modality switch, user
characteristics and consequences of the experienced interaction have a higher impact in the
choice, than the changes in environmental conditions. Further, when users switch between
modalities, we identified different types of switching behaviors. Users who deliberately try to
find and choose an optimal modality (single switcher), users who try to find optimal
combinations of modalities (multiple switcher), and a switching behavior triggered by error
occurrence (error biased switcher). We believe that these results help to further understand
when and how to design for multimodal interaction in real-world systems.
- Yulan Chen
- Jia Jia
- Zhiyong Wu
User emotion modeling is a vital problem of social media analysis. In previous studies, content
and topology information of social networks have been considered in emotion modeling tasks, but
the inflence of current emotion states of other users was not considered. We define emotion
influence as the emotional impact from user’s friends in social networks, which is
determined by both network structure and node attributes (the features of friends). In this
paper, we try to model the emotion influence to help analyze user’s emotion. The key
challenges to this problem are: 1) how to combine content features and network structures
together to model emotion influence; 2) how to selectively focus on the major social network
information related to emotion influence. To tackle these challenges, we propose an
attention-based graph convolutional recurrent network to bring in emotion influence and content
data. Firstly, we use an attention-based graph convolutional network to selectively aggregate
the features of the user’s friends with specific attention. Then an LSTM model is used to
learn user’s own content features and emotion influence. The model we proposed is more
capable of quantifying the emotion influence in social networks as well as combining them
together to analyze the user emotion status. We conduct emotion classification experiments to
evaluate the effectiveness of our model on a real world dataset called Sina Weibo1. Results
show that our model outperforms several state-of-the-art methods.
- Maite Frutos-Pascual
- Jake Michael Harrison
- Chris Creed
- Ian Williams
This paper presents an evaluation of ultrasound mid-air haptics as a supplementary feedback cue
for grasping and lifting virtual objects in Virtual Reality (VR). We present a user study with
27 participants and evaluate 6 different object sizes ranging from 40 mm to 100 mm.
We compare three supplementary feedback cues in VR; mid-air haptics, visual feedback (glow
effect) and no supplementary feedback. We report on precision metrics (time to completion,
grasp aperture and grasp accuracy) and interaction metrics (post-test questionnaire,
observations and feedback) to understand general trends and preferences. The results showed an
overall preference for visual cues for bigger objects () while ultrasound mid-air haptics were
preferred for small virtual targets ().
- Yosra Rekik
- Walid Merrad
- Christophe Kolski
Bimanual input is frequently used on touch and tangible interaction on tabletop surfaces.
Considering a composite task, such as moving a set of objects, attention, decision making and
fine motor control have to be phased with the coordination of the two hands. However, attention
demand is an important factor to design easy to learn and recall interaction techniques. This,
determining what interaction modality demands less attention and which one performs better in
these conditions is important to improve design. In this work, we present the first empirical
results on this matter. We report that users are consistent in their assessments of the
attention demand for both touch and tangible modalities, even under different hands
synchronicity, and different population sizes and densities. Our findings indicate that the one
hand condition and small populations demand less attention comparing to respectively, two hands
conditions and bigger populations. Also, we show that tangible modality reduces significantly
the attention when using two hands synchronous movements or when moving the sparse populations,
decreases the movement time over touch modality, without compromising the traveled distance. We
use our findings to outline a set of guidelines to assist touch and tangible design.
- Chandan Kumar
- Daniyal Akbari
- Raphael Menges
- Scott MacKenzie
- Steffen Staab
We present TouchGazePath, a multimodal method for entering personal identification numbers
(PINs). Using a touch-sensitive display showing a virtual keypad, the user initiates input with
a touch at any location, glances with their eye gaze on the keys bearing the PIN numbers, then
terminates input by lifting their finger. TouchGazePath is not susceptible to security attacks,
such as shoulder surfing, thermal attacks, or smudge attacks. In a user study with 18
participants, TouchGazePath was compared with the traditional Touch-Only method and the
multimodal Touch+Gaze method, the latter using eye gaze for targeting and touch for selection.
The average time to enter a PIN with TouchGazePath was 3.3 s. This was not as fast as
Touch-Only (as expected), but was about twice as fast the Touch+Gaze. TouchGazePath was also
more accurate than Touch+Gaze. TouchGazePath had high user ratings as a secure PIN input method
and was the preferred PIN input method for 11 of 18 participants.
- Mira Sarkis
- Céline Coutrix
- Laurence Nigay
- Andrzej Duda
We present WiBend, a system that recognizes bending gestures as the input modalities for
interacting on non-instrumented and deformable surfaces using WiFi signals. WiBend takes
advantage of off-the-shelf 802.11 (Wi-Fi) devices and Channel State Information (CSI)
measurements of packet transmissions when the user is placed and interacting between a Wi-Fi
transmitter and a receiver. We have performed extensive user experiments in an instrumented
laboratory to obtain data for training the HMM models and for evaluating the precision of
WiBend. During the experiments, participants performed 12 distinct bending gestures with three
surface sizes, two bending speeds and two different directions. The performance evaluation
results show that WiBend can distinguish between 12 bending gestures with a precision of
84% on average.
- Kaixin Ma
- Xinyu Wang
- Xinru Yang
- Mingtong Zhang
- Jeffrey M Girard
- Louis-Philippe Morency
Automatic emotion recognition plays a critical role in technologies such as intelligent agents
and social robots and is increasingly being deployed in applied settings such as education and
healthcare. Most research to date has focused on recognizing the emotional expressions of young
and middle-aged adults and, to a lesser extent, children and adolescents. Very few studies have
examined automatic emotion recognition in older adults (i.e., elders), which represent a large
and growing population worldwide. Given that aging causes many changes in facial shape and
appearance and has been found to alter patterns of nonverbal behavior, there is strong reason
to believe that automatic emotion recognition systems may need to be developed specifically (or
augmented) for the elder population. To promote and support this type of research, we introduce
a newly collected multimodal dataset of elders reacting to emotion elicitation stimuli.
Specifically, it contains 1323 video clips of 46 unique individuals with human annotations of
six discrete emotions: anger, disgust, fear, happiness, sadness, and surprise as well as
valence. We present a detailed analysis of the most indicative features for each emotion. We
also establish several baselines using unimodal and multimodal features on this dataset.
Finally, we show that models trained on dataset of another age group do not generalize well on
elders.
- Jiaming Huang
- Chen Min
- Liping Jing
To handle the large-scale data in terms of storage and searching time, learning to hash becomes
popular due to its efficiency and effectiveness in approximate cross-modal nearest neighbors
searching. Most existing unsupervised cross-modal hashing methods, to shorten the semantic gap,
try to simultaneously minimize the loss of intra-modal similarity and the loss of inter-modal
similarity. However, these models can not guarantee in theory these two losses are
simultaneously minimized. In this paper, we first theoretically proved that cross-modal hashing
could be implemented by protecting both intra-modal and inter-modal similarity with the aid of
variational inference technique and point out the problem that maximizing intra and inter-modal
similarity is mutually constrained. In this case, we propose an unsupervised cross-modal
hashing framework named as Unsupervised Deep Fusion Cross-modal Hashing (UDFCH) which leverages
the data fusion to capture the underlying manifold across modalities to avoid above problem.
What’s more, in order to reduce the quantization loss, we sample hash codes from
different Bernoulli distributions through a reparameterization trick. Our UDFCH framework has
two stages. The first stage aims at mining the the intra-modal structure of each modality. The
second stage aims to determine the modality-aware hash code by sufficiently considering the
correlation and manifold structure among modalities. A series of experiments conducted on three
benchmark datasets show that the proposed UDFCH framework outperforms the state-of-the-art
methods on different cross-modal retrieval tasks.
- Vineet Mehta
- Sai Srinadhu Katta
- Devendra Pratap Yadav
- Abhinav Dhall
Traffic accidents cause over a million deaths every year, of which a large fraction is
attributed to drunk driving. An automated intoxicated driver detection system in vehicles will
be useful in reducing accidents and related financial costs. Existing solutions require special
equipment such as electrocardiogram, infrared cameras or breathalyzers. In this work, we
propose a new dataset called DIF (Dataset of perceived Intoxicated Faces) which contains
audio-visual data of intoxicated and sober people obtained from online sources. To the best of
our knowledge, this is the first work for automatic bimodal non-invasive intoxication
detection. Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) are trained for
computing the video and audio baselines, respectively. 3D CNN is used to exploit the
Spatio-temporal changes in the video. A simple variation of the traditional 3D convolution
block is proposed based on inducing non-linearity between the spatial and temporal channels.
Extensive experiments are performed to validate the approach and baselines.
- Soumia Dermouche
- Catherine Pelachaud
A social interaction implies a social exchange between two or more persons, where they adapt and
adjust their behaviors in response to their interaction partners. With the growing interest in
human-agent interactions, it is desirable to make these interactions more natural and human
like. In this context, we aim at enhancing the quality of the interaction between user and
Embodied Conversational Agent (ECA) by endowing ECA with the capacity to adapt its behavior in
real time according the user’s behavior. The novelty of our approach is to model the
agent’s nonverbal behaviors as a function of both agent’s and user’s
behaviors jointly with the agent’s communicative intentions creating a dynamic loop
between both interactants. Moreover, we encompass the variation of behavior over time through a
LSTM-based model. Our model IL-LSTM (Interaction Loop LSTM) predicts the next agent’s
behavior taking into account the behavior that both, the agent and the user, have displayed
within a time window. We have conducted an evaluation study involving an agent interacting with
visitors in a science museum. Results of our study show that participants have better
experience and are more engaged in the interaction when the agent adapts its behaviors to
theirs, thus creating an interactive loop.
- Lingyu Zhang
- Mallory Morgan
- Indrani Bhattacharya
- Michael Foley
- Jonas Braasch
- Christoph Riedl
- Brooke Foucault Welles
- Richard J. Radke
Collaborative group tasks require efficient and productive verbal and non-verbal interactions
among the participants. Studying such interaction patterns could help groups perform more
efficiently, but the detection and measurement of human behavior is challenging since it is
inherently multimodal and changes on a millisecond time frame. In this paper, we present a
method to study groups performing a collaborative decision-making task using non-verbal
behavioral cues. First, we present a novel algorithm to estimate the visual focus of attention
(VFOA) of participants using frontal cameras. The algorithm can be used in various group
settings, and performs with a state-of-the-art accuracy of 90%. Secondly, we present
prosodic features for non-verbal speech analysis. These features are commonly used in
speech/music classification tasks, but are rarely used in human group interaction analysis. We
validate our algorithms on a multimodal dataset of 14 group meetings with 45 participants, and
show that a combination of VFOA-based visual metrics and prosodic-feature-based metrics can
predict emergent group leaders with 64% accuracy and dominant contributors with 86%
accuracy. We also report our findings on the correlations between the non-verbal behavioral
metrics with gender, emotional intelligence, and the Big 5 personality traits.
- Youfang Leng
- Li Yu
- Jie Xiong
Nowadays, there are more and more papers submitted to various periodicals and conferences.
Typically, reviewers need to read through the paper and give a review comment and score to it
based on somehow certain criterion. This review process is labor intensive and time-consuming.
Recently, AI technology is widely used to alleviate human labor burden. Can machine learn from
human to review papers automatically? In this paper, we propose a collaborative grammar and
innovation model - DeepReviewer to achieve automatic paper review. This model learning the
semantic, grammar and innovative features of an article by three main well-designed components
simultaneously. Moreover, these three factors are integrated by an attention layer to get the
final review score of the paper. We crawled paper review data from Openreview and built a real
data set. Experimental results demonstrate that our model exceeds many baselines.
- Tianyi Zhang
- Abdallah El Ali
- Chen Wang
- Xintong Zhu
- Pablo Cesar
To recognize emotions using less obtrusive wearable sensors, we present a novel emotion
recognition method that uses only pupil diameter (PD) and skin conductance (SC). Psychological
studies show that these two signals are related to the attention level of humans exposed to
visual stimuli. Based on this, we propose a feature extraction algorithm that extract
correlation-based features for participants watching the same video clip. To boost performance
given limited data, we implement a learning system without a deep architecture to classify
arousal and valence. Our method outperforms not only state-of-art approaches, but also
widely-used traditional and deep learning methods.
- Ankit Parag Shah
- Vaibhav Vaibhav
- Vasu Sharma
- Mahmoud Al Ismail
- Jeffrey Girard
- Louis-Philippe Morency
Suicide is one of the leading causes of death in the modern world. In this digital age,
individuals are increasingly using social media to express themselves and often use these
platforms to express suicidal intent. Various studies have inspected suicidal intent behavioral
markers in controlled environments but it is still unexplored if such markers will generalize
to suicidal intent expressed on social media. In this work, we set out to study multimodal
behavioral markers related to suicidal intent when expressed on social media videos. We explore
verbal, acoustic and visual behavioral markers in the context of identifying individuals at
higher risk of suicidal attempt. Our analysis reveals that frequent silences, slouched
shoulders, rapid hand movements and profanity are predominant multimodal behavioral markers
indicative of suicidal intent1.
- Dimosthenis Kontogiorgos
- Andre Pereira
- Joakim Gustafson
Situated multimodal systems that instruct humans need to handle user uncertainties, as expressed
in behaviour, and plan their actions accordingly. Speakers’ decision to reformulate or
repair previous utterances depends greatly on the listeners’ signals of uncertainty. In
this paper, we estimate uncertainty in a situated guided task, as leveraged in non-verbal cues
expressed by the listener, and predict that the speaker will reformulate their utterance. We
use a corpus where people instruct how to assemble furniture, and extract their multimodal
features. While uncertainty is in cases verbally expressed, most instances are expressed
non-verbally, which indicates the importance of multimodal approaches. In this work, we present
a model for uncertainty estimation. Our findings indicate that uncertainty estimation from
non-verbal cues works well, and can exceed human annotator performance when verbal features
cannot be perceived.
- Fumio Nihei
- Yukiko Nakano
- Ryuichiro Higashinaka
- Ryo Ishii
Iconic gestures are used to depict physical objects mentioned in speech, and the gesture form is
assumed to be based on the image of a given object in the speaker’s mind. Using this
idea, this study proposes a model that learns iconic gesture forms from an image representation
obtained from pictures of physical entities. First, we collect a set of pictures of each entity
from the web, and create an average image representation from them. Subsequently, the average
image representation is fed to a fully connected neural network to decide the gesture form. In
the model evaluation experiment, our two-step gesture form selection method can classify seven
types of gesture forms with over 62% accuracy. Furthermore, we demonstrate an example of
gesture generation in a virtual agent system in which our model is used to create a gesture
dictionary that assigns a gesture form for each entry word in the dictionary.
- Sixia Li
- Shogo Okada
- Jianwu Dang
In qualifying and analyzing the performance of group interaction, interaction processing
analysis (IPA) defined by Bale is considered a useful approach. IPA is a system for labeling a
total of 12 interaction categories for the interaction process. Automatic IPA can manually
encompass the gap in spending manpower and can efficiently qualify group performance. In this
paper, we present computational interaction processing analysis by developing a model to
recognize categories of IPA. We extract both verbal features and nonverbal features for IPA
category recognition modeling with SVM, RF, DNN and LSTM machine learning algorithms and
analyze the contribution of multimodal features and unimodal features for the total data and
each label. We also investigate the effect of context information by training sequences with
different lengths with an LSTM and evaluating them. The results show that multimodal features
achieve the best performance with an F1 score of 0.601 for the recognition of 12 IPA categories
using the total data. Multimodal features are better than the unimodal features for the total
data and most labels. The results of investigating context information show that a suitable
length of sequence enables a longer sequence to achieve the best F1 score of 0.602 and a better
performance for recognition.
- Qingqing Li
- Theodora Chaspari
Internet of Things technologies yield large amounts of real-life speech data related to human
emotions. Yet, labelled data of human emotion from spontaneous speech are extremely limited due
to the difficulties in the annotation of such large volumes of audio samples. A potential way
to address this limitation is to augment emotion models of spontaneous speech with fully
annotated data collected using scripted scenarios. We investigate whether and to what extent
knowledge related to speech emotional content can be transferred between datasets of scripted
and spontaneous speech. We implement transfer learning through: (1) a feed-forward neural
network trained on the source data and whose last layers are fine-tuned based on the target
data; and (2) a progressive neural network retaining a pool of pre-trained models and learning
lateral connections between source and target task. We explore the effectiveness of the
proposed approach using four publicly available datasets of emotional speech. Our results
indicate that transfer learning can effectively leverage corpora of scripted data to improve
emotion recognition performance for spontaneous speech.
- Soumia Dermouche
- Catherine Pelachaud
In the recent years, engagement modeling has gained increasing attention due the important role
it plays in human-agent interaction. The agent should be able to detect, in real time, the
engagement level of the user in order to react accordingly. In this context, our goal is to
develop a computational model to predict engagement level of the user in real time. Relying on
previous findings, we use facial expressions, head movements and gaze direction as predictive
features. Moreover, engagement is not only measured from single cues, but from the combination
of several cues that arise over a certain time window. Thus, for better engagement prediction,
we consider the variation of multimodal behaviors over time. To this end, we rely on LSTM that
can jointly model the temporality and the sequentiality of multimodal behaviors.
SESSION: Doctoral Consortium
Anxiety disorders are becoming more prevalent; therefore, the demand for mobile anxiety
self-regulation technologies is rising. However, the existing regulation technologies have not
yet reached the ability to guide suitable interventions to a user in a timely manner. This is
mainly due to the lack of maturity in the anxiety detection area. Hence, this research aims to
(1) identify potential temporal phases of anxiety which could become effective personalization
parameters for regulation technologies, (2) detect such phases through collecting and analyzing
multimodal indicators of anxiety, and (3) design self-regulation technologies that can guide
suitable interventions for the detected anxiety phase. Based on an exploratory study that was
conducted with therapists treating anxiety disorders, potential temporal phases and common
indicators of anxiety were identified. The design of anxiety detection and regulation
technologies is currently in progress. The proposed research methodology and expected
contributions are further discussed in this paper.
Mental health disorders are among the leading causes of disability. Despite the prevalence of
mental health disorders, there is a large gap between the needs and resources available for
their assessment and treatment. Automatic behaviour analysis for computer-aided mental health
assessment can augment clinical resources in the diagnosis and treatment of patients.
Intelligent systems like virtual agents and social robots can have a large impact by deploying
multimodal machine learning to perceive and interact with patients in interactive scenarios for
probing behavioral cues of mental health disorders. In this paper, we propose our plans for
developing multimodal machine learning methods for augmenting embodied interactive agents with
emotional intelligence, toward probing cues of mental health disorders. We aim to develop a new
generation of intelligent agents that can create engaging interactive experiences for assisting
with mental health assessments.
Motion-based applications are becoming increasingly popular among children and require accurate
motion recognition to ensure meaningful interactive experiences. However, motion recognizers
are usually trained on adults’ motions. Children and adults differ in terms of their body
proportions and development of their neuromuscular systems, so children and adults will likely
perform motions differently. Hence, motion recognizers tailored to adults will likely perform
poorly for children. My PhD thesis will focus on identifying features that characterize
children’s and adults’ motions. This set of features will provide a model that can
be used to understand children’s natural motion qualities and will serve as the first
step in tailoring recognizers to children’s motions. This paper describes my past and
ongoing work toward this end and outlines the next steps in my PhD work.
High-accuracy physiological emotion recognition typically requires participants to wear or
attach obtrusive sensors (e.g., Electroencephalograph). To achieve precise emotion recognition
using only wearable body-worn physiological sensors, my doctoral work focuses on researching
and developing a robust sensor fusion system among different physiological sensors. Developing
such fusion system has three problems: 1) how to pre-process signals with different temporal
characteristics and noise models, 2) how to train the fusion system with limited labeled data
and 3) how to fuse multiple signals with inaccurate and inexact ground truth. To overcome these
challenges, I plan to explore semi-supervised, weakly supervised and unsupervised machine
learning methods to obtain precise emotion recognition in mobile environments. By developing
such techniques, we can measure the user engagement with larger amounts of participants and
apply the emotion recognition techniques in a variety of scenarios such as mobile video
watching and online education.
One research branch in Affective Computing focuses on using multimodal ‘emotional’
expressions (e.g. facial expressions or non-verbal vocalisations) to automatically detect
emotions and affect experienced by persons. The field is increasingly interested in using
contextual factors to better infer emotional expressions rather than solely relying on the
emotional expressions by themselves. We are interested in expressions that occur in a social
context. In our research we plan to investigate how we can; a) utilise communicative signals
that are displayed during interactions to recognise social contextual factors that influence
emotion expression and in turn b) predict/recognise what these emotion expressions are most
likely communicating considering the context. To achieve this, we formulate three main research
questions: I) How do communicative signals such as emotion expressions co-ordinate behaviours
and knowledge between interlocutors in interactive settings?, II) Can we use behavioural cues
during interactions to detect social contextual factors relevant for interpreting affect? and
III) Can we use social contextual factors and communicative signals to predict what emotion
experience is linked to an emotion expression?
Collaboration is an important skill of the 21st century. It can take place in an online (or
remote) setting or in a co-located (or face-to-face) setting. With the large scale adoption of
sensor use, studies on co-located collaboration (CC) has gained momentum. CC takes place in
physical spaces where the group members share each other’s social and epistemic space.
This involves subtle multimodal interactions such as gaze, gestures, speech, discourse which
are complex in nature. The aim of this PhD is to detect these interactions and then use these
insights to build an automated real-time feedback system to facilitate co-located
collaboration.
This research aims to create a data-driven end-to-end model for multimodal forecasting body pose
and gestures of virtual avatars. A novel aspect of this research is to coalesce both narrative
and dialogue for pose forecasting. In a narrative, language is used in a third person view to
describe the avatar actions. In dialogue both first and second person views need to be
integrated to accurately forecast avatar pose. Gestures and poses of a speaker are linked to
other modalities: language and acoustics. We use these correlations to better predict the
avatar’s pose.
Augmented reality (AR) glasses enable the embedding of visual content in a real-world
surroundings. In this PhD project, I will implement user interfaces which adapt to the
cognitive state of the user, for example by avoiding distractions or re-directing the
user’s attention towards missed information. For this purpose, sensory data from the user
is captured (Brain activity via EEG of fNIRS, eye tracking, physiological measurements) and
modeled with machine learning techniques. The focus of the cognitive state estimation is
centered around attention related aspects. The main task is to build models for an estimation
of a person’s attentional state from the combination and classification of multimodal
data streams and context information, as well as their evaluation. Furthermore, the goal is to
develop prototypical user interfaces for AR glasses and to test their usability in different
scenarios.
The ever-growing research in computer vision has created new avenues for user interaction.
Speech commands and gesture recognition are already being applied in various touch-based
inputs. It is, therefore, foreseeable, that the use of multimodal input methods for user
interaction is the next phase in development. In this paper, I propose a research plan of novel
methods for the use of multimodal inputs for the semantic interpretation of human-computer
interaction, specifically applied to a car driver. A fusion methodology has to be designed that
adequately makes use of a recognized gesture (specifically finger pointing), eye gaze and head
pose for the identification of reference objects, while using the semantics from speech for a
natural interactive environment for the driver. The proposed plan includes different techniques
based on artificial neural networks for the fusion of the camera-based modalities (gaze, head
and gesture). It then combines features extracted from speech with the fusion algorithm to
determine the intent of the driver.
SESSION: Demo and Exhibit Session
- Zi Fong Yong
- Ai Ling Ng
- Yuta Nakayama
There is a lack of awareness about dyslexia among people in our society. More often than not,
there are many misconceptions surrounding the diagnosis of dyslexia, leading to misjudgements
and misunderstanding about dyslexics from the workplace to school. This paper presents a
multimodal interactive installation designed to communicate the emotional ordeal faced by
dyslexics, allowing those who do not understand to see through the lens of those with dyslexia.
The main component of this installation is a projection mapping technique used to enhance
typography, simulating the experience of dyslexia. Projection mapping makes it possible to
create a natural augmented information presentation method on the tangible surface of a
specially designed printed book. The user interface combines a color–tracking sensor and
a projection to create a camera–projector system. The described system performs tabletop
object detection and automatic projection mapping, using page flipping as user interaction.
Such a system can be adapted to fit different contexts and installation spaces, for the purpose
of education and awareness. There is also the potential to conduct further research with real
dyslexia patients.
- Yuyun Hua
- Sixian Zhang
- Xinhang Song
- Jia'ning Li
- Shuqiang Jiang
Depth data captured by the cameras such as Microsoft Kinect can bring depth information than
traditional RGB data, which is also more robust to different environments, such as dim or dark
lighting conditions. In this technical demonstration, we build a scene recognition system based
on real-time processing of RGB-D video streams. Our system recognizes the scenes with video
clips, where three types of threads are implemented to ensure the realtime. This system first
buffers the frames of both RGB and depth videos with the capturing threads. When the buffered
videos reach the certain length, the frames will be packed into clips and forwarded in a
pre-trained C3D model to predict scene labels with the scene recognition thread. Finally, the
predicted scene labels and captured videos are illustrated in our user interface with
illustration thread.
- Jin-hwan Oh
- Sudhakar Sah
- Jihoon Kim
- Yoori Kim
- Jeonghwa Lee
- Wooseung Lee
- Myeongsoo Shin
- Jaeyon Hwang
- Seongwon Kim
AI assistants have found their place in households but most of the existing assistants use
single modal interaction. We present a language assistant for kids called Hola (Hang out with
the Language Assistant) which is a true multimodal assistant. Hola is a small mobile robot
based assistant capable of understanding the objects around it and responding to questions
about objects that it can see. Hola is also able to adjust the camera position and its own
position to make an extra attempt to understand the object using robot control mechanism. The
technology behind it uses a combination of natural language understanding, object detection,
and hand pose detection. In addition, Hola also supports reading book in the form of
storytelling for kids using OCR. Children can ask a question about any word that they do not
understand and Hola can retrieve the information from the internet and tells the meaning, other
details of the word. After reading the book or a page, the robot asks the child based on the
words used in the book to confirm the child’s understanding of the book.
- Fahim Salim
- Fasih Haider
- Sena Busra Yengec Tasdemir
- Vahid Naghashi
- Izem Tengiz
- Kubra Cengiz
- Dees Postma
- Robby van Delden
- Dennis Reidsma
- Saturnino Luz
- Bert-Jan van Beijnum
Quick and easy access to performance data during matches and training sessions is important for
both players and coaches. While there are many video tagging systems available, these systems
require manual effort. This paper proposes a system architecture that automatically supplements
video recording by detecting events of interests in volleyball matches and training sessions to
provide tailored and interactive multi-modal feedback.
- Abdenaceur Abdouni
- Rory Clark
- Orestis Georgiou
Ultrasound is beyond the range of human hearing and tactile perception. In the past few years,
several modulation techniques have been invented to overcome this and evoke perceptible tactile
sensations of shapes and textures that can be felt, but not seen. Therefore, mid-air haptic
technology has found use in several human computer interaction applications and is the focus of
multiple research efforts. Visualising the induced acoustic pressure field can help understand
and optimise how different modulation techniques translate into tactile sensations. Here,
rather than using acoustic simulation tools to do that, we exploit the micro-displacement of a
thin layer of oil to visualize the impinging acoustic pressure field outputted from an
ultrasonic phased array device. Our demo uses a light source to illuminate the oil displacement
and project it onto a screen to produce an interactive lightbox display. Interaction is
facilitated via optical hand-tracking technology thus enabling an instantaneous and
aesthetically pleasing visualisation of mid-air haptics.
- Khalil J. Anderson
- Theodore Dubiel
- Kenji Tanaka
- Marcelo Worsley
- Cody Poultney
- Steve Brenneman
Instructors are often multitasking in the classroom. This makes it increasingly difficult for
them to pay attention to each individual’s engagement especially during activities where
students are working in groups. In this paper, we describe a system that aids instructors in
supporting group collaboration by utilizing a centralized, easy-to-navigate dashboard connected
to multiple pods dispersed among groups of students in a classroom or laboratory. This allows
instructors to check multiple qualities of the discussion such as: the usage of instructor
specified keywords, relative participation of each individual, the speech acts students are
using and different emotional characteristics of group language.
- Aaron E. Rodriguez
- Adriana Camacho
- Laura J. Hinojos
- Mahdokht Afravi
- David Novick
SESSION: Challenge 1: The 1st Chinese Audio-Textual Spoken Language Understanding Challenge
- Xu Wang
- Chengda Tang
- Xiaotian Zhao
- Xuancai Li
- Zhuolin Jin
- Dequan Zheng
- Tiejun Zhao
In this paper, we present a series of methods to improve the performance of spoken language
understanding in the 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU
2019) which is aimed to improve the robustness for automatic speech recognition (ASR) errors
and to solve the problem of not enough labeled data in new domains. We combine word information
and char information to improve the performance of the semantic parser. We also use some
transfer learning methods like correlation alignments to improve the robustness of the spoken
language understanding system. Then we merge the rule method and the neural network method to
raise system output performance. In video and weather domains with few training data, we use
both the transfer learning model trained on multi-domain data and the rule-based approach. Our
approaches achieve F1 scores of 86.83%, 92.84%, 94.16%, and 93.04% on the test
sets of map, music, video and weather domains.
- Heyan Huang
- Xianling Mao
- Puhai Yang
As a critical component of Spoken Dialog System (SDS), spoken language understanding (SLU)
attracts a lot of attention, especially for methods based on unaligned data. Recently, a new
approach has been proposed that utilizes the hierarchical relationship between act-slot-value
triples. However, it ignores the transfer of internal information which may record the
intermediate information of the upper level and contribute to the prediction of the lower
level. So, we propose a novel streamlined decoding structure with attention mechanism, which
uses three successively connected RNN to decode act, slot and value respectively. On the first
Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU), our model exceeds
state-of-the-art model on an unaligned multi-turn task-oriented Chinese spoken dialogue dataset
provided by the contest.
- Su Zhu
- Zijian Zhao
- Tiejun Zhao
- Chengqing Zong
- Kai Yu
Spoken language understanding (SLU) is a key component of conversational dialogue systems, which
converts user utterances into semantic representations. The previous works almost focus on
parsing semantic from textual inputs (top hypothesis of speech recognition and even manual
transcripts) while losing information hidden in the audio. We herein describe the 1st Chinese
Audio-Textual Spoken Language Understanding Challenge (CATSLU) which introduces a new dataset
with audio-textual information, multiple domains and domain knowledge. We introduce two
scenarios of audio-textual SLU in which participants are encouraged to utilize data of other
domains or not. In this paper, we will describe the challenge and results.
- Chaohong Tan
- Zhenhua Ling
The spoken language understanding (SLU) is an important part of spoken dialogue system (SDS). In
the paper, we focus on how to extract a set of act-slot-value tuples from users’
utterances in the 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU).
This paper adopts the pretrained BERT model to encode users’ utterances and builds
multiple classifiers to get the required tuples. In our framework, finding acts and values of
slots are recognized as classification tasks respectively. Such multi-task training is expected
to help the encoder to get better understanding of the utterance. Since the system is built on
the transcriptions given by automatic speech recognition (ASR), some tricks are applied to
correct the errors of the tuples. We also found that using the minimum edit distance (MED)
between results and candidates to rebuild the tuples was beneficial in our experiments.
- Hao Li
- Chen Liu
- Su Zhu
- Kai Yu
Spoken language understanding (SLU) converts user utterances into structured semantic forms.
There are still two main issues for SLU: robustness to ASR-errors and the data sparsity of new
and extended domains. In this paper, we propose a robust SLU system by leveraging both acoustic
and domain knowledge. We extract audio features by training ASR models on a large number of
utterances without semantic annotations. For exploiting domain knowledge, we design lexicon
features from the domain ontology and propose an error elimination algorithm to help predicted
values recovered from ASR-errors. The results of CATSLU challenge show that our systems can
outperform all of the other teams across four domains.
SESSION: Challenge 2: The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR)
- Yue Yao
- Tianyu Wang
- Heming Du
- Liang Zheng
- Tom Gedeon
Visual Keyword Spotting (KWS), as a newly proposed task deriving from visual speech recognition,
has plenty of room for improvements. This paper details our Visual Keyword Spotting system used
in the first Mandarin Audio-Visual Speech Recognition Challenge (MAVSR 2019). With the
assumption that the vocabularies of target dataset are a subset of the vocabulary of the
training set, we proposed a simple and scalable classification based strategy that achieves
19.0% mean average precision (mAP) on this challenge. Our method is based on the idea of
using sliding windows to bridge between the word-level dataset and the sentence-level dataset,
showing that a strong word level classifier can be directly used in building sentence
embedding, thereby making it possible to build a KWS system.
- Yougen Yuan
- Wei Tang
- Minhao Fan
- Yue Cao
- Peng Zhang
- Lei Xie
Audio-visual understanding is usually challenged by the complementary gap between audio and
visual informative bridging. Motivated by the recent audio-visual studies, a closed-set
word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech
Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder
initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an
attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two
fully connected layers in addition to the concatenated encoder outputs for the audio-visual
joint training, the proposed scheme won the first place with a relative word accuracy
improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have
substantially demonstrated that the proposed joint training scheme by audio-visual
incorporation is capable of enhancing the recognition performance of relatively short duration
samples, unveiling the multi-modal complementarity.
SESSION: Challenge 3: Seventh Emotion Recognition in the Wild Challenge (EmotiW)
This paper describes the Seventh Emotion Recognition in the Wild (EmotiW) Challenge. The EmotiW
benchmarking platform provides researchers with an opportunity to evaluate their methods on
affect labelled data. This year EmotiW 2019 encompasses three sub-challenges: a) Group-level
cohesion prediction; b) Audio-Video emotion recognition; and c) Student engagement prediction.
We discuss the databases used, the experimental protocols and the baselines.
- Kai Wang
- Jianfei Yang
- Da Guo
- Kaipeng Zhang
- Xiaojiang Peng
- Yu Qiao
This paper presents our approach for the engagement intensity regression task of EmotiW 2019.
The task is to predict the engagement intensity value of a student when he or she is watching
an online MOOCs video in various conditions. Based on our winner solution last year, we mainly
explore head features and body features with a bootstrap strategy and two novel loss functions
in this paper. We maintain the framework of multi-instance learning with long short-term memory
(LSTM) network, and make three contributions. First, besides of the gaze and head pose
features, we explore facial landmark features in our framework. Second, inspired by the fact
that engagement intensity can be ranked in values, we design a rank loss as a regularization
which enforces a distance margin between the features of distant category pairs and adjacent
category pairs. Third, we use the classical bootstrap aggregation method to perform model
ensemble which randomly samples a certain training data by several times and then averages the
model predictions. We evaluate the performance of our method and discuss the influence of each
part on the validation dataset. Our methods finally win 3rd place with MSE of 0.0626 on the
testing set. https://github.com/kaiwang960112/EmotiW_2019_ engagement_regression
- Da Guo
- Kai Wang
- Jianfei Yang
- Kaipeng Zhang
- Xiaojiang Peng
- Yu Qiao
This paper presents our approach for the group cohesion prediction sub-challenge in the EmotiW
2019. The task is to predict group cohesiveness in images. We mainly explore several
regularizations with three types of visual cues, namely face, body ,and global image. Our main
contribution is two-fold. First, we jointly train the group cohesion prediction task and group
emotion recognition task using multi-task learning strategy with all visual cues. Second, we
elaborately design two regularizations, namely a rank loss and a hourglass loss, where the
former aims to give a margin between the distance of distant categories and near categories and
the later to avoid centralization predictions with only MSE loss. With careful evaluations, we
finally achieve the second place in this sub-challenge with MSE of 0.43821 on the testing set.
https://github.com/DaleAG/Group_Cohesion_Prediction
- Hengshun Zhou
- Debin Meng
- Yuanyuan Zhang
- Xiaojiang Peng
- Jun Du
- Kai Wang
- Yu Qiao
The audio-video based emotion recognition aims to classify a given video into basic emotions. In
this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features
and feature fusion strategies for audio and visual modality. For emotion features, we explore
audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial
features with different CNN models and different emotion pretrained strategies. For fusion
strategies, we explore intra-modal and cross-modal fusion methods, such as designing attention
mechanisms to highlights important emotion feature, exploring feature concatenation and
factorized bilinear pooling (FBP) for cross-modal feature fusion. With careful evaluation, we
obtain 65.5% on the AFEW validation set and 62.48% on the test set and rank third in
the challenge.
- Van Thong Huynh
- Soo-Hyung Kim
- Guee-Sang Lee
- Hyung-Jeong Yang
This paper describes an approach for the engagement prediction task, a sub-challenge of the 7th
Emotion Recognition in the Wild Challenge (EmotiW 2019). Our method involves three fundamental
steps: feature extraction, regression and model ensemble. In the first step, an input video is
divided into multiple overlapped segments (instances) and the features extracted for each
instance. The combinations of Long short-term memory (LSTM) and Fully connected layers deployed
to capture the temporal information and regress the engagement intensity for the features in
previous step. In the last step, we performed fusions to achieve better performance. Finally,
our approach achieved a mean square error of 0.0597, which is 4.63% lower than the best
results last year.
- Tien Xuan Dang
- Soo-Hyung Kim
- Hyung-Jeong Yang
- Guee-Sang Lee
- Thanh-Hung Vo
In this paper, we propose a hybrid deep learning network for predicting group cohesion in
images. It is a kind of regression problem and its objective is to predict the Group Cohesion
Score (GCS), which is in the range of [0,3]. In order to solve this issue, we exploit four
types of visual cues, such as scene, skeleton, UV coordinates and face image, along with
state-of-the-art convolutional neural networks (CNNs). We use not only fusion but also ensemble
methods to combine these approaches. Our proposed hybrid network achieves 0.517 and 0.416 mean
square errors (MSEs) on validation and testing sets, respectively. We finally achieved the
first place on the Group-level Cohesion Sub-challenge (GC) in the EmotiW 2019.
- Bin Zhu
- Xin Guo
- Kenneth Barner
- Charles Boncelet
Group cohesiveness is a compelling and often studied composition in group dynamics and group
performance. The enormous number of web images of groups of people can be used to develop an
effective method to detect group cohesiveness. This paper introduces an automatic group
cohesiveness prediction method for the 7th Emotion Recognition in the Wild (EmotiW 2019) Grand
Challenge in the category of Group-based Cohesion Prediction. The task is to predict the
cohesive level for a group of people in images. To tackle this problem, a hybrid network
including regression models which are separately trained on face features, skeleton features,
and scene features is proposed. Predicted regression values, corresponding to each feature, are
fused for the final cohesive intensity. Experimental results demonstrate that the proposed
hybrid network is effective and makes promising improvements. A mean squared error (MSE) of
0.444 is achieved on the testing sets which outperforms the baseline MSE of 0.5.
- Jianming Wu
- Zhiguang Zhou
- Yanan Wang
- Yi Li
- Xin Xu
- Yusuke Uchida
This paper proposes a novel engagement intensity prediction approach, which is also applied in
the EmotiW Challenge 2019 and resulted in good performance. The task is to predict the
engagement level when a subject student is watching an educational video in diverse conditions
and various environments. Assuming that the engagement intensity has a strong correlation with
facial movements, upper-body posture movements and overall environmental movements in a time
interval, we extract and incorporate these motion features into a deep regression model
consisting of layers with a combination of LSTM, Gated Recurrent Unit (GRU) and a Fully
Connected Layer. In order to precisely and robustly predict the engagement level in a long
video with various situations such as darkness and complex background, a multi-features
engineering method is used to extract synchronized multi-model features in a period of time by
considering both the short-term dependencies and long-term dependencies. Based on the
well-processed features, we propose a strategy for maximizing validation accuracy to generate
the best models covering all the model configurations. Furthermore, to avoid the overfitting
problem ascribed to the extremely small database, we propose another strategy applying a single
Bi-LSTM layer with only 16 units to minimize the overfitting, and splitting the engagement
dataset (train + validation) with 5-fold cross validation (stratified k-fold) to train the
conservative model. By ensembling the above models, our methods finally win the second place in
the challenge with MSE of 0.06174 on the testing set.
- Sunan Li
- Wenming Zheng
- Yuan Zong
- Cheng Lu
- Chuangao Tang
- Xingxun Jiang
- Jiateng Liu
- Wanchuang Xia
The emotion recognition in the wild has been a hot research topic in the field of affective
computing. Though some progresses have been achieved, the emotion recognition in the wild is
still an unsolved problem due to the challenge of head movement, face deformation, illumination
variation etc. To deal with these unconstrained challenges, we propose a bi-modality fusion
method for video based emotion recognition in the wild. The proposed framework takes advantages
of the visual information from facial expression sequences and the speech information from
audio. The state-of-the-art CNN based object recognition models are employed to facilitate the
facial expression recognition performance. A bi-direction long short term Memory (Bi-LSTM) is
employed to capture dynamic information of the learned features. Additionally, to take full
advantages of the facial expression information, the VGG16 network is trained on AffectNet
dataset to learn a specialized facial expression recognition model. On the other hand, the
audio based features, like low level descriptor (LLD) and deep features obtained by spectrogram
image, are also developed to improve the emotion recognition performance. The best experimental
result shows that the overall accuracy of our algorithm on the Test dataset of the EmotiW
challenge is 62.78, which outperforms the best result of EmotiW2018 and ranks 2nd at the
EmotiW2019 challenge.
- Yanan Wang
- Jianming Wu
- Keiichiro Hoashi
Humans routinely pay attention to important emotion information from visual and audio modalities
without considering multimodal alignment issues, and recognize emotions by integrating
important multimodal information at a certain interval. In this paper, we propose a multiple
attention fusion network (MAFN) with the goal of improving emotion recognition performance by
modeling human emotion recognition mechanisms. MAFN consists of two types of attention
mechanisms: the intra-modality attention mechanism is applied to dynamically extract
representative emotion features from a single modal frame sequences; the inter-modality
attention mechanism is applied to automatically highlight specific modal features based on
their importance. In addition, we define a multimodal domain adaptation method to have a
positive effect on capturing interactions between modalities. MAFN achieved 58.65%
recognition accuracy with the AFEW testing set, which is a significant improvement compared
with the baseline of 41.07%.
| | |