|
|
Proceedings
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
SESSION: Keynote Talks
Most research on language has focused on spoken and written language only. However
when we use language in face-to-face interactions we use not only speech but also
use our bodily actions, such as hand gestures in meaningful ways to communicate our
messages and in ways closely linked to the spoken aspects of our language. For example
we can enhance or complement our speech with a drinking gesture, a so- called an iconic
gesture, as we say 'we stayed up late last night'. In this talk I will summarize research
that investigates how such meaningful bodily actions are recruited in using language
as a dynamic adaptive and flexible system and how gestures interact with speech during
production and comprehension of language at the behavioral, cognitive, and neural
levels. First part of the lecture will focus on how gestures are linked to the language
production system even though they have a very different representational format (i.e,
iconic and analogue) than speech (arbitrary, discrete and categorical) [1] and how
they express communicative intentions during language use [2]. In doing so I will
show different ways gestures are linked to speech in different languages and in different
communicative contexts as well as in bilinguals and language learners. Second part
of the talk will focus on how gestures influence and enhance language comprehension
by reducing the ambiguity of the communicative signal [3] and providing kinematic
cues to the communicative intentions of the speaker [4] and the underlying neural
correlates of gesture that facilitates its role in language comprehension [5]. In
the final part of the talk I will show how gestures facilitate mutual understanding,
that is alignment between interactants in dialogue. Overall I will claim that a complete
understanding of the role language plays in our cognition and communication is not
possible without having a multimodal approach.
Musical performance can be thought of in multimodal terms - physical interaction with
musical instruments produces sound output, often while the performer is visually reading
a score. Digital Musical Instrument (DMI) design merges tenets of HCI and musical
instrument practice. Audiovisual performance and other forms of multimedia might benefit
from multimodal thinking. This keynote revisits two decades of interactive music practice
that has paralleled the development of the field of multimodal interaction research.
The BioMuse was an early digital musical instrument system using EMG muscle sensing
that was extended by a second mode of sensing, allowing effort and position to be
two complementary modalities [1]. The Haptic Wave applied principles of cross-modal
information display to create a haptic audio editor enabling visually impaired audio
producers to 'feel' audio waveforms they could not see in a graphical user interface
[2]. VJ culture extends the idea of music DJs to create audiovisual cultural experiences.
AVUIs were a set of creative coding tools that enabled the convergence of performance
UI and creative visual output [3]. The Orchestra of Rocks is a continuing collaboration
with visual artist Uta Kogelsberger that has manifested itself through physical and
virtual forms - allowing multimodality over time [4]. Be it a physical exhibition
in a gallery or audio reactive 3D animation on YouTube 360, the multiple modes in
which an artwork is articulated support its original conceptual foundations. These
four projects situate multimodal interaction at the heart of artistic research.
Multimodal machine intelligence offers enormous possibilities for helping understand
the human condition and in creating technologies to support and enhance human experiences
[1, 2]. What makes such approaches and systems exciting is the promise they hold for
adaptation and personalization in the presence of the rich and vast inherent heterogeneity,
variety and diversity within and across people. Multimodal engineering approaches
can help analyze human trait (e.g., age), state (e.g., emotion), and behavior dynamics
(e.g., interaction synchrony) objectively, and at scale. Machine intelligence could
also help detect and analyze deviation in patterns from what is deemed typical. These
techniques in turn can assist, facilitate or enhance decision making by humans, and
by autonomous systems. Realizing such a promise requires addressing two major lines
of, oft intertwined, challenges: creating inclusive technologies that work for everyone
while enabling tools that can illuminate the source of variability or difference of
interest.
This talk will highlight some of these possibilities and opportunities through examples
drawn from two specific domains. The first relates to advancing health informatics
in behavioral and mental health [3, 4]. With over 10% of the world's population affected,
and with clinical research and practice heavily dependent on (relatively scarce) human
expertise in diagnosing, managing and treating the condition, engineering opportunities
in offering access and tools to support care at scale are immense. For example, in
determining whether a child is on the Autism spectrum, a clinician would engage and
observe a child in a series of interactive activities, targeting relevant cognitive,
communicative and socio- emotional aspects, and codify specific patterns of interest
e.g., typicality of vocal intonation, facial expressions, joint attention behavior.
Machine intelligence driven processing of speech, language, visual and physiological
data, and combining them with other forms of clinical data, enable novel and objective
ways of supporting and scaling up these diagnostics. Likewise, multimodal systems
can automate the analysis of a psychotherapy session, including computing treatment
quality-assurance measures e.g., rating a therapist's expressed empathy. These technology
possibilities can go beyond the traditional realm of clinics, directly to patients
in their natural settings. For example, remote multimodal sensing of biobehavioral
cues can enable new ways for screening and tracking behaviors (e.g., stress in workplace)
and progress to treatment (e.g., for depression), and offer just in time support.
The second example is drawn from the world of media. Media are created by humans and
for humans to tell stories. They cover an amazing range of domains'from the arts and
entertainment to news, education and commerce and in staggering volume. Machine intelligence
tools can help analyze media and measure their impact on individuals and society.
This includes offering objective insights into diversity and inclusion in media representations
through robustly characterizing media portrayals from an intersectional perspective
along relevant dimensions of inclusion: gender, race, gender, age, ability and other
attributes, and in creating tools to support change [5,6]. Again this underscores
the twin technology requirements: to perform equally well in characterizing individuals
regardless of the dimensions of the variability, and use those inclusive technologies
to shine light on and create tools to support diversity and inclusion.
SESSION: Long Papers
- Ashwin Ramesh Babu
- Mohammad Zaki Zadeh
- Ashish Jaiswal
- Alexis Lueckenhoff
- Maria Kyrarini
- Fillia Makedon
In recent years, computer and game-based cognitive tests have become popular with
the advancement in mobile technology. However, these tests require very little body
movements and do not consider the influence that physical motion has on cognitive
development. Our work mainly focus on assessing cognition in children through their
physical movements. Hence, an assessment test "Ball-Drop-to-the-Beat" that is both
physically and cognitively demanding has been used where the child is expected to
perform certain actions based on the commands. The task is specifically designed to
measure attention, response inhibition, and coordination in children. A dataset has
been created with 25 children performing this test. To automate the scoring, a computer
vision-based assessment system has been developed. The vision system employs an attention-based
fusion mechanism to combine multiple modalities such as optical flow, human poses,
and objects in the scene to predict a child's action. The proposed method outperforms
other state-of-the-art approaches by achieving an average accuracy of 89.8 percent
on predicting the actions and an average accuracy of 88.5 percent on predicting the
rhythm on the Ball-Drop-to-the-Beat dataset.
- Shane D. Sims
- Cristina Conati
Encouraged by the success of deep learning in a variety of domains, we investigate
the effectiveness of a novel application of such methods for detecting user confusion
with eye-tracking data. We introduce an architecture that uses RNN and CNN sub-models
in parallel, to take advantage of the temporal and visuospatial aspects of our data.
Experiments with a dataset of user interactions with the ValueChart visualization
tool show that our model outperforms an existing model based on a Random Forest classifier,
resulting in a 22% improvement in combined confused & not confused class accuracies.
- Cigdem Beyan
- Matteo Bustreo
- Muhammad Shahid
- Gian Luca Bailo
- Nicolo Carissimi
- Alessio Del Bue
We present the first publicly available annotations for the analysis of face-touching
behavior. These annotations are for a dataset composed of audio-visual recordings
of small group social interactions with a total number of 64 videos, each one lasting
between 12 to 30 minutes and showing a single person while participating to four-people
meetings. They were performed by in total 16 annotators with an almost perfect agreement
(Cohen's Kappa=0.89) on average. In total, 74K and 2M video frames were labelled as
face-touch and no-face-touch, respectively. Given the dataset and the collected annotations,
we also present an extensive evaluation of several methods: rule-based, supervised
learning with hand-crafted features and feature learning and inference with a Convolutional
Neural Network (CNN) for Face-Touching detection. Our evaluation indicates that among
all, CNN performed the best, reaching 83.76% F1-score and 0.84 Matthews Correlation
Coefficient. To foster future research in this problem, code and dataset were made
publicly available (github.com/IIT-PAVIS/Face-Touching-Behavior), providing all video
frames, face-touch annotations, body pose estimations including face and hands key-points
detection, face bounding boxes as well as the baseline methods implemented and the
cross-validation splits used for training and evaluating our models.
- Felix Putze
- Dennis Küster
- Timo Urban
- Alexander Zastrow
- Marvin Kampen
We developed an attention-sensitive system that is capable of playing the children's
guessing game "I spy with my litte eye" with a human user. In this game, the user
selects an object from a given scene and provides the system with a single-sentence
clue about it. For each trial, the system tries to guess the target object. Our approach
combines top-down and bottom-up machine learning for object and color detection, automatic
speech recognition, natural language processing, a semantic database, eye tracking,
and augmented reality. Our evaluation demonstrates performance significantly above
chance level, and results for most of the individual machine learning components are
encouraging. Participants reported very high levels of satisfaction and curiosity
about the system. The collected data shows that our guessing game generates a complex
and rich data set. We discuss the capabilities and challenges of our system and its
components with respect to multimodal attention sensing.
- Md Mahbubur Rahman
- Mohsin Yusuf Ahmed
- Tousif Ahmed
- Bashima Islam
- Viswam Nathan
- Korosh Vatanparvar
- Ebrahim Nemati
- Daniel McCaffrey
- Jilong Kuang
- Jun Alex Gao
Mobil respiratory assessments using commodity smartphones and smartwatches are unmet
needs for patient monitoring at home. In this paper, we show the feasibility of using
multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort
respiratory assessment. We have conducted studies with 228 chronic respiratory patients
and healthy subjects, and show that our model can estimate respiratory rate with mean
absolute error (MAE) 0.72$\pm$0.62 breath per minute and differentiate respiratory
patients from healthy subjects with 90% recall and 76% precision when the user breathes
normally by holding the device on the chest or the abdomen for a minute. Holding the
device on the chest or abdomen needs significantly lower effort compared to traditional
spirometry which requires a specialized device and forceful vigorous breathing. This
paper shows the feasibility of developing a low-effort respiratory assessment towards
making it available anywhere, anytime through users' own mobile devices.
- Angela Constantinescu
- Karin Müller
- Monica Haurilet
- Vanessa Petrausch
- Rainer Stiefelhagen
Digital navigation tools for helping people with visual impairments have become increasingly
popular in recent years. While conventional navigation solutions give routing instructions
to the user, systems such as GoogleMaps, BlindSquare, or Soundscape offer additional
information about the surroundings and, thereby, improve the orientation of people
with visual impairments. However, these systems only provide information about static
environments, while dynamic scenes comprising objects such as bikes, dogs, and persons
are not considered. In addition, both the routing and the information about the environment
are usually conveyed by speech. We address this gap and implement a mobile system
that combines object identification with a sonification interface. Our system can
be used in three different scenarios of macro and micro navigation: orientation, obstacle
avoidance, and exploration of known and unknown routes. Our proposed system leverages
popular computer vision methods to localize 18 static and dynamic object classes in
real-time. At the heart of our system is a mixed reality sonification interface which
is adaptable to the user's needs and is able to transmit the recognized semantic information
to the user. The system is designed in a user-centered approach. An exploratory user
study conducted by us showed that our object-to-sound mapping with auditory icons
is intuitive. On average, users perceived our system as useful and indicated that
they want to know more about their environment, apart from wayfinding and points of
interest.
- Cisem Ozkul
- David Geerts
- Isa Rutten
As Mid-Air Haptic (MAH) feedback, which provides a sensation of touch without direct
physical contact, is a relatively new technology, research investigating MAH feedback
in home usage as well as multi-sensory integration with MAH feedback is still scarce.
To address this gap, we propose a possible usage context for MAH feedback, perform
an experiment by manipulating auditory and haptic feedback in various physical qualities
and suggest possible combinations for positive experiences. Certain sensory combinations
led to changes in the emotional responses, as well as the responses regarding utilitarian
(e.g. clarity) and perceptual (sensory match) qualities. The results show an added
value of MAH feedback when added to sensory compositions, and an increase in the positive
experiences induced by MAH length and multimodality.
- Michal Muszynski
- Jamie Zelazny
- Jeffrey M. Girard
- Louis-Philippe Morency
Recent progress in artificial intelligence has led to the development of automatic
behavioral marker recognition, such as facial and vocal expressions. Those automatic
tools have enormous potential to support mental health assessment, clinical decision
making, and treatment planning. In this paper, we investigate nonverbal behavioral
markers of depression severity assessed during semi-structured medical interviews
of adolescent patients. The main goal of our research is two-fold: studying a unique
population of adolescents at high risk of mental disorders and differentiating mild
depression from moderate or severe depression. We aim to explore computationally inferred
facial and vocal behavioral responses elicited by three segments of the semi-structured
medical interviews: Distress Assessment Questions, Ubiquitous Questions, and Concept
Questions. Our experimental methodology reflects best practise used for analyzing
small sample size and unbalanced datasets of unique patients. Our results show a very
interesting trend with strongly discriminative behavioral markers from both acoustic
and visual modalities. These promising results are likely due to the unique classification
task (mild depression vs. moderate and severe depression) and three types of probing
questions.
- Nujud Aloshban
- Anna Esposito
- Alessandro Vinciarelli
This article investigates whether it is possible to detect depression using less than
10 seconds of speech. The experiments have involved 59 participants (including 29
that have been diagnosed with depression by a professional psychiatrist) and are based
on a multimodal approach that jointly models linguistic (what people say) and acoustic
(how people say it) aspects of speech using four different strategies for the fusion
of multiple data streams. On average, every interview has lasted for 242.2 seconds,
but the results show that 10 seconds or less are sufficient to achieve the same level
of recall (roughly 70%) observed after using the entire inteview of every participant.
In other words, it is possible to maintain the same level of sensitivity (the name
of recall in clinical settings) while reducing by 95%, on average, the amount of time
requireed to collect the necessary data.
- Dong Bach Vo
- Stephen Brewster
- Alessandro Vinciarelli
This work investigates the interplay between Child-Computer Interaction and attachment,
a psychological construct that accounts for how children perceive their parents to
be. In particular, the article makes use of a multimodal approach to test whether
children with different attachment conditions tend to use differently the same interactive
system. The experiments show that the accuracy in predicting usage behaviour changes,
to a statistically significant extent, according to the attachment conditions of the
52 experiment participants (age-range 5 to 9). Such a result suggests that attachment-relevant
processes are actually at work when people interact with technology, at least when
it comes to children.
- Huili Chen
- Yue Zhang
- Felix Weninger
- Rosalind Picard
- Cynthia Breazeal
- Hae Won Park
Automatic speech-based affect recognition of individuals in dyadic conversation is
a challenging task, in part because of its heavy reliance on manual pre-processing.
Traditional approaches frequently require hand-crafted speech features and segmentation
of speaker turns. In this work, we design end-to-end deep learning methods to recognize
each person's affective expression in an audio stream with two speakers, automatically
discovering features and time regions relevant to the target speaker's affect. We
integrate a local attention mechanism into the end-to-end architecture and compare
the performance of three attention implementations - one mean pooling and two weighted
pooling methods. Our results show that the proposed weighted-pooling attention solutions
are able to learn to focus on the regions containing target speaker's affective information
and successfully extract the individual's valence and arousal intensity. Here we introduce
and use a "dyadic affect in multimodal interaction - parent to child" (DAMI-P2C) dataset
collected in a study of 34 families, where a parent and a child (3-7 years old) engage
in reading storybooks together. In contrast to existing public datasets for affect
recognition, each instance for both speakers in the DAMI-P2C dataset is annotated
for the perceived affect by three labelers. To encourage more research on the challenging
task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly
available, including acoustic features of the dyads' raw audios, affect annotations,
and a diverse set of developmental, social, and demographic profiles of each dyad.
- Andrew Emerson
- Nathan Henderson
- Jonathan Rowe
- Wookhee Min
- Seung Lee
- James Minogue
- James Lester
Modeling visitor engagement is a key challenge in informal learning environments,
such as museums and science centers. Devising predictive models of visitor engagement
that accurately forecast salient features of visitor behavior, such as dwell time,
holds significant potential for enabling adaptive learning environments and visitor
analytics for museums and science centers. In this paper, we introduce a multimodal
early prediction approach to modeling visitor engagement with interactive science
museum exhibits. We utilize multimodal sensor data including eye gaze, facial expression,
posture, and interaction log data captured during visitor interactions with an interactive
museum exhibit for environmental science education, to induce predictive models of
visitor dwell time. We investigate machine learning techniques (random forest, support
vector machine, Lasso regression, gradient boosting trees, and multi-layer perceptron)
to induce multimodal predictive models of visitor engagement with data from 85 museum
visitors. Results from a series of ablation experiments suggest that incorporating
additional modalities into predictive models of visitor engagement improves model
accuracy. In addition, the models show improved predictive performance over time,
demonstrating that increasingly accurate predictions of visitor dwell time can be
achieved as more evidence becomes available from visitor interactions with interactive
science museum exhibits. These findings highlight the efficacy of multimodal data
for modeling museum exhibit visitor engagement.
- Mounia Ziat
- Katherine Chin
- Roope Raisamo
In this study, we assessed the emotional dimensions (valence, arousal, and dominance)
of the multimodal visual-cutaneous rabbit effect. Simultaneously to the tactile bursts
on the forearm, visual silhouettes of saltatorial animals (rabbit, kangaroo, spider,
grasshopper, frog, and flea) were projected on the left arm. Additionally, there were
two locomotion conditions: taking-off and landing. The results showed that the valence
dimension (happy-unhappy) was only affected by the visual stimuli with no effect of
the tactile conditions nor the locomotion phases. Arousal (excited-calm) showed a
significant difference for the three tactile conditions with an interaction effect
with the locomotion condition. Arousal scores were higher when the taking-off condition
was associated with the intermediate duration (24 ms) and when the landing condition
was associated with either the shortest duration (12 ms) or the longest duration (48
ms). There was no effect for the dominance dimension. Similar to our previous results,
the valence dimension seems to be highly affected by visual information reducing any
effect of tactile information, while touch can modulate the arousal dimension. This
can be beneficial for designing multimodal interfaces for virtual or augmented reality.
- Shaun Alexander Macdonald
- Stephen Brewster
- Frank Pollick
This paper describes a novel category of affective vibrotactile stimuli which evoke
real-world sensations and details a study into emotional responses to them. The affective
properties of short and abstract vibrotactile waveforms have previously been studied
and shown to have a narrow emotional range. By contrast this paper investigated emotional
responses to longer waveforms and to emotionally resonant vibrotactile stimuli, stimuli
which are evocative of real-world sensations such as animal purring or running water.
Two studies were conducted. The first recorded emotional responses to Tactons with
a duration of 20 seconds. The second investigated emotional responses to novel emotionally
resonant stimuli. Stimuli that users found more emotionally resonant were more pleasant,
particularly if they had prior emotional connections to the sensation represented.
Results suggest that future designers could use emotional resonance to expand the
affective response range of vibrotactile cues by utilising stimuli with which users
bear an emotional association.
- Nathan Henderson
- Wookhee Min
- Jonathan Rowe
- James Lester
Accurately detecting and responding to student affect is a critical capability for
adaptive learning environments. Recent years have seen growing interest in modeling
student affect with multimodal sensor data. A key challenge in multimodal affect detection
is dealing with data loss due to noisy, missing, or invalid multimodal features. Because
multimodal affect detection often requires large quantities of data, data loss can
have a strong, adverse impact on affect detector performance. To address this issue,
we present a multimodal data imputation framework that utilizes conditional generative
models to automatically impute posture and interaction log data from student interactions
with a game-based learning environment for emergency medical training. We investigate
two generative models, a Conditional Generative Adversarial Network (C-GAN) and a
Conditional Variational Autoencoder (C-VAE), that are trained using a modality that
has undergone varying levels of artificial data masking. The generative models are
conditioned on the corresponding intact modality, enabling the data imputation process
to capture the interaction between the concurrent modalities. We examine the effectiveness
of the conditional generative models on imputation accuracy and its impact on the
performance of affect detection. Each imputation model is evaluated using varying
amounts of artificial data masking to determine how the data missingness impacts the
performance of each imputation method. Results based on the modalities captured from
students? interactions with the game-based learning environment indicate that deep
conditional generative models within a multimodal data imputation framework yield
significant benefits compared to baseline imputation techniques in terms of both imputation
accuracy and affective detector performance.
- Ryosuke Ueno
- Yukiko I. Nakano
- Jie Zeng
- Fumio Nihei
Providing feedback to a speaker is an essential communication signal for maintaining
a conversation. In specific feedback, which indicates the listener's reaction to the
speaker?s utterances, the facial expression is an effective modality for conveying
the listener's reactions. Moreover, not only the type of facial expressions, but also
the degree of intensity of the expressions, may influence the meaning of the specific
feedback. In this study, we propose a multimodal deep neural network model that predicts
the intensity of facial expressions co-occurring with feedback responses. We focus
on multiparty video-mediated communication. In video-mediated communication, close-up
frontal face images of each participant are continuously presented on the display;
the attention of the participants is more likely to be drawn to the facial expressions.
We assume that in such communication, the importance of facial expression in the listeners?
feedback responses increases. We collected 33 video-mediated conversations by groups
of three people and obtained audio and speech data for each participant. Using the
corpus collected as a dataset, we created a deep neural network model that predicts
the intensity of 17 types of action units (AUs) co-occurring with the feedback responses.
The proposed method employed GRU-based model with attention mechanism for audio, visual,
and language modalities. A decoder was trained to produce the intensity values for
the 17 AUs frame by frame. In the experiment, unimodal and multimodal models were
compared in terms of their performance in predicting salient AUs that characterize
facial expression in feedback responses. The results suggest that well-performing
models differ depending on the AU categories; audio information was useful for predicting
AUs that express happiness, and visual and language information contributes to predicting
AUs expressing sadness and disgust.
- Bernd Dudzik
- Joost Broekens
- Mark Neerincx
- Hayley Hung
Empirical evidence suggests that the emotional meaning of facial behavior in isolation
is often ambiguous in real-world conditions. While humans complement interpretations
of others' faces with additional reasoning about context, automated approaches rarely
display such context-sensitivity. Empirical findings indicate that the personal memories
triggered by videos are crucial for predicting viewers' emotional response to such
videos ?- in some cases, even more so than the video's audiovisual content. In this
article, we explore the benefits of personal memories as context for facial behavior
analysis. We conduct a series of multimodal machine learning experiments combining
the automatic analysis of video-viewers' faces with that of two types of context information
for affective predictions: \beginenumerate* [label=(\arabic*)] \item self-reported
free-text descriptions of triggered memories and \item a video's audiovisual content
\endenumerate*. Our results demonstrate that both sources of context provide models
with information about variation in viewers' affective responses that complement facial
analysis and each other.
- Oswald Barral
- Sébastien Lallé
- Grigorii Guz
- Alireza Iranpour
- Cristina Conati
We leverage eye-tracking data to predict user performance and levels of cognitive
abilities while reading magazine-style narrative visualizations (MSNV), a widespread
form of multimodal documents that combine text and visualizations. Such predictions
are motivated by recent interest in devising user-adaptive MSNVs that can dynamically
adapt to a user's needs. Our results provide evidence for the feasibility of real-time
user modeling in MSNV, as we are the first to consider eye tracking data for predicting
task comprehension and cognitive abilities while processing multimodal documents.
We follow with a discussion on the implications to the design of personalized MSNVs.
- Lorcan Reidy
- Dennis Chan
- Charles Nduka
- Hatice Gunes
Cognitive training has shown promising results for delivering improvements in human
cognition related to attention, problem solving, reading comprehension and information
retrieval. However, two frequently cited problems in cognitive training literature
are a lack of user engagement with the training programme, and a failure of developed
skills to generalise to daily life. This paper introduces a new cognitive training
(CT) paradigm designed to address these two limitations by combining the benefits
of gamification, virtual reality (VR), and affective adaptation in the development
of an engaging, ecologically valid, CT task. Additionally, it incorporates facial
electromyography (EMG) as a means of determining user affect while engaged in the
CT task. This information is then utilised to dynamically adjust the game's difficulty
in real-time as users play, with the aim of leading them into a state of flow. Affect
recognition rates of 64.1% and 76.2%, for valence and arousal respectively, were achieved
by classifying a DWT-Haar approximation of the input signal using kNN. The affect-aware
VR cognitive training intervention was then evaluated with a control group of older
adults. The results obtained substantiate the notion that adaptation techniques can
lead to greater feelings of competence and a more appropriate challenge of the user's
skills.
- Anke van Oosterhout
- Miguel Bruns
- Eve Hoggan
In the last decade, haptic actuators have improved in quality and efficiency, enabling
easier implementation in user interfaces. One of the next steps towards a mature haptics
field is a larger and more diverse toolset that enables designers and novices to explore
with the design and implementation of haptic feedback in their projects. In this paper,
we look at several design projects that utilize haptic force feedback to aid interaction
between the user and product. We analysed the process interaction designers went through
when developing their haptic user interfaces. Based on our insights, we identified
requirements for a haptic force feedback authoring tool. We discuss how these requirements
are addressed by 'Feelix', a tool that supports sketching and refinement of haptic
force feedback effects.
- Brennan Jones
- Jens Maiero
- Alireza Mogharrab
- Ivan A. Aguliar
- Ashu Adhikari
- Bernhard E. Riecke
- Ernst Kruijff
- Carman Neustaedter
- Robert W. Lindeman
Telepresence robots allow people to participate in remote spaces, yet they can be
difficult to manoeuvre with people and obstacles around. We designed a haptic-feedback
system called "FeetBack," which users place their feet in when driving a telepresence
robot. When the robot approaches people or obstacles, haptic proximity and collision
feedback are provided on the respective sides of the feet, helping inform users about
events that are hard to notice through the robot's camera views. We conducted two
studies: one to explore the usage of FeetBack in virtual environments, another focused
on real environments. We found that FeetBack can increase spatial presence in simple
virtual environments. Users valued the feedback to adjust their behaviour in both
types of environments, though it was sometimes too frequent or unneeded for certain
situations after a period of time. These results point to the value of foot-based
haptic feedback for telepresence robot systems, while also the need to design context-sensitive
haptic feedback.
- Brandon M. Booth
- Shrikanth S. Narayanan
Continuous human annotations of complex human experiences are essential for enabling
psychological and machine-learned inquiry into the human mind, but establishing a
reliable set of annotations for analysis and ground truth generation is difficult.
Measures of consensus or agreement are often used to establish the reliability of
a collection of annotations and thereby purport their suitability for further research
and analysis. This work examines many of the commonly used agreement metrics for continuous-scale
and continuous-time human annotations and demonstrates their shortcomings, especially
in measuring agreement in general annotation shape and structure. Annotation quality
is carefully examined in a controlled study where the true target signal is known
and evidence is presented suggesting that annotators' perceptual distortions can be
modeled using monotonic functions. A novel measure of agreement is proposed which
is agnostic to these perceptual differences between annotators and provides unique
information when assessing agreement. We illustrate how this measure complements existing
agreement metrics and can serve as a tool for curating a reliable collection of human
annotations based on differential consensus.
- Aishat Aloba
- Julia Woodward
- Lisa Anthony
Classification accuracy of whole-body gestures can be improved by selecting gestures
that have few conflicts (i.e., confusions or misclassifications). To identify such
gestures, an understanding of the nuances of how users articulate whole-body gestures
can help, especially when conflicts may be due to confusion among seemingly dissimilar
gestures. To the best of our knowledge, such an understanding is currently missing
in the literature. As a first step to enable this understanding, we designed a method
that facilitates investigation of variations in how users move their body parts as
they perform a motion. This method, which we call filterJoint, selects the key body
parts that are actively moving during the performance of a motion. The paths along
which these body parts move in space over time can then be analyzed to make inferences
about how users articulate whole-body gestures. We present two case studies to show
how the filterJoint method enables a deeper understanding of whole-body gesture articulation,
and we highlight implications for the selection of whole-body gesture sets as a result
of these insights.
- Chris Zimmerer
- Erik Wolf
- Sara Wolf
- Martin Fischbach
- Jean-Luc Lugrin
- Marc Erich Latoschik
Multimodal Interfaces (MMIs) have been considered to provide promising interaction
paradigms for Virtual Reality (VR) for some time. However, they are still far less
common than unimodal interfaces (UMIs). This paper presents a summative user study
comparing an MMI to a typical UMI for a design task in VR. We developed an application
targeting creative 3D object manipulations, i.e., creating 3D objects and modifying
typical object properties such as color or size. The associated open user task is
based on the Torrence Tests of Creative Thinking. We compared a synergistic multimodal
interface using speech-accompanied pointing/grabbing gestures with a more typical
unimodal interface using a hierarchical radial menu to trigger actions on selected
objects. Independent judges rated the creativity of the resulting products using the
Consensual Assessment Technique. Additionally, we measured the creativity-promoting
factors flow, usability, and presence. Our results show that the MMI performs on par
with the UMI in all measurements despite its limited flexibility and reliability.
These promising results demonstrate the technological maturity of MMIs and their potential
to extend traditional interaction techniques in VR efficiently.
- Lik Hang Lee
- Ngo Yan Yeung
- Tristan Braud
- Tong Li
- Xiang Su
- Pan Hui
Smartwatches and other wearables are characterized by small-scale touchscreens that
complicate the interaction with content. In this paper, we present Force9, the first
optimized miniature keyboard leveraging force-sensitive touchscreens on wrist-worn
computers. Force9 enables character selection in an ambiguous layout by analyzing
the trade-off between interaction space and the easiness of force-assisted interaction.
We argue that dividing the screen's pressure range into three contiguous force levels
is sufficient to differentiate characters for fast and accurate text input. Our pilot
study captures and calibrates the ability of users to perform force-assisted touches
on miniature-sized keys on touchscreen devices. We then optimize the keyboard layout
considering the goodness of character pairs (with regards to the selected English
corpus) under the force-based configuration and the users? familiarity with the QWERTY
layout. We finally evaluate the performance of the trimetric optimized Force9 layout,
and achieve an average of 10.18 WPM by the end of the final session. Compared to the
other state-of-the-art approaches, Force9 allows for single-gesture character selection
without addendum sensors.
- Taras Kucherenko
- Patrik Jonell
- Sanne van Waveren
- Gustav Eje Henter
- Simon Alexandersson
- Iolanda Leite
- Hedvig Kjellström
During speech, people spontaneously gesticulate, which plays a key role in conveying
information. Similarly, realistic co-speech gestures are crucial to enable natural
and smooth interactions with social agents. Current end-to-end co-speech gesture generation
systems use a single modality for representing speech: either audio or text. These
systems are therefore confined to producing either acoustically-linked beat gestures
or semantically-linked gesticulation (e.g., raising a hand when saying "high''): they
cannot appropriately learn to generate both gesture types. We present a model designed
to produce arbitrary beat and semantic gestures together. Our deep-learning based
model takes both acoustic and semantic representations of speech as input, and generates
gestures as a sequence of joint angle rotations as output. The resulting gestures
can be applied to both virtual agents and humanoid robots. Subjective and objective
evaluations confirm the success of our approach. The code and video are available
at the project page svito-zar.github.io/gesticulator .
- Dulanga Weerakoon
- Vigneshwaran Subbaraju
- Nipuni Karumpulli
- Tuan Tran
- Qianli Xu
- U-Xuan Tan
- Joo Hwee Lim
- Archan Misra
This work demonstrates the feasibility and benefits of using pointing gestures, a
naturally-generated additional input modality, to improve the multi-modal comprehension
accuracy of human instructions to robotic agents for collaborative tasks.We present
M2Gestic, a system that combines neural-based text parsing with a novel knowledge-graph
traversal mechanism, over a multi-modal input of vision, natural language text and
pointing. Via multiple studies related to a benchmark table top manipulation task,
we show that (a) M2Gestic can achieve close-to-human performance in reasoning over
unambiguous verbal instructions, and (b) incorporating pointing input (even with its
inherent location uncertainty) in M2Gestic results in a significant (30%) accuracy
improvement when verbal instructions are ambiguous.
- Angela Vujic
- Stephanie Tong
- Rosalind Picard
- Pattie Maes
A hard challenge for wearable systems is to measure differences in emotional valence,
i.e. positive and negative affect via physiology. However, the stomach or gastric
signal is an unexplored modality that could offer new affective information. We created
a wearable device and software to record gastric signals, known as electrogastrography
(EGG). An in-laboratory study was conducted to compare EGG with electrodermal activity
(EDA) in 33 individuals viewing affective stimuli. We found that negative stimuli
attenuate EGG's indicators of parasympathetic activation, or "rest and digest" activity.
We compare EGG to the remaining physiological signals and describe implications for
affect detection. Further, we introduce how wearable EGG may support future applications
in areas as diverse as reducing nausea in virtual reality and helping treat emotion-related
eating disorders.
- Jun Wang
- Grace Ngai
- Hong Va Leong
The task of summarizing a document is a complex task that requires a person to multitask
between reading and writing processes. Since a person's cognitive load during reading
or writing is known to be dependent upon the level of comprehension or difficulty
of the article, this suggests that it should be possible to analyze the cognitive
process of the user when carrying out the task, as evidenced through their eye gaze
and typing features, to obtain an insight into the different difficulty levels. In
this paper, we categorize the summary writing process into different phases and extract
different gaze and typing features from each phase according to characteristics of
eye-gaze behaviors and typing dynamics. Combining these multimodal features, we build
a classifier that achieves an accuracy of 91.0% for difficulty level detection, which
is around 55% performance improvement above the baseline and at least 15% improvement
above models built on a single modality. We also investigate the possible reasons
for the superior performance of our multimodal features.
- Lian Beenhakker
- Fahim Salim
- Dees Postma
- Robby van Delden
- Dennis Reidsma
- Bert-Jan van Beijnum
In Human Behaviour Understanding, social interaction is often modeled on the basis
of lower level action recognition. The accuracy of this recognition has an impact
on the system's capability to detect the higher level social events, and thus on the
usefulness of the resulting system. We model team interactions in volleyball and investigate,
through simulation of typical error patterns, how one can consider the required quality
(in accuracy and in allowable types of errors) of the underlying action recognition
for automated volleyball monitoring. Our proposed approach simulates different patterns
of errors, grounded in related work in volleyball action recognition, on top of a
manually annotated ground truth to model their different impact on the interaction
recognition. Our results show that this can provide a means to quantify the effect
of different type of classification errors on the overall quality of the system. Our
chosen volleyball use case, in the rising field of sports monitoring, also addresses
specific team related challenges in such a system and how these can be visualized
to grasp the interdependencies. In our use case the first layer of our system classifies
actions of individual players and the second layer recognizes multiplayer exercises
and complexes (i.e. sequences in rallies) to enhance training. The experiments performed
for this study investigated how errors at the action recognition layer propagate and
cause errors at the complexes layer. We discuss the strengths and weaknesses of the
layered system to model volleyball rallies. We also give indications regarding what
kind of errors are causing more problems and what choices can follow from them. In
our given context we suggest that for recognition of non-Freeball actions (e.g. smash,
block) it is more important to achieve a higher accuracy, which can be done at the
cost of accuracy of classification of Freeball actions (which are mostly plays between
team members and are more interchangable as to their role in the complexes).
- Lauren Klein
- Victor Ardulov
- Yuhua Hu
- Mohammad Soleymani
- Alma Gharib
- Barbara Thompson
- Pat Levitt
- Maja J. Matarić
Interactions between infants and their mothers can provide meaningful insight into
the dyad's health and well-being. Previous work has shown that infant-mother coordination,
within a single modality, varies significantly with age and interaction quality. However,
as infants are still developing their motor, language, and social skills, they may
differ from their mothers in the modes they use to communicate. This work examines
how infant-mother coordination across modalities can expand researchers' abilities
to observe meaningful trends in infant-mother interactions. Using automated feature
extraction tools, we analyzed the head position, arm position, and vocal fundamental
frequency of mothers and their infants during the Face-to-Face Still-Face (FFSF) procedure.
A de-identified dataset including these features was made available online as a contribution
of this work. Analysis of infant behavior over the course of the FFSF indicated that
the amount and modality of infant behavior change evolves with age. Evaluating the
interaction dynamics, we found that infant and mother behavioral signals are coordinated
both within and across modalities, and that levels of both intramodal and intermodal
coordination vary significantly with age and across stages of the FFSF. These results
support the significance of intermodal coordination when assessing changes in infant-mother
interaction across conditions.
- Nimesha Ranasinghe
- Meetha Nesam James
- Michael Gecawicz
- Jonathan Bland
- David Smith
Little is known about the influence of various sensory modalities such as taste, smell,
color, and thermal, towards perceiving simulated flavor sensations, let alone their
influence on people's emotions and liking. Although flavor sensations are essential
in our daily experiences and closely associated with our memories and emotions, the
concept of flavor and the emotions caused by different sensory modalities are not
thoroughly integrated into Virtual and Augmented Reality technologies. Hence, this
paper presents 1) an interactive technology to simulate different flavor sensations
by overlaying taste (via electrical stimulation on the tongue), smell (via micro air
pumps), color (via RGB Lights), and thermal (via Peltier elements) sensations on plain
water, and 2) a set of experiments to investigate a) the influence of different sensory
modalities on the perception and liking of virtual flavors and b) varying emotions
mediated through virtual flavor sensations. Our findings reveal that the participants
perceived and liked various stimuli configurations and mostly associated them with
positive emotions while highlighting important avenues for future research.
- Leena Mathur
- Maja J. Matarić
Automated deception detection systems can enhance health, justice, and security in
society by helping humans detect deceivers in high-stakes situations across medical
and legal domains, among others. Existing machine learning approaches for deception
detection have not leveraged dimensional representations of facial affect: valence
and arousal. This paper presents a novel analysis of the discriminative power of facial
affect for automated deception detection, along with interpretable features from visual,
vocal, and verbal modalities. We used a video dataset of people communicating truthfully
or deceptively in real-world, high-stakes courtroom situations. We leveraged recent
advances in automated emotion recognition in-the-wild by implementing a state-of-the-art
deep neural network trained on the Aff-Wild database to extract continuous representations
of facial valence and facial arousal from speakers. We experimented with unimodal
Support Vector Machines (SVM) and SVM-based multimodal fusion methods to identify
effective features, modalities, and modeling approaches for detecting deception. Unimodal
models trained on facial affect achieved an AUC of 80%, and facial affect contributed
towards the highest-performing multimodal approach (adaptive boosting) that achieved
an AUC of 91% when tested on speakers who were not part of training sets. This approach
achieved a higher AUC than existing automated machine learning approaches that used
interpretable visual, vocal, and verbal features to detect deception in this dataset,
but did not use facial affect. Across all videos, deceptive and truthful speakers
exhibited significant differences in facial valence and facial arousal, contributing
computational support to existing psychological theories on relationships between
affect and deception. The demonstrated importance of facial affect in our models informs
and motivates the future development of automated, affect-aware machine learning approaches
for modeling and detecting deception and other social behaviors in-the-wild.
- Shun Katada
- Shogo Okada
- Yuki Hirano
- Kazunori Komatani
In human-agent interactions, it is necessary for the systems to identify the current
emotional state of the user to adapt their dialogue strategies. Nevertheless, this
task is challenging because the current emotional states are not always expressed
in a natural setting and change dynamically. Recent accumulated evidence has indicated
the usefulness of physiological modalities to realize emotion recognition. However,
the contribution of the time series physiological signals in human-agent interaction
during a dialogue has not been extensively investigated. This paper presents a machine
learning model based on physiological signals to estimate a user's sentiment at every
exchange during a dialogue. Using a wearable sensing device, the time series physiological
data including the electrodermal activity (EDA) and heart rate in addition to acoustic
and visual information during a dialogue were collected. The sentiment labels were
annotated by the participants themselves and by external human coders for each exchange
consisting of a pair of system and participant utterances. The experimental results
showed that a multimodal deep neural network (DNN) model combined with the EDA and
visual features achieved an accuracy of 63.2%. In general, this task is challenging,
as indicated by the accuracy of 63.0% attained by the external coders. The analysis
of the sentiment estimation results for each individual indicated that the human coders
often wrongly estimated the negative sentiment labels, and in this case, the performance
of the DNN model was higher than that of the human coders. These results indicate
that physiological signals can help in detecting the implicit aspects of negative
sentiments, which are acoustically/visually indistinguishable.
- Koji Inoue
- Kohei Hara
- Divesh Lala
- Kenta Yamamoto
- Shizuka Nakamura
- Katsuya Takanashi
- Tatsuya Kawahara
A job interview is a domain that takes advantage of an android robot's human-like
appearance and behaviors. In this work, our goal is to implement a system in which
an android plays the role of an interviewer so that users may practice for a real
job interview. Our proposed system generates elaborate follow-up questions based on
responses from the interviewee. We conducted an interactive experiment to compare
the proposed system against a baseline system that asked only fixed-form questions.
We found that this system was significantly better than the baseline system with respect
to the impression of the interview and the quality of the questions, and that the
presence of the android interviewer was enhanced by the follow-up questions. We also
found a similar result when using a virtual agent interviewer, except that presence
was not enhanced.
- Soumyajit Chatterjee
- Avijoy Chakma
- Aryya Gangopadhyay
- Nirmalya Roy
- Bivas Mitra
- Sandip Chakraborty
Annotated IMU sensor data from smart devices and wearables are essential for developing
supervised models for fine-grained human activity recognition, albeit generating sufficient
annotated data for diverse human activities under different environments is challenging.
Existing approaches primarily use human-in-the-loop based techniques, including active
learning; however, they are tedious, costly, and time-consuming. Leveraging the availability
of acoustic data from embedded microphones over the data collection devices, in this
paper, we propose LASO, a multimodal approach for automated data annotation from acoustic
and locomotive information. LASO works over the edge device itself, ensuring that
only the annotated IMU data is collected, discarding the acoustic data from the device
itself, hence preserving the audio-privacy of the user. In the absence of any pre-existing
labeling information, such an auto-annotation is challenging as the IMU data needs
to be sessionized for different time-scaled activities in a completely unsupervised
manner. We use a change-point detection technique while synchronizing the locomotive
information from the IMU data with the acoustic data, and then use pre-trained audio-based
activity recognition models for labeling the IMU data while handling the acoustic
noises. LASO efficiently annotates IMU data, without any explicit human intervention,
with a mean accuracy of $0.93$ ($\pm 0.04$) and $0.78$ ($\pm 0.05$) for two different
real-life datasets from workshop and kitchen environments, respectively.
- Yanan Wang
- Jianming Wu
- Jinfa Huang
- Gen Hattori
- Yasuhiro Takishima
- Shinya Wada
- Rui Kimura
- Jie Chen
- Satoshi Kurihara
Group cohesiveness reflects the level of intimacy that people feel with each other,
and the development of a dialogue robot that can understand group cohesiveness will
lead to the promotion of human communication. However, group cohesiveness is a complex
concept that is difficult to predict based only on image pixels. Inspired by the fact
that humans intuitively associate linguistic knowledge accumulated in the brain with
the visual images they see, we propose a linguistic knowledge injectable deep neural
network (LDNN) that builds a visual model (visual LDNN) for predicting group cohesiveness
that can automatically associate the linguistic knowledge hidden behind images. LDNN
consists of a visual encoder and a language encoder, and applies domain adaptation
and linguistic knowledge transition mechanisms to transform linguistic knowledge from
a language model to the visual LDNN. We train LDNN by adding descriptions to the training
and validation sets of the Group AFfect Dataset 3.0 (GAF 3.0), and test the visual
LDNN without any description. Comparing visual LDNN with various fine-tuned DNN models
and three state-of-the-art models in the test set, the results demonstrate that the
visual LDNN not only improves the performance of the fine-tuned DNN model leading
to an MSE very similar to the state-of-the-art model, but is also a practical and
efficient method that requires relatively little preprocessing. Furthermore, ablation
studies confirm that LDNN is an effective method to inject linguistic knowledge into
visual models.
- Riku Arakawa
- Hiromu Yakura
Humans are known to have a better subconscious impression of other humans when their
movements are imitated in social interactions. Despite this influential phenomenon,
its application in human-computer interaction is currently limited to specific areas,
such as an agent mimicking the head movements of a user in virtual reality, because
capturing user movements conventionally requires external sensors. If we can implement
the mimicry effect in a scalable platform without such sensors, a new approach for
designing human-computer interaction will be introduced. Therefore, we have investigated
whether users feel positively toward a mimicking agent that is delivered by a standalone
web application using only a webcam. We also examined whether a web page that changes
its background pattern based on head movements can foster a favorable impression.
The positive effect confirmed in our experiments supports mimicry as a novel design
practice to augment our daily browsing experiences.
- Shen Yan
- Di Huang
- Mohammad Soleymani
As algorithmic decision making systems are increasingly used in high-stake scenarios,
concerns have risen about the potential unfairness of these decisions to certain social
groups. Despite its importance, the bias and fairness of multimodal systems are not
thoroughly studied. In this work, we focus on the multimodal systems designed for
apparent personality assessment and hirability prediction. We use the First Impression
dataset as a case study to investigate the biases in such systems. We provide detailed
analyses on the biases from different modalities and data fusion strategies. Our analyses
reveal that different modalities show various patterns of biases and data fusion process
also introduces additional biases to the model. To mitigate the biases, we develop
and evaluate two different debiasing approaches based on data balancing and adversarial
learning. Experimental results show that both approaches can reduce the biases in
model outcomes without sacrificing much performance. Our debiasing strategies can
be deployed in real-world multimodal systems to provide fairer outcomes.
- Sarah Morrison-Smith
- Aishat Aloba
- Hangwei Lu
- Brett Benda
- Shaghayegh Esmaeili
- Gianne Flores
- Jesse Smith
- Nikita Soni
- Isaac Wang
- Rejin Joy
- Damon L. Woodard
- Jaime Ruiz
- Lisa Anthony
The future of smart environments is likely to involve both passive and active interactions
on the part of users. Depending on what sensors are available in the space, users
may make use of multimodal interaction modalities such as hand gestures or voice commands.
There is a shortage of robust yet controlled multimodal interaction datasets for smart
environment applications. One application domain of interest based on current state-of-the-art
is authentication for sensitive or private tasks, such as banking and email. We present
a novel, large multimodal dataset for authentication interactions in both gesture
and voice, collected from 106 volunteers who each performed 10 examples of each of
a set of hand gesture and spoken voice commands chosen from prior literature (10,600
gesture samples and 13,780 voice samples). We present the data collection method,
raw data and common features extracted, and a case study illustrating how this dataset
could be useful to researchers. Our goal is to provide a benchmark dataset for testing
future multimodal authentication solutions, enabling comparison across approaches.
- Ahmed Hussen Abdelaziz
- Barry-John Theobald
- Paul Dixon
- Reinhard Knothe
- Nicholas Apostoloff
- Sachin Kajareker
We describe our novel deep learning approach for driving animated faces using both
acoustic and visual information. In particular, speech-related facial movements are
generated using audiovisual information, and non-verbal facial movements are generated
using only visual information. To ensure that our model exploits both modalities during
training, batches are generated that contain audio-only, video-only, and audiovisual
input features. The probability of dropping a modality allows control over the degree
to which the model exploits audio and visual information during training. Our trained
model runs in real-time on resource limited hardware (e.g. a smart phone), it is user
agnostic, and it is not dependent on a potentially error-prone transcription of the
speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven
animation over the equivalent video-only approach, and 2) the improvement in the animation
of speech-related facial movements after introducing modality dropout. Without modality
dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences
compared with only 18% for video-driven. After introducing dropout viewer preference
for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.
- Yiqun Yao
- Verónica Pérez-Rosas
- Mohamed Abouelenien
- Mihai Burzo
Multimodal sentiment analysis aims to detect and classify sentiment expressed in multimodal
data. Research to date has focused on datasets with a large number of training samples,
manual transcriptions, and nearly-balanced sentiment labels. However, data collection
in real settings often leads to small datasets with noisy transcriptions and imbalanced
label distributions, which are therefore significantly more challenging than in controlled
settings. In this work, we introduce MORSE, a domain-specific dataset for MultimOdal
sentiment analysis in Real-life SEttings. The dataset consists of 2,787 video clips
extracted from 49 interviews with panelists in a product usage study, with each clip
annotated for positive, negative, or neutral sentiment. The characteristics of MORSE
include noisy transcriptions from raw videos, naturally imbalanced label distribution,
and scarcity of minority labels. To address the challenging real-life settings in
MORSE, we propose a novel two-step fine-tuning method for multimodal sentiment classification
using transfer learning and the Transformer model architecture; our method starts
with a pre-trained language model and one step of fine-tuning on the language modality,
followed by the second step of joint fine-tuning that incorporates the visual and
audio modalities. Experimental results show that while MORSE is challenging for various
baseline models such as SVM and Transformer, our two-step fine-tuning method is able
to capture the dataset characteristics and effectively address the challenges. Our
method outperforms related work that uses both single and multiple modalities in the
same transfer learning settings.
- Andrea Vidal
- Ali Salman
- Wei-Cheng Lin
- Carlos Busso
Expressive behaviors conveyed during daily interactions are difficult to determine,
because they often consist of a blend of different emotions. The complexity in expressive
human communication is an important challenge to build and evaluate automatic systems
that can reliably predict emotions. Emotion recognition systems are often trained
with limited databases, where the emotions are either elicited or recorded by actors.
These approaches do not necessarily reflect real emotions, creating a mismatch when
the same emotion recognition systems are applied to practical applications. Developing
rich emotional databases that reflect the complexity in the externalization of emotion
is an important step to build better models to recognize emotions. This study presents
the MSP-Face database, a natural audiovisual database obtained from video-sharing
websites, where multiple individuals discuss various topics expressing their opinions
and experiences. The natural recordings convey a broad range of emotions that are
difficult to obtain with other alternative data collection protocols. A feature of
the corpus is the addition of two sets. The first set includes videos that have been
annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings
-- 24 hrs, 41 m). The second set includes similar videos without emotional labels
(17,955 recordings -- 45 hrs, 57 m), offering the perfect infrastructure to explore
semi-supervised and unsupervised machine-learning algorithms on natural emotional
videos. This study describes the process of collecting and annotating the corpus.
It also provides baselines over this new database using unimodal (audio, video) and
multimodal emotional recognition systems.
- Leili Tavabi
- Kalin Stefanov
- Larry Zhang
- Brian Borsari
- Joshua D. Woolley
- Stefan Scherer
- Mohammad Soleymani
Motivational Interviewing (MI) is defined as a collaborative conversation style that
evokes the client's own intrinsic reasons for behavioral change. In MI research, the
clients' attitude (willingness or resistance) toward change as expressed through language,
has been identified as an important indicator of their subsequent behavior change.
Automated coding of these indicators provides systematic and efficient means for the
analysis and assessment of MI therapy sessions. In this paper, we study and analyze
behavioral cues in client language and speech that bear indications of the client's
behavior toward change during a therapy session, using a database of dyadic motivational
interviews between therapists and clients with alcohol-related problems. Deep language
and voice encoders, \ie BERT and VGGish, trained on large amounts of data are used
to extract features from each utterance. We develop a neural network to automatically
detect the MI codes using both the clients' and therapists' language and clients'
voice, and demonstrate the importance of semantic context in such detection. Additionally,
we develop machine learning models for predicting alcohol-use behavioral outcomes
of clients through language and voice analysis. Our analysis demonstrates that we
are able to estimate MI codes using clients' textual utterances along with preceding
textual context from both the therapist and client, reaching an F1-score of 0.72 for
a speaker-independent three-class classification. We also report initial results for
using the clients' data for predicting behavioral outcomes, which outlines the direction
for future work.
- Cong Bao
- Zafeirios Fountas
- Temitayo Olugbade
- Nadia Bianchi-Berthouze
We propose a novel neural network architecture, named the Global Workspace Network
(GWN), which addresses the challenge of dynamic and unspecified uncertainties in multimodal
data fusion. Our GWN is a model of attention across modalities and evolving through
time, and is inspired by the well-established Global Workspace Theory from the field
of cognitive science. The GWN achieved average F1 score of 0.92 for discrimination
between pain patients and healthy participants and average F1 score = 0.75 for further
classification of three pain levels for a patient, both based on the multimodal EmoPain
dataset captured from people with chronic pain and healthy people performing different
types of exercise movements in unconstrained settings. In these tasks, the GWN significantly
outperforms the typical fusion approach of merging by concatenation. We further provide
extensive analysis of the behaviour of the GWN and its ability to address uncertainties
(hidden noise) in multimodal data.
- Shree Krishna Subburaj
- Angela E.B. Stewart
- Arjun Ramesh Rao
- Sidney K. D'Mello
Modeling team phenomena from multiparty interactions inherently requires combining
signals from multiple teammates, often by weighting strategies. Here, we explored
the hypothesis that strategic weighting signals from individual teammates would outperform
an equal weighting baseline. Accordingly, we explored role-, trait-, and behavior-based
weighting of behavioral signals across team members. We analyzed data from 101 triads
engaged in computer-mediated collaborative problem solving (CPS) in an educational
physics game. We investigated the accuracy of machine-learned models trained on facial
expressions, acoustic-prosodics, eye gaze, and task context information, computed
one-minute prior to the end of a game level, at predicting success at solving that
level. AUROCs for unimodal models that equally weighted features from the three teammates
ranged from .54 to .67, whereas a combination of gaze, face, and task context features,
achieved an AUROC of .73. The various multiparty weighting strategies did not outperform
an equal-weighting baseline. However, our best nonverbal model (AUROC = .73) outperformed
a language-based model (AUROC = .67), and there were some advantages to combining
the two (AUROC = .75). Finally, models aimed at prospectively predicting performance
on a minute-by-minute basis from the start of the level achieved a lower, but still
above-chance, AUROC of .60. We discuss implications for multiparty modeling of team
performance and other team constructs.
- Ilhan Aslan
- Andreas Seiderer
- Chi Tai Dang
- Simon Rädler
- Elisabeth André
A human's heart beating can be sensed by sensors and displayed for others to see,
hear, feel, and potentially "resonate'' with. Previous work in studying interaction
designs with physiological data, such as a heart's pulse rate, have argued that feeding
it back to the users may, for example support users' mindfulness and self-awareness
during various everyday activities and ultimately support their health and wellbeing.
Inspired by Somaesthetics as a discipline, we designed and explored multimodal displays,
which enable experiencing heart beats as natural stimuli from oneself and others in
social proximity. In this paper, we report on the design process of our design PiHearts
and present qualitative results of a field study with 30 pairs of participants. Participants
were asked to use PiHearts during watching short movies together and report their
perceived experience in three different display conditions while watching movies.
We found, for example that participants reported significant effects in experiencing
sensory immersion when they received their own heart beats as stimuli compared to
the condition without any heart beat display, and that feeling their partner's heart
beats resulted in significant effects on social experience. We refer to resonance
theory to motivate and discuss the results, highlighting the potential of how digitalization
of heart beats as rhythmic natural stimuli may provide resonance in a modern society
facing social acceleration.
- Yi Ding
- Radha Kumaran
- Tianjiao Yang
- Tobias Höllerer
Curating large and high quality datasets for studying affect is a costly and time
consuming process, especially when the labels are continuous. In this paper, we examine
the potential to use unlabeled public reactions in the form of textual comments to
aid in classifying video affect. We examine two popular datasets used for affect recognition
and mine public reactions for these videos. We learn a representation of these reactions
by using the video ratings as a weakly supervised signal. We show that our model can
learn a fine-graind prediction of comment affect when given a video alone. Furthermore,
we demonstrate how predicting the affective properties of a comment can be a potentially
useful modality to use in multimodal affect modeling.
- Vansh Narula
- Kexin Feng
- Theodora Chaspari
The large amount of data captured by ambulatory sensing devices can afford us insights
into longitudinal behavioral patterns, which can be linked to emotional, psychological,
and cognitive outcomes. Yet, the sensitivity of behavioral data, which regularly involve
speech signals and facial images, can cause strong privacy concerns, such as the leaking
of the user identity. We examine the interplay between emotion-specific and user identity-specific
information in image-based emotion recognition systems. We further study a user anonymization
approach that preserves emotion-specific information, but eliminates user-dependent
information from the convolutional kernel of convolutional neural networks (CNN),
therefore reducing user re-identification risks. We formulate an adversarial learning
problem implemented with a multitask CNN, that minimizes emotion classification and
maximizes user identification loss. The proposed system is evaluated on three datasets
achieving moderate to high emotion recognition and poor user identity recognition
performance. The resulting image transformation obtained by the convolutional layer
is visually inspected, attesting to the efficacy of the proposed system in preserving
emotion-specific information. Implications from this study can inform the design of
privacy-aware emotion recognition systems that preserve facets of human behavior,
while concealing the identity of the user, and can be used in ambulatory monitoring
applications related to health, well-being, and education.
- Patrizia Di Campli San Vito
- Stephen Brewster
- Frank Pollick
- Simon Thompson
- Lee Skrypchuk
- Alexandros Mouzakitis
Haptic feedback can improve safety and driving behaviour. While vibration has been
widely studied, other haptic modalities have been neglected. To address this, we present
two studies investigating the use of uni- and bimodal vibrotactile and thermal cues
on the steering wheel. First, notifications with three levels of urgency were subjectively
rated and then identified during simulated driving. Bimodal feedback showed an increased
identification time over unimodal vibrotactile cues. Thermal feedback was consistently
rated less urgent, showing its suitability for less time critical notifications, where
vibration would be unnecessarily attention-grabbing. The second study investigated
more complex thermal and bimodal haptic notifications comprised of two different types
of information (Nature and Importance of incoming message). Results showed that both
modalities could be identified with high recognition rates of up to 92% for both and
up to 99% for a single type, opening up a novel design space for haptic in-car feedback.
- Patricia Cornelio
- Emanuela Maggioni
- Giada Brianza
- Sriram Subramanian
- Marianna Obrist
The Sense of Agency (SoA) is crucial in interaction with technology, it refers to
the feeling of 'I did that' as opposed to 'the system did that' supporting a feeling
of being in control. Research in human-computer interaction has recently studied agency
in visual, auditory and haptic interfaces, however the role of smell on agency remains
unknown. Our sense of smell is quite powerful to elicit emotions, memories and awareness
of the environment, which has been exploited to enhance user experiences (e.g., in
VR and driving scenarios). In light of increased interest in designing multimodal
interfaces including smell and its close link with emotions, we investigated, for
the first time, the effect of smell-induced emotions on the SoA. We conducted a study
using the Intentional Binding (IB) paradigm used to measure SoA while participants
were exposed to three scents with different valence (pleasant, unpleasant, neutral).
Our results show that participants? SoA increased with a pleasant scent compared to
neutral and unpleasant scents. We discuss how our results can inform the design of
multimodal and future olfactory interfaces.
- Yufeng Yin
- Baiyu Huang
- Yizhen Wu
- Mohammad Soleymani
Automatic emotion recognition methods are sensitive to the variations across different
datasets and their performance drops when evaluated across corpora. We can apply domain
adaptation techniques e.g., Domain-Adversarial Neural Network (DANN) to mitigate this
problem. Though the DANN can detect and remove the bias between corpora, the bias
between speakers still remains which results in reduced performance. In this paper,
we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to reduce
both the domain bias and the speaker bias. Specifically, based on the DANN, we add
a speaker discriminator to unlearn information representing speakers' individual characteristics
with a gradient reversal layer (GRL). Our experiments with multimodal data (speech,
vision, and text) and the cross-domain evaluation indicate that the proposed SIDANN
outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN
model, suggesting that the SIDANN has a better domain adaptation ability than the
DANN. Besides, the modality contribution analysis shows that the acoustic features
are the most informative for arousal detection while the lexical features perform
the best for valence detection.
- Wei Guo
- Byeong-Young Cho
- Jingtao Wang
Mobile devices are becoming an important platform for reading. However, existing research
on mobile reading primarily focuses on low-level metrics such as speed and comprehension.
For complex reading tasks involving information seeking and context switching, researchers
still rely on verbal reports via think-aloud. We present StrategicReading, an intelligent
reading system running on unmodified smartphones, to understand high-level strategic
reading behaviors on mobile devices. StrategicReading leverages multimodal behavior
sensing and takes advantage of signals from camera-based gaze sensing, kinematic scrolling
patterns, and cross-page behavior changes. Through a 40-participant study, we found
that gaze patterns, muscle stiffness signals, and reading paths captured by StrategicReading
can infer both users' reading strategies and reading performance with high accuracy.
- Amr Gomaa
- Guillermo Reyes
- Alexandra Alles
- Lydia Rupp
- Michael Feld
Hand pointing and eye gaze have been extensively investigated in automotive applications
for object selection and referencing. Despite significant advances, existing outside-the-vehicle
referencing methods consider these modalities separately. Moreover, existing multimodal
referencing methods focus on a static situation, whereas the situation in a moving
vehicle is highly dynamic and subject to safety-critical constraints. In this paper,
we investigate the specific characteristics of each modality and the interaction between
them when used in the task of referencing outside objects (e.g. buildings) from the
vehicle. We furthermore explore person-specific differences in this interaction by
analyzing individuals' performance for pointing and gaze patterns, along with their
effect on the driving task. Our statistical analysis shows significant differences
in individual behaviour based on object's location (i.e. driver's right side vs. left
side), object's surroundings, driving mode (i.e. autonomous vs. normal driving) as
well as pointing and gaze duration, laying the foundation for a user-adaptive approach.
- Lingyu Zhang
- Richard J. Radke
Social signal processing algorithms have become increasingly better at solving well-defined
prediction and estimation problems in audiovisual recordings of group discussion.
However, much human behavior and communication is less structured and more subtle.
In this paper, we address the problem of generic question answering from diverse audiovisual
recordings of human interaction. The goal is to select the correct free-text answer
to a free-text question about human interaction in a video. We propose an RNN-based
model with two novel ideas: a temporal attention module that highlights key words
and phrases in the question and candidate answers, and a consistency measurement module
that scores the similarity between the multimodal data, the question, and the candidate
answers. This small set of consistency scores forms the input to the final question-answering
stage, resulting in a lightweight model. We demonstrate that our model achieves state
of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer
pairs.
- Parul Gupta
- Komal Chugh
- Abhinav Dhall
- Ramanathan Subramanian
We present FakeET -- an eye-tracking database to understand human visual perception
of deepfake videos. Given that the principal purpose of deepfakes is to deceive human
observers, FakeET is designed to understand and evaluate the ability of viewers to
detect synthetic video artifacts. FakeET contains viewing patterns compiled from 40
users via the Tobii desktop eye-tracker for 811 videos from the Google Deepfake dataset,
with a minimum of two viewings per video. Additionally, EEG responses acquired via
the Emotiv sensor are also available. The compiled data confirms (a) distinct eye
movement characteristics for real vs fake videos; (b) utility of the eye-track saliency
maps for spatial forgery localization and detection, and (c) Error Related Negativity
(ERN) triggers in the EEG responses, and the ability of the raw EEG signal to distinguish
between real and fake videos.
- Beatrice Biancardi
- Lou Maisonnave-Couterou
- Pierrick Renault
- Brian Ravenet
- Maurizio Mancini
- Giovanna Varni
We present WoNoWa, a novel multi-modal dataset of small group interactions in collaborative
tasks. The dataset is explicitly designed to elicit and to study over time a Transactive
Memory System (TMS), a group's emergent state characterizing the group's meta-knowledge
about "who knows what". A rich set of automatic features and manual annotations, extracted
from the collected audio-visual data, is available on request for research purposes.
Features include individual descriptors (e.g., position, Quantity of Motion, speech
activity) and group descriptors (e.g., F-formations). Additionally, participants'
self-assessments are available. Preliminary results from exploratory analyses show
that the WoNoWa design allowed groups to develop a TMS that increased across the tasks.
These results encourage the use of the WoNoWa dataset for a better understanding of
the relationship between behavioural patterns and TMS, that in turn could help to
improve group performance.
- Kumar Akash
- Neera Jain
- Teruhisa Misu
Properly calibrated human trust is essential for successful interaction between humans
and automation. However, while human trust calibration can be improved by increased
automation transparency, too much transparency can overwhelm human workload. To address
this tradeoff, we present a probabilistic framework using a partially observable Markov
decision process (POMDP) for modeling the coupled trust-workload dynamics of human
behavior in an action-automation context. We specifically consider hands-off Level
2 driving automation in a city environment involving multiple intersections where
the human chooses whether or not to rely on the automation. We consider automation
reliability, automation transparency, and scene complexity, along with human reliance
and eye-gaze behavior, to model the dynamics of human trust and workload. We demonstrate
that our model framework can appropriately vary automation transparency based on real-time
human trust and workload belief estimates to achieve trust calibration.
- Victoria Lin
- Jeffrey M. Girard
- Michael A. Sayette
- Louis-Philippe Morency
Emotional expressiveness captures the extent to which a person tends to outwardly
display their emotions through behavior. Due to the close relationship between emotional
expressiveness and behavioral health, as well as the crucial role that it plays in
social interaction, the ability to automatically predict emotional expressiveness
stands to spur advances in science, medicine, and industry. In this paper, we explore
three related research questions. First, how well can emotional expressiveness be
predicted from visual, linguistic, and multimodal behavioral signals? Second, how
important is each behavioral modality to the prediction of emotional expressiveness?
Third, which behavioral signals are reliably related to emotional expressiveness?
To answer these questions, we add highly reliable transcripts and human ratings of
perceived emotional expressiveness to an existing video database and use this data
to train, validate, and test predictive models. Our best model shows promising predictive
performance on this dataset (RMSE=0.65, R^2=0.45, r=0.74). Multimodal models tend
to perform best overall, and models trained on the linguistic modality tend to outperform
models trained on the visual modality. Finally, examination of our interpretable models'
coefficients reveals a number of visual and linguistic behavioral signals---such as
facial action unit intensity, overall word count, and use of words related to social
processes---that reliably predict emotional expressiveness.
- Lars Steinert
- Felix Putze
- Dennis Küster
- Tanja Schultz
Roughly 50 million people worldwide are currently suffering from dementia. This number
is expected to triple by 2050. Dementia is characterized by a loss of cognitive function
and changes in behaviour. This includes memory, language skills, and the ability to
focus and pay attention. However, it has been shown that secondary therapy such as
the physical, social and cognitive activation of People with Dementia (PwD) has significant
positive effects. Activation impacts cognitive functioning and can help prevent the
magnification of apathy, boredom, depression, and loneliness associated with dementia.
Furthermore, activation can lead to higher perceived quality of life. We follow Cohen's
argument that activation stimuli have to produce engagement to take effect and adopt
his definition of engagement as "the act of being occupied or involved with an external
stimulus".
- Skanda Muralidhar
- Emmanuelle Patricia Kleinlogel
- Eric Mayor
- Adrian Bangerter
- Marianne Schmid Mast
- Daniel Gatica-Perez
Asynchronous video interviews (AVIs) are increasingly used by organizations in their
hiring process. In this mode of interviewing, the applicants are asked to record their
responses to predefined interview questions using a webcam via an online platform.
AVIs have increased usage due to employers' perceived benefits in terms of costs and
scale. However, little research has been conducted regarding applicants' reactions
to these new interview methods. In this work, we investigate applicants' reactions
to an AVI platform using self-reported measures previously validated in psychology
literature. We also investigate the connections of these measures with nonverbal behavior
displayed during the interviews. We find that participants who found the platform
creepy and had concerns about privacy reported lower interview performance compared
to participants who did not have such concerns. We also observe weak correlations
between nonverbal cues displayed and these self-reported measures. Finally, inference
experiments achieve overall low-performance w.r.t. to explaining applicants' reactions.
Overall, our results reveal that participants who are not at ease with AVIs (i.e.,
high creepy ambiguity score) might be unfairly penalized. This has implications for
improved hiring practices using AVIs.
- Sami Alperen Akgun
- Moojan Ghafurian
- Mark Crowley
- Kerstin Dautenhahn
An experiment is presented to investigate whether there is consensus in mapping emotions
to messages/situations in urban search and rescue scenarios, where efficiency and
effectiveness of interactions are key to success. We studied mappings between 10 specific
messages, presented in two different communication styles, reflecting common situations
that might happen during search and rescue missions, and the emotions exhibited by
robots in those situations. The data was obtained through a Mechanical Turk study
with 78 participants. Our findings support the feasibility of using emotions as an
additional communication channel to improve multi-modal human-robot interaction for
urban search and rescue robots, and suggests that these mappings are robust, i.e.
are not affected by the robot's communication style.
- Matthias Kraus
- Marvin Schiller
- Gregor Behnke
- Pascal Bercher
- Michael Dorna
- Michael Dambier
- Birte Glimm
- Susanne Biundo
- Wolfgang Minker
Effectively supporting novices during performance of complex tasks, e.g. do-it-yourself
(DIY) projects, requires intelligent assistants to be more than mere instructors.
In order to be accepted as a competent and trustworthy cooperation partner, they need
to be able to actively participate in the project and engage in helpful conversations
with users when assistance is necessary. Therefore, a new proactive version of the
DIY-assistant Robert is presented in this paper. It extends the previous prototype
by including the capability to initiate reflective meta-dialogues using multimodal
cues. Two different strategies for reflective dialogue are implemented: A progress-based
strategy initiates a reflective dialogue about previous experience with the assistance
for encouraging the self-appraisal of the user. An activity-based strategy is applied
for providing timely, task-dependent support. Therefore, user activities with a connected
drill driver are tracked that trigger dialogues in order to reflect on the current
task and to prevent task failure. An experimental study comparing the proactive assistant
against the baseline version shows that proactive meta-dialogue is able to build user
trust significantly better than a solely reactive system. Besides, the results provide
interesting insights for the development of proactive dialogue assistants.
- Abdul Rafey Aftab
- Michael von der Beeck
- Michael Feld
Sophisticated user interaction in the automotive industry is a fast emerging topic.
Mid-air gestures and speech already have numerous applications for driver-car interaction.
Additionally, multimodal approaches are being developed to leverage the use of multiple
sensors for added advantages. In this paper, we propose a fast and practical multimodal
fusion method based on machine learning for the selection of various control modules
in an automotive vehicle. The modalities taken into account are gaze, head pose and
finger pointing gesture. Speech is used only as a trigger for fusion. Single modality
has previously been used numerous times for recognition of the user's pointing direction.
We, however, demonstrate how multiple inputs can be fused together to enhance the
recognition performance. Furthermore, we compare different deep neural network architectures
against conventional Machine Learning methods, namely Support Vector Regression and
Random Forests, and show the enhancements in the pointing direction accuracy using
deep learning. The results suggest a great potential for the use of multimodal inputs
that can be applied to more use cases in the vehicle.
SESSION: Short Papers
- Jasper J. van Beers
- Ivo V. Stuldreher
- Nattapong Thammasan
- Anne-Marie Brouwer
Measuring concurrent changes in autonomic physiological responses aggregated across
individuals (Physiological Synchrony - PS) can provide insight into group-level cognitive
or emotional processes. Utilizing cheap and easy-to-use wearable sensors to measure
physiology rather than their high-end laboratory counterparts is desirable. Since
it is currently ambiguous how different signal properties (arising from different
types of measuring equipment) influence the detection of PS associated with mental
processes, it is unclear whether, or to what extent, PS based on data from wearables
compares to that from their laboratory equivalents. Existing literature has investigated
PS using both types of equipment, but none compared them directly. In this study,
we measure PS in electrodermal activity (EDA) and inter-beat interval (IBI, inverse
of heart rate) of participants who listened to the same audio stream but were either
instructed to attend to the presented narrative (n=13) or to the interspersed auditory
events (n=13). Both laboratory and wearable sensors were used (ActiveTwo electrocardiogram
(ECG) and EDA; Wahoo Tickr and EdaMove4). A participant's attentional condition was
classified based on which attentional group they shared greater synchrony with. For
both types of sensors, we found classification accuracies of 73% or higher in both
EDA and IBI. We found no significant difference in classification accuracies between
the laboratory and wearable sensors. These findings encourage the use of wearables
for PS based research and for in-the-field measurements.
- Toshiki Onishi
- Arisa Yamauchi
- Ryo Ishii
- Yushi Aono
- Akihiro Miyata
In this work, as a first attempt to analyze the relationship between praising skills
and human behavior in dialogue, we focus on head and face behavior. We create a new
dialogue corpus including face and head behavior information of persons who give praise
(praiser) and receive praise (receiver) and the degree of success of praising (praising
score). We also create a machine learning model that uses features related to head
and face behavior to estimate praising score, clarify which features of the praiser
and receiver are important in estimating praising score. The analysis results showed
that features of the praiser and receiver are important in estimating praising score
and that features related to utterance, head, gaze, and chin were important. The analysis
of the features of high importance revealed that the praiser and receiver should face
each other without turning their heads to the left or right, and the longer the praiser's
utterance, the more successful the praising.
- Tousif Ahmed
- Mohsin Y. Ahmed
- Md Mahbubur Rahman
- Ebrahim Nemati
- Bashima Islam
- Korosh Vatanparvar
- Viswam Nathan
- Daniel McCaffrey
- Jilong Kuang
- Jun Alex Gao
Tracking the type and frequency of cough events is critical for monitoring respiratory
diseases. Coughs are one of the most common symptoms of respiratory and infectious
diseases like COVID-19, and a cough monitoring system could have been vital in remote
monitoring during a pandemic like COVID-19. While the existing solutions for cough
monitoring use unimodal (e.g., audio) approaches for detecting coughs, a fusion of
multimodal sensors (e.g., audio and accelerometer) from multiple devices (e.g., phone
and watch) are likely to discover additional insights and can help to track the exacerbation
of the respiratory conditions. However, such multimodal and multidevice fusion requires
accurate time synchronization, which could be challenging for coughs as coughs are
usually concise events (0.3-0.7 seconds). In this paper, we first demonstrate the
time synchronization challenges of cough synchronization based on the cough data collected
from two studies. Then we highlight the performance of a cross-correlation based time
synchronization algorithm on the alignment of cough events. Our algorithm can synchronize
98.9% of cough events with an average synchronization error of 0.046s from two devices.
- Kumar Shubham
- Emmanuelle Patricia Kleinlogel
- Anaïs Butera
- Marianne Schmid Mast
- Dinesh Babu Jayagopi
With recent advancements in technology, new platforms have come up to substitute face-to-face
interviews. Of particular interest are asynchronous video interviewing (AVI) platforms,
where candidates talk to a screen with questions, and virtual agent based interviewing
platforms, where a human-like avatar interviews candidates. These anytime-anywhere
interviewing systems scale up the overall reach of the interviewing process for firms,
though they may not provide the best experience for the candidates. An important research
question is how the candidates perceive such platforms and its impact on their performance
and behavior. Also, is there an advantage of one setting vs. another i.e., Avatar
vs. Platform? Finally, would such differences be consistent across cultures? In this
paper, we present the results of a comparative study conducted in three different
interview settings (i.e., Face-to-face, Avatar, and Platform), as well as two different
cultural contexts (i.e., India and Switzerland), and analyze the differences in self-rated,
others-rated performance, and automatic audiovisual behavioral cues.
- Ronald Cumbal
- José Lopes
- Olov Engwall
Uncertainty is a frequently occurring affective state that learners experience during
the acquisition of a second language. This state can constitute both a learning opportunity
and a source of learner frustration. An appropriate detection could therefore benefit
the learning process by reducing cognitive instability. In this study, we use a dyadic
practice conversation between an adult second-language learner and a social robot
to elicit events of uncertainty through the manipulation of the robot's spoken utterances
(increased lexical complexity or prosody modifications). The characteristics of these
events are then used to analyze multi-party practice conversations between a robot
and two learners. Classification models are trained with multimodal features from
annotated events of listener (un)certainty. We report the performance of our models
on different settings, (sub)turn segments and multimodal inputs.
- Haley Lepp
- Chee Wee Leong
- Katrina Roohr
- Michelle Martin-Raugh
- Vikram Ramanarayanan
We investigate the effect of observed data modality on human and machine scoring of
informative presentations in the context of oral English communication training and
assessment. Three sets of raters scored the content of three minute presentations
by college students on the basis of either the video, the audio or the text transcript
using a custom scoring rubric. We find significant differences between the scores
assigned when raters view a transcript or listen to audio recordings in comparison
to watching a video of the same presentation, and present an analysis of those differences.
Using the human scores, we train machine learning models to score a given presentation
using text, audio, and video features separately. We analyze the distribution of machine
scores against the modality and label bias we observe in human scores, discuss its
implications for machine scoring and recommend best practices for future work in this
direction. Our results demonstrate the importance of checking and correcting for bias
across different modalities in evaluations of multi-modal performances.
- Ziyang Chen
- Yu-Peng Chen
- Alex Shaw
- Aishat Aloba
- Pavlo Antonenko
- Jaime Ruiz
- Lisa Anthony
It is well established that children's touch and gesture interactions on touchscreen
devices are different from those of adults, with much prior work showing that children's
input is recognized more poorly than adults? input. In addition, researchers have
shown that recognition of touchscreen input is poorest for young children and improves
for older children when simply considering their age; however, individual differences
in cognitive and motor development could also affect children's input. An understanding
of how cognitive and motor skill influence touchscreen interactions, as opposed to
only coarser measurements like age and grade level, could help in developing personalized
and tailored touchscreen interfaces for each child. To investigate how cognitive and
motor development may be related to children's touchscreen interactions, we conducted
a study of 28 participants ages 4 to 7 that included validated assessments of the
children's motor and cognitive skills as well as typical touchscreen target acquisition
and gesture tasks. We correlated participants? touchscreen behaviors to their cognitive
development level, including both fine motor skills and executive function. We compare
our analysis of touchscreen interactions based on cognitive and motor development
to prior work based on children's age. We show that all four factors (age, grade level,
motor skill, and executive function) show similar correlations with target miss rates
and gesture recognition rates. Thus, we conclude that age and grade level are sufficiently
sensitive when considering children's touchscreen behaviors.
- Jari Kangas
- Olli Koskinen
- Roope Raisamo
To effectively utilize a gaze tracker in user interaction it is important to know
the quality of the gaze data that it is measuring. We have developed a method to evaluate
the accuracy and precision of gaze trackers in virtual reality headsets. The method
consists of two software components. The first component is a simulation software
that calibrates the gaze tracker and then performs data collection by providing a
gaze target that moves around the headset's field-of-view. The second component makes
an off-line analysis of the logged gaze data and provides a number of measurement
results of the accuracy and precision. The analysis results consist of the accuracy
and precision of the gaze tracker in different directions inside the virtual 3D space.
Our method combines the measurements into overall accuracy and precision. Visualizations
of the measurements are created to see possible trends over the display area. Results
from selected areas in the display are analyzed to find out differences between the
areas (for example, the middle/outer edge of the display or the upper/lower part of
display).
- Liang Yang
- Jingjie Zeng
- Tao Peng
- Xi Luo
- Jinghui Zhang
- Hongfei Lin
The Legal Judgement Prediction (LJP) is now under the spotlight. And it usually consists
of multiple sub-tasks, such as penalty prediction (fine and imprisonment) and the
prediction of articles of law. For penalty prediction, they are often closely related
to the trial process, especially the attitude analysis of criminal suspects, which
will influence the judgment of the presiding judge to some extent. In this paper,
we firstly construct a multi-modal dataset with 517 cases of intentional assault,
which contains trial information as well as the attitude of the suspect. Then, we
explore the relationship between suspect`s attitude and term of imprisonment. Finally,
we use the proposed multi-modal model to predict the suspect's attitude, and compare
it with several strong baselines. Our experimental results show that the attitude
of the criminal suspect is closely related to the penalty prediction, which provides
a new perspective for LJP.
- Everlyne Kimani
- Prasanth Murali
- Ameneh Shamekhi
- Dhaval Parmar
- Sumanth Munikoti
- Timothy Bickmore
Audience perceptions of public speakers' performance change over time. Some speakers
start strong but quickly transition to mundane delivery, while others may have a few
impactful and engaging portions of their talk preceded and followed by more pedestrian
delivery. In this work, we model the time-varying qualities of a presentation as perceived
by the audience and use these models both to provide diagnostic information to presenters
and to improve the quality of automated performance assessments. In particular, we
use HMMs to model various dimensions of perceived quality and how they change over
time and use the sequence of quality states to improve feedback and predictions. We
evaluate this approach on a corpus of 74 presentations given in a controlled environment.
Multimodal features-spanning acoustic qualities, speech disfluencies, and nonverbal
behavior were derived both automatically and manually using crowdsourcing. Ground
truth on audience perceptions was obtained using judge ratings on both overall presentations
(aggregate) and portions of presentations segmented by topic. We distilled the overall
presentation quality into states representing the presenter's gaze, audio, gesture,
audience interaction, and proxemic behaviors. We demonstrate that an HMM of state-based
representation of presentations improves the performance assessments.
- Soheil Rayatdoost
- David Rudrauf
- Mohammad Soleymani
Emotions associated with neural and behavioral responses are detectable through scalp
electroencephalogram (EEG) signals and measures of facial expressions. We propose
a multimodal deep representation learning approach for emotion recognition from EEG
and facial expression signals. The proposed method involves the joint learning of
a unimodal representation aligned with the other modality through cosine similarity
and a gated fusion for modality fusion. We evaluated our method on two databases:
DAI-EF and MAHNOB-HCI. The results show that our deep representation is able to learn
mutual and complementary information between EEG signals and face video, captured
by action units, head and eye movements from face videos, in a manner that generalizes
across databases. It is able to outperform similar fusion methods for the task at
hand.
- Kalin Stefanov
- Baiyu Huang
- Zongjian Li
- Mohammad Soleymani
Automatic multimodal acquisition and understanding of social signals is an essential
building block for natural and effective human-machine collaboration and communication.
This paper introduces OpenSense, a platform for real-time multimodal acquisition and
recognition of social signals. OpenSense enables precisely synchronized and coordinated
acquisition and processing of human behavioral signals. Powered by the Microsoft's
Platform for Situated Intelligence, OpenSense supports a range of sensor devices and
machine learning tools and encourages developers to add new components to the system
through straightforward mechanisms for component integration. This platform also offers
an intuitive graphical user interface to build application pipelines from existing
components. OpenSense is freely available for academic research.
- Jaya Narain
- Kristina T. Johnson
- Craig Ferguson
- Amanda O'Brien
- Tanya Talkar
- Yue Zhang Weninger
- Peter Wofford
- Thomas Quatieri
- Rosalind Picard
- Pattie Maes
Nonverbal vocalizations contain important affective and communicative information,
especially for those who do not use traditional speech, including individuals who
have autism and are non- or minimally verbal (nv/mv). Although these vocalizations
are often understood by those who know them well, they can be challenging to understand
for the community-at-large. This work presents (1) a methodology for collecting spontaneous
vocalizations from nv/mv individuals in natural environments, with no researcher present,
and personalized in-the-moment labels from a family member; (2) speaker-dependent
classification of these real-world sounds for three nv/mv individuals; and (3) an
interactive application to translate the nonverbal vocalizations in real time. Using
support-vector machine and random forest models, we achieved speaker-dependent unweighted
average recalls (UARs) of 0.75, 0.53, and 0.79 for the three individuals, respectively,
with each model discriminating between 5 nonverbal vocalization classes. We also present
first results for real-time binary classification of positive- and negative-affect
nonverbal vocalizations, trained using a commercial wearable microphone and tested
in real time using a smartphone. This work informs personalized machine learning methods
for non-traditional communicators and advances real-world interactive augmentative
technology for an underserved population.
- Margaret von Ebers
- Ehsanul Haque Nirjhar
- Amir H. Behzadan
- Theodora Chaspari
Public speaking is central to socialization in casual, professional, or academic settings.
Yet, public speaking anxiety (PSA) is known to impact a considerable portion of the
general population. This paper utilizes bio-behavioral indices captured from wearable
devices to quantify the effectiveness of systematic exposure to virtual reality (VR)
audiences for mitigating PSA. The effect of separate bio-behavioral features and demographic
factors is studied, as well as the amount of necessary data from the VR sessions that
can yield a reliable predictive model of the VR training effectiveness. Results indicate
that acoustic and physiological reactivity during the VR exposure can reliably predict
change in PSA before and after the training. With the addition of demographic features,
both acoustic and physiological feature sets achieve improvements in performance.
Finally, using bio-behavioral data from six to eight VR sessions can yield reliable
prediction of PSA change. Findings of this study will enable researchers to better
understand how bio-behavioral factors indicate improvements in PSA with VR training.
- Akshat Choube
- Mohammad Soleymani
Humor has a history as old as humanity. Humor often induces laughter and elicits amusement
and engagement. Humorous behavior involves behavior manifested in different modalities
including language, voice tone, and gestures. Thus, automatic understanding of humorous
behavior requires multimodal behavior analysis. Humor detection is a well-established
problem in Natural Language Processing but its multimodal analysis is less explored.
In this paper, we present a context-aware hierarchical fusion network for multimodal
punchline detection. The proposed neural architecture first fuses the modalities two
by two and then fuses all three modalities. The network also models the context of
the punchline using Gated Recurrent Unit(s). The model's performance is evaluated
on UR-FUNNY database yielding state-of-the-art performance.
- Miltiadis Marios Katsakioris
- Ioannis Konstas
- Pierre Yves Mignotte
- Helen Hastie
We present the publicly-available Robot Open Street Map Instructions (ROSMI) corpus:
a rich multimodal dataset of map and natural language instruction pairs that was collected
via crowdsourcing. The goal of this corpus is to aid in the advancement of state-of-the-art
visual-dialogue tasks, including reference resolution and robot-instruction understanding.
The domain described here concerns robots and autonomous systems being used for inspection
and emergency response. The ROSMI corpus is unique in that it captures interaction
grounded in map-based visual stimuli that is both human-readable but also contains
rich metadata that is needed to plan and deploy robots and autonomous systems, thus
facilitating human-robot teaming.
- Murat Kirtay
- Ugo Albanese
- Lorenzo Vannucci
- Guido Schillaci
- Cecilia Laschi
- Egidio Falotico
Multimodal information can significantly increase the perceptual capabilities of robotic
agents, at the cost of a more complex sensory processing. This complexity can be reduced
by employing machine learning techniques, provided that there is enough meaningful
data to train on. This paper reports on creating novel datasets constructed by employing
the iCub robot equipped with an additional depth sensor and color camera. We used
the robot to acquire color and depth information for 210 objects in different acquisition
scenarios. At the end, the results were large scale datasets that can be used for
robot and computer vision applications: multisensory object representation, action
recognition, rotation and distance invariant object recognition.
- Roelof A. J. de Vries
- Juliet A. M. Haarman
- Emiel C. Harmsen
- Dirk K. J. Heylen
- Hermie J. Hermens
Eating is in many ways a social activity. Yet, little is known about the social dimension
of eating influencing individual eating habits. Nor do we know much about how to purposefully
design for interactions in the social space of eating. This paper presents (1) the
journey of exploring the social space of eating by designing an artifact, and (2)
the actual artifact designed for the purpose of exploring the interaction dynamics
of social eating. The result of this Research through Design journey is the Sensory
Interactive Table: an interactive dining table based on explorations of the social
space of eating, and a probe to explore the social space of eating further.
- Wail El Bani
- Mohamed Chetouani
Touch is the earliest sense to develop and the first mean of contact with the external
world. Touch also plays a key role in our socio-emotional communication: we use it
to communicate our feelings, elicit strong emotions in others and modulate behavior
(e.g compliance). Although its relevance, touch is an understudied modality in Human-Machine-Interaction
compared to audition and vision. Most of the social touch recognition systems require
a feature engineering step making them difficult to compare and to generalize to other
databases. In this paper, we propose an end-to-end approach. We present an attention-based
end-to-end model for touch gesture recognition evaluated on two public datasets (CoST
and HAART) in the context of the ICMI 15 Social Touch Challenge. Our model gave a
similar level of accuracy: 61% for CoST and 68% for HAART and uses self-attention
as an alternative to feature engineering and Recurrent Neural Networks.
SESSION: Doctoral Consortium Papers
My research is is in the field of computer supported and enabled innovation processes,
in particular focusing on the first phases of ideation in a co-located environment.
I'm developing a concept for documenting, tracking and enhancing creative ideation
processes. Base of this concept are key figures derived from various system within
the ideation sessions. The system designed in my doctoral thesis enables interdisciplinary
teams to kick-start creativity by automating facilitation, moderation, creativity
support and documentation of the process. Using the example of brainstorming, a standing
table is equipped with camera and microphone based sensing as well as multiple ways
of interaction and visualization through projection and LED lights. The user interaction
with the table is implicit and based on real time metadata generated by the users
of the system. System actions are calculated based on what is happening on the table
using object recognition. Everything on the table influences the system thus making
it into a multimodal input and output device with implicit interaction. While the
technical aspects of my research are close to be done, the more problematic part of
evaluation will benefit from feedback from the specialists for multimodal interaction
at ICMI20.
My PhD project aims to make contributions in the affective computing application to
assist in the depression diagnosis by micro-expression recognition. My motivation
is the similarities of the low-intensity facial expressions in micro-expressions and
the low-intensity facial expressions (`frozen face?) in people with psycho-motor retardation
caused by depression. It will focus on, firstly, investigating spatio-temporal modelling
and attention systems for micro-expression recognition (MER) and, secondly, exploring
the role of micro-expressions in automated depression analysis by improving deep learning
architectures to detect low-intensity facial expressions. This work will investigate
different deep learning architectures (e.g. Temporal Convolutional Networks (TCNN)
or Gate Recurrent Unit (GRU)) and validate the results on publicly available micro-expression
benchmark datasets to quantitatively analyse the robustness and accuracy of MER's
contribution to improving automatic depression analysis. Moreover, video magnification
as a way to enhance small movements will be combined with the deep learning methods
to address the low-intensity issues in MER.
- George-Petru Ciordas-Hertel
To obtain a holistic perspective on learning, a multimodal technical infrastructure
for Learning Analytics (LA) can be beneficial. Recent studies have investigated various
aspects of technical LA infrastructure. However, it has not yet been explored how
LA indicators can be complemented with Smartwatch sensor data to detect physical activity
and the environmental context. Sensor data, such as the accelerometer, are often used
in related work to infer a specific behavior and environmental context, thus triggering
interventions on a just-in-time basis. In this dissertation project, we plan to use
Smartwatch sensor data to explore further indicators for learning from blended learning
sessions conducted in-the-wild, e.g., at home. Such indicators could be used within
learning sessions to suggest breaks, or afterward to support learners in reflection
processes.
We plan to investigate the following three research questions: (RQ1) How can multimodal
learning analytics infrastructure be designed to support real-time data acquisition
and processing effectively?; (RQ2) how to use smartwatch sensor data to infer environmental
context and physical activities to complement learning analytics indicators for blended
learning sessions; and (RQ3) how can we align the extracted multimodal indicators
with pedagogical interventions.
RQ1 was investigated by a structured literature review and by conducting eleven semi-structured
interviews with LA infrastructure developers. According to RQ2, we are currently designing
and implementing a multimodal learning analytics infrastructure to collect and process
sensor and experience data from Smartwatches. Finally, according to RQ3, an exploratory
field study will be conducted to extract multimodal learning indicators and examine
them with learners and pedagogical experts to develop effective interventions.
Researchers, educators, and learners can use and adapt our contributions to gain new
insights into learners' time and learning tactics, and physical learning spaces from
learning sessions taking place in-the-wild.
Groups are getting more and more scholars' attention. With the rise of Social Signal
Processing (SSP), many studies based on Social Sciences and Psychology findings focused
on detecting and classifying groups? dynamics. Cohesion plays an important role in
these groups? dynamics and is one of the most studied emergent states, involving both
group motions and goals. This PhD project aims to provide a computational model addressing
the multidimensionality of cohesion and capturing its subtle dynamics. It will offer
new opportunities to develop applications to enhance interactions among humans as
well as among humans and machines.
When interested in monitoring attentional engagement, physiological signals can be
of great value. A popular approach is to uncover the complex patterns between physiological
signals and attentional engagement using supervised learning models, but it is often
unclear which physiological measures can best be used in such models and collecting
enough training data with a reliable ground-truth to train such model is very challenging.
Rather than using physiological responses of individual participants and specific
events in a trained model, one can also continuously determine the degree to which
physiological measures of multiple individuals uniformly change, often referred to
as physiological synchrony. As a directly proportional relation between physiological
synchrony in brain activity and attentional engagement has been pointed out in the
literature, no trained model is needed to link the two. I aim to create a more robust
measure of attentional engagement among groups of individuals by combining electroencephalography
(EEG), electrodermal activity (EDA) and heart rate into a multimodal metric of physiological
synchrony. I formulate three main research questions in the current research proposal:
1) How do physiological synchrony in physiological measures from the central and peripheral
nervous system relate to attentional engagement? 2) Does physiological synchrony reliably
reflect shared attentional engagement in real-world use-cases? 3) How can these physiological
measures be fused to obtain a multimodal metric of physiological synchrony that outperforms
unimodal synchrony?
Human-device interactions in smart environments are shifting prominently towards naturalistic
user interactions such as gaze and gesture. However, ambiguities arise when users
have to switch interactions as contexts change. This could confuse users who are accustomed
to a set of conventional controls leading to system inefficiencies. My research explores
how to reduce interaction ambiguity by semantically modelling user specific interactions
with context, enabling personalised interactions through AR. Sensory data captured
from an AR device is utilised to interpret user interactions and context which is
then modeled in an extendable knowledge graph along with user's interaction preference
using semantic web standards. These representations are utilized to bring semantics
to AR applications about user's intent to interact with a particular device affordance.
Therefore, this research aims to bring semantical modeling of personalised gesture
interactions in AR/VR applications for smart/immersive environments.
The diagnosis of autism spectrum disorder is cumbersome even for expert clinicians
owing to the diversity in the symptoms exhibited by the children which depend on the
severity of the disorder. Furthermore, the diagnosis is based on the behavioural observations
and the developmental history of the child which has substantial dependence on the
perspectives and interpretations of the specialists. In this paper, we present a robot-assisted
diagnostic system for the assessment of behavioural symptoms in children for providing
a reliable diagnosis. The robotic assistant is intended to support the specialist
in administering the diagnostic task, perceiving and evaluating the task outcomes
as well as the behavioural cues for assessment of symptoms and diagnosing the state
of the child. Despite being used widely in education and intervention for children
with autism (CWA), the application of robot assistance in diagnosis is less explored.
Further, there have been limited studies addressing the acceptance and effectiveness
of robot-assisted interventions for CWA in the Global South. We aim to develop a robot-assisted
diagnostic framework for CWA to support the experts and study the viability of such
a system in the Indian context.
Delivering a presentation has been reported as one of the most anxiety-provoking tasks
faced by English Language Learners. Researchers suggest that instructors should be
more aware of the learners' emotional states to provide appropriate emotional and
instructional scaffolding to improve their performance when presenting. Despite the
critical role of instructors in perceiving the emotional states among English language
learners, it can be challenging to do this solely by observing the learners? facial
expressions, behaviors, and their limited verbal expressions due to language and cultural
barriers. To address the ambiguity and inconsistency in interpreting the emotional
states of the students, this research focuses on identifying the potential of using
biosensor-based feedback of learners to support instructors. A novel approach has
been adopted to classify the intensity and characteristics of public speaking anxiety
and foreign language anxiety among English language learners and to provide tailored
feedback to instructors while supporting teaching and learning. As part of this work,
two further studies were proposed. The first study was designed to identify educators'
needs for solutions providing emotional and instructional support. The second study
aims to evaluate a resulting prototype as a view of instructors to offer tailored
emotional and instructional scaffolding to students. The contribution of these studies
includes the development of guidance in using biosensor-based feedback that will assist
English language instructors in teaching and identifying the students' anxiety levels
and types while delivering a presentation.
A socially acceptable robot needs to make correct decisions and be able to understand
human intent in order to interact with and navigate around humans safely. Although
research in computer vision and robotics has made huge advance in recent years, today's
robotics systems still need better understanding of human intent to be more effective
and widely accepted. Currently such inference is typically done using only one mode
of perception such as vision, or human movement trajectory. In this extended abstract,
I describe my PhD research plan of developing a novel multimodal and context-aware
framework, in which a robot infers human navigational intentions through multimodal
perception comprised of human temporal facial, body pose and gaze features, human
motion feature as well as environmental context. To facility this framework, a data
collection experiment is designed to acquire multimodal human-robot interaction data.
Our initial design of the framework is based on a temporal neural network model with
human motion, body pose and head orientation features as input. And we will increase
the complexity of the neural network model as well as the input features along the
way. In the long term, this framework can benefit a variety of settings such as autonomous
driving, service and household robots.
One of the key challenges in designing Embodied Conversational Agents (ECA) is to
produce human-like gestural and visual prosody expressivity. Another major challenge
is to maintain the interlocutor's attention by adapting the agent's behavior to the
interlocutor's multimodal behavior. This paper outlines my PhD research plan that
aims to develop convincing expressive and natural behavior in ECAs and to explore
and model the mechanisms that govern human-agent multimodal interaction. Additionally,
I describe in this paper my first PhD milestone which focuses on developing an end-to-end
LSTM Neural Network model for upper-face gestures generation. The main task consists
of building a model that can produce expressive and coherent upper-face gestures while
considering multiple modalities: speech audio, text, and action units.
Researchers are interested in understanding the emotions of couples as it relates
to relationship quality and dyadic management of chronic diseases. Currently, the
process of assessing emotions is manual, time-intensive, and costly. Despite the existence
of works on emotion recognition among couples, there exists no ubiquitous system that
recognizes the emotions of couples in everyday life while addressing the complexity
of dyadic interactions such as turn-taking in couples? conversations. In this work,
we seek to develop a smartwatch-based system that leverages multimodal sensor data
to recognize each partner's emotions in daily life. We are collecting data from couples
in the lab and in the field and we plan to use the data to develop multimodal machine
learning models for emotion recognition. Then, we plan to implement the best models
in a smartwatch app and evaluate its performance in real-time and everyday life through
another field study. Such a system could enable research both in the lab (e.g. couple
therapy) or in daily life (assessment of chronic disease management or relationship
quality) and enable interventions to improve the emotional well-being, relationship
quality, and chronic disease management of couples.
Zero-Shot Learning (ZSL) is a new paradigm in machine learning that aims to recognize
the classes that are not present in the training data. Hence, this paradigm is capable
of comprehending the categories that were never seen before. While deep learning has
pushed the limits of unseen object recognition, ZSL for temporal problems such as
unfamiliar gesture recognition (referred to as ZSGL) remain unexplored. ZSGL has the
potential to result in efficient human-machine interfaces that can recognize and understand
the spontaneous and conversational gestures of humans. In this regard, the objective
of this work is to conceptualize, model and develop a framework to tackle ZSGL problems.
The first step in the pipeline is to develop a database of gesture attributes that
are representative of a range of categories. Next, a deep architecture consisting
of convolutional and recurrent layers is proposed to jointly optimize the semantic
and classification losses. Lastly, rigorous experiments are performed to compare the
proposed model with respect to existing ZSL models on CGD 2013 and MSRC-12 datasets.
In our preliminary work, we identified a list of 64 discriminative attributes related
to gestures' morphological characteristics. Our approach yields an unseen class accuracy
of (41%) which outperforms the state-of-the-art approaches by a considerable margin.
Future work involves the following: 1. Modifying the existing architecture in order
to improve the ZSL accuracy, 2. Augmenting the database of attributes to incorporate
semantic properties, 3. Addressing the issue of data imbalance which is inherent to
ZSL problems, and 4. Expanding this research to other domains such as surgeme and
action recognition.
SESSION: Demo and Exhibit Papers
- Cigdem Turan
- Patrick Schramowski
- Constantin Rothkopf
- Kristian Kersting
This work introduces Alfie, an interactive robot that is capable of answering moral
(deontological) questions of a user. The interaction of Alfie is designed in a way
in which the user can offer an alternative answer when the user disagrees with the
given answer so that Alfie can learn from its interactions. Alfie's answers are based
on a sentence embedding model that uses state-of-the-art language models, e.g. Universal
Sentence Encoder and BERT. Alfie is implemented on a Furhat Robot, which provides
a customizable user interface to design a social robot.
- Alejandro Peña
- Ignacio Serna
- Aythami Morales
- Julian Fierrez
With the aim of studying how current multimodal AI algorithms based on heterogeneous
sources of information are affected by sensitive elements and inner biases in the
data, this demonstrator experiments over an automated recruitment testbed based on
Curriculum Vitae: FairCVtest. The presence of decision-making algorithms in society
is rapidly increasing nowadays, while concerns about their transparency and the possibility
of these algorithms becoming new sources of discrimination are arising. This demo
shows the capacity of the Artificial Intelligence (AI) behind a recruitment tool to
extract sensitive information from unstructured data, and exploit it in combination
to data biases in undesirable (unfair) ways. Aditionally, the demo includes a new
algorithm (SensitiveNets) for discrimination-aware learning which eliminates sensitive
information in our multimodal AI framework.
- Sarah Ita Levitan
- Xinyue Tan
- Julia Hirschberg
Humans are notoriously poor at detecting deception --- most are worse than chance.
To address this issue we have developed LieCatcher, a single-player web-based Game
With A Purpose (GWAP) that allows players to assess their lie detection skills while
providing human judgments of deceptive speech. Players listen to audio recordings
drawn from a corpus of deceptive and non-deceptive interview dialogues, and guess
if the speaker is lying or telling the truth. They are awarded points for correct
guesses and at the end of the game they receive a score summarizing their performance
at lie detection. We present the game design and implementation, and describe a crowdsourcing
experiment conducted to study perceived deception.
- Carla Viegas
- Albert Lu
- Annabel Su
- Carter Strear
- Yi Xu
- Albert Topdjian
- Daniel Limon
- J.J. Xu
Enthusiasm in speech has a huge impact on listeners. Students of enthusiastic teachers
show better performance. Leaders that are enthusiastic influence employee's innovative
behavior and can also spark excitement in customers. We, at TalkMeUp, want to help
people learn how to talk with enthusiasm in order to spark creativity among their
listeners. In this work we want to present a multimodal speech analysis platform.
We provide feedback on enthusiasm by analyzing eye contact, facial expressions, voice
prosody, and text content.
- Edgar Rojas-Muñoz
- Kyle Couperus
- Juan P. Wachs
Telementoring generalist surgeons as they treat patients can be essential when in
situ expertise is not readily available. However, adverse cyber-attacks, unreliable
network conditions, and remote mentors' predisposition can significantly jeopardize
the remote intervention. To provide medical practitioners with guidance when mentors
are unavailable, we present the AI-Medic, the initial steps towards the development
of a multimodal intelligent artificial system for autonomous medical mentoring. The
system uses a tablet device to acquire the view of an operating field. This imagery
is provided to an encoder-decoder neural network trained to predict medical instructions
from the current view of a surgery. The network was training using DAISI, a dataset
including images and instructions providing step-by-step demonstrations of surgical
procedures. The predicted medical instructions are conveyed to the user via visual
and auditory modalities.
SESSION: Grand Challenge Papers: Emotion Recognition in the Wild Challenge
- Zehui Yu
- Xiehe Huang
- Xiubao Zhang
- Haifeng Shen
- Qun Li
- Weihong Deng
- Jian Tang
- Yi Yang
- Jieping Ye
Driver gaze prediction is an important task in Advanced Driver Assistance System (ADAS).
Although the Convolutional Neural Network (CNN) can greatly improve the recognition
ability, there are still several unsolved problems due to the challenge of illumination,
pose and camera placement. To solve these difficulties, we propose an effective multi-model
fusion method for driver gaze estimation. Rich appearance representations, i.e. holistic
and eyes regions, and geometric representations, i.e. landmarks and Delaunay angles,
are separately learned to predict the gaze, followed by a score-level fusion system.
Moreover, pseudo-3D appearance supervision and identity-adaptive geometric normalization
are proposed to further enhance the prediction accuracy. Finally, the proposed method
achieves state-of-the-art accuracy of 82.5288% on the test data, which ranks 1st at
the EmotiW2020 driver gaze prediction sub-challenge.
- Jianming Wu
- Bo Yang
- Yanan Wang
- Gen Hattori
This paper proposes an advanced multi-instance learning method with multi-features
engineering and conservative optimization for engagement intensity prediction. It
was applied to the EmotiW Challenge 2020 and the results demonstrated the proposed
method's good performance. The task is to predict the engagement level when a subject-student
is watching an educational video under a range of conditions and in various environments.
As engagement intensity has a strong correlation with facial movements, upper-body
posture movements and overall environmental movements in a given time interval, we
extract and incorporate these motion features into a deep regression model consisting
of layers with a combination of long short-term memory(LSTM), gated recurrent unit
(GRU) and a fully connected layer. In order to precisely and robustly predict the
engagement level in a long video with various situations such as darkness and complex
backgrounds, a multi-features engineering function is used to extract synchronized
multi-model features in a given period of time by considering both short-term and
long-term dependencies. Based on these well-processed engineered multi-features, in
the 1st training stage, we train and generate the best models covering all the model
configurations to maximize validation accuracy. Furthermore, in the 2nd training stage,
to avoid the overfitting problem attributable to the extremely small engagement dataset,
we conduct conservative optimization by applying a single Bi-LSTM layer with only
16 units to minimize the overfitting, and split the engagement dataset (train + validation)
with 5-fold cross validation (stratified k-fold) to train a conservative model. The
proposed method, by using decision-level ensemble for the two training stages' models,
finally win the second place in the challenge (MSE: 0.061110 on the testing set).
- Abhinav Dhall
- Garima Sharma
- Roland Goecke
- Tom Gedeon
This paper introduces the Eighth Emotion Recognition in the Wild (EmotiW) challenge.
EmotiW is a benchmarking effort run as a grand challenge of the 22nd ACM International
Conference on Multimodal Interaction 2020. It comprises of four tasks related to automatic
human behavior analysis: a) driver gaze prediction; b) audio-visual group-level emotion
recognition; c) engagement prediction in the wild; and d) physiological signal based
emotion recognition. The motivation of EmotiW is to bring researchers in affective
computing, computer vision, speech processing and machine learning to a common platform
for evaluating techniques on a test data. We discuss the challenge protocols, databases
and their associated baselines.
- Kui Lyu
- Minghao Wang
- Liyu Meng
Recent studies has been shown that most traffic accidents are related to the driver's
engagement in the driving process. Driver gaze is considered as an important cue to
monitor driver distraction. While there has been marked improvement in driver gaze
region estimation systems, but there are many challenges exist like cross subject
test, perspectives and sensor configuration. In this paper, we propose a Convolutional
Neural Networks (CNNs) based multi-model fusion gaze zone estimation systems. Our
method mainly consists of two blocks, which implemented the extraction of gaze features
based on RGB images and estimation of gaze based on head pose features. Based on the
original input image, first general face processing model were used to detect face
and localize 3D landmarks, and then extract the most relevant facial information based
on it. We implement three face alignment methods to normalize the face information.
For the above image-based features, using a multi-input CNN classifier can get reliable
classification accuracy. In addition, we design a 2D CNN based PointNet predict the
head pose representation by 3D landmarks. Finally, we evaluate our best performance
model on the Eighth EmotiW Driver Gaze Prediction sub-challenge test dataset. Our
model has a competitive overall accuracy of 81.5144% gaze zone estimation ability
on the cross-subject test dataset.
- Boyang Tom Jin
- Leila Abdelrahman
- Cong Kevin Chen
- Amil Khanzada
Determining the emotional sentiment of a video remains a challenging task that requires
multimodal, contextual understanding of a situation. In this paper, we describe our
entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify
group videos containing large variations in language, people, and environment, into
one of three sentiment classes. Our end-to-end approach consists of independently
training models for different modalities, including full-frame video scenes, human
body keypoints, embeddings extracted from audio clips, and image-caption word embeddings.
Novel combinations of modalities, such as laughter and image-captioning, and transfer
learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate
the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points
higher than that of the baseline ensemble.
- Chuanhe Liu
- Wenqiang Jiang
- Minghao Wang
- Tianhao Tang
This paper presents a hybrid network for audio-video group Emo-tion Recognition. The
proposed architecture includes audio stream,facial emotion stream, environmental object
statistics stream (EOS)and video stream. We adopted this method at the 8th EmotionRecognition
in the Wild Challenge (EmotiW2020). According to thefeedback of our submissions, the
best result achieved 76.85% in theVideo level Group AFfect (VGAF) Test Database, 26.89%
higherthan the baseline. Such improvements prove that our method isstate-of-the-art.
- Anastasia Petrova
- Dominique Vaufreydaz
- Philippe Dessus
This article presents our unimodal privacy-safe and non-individual proposal for the
audio-video group emotion recognition subtask at the Emotion Recognition in the Wild
(EmotiW) Challenge 2020. This sub challenge aims to classify in the wild videos into
three categories: Positive, Neutral and Negative. Recent deep learning models have
shown tremendous advances in analyzing interactions between people, predicting human
behavior and affective evaluation. Nonetheless, their performance comes from individual-based
analysis, which means summing up and averaging scores from individual detections,
which inevitably leads to some privacy issues. In this research, we investigated a
frugal approach towards a model able to capture the global moods from the whole image
without using face or pose detection, or any individual-based feature as input. The
proposed methodology mixes state-of-the-art and dedicated synthetic corpora as training
sources. With an in-depth exploration of neural network architectures for group-level
emotion recognition, we built a VGG-based model achieving 59.13% accuracy on the VGAF
test set (eleventh place of the challenge). Given that the analysis is unimodal based
only on global features and that the performance is evaluated on a real-world dataset,
these results are promising and let us envision extending this model to multimodality
for classroom ambiance evaluation, our final target application.
- Sandra Ottl
- Shahin Amiriparian
- Maurice Gerczuk
- Vincent Karas
- Björn Schuller
The objectives of this challenge paper are two fold: first, we apply a range of neural
network based transfer learning approaches to cope with the data scarcity in the field
of speech emotion recognition, and second, we fuse the obtained representations and
predictions in a nearly and late fusion strategy to check the complementarity of the
applied networks. In particular, we use our Deep Spectrum system to extract deep feature
representations from the audio content of the 2020 EmotiW group level emotion prediction
challenge data. We evaluate a total of ten ImageNet pre-trained Convolutional Neural
Networks, including AlexNet, VGG16, VGG19 and three DenseNet variants as audio feature
extractors. We compare their performance to the ComParE feature set used in the challenge
baseline, employing simple logistic regression models trained with Stochastic Gradient
Descent as classifiers. With the help of late fusion, our approach improves the performance
on the test set from 47.88 % to 62.70 % accuracy.
- Yanan Wang
- Jianming Wu
- Panikos Heracleous
- Shinya Wada
- Rui Kimura
- Satoshi Kurihara
Audio-video group emotion recognition is a challenging task since it is difficult
to gather a broad range of potential information to obtain meaningful emotional representations.
Humans can easily understand emotions because they can associate implicit contextual
knowledge (contained in our memory) when processing explicit information they can
see and hear directly. This paper proposes an end-to-end architecture called implicit
knowledge injectable cross attention audiovisual deep neural network (K-injection
audiovisual network) that imitates this intuition. The K-injection audiovisual network
is used to train an audiovisual model that can not only obtain audiovisual representations
of group emotions through an explicit feature-based cross attention audiovisual subnetwork
(audiovisual subnetwork), but is also able to absorb implicit knowledge of emotions
through two implicit knowledge-based injection subnetworks (K-injection subnetwork).
In addition, it is trained with explicit features and implicit knowledge but can easily
make inferences using only explicit features. We define the region of interest (ROI)
visual features and Melspectrogram audio features as explicit features, which obviously
are present in the raw audio-video data. On the other hand, we define the linguistic
and acoustic emotional representations that do not exist in the audio-video data as
implicit knowledge. The implicit knowledge distilled by adapting video situation descriptions
and basic acoustic features (MFCCs, pitch and energy) to linguistic and acoustic K-injection
subnetworks is defined as linguistic and acoustic knowledge, respectively. When compared
to the baseline accuracy for the testing set of 47.88%, the average of the audiovisual
models trained with the (linguistic, acoustic and linguistic-acoustic) K-injection
subnetworks achieved an overall accuracy of 66.40%.
- Mo Sun
- Jian Li
- Hui Feng
- Wei Gou
- Haifeng Shen
- Jian Tang
- Yi Yang
- Jieping Ye
This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge
in the EmotiW 2020. The task is to classify a video into one of the group emotions
such as positive, neutral, and negative. Our approach exploits two different feature
levels for this task, spatio-temporal feature and static feature level. In spatio-temporal
feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow,
warped optical flow) into multiple video classification network to train the spatio-temporal
model. In static feature level, we crop all faces and bodies in an image with the
state-of the-art human pose estimation method and train kinds of CNNs with the image-level
labels of group emotions. Finally, we fuse all 14 models result together, and achieve
the third place in this sub-challenge with classification accuracies of 71.93% and
70.77% on the validation set and test set, respectively.
- Bin Zhu
- Xinjie Lan
- Xin Guo
- Kenneth E. Barner
- Charles Boncelet
Engagement detection is essential in many areas such as driver attention tracking,
employee engagement monitoring, and student engagement evaluation. In this paper,
we propose a novel approach using attention based hybrid deep models for the 8th Emotion
Recognition in the Wild (EmotiW 2020) Grand Challenge in the category of engagement
prediction in the wild EMOTIW2020. The task aims to predict the engagement intensity
of subjects in videos, and the subjects are students watching educational videos from
Massive Open Online Courses (MOOCs). To complete the task, we propose a hybrid deep
model based on multi-rate and multi-instance attention. The novelty of the proposed
model can be summarized in three aspects: (a) an attention based Gated Recurrent Unit
(GRU) deep network, (b) heuristic multi-rate processing on video based data, and (c)
a rigorous and accurate ensemble model. Experimental results on the validation set
and test set show that our method makes promising improvements, achieving a competitively
low MSE of 0.0541 on the test set, improving on the baseline results by 64%. The proposed
model won the first place in the engagement prediction in the wild challenge.
- Shivam Srivastava
- Saandeep Aathreya SIdhapur Lakshminarayan
- Saurabh Hinduja
- Sk Rahatul Jannat
- Hamza Elhamdadi
- Shaun Canavan
In this work, we present our approach for all four tracks of the eighth Emotion Recognition
in the Wild Challenge (EmotiW 2020). The four tasks are group emotion recognition,
driver gaze prediction, predicting engagement in the wild, and emotion recognition
using physiological signals. We explore multiple approaches including classical machine
learning tools such as random forests, state of the art deep neural networks, and
multiple fusion and ensemble-based approaches. We also show that similar approaches
can be used across tracks as many of the features generalize well to the different
problems (e.g. facial features). We detail evaluation results that are either comparable
to or outperform the baseline results for both the validation and testing for most
of the tracks.
- Lukas Stappen
- Georgios Rizos
- Björn Schuller
Reliable systems for automatic estimation of the driver's gaze are crucial for reducing
the number of traffic fatalities and for many emerging research areas aimed at developing
intelligent vehicle-passenger systems. Gaze estimation is a challenging task, especially
in environments with varying illumination and reflection properties. Furthermore,
there is wide diversity with respect to the appearance of drivers' faces, both in
terms of occlusions (e.g. vision aids) and cultural/ethnic backgrounds. For this reason,
analysing the face along with contextual information - for example, the vehicle cabin
environment - adds another, less subjective signal towards the design of robust systems
for passenger gaze estimation. In this paper, we present an integrated approach to
jointly model different features for this task. In particular, to improve the fusion
of the visually captured environment with the driver's face, we have developed a contextual
attention mechanism, X-AWARE, attached directly to the output convolutional layers
of InceptionResNetV2 networks. In order to showcase the effectiveness of our approach,
we use the Driver Gaze in the Wild dataset, recently released as part of the Eighth
Emotion Recognition in the Wild Challenge (EmotiW) challenge. Our best model outperforms
the baseline by an absolute of 15.03% in accuracy on the validation set, and improves
the previously best reported result by an absolute of 8.72% on the test set.
SESSION: Workshops Summaries
- Heysem Kaya
- Roy S. Hessels
- Maryam Najafian
- Sandra Hanekamp
- Saeid Safavi
Child behaviour is a topic of wide scientific interest among many different disciplines,
including social and behavioural sciences and artificial intelligence (AI). In this
workshop, we aimed to connect researchers from these fields to address topics such
as the usage of AI to better understand and model child behavioural and developmental
processes, challenges and opportunities for AI in large-scale child behaviour analysis
and implementing explainable ML/AI on sensitive child data. The workshop served as
a successful first step towards this goal and attracted contributions from different
research disciplines on the analysis of child behaviour. This paper provides a summary
of the activities of the workshop and the accepted papers and abstracts.
- Keith Curtis
- George Awad
- Shahzad Rajput
- Ian Soboroff
This is the introduction paper to the International Workshop on Deep Video Understanding,
organized at the 22nd ACM Interational Conference on Multimodal Interaction. In recent
years, a growing trend towards working on understanding videos (in particular movies)
to a deeper level started to motivate researchers working in multimedia and computer
vision to present new approaches and datasets to tackle this problem. This is a challenging
research area which aims to develop a deep understanding of the relations which exist
between different individuals and entities in movies using all available modalities
such as video, audio, text and metadata. The aim of this workshop is to foster innovative
research in this new direction and to provide benchmarking evaluations to advance
technologies in the deep video understanding community.
- Zakia Hammal
- Di Huang
- Kévin Bailly
- Liming Chen
- Mohamed Daoudi
The goal of Face and Gesture Analysis for Health Informatics's workshop is to share
and discuss the achievements as well as the challenges in using computer vision and
machine learning for automatic human behavior analysis and modeling for clinical research
and healthcare applications. The workshop aims to promote current research and support
growth of multidisciplinary collaborations to advance this groundbreaking research.
The meeting gathers scientists working in related areas of computer vision and machine
learning, multi-modal signal processing and fusion, human centered computing, behavioral
sensing, assistive technologies, and medical tutoring systems for healthcare applications
and medicine.
- Hayley Hung
- Gabriel Murray
- Giovanna Varni
- Nale Lehmann-Willenbrock
- Fabiola H. Gerpott
- Catharine Oertel
There has been gathering momentum over the last 10 years in the study of group behavior
in multimodal multiparty interactions. While many works in the computer science community
focus on the analysis of individual or dyadic interactions, we believe that the study
of groups adds an additional layer of complexity with respect to how humans cooperate
and what outcomes can be achieved in these settings. Moreover, the development of
technologies that can help to interpret and enhance group behaviours dynamically is
still an emerging field. Social theories that accompany the study of groups dynamics
are in their infancy and there is a need for more interdisciplinary dialogue between
computer scientists and social scientists on this topic. This workshop has been organised
to facilitate those discussions and strengthen the bonds between these overlapping
research communities
- Carlos Velasco
- Anton Nijholt
- Charles Spence
- Takuji Narumi
- Kosuke Motoki
- Gijs Huisman
- Marianna Obrist
Here, we present the outcome of the 4th workshop on Multisensory Approaches to Human-Food
Interaction (MHFI), developed in collaboration with ICMI 2020 in Utrecht, The Netherlands.
Capitalizing on the increasing interest on multisensory aspects of human-food interaction
and the unique contribution that our community offers, we developed a space to discuss
ideas ranging from mechanisms of multisensory food perception, through multisensory
technologies, to new applications of systems in the context of MHFI. All in all, the
workshop involved 11 contributions, which will hopefully further help shape the basis
of a field of inquiry that grows as we see progress in our understanding of the senses
and the development of new technologies in the context of food.
- Itir Onal Ertugrul
- Jeffrey F. Cohn
- Hamdi Dibeklioglu
This paper presents an introduction to the Multimodal Interaction in Psychopathology
workshop, which is held virtually in conjunction with the 22nd ACM International Conference
on Multimodal Interaction on October 25th, 2020. This workshop has attracted submissions
in the context of investigating multimodal interaction to reveal mechanisms and assess,
monitor, and treat psychopathology. Keynote speakers from diverse disciplines present
an overview of the field from different vantages and comment on future directions.
Here we summarize the goals and the content of the workshop.
- Dennis Küster
- Felix Putze
- Patrícia Alves-Oliveira
- Maike Paetzel
- Tanja Schultz
Detecting, modeling, and making sense of multimodal data from human users in the wild
still poses numerous challenges. Starting from aspects of data quality and reliability
of our measurement instruments, the multidisciplinary endeavor of developing intelligent
adaptive systems in human-computer or human-robot interaction (HCI, HRI) requires
a broad range of expertise and more integrative efforts to make such systems reliable,
engaging, and user-friendly. At the same time, the spectrum of applications for machine
learning and modeling of multimodal data in the wild keeps expanding. From the classroom
to the robot-assisted operation theatre, our workshop aims to support a vibrant exchange
about current trends and methods in the field of modeling multimodal data in the wild.
- Arjan van Hessen
- Silvia Calamai
- Henk van den Heuvel
- Stefania Scagliola
- Norah Karrouche
- Jeannine Beeken
- Louise Corti
- Christoph Draxler
Interview data is multimodal data: it consists of speech sound, facial expression
and gestures, captured in a particular situation, and containing textual information
and emotion. This workshop shows how a multidisciplinary approach may exploit the
full potential of interview data. The workshop first gives a systematic overview of
the research fields working with interview data. It then presents the speech technology
currently available to support transcribing and annotating interview data, such as
automatic speech recognition, speaker diarization, and emotion detection. Finally,
scholars who work with interview data and tools may present their work and discover
how to make use of existing technology.
- Theodoros Kostoulas
- Michal Muszynski
- Theodora Chaspari
- Panos Amelidis
The term 'aesthetic experience' corresponds to the inner state of a person exposed
to form and content of artistic objects. Exploring certain aesthetic values of artistic
objects, as well as interpreting the aesthetic experience of people when exposed to
art can contribute towards understanding (a) art and (b) people's affective reactions
to artwork. Focusing on different types of artistic content, such as movies, music,
urban art and other artwork, the goal of this workshop is to enhance the interdisciplinary
collaboration between affective computing and aesthetics researchers.
- Leonardo Angelini
- Mira El Kamali
- Elena Mugellini
- Omar Abou Khaled
- Yordan Dimitrov
- Vera Veleva
- Zlatka Gospodinova
- Nadejda Miteva
- Richar Wheeler
- Zoraida Callejas
- David Griol
- Kawtar Benghazi
- Manuel Noguera
- Panagiotis Bamidis
- Evdokimos Konstantinidis
- Despoina Petsani
- Andoni Beristain Iraola
- Dimitrios I. Fotiadis
- Gérard Chollet
- Inés Torres
- Anna Esposito
- Hannes Schlieter
T e-Coaches are promising intelligent systems that aims at supporting human everyday
life, dispatching advices through different interfaces, such as apps, conversational
interfaces and augmented reality interfaces. This workshop aims at exploring how e-coaches
might benefit from spatially and time-multiplexed interfaces and from different communication
modalities (e.g., text, visual, audio, etc.) according to the context of the interaction.
- Hiroki Tanaka
- Satoshi Nakamura
- Jean-Claude Martin
- Catherine Pelachaud
This workshop discusses how interactive, multimodal technology such as virtual agents
can be used in social skills training for measuring and training social-affective
interactions. Sensing technology now enables analyzing user's behaviors and physiological
signals. Various signal processing and machine learning methods can be used for such
prediction tasks. Such social signal processing and tools can be applied to measure
and reduce social stress in everyday situations, including public speaking at schools
and workplaces.
- Eleonora Ceccaldi
- Benoit Bardy
- Nadia Bianchi-Berthouze
- Luciano Fadiga
- Gualtiero Volpe
- Antonio Camurri
Multimodal interfaces pose the challenge of dealing with the multi-ple interactive
time-scales characterizing human behavior. To dothis, innovative models and time-adaptive
technologies are needed,operating at multiple time-scales and adopting a multi-layered
ap-proach. The first International Workshop on Multi-Scale MovementTechnologies, hosted
virtually during the 22nd ACM InternationalConference on Multimodal Interaction, is
aimed at providing re-searchers from different areas with the opportunity to discuss
thistopic. This paper summarizes the activities of the workshop andthe accepted papers
|
|
|
|