Proceedings of the 20th ACM International Conference on Multimodal Interaction
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
SESSION: Keynote & Invited Talks
Human verbal and nonverbal expressions carry crucial information not only about intent
but also emotions, individual identity, and the state of health and wellbeing. From
a basic science perspective, understanding how such rich information is encoded in
Abstract: How can we create technologies to help us reflect on and change our behavior,
improving our health and overall wellbeing? In this talk, I will briefly describe
the last several years of work our research team has been doing in this area. We ...
What are facial expressions for? In social-functional accounts, they are efficient
adaptations that are used flexibly to address the problems inherent to successful
social living. Facial expressions both broadcast emotions and regulate the emotions
Humans interact with the world using five major senses: sight, hearing, touch, smell,
and taste. Almost all interaction with the environment is naturally multimodal, as
audio, tactile or paralinguistic cues provide confirmation for physical actions and
SESSION: Session 1: Multiparty Interaction
We present dialogue management routines for a system to engage in multiparty agent-infant
interaction. The ultimate purpose of this research is to help infants learn a visual
sign language by engaging them in naturalistic and socially contingent ...
We address the problem of automatically predicting group performance on a task, using
multimodal features derived from the group conversation. These include acoustic features
extracted from the speech signal, and linguistic features derived from the ...
We model coordination and coregulation patterns in 33 triads engaged in collaboratively
solving a challenging computer programming task for approximately 20 minutes. Our
goal is to prospectively model speech rate (words/sec) - an important signal of ...
We explored the gaze behavior towards the end of utterances and dialogue act (DA),
i.e., verbal-behavior information indicating the intension of an utterance, during
turn-keeping/changing to estimate empathy skill levels in multiparty discussions.
SESSION: Session 2: Physiological Modeling
Automated measurement of affective behavior in psychopathology has been limited primarily
to screening and diagnosis. While useful, clinicians more often are concerned with
whether patients are improving in response to treatment. Are symptoms abating, ...
Smell is a powerful tool for conveying and recalling information without requiring
visual attention. Previous work identified, however, some challenges caused by user's
unfamiliarity with this modality and complexity in the scent delivery. We are now
Automatic emotion recognition has long been developed by concentrating on modeling
human expressive behavior. At the same time, neuro-scientific evidences have shown
that the varied neuro-responses (i.e., blood oxygen level-dependent (BOLD) signals
Despite the great potential, Massive Open Online Courses (MOOCs) face major challenges
such as low retention rate, limited feedback, and lack of personalization. In this
paper, we report the results of a longitudinal study on AttentiveReview2, a ...
The aim was to study if odors evaporated by an olfactory display prototype can be
used to affect participants' cognitive and emotionrelated responses to audio-visual
stimuli, and whether the display can benefit from objective measurement of the odors.
SESSION: Session 3: Sound and Interaction
The task of identifying when to take a conversational turn is an important function
of spoken dialogue systems. The turn-taking system should also ideally be able to
handle many types of dialogue, from structured conversation to spontaneous and ...
This paper presents a summary and critical reflection on ten major opportunities and
challenges for advancing the field of multimodal learning analytics (MLA). It identifies
emerging technology trends likely to disrupt learning analytics, challenges ...
Digital home assistants have an increasing influence on our everyday lives. The media
now reports how children adapt the consequential, imperious language style when talking
to real people. As a response to this behavior, we considered a digital ...
This article tackles the issue of the detection of the user's likes and dislikes in
a negotiation with a virtual agent for helping the creation of a model of user's preferences.
We introduce a linguistic model of user's likes and dislikes as they are ...
Automatic speech recognition can potentially benefit from the lip motion patterns,
complementing acoustic speech to improve the overall recognition performance, particularly
in noise. In this paper we propose an audio-visual fusion strategy that goes ...
SESSION: Session 4: Touch and Gesture
Body posture is a good indicator of, amongst other things, people's state of arousal,
focus of attention and level of interest in a conversation. Posture is conventionally
measured by observation and hand coding of videos or, more recently, through ...
Nearest neighbor classifiers recognize stroke gestures by computing a (dis)similarity
between a candidate gesture and a training set based on points, which may require
normalization, resampling, and rotation to a reference before processing. To ...
Combining mid-air gestures with pen input for bi-manual input on tablets has been
reported as an alternative and attractive input technique in drawing applications.
Previous work has also argued that mid-air gestural input can cause discomfort and
During medical interventions, direct interaction with medical image data is a cumbersome
task for physicians due to the sterile environment. Even though touchless input via
hand, foot or voice is possible, these modalities are not available for these ...
SESSION: Session 5: Human Behavior
A shared sense of humor can result in positive feelings associated with amusement,
laughter, and moments of bonding. If robotic companions could acquire their human
counterparts' sense of humor in an unobtrusive manner, they could improve their skills
Small group interaction occurs often in workplace and education settings. Its dynamic
progression is an essential factor in dictating the final group performance outcomes.
The personality of each individual within the group is reflected in his/her ...
Psychotic disorders are forms of severe mental illness characterized by abnormal social
function and a general sense of disconnect with reality. The evaluation of such disorders
is often complex, as their multifaceted nature is often difficult to ...
Constructing computational models of interactions during Forensic Interviews (FI)
with children presents a unique challenge in being able to maximize complete and accurate
information disclosure, while minimizing emotional trauma experienced by the ...
In human conversational interactions, turn-taking exchanges can be coordinated using
cues from multiple modalities. To design spoken dialog systems that can conduct fluid
interactions it is desirable to incorporate cues from separate modalities into ...
SESSION: Session 6: Artificial Agents
Convolutional neural networks (CNNs) are employed to estimate the visual focus of
attention (VFoA), also called gaze direction , in multiparty face-to-face meetings
on the basis of multimodal nonverbal behaviors including head pose, direction of the
In this paper we focus on detection of deception and suspicion from electrodermal
activity (EDA) measured on left and right wrists during a dyadic game interaction.
We aim to answer three research questions: (i) Is it possible to reliably distinguish
Emotion evoked by an advertisement plays a key role in influencing brand recall and
eventual consumer choices. Automatic ad affect recognition has several useful applications.
However, the use of content-based feature representations does not give ...
Laughter is a highly spontaneous behavior that frequently occurs during social interactions.
It serves as an expressive-communicative social signal which conveys a large spectrum
of affect display. Even though many studies have been performed on the ...
The inherent diversity of human behavior limits the capabilities of general large-scale
machine learning systems, that usually require ample amounts of data to provide robust
descriptors of the outcomes of interest. Motivated by this challenge, ...
SESSION: Poster Session 1
Cars provide drivers with task-related information (e.g. "Fill gas") mainly using
visual and auditory stimuli. However, those stimuli may distract or overwhelm the
driver, causing unnecessary stress. Here, we propose olfactory stimulation as a novel
In this work we analyze the importance of lexical and acoustic modalities in behavioral
expression and perception. We demonstrate that this importance relates to the amount
of therapy, and hence communication training, that a person received. It also ...
Older adults want to live independently and at the same time stay socially active.
We conducted contextual inquiry to understand what usability problems they face while
interacting with social media on touch screen devices. We found that it is hard for
The user experience (UX) of graphical user interfaces (GUIs) often depends on how
clearly visual designs communicate/signify "affordances", such as if an element on
the screen can be pushed, dragged, or rotated. Especially for novice users figuring
In this paper, we extract features of head pose, eye gaze, and facial expressions
from video to estimate individual learners' attentional states in a classroom setting.
We concentrate on the analysis of different definitions for a student's attention
There are many mechanisms to sense arousal. Most of them are either intrusive, prone
to bias, costly, require skills to set-up or do not provide additional context to
the user's measure of arousal. We present arousal detection through the analysis of
We present PathWord (PATH passWORD), a multimodal digit entry method for ad-hoc authentication
based on known digits shape and user relative eye movements. PathWord is a touch-free,
gaze-based input modality, which attempts to decrease shoulder surfing ...
Smart watches can enrich everyday interactions by providing both glanceable information
and instant access to frequent tasks. However, reading text messages on a 1.5-inch
small screen is inherently challenging, especially when a user's attention is ...
Despite the ubiquity and rapid growth of mobile reading activities, researchers and
practitioners today either rely on coarse-grained metrics such as click-through-rate
(CTR) and dwell time, or expensive equipment such as gaze trackers to understand ...
Motivated by the desire to give vehicles better information about their drivers, we
explore human intent inference in the setting of a human driver riding in a moving
vehicle. Specifically, we consider scenarios in which the driver intends to go to
In this paper, we introduce a novel gaze-only interaction technique called EyeLinks,
which was designed i) to support various types of discrete clickables (e.g. textual
links, buttons, images, tabs, etc.); ii) to be easy to learn and use; iii) to ...
Data Visualization has been receiving growing attention recently, with ubiquitous
smart devices designed to render information in a variety of ways. However, while
evaluations of visual tools for their interpretability and intuitiveness have been
The rising prevalence of mental illnesses is increasing the demand for new digital
tools to support mental wellbeing. Numerous collaborations spanning the fields of
psychology, machine learning and health are building such tools. Machine-learning
Quantitative analysis of gazes between a speaker and listeners was conducted from
the viewpoint of mutual activities in floor apportionment, with the assumption that
mutual gaze plays an important role in coordinating speech interaction. We conducted
The recent availability of lightweight, wearable cameras allows for collecting video
data from a "first-person' perspective, capturing the visual world of the wearer in
everyday interactive contexts. In this paper, we investigate how to exploit ...
Group meetings can suffer from serious problems that undermine performance, including
bias, "groupthink", fear of speaking, and unfocused discussion. To better understand
these issues, propose interventions, and thus improve team performance, we need to
Motivational Interviewing (MI) is a widely disseminated and effective therapeutic
approach for behavioral disorder treatment. Over the past decade, MI research has
identified client language as a central mediator between therapist skills and subsequent
We present a deep learning framework for real-time speech-driven 3D facial animation
from speech audio. Our deep neural network directly maps an input sequence of speech
spectrograms to a series of micro facial action unit intensities to drive a 3D ...
SESSION: Poster Session 2
This paper presents a novel approach in continuous emotion prediction that characterizes
dimensional emotion labels jointly with continuous and discretized representations.
Continuous emotion labels can capture subtle emotion variations, but their ...
Within the affective computing and social signal processing communities, increasing
efforts are being made in order to collect data with genuine (emotional) content.
When it comes to negative emotions and even aggression, ethical and privacy related
Autonomous systems are designed to carry out activities in remote, hazardous environments
without the need for operators to micro-manage them. It is, however, essential that
operators maintain situation awareness in order to monitor vehicle status and ...
Existing assistive technologies often capture and utilize a single remaining ability
to assist people with tetraplegia which is unable to do complex interaction efficiently.
In this work, we developed a multimodal assistive system (MAS) to utilize ...
Affect recognition aims to detect a person's affective state based on observables,
with the goal to e.g. improve human-computer interaction. Long-term stress is known
to have severe implications on wellbeing, which call for continuous and automated
When an automatic wheelchair or a self-carrying robot moves along with human agents,
prediction for the next possible actions by the participating agents, play an important
role in realization of successful cooperation among them. In this paper, we ...
Automatic analysis of advertisements (ads) poses an interesting problem for learning
multimodal representations. A promising direction of research is the development of
deep neural network autoencoders to obtain inter-modal and intra-modal ...
Correctly interpreting an interlocutor's emotional expression is paramount to a successful
interaction. But what happens when one of the interlocutors is a machine? The facilitation
of human-machine communication and cooperation is of growing importance ...
Immersive virtual environments (IVEs) present rich possibilities for the experimental
study of non-verbal communication. Here, the 'digital chameleon' effect, -which suggests
that a virtual speaker (agent) is more persuasive if they mimic their ...
This paper presents an approach for generating photorealistic video sequences of dynamically
varying facial expressions in human-agent interactions. To this end, we study human-human
interactions to model the relationship and influence of one individual'...
This paper presents a novel approach for automatic prediction of risk of ADHD in schoolchildren
based on touch interaction data. We performed a study with 129 fourth-grade students
solving math problems on a multiple-choice interface to obtain a large ...
Creating tactile representations of visual information, especially moving images,
is difficult due to a lack of available tactile computing technology and a lack of
tools for authoring tactile information. To address these limitations, we developed
Modern smartphones are built with capacitive-sensing touchscreens, which can detect
anything that is conductive or has a dielectric differential with air. The human finger
is an example of such a dielectric, and works wonderfully with such touchscreens.
Emotion recognition is a core research area at the intersection of artificial intelligence
and human communication analysis. It is a significant technical challenge since humans
display their emotions through complex idiosyncratic combinations of the ...
Robots, virtual assistants, and other intelligent agents need to effectively interpret
verbal references to environmental objects in order to successfully interact and collaborate
with humans in complex tasks. However, object disambiguation can be a ...
Tactile information in a palm is a necessary component in manipulating and perceiving
large or heavy objects. Noting this, we investigate human sensitivity to tactile haptic
feedback in a palm for an improved user interface design. To provide ...
Social skills training, performed by human trainers, is a well-established method
for obtaining appropriate skills in social interaction. Previous work automated the
process of social skills training by developing a dialogue system that teaches social
SESSION: Doctoral Consortium (alphabetically by author's last name)
While many organizations provide a website in multiple languages, few provide a sign-language
version for deaf users, many of whom have lower written-language literacy. Rather
than providing difficult-to-update videos of humans, a more practical ...
Group meetings are often inefficient, unorganized and poorly documented. Factors including
"group-think," fear of speaking, unfocused discussion, and bias can affect the performance
of a group meeting. In order to actively or passively facilitate group ...
Augmented reality eyewear devices (e.g. glasses, headsets) are poised to become ubiquitous
in a similar way than smartphones, by providing a quicker and more convenient access
to information. There is theoretically no limit to their applicative area and ...
There are various real-world applications such as video ads, airport screenings, courtroom
trials, and job interviews where deception detection can play a crucial role. Hence,
there are immense demands on deception detection in videos. Videos contain ...
Analysis of the student engagement in an e-learning environment would facilitate effective
task accomplishment and learning. Generally, engagement/disengagement can be estimated
from facial expressions, body movements and gaze pattern. The focus of this ...
Social robots need non-verbal behavior to make an interaction pleasant and efficient.
Most of the models for generating non-verbal behavior are rule-based and hence can
produce a limited set of motions and are tuned to a particular scenario. In contrast,...
I introduce a novel multi-modal multi-sensor interaction method between humans and
heterogeneous multi-robot systems. I have also developed a novel algorithm to control
heterogeneous multi-robot systems. The proposed algorithm allows the human operator
Multi-modal sentiment detection from natural video/audio streams has recently received
much attention. I propose to use this multi-modal information to develop a technique,
Sentiment Coloring , that utilizes the detected sentiments to generate effective ...
This work searches to explore the potential of textile sensing systems as a new modality
of capturing social behaviour. Hereby, the focus lies on evaluating the performance
of embedded pressure sensors as reliable detectors for social cues, such as ...
We like to conversate with other people using both sounds and visuals, as our perception
of speech is bimodal. Essentially echoing the same speech structure, we manage to
integrate the two modalities and often understand the message better than with the
Automatic analysis of teacher student interactions is an interesting research problem
in social computing. Such interactions happen in both online and class room settings.
While teaching effectiveness is the goal in both settings, the mechanism to ...
This paper is intended to outline the PhD research that is aimed to model empathy
in embodied conversational systems. Our goal is to determine the requirements for
implementation of an empathic interactive agent and develop evaluation methods that
SESSION: Demo and Exhibit Session
This work introduces EVA, a multimodal argumentative Dialogue System that is capable
of discussing controversial topics with the user. The interaction is structured as
an argument game in which the user and the system select respective moves in order
Tracking learners' engagement is useful for monitoring their learning quality. With
an increasing number of online video courses, a system that can automatically track
learners' engagement is expected to significantly help in improving the outcomes of
This work describes our approach to controlling lighter-than-air agents using multimodal
control via a wearable device. Tactile and gesture interfaces on a smart watch are
used to control the motion and altitude of these semi-autonomous agents. The ...
Autonomous systems in remote locations have a high degree of autonomy and there is
a need to explain what they are doing and why , in order to increase transparency
and maintain trust. This is particularly important in hazardous, high-risk scenarios.
SESSION: EAT Grand Challenge
The multimodal recognition of eating condition - whether a person is eating or not
- and if yes, which food type, is a new research domain in the area of speech and
video processing that has many promising applications for future multimodal interfaces
Automatic recognition of eating conditions of humans could be a useful technology
in health monitoring. The audio-visual information can be used in automating this
process, and feature engineering approaches can reduce the dimensionality of audio-visual
In this paper, we mainly investigate subjects' food likability based on audio-related
features as a contribution to EAT ? the ICMI 2018 Eating Analysis and Tracking challenge.
Specifically, we conduct 4-level Double Tree Complex Wavelet Transform ...
The use of Convolutional Neural Networks (CNN) pre-trained for a particular task,
as a feature extractor for an alternate task, is a standard practice in many image
classification paradigms. However, to date there have been comparatively few works
This paper presents the novel Functional-based acoustic Group Feature Selection (FGFS)
method for automatic eating condition recognition addressed in the ICMI 2018 Eating
Analysis and Tracking Challenge's Food-type Sub-Challenge. The Food-type Sub-...
SESSION: EmotiW Grand Challenge
Emotion recognition (ER) based on natural facial images/videos has been studied for
some years and considered a comparatively hot topic in the field of affective computing.
However, it remains a challenge to perform ER in the wild, given the noises ...
This paper presents a light-weight and accurate deep neural model for audiovisual
emotion recognition. To design this model, the authors followed a philosophy of simplicity,
drastically limiting the number of parameters to learn from the target datasets,...
This paper elaborates the winner approach for engagement intensity prediction in the
EmotiW Challenge 2018. The task is to predict the engagement level of a subject when
he or she is watching an educational video in diverse conditions and different ...
In this paper, we propose an automatic engagement prediction method for the Engagement
in the Wild sub-challenge of EmotiW 2018. We first design a novel Gaze-AU-Pose (GAP)
feature taking into account the information of gaze, action units and head pose ...
Engagement is the holy grail of learning whether it is in a classroom setting or an
online learning platform. Studies have shown that engagement of the student while
learning can benefit students as well as the teacher if the engagement level of the
In this paper we propose a new approach for classifying the global emotion of images
containing groups of people. To achieve this task, we consider two different and complementary
sources of information: i) a global representation of the entire image (...
Precise detection and localization of learners' engagement levels are useful for monitoring
their learning quality. In the emotiW Challenge's engagement detection task, we proposed
a series of novel improvements, including (a) a cluster-based framework ...
Group-level Emotion Recognition (GER) in the wild is a challenging task gaining lots
of attention. Most recent works utilized two channels of information, a channel involving
only faces and a channel containing the whole image, to solve this problem. ...
In this paper, we present our latest progress in Emotion Recognition techniques, which
combines acoustic features and facial features in both non-temporal and temporal mode.
This paper presents the details of our techniques used in the Audio-Video ...
This paper presents a hybrid deep learning network submitted to the 6th Emotion Recognition
in the Wild (EmotiW 2018) Grand Challenge , in the category of group-level emotion
recognition. Advanced deep learning models trained individually on faces, ...
This paper presents our approach for group-level emotion recognition sub-challenge
in the EmotiW 2018. The task is to classify an image into one of the group emotions
such as positive, negative, and neutral. Our approach mainly explores three cues,
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust
model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge
in EmotiW contains audio-video short clips with several emotional labels and the task
This paper details the sixth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW
2018 is a grand challenge in the ACM International Conference on Multimodal Interaction
2018, Colarado, USA. The challenge aims at providing a common platform to ...
SESSION: Workshop Summaries
This is the introduction paper to the third version of the workshop on 'Multisensory
Approaches to Human-Food Interaction' organized at the 20th ACM International Conference
on Multimodal Interaction in Boulder, Colorado, on October 16th, 2018. This ...
Analysis of group interaction and team dynamics is an important topic in a wide variety
of fields, owing to the amount of time that individuals typically spend in small groups
for both professional and personal purposes, and given how crucial group ...
Multimodal signals allow us to gain insights into internal cognitive processes of
a person, for example: speech and gesture analysis yields cues about hesitations,
knowledgeability, or alertness, eye tracking yields information about a person's focus
This paper presents an introduction to the "Human-Habitat for Health (H3): Human-habitat
multimodal interaction for promoting health and well-being in the Internet of Things
era" workshop, which was held at the 20th ACM International Conference on ...
In this paper a brief overview of the third workshop on Multimodal Analyses enabling
Artificial Agents in Human-Machine Interaction. The paper is focussing on the main
aspects intended to be discussed in the workshop reflecting the main scope of the