Accurately describing the tasks that comprise a day in your life is an inherently multimodal characterisation: after beginning each task, you become loaded to some extent by the objects, movements, communication and/or mental challenges required by that task, then you later switch to a new task, and so on. Automating task analysis of this kind, which to date is manual, post-hoc and subjective, has proven challenging. Wearable systems provide the helpful opportunity to position heterogeneous non-invasive sensors directly where they are most useful for task analysis: on the head and close to the eyes and mouth. In a longitudinal context, continuous data acquisition from these sensors is typically inefficient because the most interesting changes tend to occur infrequently, and there is scope for more investigation of automatic event-based analysis approaches inspired by the way humans annotate multimodal data. This presentation focuses on key research problems and recent results spanning psychophysiological motivations, feature extraction, including the accurate extraction of eye action units from very small near-field infrared cameras mounted on glasses frames, feature variability, multimodal fusion, machine learning, and system design for continuous and robust automatic task analysis from wearable sensors. It also highlights examples of how multimodal analysis of interpretable event-based approaches can yield new insights and new machine learning research directions. Task analytics of this kind represent huge potential for individual users to empower themselves and interact more seamlessly with machines.
As multimodal AI systems increasingly operate through diverse sensory inputs, tool use, and autonomous workflows, ensuring safety and responsibility requires more than simply placing a human “in the loop” or assigning liability to an organisation. This talk reconsiders the idea of meaningful oversight—not as symbolic presence or bureaucratic rubber-stamping, but as real agency and influence—by examining how both individuals and organisations can exert practical control over system behaviour. Drawing on ideas such as the lowest-cost avoider principle, where responsibility falls to those best positioned to prevent harm, and capability-based governance, which ties responsibility to technical control rather than formal roles, the talk invites reflection on how influence and accountability should be structured. From an engineering perspective, we’ll explore how system-level design choices—such as intervention points, safeguards, and reasoning trace monitoring—can support more substantive oversight in practice. By highlighting common design patterns that encode these principles, the talk presents a systems-oriented view of safe and responsible multimodal interaction, where accountability is not an afterthought, but something designed into the architecture itself.
In this keynote, I will explore how multimodal artificial intelligence (AI) is transforming key sectors, including water, transport, agriculture, and healthcare, by integrating diverse data streams to drive innovation and deliver measurable business and societal impact. By harnessing multimodal information such as sound, vision, physiological signals, behavioural patterns, and both structured and unstructured data, AI is increasingly capable of supporting complex decision-making and enriching human-machine interactions.
Drawing on real-world deployments, I will demonstrate how multimodal AI enhances operational efficiency, enables adaptive learning and decision-making, and fosters more responsive, intelligent systems. These applications not only improve productivity and resilience but also create new opportunities for sustainable growth and inclusive societal benefit.
Rooted in interdisciplinary research spanning AI, human-computer interaction, data science, behavioural science, neuroscience, and more, this talk will highlight technological advances while also addressing the ethical and practical challenges of deploying AI in real-world contexts. By focusing on innovative data-driven and human-centric solutions, we can unlock the full potential of multimodal AI to transform industries, empower individuals, and shape a more connected and sustainable future.
Persuasion is to change or influence a person's attitude or behavior without coercion or deception. Combined with personalization, personalized persuasion has demonstrated greater impact on influence. From persuasion theories in social psychology, emotion, cognition and personality are considered important factors. These factors are also under active research for automated detection within the computing community. How mental state and personality play a role in the process of persuasion and in persuasion outcomes has not been investigated. In this paper, we investigated the difference of mental state and personality between persuaders and receivers in a dyadic conversation where different persuasion outcome was produced. A video dataset of 24 participants in pairs debating in a survival task was collected. Arousal, valence, gaze direction, speech length and Big Five Inventory were annotated and processed to represent emotion, attention, mental load and personality during each dialog. Statistical analysis results show that persuaders’ arousal was significantly higher than receivers’ to disagreement outcome, persuaders imposed lower mental load on receivers to agreement outcome, and when persuaders’ extraversion score was higher than receivers’, the agreement response rate was higher. This research contributes to the understanding of mental state and personality of both persuaders and receivers in the context of dyadic interaction for a potential of dynamic and personalized adaptation.
Identifying affective responses of users engaged with digital information can provide valuable information on user experience and can be used for user modelling, retrieval, and content recommendation. Methods for recognising affective states often rely on supervised learning from a single modality, such as decoding affective information from content via computer vision, human physiology, or brain responses. These approaches assume the availability of paired data that contain the modality from which the prediction is made and the corresponding affective state label that is used to supervise the model. Unlike previous research, we introduce an approach to decode affective states via bimodal contrastive learning without using any externally provided affective state labels. Our method, entitled BALE, uses only bimodal data (paired brain recordings and images) to self-learn latent representations by contrasting image representations and the representations of the brain responses evoked when humans perceive the visual stimuli. We demonstrate the effectiveness of our approach using two publicly available datasets: fNIRS recordings of images showing various levels of valence (positivity/negativity) and fMRI recordings of individuals viewing human faces showing different emotions (joy/anger/contempt/pride/neutral). We evaluate our model on two tasks: classification of affective states and ranking of visual stimuli based on the affective responses they evoke when perceived. We report the first successful results for bimodal learning of affective states without labels and report performance approaching models trained with labelled supervision data. The results provide evidence that subjective affective states can be captured from multimodal interactions. Our source code is openly released at https://github.com/VadymV/BALE/.
Encouraging self-disclosure is an important technique for eliciting high-quality information from interviewees in dialog systems. We propose an adaptive interview strategy that encourages self-disclosure by dynamically adjusting topic transitions on the basis of the interviewee’s estimated attitude, which is inferred from multimodal signals. A system with the adaptive interview strategy estimates the interviewee’s attitude from the multimodal features (posture, prosody, facial landmarks, and biometric signals) extracted in the interaction and determines whether to continue/change the topic on the basis of the estimated attitude. We developed an automatic interview robot system with an adaptive interview strategy and conducted a dialog experiment with 30 people recruited from the general public. Every participant interacted with the adaptive strategy and the random strategy (random topic continuation/switching). The results of the experiment revealed that, compared with the random strategy, the adaptive interview strategy led to deeper self-disclosure by the interviewees and a greater frequency of words related to specific experiences in the interviewees’ speech. Moreover, we observed a modest positive correlation between the correct rate of attitude estimation and the strength of self-disclosure, suggesting that improved attitude recognition contributes to more effective elicitation. These findings contribute to techniques for eliciting valuable information from users through dialog and underscore the importance of social signal processing in adaptive dialog systems.
This work proposes a two-stage multimodal emotion estimation model to address the challenges of modality heterogeneity and imbalance in modality contribution. The first stage employs a PatchTST-based masked autoencoder (MAE) strategy for self-supervised learning of foundational features from each modality to guide downstream learning. Specifically, this work introduces a modality-specific feature loss to guide masked reconstruction, which compels the model to incorporate low-order statistical information while learning deep signal representations. This ensures a balance between task relevance and generalization ability. Additionally, since this stage does not rely on labels, it avoids feature bias caused by label subjectivity and helps the model learn the intrinsic structure and task-independent generalizable features of the data itself. In the second stage, emotion estimation is performed using the proposed multimodal hierarchical fusion architecture. In the initial layer, wavelet convolutions are used to capture fine-grained time–frequency features of each modality. Simultaneously, feature extraction in this layer is guided by the foundational features obtained from the first stage, ensuring accurate capture of modality-specific information. Subsequently, a multi-head cross-attention mechanism enables dynamic interaction and information fusion between EEG and physiological features, thereby constructing a shared feature space across modalities. Finally, the model estimates emotional dimensions based on the fused features. Experiments on the DEAP dataset demonstrate the promising potential of the proposed approach.
Traditional cuff-based blood pressure (BP) measurement techniques, though widely employed, require a cuff that must be fitted and inflated and thus are limited by convenience issues. In contrast, contactless BP monitoring solutions offer a promising alternative. This study explored pulse transit time (PTT) as a feature for accurate BP monitoring using remote photoplethysmography (rPPG). The investigation examined the order of PTT (PTT Order) between the palm and forehead and their impacts on the accuracy of BP estimation. Our findings showed variation in the dominant order between the two sites among the subjects. Nevertheless, the inverse of mean PTT extracted from the two sites in dominant order (PTT Dominant Order) consistently showed a higher linear correlation with systolic blood pressure (SBP). The mean and standard deviation of R-squared derived from the inverse of mean PTT with the dominant order and SBP among the 16 subjects were 0.81 ± 0.13. Additionally, subgroup analysis identified significant differences in SBP across gender and exercise status. Furthermore, our data revealed a hysteresis phenomenon in 25% of the subjects, characterized by SBP returning to baseline levels during post-exercise resting while heart rate (HR) remained persistently elevated.
Understanding physiological responses during running is critical for performance optimization, tailored training prescriptions, and athlete health management. We introduce a comprehensive framework—what we believe to be the first capable of predicting instantaneous oxygen consumption (VO2) trajectories exclusively from consumer-grade wearable data. Our approach employs two complementary physiological models: (1) accurate modeling of heart rate (HR) dynamics via a physiologically constrained ordinary differential equation (ODE) and neural Kalman filter, trained on over 3 million HR observations, achieving 1-second interval predictions with mean absolute errors as low as 2.81 bpm (correlation 0.87); and (2) leveraging the principles of precise HR modeling, a novel VO2 prediction architecture requiring only the initial second of VO2 data for calibration, enabling robust, sequence-to-sequence metabolic demand estimation. Despite relying solely on smartwatch and chest-strap data, our method achieves mean absolute percentage errors of approximately 13%, effectively capturing rapid physiological transitions and steady-state conditions across diverse running intensities. Our synchronized dataset, complemented by blood lactate measurements, further lays the foundation for future noninvasive metabolic zone identification. By embedding physiological constraints within modern machine learning, this framework democratizes advanced metabolic monitoring, bridging laboratory-grade accuracy and everyday accessibility, thus empowering both elite athletes and recreational fitness enthusiasts.
Drink spiking, the deliberate act of secretly adding substances to someone’s drink, is a growing concern, exacerbated by easier access to drugs. Effective protection against this threat requires discreet yet accurate liquid analysis solutions. Current methods fail to generalize across various substances and are primarily designed for static environments, lacking adaptability to dynamic real-world situations. We contribute SpikEy, an innovative sensing system that combines optical sensing, embedded AI, and signal processing to overcome these limitations. The key technical contributions of SpikEy are latent modeling of optical signals for robustness and generality, and motion and light calibration for ensuring accurate performance in diverse real-world settings. Extensive experiments with various drug profiles, concentrations, and liquids demonstrate that SpikEy can detect spiked drinks with up to \(86\%\) accuracy, effectively generalizing across different users and unseen drinks. It outperforms state-of-the-art methods by 10-20%, and shows nearly 20% improvement over visual inspection in real-world applications.
Foundation Models trained to perform a certain task can be fine-tuned to other tasks with limited data and computational resources. The advantage of such practice is that it makes it possible to benefit, at least indirectly, from the large amounts of data and the major computational infrastructure necessary for training a Foundation Model. However, there is a limitation too, namely that the few organizations that have the major resources necessary to develop and train Foundation Models do it only for the modalities that are of interest to them. For this reason, this article proposes to fine-tune Foundation Models trained on speech and photoplethysmography signals to perform stress detection based on Electro-Dermal Activity, a modality for which no Foundation Model exists. To the best of our knowledge, this is one of the first works proposing experiments of this type and the results show state-of-the-art stress detection performances over a publicly available benchmark, even if speech and photoplethysmograhpy data differ significantly from Electro-Dermal Activity signals.
Understanding user affect through multimodal sensing is critical for designing adaptive and effective interactive systems. While Virtual Reality (VR) is increasingly used to induce and regulate affective states, limited research has examined real-time neurophysiological and psychological changes across contrasting VR scenarios. In this exploratory study, fourteen participants were exposed to two immersive VR experiences in a within-subjects design: (1) a custom-designed, nature-based environment integrated with heart rate variability biofeedback (HRVBF) to promote relaxation, and (2) Richie’s Plank Experience, a pre-developed VR scenario designed to elicit stress. We conducted a multimodal analysis of neurophysiological and psychological responses during exposure to these VR environments. Subjective psychological responses were measured pre- and post-intervention using the State-Trait Anxiety Inventory (STAI) and Visual Analog Scale (VAS). Additionally, neurophysiological data were concurrently recorded, including respiration rate (RR), heart rate variability (HRV), and hemodynamic responses, specifically oxygenated (HbO) and deoxygenated hemoglobin (HbR), using functional near-infrared spectroscopy (fNIRS).Richie’s Plank significantly elevated post-exposure subjective anxiety scores (VAS: p = 0.05, STAI: p < 0.01) and increased RR (20.6 ± 2.37) compared to the HRVBF condition (11.2 ± 5.90, p < 0.01). HbO was higher during Richie’s Plank (3.0 ± 1.93), while HbR was elevated in the HRVBF condition (1.0 ± 1.13), both p < 0.01. Furthermore, HRV (RMSSD) was lower during the HRVBF environment (p < 0.01), indicating greater parasympathetic activation during HRVBF. These findings demonstrate that fNIRS, HRV/ECG, and RR exhibit distinct patterns that reflect the unique characteristics of each VR intervention presented. This multimodal physiological responsiveness supports the design of affect-aware systems capable of delivering real-time, personalized interventions. However, given the modest sample size and demographic homogeneity, these exploratory results warrant replication in larger, diverse populations for generalizability.
Virtual keyboards are important tools for efficient text entry in virtual reality (VR). Our everyday typing relies heavily on touch sensation, provided through tactile and kinesthetic (force) feedback. While tactile feedback has been explored for VR keyboards, the effects of force feedback, and its multimodal interaction with visual design elements of virtual keyboards, remain largely unexplored. This study investigated the effects of force feedback in combination with two fundamental visual elements – key dimensionality (2D vs. 3D) and visual keystroke feedback – on typing efficiency and user experience. Utilizing state-of-the-art kinesthetic gloves, a text entry experiment showed that 3D keys, force and visual feedback significantly improved typing accuracy, and visual feedback enhanced typing speed. Crucially, the study revealed significant and complex interaction effects between them on various aspects of user experience. These findings highlighted that the combined effect of multimodal cues on user experience is not simply additive. This work is one of the first to empirically explore the multimodal interaction of force feedback and visual GUI elements for VR text entry, which provides valuable insights for the design of efficient and user-friendly multimodal VR keyboards.
Educational games enhance learning experiences by integrating touchscreens, making interactions more engaging and intuitive for learners. However, the cognitive impacts of educational game input modalities, such as the hand and stylus technique, are not clear. We compared the experience of using hands vs. a stylus for touchscreens while educational gameplay by analyzing oxygenated hemoglobin collected by functional Near-Infrared Spectroscopy and self-reported measures. In addition, we measured the hand vs. the stylus modalities of the task and calculated the relative neural efficiency and relative neural involvement using the mental demand and the quiz score. Our findings show that the hand condition had a significantly lower neural involvement, yet higher neural efficiency than the stylus condition. This result suggests the requirement of less cognitive effort while using the hand. Additionally, the self reported measures show significant differences, and the results suggest that hand-based input is more intuitive, less cognitively demanding, and less frustrating. On the other hand, the use of a stylus required higher cognitive effort due to the cognitive balance of controlling the pen and answering questions. These findings highlight the importance of designing educational games that allow learners to engage with the system while minimizing cognitive effort.
We introduce AirSpartOne a one-handed hybrid distal pointing technique for large displays. AirSpartOne enables rapid cursor repositioning in the air with absolute mapping, followed by precise relative mapping on the smartphone, while ensuring continuous interaction. We then conduct an experiment that compares the performance of AirSpartOne with pad and freehand. Our findings indicate that AirSpartOne is faster than pad without compromising accuracy while improving the accuracy over freehand, without compromising the speed. Our findings also suggest an overwhelming preference for AirSpartOne. Finally, informed by our experimental findings, we derive five guidelines for distal pointing with large displays.
Storyboarding is an established method for designing user experiences. Generative AI can support this process by helping designers quickly create visual narratives. However, existing tools mainly focus on improving the accuracy of text-to-image generation. There is a lack of understanding on how to effectively support the entire creative process of storyboarding and how to develop AI-powered tools to be integrated into designers’ diverse workflows. In this work, we designed and developed StoryDiffusion, a system that integrates text-to-text and text-to-image models, to support the generation of narratives and images in a single pipeline. In a user study, we observed 12 UX design students using the system for both concept ideation and illustration tasks. Our findings identified AI-directed vs. user-directed creative strategies in both tasks and revealed the importance of supporting the interchange between narrative iteration and image generation. We also found effects of the design tasks on their strategies and preferences, providing insights for future development.
Generative AI technologies are reshaping everyday environments by enabling multimodal interaction. As their ubiquity and agentic capacities grow, there is a pressing need to understand how these systems reshape human–computer interaction in relational, social, and systemic terms. We introduce a scenario-based design pack for investigating Human–GenAI relations. Grounded in assemblage theory and structured around a three-stage process—Prepare, Make, Reflect—the pack supports the prototyping, analysis, and critical reflection of emergent sociotechnical configurations. We evaluated the pack across three deployments: an ACM workshop (n=22), a multidisciplinary design session (n=20), and a university HCI class (n=260). Participants generated scenarios that surfaced relational issues of power, agency, visibility, and care. We contribute the design pack alongside an exploratory framework to advance relational enquiry into multimodal Human–GenAI relations, support more inclusive and socially responsive GenAI practices, and complement FATE approaches by grounding fairness, accountability, and transparency in lived, multimodal configurations.
Sign languages are the main means of communication for deaf communities around the world. However, developing robust systems to perform Isolated Sign Language Recognition (ISLR) faces the challenge of data scarcity. Additionally, although many approaches rely on RGB videos, training these large models to cope with sign language recognition becomes infeasible when data is scarce. To address this challenge, we propose a series of light transformers to solve ISLR using landmark-based approaches. In our experiments, we compared pooling strategies to compact temporal representations and performed ablation studies about the contribution of Additive Positional Encoding (PE). Our results across two sign language datasets show that the pooling strategy that employs a learnable query token at the input of the decoder achieved a competitive Weighted-F1 (W-F1) of 76.99 in the AVASAG100 dataset when no sin-cos PE is added, and a W-F1 of 59.64 in WLASL100, performing similarly to the architecture with a Global Average Pooling (GAP), with a W-F1 of 76.71 in AVASAG100 and a W-F1 of 62.91 in WLASL100.
Social media platforms use multimodal data (e.g., text, images, behavioral patterns) to infer user characteristics for algorithmic profiling. To comply with privacy regulations like the GDPR, companies provide transparency tools, which are often hard for users to interpret—especially individuals with cognitive impairments (CIs), whose specific needs remain underexplored. It is still unclear (1) what information users need and (2) how it should be effectively and accessibly represented. We investigate transparency needs across cognitive abilities, using Large Language Models (LLMs) to create more understandable representations of profiling. An exploratory study with 45 participants—30 without CIs and 15 with CIs—was conducted under three conditions. After 15 minutes of social media browsing, participants received either (1) a verbal explanation of profiling, (2) LLM-generated interest segments, or (3) LLM-generated user personas (in general or Easy-to-Read German for participants with CIs), followed by a semi-structured interview. Thematic analysis of transcripts revealed concerns about data sensitivity, perceived consequences, and the influence of cognitive abilities. Merely showing users collected or inferred data—regardless of format—may not meet user transparency needs. Our findings suggest transparency tools must go beyond data representation to explain inference mechanisms and potential outcomes, tailored to the sensitivities of different cognitive user groups.
Sign Language Generation (SLG) has received increasing attention in recent years, with various models aiming to produce natural and temporally coherent sign gestures from spoken or written language. However, SLG remains a challenging task due to its inherently one-to-many nature, where a single sentence can correspond to multiple valid gesture sequences, and the requirement for smooth, synchronized motion across multiple articulators. Although diffusion-based models capture diversity, their stochastic denoising often introduces temporal misalignment and motion artifacts. In this work, we propose SignFlow, a novel architecture for SLG based on conditional flow matching with optimal transport. By modeling deterministic flow paths guided by optimal transport and supervised via velocity fields, SignFlow generates gestures that are both semantically coherent and visually smooth. Experiments on the CSL-Daily dataset demonstrate that SignFlow achieves superior BLEU scores and DTW-based motion accuracy compared to both diffusion and autoregressive baselines.
Improving the quality of geriatric care is a challenge that requires insights from stakeholders. While simulated trainings can boost competencies, extracting meaningful insights from these practices to enhance simulation effectiveness remains a challenge. In this study, we introduce Multimodal Epistemic Network Analysis (MENA), a novel framework for analyzing caregiver attitudes and emotional responses in an Augmented Reality simulation. By integrating a multimodal Emotional State Classifier, MENA extends traditional epistemic network analysis to reveal complex relationships between caregiving competencies and positive emotions. Applied in a pilot study (N = 20) comparing caregiver interactions with an unaware versus an aware virtual geriatric patient (VGP), MENA visualizations demonstrated how awareness in the VGP fostered more supportive and person-centered caregiving behaviors. These findings suggest that MENA not only enhances the analysis of multimodal interactions but also provides a powerful tool for designing emotionally intelligent training systems that prepare caregivers for the nuanced demands of real-world practice. The code and setup to reproduce the experiments are publicly available here , and data is available upon request.
Training multimodal large language models (LLMs) for safety-critical assistive applications, especially those handling sensitive user data, presents challenges related to responsible deployment, privacy, and computational efficiency. Centralized training risks user privacy during fine-tuning, while resource-heavy methods limit deployment in real-world assistive scenarios.
To advance safe and responsible multimodal interaction, we propose a federated learning approach that enhances user privacy through decentralized data processing, enabling model fine-tuning without compromising user data security. Additionally, we extend visual instruction tuning by applying efficient fine-tuning techniques to multimodal language-image instruction-following data. This process results in capable multimodal LLMs optimized for computational efficiency, trainable in ∼ 36 hours on a single 2 × A100 node, making it more accessible for inclusive deployment.
We demonstrate a practical application of these advances through an assistive system designed for visually impaired users. This application utilizes a privacy-preserving and efficient multimodal LLM to provide real-time, interactive, and descriptive engagement with the environment, enhancing user understanding, autonomy, and safety. Our results show that combining federated learning with efficient visual instruction enables secure, transparent, and scalable multimodal LLMs. These advances support responsible AI deployment in assistive technologies, promoting accessibility, trust, and ethical impact for individuals with disabilities and the elderly, while upholding ethical integrity and societal benefit.
Modeling gaze patterns in multiparty conversations is crucial to build socially-aware dialogue agents and humanoid robots. However, existing approaches typically rely on visual data or focus on dyadic settings. We propose a novel framework for social attention modeling — predicting gaze directions from linguistic and speaker cues alone, without direct visual input. We introduce SAT5, a speaker-aware adaptation of the T5 language model, pre-trained using multi-task objectives that capture both span corruption and speaker state modeling. Using a new dataset of three-party face-to-face conversations with synchronized speech, gaze, and motion capture data, we demonstrate that SAT5 significantly outperforms both pretrained and RNN-based baselines in predicting gaze targets. Our findings highlight the importance of conversational structure and speaker dynamics in modeling social attention, and offer a strong foundation for gaze-aware multimodal systems.
In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the processing rate enables real-time operation without a substantial drop in accuracy, and integrated the model into an avatar attentive listening system. Subjective evaluations showed that it outperformed the conventional method, which always does nodding in sync with verbal backchannel. The code and trained models are available at https://github.com/MaAI-Kyoto/MaAI.
In social settings, people display sophisticated spatial behaviors—for example, one might naturally enter into a conversation by sidling up to a group. Artificial agents will need the ability to reason about spatial representations of social information to understand not only how social groups form, but also how to interact within and around them. Leveraging the insight that people reason about shared space topologically rather than geometrically, we employ techniques from applied topology to introduce a new method for social group analysis that improves quantifiability and enables rigorous analysis of social group structure. We present a novel topological mathematical formalism called the social simplicial complex that provides an equivalence relation for socially analogous configurations of people and is provably robust against small perturbations and noise. Moreover, this formalism suggests quantifiable metrics to assess the confidence of social group existence and the social closeness of people within groups. We further use this formalism to introduce an open-source toolkit for evaluating possible models of social relationships, which we name the Social Topological Analysis (SoTA) Toolkit. Finally, we explore algebraic topology’s potential to serve more generally as a powerful tool for multi-modal social data processing, and its possibilities for further applications in social-spatial analysis.
This article proposes a multimodal approach for the detection of disagreement in dyadic conversations, where disagreement means that people express different opinions about a topic under discussion. The key-assumption underlying the work is that people tend to manifest different emotions depending on whether they are disagreeing or not. Therefore, emotions can provide evidence that disagreement is taking place. The experiments were performed over a corpus of 684 clips involving 60 dyads (120 persons and roughly 8 hours of speech). Each clip revolves around a decision-making task and it is annotated in terms of the percentage of time people spend in disagreement. For the sake of reproducibility, the Glasgow Disagreement Corpus, the data used in the experiments, has been made accessible through a link available in the paper. The results show that a multimodal approach based on language and paralanguage can predict such a percentage with Mean Absolute Error 9.7 and correlation 0.52 between actual and predicted percentage of time spent in disagreement.
Conversational systems that interact or collaborate with people must understand not only task success but also the quality of human experience. We present Speech-to-Joy, a lightweight framework that learns to predict users’ own post-interaction enjoyment ratings using latent embeddings from audio and text modalities. Evaluated on a corpus of human-robot dialogues, the model’s predicted enjoyment correlates strongly and significantly with user self-reports, outperforming both an experienced HRI annotator and heavier LLM-based uni- and multimodal baselines. Notably, even the unimodal audio branch - using only frozen speech embeddings - surpasses all baselines, and a late-fusion of text and audio achieves the highest performance. Designed for real-time inference on resource-limited platforms, Speech-to-Joy replaces ad-hoc emotion heuristics with a direct and user-centered measure of enjoyment. This work paves the way for optimizing interactions with robots and other conversational systems through the lens that matters most: the user’s own experience. 1
Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decomposition–based Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations—gesture kinematic dampening, uniform speech–gesture delays, and prosodic pitch-variance reduction—applied to ≈ 145 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study (N = 27) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on GitHub.
Accurate end-of-turn prediction in multiparty conversations is essential for enabling dialogue systems to participate in fluid and socially responsive interactions. While prior work has explored linguistic, prosodic, and gaze-based cues, the role of continuous bodily motion in modeling turn transitions remains relatively underexplored. This study investigates the contribution of head, hand, and full-body movements by learning symbolic motion representations through Vector Quantized Variational Autoencoders (VQ-VAE). Each motion modality is encoded independently into a discrete latent space, allowing us to assess their individual and combined predictive value for end-of-turn classification. Using a triadic conversation dataset with synchronized audio, gaze, and motion streams, we evaluate the independent and combined contributions of each motion modality. Our results show that hand motion, particularly when combined with gestural backchannel features, significantly improves performance. In contrast, head motion encoded via VQ-VAE provides only marginal gains and may overlap with discrete gesture labels. Compared to prior work, our model achieves higher precision, recall, and F1-score while maintaining real-time inference speed. These findings highlight the potential of structured, data-driven motion embeddings in developing socially aware dialogue systems.
Automated analysis of employment interviews has emerged as a promising avenue for leveraging multimodal machine learning (ML) to support both job seekers and recruiters, but existing research has mostly focused on homogeneous datasets of college students and relies predominantly on hand-crafted features. This paper advances the field by investigating multimodal ML models for hirability prediction that integrate pre-trained visual embeddings and transformer-based language models. It further presents the first analysis of such models on the VetTrain dataset, which features mock interviews between military veterans and industry professionals, and compares performance with the benchmark MIT Interview Dataset. Particularly, the paper examines: (1) cross-dataset predictability differences; (2) the effectiveness of transformer-based language models for hirability prediction; and (3) the comparative performance of deep facial embeddings versus hand-crafted visual features. Results suggest that linguistic and acoustic features are most predictive in the VetTrain dataset, while the MIT Interview dataset depicts stronger performance with visual cues. A sliding window text approach in the transformer-based language models is effective for handling lengthy, unstructured responses in VetTrain. While multimodal fusion improves performance in the MIT Interview dataset, it offers no added benefit over unimodal models in VetTrain. These findings underscore the importance of contextual differences in interview format, population, and setting, necessitating tailored approaches in ML-based interview support tools.
Automatically understanding and facilitating effective group collaboration remains a core challenge across social science and computational research. While prior work has focused on fine-grained social cues or coarse behavioral patterns, understanding the intermediate structure of dialogue—how sequences of utterances (discussion segments) reflect evolving group knowledge—is critical. This paper introduces a novel discussion segmentation framework and taxonomy for modeling collaborative problem-solving (CPS) processes, classifying segments into categories such as “task progress”, “task attempt”, and “grounding”. We collected and annotated over 1,700 multi-modal discussion segments from 21 group discussions, both in-person and online, based on this taxonomy. We further propose a baseline model that integrates audio, visual, and textual signals to classify discussion segments with an average F1 score of 69.3%. Notably, this lightweight expert model achieves performance comparable to, and sometimes exceeding, proprietary state-of-the-art multimodal large language models. These findings highlight the promise of sequence-level discourse analysis for automated facilitation and human-agent collaboration.
Conversational agents are becoming increasingly popular for digital mental health support. However, while empathy is essential for effective emotional support, the unimodal request-response interaction of such systems limits empathic communication. We address this limitation through a secondary channel that displays an agent’s inner reflections, similar to how nonverbal feedback in human interaction conveys cognitive and emotional states. We implemented a chatbot that generates not only conversational responses but also describes internal reasoning and emotional resonance. A user study involving N = 188 participants indicated a statistically significant increase in perceived empathy ( \(+14.7\%\) ) when the agent’s internal reflections were displayed. Our findings demonstrate a practical method to enhance empathic interaction with LLM-based chatbots in empathy-critical contexts. Additionally, this work opens possibilities for multimodal systems where LLM-generated reflections may serve as input for generating nonverbal feedback.
Virtual reality (VR) offers promising opportunities for procedural learning, particularly in preserving intangible cultural heritage. Advances in generative artificial intelligence (Gen-AI) further enrich these experiences by enabling adaptive learning pathways. However, evaluating such adaptive systems using traditional temporal metrics remains challenging due to the inherent variability in Gen-AI response times. To address this, our study employs multimodal behavioural metrics, including visual attention, physical exploratory behaviour, and verbal interaction, to assess user engagement in an adaptive VR environment. In a controlled experiment with (n = 54) participants, we compared three levels of adaptivity (high, moderate, and non-adaptive baseline) within a Neapolitan pizza-making VR experience. Results show that moderate adaptivity optimally enhances user engagement, significantly reducing unnecessary exploratory behaviour and increasing focused visual attention on the AI avatar. Our findings suggest that a balanced level of adaptive AI provides the most effective user support, offering practical design recommendations for future adaptive educational technologies.
Affective Computing (AC) has made significant progress with the advent of deep learning, yet a persistent challenge remains: the reliable transfer of affective models from controlled laboratory settings (in-vitro) to uncontrolled real-world environments (in-vivo). To address this challenge we introduce the Privileged Contrastive Pretraining (PriCon) framework according to which models are first pretrained via supervised contrastive learning (SCL) and then act as teacher models within a Learning Using Privileged Information (LUPI) framework. PriCon both leverages privileged information during training and enhances the robustness of derived affect models via SCL. Experiments conducted on two benchmark affective corpora, RECOLA and AGAIN, demonstrate that models trained using PriCon consistently outperform LUPI and end to end models. Remarkably, in many cases, PriCon models achieve performance comparable to models trained with access to all modalities during both training and testing. The findings underscore the potential of PriCon as a paradigm towards further bridging the gap between in-vitro and in-vivo affective modelling, offering a scalable and practical solution for real-world applications.
The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360°, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360° socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30× speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.
This paper investigates the performance of multimodal pre-trained models in user profiling tasks based on visual-linguistic demographic data. These models are critical for adapting to the needs and preferences of human users in social robotics, thereby providing personalized responses and enhancing interaction quality. First, we introduce two datasets specifically curated to represent demographic characteristics derived from user facial images. Next, we evaluate the performance of a prominent contrastive multimodal pre-trained model, CLIP, on these datasets, both in its out-of-the-box state and after fine-tuning. Initial results indicate that CLIP performs suboptimal in matching images to demographic descriptions without fine-tuning. Although fine-tuning significantly enhances its predictive capacity, the model continues to exhibit limitations in effectively generalizing subtle demographic nuances. To address this, we propose adopting a masked image modeling strategy to improve generalization and better capture subtle demographic attributes. This approach offers a pathway for enhancing demographic sensitivity in multimodal user modeling tasks.
Multimodal emotion recognition in conversation (MERC) requires modeling complex, context-dependent affective cues from heterogeneous modalities. However, most existing methods treat all modalities with equal importance throughout the dialogue, lacking a mechanism to adaptively filter redundant cues or emphasize salient signals as emotional dynamics shift. To address this, we propose the Adaptive Compression with Semantic Disentanglement Network (ACoSDN). ACoSDN introduces a context-aware controller that adjusts compression strength based on four dialogue structure signals—emotion shift, speaker gap, prediction entropy, and utterance position—enabling adaptive information retention. Furthermore, we disentangle compressed cross-modal representations into shared, unique, and synergistic semantic factors via a variational module, enhancing emotional interpretability and robustness. Experiments on two public datasets demonstrate that ACoSDN effectively handles emotion recognition tasks, showing robust performance particularly on ambiguous or minority emotion classes.
Designing eXtended Reality (XR) interaction techniques that function efficiently across varying contexts remains a significant challenge, leading to a fragmented landscape of input paradigms. A key factor influencing interaction performance is the user’s distance from virtual content. To address this, we present two user studies investigating how distance impacts interaction efficacy and user experience in XR environments. These studies evaluate widely used interaction methods—freehand techniques (Press, Airtap, Hover) and gaze-based techniques (Eye and Head)—across four proxemic zones: Intimate, Personal, Social, and Public. Each study involved 32 participants, with a combined total of 4,608 interaction trials (1,152 in Study 1 and 3,456 in Study 2). Findings reveal the strengths and limitations of techniques depending on user-object distance. Based on these insights, we offer design recommendations for tailoring XR interaction modalities to proxemic factors. This promotes adaptable interaction design, enabling more inclusive and personalised user experiences across diverse XR scenarios.
This paper presents the first attempt at zero-shot music emotion recognition (MER) to map musical pieces, represented in symbolic formats (e.g., ABC notation), onto the valence-arousal space. Conventional MER approaches typically train an end-to-end deep neural network (DNN). However, the performance of such supervised methods is limited due to the multifaceted and ambiguous nature of music emotions, compounded by the scarcity of MER datasets. To address this, we leverage knowledge transfer from large language models (LLMs) pre-trained on vast text and symbolic data. We hypothesize that LLMs possess capabilities in low-level music description and high-level emotion reasoning (not necessarily in a musical context). Accordingly, we propose a multi-agent framework that performs zero-shot MER by associating objective musical attributes (harmony, melody, rhythm, and structure) with subjective attributes (valence and arousal). Our system employs a hierarchical architecture comprising (i) musical element descriptors, (ii) chain-of-thought emotion analysts, and (iii) comprehensive predictors. Knowledge injection and zero-shot prompting are utilized to mitigate inherent model biases. Evaluations on the EMOPIA dataset demonstrate that our system, built on the Gemini-2.0-Flash backbone, significantly outperforms baseline LLM models, including ultra-large models and mixture-of-experts (MoE) systems, and performs comparably to fully supervised or fine-tuned models.
Attribute manipulation deals with the problem of changing individual attributes of a data point or a time series, while leaving all other aspects unaffected. This work focuses on the domain of human motion, more precisely karate movement patterns. To the best of our knowledge, it presents the first success at manipulating attributes of human motion data. One of the key requirements for achieving attribute manipulation on human motion is a suitable pose representation. Therefore, we design a novel continuous, rotation-based pose representation that enables the disentanglement of the human skeleton and the motion trajectory, while still allowing an accurate reconstruction of the original anatomy. The core idea of the manipulation approach is to use a transformer encoder for discovering high-level semantics, and a diffusion probabilistic model for modeling the remaining stochastic variations. We show that the embedding space obtained from the transformer encoder is semantically meaningful and linear. This enables the manipulation of high-level attributes, by discovering their linear direction of change in the semantic embedding space and moving the embedding along said direction. All code and data is made publicly available.
Digital media is increasingly audio-rich, yet much of its sound content remains inaccessible for deaf and hard-of-hearing (DHH) individuals. While prior work has focused on captioning and sound recognition, little research has explored how sound itself can be transformed to better align with hearing needs, preferences, and contexts for people with partial hearing. In this paper, we present findings from a formative study with 24 DHH participants that examines their experiences and unmet needs around digital media audio. Participants emphasized the importance of features such as selective speaker amplification, contextual sound control, semantic summarization, and adaptive personalization. Based on these insights, we introduce the ReMediaTion Framework, a layered model that articulates user goals, audio transformation dimensions, interaction strategies, contextual modulation, and expressive engagement. Our work provides a foundation for designing future audio accessibility systems that go beyond substitution, empowering DHH users to reshape how they experience and interpret sound.
Despite advances in practical and multimodal fine-grained Human Activity Recognition (HAR), a system that runs entirely on smartwatches in unconstrained environments remains elusive. We present WatchHAR , an audio and inertial-based HAR system that operates fully on smartwatches, addressing privacy and latency issues associated with external data processing. By optimizing each component of the pipeline, WatchHAR achieves compounding performance gains. We introduce a novel architecture that unifies sensor data preprocessing and inference into an end-to-end trainable module, achieving 5x faster processing while maintaining over 90% accuracy across more than 25 activity classes. WatchHAR outperforms state-of-the-art models for event detection and activity classification while running directly on the smartwatch, achieving 9.3 ms processing time for activity event detection and 11.8 ms for multimodal activity classification. This research advances on-device activity recognition, realizing smartwatches’ potential as standalone, privacy-aware, and minimally-invasive continuous activity tracking devices.
The integration of multimodal AI agents into human teams raises critical questions about collaboration in ad hoc environments without pre-coordination. We examine how team composition affects psychological dimensions in Human-AI Teaming by investigating self-confidence, satisfaction, and accountability across configurations. Using a factorial design, we compared four conditions (Human-Only, Human-Human, Human-Agent, Human-Human-Agent) across resource management, healthcare, and finance domains. Fifty-four participants completed decision tasks using a tree-of-thought framework, with performance assessment and psychological measures. Results revealed domain-specific patterns: participants preferred human-led teams for health decisions but AI-assisted teams for data-driven tasks. Satisfaction did not increase when incorporating agents into human-expert teams, suggesting cognitive load constraints. Accountability attribution varied with perceived performance, with users taking more responsibility for failures while distributing credit for successes. These findings, grounded in cognitive load theory, advance our understanding of psychological dimensions in ad hoc teaming and provide a framework for designing domain-appropriate HAT systems that balance performance with ethical considerations.
Virtual Reality (VR) technologies provide interactive experiences capable of evoking a wide spectrum of emotional responses from users. However, there is a notable scarcity of VR-based multimodal emotional response datasets designed to enhance the accuracy of emotionally immersive film and television productions. To address this gap, we conducted a study to develop a comprehensive, multimodal, annotated dataset capturing users’ affective, physiological, and behavioral responses to 360° panoramic videos in immersive VR environment. Our dataset specifically focuses on two age groups: adults and minors. The dataset collection process involved gathering participants’ self-reported emotional responses alongside objective measures, including behavioral data (e.g., head movements and gaze patterns) and physiological data (e.g., heart rate and skin conductance signals). To analyze the data, we employed subject-independent baseline classification algorithms to evaluate the usefulness of the dataset for emotion analysis. Furthermore, we assessed the consistency of participants’ interactions with specific regions of the 360° panoramic videos across experimenters, and examined the correlation between physiological data and self-reported emotional responses. This publicly available multimodal dataset provides a valuable resource to facilitate numerous future efforts on VR-based affective computing research and on tailoring VR content to diverse audiences based on emotional and demographic profiles.
Robotic systems have been increasingly applied across a wide range of sectors to alleviate the burden on human labor, enhancing efficiency in various work settings. In certain instances, effective human-robot collaboration is essential for task success, and it can be facilitated by equipping robots with emotional and empathetic capabilities that resemble those of humans. Although interpreting emotions via facial cues or physical gestures is comparatively accessible, replicating and modeling empathy computationally poses a significantly greater challenge. This paper outlines the architecture and performances of computational models developed for simulating and predicting human empathy. The OMG-Empathy dataset, comprising storytelling video recordings and self-reported empathy valence ratings from listeners after viewing, was used to train models aimed at predicting empathy responses of listeners. A range of multimodal features, such as conversational content, facial expressions, and vocal arousal, were extracted for analysis. These features were examined to assess their impact on the prediction of empathy valence. For computational models, various Machine Learning (ML) techniques were implemented, namely Support Vector Machines (SVM), Decision Trees, Random Forests, and Neural Networks (NN), with the Concordance Correlation Coefficient (CCC) employed as the primary evaluation criterion. Among all models evaluated, the Support Vector Machine (SVM) achieved the greatest performance, yielding the highest personalized and generalized CCC scores of around 0.07. Result also indicated that linguistic empathy feature had a stronger correlation with overall empathy than the emotional empathy feature. This research offers insights on computational modeling on empathy and a potential inspiration to develop more human-like robots. Future work may involve more refining empathy extraction approaches and more robust models which are adaptable to a wider range of cases.
Automatic Speech Recognition (ASR) systems like OpenAI’s Whisper models have achieved remarkable performance in tasks such as transcription, language identification, and language translation. However, their widespread deployment raises security concerns due to their susceptibility to adversarial attacks. One such attack, a universal adversarial perturbation, effectively "mutes" Whisper by prepending a clip of audio that halts its transcription. While effective, this prepended attack is unrealistic for real-world use as it assumes the audio was recorded in an anechoic environment. Due to this limitation, the attack fails when used in the wild, as it is overpowered by the ambient noise. We propose a more practical alternative by training an overlaid universal attack which is successful in noisy environments. Additionally, to reduce the attack’s human perceptibility, we incorporate a frequency-domain constraint into the objective function which restricts the volume of the attack in audible frequencies. Our approach preserves the effectiveness of the original attack, muting over 97% of speech samples, while improving applicability in real-world settings and maintaining a low degree of perceptibility.
Dynamic Hand Gesture Recognition facilitates intuitive human-computer interaction, yet current deep learning approaches face challenges in efficiently integrating multi-scale features and capturing long-range temporal dependencies. These methods often rely on single-scale features or simple concatenation, lacking adaptive fusion across spatial and temporal scales. Moreover, short-term self-attention struggles to capture extended dependencies without high computational costs from global convolutions. We propose TCAF-HANet, an end-to-end framework for Dynamic Hand Gesture Recognition (DHGR) that addresses these limitations. Built on a ResNet18 backbone, TCAF-HANet extracts multi-layer spatio-temporal representations using a feature pyramid and employs the Temporal-Channel Adaptive Fusion Module (TCAF). The TCAF module unifies shallow, middle, and deep features through frame-by-frame channel mapping and adaptive resolution downsampling, integrating time-aware embeddings and predicting adaptive fusion weights via 3D convolution for time-dynamic responsiveness. The Hierarchical Temporal Attention-Convolution Module (HTAC) segments the fused output into fixed-length temporal windows, using multi-head self-attention to model short-term dynamics, enhanced by residual feedforward networks. Multi-scale Temporal Convolution (MSTCN) with varied kernel sizes aggregates window tokens to capture dependencies across multiple temporal scales efficiently. This hierarchical approach balances short-term detail and long-term continuity with reduced computational overhead. Evaluation on the NVGesture and Briareo datasets shows that TCAF-HANet achieves top-performing accuracy with an 8.2% reduction in parameters and 17.6% lower computational cost compared to standard Transformer-based models.
A voice activity prediction (VAP) model is a pre-trained model learned via a self-supervised task that predicts future speech activity using the vocal information of both interlocutors in dyadic interaction. It has proven to be a strong foundation for online end-of-turn and backchannel prediction. We propose a multimodal VAP model that integrates not only acoustic features but also spoken language and facial expressions to enhance VAP learning. We further fine-tune this model for real-time prediction of end-of-turn and backchannel events using continuous multimodal input. Experimental results indicate that incorporating language and visual modalities improves VAP performance over audio-only baselines. In end-of-turn and backchannel prediction tasks, models fine-tuned from the multimodal VAP model also outperform conventional audio-only models, with particularly notable improvements observed in configurations that combine audio with language, audio with visual input, or all three modalities. These findings highlight the effectiveness of language and multimodal integration in speech-related prediction tasks and support the development of more natural and responsive dialogue systems.
Autonomous drones are increasingly used across various domains, yet critical situations can arise, and little research exists on how users prefer to be alerted during these events. In multi-drone control scenarios, where human-machine interfaces are used to monitor multiple drones simultaneously, alerting preferences are critical for ensuring situational awareness and timely decision-making. This paper explores multimodal alert design preferences in a user-centered approach. In an online survey, drone pilots identified critical scenarios, with collision risks, signal loss, and hardware problems being the most prevalent challenges. The subsequent study examined notification preferences for multi-drone control interfaces. Participants designed alerts for critical scenarios that were created based on the findings from the first survey. Using a printed control room interface with drone feeds and a map view, participants created multimodal alerts combining visual cues (e.g., frames, text), auditory signals (e.g., beeps), and, less frequently, tactile notifications (vibrations). This work bridges real-world drone operation challenges with user-centered multimodal interface design for autonomous systems.
Recent advances in AI has made automated analysis of complex media content at scale possible while generating actionable insights regarding character representation along such dimensions as gender and age. Past works focused on quantifying representation from audio/video/text using AI models, but without having the audience in the loop. We ask, even if character distribution along demographic dimensions are available, how useful are those to the general public? Do they actually trust the numbers generated by AI models? Our work addresses these open questions by proposing a new AI-based character representation tool and performing a thorough user study. Our tool has two components: (i) An analytics extraction model based on the Contrastive Language Image Pretraining (CLIP) foundation model that analyzes visual screen data to quantify character representation across age and gender; (ii) A visualization component effectively designed for presenting the analytics to lay audience. The user study seeks empirical evidence on the usefulness and trustworthiness of the AI-generated results for carefully chosen movies presented in the form of our visualizations. We found that participants were able to understand the analytics in our visualizations, and deemed the tool ‘overall useful’. Participants also indicated a need for more detailed visualizations to include more demographic categories and contextual information of the characters. Participants’ trust in AI-based gender and age models is seen to be moderate to low, although they were not against the use of AI in this context. Our tool including code, benchmarking, and the user study data can be found at https://github.com/debadyuti0510/Character-Representation-Media.
Eye-tracking data reveals valuable insights into users’ cognitive states but is difficult to analyze due to its structured, non-linguistic nature. While large language models (LLMs) excel at reasoning over text, they struggle with temporal and numerical data. This paper presents a multimodal human–AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals. The framework includes: (1) a multi-stage pipeline using horizontal and vertical segmentation alongside LLM reasoning to uncover latent gaze patterns; (2) an Expert–Model Co-Scoring Module that integrates expert judgment with LLM output to generate trust scores for behavioral interpretations; and (3) a hybrid anomaly detection module combining LSTM-based temporal modeling with LLM-driven semantic analysis. Our results across several LLMs and prompt strategies show improvements in consistency, interpretability, and performance, with up to 50% accuracy in difficulty prediction tasks. This approach offers a scalable, interpretable solution for cognitive modeling and has broad potential in adaptive learning, human–computer interaction, and educational analytics.
This systematic review investigates the current state of research on multimodal fusion methods, i.e., the joint analysis of multimodal inputs, for intentional, instruction-based human-computer interactions, focusing on the combination of speech and spatially expressive modalities such as gestures, touch, pen, and gaze. We examine 50 systems from a User-Centered Design perspective, categorizing them by modality combinations, fusion strategies, application domains and media, as well as reusability. Our findings highlight a predominance of descriptive late fusion methods, limited reusability, and a lack of standardized tool support, hampering rapid prototyping and broader applicability. We identify emerging trends in machine learning-based fusion and outline future research directions to advance reusable and user-centered multimodal systems.
Complex applications utilizing embodied AI often rely on cloud-based large-language-models, speech-recognition and -generation. This makes them vulnerable to network issues leading to inconsistent and sometime high response latency. While response latency and how to bridge it has been thoroughly researched in social psychology and classic human-computer interaction, research on how embodied AI should behave to minimize perceived waiting time is rare. Therefore, two consecutive VR studies were conducted to fill this research gap. In the first study (N=90), this paper investigates the impact of varying response latency on the users’ perception of an embodied AI and assesses the participants’ suggestions for bridging the waiting time with conversational fillers. In the second study (N=104), we implemented machine- and human-like verbal and non-verbal conversational fillers and evaluated participants’ perception of it. The results suggest that verbal fillers lead to shorter perceived waiting times. The conversational fillers’ anthropomorphism had no direct effect on the participants’ perception of latency. These results provide clear design implications for AI-driven conversational systems that are subject to unpleasant latency.
Text-to-image diffusion models demonstrate strong capabilities in generating photorealistic content across diverse domains. However, they remain limited in synthesizing clinically relevant facial anomalies, such as cleft lip, due to the lack of domain-specific representations and adaptation strategies. In this work, we introduce a method for domain-specialized image generation by adapting a publicly available multimodal diffusion model to synthesize prompt-based, realistic facial images of both pre-operative and post-operative cleft lip conditioned on a small set of real images. We compute quantitative metrics to evaluate the realism, identity safety, and diversity of the generated images, including face identity recognition (FIR), Fréchet inception distance (FID), and learned perceptual image patch similarity (LPIPS). In addition, two medical experts independently rated a subset of the generated samples for anatomical plausibility and visual realism. Results show that the adapted model avoids identity leakage, outperforms previous GAN-based approaches in distributional similarity, and achieves average human ratings of 4.85 for realism and 4.81 for anatomical plausibility on a 5-point Likert scale. Beyond qualitative generation, we demonstrate the clinical utility of the generated images by training a lip anomaly detection model on synthetic samples, achieving an accuracy of \(79\%\) on real clinical data. These findings establish a new paradigm for adapting generative models toward generating diverse, clinically meaningful imagery with high fidelity and domain specificity.
Collaborative learning in K-12 classrooms creates deeply engaging problem-solving experiences for students. This involves rich, multimodal interactions, such as speech, gaze, and gesture, that can provide insight into how collaborative learning unfolds and contributes to effective learning and engagement. However, analyzing these interactions is challenging, as it typically requires manual processing of complex multimodal data, making the task labor-intensive and time-consuming. Multimodal classroom video question-answering holds significant potential for addressing these challenges by automatically capturing and interpreting collaborative problem-solving behaviors. In this paper, we present a multimodal classroom video question-answering framework that analyzes and automatically answers questions on videos of students’ collaborative problem-solving communication and interactions in K-12 classrooms. The video question-answering framework leverages large language model code generation capabilities to analyze classroom videos of students’ collaborative problem solving and spoken language dialogue by dynamically composing Python programs that call expert models including audio-visual grounding, speech recognition, text analysis, gaze tracking, object detection, and moment localization. Results of evaluations demonstrate that the framework significantly outperforms competitive baselines (i.e., ViperGPT, VideoChat2) according to semantics-based automated metrics along multiple types of student engagement and according to an automated video-based evaluation method we introduce. Furthermore, the results suggest that the video question-answering framework accurately analyzes student collaborative problem solving in classrooms.
Physiological synchrony—the unconscious, dynamic alignment of physiological responses such as heart rate and electrodermal activity (EDA)—is increasingly recognized as a crucial element of effective teamwork and interpersonal dynamics. While synchrony has been studied extensively in romantic partners, friends, and therapeutic contexts, there is limited research on how it operates within high-stress, hands-on environments such as paramedic trainee simulations. In this study, we examine how differences in synchrony relate to multimodal interaction—specifically, verbal and nonverbal—between paramedic trainee dyads during simulation training. Quantitative analysis revealed statistically significant differences in Technical Coordination across synchrony levels during the Consult phase, with higher synchrony associated with more effective coordination. Qualitative analysis further highlighted distinct interactional patterns: high-synchrony teams demonstrated mutual gaze, closer physical proximity, aligned body orientation, and cooperative dialogue, whereas low-synchrony teams often displayed disengagement, spatial misalignment, and minimal interaction. These findings underscore the role of physiological synchrony in shaping the quality and effectiveness of multimodal team interaction, offering practical insights for improving collaboration in emergency medical training environments.
Stress detection is essential for advancing mental health diagnostics, optimizing workplace well-being, and enhancing cognitive performance monitoring. Existing approaches rely mostly on unimodal signals, such as physiological data or facial expressions, which is insufficient to capture the complex and multifaceted nature of stress responses. Despite the observational evidence of the importance of body gestures in affective computing applications, body expressions has not been studied in the domain of automatic stress detection. In this paper, we introduce a novel multimodal framework for stress detection that combines physiological signals (BVP, EDA) with behavioral features extracted from facial expressions and body gestures analysis. We demonstrate that our multimodal system outperforms state-of-the-art stress detection models, when evaluated on the UBFC-Phys dataset. Detailed feature importance analysis showed the significance and complementary nature of face and body modalities in stress detection. This work establishes a new benchmark in stress classification by demonstrating the effectiveness of multimodal fusion — especially the contribution of gesture and posture cues — in improving both accuracy and robustness.
Emotional self-expression positively impacts people’s well-being by reducing stress and contributing to overall well-being. As digital mental health interventions become more prevalent in assisting graduate students with stress management and emotional awareness, it is important to understand how self-expression modalities can be effectively integrated. We conducted a study in which graduate students reflected on a positive experience and a negative experience, separately through three self-expression modalities (visual art, writing, and movement). Our results showed that negative reflection elicited a sense of relief as it felt calming and therapeutic. The visual and movement modalities helped participants express themselves creatively, whereas writing pushed them to face their emotions head-on. Our findings contribute to the multimodal interaction community and the field of HCI toward developing intelligent, affective multimodal digital technologies to encourage enhanced well-being through emotional and creative expression and reflection for graduate students.
As AI agents become more prevalent in our everyday lives, one key question surrounds them: under what circumstances should they make use of multimodal input to inform its decision-making? To begin to examine this question, we ran a Wizard of Oz study whereby teams of three participants were guided by either a conversational agent which solely used visual information from the task environment or an agent which also listened and responded to participant queries. Although participants generally favoured the agent that could listen and respond to their queries, such responsiveness did not impact how teams performed the task or communicated with each other. Taken together this suggests that for tasks where the guidance provided by the agent is small and finite, and the informational needs of the agent can be satiated by another modality, such agents may not require the use of sophisticated conversational dialogue capability.
Virtual Reality (VR) is increasingly popular, but technical barriers exist for individuals with little experience in coding, 3D modeling, or authoring virtual experiences. Current content creation tools are often viewed as complex or frustrating due to a steep learning curve involving memorizing and navigating 2D interfaces. For VR content creation, state of the art tools often provide limited or frustrating user experiences. This problem motivates the application of a multimodal interaction tool that incorporates Natural User Interfaces to better support end users, through natural language and Large Language Models (LLM). These technologies can be used to foster a Human-AI co-creative process to further push the boundaries of creation while reducing the users’ workload, without completely removing them from the creative process.
In this paper, we describe a template that affords the user multimodal input leveraging 3D user interfaces and LLMs. We present a summative research study with 22 novice participants to assess the usability and potential of our template. Our participants were tasked to author a predefined environment, and they indicated how the tool — though complex — was easy to use after some experience with it. We found how participants tended to use speech for coarse tasks but relied on manual manipulation for finer adjustments.
Our results indicate that our multimodal approach, combining a large language model with 3D user interface modalities, is viable and can provide a more intuitive and accessible interface. Future research and development will focus on fine-tuning templates, interactions, and expanding capabilities to better support the user.
A framework named perceptual functional spectrum analysis (pFSA) for analyzing how people perceive the multifunctional nonverbal behaviors that emerge in conversations is proposed. The goal is to elucidate the intrinsic nonverbal properties, called functional multiplicity and interpretational ambiguity, in a separable way. The former property is that a single behavior could imply multiple meanings, and the latter is that different observers could interpret the same behaviors differently. In the pFSA framework, the labels of multiple raters across multiple functions over time are represented as a third-order tensor. This study then formulated a semiorthogonal nonnegative tensor factorization (SO-NTF) that approximates the input tensor as a linear combination of the functional basis matrix, perceptual basis matrices, and perceptual coefficient matrices. The functional basis matrix consists of functional spectra that represent fundamental functionalities in conversations. The perceptual basis matrices represent the perceptual tendencies, which consist of the sensitivities of the raters to the fundamental functionalities. The perceptual coefficient matrices represent the temporal activations of the perceptual tendencies. The pFSA framework constructs the perceptual basis matrices to characterize both label reliability and diversity. This study targeted 32 head movement functions labeled by ten raters. The experimental results confirmed that pFSA could successfully analyze the levels of ambiguity for multiple functionalities, such as low ambiguity for addressing and backchannel functions and high ambiguity for thinking functions.
Modeling multimodal data in privacy‑sensitive settings, such as human behavior recognition, has emerged as a critical challenge. While multimodal federated learning (MFL) offers a promising privacy‑preserving framework for collaborative modeling, it still falls short in capturing fine‑grained inter‑modal features and uncovering deeper cross‑modal relationships, dampening data expressiveness and constraining overall model performance. In this paper, we propose a client‑centric, fine‑grained fusion framework for MFL to address the aforementioned problems. Specifically, we design a fine-grained modality feature chunking method that can identify and leverage semantically similar local features. This method enables local block collaboration and effectively mitigates interference caused by modality differences. Additionally, a graph attention network based on similarity matrices is employed to update the features of each modality block. By capturing nonlinear and indirect relationships between modality blocks, this method enhances modality alignment accuracy and strengthens the expressive power of multimodal data. Experiments on two multimodal benchmark datasets show that our method outperforms existing approaches in classification accuracy and robustness, especially in handling modality heterogeneity and cross-modal information fusion.
Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self-generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi-modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and text-to-image diffusion models, as well as recursive generate-train loops with multiple models. We find that model collapse, previously observed in single-modality generative models, exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in VLM image-captioning task. Additionally, we find that general approaches such as increased decoding budgets, greater model diversity, and relabeling with frozen models can effectively mitigate model collapse. Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems and curating robust multi-modal synthetic datasets.
AI chat interfaces are increasingly used by college students, yet research is limited on how Deaf and Hard of Hearing (DHH) students— who navigate between sign language and text modalities— interact with these text-based systems. Through a large-scale survey (n=87) and an user study (n=45), we investigated DHH college students’ multimodal interaction patterns with AI interfaces. Our findings reveal that 55% of DHH college students actively use AI chat interfaces despite cross-modal translation challenges, exhibiting distinctive language patterns that reflect their visual-spatial thinking processes. We documented specific visual information processing strategies, with participants demonstrating clear preferences for structured content formats that facilitate efficient information extraction. While AI systems showed fault tolerance for DHH-specific language patterns, 58% of user study participants explicitly requested bidirectional sign language integration to enable communication in their primary modality. Based on these findings, we propose design implications for accessible AI interfaces that support navigation between modalities, promoting inclusive technology experiences for diverse communication needs.
Drunk driving remains a significant public safety challenge, demanding innovative alternatives to conventional methods such as field sobriety tests and breathalysers. Estimating a driver’s level of intoxication through facial cues is particularly challenging due to the subtle and person-specific nature of alcohol-induced behaviours. In this paper, we present BiFuseNet, a 3D spatio-temporal multi-modal network designed to classify alcohol impairment levels into three categories: sober, moderate, and severe. Unlike prior approaches that rely on either uni-modal RGB video or hand-crafted facial features, our method exploits complementary physiological cues from RGB and infrared (IR) facial videos. We introduce a Bi-directional Hierarchical Fusion (BiHF) module that applies cross-attention mechanisms at multiple semantic levels of our BiFuseNet, including early, middle, and late feature stages. This enables deep integration of modality-specific signals across varying temporal and spatial contexts. To capture both short-term facial movements and sustained facial dynamics, we implement a sliding window strategy that samples over 30 frames across ten-minute recordings. Extensive experiments on a public dataset demonstrate that BiFuseNet outperforms uni-modal and traditional fusion baselines, achieving a classification accuracy of 88.41% and an AUC-ROC of 0.91, establishing a new state of the art in estimating blood alcohol concentration.
The key-question addressed in this article is whether the traces of depression in speech are punctual (the pathology manifests itself at specific points in time) or continuous (the pathology manifests itself at every moment in time), where the expression “speech” refers to both speech signals and their transcriptions. For this reason, this work compares the performances of different approaches (both unimodal or multimodal) based either on the assumption that the traces are punctual or on the other one. In this way, it is possible to test which of the alternative assumptions is more realistic. The experiments were performed over a publicly available dataset (the Androids Corpus) and the results include F1 Scores up to 93.1%, among the best reported in literature for the Corpus. Furthermore, the results suggest that depression traces are punctual, but they appear so frequently that approaches based on the assumption of continuous traces still perform well. The conclusions discuss the implications of such observation.
Tactile representations of fabrics are critical in virtual retail and VR fitting applications. A pinching tactile display (PTD) using electrostatic adhesion is a fabric-type display that can flexibly adjust tactile sensations by adjusting voltage and frequency, and is expected to be applied to the remote transmission of fabric tactile sensations. However, no significant differences in tactile sensation other than roughness were observed. Building on established research in cross-modal perception, we investigated whether overlaying visual fabric textures could alter the tactile expressiveness of PTDs. We developed a system that integrated a systematically controlled conductive fabric-based PTD with spatially registered VR fabric simulations. Participants rated four tactile properties (roughness, stiffness, thickness, warmth) while touching the same conductive substrate under different visual texture conditions (denim, gauze, toweling, voile). The results showed that voile texture significantly affected stiffness (p=.00887); toweling texture tended to affect warmth (p=.0658); and denim texture tended to affect roughness (p=.0972). These results suggest that virtual texture overlay can affect the warmth, stiffness, and roughness of the PTD. Since the PTD can only change roughness, these results suggest that texture overlay enables the display of tactile sensations that the PTD alone cannot express. On the other hand, no significant influence of texture on thickness was observed. This suggests the pinching motion provided a highly reliable tactile cue, which may have led to visual information being underrated and thus not reaching a significant level. These findings offer strategic insights for employing visual textures to enhance perceived fabric properties, paving the way for deployment in virtual fashion retail and VR fitting applications.
The digital era has revolutionized information sharing but has also fueled the rapid spread of misinformation, particularly through cheapfakes—low-effort manipulations combining text and visuals to mislead audiences. Their ease of creation makes them a prevalent tool for deception on social media, significantly shaping public perception. Existing detection methods rely on pre-trained datasets but suffer from limited diversity, weak contextual understanding, and inadequate multimodal reasoning. To address these challenges, we propose a novel framework that: (1) integrates external knowledge through knowledge graphs to enhance contextual comprehension, (2) extracts fine-grained visual features to better capture subtle manipulations, and (3) leverages multimodal learning for improved text-image consistency analysis. Unlike prior approaches, our method generalizes across multiple datasets, demonstrating robust adaptability and higher accuracy in misinformation detection. Extensive experiments on real-world datasets validate its effectiveness, offering a scalable and reliable solution for combating cheapfakes.
Disinformation poses serious challenges as it involves deliberate falsehoods that mislead the public, erode trust, and hinder informed decision-making, issues exacerbated by AI technologies like ChatGPT, which facilitate realistic generation while complicating detection. Existing methods often rely on large-scale multimodal datasets like NewsCLIPpings, but their limited language coverage and reliance on naive, easily detectable examples from social media or synthetic sources reduce their effectiveness, especially on out-of-domain (unseen) data. To address these limitations, we develop a novel multilingual, multimodal dataset curated from fact-checking websites, presenting a greater challenge for existing detection models. Notably, we find that multilingual training not only facilitates cross-lingual generalization but also improves robustness on monolingual data, highlighting the value of linguistic diversity in building generalizable OOC detectors. Our dataset uniquely incorporates supporting information, and we propose a simple training strategy that utilizes supportive information to enhance both learning efficiency (effective use of small-scale data) and detection accuracy, even on unseen data. The dataset and code will be publicly released.
The quality of interactions between parents and children is a critical factor in child development. In recent years, programs have been developed to improve parenting behaviors through evidence-based approaches, such as attachment-based interventions. A vital element of these programs is assessing parenting quality via video recordings of parent-child interactions, which is often labor-intensive and requires specialized expertise. Prior work explored machine learning models to predict expert ratings of parenting behaviors from recordings of semi-structured parent-child play. However, the large set of low-level multimodal features struggled to provide explainable insights, creating barriers to communicating with domain experts and improving the models. In this work, we developed a machine learning pipeline combining sparse multiple canonical correlation analysis with causal discovery and inference techniques to uncover explainable causal relationships between nine categories of behavioral features and expert quality ratings of parent-child interactions. This work provides valuable insights into otherwise black-box models and contributes to the growing body of research on transparent, trustworthy machine learning approaches for modeling parenting behaviors, while offering unique insights into behavioral factors contributing to parenting quality.
Human-Computer Interaction (HCI) is a multi-modal, interdisciplinary field focused on designing, studying, and improving the interactions between people and computer systems. This involves the design of systems that can recognize, interpret, and respond to human emotions or stress. Developing systems to monitor and react to stressful events can help prevent severe health implications caused by long-term stress exposure. Currently, the publicly available datasets and standardized protocols for data collection in this domain are limited. Therefore, we introduce a multi-modal dataset intended for wearable affective computing research, specifically the development of automated stress recognition systems. We systematically review the publicly available datasets recorded in controlled laboratory settings. Based on a proposed framework for the standardization of stress experiments and data collection, we collect physiological and motion signals from wearable devices (e.g., electrodermal activity, photoplethysmography, three-axis accelerometer). During the experimental protocol, we differentiate between the following four affective/activity states: neutral, physical, cognitive stress, and socio-evaluative stress. These different phases are meticulously labeled, allowing for detailed analysis and reconstruction of each experiment. Meta-data such as body positions, locations, and rest phases are included as further annotations. In addition, we collect psychological self-assessments after each stressor to evaluate subjects’ affective states. The contributions of this paper are twofold: 1) a systematic review of publicly available stress recognition datasets, and 2) a novel multi-modal, publicly available dataset for automated stress recognition, and 3) validation of a proposed framework for standardization.
With large language models (LLMs) on the rise, in-game interactions are shifting from rigid commands to natural conversations. However, the impacts of LLMs on player performance and game experience remain underexplored. This work explores LLM’s role as a co-builder during gameplay, examining its impact on task performance, usability, and player experience. Using Minecraft as a sandbox, we present an LLM-assisted interface that engages players through natural language, aiming to facilitate creativity and simplify complex gaming commands. We conducted a mixed-methods study with 30 participants, comparing LLM-assisted and command-based interfaces across simple and complex game tasks. Quantitative and qualitative analyses reveal that the LLM-assisted interface significantly improves player performance, engagement, and overall game experience. Additionally, task complexity has a notable effect on player performance and experience across both interfaces. Our findings highlight the potential of LLM-assisted interfaces to revolutionize virtual experiences, emphasizing the importance of balancing intuitiveness with predictability, transparency, and user agency in AI-driven, multimodal gaming environments.
Humans have always dreamed of possessing superpowers, and the rapid development of AI-based features promises to bring these dreams (closer) to reality. However, these advancements come with significant risks. This paper advocates for challenging existing methods and approaches in design and evaluation for more responsible AI. We stimulate reflection through a futuristic user journey illustrating the AI-driven life of Edmund in 2035. Subsequently, we discuss four AI-based superpowers: extended perception, cognitive offloading, externalized memory, and enhanced presence. We then discuss implications for HCI and AI, emphasizing the need for preserving intrinsic human superpowers, identifying meaningful use cases for AI, and evaluating AI’s impact on human abilities. This paper advocates for responsible and reflective AI integration and proposes a pathway towards the idea of a Human Flourishing Benchmark.
Realizing truly capable personal AI for health and education requires effectively modeling complex longitudinal experiential (LE) data. Unlike standard datasets, LE data from human experience is inherently multifaceted, dynamic, and contextual. Current AI approaches struggle with this complexity due to three critical gaps rooted in differences from human cognition: insufficient multimodal processing, lack of generative perception, and inadequate symbolic order contextualization. Drawing upon interdisciplinary insights, our blue sky vision, the MUSE (Multimodal, generative, and Symbolic framEwork), proposes to bridge these gaps. By integrating multimodal representations addressing LE data’s structure, generative simulation capturing “what-if” dynamics, and symbolic order grounding for context, MUSE aims for a profound, contextually aware understanding of individual experience, essential for robust inference and enabling advanced personal AI applications.
Cognitive effort, focusing on neural efficiency and involvement, is the level of a learner’s involvement in understanding a concept. Understanding how learners interact with AI-driven educational tools is crucial for optimizing cognitive effort and user experience. This study is going to investigate how different Large Language Model response strategies with specific commands such as with and without analogy, and response tones such as empathetic vs. neutral impact cognitive effort, load, usability, and knowledge gain in an educational game setting. I will collect neural activity from the prefrontal cortex during educational gameplay using functional Near-Infrared Spectroscopy. I will then compute cognitive effort based on brain signal and performance score. Additionally, I will evaluate self-reported data. My preliminary findings suggest that learning materials have impact on cognitive effort. For example, a simple change in learning material such as use of input method in same environment can affect the cognitive effort. This research provides actionable insights for designing emotionally intelligent, cognitively supportive AI tutors and contributes to the development of next-generation human-centered learning technologies.
Conversational events, such as speaking turns, backchannels, topic changes, and laughter, are central to the structure of multiparty interaction and play a key role in shaping its dynamics. However, detecting such events in real world social settings remains challenging due to perceptual ambiguity, visual occlusion, signal noise, and limitations in acquiring high quality audio data. This work addresses these challenges by focusing on spontaneous interactions in socially complex and privacy sensitive environments, exploring multimodal, nonverbal cues that do not rely on audio. The goal is to develop a novel modeling approach for group context awareness to infer conversational events and support social scene understanding under real world constraints.
Understanding how multimodal communicative behaviors reflect affective dynamics is critical for developing socially aware artificial intelligence (AI) systems. This research investigates the relationship between moment-to-moment behavioral cues – such as gaze, prosody, facial expressions, and head gestures – and perceived engagement and alliance in naturalistic conversations. We present two empirical studies across face-to-face and video-mediated group settings that examine speaker and listener behaviors using richly detailed multimodal recordings across multiple modalities and self-reported affective ratings. By employing modality-wise ablation and role-specific modeling, we identify behavioral signals that are most predictive of subjective impressions of social connection. Through the research, we found that listener and speaker nonverbal cues – especially head nods, brow raises, and head movements – were statistically significant predictors of perceived dyadic alliance. Positive predictors included listener head nods and brow raises, and speaker head movements and pitch variation; negative predictors included smiles and frowns from the speaker and listener. These findings inform a broader research agenda aimed at building interpretable, context-sensitive models of human interaction. Future directions include modeling interaction roles and temporal context in group communication, as well as exploring first impression dynamics in rapid, high-stakes social encounters.
This doctoral research investigates how immersive simulation-based training, enhanced through multimodal analysis, can improve caregiving interactions in geriatric care. By integrating emotional states, gaze behavior, body movements, and facial expressions, the study aims to bridge the gap between traditional caregiver training and behavioral complexities of real-world eldercare. Through an Augmented Reality simulation with aware and unaware virtual patient, the project collects multimodal data using microphones, cameras, and eye-tracking sensors. A novel multimodal emotion classifier and an extended Epistemic Network Analysis (ENA) framework are developed to evaluate caregiver engagement. Initial findings show that aware simulation fosters more empathetic, person-centered caregiver behaviors. Ongoing work involves analyzing facial expression dynamics and gaze patterns to deepen the understanding of attentional alignment and affective synchrony during caregiver-patient interactions. These next steps will further enhance the framework’s ability to assess and inform intelligent training design in healthcare.
Achieving socially compatible human-AI interaction requires systems that can interpret and respond to human emotions appropriately in complex social environments. While traditional emotion recognition models rely heavily on facial or bodily expressions, a growing body of research demonstrates that such cues are insufficient without the dynamic, multimodal contextual cues. Positioned at the intersection of cognitive psychology and AI, this work identifies three essential qualities for context-sensitive emotion recognition (CSER): generalizability to unseen scenarios, data efficiency in adapting to new contexts, and reliability in predictive performance across contexts. We outline a research plan that systematically investigates the role of contextual factors, domain adaptation, and uncertainty quantification in building CSER models capable of robust performance across real-world settings. Our approach integrates computational rigour with ethical responsibility to lay the foundation for next-generation emotion-aware systems that are not only accurate but also trustworthy, transparent, and support human well-being in digital interactions.
Supervising multi-robot systems requires interfaces that manage operator cognitive resources and maintain situational awareness. This PhD research proposes a two-stage framework using multimodal large language models (MM-LLMs) for dynamic event detection from diverse sensor inputs, combined with parameterised multimodal communication to convey these events intuitively. By integrating multimedia processing with adaptive communication, this work aims to improve operator performance and address limitations of current interfaces in multi-agent supervision. The expected contributions include novel methods for real-time event detection and multimodal communication, advancing both theory and practical design in human-robot supervision.
This doctoral research develops and evaluates a hybrid Virtual Reality (VR) therapy platform that integrates trauma-focused exposure therapy with posture-adaptive cognitive rehabilitation, tailored for Australian veterans and first responders diagnosed with PTSD. Building on the clinically validated Bravemind VR Exposure Therapy (VRET) system, the project localizes therapeutic content for cultural relevance and emotional safety. It also incorporates the Virtual Human Benchmark (VHB), a posture-adaptive VR game developed by a team of researchers, to enhance cognitive performance and engagement through gamified training. The three-study research design includes (1) adaptation and pilot testing of Bravemind, (2) validation of the VHB system with biometric integration, and (3) development of an AI-driven hybrid platform with dynamic scenario generation and user analytics. Early results from general population testing on VHB demonstrate efficacy in embodied cognitive-motor training and safe physiological regulation. This research is the first to integrate culturally adapted VRET with embodied cognitive rehabilitation in a unified, AI-personalized platform for PTSD. Positioned at the intersection of trauma therapy, human-computer interaction, and precision mental health, it aims to deliver a scalable, personalized, and ethically grounded intervention for PTSD rehabilitation and specialist training.
Gaze-based interaction in virtual reality (VR) often misinterprets passive viewing as intentional selection. This is recognized as the Midas Touch problem, leading to a poor user experience. My doctoral research proposes a brain-computer interface (BCI) that integrates electroencephalography (EEG) and eye tracking to detect user intention in real-time. This will be done using the Stimulus-Preceding Negativity (SPN) as an EEG neural confirmation trigger. This research extends from prior work showing SPN can differentiate passive from intentional gaze. The system is developed across multiple studies, including replication, classifier development, and integration into VR. By proactively confirming intention, the system aims to reduce selection errors and cognitive workload. This research contributes to developing neuroadaptive interfaces in immersive environments. This research will have broader impacts that initiate the design of other neuroadaptive systems that may integrate cognitive state, motor imagery, and real-time interactions.
With the growing prevalence of cognitive impairments around the globe and in Europe, an increasing number of people are likely to experience cognitive decline during their working years. Supporting these individuals to remain in employment is imperative, both to promote personal well-being and to enable organizations to retain experienced and skilled workers. This research proposes the design of a physiologically adaptive cognitive assistance system to support individuals with mild cognitive impairment in sheltered workshops. This work adopts a design science research approach, combining laboratory and field experiments to achieve user-centred design. Expected outcomes include a modular framework for physiologically adaptive cognitive assistive systems, a multimodal machine learning pipeline for detecting psychological states from physiological signals and design principles to inform future research and development. By demonstrating the potential of such systems within work settings, this research aims to advance the social inclusion of individuals with mild cognitive impairment in the labour market.
Cartoons and animated media play a crucial role in children’s cognitive and emotional growth, yet remain largely inaccessible to visually impaired audiences. While audio description (AD) can bridge this gap, its manual production is costly, subjective, and unscalable. This work proposes a context-aware, automated audio description (AAD) system that integrates cartoon character detection, mask-level tracking, and emotion recognition with large language models to generate real-time, child-friendly narrative descriptions. We introduce ToonDetect, a novel dataset of 10,000+ annotated frames from globally popular cartoons, designed to capture the challenges of stylized character detection and emotion recognition. Our pipeline integrates YOLOv8 for optimized cartoon character detection, SAM2 for segmentation, XMem for temporal consistency, and SWIN Transformer for efficient emotion recognition, enabling frame-consistent understanding of dynamic animated scenes. Leveraging Video-LLaVA and BLIP-2, we generate coherent, emotionally aligned, child-friendly narrations. This research contributes a scalable, multimodal framework for real-time accessibility in animated content, paving the way for inclusive entertainment and education for visually impaired children.
Frustration during goal-directed tasks arises from unpredictable obstructions, yet existing multimodal detection systems (e.g., fNIRS neuroimaging, behavioral metrics) fail to disentangle this affective state from inherent cognitive workload. To address this gap, I leverage a dual-task paradigm capturing behavioral metrics (errors, reaction time, etc) under deliberately induced frustrating events. I found that the learning effect existed in my experiment, and the effect of a highly frustrating experience on the secondary task diminished as the task progressed. Building on this foundation, I will propose: (1) a multimodal frustration classifier fusing fNIRS with behavioral metrics, and (2) a tactile mitigation intervention on the mouse to enhance user perception of frustration.
This research aims to better understand human behaviour in social environment by using computer vision and large language models (LLMs), [33] [34], with a focus on understanding the traumatic behaviour of person such as avoidance, dissociation, social withdrawal, and physical reflections contributing trembling that has experienced by an individual following exposure to distressing or series of such events including physical assault, witnessing accident experiencing personal loss or enduring natural disaster, using multi-modal cues like visual or audio data captured through surveillance data. A mixed-methods approach will be adopted. The qualitative component involves the evaluation of existing datasets, algorithms, and AI integration, while the quantitative aspect focuses on modelling social interaction factors [4], including nonverbal cues like gesture, pose estimation, trajectory derived from the literature [24], [10], [22], [11], [25], [20], [21]
Expected outcomes include a robust, richly annotated dataset and improved models capable of understanding traumatic behaviour in social environments. Existing literature reveals key limitations, including restricted dataset diversity, limited trauma-related annotations, and a lack of multimodal representation. Furthermore, many datasets are collected in controlled or clinical environments, which constrains their generalizability to real-world contexts. Additionally, current algorithms often fall short in effectively modelling the complexity of trauma-related social behaviour. This research addresses gaps through 1) systematic review and selection of dataset which are relevant to understanding traumatic behaviour (e.g. PTSD, CPTSD etc), studying behavioural annotation granularity, modality coverage (facial expression, body posture, etc.) within those 2) addressing inclusion criteria like defining behavioural indicators to trauma, identifying high-quality video/audio with sufficient resolution for multimodal feature extraction, annotation reliability preferably verified by psychologist and domain experts 3) refining annotation using a hybrid approach combining expert coding and LLM based labelling [33] [34] followed by manual verification 4) improving data quality and model robustness through augmentation and fusion techniques 5) for consistency of annotations/labels, assessment by well-trained annotators or domain experts.[2], [3], [19], [20], [26],[25].
We present DriveWise, a personalized driver coaching system that integrates naturalistic performance assessment, CARLA-simulated lesson generation, and eye-tracking-based risk perception feedback. Driving skills are evaluated through unscripted simulation, where collisions and near-misses are detected to quantify risk. Based on these incidents, tailored interactive lessons are generated to address individual weaknesses. Eye-tracking data further identifies gaps in visual attention, enabling cognitive-level coaching. DriveWise offers a scalable approach to enhancing both behavioral and perceptual aspects of driving safety.
The Crock of Shh is a water dispenser interface designed for typing messages to water. Instead of dispensing water quickly, users must press the nozzle repeatedly to scroll through whispered positive words representing letters of the alphabet. Each short press releases a small amount of water, slowing the process and creating time for reflection. Users listen to the whispered words by placing their ear near the water tank, which resonates the sounds through an audio exciter. Four buttons on the interface allow users to scroll forward or backward through the spelling alphabet, select letters, insert spaces, and print messages on a Dymo label printer. The printed labels, in black ink on transparent material, can be attached to a water cup or the dispenser jug. The printed labels transmit positive messages, fostering a connection between the user and the water. The work builds on ideas of slow technology, embodied interaction, and placebo effects.
Many people listen to music as a background activity during various tasks, such as driving, walking their dog, or showering. The availability of streaming services has made music more convenient and widespread. The Human Record Needle requires users to perform rhythmic movements to activate the music, engaging them physically in the listening experience. This approach allows both musicians and non-musicians to appreciate listening to music beyond just background noise. This exhibition introduces the Human Record Needle as a novel interface for playing music from a MIDI score.
Human pose annotation is essential but is often tedious and error-prone, especially for visually challenging images. Existing tools offer limited support for inspecting or refining annotations. We present PoseDoc, an interactive tool for inspecting, correcting, and customizing human pose annotations. The proposed tool will feature a flexible ranking framework that prioritizes hard cases – based on user-defined, application-specific metrics – to guide inspection and improve the annotation quality of a dataset. Currently under development, PoseDoc can enable efficient annotation workflows and improve model reliability for responsible multimodal interaction.
Autism Spectrum Disorder (ASD) affects more than 75 million people worldwide. However, scalable support for practicing everyday conversation is scarce: Low-cost activities such as story reading yield limited improvement. At the same time, effective role-play therapy demands expensive, in-person sessions with specialists. SocialWise bridges this gap through a browser-based application that pairs LLM conversational agents with a therapeutic retrieval augmented generation (RAG) knowledge base. Users select a scenario (e.g., ordering food, joining a group), interact by text or voice, and receive instant, structured feedback on tone, engagement, and alternative phrasing. The SocialWise prototype, implemented with Streamlit, LangChain, and ChromaDB, runs on any computer with internet access, and demonstrates how recent advances in LLM can provide evidence-based, on-demand communication coaching for individuals with ASD.
Many individuals—especially those with autism spectrum disorder (ASD), alexithymia, or other neurodivergent profiles—face challenges in recognizing, expressing, or interpreting emotions. To support more inclusive and personalized emotion technologies, we present a real-time multimodal emotion estimation system that combines neurophysiological EEG, ECG, blood volume pulse (BVP), and galvanic skin response (GSR/EDA) and behavioral modalities (facial expressions, and speech) in a unified arousal-valence 2D interface to track moment-to-moment emotional states. This architecture enables interpretable, user-specific analysis and supports applications in emotion education, neuroadaptive feedback, and interaction support for neurodiverse users. Two demonstration scenarios illustrate its application: (1) passive media viewing (2D or VR videos) reveals cortical and autonomic responses to affective content, and (2) semi-scripted conversations with a facilitator or virtual agent capture real-time facial and vocal expressions. These tasks enable controlled and naturalistic emotion monitoring, making the system well-suited for personalized feedback and neurodiversity-informed interaction design.
This demonstration paper presents LayLens, a tool aimed to make deepfake understanding easier for users of all educational backgrounds. While prior works often rely on outputs containing technical jargon, LayLens bridges the gap between model reasoning and human understanding through a three-stage pipeline: (1) explainable deepfake detection using a state-of-the-art forgery localization model, (2) natural language simplification of technical explanations using a vision-language model, and (3) visual reconstruction of a plausible original image via guided image editing. The interface presents both technical and layperson-friendly explanations in addition to a side-by-side comparison of the uploaded and reconstructed images. A user study with 15 participants shows that simplified explanations significantly improve clarity and reduce cognitive load, with most users expressing increased confidence in identifying deepfakes. LayLens offers a step toward transparent, trustworthy, and user-centric deepfake forensics.
As part of the Intangible Cultural Heritage, Bridging the Past, Present and Future (INT-ACT) project, we investigate how Extended Reality (XR) technologies can meaningfully engage users with cultural content by monitoring their physiological and affective responses. While immersive XR systems offer new ways of exploring heritage, their impact on users’ internal states remains underexplored. In this study, we present a multimodal experimental setup using EmotiBit, a wearable biosensing platform, to monitor real-time physiological signals during cultural XR interaction. Participants were evaluated across three activities representing varying cognitive and sensory loads: immersive interaction with the INT-ACT XR Demonstrator, composing work emails (a low-stimulation control task), and passive movie watching. Our aim is to quantify how cultural XR experiences influence biosignals such as electrodermal activity and heart rate. The findings reveal distinct physiological patterns across conditions, suggesting that biosignal monitoring can inform the design of adaptive XR environments that are responsive to user states. This work contributes to INT-ACT’s broader objective of creating intelligent, inclusive, and emotionally resonant cultural heritage experiences.
We introduce a multilingual chatbot tailored for remote data collection, focusing on mental health issues such as depression. It gathers both self-reported information and voice recordings. Implemented as a Telegram bot, the system enables participants to complete psychological assessments, including the 8-item Patient Health Questionnaire (PHQ-8) and open-ended questions entirely through a conversational interface. The chatbot supports multiple languages, making it well-suited for cross-cultural research. Users respond to both close-ended and open-ended questions, with voice messages serving as the required mode of response for the latter. These recordings are stored for future multimodal analysis. To strengthen privacy guarantees, the chatbot operates in conjunction with a self-hosted Telegram Bot API server, ensuring that all user data remains within researcher-controlled infrastructure. The system emphasizes accessibility, privacy, and extensibility featuring built-in mechanisms for consent withdrawal and data deletion, as well as future support for additional modalities such as video responses and alternative psychometric instruments. This demonstration highlights the core functionality of the chatbot and its potential as a lightweight, scalable tool for affective computing and digital mental health research. The code is available at: https://github.com/danila-mamontov/mentalhealth-chatbot.
Organic waste management is a crucial component of a circular economy, which prioritizes reducing waste through the reuse and recycling of products and materials. It is a tedious and complicated task, largely accomplished through manual labor. We introduce a novel ‘in-the-wild’ multimodal image dataset of 15-band NIR multi-spectral and single band thermal images of bulk food waste in an industrial setting. The dataset showcases a number of complex computer vision problems that are unavoidable constraints in this setting. Benchmarking against different computer vision algorithms is performed to highlight these challenges. The key issues and their place in robotic waste processing for industrial applications, and grand challenge objectives are discussed.
Waste management is particularly challenging, placing unprecedented pressure on recycling facilities to sort heterogeneous streams at industrial throughputs. Conventional RGB-based vision systems struggle in this setting: overlapping items, translucent plastics, and occluded food residues often share similar visible-light textures, leading to frequent missed detections. important, standard accuracy metrics such as mIoU and mAP50–95 penalise any boundary inaccuracy, undervaluing predictions that slightly overshoot yet fully capture grasp-critical regions.
To address this gap, our work adopts mean Intersection over Ground-truth (mIoG). This is a recall-oriented metric that measuring the proportion of ground truth actually covered by predictions. Unlike mIoU, mIoG does not penalise harmless over-segmentation. This discrepancy underscores the practical insight offered by mIoG we adopted. In complex environments with overlapping, deformable waste items, false negatives (missing parts or whole objects) are more costly than false positives. Thus, we argue that mIoG better aligns with the operational goals of waste sorting systems where the primary requirement is to detect all relevant material, even at the cost of over-segmentation. Our results support this point of view. Despite imprecise boundaries because of weak spectral signatures, inorganic segments were detected sufficiently in area coverage.
This workshop aims to advance research on cross-cultural multimodal interaction by establishing a platform for sharing methodologies, data, and insights related to nonverbal behavior across different linguistic and cultural contexts. We focus on the challenges of collecting and annotating multimodal data in a consistent manner across diverse locations, and on developing analysis methods that leverage recent advances in machine learning and large language models. By bringing together researchers from various disciplines and regions, the workshop seeks to foster international collaboration, promote the harmonization of annotation standards, and encourage interdisciplinary approaches. The outcomes are expected to contribute to a deeper understanding of cultural differences in nonverbal communication and to the development of technologies that enable natural and effective human-human and human-machine interaction in multicultural societies.
Pain communication varies significantly among individuals, some are highly expressive, while others demonstrate stoic restraint and offer minimal verbal indication of discomfort. Substantial progress has been made in identifying behavioral indicators of pain. A growing body of literature highlights measurable indices of pain through facial expressions, vocalizations, body movements, as well as physiological and neural responses. To enhance the reliability of pain monitoring, automated pain assessment has emerged as a promising approach. Although available datasets remain limited, they are steadily increasing, helping to drive research forward. Despite notable progress, this field is still in its early stages. The 5th edition of the AAP workshop continues to seek to highlight current research and foster interdisciplinary collaboration and discussion to accelerate progress in this important area.
The ICMI 2025 Workshop on Holistic and Responsible Affective Intelligence (HRAI 2025) aims to advance research in affective intelligence by fostering discussions on the holistic development of affective computing and the ethical challenges it entails. The workshop aims to strengthen interdisciplinary connections within the affective computing community, promoting better integration of methodologies and enhancing real-world applicability. By tackling both technical and ethical issues, HRAI 2025 aspires to shape the future of affective AI, ensuring it is not only powerful but also fair, safe, and socially responsible.