28th ACM International Conference on Multimodal Interaction
(5-9 October 2026)

Context and Cultural Awareness for Multimodal Interaction

Home

Important dates

Conference Center

Grand Challenges

Special Sessions

Workshops

Call for Sponsors

Call for Tutorials

Call for Papers

Call for Blue Sky Papers

Call for Demos

Doctoral Consortium

Late-Breaking Results

Author Guidelines

Reviewer Guidelines

About Naples

Visa Information

Accommodation

People

Steering Committee

Platinum Sponsor
Blue Sky Sponsor
blank
Institutional Sponsor
blank

Special Sessions

The ICMI 2026 Special Sessions highlight emerging topics and interdisciplinary directions in multimodal interaction. These sessions provide a venue for focused discussion on timely research themes and bring together communities working on innovative methods, applications, and challenges across the field.

Papers submitted to an accepted Special Session will follow the same review process as the main conference track papers, including the same submission system (PCS), formatting guidelines (short or long papers), notification dates, and a rigorous peer review process. Accepted special session papers will be included in the main proceedings.

For any questions and further information about the Special Sessions, please email icmi2026-specialsession-chairs@acm.org.

To submit a paper to a special session, please select “ICMI 2026 long and short papers” and then in the submission form select the radio button with the number for the selected session. Special sessions are numbered as follows.

Go to Submission site

 

Special Session 1: Multimodal Large Language Models for Context-Aware Interaction

Organizers: Sławomir Kciuk, Łukasz Sobczak
Corresponding Organizer: Łukasz Sobczak

Description: Large language models (LLMs) are increasingly used as the core component of multimodal interactive systems. While LLMs are strong in language processing, real-world interaction requires them to operate together with other modalities such as vision, audio, gesture or sensor data. Integrating these modalities in a coherent and reliable way remains a significant challenge.

This Special Session focuses on multimodal LLMs as context-aware interaction engines, with particular emphasis on cross-modal grounding and consistency. We are interested in architectural and system-level approaches that connect language models with perceptual inputs and enable responses that remain consistent across modalities and over time.

The session addresses questions such as:

How can large language models be effectively integrated with visual, auditory, and other input streams?
How can LLM-based systems ensure that generated responses stay aligned with multimodal context?
How should multimodal uncertainty influence LLM-based reasoning and response generation?
What architectural designs of LLM-based systems support stable and predictable multimodal interaction?
How can robustness of multimodal LLMs be evaluated and failure modes better understood?

We welcome contributions presenting new architectures, integration strategies, empirical evaluations, benchmarking efforts, and analyses of limitations in multimodal LLM-based interaction. By addressing contextual grounding, cross-modal consistency, and reliability in multimodal LLM systems, this session directly contributes to ICMI’s core focus on intelligent and context-aware multimodal interaction.

Topics include, but are not limited to:

Architectures for multimodal LLM-based interactive systems
Cross-modal alignment and contextual grounding in multimodal LLMs
Integration of visual, auditory, and sensor inputs with language models
Incorporating multimodal uncertainty into LLM-based reasoning and response generation
Incremental and real-time multimodal processing with LLMs
Robustness analysis and systematic study of failure modes in multimodal LLM systems
Evaluation methodologies and benchmarks for multimodal LLM systems
Interpretability and controllability in multimodal LLM-based interaction

Special Session 2: Empowering Society through Personalised Multimodal HRI

Organizers: Antonio Andriella, Wing-Yue Geoffrey Louie, Alessandra Rossi, Silvia Rossi

Description: The vision of personal robots capable of supporting, cooperating with, and living among humans holds immense potential for societal good. To realise this vision, robots must move beyond one-size-fits-all interactions and autonomously tailor their behaviour to the unique characteristics of individuals, including their culture, preferences, and cognitive and physical abilities. Personalisation can significantly enhance human-robot interactions in various real-world scenarios by increasing engagement through tailored content, building trust and rapport, improving adherence to the interaction, and enhancing task performance, thus configuring such a technology as a viable tool for promoting equity and access in domains like healthcare, education, and assisted living.

As multimodal systems, robots integrate multiple channels of communication and perception to interact with humans and their environment effectively. For instance, a robot equipped with speech recognition capabilities can understand verbal commands and engage in spoken dialogue with users. At the same time, its vision system allows it to perceive facial expressions, gestures, and other visual cues, providing additional context for interpreting the user’s intent and emotional state. Moreover, tactile sensors enable the robot to sense touch and physical interactions, enhancing its ability to respond appropriately to human gestures and contact. By integrating these modalities, the robot can tailor its behaviour dynamically based on the information it gathers from each channel.

This special session aims at exploring how robots can interact meaningfully across diverse contexts by going beyond speech and vision processing. It will collect works that explore the need for robots to be culturally and context-aware—capable of understanding not only what people said, but how it is expressed, considering differences and similarities in the richness and expressiveness of communication styles, and what those expressions mean within specific cultural frameworks. This session also focuses on personalisable intelligent agents that are able to adapt not only to individuals, but to the communities and traditions that shape them. Finally, with this special session, we want to promote a design of empowering robotic technologies, developed with and for the people they support, in order to foster inclusivity, embrace diversity, and generate long-term social benefits.

To this extent, this special session welcomes research works from a multidisciplinary group of researchers, including, but not limited to, psychology, neuroscience, computer science, robotics, and sociology, to share and discuss current approaches to empowering social assistive robots with adaptive and learning capabilities in order to foster research and development of robotic solutions specifically designed for meeting the individual’s unique needs.

Topics include, but are not limited to:

Personalisation in short and long-term HRI
User modelling in HRI
Personalisation for inclusion
Robot’s personality
Socially-aware personalization
Context and situation awareness for robots
Engagement evaluation and re-engagement strategies
Personalised dialogue with robots
Personalised non-verbal behaviour with robots
Adaptive human-aware task planning
Theory of Mind for adaptive interaction
Machine Learning for robotic personalization
Lifelong (continual) learning for adaptation
Adaptation in multimodal interaction
Affective and emotion-adapted HRI
Persuasion in HRI
Personalisation for Sustainability
Culture-aware robots
Evaluation metrics for adaptive robotic behaviour
Ethical implications of personalization
Robot customisation and teaching

Special Session 3: Multimodal Sensing and Interventions for Mental Well-being

Organizers: Iulia Lefter, Alessandro Vinciarelli

Description: The process of mental illness diagnosis is one of the most expensive and time-consuming aspects of clinical practice. Traditional methods of mental health assessment rely on self-report instruments and periodic clinical evaluations, which makes them vulnerable to subjective judgment, attentional and memory biases, and the episodic nature of clinical encounters. Even though specific interventions for treatments are shown to be effective, many do not experience improvements and there are high drop-out rates. This indicates a need for improved treatments and supporting engagement over time. In this context, multimodal sensing and intervention technologies have the potential to play a transformative role in supporting and sustaining mental well-being.

Multimodal sensing has the potential to augment traditional diagnosis by integrating verbal and non-verbal communication, physiological signals, and activity data to deliver objective markers and identify bio-behavioral patterns related to psychopathology. This can be particularly useful in providing both patients and clinicians with explanations and a more comprehensive picture, and could identify moments needing critical alert. The advantage of these technologies will be particularly evident in the early stages of a pathology, when the symptoms may be too subtle to be visible. Last, but not least, multimodal technologies can automate large-scale screening of the population for what concerns, e.g., developmental issues in childhood (e.g., identification of children with insecure attachment) or early symptoms of cognitive deterioration in late age.

Interventions for mental well-being support can leverage multimodal sensing as well as multimodal human-agent interaction. By continuously assessing a user’s affective and behavioral state, digital systems can move away from one-size-fits-all approaches toward context-aware, personalized interventions tailored to an individual’s unique needs, preferences, and real-time emotional condition — including just-in-time interventions triggered at moments of particular need. More broadly, multimodal technologies can support human flourishing beyond pathology, promoting healthy lifestyle habits, building skills such as emotional regulation. The need for long-term engagement can further be tackled by delivering interventions through (embodied) conversational agents that can interact with users in naturalistic, responsive ways.

The aim of this special session is to highlight the latest developments in multimodal sensing and interventions for mental well-being and facilitate inter-disciplinary discussion on these topics.