Proceedings of the 2008 International Conference on Multimodal Interfaces
Time and place:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
The following articles are from "Proceedings of the 2008 International Conference on Multimodal Interfaces":
Cohen, Phil (2008): Natural interfaces in the field: the case of pen and paper. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 1-2. Available online
Over the past 7 years, Adapx (formerly, Natural Interaction Systems) has been developing digital pen-based natural interfaces for field tasks. Examples include products for field note-taking, mapping and architecture/engineering/construction, which have been applied to such uses as: surveying, wild-fire fighting, land use planning and dispute resolution, and civil engineering. In this talk, I will describe the technology and some of these field-based use cases, discussing why natural interfaces are the preferred means for human-computer interaction for these applications.
Choumane, Ali and Siroux, Jacques (2008): Knowledge and data flow architecture for reference processing in multimodal dialog systems. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 105-108. Available online
This paper is concerned with the part of the system dedicated to the processing of the user's designation activities for multimodal search of information. We highlight the necessity of using specific knowledge for multimodal input processing. We propose and describe knowledge modeling as well as the associated processing architecture. Knowledge modeling is concerned with the natural language and the visual context; it is adapted to the kind of application and allows several types of filtering of the inputs. Part of this knowledge is dynamically updated to take into account the interaction history. In the proposed architecture, each input modality is processed first by using the modeled knowledge, producing intermediate structures. Next a fusion of these structures allows the determination of the referent aimed at by using dynamic knowledge. The steps of this last process take into account the possible combinations of modalities as well as the clues carried by each modality (linguistic clues, gesture type). The development of this part of our system is mainly complete and tested.
This paper describes the acquisition and content of a new multi-modal database. Some tools for making use of the data streams are also presented. The Computational Audio-Visual Analysis (CAVA) database is a unique collection of three synchronised data streams obtained from a binaural microphone pair, a stereoscopic camera pair and a head tracking device. All recordings are made from the perspective of a person; i.e. what would a human with natural head movements see and hear in a given environment. The database is intended to facilitate research into humans' ability to optimise their multi-modal sensory input and fills a gap by providing data that enables human centred audio-visual scene analysis. It also enables 3D localisation using either audio, visual, or audio-visual cues. A total of 50 sessions, with varying degrees of visual and auditory complexity, were recorded. These range from seeing and hearing a single speaker moving in and out of field of view, to moving around a 'cocktail party' style situation, mingling and joining different small groups of people chatting.
Li, Li and Chou, Wu (2008): Towards a minimalist multimodal dialogue framework using recursive MVC pattern. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 117-120. Available online
This paper presents a formal framework for multimodal dialogue systems by applying a set of complexity reduction patterns. The minimalist approach described in this paper combines recursive application of Model-View-Controller (MVC) design patterns with layering and interpretation. It leads to a modular, concise, flexible and dynamic framework building upon a few core constructs. This framework could expedite the development of complex multimodal dialogue systems with sound software development practices and techniques. A XML based prototype multimodal dialogue system that embodies this framework is developed and studied. Experimental results indicate that the proposed framework is effective and well suited for multimodal interaction in complex business transactions.
Ratzka, Andreas (2008): Explorative studies on multimodal interaction in a PDA- and desktop-based scenario. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 121-128. Available online
This paper presents two explorative case studies on multimodal interaction. Goal of this work is to find and underpin design recommendations to provide well proven decision support across all phases of the usability engineering lifecycle . During this work, user interface patterns for multimodal interaction were identified [2, 3]. These patterns are closely related to other user interface patterns [4, 5, 6]. Two empirical case studies, one using a Wizard of Oz setting and another one using a stand-alone prototype linked to a speech recognition engine  were conducted to assess the acceptance of resulting interaction styles. Although the prototypes applied as well those interface patterns that increase usability by means of traditional interaction techniques and thus compete with multimodal interaction styles, multimodal interaction was preferred by most of the users.
Vanacken, Lode, Boeck, Joan De, Raymaekers, Chris and Coninx, Karin (2008): Designing context-aware multimodal virtual environments. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 129-136. Available online
Despite of decades of research, creating intuitive and easy to learn interfaces for 3D virtual environments (VE) is still not obvious, requiring VE specialists to define, implement and evaluate solutions in an iterative way, often using low-level programming code. Moreover, quite frequently the interaction with the virtual environment may also vary dependent on the context in which it is applied, such as the available hardware setup, user experience, or the pose of the user (e.g. sitting or standing). Lacking other tools, the context-awareness of an application is usually implemented in an ad-hoc manner, using low-level programming, as well. This may result in code that is difficult and expensive to maintain. One possible approach to facilitate the process of creating these highly interactive user interfaces is by adopting a model-based user interface design. This lifts the creation of a user interface to a higher level allowing the designer to reason more in terms of high-level concepts, rather than writing programming code. In this paper, we adopt a model-based user interface design (MBUID) process for the creation of VEs, and explain how a context system using an Event-Condition-Action paradigm is added. We illustrate our approach by means of a case study.
Cohen, Phil, Swindells, Colin, Oviatt, Sharon and Arthur, Alex (2008): A high-performance dual-wizard infrastructure for designing speech, pen, and multimodal interfaces. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 137-140. Available online
The present paper reports on the design and performance of a novel dual-Wizard simulation infrastructure that has been used effectively to prototype next-generation adaptive and implicit multimodal interfaces for collaborative groupwork. This high-fidelity simulation infrastructure builds on past development of single-wizard simulation tools for multiparty multimodal interactions involving speech, pen, and visual input . In the new infrastructure, a dual-wizard simulation environment was developed that supports (1) real-time tracking, analysis, and system adaptivity to a user's speech and pen paralinguistic signal features (e.g., speech amplitude, pen pressure), as well as the semantic content of their input. This simulation also supports (2) transparent user training to adapt their speech and pen signal features in a manner that enhances the reliability of system functioning, i.e., the design of mutually-adaptive interfaces. To accomplish these objectives, this new environment also is capable of handling (3) dynamic streaming digital pen input. We illustrate the performance of the simulation infrastructure during longitudinal empirical research in which a user-adaptive interface was designed for implicit system engagement based exclusively on users' speech amplitude and pen pressure . While using this dual-wizard simulation method, the wizards responded successfully to over 3,000 user inputs with 95-98% accuracy and a joint wizard response time of less than 1.0 second during speech interactions and 1.65 seconds during pen interactions. Furthermore, the interactions they handled involved naturalistic multiparty meeting data in which high school students were engaged in peer tutoring, and all participants believed they were interacting with a fully functional system. This type of simulation capability enables a new level of flexibility and sophistication in multimodal interface design, including the development of implicit multimodal interfaces that place minimal cognitive load on users during mobile, educational, and other applications.
Gruenstein, Alexander, McGraw, Ian and Badr, Ibrahim (2008): The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 141-148. Available online
Many compelling multimodal prototypes have been developed which pair spoken input and output with a graphical user interface, yet it has often proved difficult to make them available to a large audience. This unfortunate reality limits the degree to which authentic user interactions with such systems can be collected and subsequently analyzed. We present the WAMI toolkit, which alleviates this difficulty by providing a framework for developing, deploying, and evaluating Web-Accessible Multimodal Interfaces in which users interact using speech, mouse, pen, and/or touch. The toolkit makes use of modern web-programming techniques, enabling the development of browser-based applications which rival the quality of traditional native interfaces, yet are available on a wide array of Internet-connected devices. We will showcase several sophisticated multimodal applications developed and deployed using the toolkit, which are available via desktop, laptop, and tablet PCs, as well as via several mobile devices. In addition, we will discuss resources provided by the toolkit for collecting, transcribing, and annotating usage data from multimodal user interactions.
Serrano, Marcos, Juras, David and Nigay, Laurence (2008): A three-dimensional characterization space of software components for rapidly developing multimodal interfaces. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 149-156. Available online
In this paper we address the problem of the development of multimodal interfaces. We describe a three-dimensional characterization space for software components along with its implementation in a component-based platform for rapidly developing multimodal interfaces. By graphically assembling components, the designer/developer describes the transformation chain from physical devices to tasks and vice-versa. In this context, the key point is to identify generic components that can be reused for different multimodal applications. Nevertheless for flexibility purposes, a mixed approach that enables the designer to use both generic components and tailored components is required. As a consequence, our characterization space includes one axis dedicated to the reusability aspect of a component. The two other axes of our characterization space, respectively depict the role of the component in the data-flow from devices to tasks and the level of specification of the component. We illustrate our three dimensional characterization space as well as the implemented tool based on it using a multimodal map navigator.
Hoggan, Eve, Kaaresoja, Topi, Laitinen, Pauli and Brewster, Stephen (2008): Crossmodal congruence: the look, feel and sound of touchscreen widgets. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 157-164. Available online
Our research considers the following question: how can visual, audio and tactile feedback be combined in a congruent manner for use with touchscreen graphical widgets? For example, if a touchscreen display presents different styles of visual buttons, what should each of those buttons feel and sound like? This paper presents the results of an experiment conducted to investigate methods of congruently combining visual and combined audio/tactile feedback by manipulating the different parameters of each modality. The results indicate trends with individual visual parameters such as shape, size and height being combined congruently with audio/tactile parameters such as texture, duration and different actuator technologies. We draw further on the experiment results using individual quality ratings to evaluate the perceived quality of our touchscreen buttons then reveal a correlation between perceived quality and crossmodal congruence. The results of this research will enable mobile touchscreen UI designers to create realistic, congruent buttons by selecting the most appropriate audio and tactile counterparts of visual button styles.
Giuliani, Manuel and Knoll, Alois (2008): MultiML: a general purpose representation language for multimodal human utterances. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 165-172. Available online
We present MultiML, a markup language for the annotation of multimodal human utterances. MultiML is able to represent input from several modalities, as well as the relationships between these modalities. Since MultiML separates general parts of representation from more context-specific aspects, it can easily be adapted for use in a wide range of contexts. This paper demonstrates how speech and gestures are described with MultiML, showing the principles -- including hierarchy and underspecification -- that ensure the quality and extensibility of MultiML. As a proof of concept, we show how MultiML is used to annotate a sample human-robot interaction in the domain of a multimodal joint-action scenario.
Simonin, Jérôme, Carbonell, Noelle and Pele, Danielle (2008): Effectiveness and usability of an online help agent embodied as a talking head. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 17-20. Available online
An empirical study is presented which aims at assessing the possible effects of embodiment on online help effectiveness and attraction. 22 undergraduate students who were unfamiliar with animation creation software created two simple animations with Flash, using two multimodal online help agents, EH and UH, one per animation. Both help agents used the same database of speech and graphics messages; EH was personified using a talking head while UH was not embodied. EH and UH presentation order was counterbalanced between participants. Subjective judgments elicited through verbal and nonverbal questionnaires indicate that the presence of the ECA was well accepted by participants and its influence on help effectiveness perceived as positive. Analysis of eye tracking data indicates that the ECA actually attracted their visual attention and interest, since they glanced at it from the beginning to the end of the animation creation (75 fixations during 40 min.). Contrastingly, post-tests marks and interaction traces suggest that the ECA's presence had no perceivable effect on concept or skill learning and task execution. It only encouraged help consultation.
Voit, Michael and Stiefelhagen, Rainer (2008): Deducing the visual focus of attention from head pose estimation in dynamic multi-view meeting scenarios. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 173-180. Available online
This paper presents our work on recognizing the visual focus of attention during dynamic meeting scenarios. We collected a new dataset of meetings, in which acting participants were to follow a predefined script of events, to enforce focus shifts of the remaining, unaware meeting members. Including the whole room, all in all, a total of 35 potential focus targets were annotated, of which some were moved or introduced spontaneously during the meeting. On this dynamic dataset, we present a new approach to deduce the visual focus by means of head orientation as a first clue and show, that our system recognizes the correct visual target in over 57% of all frames, compared to 47% when mapping head pose to the first-best intersecting focus target directly.
Morency, Louis-Philippe, Kok, Iwan de and Gratch, Jonathan (2008): Context-based recognition during human interactions: automatic feature selection and encoding dictionary. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 181-188. Available online
During face-to-face conversation, people use visual feedback such as head nods to communicate relevant information and to synchronize rhythm between participants. In this paper we describe how contextual information from other participants can be used to predict visual feedback and improve recognition of head gestures in human-human interactions. For example, in a dyadic interaction, the speaker contextual cues such as gaze shifts or changes in prosody will influence listener backchannel feedback (e.g., head nod). To automatically learn how to integrate this contextual information into the listener gesture recognition framework, this paper addresses two main challenges: optimal feature representation using an encoding dictionary and automatic selection of optimal feature-encoding pairs. Multimodal integration between context and visual observations is performed using a discriminative sequential model (Latent-Dynamic Conditional Random Fields) trained on previous interactions. In our experiments involving 38 storytelling dyads, our context-based recognizer significantly improved head gesture recognition performance over a vision-only recognizer.
Hernandez-Rebollar, José Luis, Elsakay, Ethar Ibrahim and Alanís-Urquieta, José D. (2008): AcceleSpell, a gestural interactive game to learn and practice finger spelling. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 189-190. Available online
In this paper, an interactive computer game for learning and practicing continuous fingerspelling is described. The game is controlled by an instrumented glove known as AcceleGlove and a recognition algorithm based on decision trees. The Graphical User Interface is designed to allow beginners to remember the correct hand shapes and start finger spelling words sooner than traditional methods of learning.
Balchandran, Rajesh, Epstein, Mark E., Potamianos, Gerasimos and Seredi, Ladislav (2008): A multi-modal spoken dialog system for interactive TV. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 191-192. Available online
In this demonstration we present a novel prototype system that implements a multi-modal interface for control of the television. This system combines the standard TV remote control with a dialog management based natural language speech interface to allow users to efficiently interact with the TV, and to seamlessly alternate between the two modalities. One of the main objectives of this system is to make the unwieldy Electronic Program Guide information more navigable by the use of voice to filter and locate programs of interest.
Juras, David, Nigay, Laurence, Ortega, Michaël and Serrano, Marcos (2008): Multimodal slideshow: demonstration of the openinterface interaction development environment. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 193-194. Available online
In this paper, we illustrate the OpenInterface Interaction Development Environment (OIDE) that addresses the design and development of multimodal interfaces. Multimodal interaction software development presents a particular challenge because of the ever increasing number of novel interaction devices and modalities that can used for a given interactive application. To demonstrate our graphical OIDE and its underlying approach, we present a multimodal slideshow implemented with our tool.
Katsurada, Kouichi, Kirihata, Teruki, Kudo, Masashi, Takada, Junki and Nitta, Tsuneo (2008): A browser-based multimodal interaction system. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 195-196. Available online
In this paper, we propose a system that enables users to have multimodal interactions (MMI) with an anthropomorphic agent via a web browser. By using the system, a user can interact simply by accessing a web site from his/her web browser. A notable characteristic of the system is that the anthropomorphic agent is synthesized from a photograph of a real human face. This makes it possible to construct a web site whose owner's facial agent speaks with visitors to the site. This paper describes the structure of the system and provides a screen shot.
The need for language aids is pervasive in today's world. There are millions of individuals who have language and speech challenges, and these individuals require additional support for communication and language learning. We demonstrate technology to supplement common face-to-face language interaction to enhance intelligibility, understanding, and communication, particularly for those with hearing impairments. Our research is investigating how to automatically supplement talking faces with information that is ordinarily conveyed by auditory means. This research consists of two areas of inquiry: 1) developing a neural network to perform real-time analysis of selected acoustic features for visual display, and 2) determining how quickly participants can learn to use these selected cues and how much they benefit from them when combined with speechreading.
This demo paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at "mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all". The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may ask questions such as "When was I supposed to meet Sara?" or "What's my schedule today?"
Diego, Jonathan Padilla San, Barrow, Alastair, Cox, Margaret and Harwin, William (2008): PHANTOM prototype: exploring the potential for learning with multimodal features in dentistry. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 201-202. Available online
In this paper, we will demonstrate how force feedback, motion-parallax, and stereoscopic vision can enhance the opportunities for learning in the context of dentistry. A dental training workstation prototype has been developed intended for use by dental students in their introductory course to preparing a tooth cavity. The multimodal feedback from haptics, motion tracking cameras, computer generated sound and graphics are being exploited to provide 'near-realistic' learning experiences. Whilst the empirical evidence provided is preliminary, we describe the potential of multimodal interaction via these technologies for enhancing dental-clinical skills.
Drettakis, George (2008): Audiovisual 3d rendering as a tool for multimodal interfaces. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 203-204. Available online
In this talk, we will start with a short overview of 3D audiovisual rendering and its applicability to multimodal interfaces. In recent years, we have seen the generalization of 3D applications, ranging from computer games, which involve a high level of realism, to applications such as SecondLife, in which the visual and auditory quality of the 3D environment leaves much to be desired. In our introduction will attempt to examine the relationship between the audiovisual rendering of the environment and the interface. We will then review some of the audio-visual rendering algorithms we have developed in the last few years. We will discuss four main challenges we have addressed. The first is the development of realistic illumination and shadow algorithms which contribute greatly to the realism of 3D scenes, but could also be important for interfaces. The second involves the application of these illumination algorithms to augmented reality settings. The third concerns the development of perceptually-based techniques, and in particular using audio-visual cross-modal perception. The fourth challenge has been the development of approximate but "plausible", interactive solutions to more advanced rendering effects, both for graphics and audio. On the audio side, our review will include the introduction of clustering, masking and perceptual rendering for 3D spatialized audio and our recently developed solution for the treatment of contact sounds. On the graphics side, our discussion will include a quick overview of our illumination and shadow work, its application to augmented reality, our work on interactive rendering approximations and perceptually driven algorithms. For all these techniques we will discuss their relevance to multimodal interfaces, including our experience in a urban design case-study and attempt to relate them to recent interface research. We will close with a broad reflection on the potential for closer collaboration between 3D audiovisual rendering and multimodal interfaces.
Damm, David, Fremerey, Christian, Kurth, Frank, Müller, Meinard and Clausen, Michael (2008): Multimodal presentation and browsing of music. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 205-208. Available online
Recent digitization efforts have led to large music collections, which contain music documents of various modes comprising textual, visual and acoustic data. In this paper, we present a multimodal music player for presenting and browsing digitized music collections consisting of heterogeneous document types. In particular, we concentrate on music documents of two widely used types for representing a musical work, namely visual music representation (scanned images of sheet music) and associated interpretations (audio recordings). We introduce novel user interfaces for multimodal (audio-visual) music presentation as well as intuitive navigation and browsing. Our system offers high quality audio playback with time-synchronous display of the digitized sheet music associated to a musical work. Furthermore, our system enables a user to seamlessly crossfade between various interpretations belonging to the currently selected musical work.
Devallez, Delphine, Fontana, Federico and Rocchesso, Davide (2008): An audio-haptic interface based on auditory depth cues. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 209-216. Available online
Spatialization of sound sources in depth allows a hierarchical display of multiple audio streams and therefore may be an efficient tool for developing novel auditory interfaces. In this paper we present an audio-haptic interface for audio browsing based on rendering distance cues for ordering sound sources in depth. The haptic interface includes a linear position tactile sensor made by conductive material. The touch position on the ribbon is mapped onto the listening position on a rectangular virtual membrane, modeled by a bidimensional Digital Waveguide Mesh and providing distance cues of four equally spaced sound sources. Furthermore a knob of a MIDI controller controls the position of the mesh along the playlist, which allows to browse the whole set of files. Subjects involved in a user study found the interface intuitive and entertaining. In particular the interaction with the stripe was highly appreciated.
Miller, Chreston, Robinson, Ashley, Wang, Rongrong, Chung, Pak and Quek, Francis (2008): Interaction techniques for the analysis of complex data on high-resolution displays. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 21-28. Available online
When combined with the organizational space provided by a simple table, physical notecards are a powerful organizational tool for information analysis. The physical presence of these cards affords many benefits but also is a source of disadvantages. For example, complex relationships among them are hard to represent. There have been a number of notecard software systems developed to address these problems. Unfortunately, the amount of visual details in such systems is lacking compared to real notecards on a large physical table; we look to alleviate this problem by providing a digital solution. One challenge with new display technology and systems is providing an efficient interface for its users. In this paper we look at comparing different interaction techniques of an emerging class of organizational systems that use high-resolution tabletop displays. The focus of these systems is to more easily and efficiently assist interaction with information. Using PDA, token, gesture, and voice interaction techniques, we conducted a within subjects experiment comparing these techniques over a large high-resolution horizontal display. We found strengths and weaknesses for each technique. In addition, we noticed that some techniques build upon and complement others.
Khalidov, Vasil, Forbes, Florence, Hansard, Miles, Arnaud, Elise and Horaud, Radu (2008): Detection and localization of 3d audio-visual objects using unsupervised clustering. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 217-224. Available online
This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectation-maximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.
Bangalore, Srinivas and Johnston, Michael (2008): Robust gesture processing for multimodal interaction. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 225-232. Available online
With the explosive growth in mobile computing and communication over the past few years, it is possible to access almost any information from virtually anywhere. However, the efficiency and effectiveness of this interaction is severely limited by the inherent characteristics of mobile devices, including small screen size and the lack of a viable keyboard or mouse. This paper concerns the use of multimodal language processing techniques to enable interfaces combining speech and gesture input that overcome these limitations. Specifically we focus on robust processing of pen gesture inputs in a local search application and demonstrate that edit-based techniques that have proven effective in spoken language processing can also be used to overcome unexpected or errorful gesture input. We also examine the use of a bottom-up gesture aggregation technique to improve the coverage of multimodal understanding.
Hung, Hayley, Jayagopi, Dinesh Babu, Ba, Sileye, Odobez, Jean-Marc and Gatica-Perez, Daniel (2008): Investigating automatic dominance estimation in groups from visual attention and speaking activity. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 233-236. Available online
We study the automation of the visual dominance ratio (VDR); a classic measure of displayed dominance in social psychology literature, which combines both gaze and speaking activity cues. The VDR is modified to estimate dominance in multi-party group discussions where natural verbal exchanges are possible and other visual targets such as a table and slide screen are present. Our findings suggest that fully automated versions of these measures can estimate effectively the most dominant person in a meeting and can match the dominance estimation performance when manual labels of visual attention are used.
Gurban, Mihai, Thiran, Jean-Philippe, Drugman, Thomas and Dutoit, Thierry (2008): Dynamic modality weighting for multi-stream hmms in audio-visual speech recognition. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 237-240. Available online
Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight adaptation based on modality confidence estimators. We assume different and time-varying environment noise, as can be encountered in realistic applications, and, for this, adaptive methods are best suited. Stream reliability is assessed directly through classifier outputs since they are not specific to either noise type or level. The influence of constraining the weights to sum to one is also discussed.
Vertegaal, Roel (2008): A Fitts Law comparison of eye tracking and manual input in the selection of visual targets. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 241-248. Available online
We present a Fitts' Law evaluation of a number of eye tracking and manual input devices in the selection of large visual targets. We compared performance of two eye tracking techniques, manual click and dwell time click, with that of mouse and stylus. Results show eye tracking with manual click outperformed the
Lee, Minkyung and Billinghurst, Mark (2008): A Wizard of Oz study for an AR multimodal interface. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 249-256. Available online
In this paper we describe a Wizard of Oz (WOz) user study of an Augmented Reality (AR) interface that uses multimodal input (MMI) with natural hand interaction and speech commands. Our goal is to use a WOz study to help guide the creation of a multimodal AR interface which is most natural to the user. In this study we used three virtual object arranging tasks with two different display types (a head mounted display, and a desktop monitor) to see how users used multimodal commands, and how different AR display conditions affect those commands. The results provided valuable insights into how people naturally interact in a multimodal AR scene assembly task. For example, we discovered the optimal time frame for fusing speech and gesture commands into a single command. We also found that display type did not produce a significant difference in the type of commands used. Using these results, we present design recommendations for multimodal interaction in AR environments.
Otsuka, Kazuhiro, Araki, Shoko, Ishizuka, Kentaro, Fujimoto, Masakiyo, Heinrich, Martin and Yamato, Junji (2008): A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 257-264. Available online
This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.
Lemmelä, Saija, Vetek, Akos, Mäkelä, Kaj and Trendafilov, Dari (2008): Designing and evaluating multimodal interaction for mobile contexts. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 265-272. Available online
In this paper we report on our experience on the design and evaluation of multimodal user interfaces in various contexts. We introduce a novel combination of existing design and evaluation methods in the form of a 5-step iterative process and show the feasibility of this method and some of the lessons learned through the design of a messaging application for two contexts (in car, walking). The iterative design process we employed included the following five basic steps: 1) identification of the limitations affecting the usage of different modalities in various contexts (contextual observations and context analysis) 2) identifying and selecting suitable interaction concepts and creating a general design for the multimodal application (storyboarding, use cases, interaction concepts, task breakdown, application UI and interaction design), 3) creating modality-specific UI designs, 4) rapid prototyping and 5) evaluating the prototype in naturalistic situations to find key issues to be taken into account in the next iteration. We have not only found clear indications that context affects users' preferences in the usage of modalities and interaction strategies but also identified some of these. For instance, while speech interaction was preferred in the car environment users did not consider it useful when they were walking. 2D (finger strokes) and especially 3D (tilt) gestures were preferred by walking users.
Kaliouby, Rana El and Mikhail, Mina (2008): Automated sip detection in naturally-evoked video. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 273-280. Available online
Quantifying consumer experiences is an emerging application area for event detection in video. This paper presents a hierarchical model for robust sip detection that combines bottom-up processing of face videos, namely real-time head action unit analysis and head gesture recognition, with top-down knowledge about sip events and task semantics. Our algorithm achieves an average accuracy of 82% in videos that feature single sips, and an average accuracy of 78% and
Haptic stimulation in motion has been studied only little earlier. To provide guidance for designing haptic interfaces for mobile use we carried out an initial experiment using C-2 actuators. 16 participants attended in the experiment to find out whether there is a difference in perceiving low-amplitude vibrotactile stimuli when exposed to minimal and moderate physical exertion. A stationary bike was used to control the exertion. Four body locations (wrist, leg, chest and back), two stimulus durations (1000 ms and 2000 ms) and two motion conditions with the stationary bicycle (still and moderate pedaling) were applied. It was found that cycling had significant effect on both the perception accuracy and the reaction times with selected stimuli. Stimulus amplitudes used in this experiment can be used to help haptic design for mobile users.
Tahir, Muhammad, Bailly, Gilles, Lecolinet, Eric and Mouret, Gérard (2008): TactiMote: a tactile remote control for navigating in long lists. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 285-288. Available online
This paper presents TactiMote, a remote control with tactile feedback designed for navigating in long lists and catalogues. TactiMote integrates a joystick that allows 2D interaction with the thumb and a Braille cell that provides tactile feedback. This feedback is intended to help the selection task in novice mode and to allow for fast eyes-free navigation among favorite items in expert mode. The paper describes the design of the TactiMote prototype for TV channel selection and reports a preliminary experiment that shows the feasibility of the approach.
It is of prime importance in everyday human life to cope with and respond appropriately to events that are not foreseen by prior experience. Machines to a large extent lack the ability to respond appropriately to such inputs. An important class of unexpected events is defined by incongruent combinations of inputs from different modalities and therefore multimodal information provides a crucial cue for the identification of such events, e.g., the sound of a voice is being heard while the person in the field-of-view does not move her lips. In the project DIRAC ("Detection and Identification of Rare Audio-visual Cues") we have been developing algorithmic approaches to the detection of such events, as well as an experimental hardware platform to test it. An audio-visual platform ("AWEAR" -- audio-visual wearable device) has been constructed with the goal to help users with disabilities or a high cognitive load to deal with unexpected events. Key hardware components include stereo panoramic vision sensors and 6-channel worn-behind-the-ear (hearing aid) microphone arrays. Data have been recorded to study audio-visual tracking, a/v scene/object classification and a/v detection of incongruencies.
Favre, Sarah, Salamin, Hugues, Dines, John and Vinciarelli, Alessandro (2008): Role recognition in multiparty recordings using social affiliation networks and discrete distributions. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 29-36. Available online
This paper presents an approach for the recognition of roles in multiparty recordings. The approach includes two major stages: extraction of Social Affiliation Networks (speaker diarization and representation of people in terms of their social interactions), and role recognition (application of discrete probability distributions to map people into roles). The experiments are performed over several corpora, including broadcast data and meeting recordings, for a total of roughly 90 hours of material. The results are satisfactory for the broadcast data (around 80 percent of the data time correctly labeled in terms of role), while they still must be improved in the case of the meeting recordings (around 45 percent of the data time correctly labeled). In both cases, the approach outperforms significantly chance.
Funakoshi, Kotaro, Kobayashi, Kazuki, Nakano, Mikio, Yamada, Seiji, Kitamura, Yasuhiko and Tsujino, Hiroshi (2008): Smoothing human-robot speech interactions by using a blinking-light as subtle expression. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 293-296. Available online
Speech overlaps, undesired collisions of utterances between systems and users, harm smooth communication and degrade the usability of systems. We propose a method to enable smooth speech interactions between a user and a robot, which enables subtle expressions by the robot in the form of a blinking LED attached to its chest. In concrete terms, we show that, by blinking an LED from the end of the user's speech until the robot's speech, the number of undesirable repetitions, which are responsible for speech overlaps, decreases, while that of desirable repetitions increases. In experiments, participants played a last-and-first game with the robot. The experimental results suggest that the blinking-light can prevent speech overlaps between a user and a robot, speed up dialogues, and improve user's impressions.
Koskinen, Emilia, Kaaresoja, Topi and Laitinen, Pauli (2008): Feel-good touch: finding the most pleasant tactile feedback for a mobile touch screen button. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 297-304. Available online
Earlier research has shown the benefits of tactile feedback for touch screen widgets in all metrics: performance, usability and user experience. In our current research the goal was to go deeper in understanding the characteristics of a tactile click for virtual buttons. More specifically we wanted to find a tactile click which is the most pleasant to use with a finger. We used two actuator solutions in a small mobile touch screen: piezo actuators or a standard vibration motor. We conducted three experiments: The first and second experiments aimed to find the most pleasant tactile feedback done with the piezo actuators or a vibration motor, respectively, and the third one combined and compared the results from the first two experiments. The results from the first two experiments showed significant differences for the perceived pleasantness of the tactile clicks, and we used these most pleasant clicks in the comparison experiment in addition to the condition with no tactile feedback. Our findings confirmed results from earlier studies showing that tactile feedback is superior to a nontactile condition when virtual buttons are used with the finger regardless of the technology behind the tactile feedback. Another finding suggests that the users perceived the feedback done with piezo actuators slightly more pleasant than the vibration motor based feedback, although not statistically significantly. These results indicate that it is possible to modify the characteristics of the virtual button tactile clicks towards the most pleasant ones, and on the other hand this knowledge can help designers to create better touch screen virtual buttons and keyboards.
Evreinova, Tatiana G. (2008): Manipulating trigonometric expressions encodedthrough electro-tactile signals. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 3-8. Available online
Visually challenged pupils and students need special developmental tools. To facilitate their skills acquisition in math, different game-like techniques have been implemented. Along with Braille, the electro-tactile patterns (eTPs) can be used to deliver mathematical content to the visually challenged user. The goal of this work was to continue an exploration on non-visual manipulating mathematics. The eTPs denoting four trigonometric functions and their seven arguments (angles) were shaped with designed electro-tactile unit. Matching software application was used to facilitate the learning process of the eTPs. The permutation puzzle game was employed to improve the perceptual skills of the players in manipulating the trigonometric functions and their arguments encoded. The performance of 8 subjects was investigated and discussed. The experimental findings confirmed the possibility of the use of the eTPs for communicating different kinds of math content.
(2008): Embodied conversational agents for voice-biometric interfaces Álvaro Hernández-Trapote, Beatriz López-Mencía, David Díaz, Rubén Fernández-Pozo, Javier Caminero. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 305-312. Available online
In this article we present a research scheme which aims to analyze the use of Embodied Conversational Agent (ECA) technology to improve the robustness and acceptability of speaker enrolment and verification dialogues designed to provide secure access through natural and intuitive speaker recognition. In order to find out the possible effects of the visual information channel provided by the ECA, tests were carried out in which users were divided into two groups, each interacting with a different interface (metaphor): an ECA Metaphor group -- with an ECA -- and a VOICE Metaphor group -- without an ECA --. Our evaluation methodology is based on the ITU-T P.851 recommendation for spoken dialogue system evaluation, which we have complemented to cover particular aspects with regard to the two major extra elements we have incorporated: secure access and an ECA. Our results suggest that likeability-type factors and system capabilities are perceived more positively by the ECA metaphor users than by the VOICE metaphor users. However, the ECA's presence seems to intensify users' privacy concerns.
Petridis, Stavros and Pantic, Maja (2008): Audiovisual laughter detection based on temporal features. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 37-44. Available online
Previous research on automatic laughter detection has mainly been focused on audio-based detection. In this study we present an audio-visual approach to distinguishing laughter from speech based on temporal features and we show that integrating the information from audio and video channels leads to improved performance over single-modal approaches. Static features are extracted on an audio/video frame basis and then combined with temporal features extracted over a temporal window, describing the evolution of static features over time. The use of several different temporal features has been investigated and it has been shown that the addition of temporal information results in an improved performance over utilizing static information only. It is common to use a fixed set of temporal features which implies that all static features will exhibit the same behaviour over a temporal window. However, this does not always hold and we show that when AdaBoost is used as a feature selector, different temporal features for each static feature are selected, i.e., the temporal evolution of each static feature is described by different statistical measures. When tested on 96 audiovisual sequences, depicting spontaneously displayed (as opposed to posed) laughter and speech episodes, in a person independent way the proposed audiovisual approach achieves an F1 rate of over 89%.
Jayagopi, Dinesh Babu, Ba, Sileye, Odobez, Jean-Marc and Gatica-Perez, Daniel (2008): Predicting two facets of social verticality in meetings from five-minute time slices and nonverbal cues. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 45-52. Available online
This paper addresses the automatic estimation of two aspects of social verticality (status and dominance) in small-group meetings using nonverbal cues. The correlation of nonverbal behavior with these social constructs have been extensively documented in social psychology, but their value for computational models is, in many cases, still unknown. We present a systematic study of automatically extracted cues -- including vocalic, visual activity, and visual attention cues -- and investigate their relative effectiveness to predict both the most-dominant person and the high-status project manager from relative short observations. We use five hours of task-oriented meeting data with natural behavior for our experiments. Our work suggests that, although dominance and role-based status are related concepts, they are not equivalent and are thus not equally explained by the same nonverbal cues. Furthermore, the best cues can correctly predict the person with highest dominance or role-based status with an accuracy of 70% approximately.
Pianesi, Fabio, Mana, Nadia, Cappelletti, Alessandro, Lepri, Bruno and Zancanaro, Massimo (2008): Multimodal recognition of personality traits in social interactions. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 53-60. Available online
This paper targets the automatic detection of personality traits in a meeting environment by means of audio and visual features; information about the relational context is captured by means of acoustic features designed to that purpose. Two personality traits are considered: Extraversion (from the Big Five) and the Locus of Control. The classification task is applied to thin slices of behaviour, in the form of 1-minute sequences. SVM were used to test the performances of several training and testing instance setups, including a restricted set of audio features obtained through feature selection. The outcomes improve considerably over existing results, provide evidence about the feasibility of the multimodal analysis of personality, the role of social context, and pave the way to further studies addressing different features setups and/or targeting different personality traits.
Vinciarelli, Alessandro, Pantic, Maja, Bourlard, Hervé and Pentland, Alex (2008): Social signals, their function, and automatic analysis: a survey. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 61-68. Available online
Social Signal Processing (SSP) aims at the analysis of social behaviour in both Human-Human and Human-Computer interactions. SSP revolves around automatic sensing and interpretation of social signals, complex aggregates of nonverbal behaviours through which individuals express their attitudes towards other human (and virtual) participants in the current social context. As such, SSP integrates both engineering (speech analysis, computer vision, etc.) and human sciences (social psychology, anthropology, etc.) as it requires multimodal and multidisciplinary approaches. As of today, SSP is still in its early infancy, but the domain is quickly developing, and a growing number of works is appearing in the literature. This paper provides an introduction to nonverbal behaviour involved in social signals and a survey of the main results obtained so far in SSP. It also outlines possibilities and challenges that SSP is expected to face in the next years if it is to reach its full maturity.
Many mobile machine learning applications require collecting and labeling data, and a traditional GUI on a mobile device may not be an appropriate or viable method for this task. This paper presents an alternative approach to mobile labeling of sensor data called VoiceLabel. VoiceLabel consists of two components: (1) a speech-based data collection tool for mobile devices, and (2) a desktop tool for offline segmentation of recorded data and recognition of spoken labels. The desktop tool automatically analyzes the audio stream to find and recognize spoken labels, and then presents a multimodal interface for reviewing and correcting data labels using a combination of the audio stream, the system's analysis of that audio, and the corresponding mobile sensor data. A study with ten participants showed that VoiceLabel is a viable method for labeling mobile sensor data. VoiceLabel also illustrates several key features that inform the design of other data labeling tools.
Schehl, Jan, Pfalzgraf, Alexander, Pfleger, Norbert and Steigner, Jochen (2008): The babbleTunes system: talk to your ipod!. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 77-80. Available online
This paper presents a full-fledged multimodal dialogue system for accessing multimedia content in home environments from both portable media players and online sources. We will mainly focus on two aspects of the system that provide the basis for a natural interaction: (i) the automatic processing of named entities which permits the incorporation of dynamic data into the dialogue (e.g., song or album titles, artist names, etc.) and (ii) general multimodal interaction patterns that are bound to ease the access to large sets of data.
Kühnel, Christine, Weiss, Benjamin, Wechsung, Ina, Fagel, Sascha and Möller, Sebastian (2008): Evaluating talking heads for smart home systems. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 81-84. Available online
In this paper we report the results of a user study evaluating talking heads in the smart home domain. Three noncommercial talking head components are linked to two freely available speech synthesis systems, resulting in six different combinations. The influence of head and voice components on overall quality is analyzed as well as the correlation between them. Three different ways to assess overall quality are presented. It is shown that these three are consistent in their results. Another important result is that in this design speech and visual quality are independent of each other. Furthermore, a linear combination of both quality aspects models overall quality of talking heads to a good degree.
Ahmaniemi, Teemu Tuomas, Lantz, Vuokko and Marila, Juha (2008): Perception of dynamic audiotactile feedback to gesture input. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 85-92. Available online
In this paper we present results of a study where perception of dynamic audiotactile feedback to gesture input was examined. Our main motivation was to investigate how users' active input and different modality conditions effect the perception of the feedback. The experimental prototype in the study was a handheld sensor-actuator device that responds dynamically to user's hand movements creating an impression of a virtual texture. The feedback was designed so that the amplitude and frequency of texture were proportional to the overall angular velocity of the device. We used four different textures with different velocity responses. The feedback was presented to the user by the tactile actuator in the device, by audio through headphones, or by both. During the experiments, textures were switched in random intervals and the task of the user was to detect the changes while moving the device freely. The performances of the users with audio or audiotactile feedback were quite equal while tactile feedback alone yielded poorer performance. The texture design didn't influence the movement velocity or periodicity but tactile feedback induced most and audio feedback the least energetic motion. In addition, significantly better performance was achieved with slower motion. We also found that significant learning happened over time; detection accuracy increased significantly during and between the experiments. The masking noise used in tactile modality condition did not significantly influence the detection accuracy when compared to acoustic blocking but it increased the average detection time.
Perakakis, Manolis and Potamianos, Alexandros (2008): Multimodal system evaluation using modality efficiency and synergy metrics. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 9-16. Available online
In this paper, we propose two new objective metrics, relative modality efficiency and multimodal synergy, that can provide valuable information and identify usability problems during the evaluation of multimodal systems. Relative modality efficiency (when compared with modality usage) can identify suboptimal use of modalities due to poor interface design or information asymmetries. Multimodal synergy measures the added value from efficiently combining multiple input modalities, and can be used as a single measure of the quality of modality fusion and fission in a multimodal system. The proposed metrics are used to evaluate two multimodal systems that combine pen/speech and mouse/keyboard modalities respectively. The results provide much insight into multimodal interface usability issues, and demonstrate how multimodal systems should adapt to maximize modalities synergy resulting in efficient, natural, and intelligent multimodal interfaces.
Miki, Madoka, Miyajima, Chiyomi, Nishino, Takanori, Kitaoka, Norihide and Takeda, Kazuya (2008): An integrative recognition method for speech and gestures. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 93-96. Available online
We propose an integrative recognition method of speech accompanied with gestures such as pointing. Simultaneously generated speech and pointing complementarily help the recognition of both, and thus the integration of these multiple modalities may improve recognition performance. As an example of such multimodal speech, we selected the explanation of a geometry problem. While the problem was being solved, speech and fingertip movements were recorded with a close-talking microphone and a 3D position sensor. To find the correspondence between utterance and gestures, we propose probability distribution of the time gap between the starting times of an utterance and gestures. We also propose an integrative recognition method using this distribution. We obtained approximately 3-point improvement for both speech and fingertip movement recognition performance with this method.
Quek, Francis, Ehrich, Roger and Lockhart, Thurmon (2008): As go the feet...: on the estimation of attentional focus from stance. In: Proceedings of the 2008 International Conference on Multimodal Interfaces 2008. pp. 97-104. Available online
The estimation of the direction of visual attention is critical to a large number of interactive systems. This paper investigates the cross-modal relation of the position of one's feet (or standing stance) to the focus of gaze. The intuition is that while one CAN have a range of attentional foci from a particular stance, one may be MORE LIKELY to look in specific directions given an approach vector and stance. We posit that the cross-modal relationship is constrained by biomechanics and personal style. We define a stance vector that models the approach direction before stopping and the pose of a subject's feet. We present a study where the subjects' feet and approach vector are tracked. The subjects read aloud contents of note cards in 4 locations. The order of visits' to the cards were randomized. Ten subjects read 40 lines of text each, yielding 400 stance vectors and gaze directions. We divided our data into 4 sets of 300 training and 100 test vectors and trained a neural net to estimate the gaze direction given the stance vector. Our results show that 31% our gaze orientation estimates were within 5°, 51% of our estimates were within 10°, and 60% were within 15°. Given the ability to track foot position, the procedure is minimally invasive.