Proceedings of the 2005 International Conference on Multimodal Interfaces
Time and place:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
The following articles are from "Proceedings of the 2005 International Conference on Multimodal Interfaces":
Ernst, Marc O. (2005): The "puzzle" of sensory perception: putting together multisensory information. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. p. 1. Available online
For perceiving the environment our brain uses multiple sources of sensory information derived from several different modalities, including vision, touch and audition. The question how information derived from these different sensory modalities converges in the brain to form a coherent and robust percept is central to understanding the process of perception. My main research interest is the study of human perception focusing on multimodal integration and visual-haptic interaction. For this, I use quantitative computational/statistical models together with psychophysical and neuropsychological methods. A desirable goal for the perceptual system is to maximize the reliability of the various perceptual estimates. From a statistical viewpoint the optimal strategy for achieving this goal is to integrate all available sensory information. This may be done using a "maximum-likelihood-estimation" (MLE) strategy. Then the combined percept will be a weighted average across the individual estimates with weights that are proportional to their reliabilities. In a recent study we could show that humans actually integrate visual and haptic information in such a statistically optimal fashion (Ernst&Banks, Nature, 2002). Others have now demonstrated that this finding is true not only for the integration across vision and touch, but also for the integration of information across and within other modalities, such as audition or vision. This suggests that maximum-likelihood-estimation is an effective and widely used strategy exploited by the perceptual system. By integrating sensory information the brain may or may not loose access to the individual input signals feeding into the integrated percept. The degree to which the original information is still accessible defines the strength of coupling between the signals. We found that the strengths of coupling is varying depending on the set of signals used; e.g. strong coupling for stereo and texture signals to slant and weak coupling for visual and haptic signals to size (Hillis, Ernst, Banks,&Landy, Science, 2002). As suggested by one of our recent learning studies, the strength of coupling, which can be modeled using Bayesian statistics, seems to depend on the natural statistical co-occurrence between signals (Jäkel&Ernst, in prep.) Important precondition for integrating signals is to know which signals derived from the different modalities belong together and how reliable these are. Recently we could show that touch can teach the visual modality how to interpret its signals and their reliabilities. More specifically, we could show that by exploiting touch we can alter visual perception of slant (Ernst, Banks&Bulthoff, Nature Neuroscience, 2000). This finding contributes to a very old debate postulating that we only perceive the world because of our interactions with the environment. Similarly, in one of our latest studies we could show that experience can change the so-called "light-from-above" prior. Prior knowledge is essential for the interpretation of sensory signals during perception. Consequently, with the prior change we introduced a change in the perception of shape (Adams, Graf&Ernst, Nature Neuroscience, 2004). Integration is only sensible if the information sources carry redundant information. If the information sources are complementary, different combination strategies have to be exploited. Complementation of cross-modal information was demonstrated in a recent study investigating visual-haptic shape perception (Newell, Ernst, Tjan,&Bulthoff, Psychological Science, 2001).
Barthelmess, Paulo, Kaiser, Ed, Huang, Xiao and Demirdjian, David (2005): Distributed pointing for multimodal collaboration over sketched diagrams. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 10-17. Available online
A problem faced by groups that are not co-located but need to collaborate on a common task is the reduced access to the rich multimodal communicative context that they would have access to if they were collaborating face-to-face. Collaboration support tools aim to reduce the adverse effects of this restricted access to the fluid intermixing of speech, gesturing, writing and sketching by providing mechanisms to enhance the awareness of distributed participants of each others' actions. In this work we explore novel ways to leverage the capabilities of multimodal context-aware systems to bridge co-located and distributed collaboration contexts. We describe a system that allows participants at remote sites to collaborate in building a project schedule via sketching on multiple distributed whiteboards, and show how participants can be made aware of naturally occurring pointing gestures that reference diagram constituents as they are performed by remote participants. The system explores the multimodal fusion of pen, speech and 3D gestures, coupled to the dynamic construction of a semantic representation of the interaction, anchored on the sketched diagram, to provide feedback that overcomes some of the intrinsic ambiguities of pointing gestures.
Baillie, Lynne and Schatz, Raimund (2005): Exploring multimodality in the laboratory and the field. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 100-107. Available online
There are new challenges to us, as researchers, on how to design and evaluate new mobile applications because they give users access to powerful computing devices through small interfaces, which typically have limited input facilities. One way of overcoming these shortcomings is to utilize the possibilities of multimodality. We report in this paper how we designed, developed, and evaluated a multimodal mobile application through a combination of laboratory and field studies. This is the first time, as far as we know, that a multimodal application has been developed in such a way. We did this so that we would understand more about where and when users envisioned using different modes of interaction and what problems they may encounter when using an application in context.
Prendinger, Helmut, Ma, Chunling, Yingzi, Jin, Nakasone, Arturo and Ishizuka, Mitsuru (2005): Understanding the effect of life-like interface agents through users' eye movements. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 108-115. Available online
We motivate an approach to evaluating the utility of life-like interface agents that is based on human eye movements rather than questionnaires. An eye tracker is employed to obtain quantitative evidence of a user's focus of attention. The salient feature of our evaluation strategy is that it allows us to measure important properties of a user's interaction experience on a moment-by-moment basis in addition to a cumulative (spatial) analysis of the user's areas of interest. We describe an empirical study in which we compare attending behavior of subjects watching the presentation of an apartment by three types of media: an animated agent, a text box, and speech only. The investigation of users' eye movements reveals that agent behavior may trigger natural and social interaction behavior of human users.
Spakov, Oleg and Miniotas, Darius (2005): Gaze-based selection of standard-size menu items. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 124-128. Available online
With recent advances in eye tracking technology, eye gaze gradually gains acceptance as a pointing modality. Its relatively low accuracy, however, determines the need to use enlarged controls in eye-based interfaces rendering their design rather peculiar. Another factor impairing pointing performance is deficient robustness of an eye tracker's calibration. To facilitate pointing at standard-size menus, we developed a technique that uses dynamic target expansion for on-line correction of the eye tracker's calibration. Correction is based on the relative change in the gaze point location upon the expansion. A user study suggests that the technique affords a dramatic six-fold improvement in selection accuracy. This is traded off against a much smaller reduction in performance speed (39%). The technique is thus believed to contribute to development of universal-access solutions supporting navigation through standard menus by eye gaze alone.
Ukita, Norimichi, Ono, Tomohisa and Kidode, Masatsugu (2005): Region extraction of a gaze object using the gaze point and view image sequences. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 129-136. Available online
Analysis of the human gaze is a basic way to investigate human attention. Similarly, the view image of a human being includes the visual information of what he/she pays attention to. This paper proposes an interface system for extracting the region of an object viewed by a human from a view image sequence by analyzing the history of gaze points. All the gaze points, each of which is recorded as a 2D point in a view image, are transfered to an image in which the object region is extracted. These points are then divided into several groups based on their colors and positions. The gaze points in each group compose an initial region. After all the regions are extended, outlier regions are removed by comparing the colors and optical flows in the extended regions. All the remaining regions are merged into one in order to compose a gaze region.
Ishiguro, Hiroshi (2005): Interactive humanoids and androids as ideal interfaces for humans. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. p. 137. Available online
Many robotics researchers are exploring new possibilities of intelligent robots in our everyday life. Humanoid and androids, which have various modalities, can communicate with humans as new information media. In this talk, we argue how to develop the interactive robots and how to evaluate them as introducing several robots developed in ATR Intelligent Robotics and Communications Laboratories and Department of Adaptive Machine Systems, Osaka University. Especially, we focus on a constructive approach to developing the interactive robots, cognitive studies using the humanoids and androids for evaluating the interactions, and long-term field experiments in an elementary school. The talk consists of two parts. There are two relationships between robots and humans: one is inter-personal and the other is social. In the inter-personal relationships, the appearance of the robot is a new and important research issues. In the social relationships, a function to recognize human relationships through interactions is needed for robots of the next generation. These two issues explore new possibilities of robots. In these issues, the appearance problem bridges between science and engineering. In the development of humanoids, both the appearance and behavior of the robots are significant issues. However, designing the robot's appearance, especially to give it a humanoid one, was always a role of the industrial designer. To tackle the problem of appearance and behavior, two approaches are necessary: one from robotics and the other from cognitive science. The approach from robotics tries to build very human-like robots based on knowledge from cognitive science. The approach from cognitive science uses the robot for verifying hypotheses for understanding humans. We call this cross-interdisciplinary framework android science (www.androidscience.com). The speaker hopes that attendees catch new waves in robotics and media research and our future life.
Gorniak, Peter and Roy, Deb (2005): Probabilistic grounding of situated speech using plan recognition and reference resolution. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 138-143. Available online
Situated, spontaneous speech may be ambiguous along acoustic, lexical, grammatical and semantic dimensions. To understand such a seemingly difficult signal, we propose to model the ambiguity inherent in acoustic signals and in lexical and grammatical choices using compact, probabilistic representations of multiple hypotheses. To resolve semantic ambiguities we propose a situation model that captures aspects of the physical context of an utterance as well as the speaker's intentions, in our case represented by recognized plans. In a single, coherent Framework for Understanding Situated Speech (FUSS) we show how these two influences, acting on an ambiguous representation of the speech signal, complement each other to disambiguate form and content of situated speech. This method produces promising results in a game playing environment and leaves room for other types of situation models.
Senior, Robin and Vertegaal, Roel (2005): Augmenting conversational dialogue by means of latent semantic googling. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 144-150. Available online
This paper presents Latent Semantic Googling, a variant of Landauer's Latent Semantic Indexing that uses the Google search engine to judge the semantic closeness of sets of words and phrases. This concept is implemented via Ambient Google, a system for augmenting conversations through the classification of discussed topics. Ambient Google uses a speech recognition engine to generate Google keyphrase queries directly from conversations. These queries are used to analyze the semantics of the conversation, and infer related topics that have been discussed. Conversations are visualized using a spring-model algorithm representing common topics. This allows users to browse their conversation as a contextual relationship between discussed topics, and augment their discussion through the use of related websites discovered by Google. An evaluation of Ambient Google is presented, discussing user reaction to the system.
Li, Shuyin, Haasch, Axel, Wrede, Britta, Fritsch, Jannik and Sagerer, Gerhard (2005): Human-style interaction with a robot for cooperative learning of scene objects. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 151-158. Available online
In research on human-robot interaction the interest is currently shifting from uni-modal dialog systems to multi-modal interaction schemes. We present a system for human-style interaction with a robot that is integrated on our mobile robot BIRON. To model the dialog we adopt an extended grounding concept with a mechanism to handle multi-modal in- and output where object references are resolved by the interaction with an object attention system (OAS). The OAS integrates multiple input from, e.g., the object and gesture recognition systems and provides the information for a common representation. This representation can be accessed by both modules and combines symbolic verbal attributes with sensor-based features. We argue that such a representation is necessary to achieve a robust and efficient information processing.
Experience shows that decisions in the early phases of the development of a multimodal system prevail throughout the life-cycle of a project. The distributed architecture and the requirement for robust multimodal interaction in our project SmartWeb resulted in an approach that uses and extends W3C standards like EMMA and RDFS. These standards for the interface structure and content allowed us to integrate available tools and techniques. However, the requirements in our system called for various extensions, e.g., to introduce result feedback tags for an extended version of EMMA. The interconnection framework depends on a commercial telephone voice dialog system platform for the dialog-centric components while the information access processes are linked using web service technology. Also in the area of this underlying infrastructure, enhancements and extensions were necessary. The first demonstration system is operable now and will be presented at the Football World Cup 2006 in Germany.
Lunsford, Rebecca, Oviatt, Sharon and Coulston, Rachel (2005): Audio-visual cues distinguishing self- from system-directed speech in younger and older adults. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 167-174. Available online
In spite of interest in developing robust open-microphone engagement techniques for mobile use and natural field contexts, there currently are no reliable techniques available. One problem is the lack of empirically-grounded models as guidance for distinguishing how users' audio-visual activity actually differs systematically when addressing a computer versus human partner. In particular, existing techniques have not been designed to handle high levels of user self talk as a source of "noise," and they typically assume that a user is addressing the system only when facing it while speaking. In the present research, data were collected during two related studies in which adults aged 18-89 interacted multimodally using speech and pen with a simulated map system. Results revealed that people engaged in self talk prior to addressing the system over 30% of the time, with no decrease in younger adults' rate of self talk compared with elders. Speakers' amplitude was lower during 96% of their self talk, with a substantial 26 dBr amplitude separation observed between self- and system-directed speech. The magnitude of speaker's amplitude separation ranged from approximately 10-60 dBr and diminished with age, with 79% of the variance predictable simply by knowing a person's age. In contrast to the clear differentiation of intended addressee revealed by amplitude separation, gaze at the system was not a reliable indicator of speech directed to the system, with users looking at the system over 98% of the time during both self- and system-directed speech. Results of this research have implications for the design of more effective open-microphone engagement for mobile and pervasive systems.
Turnhout, Koen van, Terken, Jacques, Bakx, Ilse and Eggen, Berry (2005): Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 175-182. Available online
Against the background of developments in the area of speech-based and multimodal interfaces, we present research on determining the addressee of an utterance in the context of mixed human-human and multimodal human-computer interaction. Working with data that are taken from realistic scenarios, we explore several features with respect to their relevance to the question who is the addressee of an utterance: eye gaze both of speaker and listener, dialogue history and utterance length. With respect to eye gaze, we inspect the detailed timing of shifts in eye gaze between different communication partners (human or computer). We show that these features result in an improved classification of utterances in terms of addressee-hood relative to a simple classification algorithm that assumes that "the addressee is where the eye is", and compare our results to alternative approaches.
Morency, Louis-Philippe, Sidner, Candace, Lee, Christopher and Darrell, Trevor (2005): Contextual recognition of head gestures. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 18-24. Available online
Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. We investigate how dialog context from an embodied conversational agent (ECA) can improve visual recognition of user gestures. We present a recognition framework which (1) extracts contextual features from an ECA's dialog manager, (2) computes a prediction of head nod and head shakes, and (3) integrates the contextual predictions with the visual observation of a vision-based head gesture recognizer. We found a subset of lexical, punctuation and timing features that are easily available in most ECA architectures and can be used to learn how to predict user feedback. Using a discriminative approach to contextual prediction and multi-modal integration, we were able to improve the performance of head gesture detection even when the topic of the test set was significantly different than the training set.
Gatica-Perez, Daniel, Lathoud, Guillaume, Odobez, Jean-Marc and McCowan, Iain (2005): Multimodal multispeaker probabilistic tracking in meetings. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 183-190. Available online
Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audio-visual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results -based on an objective evaluation procedure-that show that our framework (1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy; (2) can deal with cases of visual clutter and partial occlusion; and (3) significantly outperforms a traditional sampling-based approach.
Otsuka, Kazuhiro, Takemae, Yoshinao and Yamato, Junji (2005): A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 191-198. Available online
A novel probabilistic framework is proposed for inferring the structure of conversation in face-to-face multiparty communication, based on gaze patterns, head directions and the presence/absence of utterances. As the structure of conversation, this study focuses on the combination of participants and their participation roles. First, we assess the gaze patterns that frequently appear in conversations, and define typical types of conversation structure, called conversational regime, and hypothesize that the regime represents the high-level process that governs how people interact during conversations. Next, assuming that the regime changes over time exhibit Markov properties, we propose a probabilistic conversation model based on Markov-switching; the regime controls the dynamics of utterances and gaze patterns, which stochastically yield measurable head-direction changes. Furthermore, a Gibbs sampler is used to realize the Bayesian estimation of regime, gaze pattern, and model parameters from observed head directions and utterances. Experiments on four-person conversations confirm the effectiveness of the framework in identifying conversation structures.
Pentland, Alex (Sandy) (2005): Socially aware computation and communication. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. p. 199. Available online
By building machines that understand social signaling and social context, we can dramatically improve collective decision making and help keep remote users 'in the loop.' I will describe three systems that have a substantial understanding of social context, and use this understanding to improve human group performance. The first system is able to interpret social displays of interest and attraction, and uses this information to improve conferences and meetings. The second is able to infer friendship, acquaintance, and workgroup relationships, and uses this to help people build social capital. The third is able to examine human interactions and categorize participants attitudes (attentive, agreeable, determined, interested, etc), and uses this information to proactively promote group cohesion and to match participants on the basis of their compatiblity.
Lee, Bee-Wah and Yeo, Alvin W. (2005): Integrating sketch and speech inputs using spatial information. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 2-9. Available online
Since the development of multimodal spatial query, the integration technique in determining the correct pair of multimodal inputs remains a problem in multimodal fusion. Although there exist integration techniques that have been proposed to resolve this problem, they are limited to the interaction with predefined speech and sketch commands. Furthermore, they are only designed to resolve the spatial query with single speech input and single sketch input. Therefore, when it comes to the introduction of multiple speech and sketch inputs in a single query, all the existing integration techniques are unable to resolve it. To date, no integration technique has been found that can resolve the Multiple Sentences and Sketch Objects Spatial Query. In this paper, the limitations of the existing integration techniques are discussed. A new integration technique in resolving this problem is described and compared with the widely used integration technique, Unification-based Integration Technique.
Not, Elena, Balci, Koray, Pianesi, Fabio and Zancanaro, Massimo (2005): Synthetic characters as multichannel interfaces. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 200-207. Available online
Synthetic characters are an effective modality to convey messages to the user, provide visual feedback about the system internal understanding of the communication, and engage the user in the dialogue through emotional involvement. In this paper we argue for a fine-grain distinction of the expressive capabilities of synthetic agents: avatars should not be considered as an indivisible modality but as the synergic contribution of different communication channels that, properly synchronized, generate an overall communication performance. In this view, we propose SMIL-AGENT as a representation and scripting language for synthetic characters, which abstracts away from the specific implementation and context of use of the character. SMIL-AGENT has been defined starting from SMIL 0.1 standard specification and aims at providing a high-level standardized language for presentations by different synthetic agents within diverse communication and application contexts.
Balci, Koray (2005): XfaceEd: authoring tool for embodied conversational agents. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 208-213. Available online
In this paper, XfaceEd, our open source, platform independent tool for authoring 3D embodied conversational agents (ECAs) is presented. Following MPEG-4 Facial Animation (FA) standard, XfaceEd provides an easy to use interface to generate MPEG-4 ready ECAs from static 3D models. Users can set MPEG-4 Facial Definition Points (FDP) and Facial Animation Parameter Units (FAPU), define the zone of influence of each feature point and how this influence is propagated among the neighboring vertices. As an alternative to MPEG-4, one can also specify morph targets for different categories such as visemes, emotions and expressions, in order to achieve facial animation using the keyframe interpolation technique. Morph targets from different categories are blended to create more lifelike behaviour. Results can be previewed and parameters can be tweaked real time within the application for fine tuning. Changes made take into effect immediately, which in turn ensures rapid production. The final output is a configuration file XML format and can be interpreted by XfacePlayer or other applications for easy authoring of embodied conversational agents for multimodal environments.
Battocchi, Alberto, Pianesi, Fabio and Goren-Bar, Dina (2005): A first evaluation study of a database of kinetic facial expressions (DaFEx). In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 214-221. Available online
In this paper we present DaFEx (Database of Facial Expressions), a database created with the purpose of providing a benchmark for the evaluation of the facial expressivity of Embodied Conversational Agents (ECAs). DaFEx consists of 1008 short videos containing emotional facial expressions of the 6 Ekman's emotions plus the neutral expression. The facial expressions were recorded by 8 professional actors (male and female) in two acting conditions ("utterance" and "no-utterance") and at 3 intensity levels (high, medium, low). The properties of DaFEx were studied by having 80 subjects classify the emotion expressed in the videos. High rates of accuracy were obtained for most of the emotions displayed. We also tested the effect of the intensity level, of the articulatory movements due to speech, and of the actors' and subjects' gender, on classification accuracy. The results showed that decoding accuracy decreases with the intensity of emotions; that the presence of articulatory movements negatively affects the recognition of fear, surprise and of the neutral expression, while it improves the recognition of anger; and that facial expressions seem to be recognized (slightly) better when acted by actresses than by actors.
Yohanan, Steve, Chan, Mavis, Hopkins, Jeremy, Sun, Haibo and MacLean, Karon (2005): Hapticat: exploration of affective touch. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 222-229. Available online
This paper describes the Hapticat, a device we developed to study affect through touch. Though intentionally not highly zoomorphic, the device borrows behaviors from pets and the rich manner in which they haptically communicate with humans. The Hapticat has four degrees of freedom to express itself: a pair of ear-like appendages, a breathing mechanism, a purring mechanism, and a warming element. Combinations of levels for these controls are used to define the five active haptic responses: playing dead, asleep, content, happy, and upset. In the paper we present the design considerations and implementation details of the device. We also detail a preliminary observational study where participants interacted with the Hapticat through touch. To compare the effects of haptic feedback, the device presented either active haptic renderings or none at all. Participants reported which of the five responses they believed the Hapticat rendered, as well as their degree of affect to the device. We observed that participants' expectations of the device's response to various haptic stimuli correlated with our mappings. We also observed that participants were able to reasonably recognize three of the five response renderings, while having difficulty discriminating between happy and content states. Finally, we found that participants registered a broader range of affect when active haptic renderings were applied as compared to when none were presented.
Giraudo, Umberto and Bordegoni, Monica (2005): Using observations of real designers at work to inform the development of a novel haptic modeling system. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 230-235. Available online
Gestures, besides speech, represent the mostly used means of expression by humans. For what regards the product design field, designers have multiple ways for communicating their ideas and concepts. One of them concerns the model making activity, where designers make explicit their concepts by using some appropriate tools and specific hand movements on plastic material with the intent of obtaining a shape. Some studies have demonstrated that visual, tactile and kinesthetic feedbacks are equally important in the shape creation and evaluation process . The European project "Touch and Design" (T'nD) (www.kaemart.it/touch-and-design) proposes the implementation of an innovative virtual clay modeling system based on novel haptic interaction modality oriented to industrial designers. In order to develop an intuitive and easy-to-use system, a study of designers' hand modeling activities has been carried out by the project industrial partners supported by cognitive psychologists. The users' manual operators and tools have been translated into corresponding haptic tools and multimodal interaction modalities in the virtual free-form shape modeling system. The paper presents the project research activities and the results achieved so far.
Ziat, Mounia, Gapenne, Olivier, Stewart, John and Lenay, Charles (2005): A comparison of two methods of scaling on form perception via a haptic interface. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 236-243. Available online
In this fundamental study, we compare two scaling methods by focusing on the subjects' strategies which are using a sensory substitution device. Method 1 consists in a reduction of the sensor size and its displacement speed. Here, speed reduction is obtained by a "human" movement reduction (hand speed reduction). Method 2 consists in a classical increase of the image dimension. The experimental device couples the pen on a graphics tablet with tactile sensory stimulators. These latter are activated when the sensor crosses the figure on the computer screen. This virtual sensor (square matrix composed of 16 elementary fields) is displaced when the pen, guided by a human hand displacements, moves on the graphics tablet. Even if it seems that there is no difference between the two methods, the results show that the recognition rate is closely dependent on the figure size and the strategies used by the subjects are more suitable for method 2 than the method 1. In fact, half of the subjects found that method 1 inhibits their movements and the majority of them don't feel the scaling effect, whereas this is clearly felt in method 2.
Allen, Meghan, Gluck, Jennifer, MacLean, Karon and Tang, Erwin (2005): An initial usability assessment for symbolic haptic rendering of music parameters. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 244-251. Available online
Current methods of playlist creation and maintenance do not support user needs, especially in a mobile context. Furthermore, they do not scale: studies show that users with large mp3 collections have abandoned the concept of playlists. To remedy the usability problems associated with playlist creation and navigation -- in particular, reliance on visual feedback and the absence of rapid content scanning mechanisms -- we propose a system that utilizes the haptic channel. A necessary first step in this objective is the creation of a haptic mapping for music. In this paper, we describe an exploratory study addressed at understanding the feasibility, with respect to learnability and usability, of efficient, eyes-free playlist navigation based on symbolic haptic renderings of key song parameters. Users were able to learn haptic mappings for music parameters to usable accuracy with 4 minutes of training. These results indicate promise for the approach and support for continued effort in both improving the rendering scheme and implementing the haptic playlist system.
Hanheide, M., Bauckhage, C. and Sagerer, G. (2005): Combining environmental cues & head gestures to interact with wearable devices. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 25-31. Available online
As wearable sensors and computing hardware are becoming a reality, new and unorthodox approaches to seamless human-computer interaction can be explored. This paper presents the prototype of a wearable, head-mounted device for advanced human-machine interaction that integrates speech recognition and computer vision with head gesture analysis based on inertial sensor data. We will focus on the innovative idea of integrating visual and inertial data processing for interaction. Fusing head gestures with results from visual analysis of the environment provides rich vocabularies for human-machine communication because it renders the environment into an interface: if objects or items in the surroundings are being associated with system activities, head gestures can trigger commands if the corresponding object is being looked at. We will explain the algorithmic approaches applied in our prototype and present experiments that highlight its potential for assistive technology. Apart from pointing out a new direction for seamless interaction in general, our approach provides a new and easy to use interface for disabled and paralyzed users in particular.
Qi, Wen and Martens, Jean-Bernard (2005): Tangible user interfaces for 3D clipping plane interaction with volumetric data: a case study. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 252-258. Available online
Visualization via direct volume rendering is a potentially very powerful technique for exploring and interacting with large amounts of scientific data. However, the available two-dimensional (2D) interfaces make three-dimensional (3D) manipulation with such data very difficult. Many usability problems during interaction in turn discourage the widespread use of volume rendering as a scientific tool. In this paper, we present a more in-depth investigation into one specific interface aspect, i.e., the positioning of a clipping plane within volume-rendered data. More specifically, we propose three different interface prototypes that have been realized with the help of wireless vision-based tracking. These three prototypes combine aspects of 2D graphical user interfaces with 3D tangible interaction devices. They allow to experience and compare different user interface strategies for performing the clipping plane interaction task. They also provide a basis for carrying out user evaluations in the near future.
A transformational approach for developing multimodal web user interfaces is presented that progressively moves from a task model and a domain model to a final user interface. This approach consists of three steps: deriving one or many abstract user interfaces from a task model and a domain model, deriving one or many concrete user interfaces from each abstract one, and producing the code of the corresponding final user interfaces. To ensure these steps, transformations are encoded as graph transformations performed on the involved models expressed in their graph equivalent. For each step, a graph grammar gathers relevant graph transformations for accomplishing the sub-steps. The final user interface is multimodal as it involves graphical (keyboard, mouse) and vocal interaction. The approach outlined in the paper is illustrated throughout a running example for a graphical interface, a vocal interface, and two multimodal interfaces with graphical and vocal predominances, respectively.
Morita, Tomoyuki, Hirano, Yasushi, Sumi, Yasuyuki, Kajita, Shoji and Mase, Kenji (2005): A pattern mining method for interpretation of interaction. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 267-273. Available online
This paper proposes a novel mining method for multimodal interactions to extract important patterns of group activities. These extracted patterns can be used as machine-readable event indices in developing an interaction corpus based on a huge collection of human interaction data captured by various sensors. The event indices can be used, for example, to summarize a set of events and to search for particular events because they contain various pieces of context information. The proposed method extracts simultaneously occurring patterns of primitive events in interaction, such as gaze and speech, that in combination occur more consistently than randomly. The proposed method provides a statistically plausible definition of interaction events that is not possible through intuitive top-down definitions. We demonstrate the effectiveness of our method for the data captured in an experimental setup of a poster-exhibition scene. Several interesting patterns are extracted by the method, and we examined their interpretations.
Chen, Fang, Choi, Eric, Epps, Julien, Lichman, Serge, Ruiz, Natalie, Shi, Yu, Taib, Ronnie and Wu, Mike (2005): A study of manual gesture-based selection for the PEMMI multimodal transport management interface. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 274-281. Available online
Operators of traffic control rooms are often required to quickly respond to critical incidents using a complex array of multiple keyboards, mice, very large screen monitors and other peripheral equipment. To support the aim of finding more natural interfaces for this challenging application, this paper presents PEMMI (Perceptually Effective Multimodal Interface), a transport management system control prototype taking video-based manual gesture and speech recognition as inputs. A specific theme within this research is determining the optimum strategy for gesture input in terms of both single-point input selection and suitable multimodal feedback for selection. It has been found that users tend to prefer larger selection areas for targets in gesture interfaces, and tend to select within 44% of this selection radius. The minimum effective size for targets when using 'device-free' gesture interfaces was found to be 80 pixels (on a 1280x1024 screen). This paper also shows that feedback on gesture input via large screens is enhanced by the use of both audio and visual cues to guide the user's multimodal input. Audio feedback in particular was found to improve user response time by an average of 20% over existing gesture selection strategies for multimodal tasks.
Zhang, Liang-Guo, Chen, Xilin, Wang, Chunli, Chen, Yiqiang and Gao, Wen (2005): Recognition of sign language subwords based on boosted hidden Markov models. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 282-287. Available online
Sign language recognition (SLR) plays an important role in human-computer interaction (HCI), especially for the convenient communication between deaf and hearing society. How to enhance the traditional hidden Markov models (HMM) based SLR is an important issue in the SLR community. And how to refine the boundaries of the classifiers to effectively characterize the property of spread-out of the training samples is another significant issue. In this paper, a new classification framework applying adaptive boosting (AdaBoost) strategy to continuous HMM (CHMM) training procedure at the subwords classification level for SLR is presented. The ensemble of multiple composite CHMMs for each subword trained in boosting iterations tends to concentrate more on the hard-to-classify samples so as to generate more complex decision boundary than that of the single HMM classifier. Experimental results on the vocabulary of frequently used Chinese sign language (CSL) subwords show that the proposed boosted CHMM outperforms the conventional CHMM for SLR.
Hernandez-Rebollar, Jose L. (2005): Gesture-driven American sign language phraselator. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 288-292. Available online
This paper describes a portable American Sign Language (ASL)-to-English phraselator. This wearable device is based on an Acceleglove originally developed for recognizing the hand alphabet, and a two-link arm skeleton that detects hand location and movement with respect to the body. Therefore, this phraselator is able to recognize finger-spelled words as well as hand gestures and translate them into spoken voice through a speech synthesizer. To speed-up the recognition process, a simple prediction algorithm has been introduced so the phraselator predicts words based on the current letter being inputted, or complete sentences based on the current sign being translated. The user selects the rest of the sentence (or word) by means of a predefined hand gesture for the phraselator to speak out the sentence in English or Spanish. New words of phrases are automatically added to the lexicon for future predictions.
Hossain, Altab, Kurnia, Rahmadi, Nakamura, Akio and Kuno, Yoshinori (2005): Interactive vision to detect target objects for helper robots. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 293-300. Available online
An effective human-robot interaction is essential for wide penetration of service robots into the market. Such robots need vision systems to recognize objects. It is, however, difficult to realize vision systems that can work in various conditions. More robust techniques of object recognition and image segmentation are essential. Thus, we have proposed to use the human user's assistance for object recognition through speech. The robot asks a question to which the user can easily answer and whose answer can efficiently reduce the number of candidate objects even if there are occluded objects and/or objects composed of multicolor parts in the scene. It considers the characteristics of features used for object recognition such as the easiness for humans to specify them by word, thus generating a user-friendly and efficient sequence of questions. Experimental results show that the robot can detect target objects by asking the questions generated by the method.
Baljko, Melanie (2005): The contrastive evaluation of unimodal and multimodal interfaces for voice output communication aids. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 301-308. Available online
For computational Augmentative and Alternative Communication (AAC) aids, it has often been asserted that multimodal interfaces have benefits over unimodal ones. Several such benefits have been described informally, but, to date, few have actually been formalized or quantified. In this paper, some of the special considerations of this application domain are described. Next, the hypothesized benefits of semantically nonredundant multimodal input actions over unimodal input actions are described formally. The notion of information rate, already well established as a dependent variable in evaluations of AAC devices, is quantified in this paper, using the formalisms provided by Information Theory (as opposed to other, idiosyncratic approaches that have been employed previously). A comparative analysis was performed between interfaces that afford unimodal input actions and those that afford semantically nonredundant multimodal input actions. This analysis permitted generalized conclusions, which have been synthesized with those of another, recently-completed analysis in which unimodal and semantically redundant multimodal input actions were compared. A reinterpretation of Keates and Robinson's empirical data (1998) shows that their criticism of multimodal interfaces for AAC devices, in part, was unfounded.
Saarinen, Rami, Järvi, Janne, Raisamo, Roope and Salo, Jouni (2005): Agent-based architecture for implementing multimodal learning environments for visually impaired children. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 309-316. Available online
Visually impaired children have a great disadvantage in the modern society since their ability to use modern computer technology is limited due to inappropriate user interfaces. The aim of the work presented in this paper was to develop a multimodal software architecture and applications to support visually impaired children and to enable them to interact equally with sighted children in learning situations. The architecture is based on software agents, and has specific support for visual, auditory and haptic interaction. It has been used successfully with different groups of 7-8-year-old and 12-year-old visually impaired children. In this paper we focus on the enabling software technology and interaction techniques aimed to realize our goal.
McLachlan, Peter, Lowe, Karen, Saka, Chalapati Rao and MacLean, Karon (2005): Perceiving ordinal data haptically under workload. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 317-324. Available online
Visual information overload is a threat to the interpretation of displays presenting large data sets or complex application environments. To combat this problem, researchers have begun to explore how haptic feedback can be used as another means for information transmission. In this paper, we show that people can perceive and accurately process haptically rendered ordinal data while under cognitive workload. We evaluate three haptic models for rendering ordinal data with participants who were performing a taxing visual tracking task. The evaluation demonstrates that information rendered by these models is perceptually available even when users are visually busy. This preliminary research has promising implications for haptic augmentation of visual displays for information visualization.
Brdiczka, Oliver, Maisonnasse, Jérôme and Reignier, Patrick (2005): Automatic detection of interaction groups. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 32-36. Available online
This paper addresses the problem of detecting interaction groups in an intelligent environment. To understand human activity, we need to identify human actors and their interpersonal links. An interaction group can be seen as basic entity, within which individuals collaborate in order to achieve a common goal. In this regard, the dynamic change of interaction group configuration, i.e. the split and merge of interaction groups, can be seen as indicator of new activities. Our approach takes speech activity detection of individuals forming interaction groups as input. A classical HMM-based approach learning different HMM for the different group configurations did not produce promising results. We propose an approach for detecting interaction group configurations based on the assumption that conversational turn taking is synchronized inside groups. The proposed detector is based on one HMM constructed upon conversational hypotheses. The approach shows good results and thus confirms our conversational hypotheses.
Tokunaga, Eiji, Kimura, Hiroaki, Kobayashi, Nobuyuki and Nakajima, Tatsuo (2005): Virtual tangible widgets: seamless universal interaction with personal sensing devices. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 325-332. Available online
Using a single personal device as an universal controller for diverse services is a promising approach to solving the problem of too many controllers in ubiquitous multimodal environments. However, the current approaches to universal controllers cannot provide intuitive control because they are restricted to traditional mobile user interfaces such as small keys or small touch panels. We propose Vidgets, which is short for virtual tangible widgets}, as an approach to selecting and controlling ubiquitous services with virtually implemented tangible user interfaces based on a single sensing personal device equipped with a digital camera and several physical sensors. We classify the use of the universal controller into the three stages: (a) searching for a service, (b) grasping the service and (c) using the service. User studies with our prototype implementation indicate that the smooth transition and integration of the three stages improve the overall interaction with our universal controller.
Xiong, Yingen and Quek, Francis (2005): Meeting room configuration and multiple camera calibration in meeting analysis. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 37-44. Available online
In video based cross-model analysis of planning meeting, the meeting events are recorded by multiple cameras distributed in the entire meeting room. Subject's hand gestures, hand motion, head orientations, gaze targets, body poses are very important for the meeting event analysis. In order to register everything to the same global coordinate system, build 3D model, get 3D data from the video, we need to create a proper meeting room configuration and calibrate all cameras to obtain their intrinsic and extrinsic parameters. However, the calibration of multiple cameras distributed in the entire meeting room area is a challenging task because it is impossible to let all cameras in the meeting room see a reference object at the same time and wide field-of-view cameras suffer under radial distortion. In this paper, we propose a simple approach to create a good meeting room configuration and calibrate multiple cameras in the meeting room. The proposed approach includes several steps. First, we create stereo camera pairs according to the room configuration and the requirements of the targets, the participants of the meeting. Second, we apply Tsai's algorithm to calibrate each stereo camera pair and obtain the parameters in its own local coordinate system. Third, we use Vicon motion capture data to transfer all local coordinate systems of stereo camera pairs into a global coordinate system in the meeting room. We can obtain the positions, orientations, and parameters for all cameras in the same global coordinate system, so that we can register everything into this global coordinate system. Next, we do calibration error analysis for the current camera and meeting room configuration. We can obtain error distribution in the entire meeting room area. Finally, we improve the current camera and meeting room configuration according to the error distribution. By repeating these steps, we can obtain a good meeting room configuration and parameters of all cameras for this room configuration.
Iannizzotto, Giancarlo, Costanzo, Carlo, Rosa, Francesco La and Lanzafame, Pietro (2005): A multimodal perceptual user interface for video-surveillance environments. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 45-52. Available online
In this paper a perceptual user interface (PUI) for video-surveillance environments is introduced. This system provides a tool for a video-surveillance control-room, and exploits a novel multimodal user interaction paradigm based on hand gesture and perceptual user interfaces. The proposed system, being simple and intuitive, is expected to be useful in the control of large and dynamic environments. To illustrate our work, we introduce a proof-of concept multimodal, bare-hand gesture-based application and discuss its implementation and the obtained experimental results.
Wang, Sy Bor and Demirdjian, David (2005): Inferring body pose using speech content. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 53-60. Available online
Untethered multimodal interfaces are more attractive than tethered ones because they are more natural and expressive for interaction. Such interfaces usually require robust vision-based body pose estimation and gesture recognition. In interfaces where a user is interacting with a computer using speech and arm gestures, the user's spoken keywords can be recognized in conjunction with a hypothesis of body poses. This co-occurrence can reduce the number of body pose hypothesis for the vision based tracker. In this paper we show that incorporating speech-based body pose constraints can increase the robustness and accuracy of vision-based tracking systems. Next, we describe an approach for gesture recognition. We show how Linear Discriminant Analysis (LDA), can be employed to estimate 'good features' that can be used in a standard HMM-based gesture recognition system. We show that, by applying our LDA scheme, recognition errors can be significantly reduced over a standard HMM-based technique. We applied both techniques in a Virtual Home Desktop scenario. Experiments where the users controlled a desktop system using gestures and speech were conducted and the results show that the speech recognised in conjunction with body poses has increased the accuracy of the vision-based tracking system.
Nickel, Kai, Gehrig, Tobias, Stiefelhagen, Rainer and McDonough, John (2005): A joint particle filter for audio-visual speaker tracking. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 61-68. Available online
In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.
We present the Connector, a context-aware service that intelligently connects people. It maintains an awareness of its users' activities, preoccupations and social relationships to mediate a proper connection at the right time between them. In addition to providing users with important contextual cues about the availability of potential callees, the Connector adapts the behavior of the contactee's device automatically in order to avoid inappropriate interruptions. To acquire relevant context information, perceptual components analyze sensor input obtained from a smart mobile phone and -- if available -- from a variety of audio-visual sensors built into a smart meeting room environment. The Connector also uses any available multimodal interface (e.g. a speech interface to the smart phone, steerable camera-projector, targeted loudspeakers) in the smart meeting room, to deliver information to users in the most unobtrusive way possible.
Latoschik, Marc Erich (2005): A user interface framework for multimodal VR interactions. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 76-83. Available online
This article presents a User Interface (UI) framework for multimodal interactions targeted at immersive virtual environments. Its configurable input and gesture processing components provide an advanced behavior graph capable of routing continuous data streams asynchronously. The framework introduces a Knowledge Representation Layer which augments objects of the simulated environment with Semantic Entities as a central object model that bridges and interfaces Virtual Reality (VR) and Artificial Intelligence (AI) representations. Specialized node types use these facilities to implement required processing tasks like gesture detection, preprocessing of the visual scene for multimodal integration, or translation of movements into multimodally initialized gestural interactions. A modified Augmented Transition Nettwork (ATN) approach accesses the knowledge layer as well as the preprocessing components to integrate linguistic, gestural, and context information in parallel. The overall framework emphasizes extensibility, adaptivity and reusability, e.g., by utilizing persistent and interchangeable XML-based formats to describe its processing stages.
Rousseau, Cyril, Bellik, Yacine and Vernier, Frederic (2005): Multimodal output specification / simulation platform. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 84-91. Available online
The design of an output multimodal system is a complex task due to the richness of today interaction contexts. The diversity of environments, systems and user profiles requires a new generation of software tools to specify complete and valid output interactions. In this paper, we present a multimodal output specification and simulation platform. After introducing the design process which inspired this platform, we describe the two main platform's tools which respectively allow the outputs specification and the outputs simulation of a multimodal system. Finally, an application of the platform is illustrated through the outputs design on a mobile phone application.
Berti, Silvia and Paterno, Fabio (2005): Migratory MultiModal interfaces in MultiDevice environments. In: Proceedings of the 2005 International Conference on Multimodal Interfaces 2005. pp. 92-99. Available online
This paper describes an environment able to support migratory multimodal interfaces in multidevice environments. We introduce the software architecture and the device-independent languages used by our tool, which provides services enabling users to freely move about, change device and continue the current task from the point where they left off in the previous device. Our environment currently supports interaction with applications through graphical and vocal modalities, either separately or together. Such applications are implemented in Web-based languages. We discuss how the features of the device at hand, desktop or mobile, are considered when generating the multimodal user interface.