Proceedings of the 2002 International Conference on Multimodal Interfaces
Time and place:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
The following articles are from "Proceedings of the 2002 International Conference on Multimodal Interfaces":
Roy, Deb (2002): Towards Visually-Grounded Spoken Language Acquisition. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 105. Available online
A characteristic shared by most approaches to natural language understanding and generation is the use of symbolic representations of word and sentence meanings. Frames and semantic nets are examples of symbolic representations. Symbolic methods are inappropriate for applications which require natural language semantics to be linked to perception, as is the case in tasks such as scene description or human-robot interaction. This paper presents two implemented systems, one that learns to generate, and one that learns to understand visually-grounded spoken language. These implementations are part of our ongoing effort to develop a comprehensive model of perceptually-grounded semantics.
Möhler, Gregor (2002): Modeling Output in the EMBASSI Multimodal Dialog System. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 111. Available online
In this paper we present the concept for an abstract modeling of output render components. We illustrate how this categorization serves to seamlessly integrate previously unknown output multimodalities coherently into the multimodal presentations of the EMBASSI dialog system. We present a case study and conclude with an overview of related work.
Ibrahim, Aseel and Johansson, Pontus (2002): Multimodal Dialogue Systems for Interactive TV Applications. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 117. Available online
Many studies have shown the advantages of building multimodal systems, however not in the interactive TV application context. This paper reports on a qualitative study of a multimodal program guide for interactive TV. The system was designed by adding speech interaction to an already existing TV program guide. Study results indicate that spoken natural language input combined with visual output is preferable for TV applications. Furthermore, the user feedback requires a clear distinction between the dialogue system's domain result and system status in the visual output. Consequently, we propose an interaction model that consists of three entities: user, domain results, and system feedback.
Sidner, Candace L. and Dzikovska, Myroslava (2002): Human-Robot Interaction: Engagement between Humans and Robots for Hosting Activities. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 123. Available online
To participate in conversations with people, robots must not only see and talk with people but make use of the conventions of conversation and of how to be connected to their human counterparts. This paper reports on research on engagement in human-human interaction and applications to (non-autonomous) robots interacting with humans in hosting activities.
Mostow, Jack, Beck, Joseph, Chalasani, Raghu, Cuneo, Andrew and Jia, Peng (2002): Viewing and Analyzing Multimodal Human-computer Tutorial Dialogue: A Database Approach. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 129. Available online
It is easier to record logs of multimodal human-computer tutorial dialogue than to make sense of them. In the 2000-2001 school year, we logged the interactions of approximately 400 students who used Project LISTEN's Reading Tutor and who read aloud over 2.4 million words. This paper discusses some difficulties we encountered converting the logs into a more easily understandable database. It is faster to write SQL queries to answer research questions than to analyze complex log files each time. The database also permits us to construct a viewer to examine individual Reading Tutor-student interactions. This combination of queries and viewable data has turned out to be very powerful, and we discuss how we have combined them to answer research questions.
Flanagan, James (2002): Adaptive Dialog Based upon Multimodal Language Acquisition. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 135. Available online
Communicating by voice with speech-enabled computer applications based on preprogrammed rule grammars suffers from constrained vocabulary and sentence structures. Deviations from the allowed language result in an unrecognized utterance that will not be understood and processed by the system. One way to alleviate this restriction consists in allowing the user to expand the computer's recognized and understood language by teaching the computer system new language knowledge. We present an adaptive dialog system capable of learning from users new words, phrases and sentences, and their corresponding meanings. User input incorporates multiple modalities, including speaking, typing, pointing, drawing, and image capturing. The allowed language can thus be expanded in real time by users according to their preferences. By acquiring new language knowledge the system becomes more capable in specific tasks, although its language is still constrained.
Holzapfel, Hartwig and Fuegen, Christian (2002): Integrating Emotional Cues into a Framework for Dialogue Management. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 141. Available online
Emotions are very important in human-human communication but are usually ignored in human-computer interaction. Recent work focuses on recognition and generation of emotions as well as emotion driven behavior. Our work focuses on the use of emotions in dialogue systems that can be used with speech input or as well in multi-modal environments. This paper describes a framework for using emotional cues in a dialogue system and their informational characterization. We describe emotion models that can be integrated into the dialogue system and can be used in different domains and tasks. Our application of the dialogue system is planned to model multi-modal human-computer-interaction with a humanoid robotic system.
Li, Haifeng (2002): Data Driven Design of an ANN/HMM System for On-line Unconstrained Handwritten Character Recognition. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 149. Available online
This paper is dedicated to data driven design method for a hybrid ANN / HMM based handwriting recognition system. On one hand, a data driven designed neural modelling of handwriting primitives is proposed. ANNs are firstly used as state models in a HMM primitive divider that associates each signal frame with an ANN by minimizing the accumulated prediction error. Then, the neural modelling is realized by training each network on its own frame set. Organizing these two steps in an EM algorithm, precise primitive models are obtained. On the other hand, a data driven systematic method is proposed for HMM topology inference task. All possible prototypes of a pattern class are firstly merged into several clusters by a Tabu search aided clustering algorithm. Then a multiple parallel-path HMM is constructed for the pattern class. Experiments prove an 8% recognition improvement with a saving of 50% of system resources, compared to an intuitively designed referential ANN / HMM system.
Maynes-Aminzade, Dan, Pausch, Randy and Seitz, Steve (2002): Techniques for Interactive Audience Participation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 15. Available online
At SIGGRAPH in 1991, Loren and Rachel Carpenter unveiled an interactive entertainment system that allowed members of a large audience to control an onscreen game using red and green reflective paddles. In the spirit of this approach, we present a new set of techniques that enable members of an audience to participate, either cooperatively or competitively, in shared entertainment experiences. Our techniques allow audiences with hundreds of people to control onscreen activity by (1) leaning left and right in their seats, (2) batting a beach ball while its shadow is used as a pointing device, and (3) pointing laser pointers at the screen. All of these techniques can be implemented with inexpensive, off the shelf hardware. We have tested these techniques with a variety of audiences; in this paper we describe both the computer vision based implementation and the lessons we learned about designing effective content for interactive audience participation.
Chen, Lei, Harper, Mary and Quek, Francis (2002): Gesture Patterns during Speech Repairs. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 155. Available online
Speech and gesture are two primary modes used in natural human communication; hence, they are important inputs for a multimodal interface to process. One of the challenges for multimodal interfaces is to accurately recognize the words in spontaneous speech. This is partly due to the presence of speech repairs, which seriously degrade the accuracy of current speech recognition systems. Based on the assumption that speech and gesture arise from same thought process, we would expect to find patterns of gesture that co-occur with speech repairs that can be exploited by a multimodal processing system to more effectively process spontaneous speech. To evaluate this hypothesis, we have conducted a measurement study of gesture and speech repair data extracted from videotapes of natural dialogs. Although we have found that gestures do not always co-occur with speech repairs, we observed that modification gesture patterns have a high correlation with content replacement speech repairs, but rarely occur with content repetitions. These results suggest that gesture patterns can help us to classify different types of speech repairs in order to correct them more accurately .
Kettebekov, Sanshzar, Yeasin, Mohammed and Sharma, Rajeev (2002): Prosody Based Co-analysis for Continuous Recognition of Coverbal Gestures. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 161. Available online
Although recognition of natural speech and gestures have been studied extensively, previous attempts of combining them in a unified framework to boost classification were mostly semantically motivated, e.g., keyword-gesture co-occurrence. Such formulations inherit the complexity of natural language processing. This paper presents a Bayesian formulation that uses a phenomenon of gesture and speech articulation for improving accuracy of automatic recognition of continuous coverbal gestures. The prosodic features from the speech signal were co-analyzed with the visual signal to learn the prior probability of co-occurrence of the prominent spoken segments with the particular kinematical phases of gestures. It was found that the above co-analysis helps in detecting and disambiguating small hand movements, which subsequently improves the rate of continuous gesture recognition. The efficacy of the proposed approach was demonstrated on a large database collected from the weather channel broadcast. This formulation opens new avenues for bottom-up frameworks of multimodal integration.
Kak, Avi C. (2002): Purdue RVL-SLLL ASL Database for Automatic Recognition of American Sign Language. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 167. Available online
This article reports on an extensive database of American Sign Language (ASL) motions, handshapes, words and sentences. Research on automatic recognition of ASL requires a suitable database for the training and the testing of algorithms. The databases that are currently available do not allow, for algorithmic development that requires a step-by-step approach to ASL recognition -- from the recognition of individual hand-shapes, to the recognition of motion primitives, and, finally, to the recognition of full sentences. We have sought to remove these deficiencies in a new database -- Purdue RVL-SLLL ASL database.
Landragin, Frédéric (2002): The Role of Gesture in Multimodal Referring Actions. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 173. Available online
When deictic gestures are produced on a touch screen, they can take forms which can lead to several sorts of ambiguities. Considering that the resolution of a multimodal reference requires the identification of the referents and of the context ("reference domain") from which these referents are extracted, we focus on the linguistic, gestural, and visual clues that a dialogue system may exploit to comprehend the referring intention. We explore the links between words, gestures and perceptual groups, doing so in terms of the clues that delimit the reference domain. We also show the importance of taking the domain into account for dialogue management, particularly for the comprehension of further utterances, when they seem to implicitly use a pre-existing restriction to a subset of objects. We propose a strategy of multimodal reference resolution based on this notion of reference domain, and we illustrate its efficiency with prototypic examples built from a study of significant referring situations extracted from a corpus. We give at last the future directions of our works concerning some linguistic and task aspects that are not integrated here.
Xiong, Yingen, Quek, Francis and McNeill, David (2002): Hand Gesture Symmetric Behavior Detection and Analysis in Natural Conversation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 179. Available online
We present an experimental investigation into the phenomenon of gestural symmetry for two-handed gestures accompanying speech. We describe an approach to compute hand motion symmetries based on the correlation computations. Local symmetries are detected using a windowing operation. We demonstrate that the selection of a smaller window size results in better sensitivity to local symmetries at the expense of noise in the form of spurious symmetries and symmetry dropoffs'. Our algorithm applies a hole filling' post process to address these detection problems. We examine the role of the detected motion symmetries of two-handed gestures in the structuring of speech. We compared discourse segments corresponding to extracted symmetries in two natural conversations against a discourse analysis by expert psycholinguistic coders. These comparisons illustrate the effectiveness of the symmetry feature for the under-standing of underlying discourse structure. We believe that this basic characteristic of two-handed gestures accompanying speech must be incorporated in any multimodal interaction system involving two-handed gestures and speech.
Hernandez-Rebollar, Jose L., Lindeman, Robert W. and Kyriakopoulos, Nicholas (2002): A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 185. Available online
This paper presents a portable system and method for recognizing the 26 hand shapes of the American Sign Language alphabet, using a novel glove-like device. Two additional signs, 'space', and 'enter' are added to the alphabet to allow the user to form words or phrases and send them to a speech synthesizer. Since the hand shape for a letter varies from one signer to another, this is a 28-class pattern recognition system. A three-level hierarchical classifier divides the problem into "dispatchers" and "recognizers." After reducing pattern dimension from ten to three, the projection of class distributions onto horizontal planes makes it possible to apply simple linear discrimination in 2D, and Bayes' Rule in those cases where classes had features with overlapped distributions. Twenty-one out of 26 letters were recognized with 100% accuracy; the worst case, letter U, achieved 78%.
Corradini, Andrea, Wesson, Richard M. and Cohen, Philip R. (2002): A Map-Based System Using Speech and 3D Gestures for Pervasive Computing. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 191. Available online
We describe an augmentation of Quickset, a multimodal voice/pen system that allows users to create and control map-based, collaborative, interactive simulations. In this paper, we report on our extension of the graphical pen input mode from stylus/mouse to 3D hand movements. To do this, the map is projected onto a virtual plane in space, specified by the operator before the start of the interactive session. We then use our geometric model to compute the intersection of hand movements with the virtual plane, translating these into map coordinates on the appropriate system. The goal of this research is the creation of a body-centered, multimodal architecture employing both speech and 3D hand gestures, which seamlessly and unobtrusively supports distributed interaction. The augmented system, built on top of an existing architecture, also provides an improved visualization, management and awareness of a shared understanding. Potential applications of this work include tele-medicine, battlefield management and any kind of collaborative decision-making during which users may wish to be mobile.
Cassidy, Andy, Hook, Dan and Baliga, Avinash (2002): Hand Tracking Using Spatial Gesture Modeling and Visual Feedback for a Virtual DJ System. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 197. Available online
The ability to accurately track hand movement provides new opportunities for human computer interaction (HCI). Many of today's commercial hand tracking devices based on gloves can be cumbersome and expensive. An approach that avoids these problems is to use computer vision to capture hand motion. In this paper, we present a complete real-time hand tracking and 3-D modeling system based on a single camera. In our system, we extract feature points from a video stream of a hand to control a virtual hand model with 2-D global motion and 3-D local motion. The on screen model gives the user instant feedback on the estimated position of the hand. This visual feedback allows a user to compensate for the errors in tracking. The system is used for three example applications. The first application uses hand tracking and gestures to take on the role of the mouse. The second interacts with a 3D virtual environment using the 3D hand model. The last application is a virtual DJ system that is controlled by hand motion tracking and gestures.
Li, Haifeng (2002): State Sharing in a Hybrid Neuro-Markovian On-Line Handwriting Recognition System through a Simple Hierarchical Clustering Algorithm. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 203. Available online
HMM has been largely applied in many fields with great successes. To achieve a better performance, an easy way is using more states or more free parameters for a better signal modelling. Thus, state sharing and state clipping methods have been proposed to reduce parameter redundancy and to limit the explosive consummation of system resources. In this paper, we focus here on a simple state sharing method for a hybrid neuro-Markovian on-line handwriting recognition system. At first, a likelihood-based distance is proposed for measuring the similarity between two HMM state models. After wards, a minimum quantification error aimed hierarchical clustering algorithm is also proposed to select the most representative models. Here, models are shared to the most under the constraint of the minimum system performance loss. As the result, we maintain about 98% of the system performance while about 60% of the parameters are reduced.
Barthelmess, P. and Ellis, C. A. (2002): Perceptual Collaboration in Neem. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 21. Available online
The Neem Platform is a research test bed for Project Neem, concerned with the development of socially and culturally aware collaborative systems in a wide range of domains. In this paper we discuss a novel use of Perceptual Interfaces, applied to group collaboration support. In Neem, the multimodal content of human to human interaction is analyzed and reasoned upon. Applications react to this implicit communication by dynamically adapting their behavior according to the perceived group context. In contrast, Perceptual Interfaces have been traditionally employed to handle explicit (multimodal) commands from users, and are as a rule not concerned with the communication that takes place among humans. The Neem Platform is a generic (application neutral) component-based evolvable framework that provides functionality that facilitates building such perceptual collaborative applications.
Hattori, Hiroaki (2002): An Automatic Speech Translation System on PDAs for Travel Conversation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 211. Available online
We present an automatic speech-to-speech translation system for Personal Digital Assistants (PDAs) that helps oral communication between Japanese and English speakers in various situations while traveling. Our original compact large vocabulary continuous speech recognition engine, compact translation engine based on a lexicalized grammar, and compact Japanese speech synthesis engine lead to the development of a Japanese/English bi-directional speech translation system that works with limited computational resources.
Zhang, Jing, Chen, Xilin, Yang, Jie and Waibel, Alex (2002): A PDA-Based Sign Translator. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 217. Available online
In this paper, we propose an effective approach for a PDA-based sign system, and it presents user the sign translator. Its main functions include 3 parts: detection, recognition and translation. Automatic detection and recognition of text in natural scenes is a prerequisite for automatic sign translator. In order to make the system robust for text detection in various natural scenes, the detection approach efficiently embeds multi-resolution, adaptive search in a hierarchical framework with different emphases at each layer. We also introduce an intensity-based OCR method to recognize character in various fonts and lighting condition, where we employ Gabor transform to obtain local features, and LDA for selection and classification of features. The recognition rate is 92.4% for the testing set got from the natural sign. Sign is different from the normal used sentence. It is brief, with a lot of abbreviations and place nouns. We here only briefly introduce a rule-based place name translation. We have integrated all these functions in a PDA, which can capture sign image, auto segment and recognize the Chinese sign, and translate it into English.
Costantini, Erica (2002): The NESPOLE! Multimodal Interface for Cross-lingual Communication -- Experience and Lessons Learned. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 223. Available online
In this paper we describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy and web expertise. The user interface was designed to effectively combine web browsing, real-time sharing of graphical information and multi-modal annotations using a shared whiteboard, and real-time multilingual speech communication, all within an e-commerce scenario. Data collected in sessions with naïve users in several stages in the process of system development formed the basis for improving the effectiveness and usability of the system. We describe this development process, the resulting interface components and the lessons learned.
Zheng, Dequan (2002): Research of Machine Learning Method for Specific Information Recognition on the Internet. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 229. Available online
With the available resources on the Internet becoming plentiful, a large amount of harmful information is permeating in and has been influencing people's normal work and living seriously. Therefore, some harmful data stream must be recognized and filtered out effectively. After analyzing some harmful contents in Internet information stream, we present a new method, which recognizes specific information by Machine Learning (ML). We extracted key information from a number of corpuses through ML method to obtain the part of speech (POS) Transfer-Form for key information by learning from corpuses, which is based on the same pronunciation matching of key information. Further more, the testing value of key information will be obtained in real corpus to examine the likelihood between matching rules from information streams and those learnt from corpuses through the average value of POS transfer probability of key information. Therefore, the testing value for the whole real data stream will be obtained. The experiment proved that the method was efficient for recognizing certain Internet harmful information.
Costantini, Erica (2002): The Added Value of Multimodality in the NESPOLE! Speech-to-Speech Translation System: an Experimental Study. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 235. Available online
Multimodal interfaces, which combine two or more input modes (speech, pen, touch), are expected to be more efficient, natural and usable than single-input interfaces. However, the advantage of multimodal input has only been ascertained in highly controlled experimental conditions [4, 5, 6]; in particular, we lack data about what happens with real' human-human, multilingual communication systems. In this work we discuss the results of an experiment aiming to evaluate the added value of multimodality in a "true" speech-to-speech translation system, the NESPOLE! system, which provides for multilingual and multimodal communication in the tourism domain, allowing users to interact through the internet sharing maps, web-pages and pen-based gestures. We compared two experimental conditions differing as to whether multimodal resources were available: a speech-only condition (SO), and a multimodal condition (MM). Most of the data show tendencies for MM to be better than SO.
Morishima, Shigeo and Nakamura, Satoshi (2002): Multi-Modal Translation System and Its Evaluation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 241. Available online
Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation tests using the connected digit discrimination test using data with and without audio-visual lip-synchronization. The results confirm the significant quality of the proposed audio-visual translation system and the importance of lip-synchronization.
Wang, Zhirong, Topkara, Umut, Schultz, Tanja and Waibel, Alex (2002): Towards Universal Speech Recognition. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 247. Available online
The increasing interest in multilingual applications like speech-to-speech translation systems is accompanied by the need for speech recognition front-ends in many languages that can also handle multiple input languages at the same time. In this paper we describe a universal speech recognition system that fulfills such needs. It is trained by sharing speech and text data across languages and thus reduces the number of parameters and overhead significantly at the cost of only slight accuracy loss. The final recognizer eases the burden of maintaining several monolingual engines, makes dedicated language identification obsolete and allows for code-switching within an utterance. To achieve these goals we developed new methods for constructing multilingual acoustic models and multilingual n-gram language models.
Huang, Fei (2002): Improved Named Entity Translation and Bilingual Named Entity Extraction. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 253. Available online
Translation of named entities (NE), including proper names, temporal and numerical expressions, is very important in multilingual natural language processing, like crosslingual information retrieval and statistical machine translation. In this paper we present an integrated approach to extract a named entity translation dictionary from a bilingual corpus while at the same time improving the named entity annotation quality. Starting from a bilingual corpus where the named entities are extracted independently for each language, a statistical alignment model is used to align the named entities. An iterative process is applied to extract named entity pairs with higher alignment probability. This leads to a smaller but cleaner named entity translation dictionary and also to a significant improvement of the monolingual named entity annotation quality for both languages. Experimental result shows that the dictionary size is reduced by 51.8% and the annotation quality is improved from70.03 to 78.15 for Chinese and 73.38 to 81.46 in terms of F-score.
Atienza, Rowel and Zelinsky, Alexander (2002): Active Gaze Tracking for Human-Robot Interaction. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 261. Available online
In our effort to make human-robot interfaces more user-friendly, we built an active gaze tracking system that can measure a person's gaze direction in real-time. Gaze normally tells which object in his/her surrounding a person is interested in. Therefore, it can be used as a medium for human-robot interaction like instructing a robot arm to pick a certain object a user is looking at. In this paper, we discuss how we developed and put together algorithms for zoom camera calibration, low-level control of active head, face and gaze tracking to create an active gaze tracking system.
Demirdjian, David and Darrell, Trevor (2002): 3-D Articulated Pose Tracking for Untethered Diectic Reference. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 267. Available online
Arm and body pose are useful cues for diectic reference-users naturally extend their arms to objects of interest in a dialog. We present recent progress on untethered sensing of articulated arm and body configuration using robust stereo vision techniques. These techniques allow robust, accurate, real-time tracking of 3-D position and orientation. We demonstrate users' performance with our system on object selection tasks and describe our initial efforts to integrate this system into a multimodal conversational dialog framework.
Polat, E. (2002): A Tracking Framework for Collaborative Human Computer Interaction. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 27. Available online
The ability to track multiple people and their body parts (i.e., face and hands) in a complex environment is crucial for designing a collaborative natural human computer interaction (HCI). One of the challenging issues in tracking body parts of people is the data association uncertainty while assigning measurements to the proper tracks in the case of occlusion and close interaction of body parts of different people. This paper describes a framework for tracking body parts of people in 2D/3D using multiple hypothesis tracking (MHT) algorithm. A path coherence function has been incorporated along with MHT to reduce the negative effects of closely spaced measurements that produce unconvincing tracks and needless computations. The performance of the framework has been validated using experiments on real sequence of images.
Stiefelhagen, Rainer (2002): Tracking Focus of Attention in Meetings. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 273. Available online
This paper presents an overview of our work on tracking focus of attention in meeting situations. We have developed a system capable of estimating participants' focus of attention from multiple cues. In our system we employ an omni-directional camera to simultaneously track the faces of participants sitting around a meeting table and use neural networks to estimate their head poses. In addition, we use microphones to detect who is speaking. The system predicts participants' focus of attention from acoustic and visual information separately, and then combines the output of the audio- and video-based focus of attention predictors. In addition this work reports recent experimental results: In order to determine how well we can predict a subject's focus of attention solely on the basis of his or her head orientation, we have conducted an experiment in which we recorded head and eye orientations of participants in a meeting using special tracking equipment. Our results demonstrate that head orientation was a sufficient indicator of the subjects' focus target in 89% of the time. Furthermore we discuss how the neural networks used to estimate head orientation can be adapted to work in new locations and under new illumination conditions.
Wang, Qiang, Ai, Haizhou and Xu, Guangyou (2002): A Probabilistic Dynamic Contour Model for Accurate and Robust Lip Tracking. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 281. Available online
In this paper a new Condensation style contour tracking method called Probabilistic Dynamic Contour (PDC) is proposed for lip tracking: a novel mixture dynamic model is designed to represent shape more compactly and to tolerate larger motions between frames, a measurement model is designed to include multiple visual cues. The proposed PDC tracker has the advantage that it is conceptually general but effectively suitable for lip tracking with the designed dynamic and measurement model. The new tracker improves the traditional Condensation style tracker in three aspects: Firstly, the dynamic model is partially derived from the image sequence, so the tracker does not need to learn the dynamics in advance. Secondly, the measurement model is easy to be updated during tracking, which avoids modeling the foreground object in prior. Thirdly, to improve the tracker's speed, a compact representation of shape and a noise model are proposed to reduce the samples required to represent the posterior distribution. Experiment on lip contour tracking shows that the proposed method tracks contour robustly as well as accurately compare to the existing tracking method.
Yu, Chen, Ballard, Dana H. and Zhu, Shenghuo (2002): Attentional Object Spotting by Integrating Multimodal Input. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 287. Available online
An intelligent human-computer interface is expected to allow computers to work with users in a cooperative manner. To achieve this goal, computers need to be aware of user attention and provide assistances without explicit user request. Cognitive studies of eye movements suggest that in accomplishing well-learned tasks, the performer's focus of attention is locked with the ongoing work and more than 90% of eye movements are closely related to the objects being manipulated in the tasks. In light of this, we have developed an attentional object spotting system that integrates multimodal data consisting of eye positions, head positions and video from the "first-person" perspective. To detect the user's focus of attention, we modeled eye gaze and head movements using a hidden Markov model (HMM) representation. For each attentional point in time, the object of user interest is automatically extracted and recognized. We report the results of experiments on finding attentional objects in the natural task of "making a peanut-butter sandwich".
Wu, Zhilin, Aleksic, Petar S. and Katsaggelos, Aggelos K. (2002): Lip Tracking for MPEG-4 Facial Animation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 293. Available online
It is very important to accurately track the mouth of a talking person for many applications, such as face recognition and human computer interaction. This is in general a difficult problem due to the complexity of shapes, colors, textures, and changing lighting conditions. In this paper we develop techniques for outer and inner lip tracking. From the tracking results FAPsare extracted which are used to drive an MPEG-4 decoder. A novel method consisting of a Gradient Vector Flow (GVF) snake with a parabolic template as an additional external force is proposed. Based on the results of the outer lip tracking, the inner lip is tracked using a similarity function and a temporal smoothness constraint. Numerical results are presented using the Bernstein database.
Kim, Taeyoon, Kang, Yongsung and Ko, Hanseok (2002): Achieving Real-Time Lip Synch via SVM-Based Phoneme Classification and Lip Shape Refinement. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 299. Available online
In this paper, we develop a real time lip-synch system that activates 2-D avatar's lip motion in synch with incoming speech utterance. To realize the "real time" operation of the system, we contain the processing time by invoking merge and split procedure performing coarse-to-fine phoneme classification. At each stage of phoneme classification, we apply the support vector machine (SVM) to constrain the computational load while attaining the desirable accuracy. The coarse-to-fine phoneme classification is accomplished via 2 stages of feature extraction, where each speech frame is acoustically analyzed first for 3 classes of lip opening using MFCC as feature and then a further refined classification for detailed lip shape using formant information. We implemented the system with a 2-D lip animation that shows the effectiveness of the proposed 2-stage procedure accomplishing the real-time lip-synch task.
Oliver, Nuria, Horvitz, Eric and Garg, Ashutosh (2002): Layered Representations for Human Activity Recognition. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 3. Available online
We present the use of layered probabilistic representations using Hidden Markov Models for performing sensing, learning, and inference at multiple levels of temporal granularity. We describe the use of the representation in a system that diagnoses states of a user's activity based on real-time streams of evidence from video, acoustic, and computer interactions. We review the representation, present an implementation, and report on experiments with the layered representation in an office-awareness application.
Nakamura, Satoshi, Kumatani, Ken'ichi and Tamura, Satoshi (2002): Multi-Modal Temporal Asynchronicity Modeling by Product HMMs for Robust. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 305. Available online
Recently demands for Audio-visual Speech Recognition (AVSR) has been increased in order to make the speech recognition system robust to acoustic noise. There are two kinds of research issues in the audio-visual speech recognition research such as integration modeling considering asynchronicity between modalities and adaptive information weighting according information reliability. This paper proposes a method to effectively integrate audio and visual information. Such integration inevitably necessitates modeling of the synchronization and asynchronization of the audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on a family of a product HMM. The proposed model can represent state synchronicity not only within a phoneme but also between phonemes. Furthermore, we also propose a rapid stream weight optimization based on GPD algorithm for noisy bi-modal speech recognition. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech. In SNR=0dB our proposed method attained 16% higher performance compared to a product HMMs without the synchronicity re-estimation.
Zudilova, E. V. (2002): A Multi-Modal Interface for an Interactive Simulated Vascular Reconstruction System. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 313. Available online
This paper is devoted to multi-modal interface design and implementation of a simulated vascular reconstruction system. It provides multi-modal interaction methods such as speech recognition, hand gestures, direct manipulation of virtual 3D objects and measurement tools. The main challenge is that no general interface scenario in existence today can satisfy all the users of the system (radiologists, vascular surgeons, medical students, etc.). The potential users of the system can vary by their skills, expertise level, habits and psycho-motional characteristics. To make a multi-modal interface user-friendly is a crucial issue. In this paper we introduce an approach to develop such an efficient, user-friendly multi-modal interaction system. We focus on adaptive interaction as a possible solution to address the variety of end-users. Based on a user model, the adaptive user interface identifies each individual by means of a set of criteria and generates a customized exploration environment.
Langer, Ine (2002): Universal Interfaces to Multimedia Documents. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 319. Available online
Electronic documents theoretically have great advantages for people with print disabilities, although currently this potential is not being realized. This paper reports research to develop multimedia documents with universal interfaces which can be configured to the needs of people with a variety of print disabilities. The implications of enriching multimedia documents with additional and alternative single media objects is discussed and an implementation using HTML + TIME has been undertaken.
Zandifar, Ali and Chahine, Antoine (2002): A Video Based Interface to Textual Information for the Visually Impaired. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 325. Available online
We describe the development of an interface to textual information for the visually impaired that uses video, image processing, optical-character-recognition (OCR) and text-to-speech (TTS). The video provides a sequence of low resolution images in which text must be detected, rectified and converted into high resolution rectangular blocks that are capable of being analyzed via off-the-shelf OCR. To achieve this, various problems related to feature detection, mosaicing, auto-focus, zoom, and systems integration were solved in the development of the system, and these are described.
Fontana, Federico (2002): A Structural Approach to Distance Rendering in Personal Auditory Displays. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 33. Available online
A virtual resonating environment aiming at enhancing our perception of distance is proposed. This environment reproduces the acoustics inside a tube, thus conveying peculiar distance cues to the listener. The corresponding resonator has been prototyped using a wave-based numerical scheme called Waveguide Mesh, that gave the necessary versatility to the model during the design and parameterization of the listening environment. Psychophysical tests show that this virtual environment conveys robust distance cues.
Phillips, George N. (2002): Modular Approach of Multimodal Integration in a Virtual Environment. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 331. Available online
We present a novel modular approach of integrating multiple input/output (I/O) modes in a virtual environment that imitate the natural, intuitive and effective human interaction behavior. The I/O modes that are used in this research are spatial tracking of two hands, fingers gesture recognition, head/body spatial tracking, voice recognition (discrete recognition for simple commands, and continuous recognition for natural language input), immersive stereo display and synthesized speech output. The intuitive natural interaction is achieved through several stages: identify all the tasks that need to be performed, group the similar tasks and assign them to a particular mode such that it imitates the physical world. This modular approach allows inclusion and removal of additional input and output modes as well as additional number of users easily. We described this multimodal interaction paradigm by applying it to a real world application: visualizing, modeling and fitting protein molecular structures in an immersive virtual environment.
Niklfeld, Georg, Pucher, Michael, Finan, Robert and Eckhart, Wolfgang (2002): Mobile Multi-Modal Data Services for GPRS Phones and Beyond. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 337. Available online
The paper discusses means to build multi-modal data services in existing GPRS infrastructures, and it puts the proposed simple solutions into the perspective of technological possibilities that will become available in public mobile communications networks over the next few years along the progression path from 2G/GSM systems, through GPRS, to 3G systems like UMTS, or equivalently to 802.11 networks. Three demonstrators are presented, which were developed by the authors in an application-oriented research project co-financed by telecommunications companies. The first two, push-to-talk address entry for a route-finder, and an open-microphone map-content navigator, simulate a UMTS or WLAN scenario. The third demonstrator implements a multi-modal map finder in a live public GPRS network using WAP-Push. Some indications on usability are given. The paper argues for the importance of open, standards-based architectures that will spur attractive multi-modal services for the short term, as the current economic difficulties in the telecommunications industry put support for long term research into more advanced forms of multi-modality in question.
Bostwick, Ben and Seemann, Edgar (2002): Flexi-Modal and Multi-Machine User Interfaces. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 343. Available online
We describe our system which facilitates collaboration using multiple modalities, including speech, handwriting, gestures, gaze tracking, direct manipulation, large projected touch-sensitive displays, laser pointer tracking, regular monitors with a mouse and keyboard, and wirelessly-networked handhelds. Our system allows multiple, geographically dispersed participants to simultaneously and flexibly mix different modalities using the right interface at the right time on one or more machines. This paper discusses each of the modalities provided, how they were integrated in the system architecture, and how the user interface enabled one or more people to flexibly use one or more devices.
Krahnstoever, N., Kettebekov, S., Yeasin, M. and Sharma, R. (2002): A Real-Time Framework for Natural Multimodal Interaction with Large Screen Displays. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 349. Available online
This paper presents a framework for designing a natural multimodal human computer interaction (HCI) system. The core of the proposed framework is a principled method for combining information derived from audio and visual cues. To achieve natural interaction, both audio and visual modalities are fused along with feedback through a large screen display. Careful design along with due considerations of possible aspects of a systems interaction cycle and integration has resulted in a successful system. The performance of the proposed framework has been validated through the development of several prototype systems as well as commercial applications for the retail and entertainment industry. To assess the impact of these multimodal systems (MMS), informal studies have been conducted. It was found that the system performed according to its specifications in 95% of the cases and that users showed ad-hoc proficiency, indicating natural acceptance of such systems.
Sinha, Anoop K. and Landay, James A. (2002): Embarking on Multimodal Interface Design. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 355. Available online
Designers are increasingly faced with the challenge of targeting multimodal applications, those that span heterogeneous devices and use multimodal input, but do not have tools to support them. We studied the early stage work practices of professional multimodal interaction designers. We noted the variety of different artifacts produced, such as design sketches and paper prototypes. Additionally, we observed Wizard of Oz techniques that are sometimes used to simulate an interactive application from these sketches. These studies have led to our development of a technique for interface designers to consider as they embark on creating multimodal applications.
Larsen, Lars Bo, Jensen, Morten Damm and Vodzi, Wisdom Kobby (2002): Multi Modal User Interaction in an Automatic Pool Trainer. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 361. Available online
This paper presents the human-computer interaction in an automatic pool trainer currently being developed at the Center for PersonKommunikation, Aalborg University. The aim of the system is to automate (parts of) the learning process, in this case of the game of pool. The Automated Pool Trainer (APT) utilises multi modal, agent driven user-system communication, to facilitate the user interaction. To allow the user the necessary freedom of movement when addressing the task, system output is presented on a wall-mounted screen and is augmented by a laser drawing lines and points directly on the pool table surface. User Interaction is either carried out via a spoken dialogue with an animated interface agent, or by using a touch screen panel. The paper describes the philosophy on which the system is designed, as well as the system architecture and individual modules. The user interaction is described and the paper concludes with a presentation of some test results and a discussion of the suitability of the presented and similar systems.
Siewiorek, Daniel, Smailagic, Asim and Hornyak, Matthew (2002): Multimodal Contextual Car-Driver Interface. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 367. Available online
This paper focuses on the design and implementation of a Companion Contextual Car Driver interface that proactively assists the driver in managing information and communication. The prototype combines a smart car environment and driver state monitoring, incorporating a wide range of input-output modalities and a display hierarchy. Intelligent agents link information from many contexts, such as location and schedule, and transparently learn from the driver, interacting with the driver only when it is necessary.
Nichols, Jeffrey, Myers, Brad A., Harris, Thomas K., Rosenfeld, Roni, Shriver, Stefanie, Higgins, Michael and Hughes, Joseph (2002): Requirements for Automatically Generating Multi-Modal Interfaces for Complex Appliances. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 377. Available online
Several industrial and academic research groups are working to simplify the control of appliances and services by creating a truly universal remote control. Unlike the preprogrammed remote controls available today, these new controllers download a specification from the appliance or service and use it to automatically generate a remote control interface. This promises to be a useful approach because the specification can be made detailed enough to generate both speech and graphical interfaces. Unfortunately, generating good user interfaces can be difficult. Based on user studies and prototype implementations, this paper presents a set of requirements that we have found are needed for automatic interface generation systems to create high-quality user interfaces.
Ning, Huazhong (2002): Articulated Model Based People Tracking Using Motion Models. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 383. Available online
This paper focuses on acquisition of human motion data such as joint angles and velocity for applications of virtual reality, using both articulated body model and motion model in the CONDENSATION framework. Firstly, we learn a motion model represented by Gaussian distributions, and explore motion constraints by considering the dependency of motion parameters and represent them as conditional distributions. Then both of them are integrated into the dynamic model to concentrate factored sampling in the areas of state-space with most posterior information. To measure the observing density with accuracy and robustness, a PEF (Pose Evaluation Function) modeled with a radial term is proposed. We also address the issue of automatic acquisition of initial model posture and recovery from severe failures. A large number of experiments on several persons demonstrate that our approach works well.
Wilson, Kevin, Rangarajan, Vibhav, Checka, Neal and Darrell, Trevor (2002): Audiovisual Arrays for Untethered Spoken Interfaces. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 389. Available online
When faced with a distant speaker at a known location in a noisy environment, a microphone array can provide a significantly improved audio signal for speech recognition. Estimating the location of a speaker in a reverberant environment from audio information alone can be quite difficult, so we use an array of video cameras to aid localization. Stereo processing techniques are used on pairs of cameras, and foreground 3-D points are grouped to estimate the trajectory of people as they move in an environment. These trajectories are used to guide a microphone array beamfermer. Initial results using this system for speech recognition demonstrate increased recognition rates compared to non-array processing techniques.
Panuccio, A. (2002): A Multimodal Electronic Travel Aid Device. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 39. Available online
This paper describes an Electronic Travel Aid device, that may enable blind individuals to "see the world with their ears". A wearable prototype will be assembled using low-cost hardware: earphones, sunglasses fitted with two micro cameras, and a palmtop computer. The system, which currently runs on a desktop computer, is able to detect the light spot produced by a laser pointer, compute its angular position and depth, and generate a corresponding sound providing the auditory cues for the perception of the position and distance of the pointed surface patch. It permits different sonification modes that can be chosen by drawing, with the laser pointer, a predefined stroke which will be recognized by a Hidden Markov Model. In this way the blind person can use a common pointer as a replacement of the cane and will interact with the device by using a flexible and natural sketch based interface.
Wang, Sen, Zhang, Wei Wei and Wang, Yang Sheng (2002): Fingerprint Classification by Directional Fields. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 395. Available online
Fingerprint classification provides an important fingerprint index and can reduce fingerprint matching time in large database. A good classification algorithm can give an accurate index that is able to search a fingerprint database more effectively. In this paper, we present a fingerprint classification algorithm that is based on directional fields. We compute directional fields of fingerprint image and detect singular points (cores). Then, we extract features that we define from fingerprint image. We also use k-means classifier and 3-nearest neighbor to classify feature and distinguish which fingerprint is Arch, Left Loop, Right Loop, or Whorl. Experimental results show a significant improvement in fingerprint classification performance. Moreover, the time required for the classification algorithm is reduced.
Focken, Dirk (2002): Towards Vision-Based 3-D People Tracking in a Smart Room. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 400. Available online
This paper presents our work on building a real time distributed system to track 3-D locations of people in an indoor environment, such as a smart room, using multiple calibrated cameras. In our system, each camera is connected to a dedicated computer on which foreground regions in the camera image are detected. This is done using an adaptive background model. These detected foreground regions are broadcasted to a tracking agent, which computes believed 3-D locations of persons based on the detected image regions. We have implemented both a best-hypothesis heuristic tracking approach as well as a probabilistic multi-hypothesis tracker to find the object tracks from these 3-D locations. The two tracking approaches are evaluated on a sequence of two people walking in a conference room recorded with three cameras. The results suggest that the probabilistic tracker shows comparable performance to the heuristic tracker.
Mentis, Helena M. (2002): Using TouchPad Pressure to Detect Negative Affect. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 406. Available online
Humans naturally use behavioral cues in their interactions with other humans. The Media Equation proposes that these same cues are directed towards media, including computers. It is probable that detection of these cues by a computer during run-time could improve usability design and analysis. A preliminary experiment testing one of these cues, Synaptics TouchPad pressure, shows that behavioral cues can be used as a critical incident indicator by detecting negative affect.
Latoschik, Marc Erich (2002): Designing Transition Networks for Multimodal VR-Interactions Using a Markup Language. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 411. Available online
This article presents one core component for enabling multimodal-speech and gesture-driven interaction in and for Virtual Environments. A so-called temporal Augmented Transition Network (tATN) is introduced. It allows to integrate and evaluate information from speech, gesture, and a given application context using a combined syntactic/semantic parse approach. This tATN represents the target structure for a multimodal integration markup language (MIML). MIML centers around the specification of multimodal interactions by letting an application designer declare temporal and semantic relations between given input utterance percepts and certain application states in a declarative and portable manner. A subsequent parse pass translates MIML into corresponding tATNs which are directly loaded and executed by a simulation engines scripting facility.
Yonezawa, Tomoko and Mase, Kenji (2002): Musically Expressive Doll in Face-to-Face Communication. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 417. Available online
We propose an application that uses music as a multimodal expression to activate and support communication that runs parallel with traditional conversation. In this paper, we examine a personified doll-shaped interface designed for musical expression. To direct such gestures toward communication, we have adopted an augmented stuffed toy with tactile interaction as a musically expressive device. We constructed the doll with various sensors for user context recognition. This configuration enables translation of the interaction into melodic statements. We demonstrate the effect of the doll on face-to-face conversation by comparing the experimental results of different input interfaces and output sounds. Consequently, we have found that conversation with the doll was positively affected by the musical output, the doll interface, and their combination.
Chen, Xilin and Yang, Jie (2002): Towards Monitoring Human Activities Using an Omnidirectional Camera. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 423. Available online
In this paper we propose an approach for monitoring human activities in an indoor environment using an omnidirectional camera. Robustly tracking people is prerequisite for modeling and recognizing human activities. An omnidirectional camera mounted on the ceiling is less prone to problems of occlusion. We use the Markov Random Field (MRF) to present both background and foreground, and adapt models effectively against environment changes. We employ a deformable model to adapt the foreground models to optimally match objects in different position within a pattern of view of omnidirectional camera. In order to monitor human activity, we represent positions of people as spatial points and analyze moving trajectories within a time-spatial window. The method provides an efficient way to monitoring high-level human activities without exploring identities.
Xie, Weikai, Shi, Yuanchun, Xu, Guanyou and Mao, Yanhua (2002): Smart Platform -- A Software Infrastructure for Smart Space (SISS). In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 429. Available online
A software infrastructure is fundamental to a Smart Space. Previous proposed software infrastructure for Smart Space (SISS) did not sufficiently address the issue of performance and usability. A new solution, Smart Platform, which is focused on improving these aspects of a SISS, is presented in this paper. To optimize its intermodule communication performance, the stream-oriented communication is distinguished from the message-oriented ones, and a corresponding hybrid communication scheme is proposed. To improve the usability, a featured loose-coupling structure, a straightforward Publish-and-Subscribe coordination model as well as a set of user-friendly deployment and development tools are developed. Besides, Smart Platform is intended as an open and generic SISS available for other research groups. To this end, XML-based message syntax and open wire-protocol based architecture are adopted to make sharing research efforts more easily.
Gray, Rob, Tan, Hong Z. and Young, J. Jay (2002): Do Multimodal Signals Need to Come from the Same Place? Crossmodal Attentional Links Between Proximal and Distal Surfaces. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 437. Available online
Previous research has shown that the use of multimodal signals can lead to faster and more accurate responses compared to purely unimodal displays. However, in most cases response facilitation only occurs when the signals are presented in roughly the same spatial location. This would suggest a severe restriction on interface designers: to use multimodal displays effectively all signals must be presented from the same location on the display. We previously reported evidence that the use of haptic cues may provide a solution to this problem as haptic cues presented to a user's back can be used to redirect visual attention to locations on a screen in front of the user . In the present experiment we used a visual change detection task to investigate whether (i) this type of visual-haptic interaction is robust at low cue validity rates and (ii) similar effects occur for auditory cues. Valid haptic cues resulted in significantly faster change detection times even when they accurately indicated the location of the change on only 20% of the trials. Auditory cues had a much smaller effect on detection times at the high validity
Bergman, Janne (2002): CATCH-2004 Multi-Modal Browser: Overview Description with Usability Analysis. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 442. Available online
This paper takes a closer look at the user interface issues in our research multi-modal browser architecture. The browser framework, also briefly introduced in this paper, reuses single-modal browser technologies available for VoiceXML, WML, and HTML browsing. User interface actions on a particular browser are captured, converted to events, and distributed to the other browsers participating (possibly on different hosts) in the multi-modal framework. We have defined a synchronization protocol, which distributes such events with the help of the central component called the Virtual Proxy. The choice of the architecture and the synchronization primitives have profound consequences on handling certain interesting UI use cases. We particularly address those specified by the W3C Multi-Modal Requirements, which are related to the design of possible strategies of dealing with simultaneous input, solving input inconsistencies, and defining synchronization points. The proposed approaches are illustrated by examples.
Cohen, Philip R., Coulston, Rachel and Krout, Kelly (2002): Multimodal Interaction During Multiparty Dialogues: Initial Results. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 448. Available online
Groups of people involved in collaboration on a task often incorporate the objects in their mutual environment into their discussion. With this comes physical reference to these 3-D objects, including: gesture, gaze, haptics, and possibly other modalities, over and above the speech we commonly associate with human-human communication. From a technological perspective, this human style of communication not only poses the challenge for researchers to create multimodal systems capable of integrating input from various modalities, but also to do it well enough that it supports -- but does not interfere with -- the primary goal of the collaborators, which is their own human-human interaction. This paper offers a first step towards building such multimodal systems for supporting face-to-face collaborative work by providing both qualitative and quantitative analyses of multiparty multimodal dialogues in a field setting.
Arafa, Yasmine and Mamdani, Abe (2002): Multi-Modal Embodied Agents Scripting. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 454. Available online
Embodied agents present ongoing challenging agenda for research in multi-modal user interfaces and human-computer-interaction. Such agent metaphors will only be widely applicable to online applications when there is a standardised way to map underlying engines with the visual presentation of the agents. This paper delineates the functions and specifications of a mark-up language for scripting the animation of virtual characters. The language is called: Character Mark-up Language (CML) and is an XML-based character attribute definition and animation scripting language designed to aid in the rapid incorporation of lifelike characters/agents into online applications or virtual reality worlds. This multi-modal scripting language is designed to be easily understandable by human animators and easily generated by a software process such as software agents. CML is constructed based jointly on motion and multi-modal capabilities of virtual life-like figures. The paper further illustrates the constructs of the language and describes a real-time execution architecture that demonstrates the use of such a language as a 4G language to easily utilise and integrate MPEG-4 media objects in online interfaces and virtual environments.
Moehler, Gregor (2002): A Methodology for Evaluating Multimodality in a Home Entertainment System. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 460. Available online
Multimodality is likely to play a key role in household technology interfaces of the future offering as it can, enhanced efficiency, improved flexibility and increased user preference. These benefits are unlikely to be realized however unless such interfaces are well designed specifically with regard to modality allocation and configuration. We report on a methodology aimed at evaluating modality usage, which involves a combination of two sets of heuristics, one derived from a description of modality properties, the other concerned with issues of usability. We describe how modality properties can be reformulated into a procedural style checklist and then describe the implementation of this methodology and the issues we were able to highlight in the context of the EMBASSI Home' system, a multimodal system which aims to provide a natural and intuitive interface to a potentially open-ended array of appliances within the home.
Cho, Changseok and Yang, Huichul (2002): Body-Based Interfaces. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 466. Available online
This research explores different ways to use features of one's own body for interacting with computers. In the future, such "body-based" interfaces may be put into good use for wearable computing or virtual reality systems as part of a 3D multi-modal interface, freeing the user from holding interaction devices. We have identified four types of body-based interfaces: the Body-inspired-metaphor uses various parts of the body metaphorically for interaction; the Body-as-interaction-surface simply uses parts of the body as points of interaction; Mixed-mode mixes the former two; Object-mapping spatially maps the interaction object to the human body. These four body-based interfaces were applied to three different applications (and associated tasks) and were tested for their performance and utility. It was generally found that, while the Body-inspired-metaphor produced the lowest error rate, it required a longer task completion time and caused more fatigue due to the longer hand moving distance. On the other hand, the Body-as-interaction-surface was the fastest, but produced many more errors.
Rogina, Ivica (2002): Lecture and Presentation Tracking in an Intelligent Meeting Room. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 47. Available online
Archiving, indexing, and later browsing through stored presentations and lectures is a task that can be observed with a growing frequency. We have investigated the special problems and advantages of lectures and propose the design and adaptation of a speech recognizer towards a lecture such that the recognition accuracy can be significantly improved by prior analysis of the presented documents using a special class-based language model. We defined a tracking accuracy measure which measures how well a system can automatically align recognized words with parts of a presentation and show that by prior exploitation of the presented documents, the trucking accuracy can be improved. The system described in this paper is part of an intelligent meeting room developed in the European-Union-sponsored project FAME (Facilitating Agent for Multicultural Exchange).
Application control in virtual environments (VE) is still an open field of research. The Command and Control Cube (C3) developed by Grosjean et al. is a quick access menu for the VE configuration called workbench (a large screen displaying stereoscopic images). The C3 presents two modes, one with the graphical display of the cubic structure associated to the C3 and a blind mode for expert users, with no feedback. In this paper we conduct formal tests of the C3 under four different conditions: the visual mode with the graphical display, the blind mode with no feedback and two additional conditions enhancing the expert blind mode: a tactile mode with the tactile feedback of a Cyberglove and a sound mode with a standard audio device. Results show that the addition of sound and tactil feedback is more disturbing to the users than the blind mode. The visual mode performs the best although the blind mode achieves some promising results.
Stouffs, Alexandre (2002): Interruptions as Multimodal Outputs: Which are the Less Disruptive?. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 479. Available online
This paper describes exploratory studies of interruption modalities and disruptiveness. Five interruption modalities were compared: Heat, Smell, Sound, Vibration, and Light. Much more notable than the differences between modalities was the differences between people. We found that subjects' sensitiveness depended on their previous life exposure to the modalities. Individual differences greatly control the effect of interrupting stimuli. We show that is possible to build multimodal adaptive interruption interface, such interfaces would dynamically select the output interruption modality to use based on its effectiveness on a particular user.
Aist, Gregory, Kort, Barry, Reilly, Rob, Mostow, Jack and Picard, Rosalind W. (2002): Experimentally Augmenting an Intelligent Tutoring System with Human-Supplied Capabilities: Adding Human-Provided Emotional Scaffolding to an Automated Reading Tutor that Listens. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 483. Available online
This paper presents the first statistically reliable empirical evidence from a controlled study for the effect of human-provided emotional scaffolding on student persistence in an intelligent tutoring system. We describe an experiment that added human-provided emotional scaffolding to an automated Reading Tutor that listens, and discuss the methodology we developed to conduct this experiment. Each student participated in one (experimental) session with emotional scaffolding, and in one (control) session without emotional scaffolding, counterbalanced by order of session. Each session was divided into several portions. After each portion of the session was completed, the Reading Tutor gave the student a choice: continue, or quit. We measured persistence as the number of portions the student completed. Human-provided emotional scaffolding added to the automated Reading Tutor resulted in increased student persistence, compared to the Reading Tutor alone. Increased persistence means increased time on task, which ought lead to improved learning. If these results for reading turn out to hold for other domains too, the implication for intelligent tutoring systems is that they should respond with not just cognitive support -- but emotional scaffolding as well. Furthermore, the general technique of adding human-supplied capabilities to an existing intelligent tutoring system should prove useful for studying other ITSs too. This paper is a shortened and revised version of Aist et al. (same title). ITS Workshop on Empirical Methods for Tutorial Dialogue. June 4, 2002, San Sebastian, Spain.
Cohn, Jeffrey F., Schmidt, Karen, Gross, Ralph and Ekman, Paul (2002): Individual Differences in Facial Expression: Stability over Time, Relation to Self-Reported Emotion, and Ability to Inform Person Identification. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 491. Available online
The face can communicate varied personal information including subjective emotion, communicative intent, and cognitive appraisal. Accurate interpretation by observer or computer interface depends on attention to dynamic properties of the expression, context, and knowledge of what is normative for a given individual. In two separate studies, we investigated individual differences in the base rate of positive facial expression and in specific facial action units over intervals from 4- to 12 months. Facial expression was measured using convergent measures, including facial EMG, automatic feature-point tracking, and manual FACS coding. Individual differences in facial expression were stable over time, comparable in magnitude to stability of self-reported emotion, and sufficiently strong that individuals were recognized on the basis of their facial behavior alone at rates comparable to that for a commercial face recognition system (FaceIt from Identix). Facial action units convey unique information about person identity that can inform interpretation of psychological states, person recognition, and design of individuated avatars.
Cohen, Michael M., Massaro, Dominic W. and Clark, Rashid (2002): Training a Talking Head. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 499. Available online
A Cyberware laser scan of DWM was made, Baldi's generic morphology was mapped into the form of DWM, this head was trained on real data recorded with Optotrak LED markers, and the quality of its speech was evaluated. Participants were asked to recognize auditory sentences presented alone in noise, aligned with the newly trained synthetic textured mapped target face, or the original natural face. There was a significant advantage when the noisy auditory sentence was paired with either head, with the synthetic textured mapped target face giving as much of an improvement as the original recordings of the natural face.
Caldognetto, Emanuela Magno, Perin, Giulio and Zmarich, Claudio (2002): Labial Coarticulation Modeling for Realistic Facial Animation. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 505. Available online
A modified version of the coarticulation model proposed by Cohen and Massaro is described. A semi-automatic minimization technique, working on real cinematic data, acquired by the ELITE opto-electronic system, was used to train the dynamic characteristics of the model. Finally, the model was applied with success to GRETA an Italian talking head and few examples are illustrated to show the naturalness of the resulting animation technique.
Xiong, Ziyou, Chen, Yunqiang, Wang, Roy and Huang, Thomas S. (2002): Improved Information Maximization based Face and Facial Feature Detection from Real-time Video and Application in a Multi-Modal Person Identification System. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 511. Available online
In this paper an improved face detection method based on our previous Information-Based Maximum Discrimination approach is presented that maximizes the discrimination between face and non-face examples in a training set without even using color or motion information. A short review of our previous method is given together with the description of the intuition behind and our recent improvement on its detection speed. A person identification system has been developed that performs multi-modal person identification in real-time video based on this newly improved face detection method together with speaker identification.
Jiang, Dalong (2002): Animating Arbitrary Topology 3D Facial Model Using the MPEG-4 FaceDefTables. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 517. Available online
In this paper, we put forward a method to animate arbitrary topology facial model (ATFM) based on the MPEG-4 standard. This paper deals mainly with the problem of building the FaceDefTables, which play a very important role in the MPEG-4 based facial animation system. The FaceDefTables for our predefined standard facial model (SFM) are built by using the interpolation method. Since the FaceDefTables depend on facial models, the FaceDefTables for the SFM can be applied to only those facial models that have the same topology as the SFM. For those facial models that have different topology, we have to build the FaceDefTables accordingly. To acquire the FaceDefTables for ATFM, we first select feature points on ATFM, and then transform the SFM according to those feature points. At last, we project each vertex on the ATFM to the transformed SFM and build the FaceDefTables for the ATFM according to the projection position. With the FaceDefTables we built, realistic animation results have been acquired.
Wang, Wei, Shan, Shiguang, Gao, Wen, Cao, Bo and Yin, Baocai (2002): An Improved Active Shape Model for Face Alignment. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 523. Available online
In this paper, we present several improvements on the conventional Active Shape Models (ASM) for face alignment. Despite the accuracy and robustness of the ASMs in the image alignment, its performance depends heavily on the initial parameters of the shape model, as well as the local texture model for each landmark and the corresponding local matching strategy. In this work, to improve the ASMs for face alignment, several measures are taken. First, salient facial features, such as the eyes and the mouth, are localized based on a face detector. These salient features are then utilized to initialize the shape model and provide region constraints on the subsequent iterative shape searching. Secondly, we exploit the edge information to construct better local texture models for the landmarks on the face contour. The edge intensity at the contour landmark is used as a self-adaptive weight when calculating the Mahalanobis distance between the candidate profile and the reference one. Thirdly, to avoid their unreasonable shift from the pre-Iocalized salient features, landmarks around the salient features are adjusted before applying the global subs pace constraints. Experiments on a database containing 300 labeled face images show that the proposed method performs significantly better than traditional ASMs.
Fasel, Beat (2002): Head-Pose Invariant Facial Expression Recognition Using Convolutional Neural Networks. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 529. Available online
Automatic face analysis has to cope with pose and lighting variations. Especially pose variations are difficult to tackle and many face analysis methods require the use of sophisticated normalization and initialization procedures. We propose a data-driven face analysis approach that is not only capable of extracting features relevant to a given face analysis task, but is also more robust with regard to face location changes and scale variations when compared to classical methods such as e.g. MLPs. Our approach is based on convolutional neural networks that use multi-scale feature extractors, which allow for improved facial expression recognition results with faces subject to in-plane pose variations.
Taguma, Ryuta, Moriyama, Tatsuhiro, Iwano, Koji and Furui, Sadaoki (2002): Parallel Computing-Based Architecture for Mixed-Initiative Spoken Dialogue. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 53. Available online
This paper describes a new method of implementing mixed-initiative spoken dialogue systems based on parallel computing architecture. In a mixed-initiative dialogue, the user as well as the system needs to be capable of controlling the dialogue sequence. In our implementation, various language models corresponding to different dialogue contents, such as requests for information or replies to the system, are built and multiple recognizers using these language models are driven under a parallel computing architecture. The dialogue content of the user is automatically detected based on likelihood scores given by the recognizers, and the content is used to build the dialogue. A transitional probability from one dialogue state uttering a kind of content to another state uttering a different content is incorporated into the likelihood score. A flexible dialogue structured that gives users the initiative to control the dialogue is implemented by this architecture. Real-time dialogue systems for retrieving information about restaurants and food stores are built and evaluated in terms of dialogue content identification rate and keyword accuracy. the proposed architecture has the advantage that the dialogue system can be easily modified without remaking the whole language model.
Kong, Dehui (2002): An Improved Algorithm for Hairstyle Dynamics. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 535. Available online
This paper introduces an efficient and flexible hair modeling method to develop intricate hairstyle dynamics. A prominent contribution of the present work is that it proposes the evaluating approach of spring coefficient, i.e., spring coefficient can be obtained through the combination of large deflection deformation model and spring hinge model. This is based on the fact that there is a direct proportion between the spring coefficient and the stiffness coefficient, a variable determined by hair shape. What is more, the damping coefficient is no longer regarded as a constant, but a function of hair density, and this treatment has turned out to be a success in solving the problem of hair-hair collision. As a result, a dynamic model, which fits very well a great variety of hairstyles, is proposed.
Nakamura, Satoshi and Heracleous, Panikos (2002): 3-D N-Best Search for Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 59. Available online
A microphone array is the promising solution for realizing hands-free speech recognition in real environments. Accurate talker localization is very important for speech recognition using the microphone array. However localization of a moving talker is difficult in noisy reverberant environments. The talker localization errors degrade the performance of speech recognition. To solve the problem, we proposed a new speech recognition algorithm which considers multiple talker direction hypotheses simultaneously . The proposed algorithm performs Viterbi search in 3-dimensional trellis space composed of talker directions, input frames, and HMM states. In this paper we describe a new simultaneous recognition algorithm of distant-talking speech of multiple talkers using the extended 3-D N-best search algorithm. The algorithm exploits a path distance-based clustering and a likelihood normalization technique appeared to be necessary in order to build an efficient system for our purpose. We evaluated the proposed method using reverberated data, which are those simulated by the image method and recorded in a real room. The image method was used to know the accuracy-reverberation time relationship, and the real data was used to evaluate the real performance of our algorithm. The obtained Top 3 results of the Simultaneous Word Accuracy was 73.02% under 162ms reverberation time and using the image method.
Wong, Pui-Fung and Siu, Man-Hung (2002): Integration of Tone Related Feature for Chinese Speech Recognition. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 64. Available online
Chinese is a tonal language that uses fundamental frequency, in addition to phones for word differentiation. Commonly used front-end features, such as Mel-Frequency Cepstral Coefficients (MFCC), however, are optimized for non-tonal languages such as English and are not mainly focused on pitch information that is important for tone identification. In this paper, we examine the integration of tone-related acoustic features for Chinese recognition. We propose the use of Cepstrum Method (CEP), which uses the same configurations as in MFCC extraction, for the extraction of pitch-related features. The pitch periods extracted from the CEP algorithm can be used directly for speech recognition and do not require any special treatment for unvoiced frames. In addition, we explore a number of feature transformations and find that the addition of a properly normalized and transformed set of pitch related-features can reduce the recognition error rate from 34.61% to 29.45% on the Chinese 1998 National Performance Assessment (Project 863) corpus.
Mersiol, Marc, Chateau, Noël and Maffiolo, Valérie (2002): Talking Heads: Which Matching between Faces and Synthetic Voices?. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 69. Available online
The integration of synthetic faces and text-to-speech voice synthesis (what we call "talking heads") allows new applications in the area of man-machine interfaces. In a close future, talking heads might be useful communicative interface agents. But before making an extensive use of talking heads, several issues have to be checked according to their acceptability by users. An important issue is to make sure that the used synthetic voices match to their faces. The scope of this paper is to study the coherence that might exist between synthetic voices and faces. Twenty-four subjects rated the coherence of all the combinations between ten faces and six voices. The main results of this paper show that not all associations between faces and voices are relevant and that some associations are better rated than others according to qualitative criteria.
Lu, Dajin (2002): Robust Noisy Speech Recognition with Adaptive Frequency Bank Selection. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 75. Available online
With the development of automatic speech recognition technology, the robustness problem of speech recognition system is becoming more and more important. This paper addresses the problem of speech recognition in additive background noise environment. Since the frequency energy of different types of noise focuses on different frequency banks, the effect of additive noise on each frequency bank are different. The seriously obscured frequency banks have little word signal information left, and are harmful for subsequence speech processing. Wu et al. applied the frequency bank selection theory to robust word boundary detection in noise environment, and obtained good detection results. In this paper, this theory is extended to noisy speech recognition. Unlike the standard MFCC which uses all frequency banks for cepstral coefficients, we only use the frequency banks that are slightest corrupted and discard the seriously obscured ones. Cepstral coefficients are calculated only on the selected frequency banks. Moreover, acoustic model is also adapted to match the modification of acoustic feature. Experiments on continuous digital speech recognition show that the proposed algorithm leads to better performance than spectral subtraction and cepstral mean normalization at low SNRs.
Wang, ZhiQiang, Liu, Yang, Ding, Peng and Bo, Xu (2002): Covariance-Tied Clustering Method in Speaker Identification. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 81. Available online
Gaussian mixture models (GMMs) have been successfully applied to the classier for speaker modeling in speaker identification. However, there are still problems to solve, such as the clustering methods. Conditional K-Means Algorithm utilizes Euclidean distance taking all data distribution as sphericity, which is not the distribution of the actual data. In this paper we present a new method to make use of covariance information to direct the clustering of GMMs, namely covariance-tied clustering. This method is consisted of two parts: obtaining the covariance matrices using data sharing technique based on binary tree and making use of the covariance matrices to direct clustering. The experiments results prove that this method leads to worthwhile reductions of error rates in speaker identification. Much remains to be done to explore fully the covariance information.
Chai, Joyce, Pan, Shimei, Zhou, Michelle X. and Houck, Keith (2002): Context-Based Multimodal Input Understanding in Conversational Systems. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 87. Available online
In a multimodal human-machine conversation, user inputs are often abbreviated or imprecise. Sometimes, only fusing multimodal inputs together cannot derive a complete understanding. To address these inadequacies, we are building a semantics-based multimodal interpretation framework called MIND (Multimodal Interpretation for Natural Dialog). The unique feature of MIND is the use of a variety of contexts (e.g., domain context and conversation context) to enhance multimodal fusion. In this paper, we present a semantic rich modeling scheme and a context-based approach that enable MIND to gain a full understanding of user inputs, including those ambiguous and incomplete ones.
Bauckhage, C. (2002): Evaluating Integrated Speech- and Image Understanding. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 9. Available online
The capability to coordinate and interrelate speech and vision is a virtual prerequisite for adaptive, cooperative, and flexible interaction among people. It is therefore to assume that human-machine interaction, too, would benefit from intelligent interfaces for integrated speech and image processing. In this paper, we first sketch an interactive system that integrates automatic speech processing with image understanding. Then, we concentrate on performance assessment which we believe is an emerging key issue in multimodal interaction. We explain the benefit of time scale analysis and usability studies and evaluate our system accordingly.
Hastie, Helen Wright, Johnston, Michael and Ehlen, Patrick (2002): Context-Sensitive Help for Multimodal Dialogue. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 93. Available online
Multimodal interfaces offer users unprecedented flexibility in choosing a style of interaction. However, users are frequently unaware of or forget shorter or more effective multimodal or pen-based commands. This paper describes a working help system that leverages the capabilities of a multimodal interface in order to provide targeted, unobtrusive, context-sensitive help. This Multimodal Help System guides the user to the most effective way to specify a request, providing transferable knowledge that can be used in future requests without repeatedly invoking the help system.
Landragin, Frédéric (2002): Referring to Objects with Spoken and Haptic Modalities. In: Proceedings of the 2002 International Conference on Multimodal Interfaces 2002. p. 99. Available online
The gesture input modality considered in multimodal dialogue systems is mainly reduced to pointing or manipulating actions. With an approach based on the spontaneous character of the communication, the treatment of such actions involves many processes. Without any constraints, the user may use gesture in association with speech and may exploit the visual context peculiarities, guiding his articulation of gesture trajectories and his choices of words. The semantic interpretation of multimodal utterances also becomes a complex problem, taking into account varieties of referring expressions, varieties of gestural trajectories, structural parameters from the visual context, and also directives from a specific task. Following the spontaneous approach, we propose to give the maximal understanding capabilities to dialogue systems, to ensure that various interaction modes must be taken into account. Considering the development of haptic devices (as PHANToM) which increase the capabilities of sensations, particularly tactile and kinesthetic ones, we propose to explore a new domain of research concerning the integration of haptic gesture into multimodal dialogue systems, in terms of its possible associations with speech for objects reference and manipulation. We focus in this paper on the compatibility between haptic gesture and multimodal reference models, and on the consequences of processing this new modality on intelligent system architectures, which is not yet enough studied from a semantic point of view.