Proceedings of the 2004 International Conference on Multimodal Interfaces
Time and place:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
The following articles are from "Proceedings of the 2004 International Conference on Multimodal Interfaces":
Kuno, Yoshinori, Sakurai, Arihiro, Miyauchi, Dai and Nakamura, Akio (2004): Two-way eye contact between humans and robots. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 1-8. Available online
Eye contact is an effective means of controlling human communication, such as in starting communication. It seems that we can make eye contact if we simply look at each other. However, this alone does not establish eye contact. Both parties also need to be aware of being watched by the other. We propose a method of two-way eye contact for human-robot communication. When a human wants to start communication with a robot, he/she watches the robot. If it finds a human looking at it, the robot turns to him/her, changing its facial expressions to let him/her know its awareness of his/her gaze. When the robot wants to initiate communication with a particular person, it moves its body and face toward him/her and changes its facial expressions to make the person notice its gaze. We show several experimental results to prove the effectiveness of this method. Moreover, we present a robot that can recognize hand gestures after making eye contact with the human to show the usefulness of eye contact as a means of controlling communication.
Kettebekov, Sanshzar (2004): Exploiting prosodic structuring of coverbal gesticulation. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 105-112. Available online
Although gesture recognition has been studied extensively, communicative, affective, and biometrical "utility" of natural gesticulation remains relatively unexplored. One of the main reasons for that is the modeling complexity of spontaneous gestures. While lexical information in speech provides additional cues for disambiguating gestures, it does not cover rich paralinguistic domain. This paper offers initial findings from a large corpus of natural monologues about prosodic structuring between frequent beat-like strokes and concurrent speech. Using a set of audio-visual features in an HMM-based formulation, we are able to improve the discrimination between visually similar gestures. Those types of articulatory strokes represent different communicative functions. The analysis is based on the temporal alignment of detected vocal perturbations and the concurrent hand movement. As a supplementary result, we show that recognized articulatory strokes may be used for quantifying gesturing behavior.
Eisenstein, Jacob and Davis, Randall (2004): Visual and linguistic information in gesture classification. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 113-120. Available online
Classification of natural hand gestures is usually approached by applying pattern recognition to the movements of the hand. However, the gesture categories most frequently cited in the psychology literature are fundamentally multimodal; the definitions make reference to the surrounding linguistic context. We address the question of whether gestures are naturally multimodal, or whether they can be classified from hand-movement data alone. First, we describe an empirical study showing that the removal of auditory information significantly impairs the ability of human raters to classify gestures. Then we present an automatic gesture classification system based solely on an n-gram model of linguistic context; the system is intended to supplement a visual classifier, but achieves 66% accuracy on a three-class classification problem on its own. This represents higher accuracy than human raters achieve when presented with the same information.
Harper, Mary P. and Shriberg, Elizabeth (2004): Multimodal model integration for sentence unit detection. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 121-128. Available online
In this paper, we adopt a direct modeling approach to utilize conversational gesture cues in detecting sentence boundaries, called SUs, in video taped conversations. We treat the detection of SUs as a classification task such that for each inter-word boundary, the classifier decides whether there is an SU boundary or not. In addition to gesture cues, we also utilize prosody and lexical knowledge sources. In this first investigation, we find that gesture features complement the prosodic and lexical knowledge sources for this task. By using all of the knowledge sources, the model is able to achieve the lowest overall SU detection error rate.
Oviatt, Sharon, Coulston, Rachel and Lunsford, Rebecca (2004): When do we interact multimodally?: cognitive load and multimodal communication patterns. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 129-136. Available online
Mobile usage patterns often entail high and fluctuating levels of difficulty as well as dual tasking. One major theme explored in this research is whether a flexible multimodal interface supports users in managing cognitive load. Findings from this study reveal that multimodal interface users spontaneously respond to dynamic changes in their own cognitive load by shifting to multimodal communication as load increases with task difficulty and communicative complexity. Given a flexible multimodal interface, users' ratio of multimodal (versus unimodal) interaction increased substantially from 18.6% when referring to established dialogue context to 77.1% when required to establish a new context, a +315% relative increase. Likewise, the ratio of users' multimodal interaction increased significantly as the tasks became more difficult, from 59.2% during low difficulty tasks, to 65.5% at moderate difficulty, 68.2% at high and 75.0% at very high difficulty, an overall
Zeng, Zhihong, Tu, Jilin, Liu, Ming, Zhang, Tong, Rizzolo, Nicholas, Zhang, ZhenQiu, Huang, Thomas S., Roth, Dan and Levinson, Stephen (2004): Bimodal HCI-related affect recognition. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 137-143. Available online
Perhaps the most fundamental application of affective computing will be Human-Computer Interaction (HCI) in which the computer should have the ability to detect and track the user's affective states, and make corresponding feedback. The human multi-sensor affect system defines the expectation of multimodal affect analyzer. In this paper, we present our efforts toward audio-visual HCI-related affect recognition. With HCI applications in mind, we take into account some special affective states which indicate users' cognitive/motivational states. Facing the fact that a facial expression is influenced by both an affective state and speech content, we apply a smoothing method to extract the information of the affective state from facial features. In our fusion stage, a voting method is applied to combine audio and visual modalities so that the final affect recognition accuracy is greatly improved. We test our bimodal affect recognition approach on 38 subjects with 11 HCI-related affect states. The extensive experimental results show that the average person-dependent affect recognition accuracy is almost 90% for our bimodal fusion.
Katzenmaier, Michael, Stiefelhagen, Rainer and Schultz, Tanja (2004): Identifying the addressee in human-human-robot interactions based on head pose and speech. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 144-151. Available online
In this work we investigate the power of acoustic and visual cues, and their combination, to identify the addressee in a human-human-robot interaction. Based on eighteen audio-visual recordings of two human beings and a (simulated) robot we discriminate the interaction of the two humans from the interaction of one human with the robot. The paper compares the result of three approaches. The first approach uses purely acoustic cues to find the addressees. Low level, feature based cues as well as higher-level cues are examined. In the second approach we test whether the human's head pose is a suitable cue. Our results show that visually estimated head pose is a more reliable cue for the identification of the addressee in the human-human-robot interaction. In the third approach we combine the acoustic and visual cues which results in significant improvements.
Saenko, Kate, Darrell, Trevor and Glass, James R. (2004): Articulatory features for robust visual speech recognition. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 152-158. Available online
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
Grange, Sébastien, Fong, Terrence and Baur, Charles (2004): M/ORIS: a medical/operating room interaction system. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 159-166. Available online
We propose an architecture for a real-time multimodal system, which provides non-contact, adaptive user interfacing for Computer-Assisted Surgery (CAS). The system, called M/ORIS (for Medical/Operating Room Interaction System) combines gesture interpretation as an explicit interaction modality with continuous, real-time monitoring of the surgical activity in order to automatically address the surgeon's needs. Such a system will help reduce a surgeon's workload and operation time. This paper focuses on the proposed activity monitoring aspect of M/ORIS. We analyze the issues of Human-Computer Interaction in an OR based on real-world case studies. We then describe how we intend to address these issues by combining a surgical procedure description with parameters gathered from vision-based surgeon tracking and other OR sensors (e.g. tool trackers). We called this approach Scenario-based Activity Monitoring (SAM). We finally present preliminary results, including a non-contact mouse interface for surgical navigation systems.
Ohno, Takehiko (2004): EyePrint: support of document browsing with eye gaze trace. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 16-23. Available online
Current digital documents provide few traces to help user browsing. This makes document browsing difficult, and we sometimes feel it is hard to keep track of all of the information. To overcome this problem, this paper proposes a method of creating traces on digital documents. The method, called EyePrint, generates a trace from the user's eye gaze in order to support the browsing of digital document. Traces are presented as highlighted areas on a document, which become visual cues for accessing previously visited documents. Traces also become document attributes that can be used to access and search the document. A prototype system that works with a gaze tracking system is developed. The result of a user study confirms the usefulness of the traces in digital document browsing.
Milota, André D. (2004): Modality fusion for graphic design applications. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 167-174. Available online
Users must enter a complex mix of spatial and abstract information when operating a graphic design application. Speech / language provides a fluid and natural method for specifying abstract information while a spatial input device is often most intuitive for the entry of spatial information. Thus, the combined speech / gesture interface is ideally suited to this application domain. While some research has been conducted on multimodal graphic design applications, advanced research on modality fusion has typically focused on map related applications. This paper considers the particular demands of graphic design applications and what impact these demands will have on the general strategies employed when combining the speech and gesture channels. We also describe initial work on our own multimodal graphic design application (DPD) which uses these strategies.
Holzapfel, Hartwig, Nickel, Kai and Stiefelhagen, Rainer (2004): Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 175-182. Available online
This paper presents an architecture for fusion of multimodal input streams for natural interaction with a humanoid robot as well as results from a user study with our system. The presented fusion architecture consists of an application independent parser of input events, and application specific rules. In the presented user study, people could interact with a robot in a kitchen scenario, using speech and gesture input. In the study, we could observe that our fusion approach is very tolerant against falsely detected pointing gestures. This is because we use speech as the main modality and pointing gestures mainly for disambiguation of objects. In the paper we also report about the temporal correlation of speech and gesture events as observed in the user study.
Bodnar, Adam, Corbett, Richard and Nekrasovski, Dmitry (2004): AROMA: ambient awareness through olfaction in a messaging application. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 183-190. Available online
This work explores the properties of different output modalities as notification mechanisms in the context of messaging. In particular, the olfactory (smell) modality is introduced as a potential alternative to visual and auditory modalities for providing messaging notifications. An experiment was performed to compare these modalities as secondary display mechanisms used to deliver notifications to users working on a cognitively engaging primary task. It was verified that the disruptiveness and effectiveness of notifications varied with the notification modality. The olfactory modality was shown to be less effective in delivering notifications than the other modalities, but produced a less disruptive effect on user engagement in the primary task. Our results serve as a starting point for future research into the use of olfactory notification in messaging systems.
This paper discusses the Ohio University Virtual Haptic Back (VHB) project, including objectives, implementation, and initial evaluations. Haptics is the science of human tactile sensation and a haptic interface provides force and touch feedback to the user from virtual reality. Our multimodal VHB simulation combines high-fidelity computer graphics with haptic feedback and aural feedback to augment training in palpatory diagnosis in osteopathic medicine, plus related training applications in physical therapy, massage therapy, chiropractic therapy, and other tactile fields. We use the PHANToM haptic interface to provide position interactions by the trainee, with accompanying force feedback to simulate the back of a live human subject in real-time. Our simulation is intended to add a measurable, repeatable component of science to the art of palpatory diagnosis. Based on our experiences in the lab to date, we believe that haptics-augmented computer models have great potential for improving training in the future, for various tactile applications. Our main project goals are to: 1. Provide a novel tool for palpatory diagnosis training; and 2. Improve the state-of-the-art in haptics and graphics applied to virtual anatomy.
Zhang, Liang-Guo, Chen, Yiqiang, Fang, Gaolin, Chen, Xilin and Gao, Wen (2004): A vision-based sign language recognition system using tied-mixture density HMM. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 198-204. Available online
In this paper, a vision-based medium vocabulary Chinese sign language recognition (SLR) system is presented. The proposed recognition system consists of two modules. In the first module, techniques of robust hands detection, background subtraction and pupils detection are efficiently combined to precisely extract the feature information with the aid of simple colored gloves in the unconstrained environment. Meanwhile, an effective and efficient hierarchical feature description scheme with different scale features to characterize sign language is proposed, where principal component analysis (PCA) is employed to characterize the finger features more elaborately. In the second part, a Tied-Mixture Density Hidden Markov Models (TMDHMM) framework for SLR is proposed, which can speed up the recognition without the significant loss of recognition accuracy compared with the continuous hidden Markov models (CHMM). Experimental results based on 439 frequently used Chinese sign language (CSL) words show that the proposed methods can work well for the medium vocabulary SLR in the environment without special constraints and the recognition accuracy is up to 92.5%.
Busso, Carlos, Deng, Zhigang, Yildirim, Serdar, Bulut, Murtaza, Lee, Chul Min, Kazemzadeh, Abe, Lee, Sungbok, Neumann, Ulrich and Narayanan, Shrikanth (2004): Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 205-211. Available online
The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on facial expressions or speech, relatively limited work has been done to fuse these two, and other, modalities to improve the accuracy and robustness of the emotion recognition system. This paper analyzes the strengths and the limitations of systems based only on facial expressions or acoustic information. It also discusses two approaches used to fuse these two modalities: decision level and feature level integration. Using a database recorded from an actress, four emotions were classified: sadness, anger, happiness, and neutral state. By the use of markers on her face, detailed facial motions were captured with motion capture, in conjunction with simultaneous speech recordings. The results reveal that the system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. Results also show the complementarily of the two modalities and that when these two modalities are fused, the performance and the robustness of the emotion recognition system improve measurably.
Dragicevic, Pierre and Fekete, Jean-Daniel (2004): Support for input adaptability in the ICON toolkit. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 212-219. Available online
In this paper, we introduce input adaptability as the ability of an application to exploit alternative sets of input devices effectively and offer users a way of adapting input interaction to suit their needs. We explain why input adaptability must be seriously considered today and show how it is poorly supported by current systems, applications and tools. We then describe ICon (Input Configurator), an input toolkit that allows interactive applications to achieve a high level of input adaptability. We present the software architecture behind ICon then the toolkit itself, and give several examples of non-standard interaction techniques that are easy to build and modify using ICon's graphical editor while being hard or impossible to support using regular GUI toolkits.
Esch-Bussemakers, M. P. van and Cremers, A. H. M. (2004): User walkthrough of multimodal access to multidimensional databases. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 220-226. Available online
This paper describes a user walkthrough that was conducted with an experimental multimodal dialogue system to access a multidimensional music database using a simulated mobile device (including a technically challenging four-PHANToM-setup). The main objectives of the user walkthrough were to assess user preferences for certain modalities (speech, graphical and haptic-tactile) to access and present certain types of information, and for certain search strategies when searching and browsing a multidimensional database. In addition, the project aimed at providing concrete recommendations for the experimental setup, multimodal user interface design and evaluation. The results show that recommendations can be formulated both on the use of modalities and search strategies, and on the experimental setup as a whole, including the user interface. In short, it is found that haptically enhanced buttons are preferred for navigating or selecting and speech is preferred for searching the database for an album or artist. A 'direct' search strategy indicating an album, artist or genre is favorable. It can be concluded that participants were able to look beyond the experimental setup and see the potential of the envisioned mobile device and its modalities. Therefore it was possible to formulate recommendations for future multimodal dialogue systems for multidimensional database access.
Kumar, Sanjeev, Cohen, Philip R. and Coulston, Rachel (2004): Multimodal interaction under exerted conditions in a natural field setting. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 227-234. Available online
This paper evaluates the performance of a multimodal interface under exerted conditions in a natural field setting. The subjects in the present study engaged in a strenuous activity while multimodally performing map-based tasks using handheld computing devices. This activity made the users breathe heavily and become fatigued during the course of the study. We found that the performance of both speech and gesture recognizers degraded as a function of exertion, while the overall multimodal success rate was stable. This stabilization is accounted for by the mutual disambiguation of modalities, which increases significantly with exertion. The system performed better for subjects with a greater level of physical fitness, as measured by their running speed, with more stable multimodal performance and a later degradation of speech and gesture recognition as compared with subjects who were less fit. The findings presented in this paper have a significant impact on design decisions for multimodal interfaces targeted towards highly mobile and exerted users in field environments.
Hazen, Timothy J., Saenko, Kate, La, Chia-Hao and Glass, James R. (2004): A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 235-242. Available online
This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.
Massaro, Dominic W. (2004): A framework for evaluating multimodal integration by humans and a role for embodied conversational agents. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 24-31. Available online
One of the implicit assumptions of multi-modal interfaces is that human-computer interaction is significantly facilitated by providing multiple input and output modalities. Surprisingly, however, there is very little theoretical and empirical research testing this assumption in terms of the presentation of multimodal displays to the user. The goal of this paper is provide both a theoretical and empirical framework for addressing this important issue. Two contrasting models of human information processing are formulated and contrasted in experimental tests. According to integration models, multiple sensory influences are continuously combined during categorization, leading to perceptual experience and action. The Fuzzy Logical Model of Perception (FLMP) assumes that processing occurs in three successive but overlapping stages: evaluation, integration, and decision (Massaro, 1998). According to nonintegration models, any perceptual experience and action results from only a single sensory influence. These models are tested in expanded factorial designs in which two input modalities are varied independently of one another in a factorial design and each modality is also presented alone. Results from a variety of experiments on speech, emotion, and gesture support the predictions of the FLMP. Baldi, an embodied conversational agent, is described and implications for applications of multimodal interfaces are discussed.
Bastide, Remi, Navarre, David, Palanque, Philippe A., Schyn, Amelie and Dragicevic, Pierre (2004): A model-based approach for real-time embedded multimodal systems in military aircrafts. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 243-250. Available online
This paper presents the use of a model-based approach for the formal description of real-time embedded multimodal systems. This modeling technique has been used in the field of military fighter aircrafts. The paper presents the formal description techniques, its application on the case study of a multimodal command and control interface for the Rafale aircraft as well as its relationship with architectural model for interactive systems.
Bouchet, Jullien, Nigay, Laurence and Ganille, Thierry (2004): ICARE software components for rapidly developing multimodal interfaces. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 251-258. Available online
Although several real multimodal systems have been built, their development still remains a difficult task. In this paper we address this problem of development of multimodal interfaces by describing a component-based approach, called ICARE, for rapidly developing multimodal interfaces. ICARE stands for Interaction-CARE (Complementarity Assignment Redundancy Equivalence). Our component-based approach relies on two types of software components. Firstly ICARE elementary components include Device components and Interaction Language components that enable us to develop pure modalities. The second type of components, called Composition components, define combined usages of modalities. Reusing and assembling ICARE components enable rapid development of multimodal interfaces. We have developed several multimodal systems using ICARE and we illustrate the discussion using one of them: the FACET simulator of the Rafale French military plane cockpit.
Rose, R. Travis, Quek, Francis and Shi, Yang (2004): MacVisSTA: a system for multimodal analysis. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 259-264. Available online
The study of embodied communication requires access to multiple data sources such as multistream video and audio, various derived and metadata such as gesture, head, posture, facial expression and gaze information. The common element that runs through these data is the co-temporality of the multiple modes of behavior. In this paper, we present the multimedia Visualization for Situated Temporal Analysis (MacVisSTA) system for the analysis of multimodal human communication through video, audio, speech transcriptions, and gesture and head orientation data. The system uses a multiple linked representation strategy in which different representations are linked by the current time focus. In this framework, the multiple display components associated with the disparate data types are kept in synchrony, each component serving as both a controller of the system as well as a display. Hence the user is able to analyze and manipulate the data from different analytical viewpoints (e.g. through the time-synchronized speech transcription or through motion segments of interest). MacVisSTA supports analysis of the synchronized data at varying timescales. It provides an annotation interface that permits users to code the data into 'music-score' objects, and to make and organize multimedia observations about the data. Hence MacVisSTA integrates flexible visualization with annotation within a single framework. An XML database manager has been created for storage and search of annotation data. We compare the system with other existing annotation tools with respect to functionality and interface design. The software runs on Macintosh OS X computer systems.
Pfleger, Norbert (2004): Context based multimodal fusion. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 265-272. Available online
We present a generic approach to multimodal fusion which we call context based multimodal integration. Key to this approach is that every multimodal input event is interpreted and enriched with respect to its local turn context. This local turn context comprises all previously recognized input events and the dialogue state that both belong to the same user turn. We show that a production rule system is an elegant way to handle this context based multimodal integration and we describe a first implementation of the so-called PATE system. Finally, we present results from a first evaluation of this approach as part of a human-factors experiment with the COMIC system.
Tao, Jianhua and Tan, Tieniu (2004): Emotional Chinese talking head system. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 273-280. Available online
Natural Human-Computer Interface requires integration of realistic audio and visual information for perception and display. In this paper, a lifelike talking head system is proposed. The system converts text to speech with synchronized animation of mouth movements and emotion expression. The talking head is based on a generic 3D human head model. The personalized model is incorporated into the system. With texture mapping, the personalized model offers a more natural and realistic look than the generic model. To express emotion, both emotional speech synthesis and emotional facial animation are integrated and Chinese viseme models are also created in the paper. Finally, the emotional talking head system is created to generate the natural and vivid audio-visual output.
Patomäki, Saija, Raisamo, Roope, Salo, Jouni, Pasto, Virpi and Hippula, Arto (2004): Experiences on haptic interfaces for visually impaired young children. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 281-288. Available online
Visually impaired children do not have equal opportunities to learn and play compared to sighted children. Computers have a great potential to correct this problem. In this paper we present a series of studies where multimodal applications were designed for a group of eleven visually impaired children aged from 3.5 to 7.5 years. We also present our testing procedure specially adapted for visually impaired young children. During the two-year project it became clear that with careful designing of the tasks and proper use of haptic and auditory features usable computing environments can be created for visually impaired children.
Malik, Shahzad and Laszlo, Joe (2004): Visual touchpad: a two-handed gestural input device. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 289-296. Available online
This paper presents the Visual Touchpad, a low-cost vision-based input device that allows for fluid two-handed interactions with desktop PCs, laptops, public kiosks, or large wall displays. Two downward-pointing cameras are attached above a planar surface, and a stereo hand tracking system provides the 3D positions of a user's fingertips on and above the plane. Thus the planar surface can be used as a multi-point touch-sensitive device, but with the added ability to also detect hand gestures hovering above the surface. Additionally, the hand tracker not only provides positional information for the fingertips but also finger orientations. A variety of one and two-handed multi-finger gestural interaction techniques are then presented that exploit the affordances of the hand tracker. Further, by segmenting the hand regions from the video images and then augmenting them transparently into a graphical interface, our system provides a compelling direct manipulation experience without the need for more expensive tabletop displays or touch-screens, and with significantly less self-occlusion.
Guinn, Curry and Hubal, Rob (2004): An evaluation of virtual human technology in informational kiosks. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 297-302. Available online
In this paper, we look at the results of using spoken language interactive virtual characters in information kiosks. Users interact with synthetic spokespeople using spoken natural language dialogue. The virtual characters respond with spoken language, body and facial gesture, and graphical images on the screen. We present findings from studies of three different information kiosk applications. As we developed successive kiosks, we applied lessons learned from previous kiosks to improve system performance. For each setting, we briefly describe the application, the participants, and the results, with specific focus on how we increased user participation and improved informational throughput. We tie the results together in a lessons learned section.
Goldiez, Brian, Martin, Glenn, Daly, Jason, Washburn, Donald and Lazarus, Todd (2004): Software infrastructure for multi-modal virtual environments. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 303-308. Available online
Virtual environment systems, especially those supporting multi-modal interactions require a robust and flexible software infrastructure that supports a wide range of devices, interaction techniques, and target applications. In addition to interactivity needs, a key factor of robustness of the software is the minimization of latency and more importantly, reduction of jitter (the variability of latency). This paper presents a flexible software infrastructure that has demonstrated robustness in initial prototyping. The infrastructure, based on the VESS Libraries from the University of Central Florida, simplifies the task of creating multi-modal virtual environments. Our extensions to VESS include numerous features to support new input and output devices for new sensory modalities and interaction techniques, as well as some control over latency and jitter.
Madan, Anmol, Caneel, Ron and Pentland, Alex Sandy (2004): GroupMedia: distributed multi-modal interfaces. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 309-316. Available online
In this paper, we describe the GroupMedia system, which uses wireless wearable computers to measure audio features, head-movement, and galvanic skin response (GSR) for dyads and groups of interacting people. These group sensor measurements are then used to build a real-time group interest index. The group interest index can be used to control group displays, annotate the group discussion for later retrieval, and even to modulate and guide the group discussion itself. We explore three different situations where this system has been introduced, and report experimental results.
Hamilton, Eric R. (2004): Agent and library augmented shared knowledge areas (ALASKA). In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 317-318. Available online
This paper reports on an NSF-funded effort now underway to integrate three learning technologies that have emerged and matured over the past decade; each has presented compelling and oftentimes moving opportunities to alter educational practice and to render learning more effective. The project seeks a novel way to blend these technologies and to create and test a new model for human-machine partnership in learning settings. The innovation we are prototyping in this project creates an applet-rich shared space whereby a pedagogical agent at each learner's station functions as an instructional assistant to the teacher or professor and tutor to the student. The platform is intended to open a series of new -- and instructionally potent -- interactive pathways.
Channarukul, Songsak, Mcroy, Susan W. and Ali, Syed S. (2004): MULTIFACE: multimodal content adaptations for heterogeneous devices. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 319-320. Available online
We are interested in applying and extending existing frameworks for combining output modalities for adaptations of multimodal content on heterogeneous devices based on user and device models. In this paper, we present Multiface, a multimodal dialog system that allows users to interact using different devices such as desktop computers, PDAs, and mobile phones. The presented content and its modality will be customized to individual users and the device they are using.
Morency, Louis-Philippe and Darrell, Trevor (2004): From conversational tooltips to grounded discourse: head poseTracking in interactive dialog systems. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 32-37. Available online
Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. While the machine interpretation of these cues has previously been limited to output modalities, recent advances in face-pose tracking allow for systems which are robust and accurate enough to sense natural grounding gestures. We present the design of a module that detects these cues and show examples of its integration in three different conversational agents with varying degrees of discourse model complexity. Using a scripted discourse model and off-the-shelf animation and speech-recognition components, we demonstrate the use of this module in a novel "conversational tooltip" task, where additional information is spontaneously provided by an animated character when users attend to various physical objects or characters in the environment. We further describe the integration of our module in two systems where animated and robotic characters interact with users based on rich discourse and semantic models.
Dalton, Joseph M., Ahmad, Ali and Stanney, Kay (2004): Command and control resource performance predictor (C2RP2). In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 321-322. Available online
Nardelli, Luca, Orlandi, Marco and Falavigna, Daniele (2004): A multi-modal architecture for cellular phones. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 323-324. Available online
Merdes, Matthias, Häußler, Jochen and Jöst, Matthias (2004): 'SlidingMap': introducing and evaluating a new modality for map interaction. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 325-326. Available online
In this paper, we describe the concept of a new modality for interaction with digital maps. We propose using inclination as a means for panning maps on a mobile computing device, namely a tablet PC. The result is a map which is both physically transportable as well as manipulable with very simple and natural hand movements. We describe a setup for comparing this new modality with the better known modalities of pen-based and joystick-based interaction. Apart from demonstrating the new modality we plan to perform a short evaluation.
We demonstrate a same-time different-place collaboration system for managing crisis situations using geospatial information. Our system enables distributed spatial decision-making by providing a multimodal interface to team members. Decision makers in front of large screen displays and/or desktop computers, and emergency responders in the field with tablet PCs can engage in collaborative activities for situation assessment and emergency response.
Kaiser, Ed, Demirdjian, David, Gruenstein, Alexander, Li, Xiaoguang, Niekrasz, John, Wesson, Matt and Kumar, Sanjeev (2004): A multimodal learning interface for sketch, speak and point creation of a schedule chart. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 329-330. Available online
We present a video demonstration of an agent-based test bed application for ongoing research into multi-user, multimodal, computer-assisted meetings. The system tracks a two person scheduling meeting: one person standing at a touch sensitive whiteboard creating a Gantt chart, while another person looks on in view of a calibrated stereo camera. The stereo camera performs real-time, untethered, vision-based tracking of the onlooker's head, torso and limb movements, which in turn are routed to a 3D-gesture recognition agent. Using speech, 3D deictic gesture and 2D object de-referencing the system is able to track the onlooker's suggestion to move a specific milestone. The system also has a speech recognition agent capable of recognizing out-of-vocabulary (OOV) words as phonetic sequences. Thus when a user at the whiteboard speaks an OOV label name for a chart constituent while also writing it, the OOV speech is combined with letter sequences hypothesized by the handwriting recognizer to yield an orthography, pronunciation and semantics for the new label. These are then learned dynamically by the system and become immediately available for future recognition.
Demirdjian, David, Wilson, Kevin, Siracusa, Michael and Darrell, Trevor (2004): Real-time audio-visual tracking for meeting analysis. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 331-332. Available online
We demonstrate an audio-visual tracking system for meeting analysis. A stereo camera and a microphone array are used to track multiple people and their speech activity in real-time. Our system can estimate the location of multiple people, detect the current speaker and build a model of interaction between people in a meeting.
We present a novel paradigm for human to human asymmetric collaboration. There is a need for people at geographically separate locations to seamlessly collaborate in real time as if they are physically co-located. In our system one user (novice) works in the real world and the other user (expert) works in a parallel virtual world. They are assisted in this task by an Intelligent Agent (IA) with considerable knowledge about the environment. Current tele-collaboration systems deal primarily with collaboration purely in the real or virtual worlds. The use of a combination of virtual and real worlds allows us to leverage the advantages from both the worlds.
Rybski, Paul E., Banerjee, Satanjeev, Torre, Fernando De la, Vallespí, Carlos, Rudnicky, Alexander I. and Veloso, Manuela (2004): Segmentation and classification of meetings using multiple information streams. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 335-336. Available online
We present a meeting recorder infrastructure used to record and annotate events that occur in meetings. Multiple data streams are recorded and analyzed in order to infer a higher-level state of the group's activities. We describe the hardware and software systems used to capture people's activities as well as the methods used to characterize them.
Boda, Péter Pál (2004): A maximum entropy based approach for multimodal integration. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 337-338. Available online
Integration of various user input channels for a multimodal interface is not just an engineering problem. To fully understand users in the context of an application and the current session, solutions are sought that process information from different intentional, i.e. user-originated, as well as from passively available sources in a uniform manner. As a first step towards this goal, the work demonstrated here investigates how intentional user input (e.g. speech, gesture) can be seamlessly combined to provide a single semantic interpretation of the user input. For this classical Multimodal Integration problem the Maximum Entropy approach is demonstrated with 76.52% integration accuracy for the 1st and 86.77% accuracy for the top 3-best candidates. The paper also exhibits the process that generates multimodal data for training the statistical integrator, using transcribed speech from MIT's Voyager application. The quality of the generated data is assessed by comparing to real inputs to the multimodal version of Voyager.
A novel interface system for accessing geospatial data (GeoMIP) has been developed that realizes a user-centered multimodal speech/gesture interface for addressing some of the critical needs in crisis management. In this system we primarily developed vision sensing algorithms, speech integration, multimodality fusion, and rule-based mapping of multimodal user input to GIS database queries. A demo system of this interface has been developed for the Port Authority NJ/NY and is explained here.
Channarukul, Songsak (2004): Adaptations of multimodal content in dialog systems targeting heterogeneous devices. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 341. Available online
Dialog systems that adapt to different user needs and preferences appropriately have been shown to achieve higher levels of user satisfaction . However, it is also important that dialog systems be able to adapt to the user's computing environment, because people are able to access computer systems using different kinds of devices such as desktop computers, personal digital assistants, and cellular telephones. Each of these devices has a distinct set of physical capabilities, as well as a distinct set of functions for which it is typically used. Existing research on adaptation in both hypermedia and dialog systems has focused on how to customize content based on user models [2, 4] and interaction history. Some researchers have also investigated device-centered adaptations that range from low-level adaptations such as conversion of multimedia objects  (e.g., video to images, audio to text, image size reduction) to higher-level adaptations based on multimedia document models  and frameworks for combining output modalities [3, 5]. However, to my knowledge, no work has been done on integrating and coordinating both types of adaptation interdependently. The primary problem I would like to address in this thesis is how multimodal dialog systems can adapt their content and style of interaction, taking the user, the device, and the dependency between them into account. Two main aspects of adaptability that my thesis considers are: (1) adaptability in content presentation and communication and (2) adaptability in computational strategies used to achieve system's and user's goals. Beside general user modeling questions such as how to acquire information about the user and construct a user model, this thesis also considers other issues that deal with device modeling such as (1) how can the system employ user and device models to adapt the content and determine the right combination of modalities effectively? (2) how can the system determine the right combination of multimodal contents that best suits the device? (3) how can one model the characteristics and constraints of devices? and (4) is it possible to generalize device models based on modalities rather than on their typical categories or physical appearance.
Chen, Lei (2004): Utilizing gestures to better understand dynamic structure of human communication. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 342. Available online
Motivation: Many researchers have highlighted the importance of gesture in natural human communication. McNeill  puts forward the hypothesis that gesture and speech stem from the same mental process and so tend to be both temporally and semantically related. However in contrast to speech, which surfaces as a linear progression of segments, sounds, and words, gestures appear to be nonlinear, holistic, and imagistic. Gesture adds an important dimension to language understanding due to this property of sharing a common origin with speech while using a very different mechanism for transferring information. Ignoring this information when constructing a model of human communication would limit its potential effectiveness. Goal and Method: This thesis concerns the development of methods to effectively incorporate gestural information from a human communication into a computer model to more accurately interpret the content and structure of that communication. Levelt  suggests that structure in human communication stems from the dynamic conscious process of language production, during which a conversant organizes the concepts to be expressed, plans the discourse, and selects appropriate words, prosody, and gestures while also correcting errors that occur in this process. Clues related to this conscious processing emerge in both the final speech stream and gestures. This thesis will attempt to utilize these clues to determine the structural elements of human-to-human dialogs, including sentence boundaries, topic boundaries, and disfluency structure. For this purpose, the data driven approach is used. This work requires three important components: corpus generation, feature extraction, and model construction. Previous Work: Some work related to each of these components has already been conducted. A data collection and processing protocol for constructing multimodal corpora has been created; details on the video and audio processing can be found in the Data and Annotation section of . To improve the speed of producing a corpus while maintaining its quality, we have surveyed factors impacting the accuracy of forced alignments of transcriptions to audio files . These alignments provide a crucial temporal synchronization between video events and spoken words (and their components) for this research effort. We have also conducted measurement studies in an attempt to understand how to model multimodal conversations. For example, we have investigated the types of gesture patterns that occur during speech repairs . Recently, we constructed a preliminary model combining speech and gesture features for detecting sentence boundaries in videotaped dialogs. This model combines language and prosody models together with a simple gestural model to more effectively detect sentence boundaries . Future Work: To date, our multimodal corpora involve human monologues and dialogues (see http://vislab.cs.wright.edu/kdi). We are participating in the collection and preparation of a corpus of multi-party meetings (see http://vislab.cs.wright.edu/Projects/Meeting-Analysis). To facilitate the multi-channel audio processing, we are constructing a tool to support accurate audio transcription and alignment. The data from this meeting corpus will enable the development of more sophisticated gesture models allowing us to expand the set of gesture features (e.g., spatial properties of the tracked gestures). Additionally, we will investigate more advanced machine learning methods in an attempt to improve the performance of our models. We also plan to expand our models to phenomena such as topic segmentation.
Wilson, Dale-Marie (2004): Multimodal programming for dyslexic students. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 343. Available online
As the Web's role in society increases, so to does the need for its universality. Access to the Web by all, including people with disabilities has become a requirement of Web sites as can be seen by the passing of the Americans with Disabilities Act in 1990. This universality has spilled over into other disciplines, e.g. screen readers for Web browsing; however Computer Science has not yet made significant efforts to do the same. The main focus of this research is to provide this universal access in the development of virtual learning environments, more specifically in computer programming. To facilitate this access, research into the features of dyslexia is required: what it is, how it affects a person's thought process and what changes are necessary to facilitate these effects. Also, a complete understanding of the thought process in the creation of a complete computer program is necessary. Dyslexia has been diagnosed as affecting the left side of the brain. The left side of the brain processes information in a linear, sequential manner. It is also responsible for processing symbols, which include letters, words and mathematical notations. Thus dyslexics have problems with the code generation, analysis and implementation steps in the creation of a computer program. Potential solutions to this problem include a multimodal programming environment. This multimodal environment will be interactive, providing multimodal assistance to the user as they generate, analyze and implement code. This assistance will include the ability to add functions and loops via voice and receiving a spoken description of a code segment that has been selected by the cursor.
Eisenstein, Jacob (2004): Gestural cues for speech understanding. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 344. Available online
Chandrasekaran, Rajesh (2004): Using language structure for adaptive multimodal language acquisition. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 345. Available online
In human spoken communication, language structure plays a vital role in providing a framework for humans to understand each other. Using language rules, words are combined into meaningful sentences to represent knowledge. Speech enabled systems based on pre-programmed Rule Grammar suffer from constraints on vocabulary and sentence structures. To address this problem, in this paper, we discuss a language acquisition system that is capable of learning new words and their corresponding semantic meaning by initiating an adaptive dialog with the user. Thus, the vocabulary of the system can be increased in real time by the user. The language acquisition system is provided knowledge about language structure and is capable of accepting multimodal user inputs that includes speech, touch, pen-tablet, mouse, and keyboard. We discuss the efficiency of learning new concepts and the ease with which users can teach the system new concepts. The multimodal language acquisition system is capable of acquiring, in real time, new words that pertain to objects, actions or attributes and their corresponding meanings. The first step in this process is to detect unknown words in the spoken utterance. Any new word that is detected is classified into one of the above mentioned categories. The second step is to learn from the user the meaning of the word and add it to the semantic database. An unknown word is flagged whenever an utterance is not consistent with the pre-programmed Rule Grammar. Because the system can acquire words pertaining to objects, actions or attributes, we are interested in words that are nouns, verbs or adjectives. We use a transformation based part-of-speech tagger that is capable of annotating English words with their part-of-speech to identify words in the utterance that are nouns, verbs and adjectives. These words are searched in the semantic database and unknown words are identified. The system then initiates an adaptive dialog with the user, requesting the user to provide the meaning of the unknown word. When the user has provided the relevant meaning using any of the input modalities, the system checks whether the meaning given corresponds to the category of the word, i.e. if the unknown word is a noun then the user can associate only an object with it or if the unknown word is a verb then only an action can be associated with the word. Thus, the system uses the knowledge of the occurrence of the word in the sentence to determine what kind of meaning can be associated with the word. The language structure thus gives the system a basic knowledge of the unknown word.
Lunsford, Rebecca (2004): Private speech during multimodal human-computer interaction. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 346. Available online
Bennett, Emily (2004): Projection augmented models: the effect of haptic feedback on subjective and objective human factors. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 347. Available online
Lisowska, Agnes (2004): Multimodal interface design for multimodal meeting content retrieval. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 348. Available online
This thesis will investigate which modalities, and in which combinations, are best suited for use in a multimodal interface that allows users to retrieve the content of recorded and processed multimodal meetings. The dual role of multimodality in the system (present in both the interface and the stored data) poses additional challenges. We will extend and adapt established approaches to HCI and multimodality [2, 3] to this new domain, maintaining a strongly user-driven approach to design.
Reeves, Leah M. (2004): Determining efficient multimodal information-interaction spaces for C2 systems. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 349. Available online
Military operations and friendly fire mishaps over the last decade have demonstrated that Command, Control, Communications, Computers, Intelligence, Surveillance, and Reconnaissance (C4ISR) systems may often lack the ability to efficiently and effectively support operations in complex, time critical environments. With the vast increase in the amount and type of information available, the challenge to today's military system designers is to create interfaces that allow warfighters to proficiently process the optimal amount of mission essential data . To meet this challenge, multimodal system technology is showing great promise because, as the technology that supports C4ISR systems advances, the possibility of leveraging all of the human sensory systems becomes possible. The implication is that by facilitating the efficient use of a C4ISR operator's multiple information processing resources, substantial gains in the information management capacity of the warfighter-computer integral may be realized . Despite its great promise, however, the potential of multimodal technology as a tool for streamlining interaction within military C4ISR environments may not be fully realized until the following guiding principles are identified: * how to combine visualization and multisensory display techniques for given users, tasks, and problem domains * how task attributes should be represented (e.g., via which modality, via multiple modalities); * which multimodal interaction technique(s) is most appropriate. Due to the current lack of empirical evidence and principle-driven guidelines, designers often encounter difficulties when choosing the most appropriate modal interaction techniques for given users, applications, or specific military command and control (C2) tasks within C4ISR systems. The implication is that inefficient multimodal C2 system design may hinder our military's ability to fully support operations in complex, time critical environments and thus impede warfighters' ability to achieve accurate situational awareness (SA) in a timely manner . Consequently, warfighters are often becoming overwhelmed when provided with more information than they can accurately process. The development of multimodal design guidelines from both a user and task domain perspective is thus critical to the achievement of successful Human Systems Integration (HSI) within military environments such as C2 systems. This study provides preliminary empirical support in identifying user attributes, such as spatial ability (p < 0.02) and learning style (p < 0.03), which may aid in developing principle-driven guidelines for how and when to effectively present task-specific modal information to improve C2 warfighters' performance. A preliminary framework for modeling user interaction in multimodal C2 environments is also in development and is based on existing theories and models of working memory, as well as from new insights gained from the latest in imaging of electromagnetic (e.g., EEG, ERP, MEG) and hemodynamic (e.g., fMRI, PET) changes in the brain while user's perform predefined tasks. This research represents an innovative way to both predict and accommodate a user's information processing resources while interacting with multimodal systems. The current results and planned follow-on studies are facilitating the development of principle-driven multimodal design guidelines regarding how and when to adapt modes of interaction to meet the cognitive capabilities of users. Although the initial application of such results are focused on determining how and when modalities should be presented, either in isolation or combination, to effectively present task-specific information to C4ISR warfighters, this research shows great potential for its applicability to the multimodal design community in general.
Ho, Cristy (2004): Using spatial warning signals to capture a driver's visual attention. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 350. Available online
This study was designed to assess the potential benefits of using spatial auditory or vibrotactile warning signals in the domain of driving performance, using a simulated driving task. Across six experiments, participants had to monitor a rapidly presented stream of distractor letters for occasional target digits (simulating an attention-demanding visual task, such as driving). Whenever participants heard an auditory cue (E1-E4) or felt a vibration (E5-E6), they had to check the front and the rearview mirror for the rapid approach of a car from in front or behind and respond accordingly (either by accelerating or braking). The efficacy of various auditory and vibrotactile warning signals in directing a participant's visual attention to the correct environmental position was compared (see Table 1). The results demonstrate the potential utility of semantically-meaningful or spatial auditory, and/or vibrotactile warning signals in interface design for directing a driver's, or other interface-operator's, visual attention to time-critical events or information.
Patomäki, Saija (2004): Multimodal interfaces and applications for visually impaired children. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 351. Available online
Applications specially designated for visually handicapped children are rare. Additionally, this group of users is often not able to obtain the needed applications and machinery to their homes due to the expenses. However, the impairment these children have should not preclude them from the benefits and possibilities computers have to offer. In a modern society services and applications that open up along with the computers can be considered as a necessity to its citizens. This is the core issue of our research interest; to test various haptic devices and design usable applications to give this special user group the possibility to become acquainted with the computers so that they are encouraged to use and benefit from the technology also later in their lives. Similar research to ours where the haptic sensation is present has been carried out by Sjöström . He has developed and tested haptic games that are used with the Phantom device . Some of his applications are aimed for visually impaired children. During the project "Computer-based learning environment for visually impaired people" we designed, implemented and tested three different applications. Our target group was from three- to seven-year-old visually impaired children. Applications were tested in three phases with the chosen subjects. During the experiments a special testing procedure was developed . The applications were based on haptic and auditory feedback but the simple graphical interface was available for those who were only partially blind. The chosen haptic device was the Phantom  that is a six-degrees-of-freedom input device. The Phantom is used with the stylus that resembles a pen. A pen is attached to a robotic arm that generates force feedback to stimulate touch. The first application consisted on simple materials and path shapes. In the user tests the virtual materials were compared with real ones and the various path shapes were meant to track along with the stylus. The second application was more a game-like environment. There were four haptic rooms where children had to do different tasks. The last tested application was a modification of the previous one. Its user interface consisted of six rooms and the tasks in them were simplified based on the results gained in the previous user tests. As the Phantom device is expensive and also difficult to use for some of the children the haptic device was decided to be replaced with simple machinery. In our current project "Multimodal Interfaces for Visually Impaired Children" the applications will be used with haptic devices such as tactile mouse or force feedback joystick. Some applications are designed and implemented from the start and some applications are adapted from the games that are originally meant for sighted children. The desirable research outcome is practical; to produce workable user interfaces and applications whose functionality and cost are reasonable enough to be acquired to the homes of the blind children.
Jiang, Feng, Yao, Hongxun and Yao, Guilin (2004): Multilayer architecture in sign language recognition system. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 352-353. Available online
Up to now analytical or statistical methods have been used in sign language recognition with large vocabulary. Analytical methods such as Dynamic Time Wrapping (DTW) or Euclidian distance have been used for isolated word recognition, but the performance is not satisfactory enough because it is easily interfered by noise. Statistical methods, especially hidden Markov Models are commonly used, for both continuous sign language and isolated words and with the expansion of vocabulary the processing time becomes increasingly unacceptable. Therefore, a multilayer architecture of sign language recognition for large vocabulary is proposed in this paper for the purpose of speeding up the recognition process. In this method the gesture sequence to be recognized is first located at a set of words that are easy to be confused (confusion set) through a global cursory search and then the gesture is recognized through a latter local search and the generation of confusion set is realized by DTW/ISODATA algorithm. Experiment results indicate that it is an effective algorithm for Chinese sign language recognition.
Mäkinen, Erno (2004): Computer vision techniques and applications in human-computer interaction. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 354. Available online
There has been much research on computer vision in last three decades. Computer vision methods have been developed for different situations. One example is a detection of human face. For computers face detection is hard. Faces look different from different viewing directions. Facial expressions affect to the look of the face. Each individual person has a unique face. The lightning conditions can vary and so on. However, face detection is currently possible in limited conditions. In addition, there are some methods that can be used for gender recognition , face recognition  and facial expression recognition . Nonetheless, there has been very little research on how to combine these methods. There has also been quite little research on how to apply these methods in human-computer interaction (HCI). Finding sets of techniques that complement each other in a useful way is one research challenge. There are some applications that take advantage of one or two computer vision techniques. For example, Christian and Avery  have developed an information kiosk that uses computer vision to detect potential users from a distance. A similar kiosk has been developed by us in the University of Tampere . There are also some games that use simple computer vision techniques for the interaction. However, there are very few applications that use several computer vision techniques together such as face detection, facial expression recognition and gender recognition. Overall, there has been very little effort in combining different techniques. In my research I develop computer vision methods and combine them, so that the combined method can detect face, recognize gender and facial expressions. After successfully combining the methods, it is easier to develop HCI applications that take advantage of computer vision. Applications that can be used by small group of people are my specific interest. These applications allow me to build adaptive user interfaces and analyze the use of computer vision techniques in improving human-computer interaction.
Bolelli, Levent (2004): Multimodal response generation in GIS. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 355. Available online
Advances in computer hardware and software technologies have enabled sophisticated information visualization techniques as well as new interaction opportunities to be introduced in the development of GIS (Geographical Information Systems) applications. Especially, research efforts in computer vision and natural language processing have enabled users to interact with computer applications using natural speech and gestures, which has proven to be effective for interacting with dynamic maps [1, 6]. Pen-based mobile devices and gesture recognition systems enable system designers to define application-specific gestures for carrying out particular tasks. Using force-feedback mouse for interacting with GIS has been proposed for visually-impaired people . These are exciting new opportunities and hold the promise of advancing interaction with computers to a complete new level. The ultimate aim, however, should be directed on facilitating human-computer communication; that is, equal emphasis should be given to both understanding and generation of multimodal behavior. My proposed research will provide a conceptual framework and a computational model for generating multimodal responses to communicate spatial information along with dynamically generated maps. The model will eventually lead to development of a computational agent that has reasoning capabilities for distributing the semantic and pragmatic content of the intended response message among speech, deictic gestures and visual information. In other words, the system will be able to select the most natural and effective mode(s) of communicating back to the user. Any research in computer science that investigates direct interaction of computers with humans should place human factors in center stage. Therefore, this work will follow a multi-disciplinary approach and integrate ideas from previous research in Psychology, Cognitive Science, Linguistics, Cartography, Geographical Information Science (GIScience) and Computer Science that will enable us to identify and address human, cartographic and computational issues involved in response planning and assist users with their spatial decision making by facilitating their visual thinking process as well as reducing their cognitive load. The methodology will be integrated into the design of DAVE_G  prototype: a, 6e of Computer Science and USAtyd Engineeringerface to Support Emergency Management. meaning. natural, multimodal, mixed-initiative dialogue interface to GIS. The system is currently capable of recognizing, interpreting and fusing users' natural occurring speech and gesture requests, and generating natural speech output. The communication between the system and user is modeled following the collaborative discourse theory  and maintains a Recipe Graph  structure -- based on SharedPlan theory -- to represent the intentional structure of the discourse between the user and system. One major concern in generating speech responses for dynamic maps is that spatial information cannot be effectively communicated using speech. Altering perceptual attributes (e.g. color, size, pattern) of the visual data to direct user's attention to a particular location on the map is not usually effective, since each attribute bears an inherent semantic meaning and those perceptual attributes should be modified only when the system's judgement states that those attributes are not crucial to the user's understanding of the situation at that stage of the task. Gesticulation, on the other hand, is powerful for conveying location and form of spatially oriented information  without manipulating the map and the benefit of facilitating speech production. My research aims at designing feasible, extensible and effective multimodal response generation (content planning and modality allocation) model. A plan-based reasoning algorithm and methodology integrated with the Recipe Graph structure has the potential to achieve those goals.
Rauschert, Ingmar (2004): Adaptive multimodal recognition of voluntary and involuntary gestures of people with motor disabilities. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. p. 356. Available online
Bernsen, Niels Ole and Dybkjær, Laila (2004): Evaluation of spoken multimodal conversation. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 38-45. Available online
Spoken multimodal dialogue systems in which users address face-only or embodied interface agents have been gaining ground in research for some time. Although most systems are still strictly task-oriented, the field is now moving towards domain-oriented systems and real conversational systems which are no longer defined in terms of the task(s) they support. This paper describes the first running prototype of such a system which enables spoken and gesture interaction with life-like fairytale author Hans Christian Andersen about his fairytales, life, study, etc., focusing on multimodal conversation. We then present recent user test evaluation results on multimodal conversation.
Understanding human-human interaction is fundamental to the long-term pursuit of powerful and natural multimodal interfaces. Nonverbal communication, including body posture, gesture, facial expression, and eye gaze, is an important aspect of human-human interaction. We introduce a paradigm for studying multimodal and nonverbal communication in collaborative virtual environments (CVEs) called Transformed Social Interaction (TSI), in which a user's visual representation is rendered in a way that strategically filters selected communication behaviors in order to change the nature of a social interaction. To achieve this, TSI must employ technology to detect, recognize, and manipulate behaviors of interest, such as facial expressions, gestures, and eye gaze. In  we presented a TSI experiment called non-zero-sum gaze (NZSG) to determine the effect of manipulated eye gaze on persuasion in a small group setting. Eye gaze was manipulated so that each participant in a three-person CVE received eye gaze from a presenter that was normal, less than normal, or greater than normal. We review this experiment and discuss the implications of TSI for multimodal interfaces.
Heidemann, Gunther, Bax, Ingo and Bekel, Holger (2004): Multimodal interaction in an augmented reality scenario. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 53-60. Available online
We describe an augmented reality system designed for online acquisition of visual knowledge and retrieval of memorized objects. The system relies on a head mounted camera and display, which allow the user to view the environment together with overlaid augmentations by the system. In this setup, communication by hand gestures and speech is mandatory as common input devices like mouse and keyboard are not available. Using gesture and speech, basically three types of tasks must be handled: (i) Communication with the system about the environment, in particular, directing attention towards objects and commanding the memorization of sample views; (ii) control of system operation, e.g. switching between display modes; and (iii) re-adaptation of the interface itself in case communication becomes unreliable due to changes in external factors, such as illumination conditions. We present an architecture to manage these tasks and describe and evaluate several of its key elements, including modules for pointing gesture recognition, menu control based on gesture and speech, and control strategies to cope with situations when vision becomes unreliable and has to be re-adapted by speech.
Barthelmess, Paulo and Ellis, Clarence A. (2004): The ThreadMill architecture for stream-oriented human communication analysis applications. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 61-68. Available online
This work introduces a new component software architecture -- ThreadMill -- whose main purpose is to facilitate the development of applications in domains where high volumes of streamed data need to be efficiently analyzed. It focuses particularly on applications that target the analysis of human communication e.g. in speech and gesture recognition. Applications in this domain usually employ costly signal processing techniques, but offer in many cases ample opportunities for concurrent execution in many different phases. ThreadMill's abstractions facilitate the development of applications that take advantage of this potential concurrency by hiding the complexity of parallel and distributed programming. As a result, ThreadMill applications can be made to run unchanged on a wide variety of execution environments, ranging from a single-processor machine to a cluster of multi-processor nodes. The architecture is illustrated by an implementation of a tracker for hands and face of American Sign Language signers that uses a parallel and concurrent version of the Joint Likelihood Filter method.
Wilson, Andrew D. (2004): TouchLight: an imaging touch screen and display for gesture-based interaction. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 69-76. Available online
A novel touch screen technology is presented. TouchLight uses simple image processing techniques to combine the output of two video cameras placed behind a semi-transparent plane in front of the user. The resulting image shows objects that are on the plane. This technique is well suited for application with a commercially available projection screen material (DNP HoloScreen) which permits projection onto a transparent sheet of acrylic plastic in normal indoor lighting conditions. The resulting touch screen display system transforms an otherwise normal sheet of acrylic plastic into a high bandwidth input/output surface suitable for gesture-based interaction. Image processing techniques are detailed, and several novel capabilities of the system are outlined.
Bouguila, Laroussi, Evéquoz, Florian, Courant, Michèle and Hirsbrunner, Béat (2004): Walking-pad: a step-in-place locomotion interface for virtual environments. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 77-81. Available online
This paper presents a new locomotion interface that provides users with the ability to engage in a life-like walking experience using stepping in place. Stepping actions are performed on top of a flat platform that has an embedded grid of switch sensors that detect footfalls pressure. Based on data received from sensors, the system can compute different variables that represent user's walking behavior such as walking direction, walking speed, standstill, jump, and walking. The overall platform status is scanned at a rate of 100Hz with which we can deliver real-time visual feedback reaction to user actions. The proposed system is portable and easy to integrate into major virtual environment with large projection feature such as CAVE and DOME systems. The overall weight of the Walking-Pad is less than 5 Kg and can be connected to any computer via USB port, which make it even controllable via a portable computer.
Chen, Datong, Malkin, Robert and Yang, Jie (2004): Multimodal detection of human interaction events in a nursing home environment. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 82-89. Available online
In this paper, we propose a multimodal system for detecting human activity and interaction patterns in a nursing home. Activities of groups of people are firstly treated as interaction patterns between any pair of partners and are then further broken into individual activities and behavior events using a multi-level context hierarchy graph. The graph is implemented using a dynamic Bayesian network to statistically model the multi-level concepts. We have developed a coarse-to-fine prototype system to illustrate the proposed concept. Experimental results have demonstrated the feasibility of the proposed approaches. The objective of this research is to automatically create concise and comprehensive reports of activities and behaviors of patients to support physicians and caregivers in a nursing facility.
Stein, Randy and Brennan, Susan E. (2004): Another person's eye gaze as a cue in solving programming problems. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 9-15. Available online
Expertise in computer programming can often be difficult to transfer verbally. Moreover, technical training and communication occur more and more between people who are located at a distance. We tested the hypothesis that seeing one person's visual focus of attention (represented as an eyegaze cursor) while debugging software (displayed as text on a screen) can be helpful to another person doing the same task. In an experiment, a group of professional programmers searched for bugs in small Java programs while wearing an unobtrusive head-mounted eye tracker. Later, a second set of programmers searched for bugs in the same programs. For half of the bugs, the second set of programmers first viewed a recording of an eyegaze cursor from one of the first programmers displayed over the (indistinct) screen of code, and for the other half they did not. The second set of programmers found the bugs more quickly after viewing the eye gaze of the first programmers, suggesting that another person's eye gaze, produced instrumentally (as opposed to intentionally, like pointing with a mouse), can be a useful cue in problem solving. This finding supports the potential of eye gaze as a valuable cue for collaborative interaction in a visuo-spatial task conducted at a distance.
Juster, Joshua and Roy, Deb (2004): Elvis: situated speech and gesture understanding for a robotic chandelier. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 90-96. Available online
We describe a home lighting robot that uses directional spotlights to create complex lighting scenes. The robot senses its visual environment using a panoramic camera and attempts to maintain its target goal state by adjusting the positions and intensities of its lights. Users can communicate desired changes in the lighting environment through speech and gesture (e.g., "Make it brighter over there"). Information obtained from these two modalities are combined to form a goal, a desired change in the lighting of the scene. This goal is then incorporated into the system's target goal state. When the target goal state and the world are out of alignment, the system formulates a sensorimotor plan that acts on the world to return the system to homeostasis.
Kopp, Stefan, Tepper, Paul and Cassell, Justine (2004): Towards integrated microplanning of language and iconic gesture for multimodal output. In: Proceedings of the 2004 International Conference on Multimodal Interfaces 2004. pp. 97-104. Available online
When talking about spatial domains, humans frequently accompany their explanations with iconic gestures to depict what they are referring to. For example, when giving directions, it is common to see people making gestures that indicate the shape of buildings, or outline a route to be taken by the listener, and these gestures are essential to the understanding of the directions. Based on results from an ongoing study on language and gesture in direction-giving, we propose a framework to analyze such gestural images into semantic units (image description features), and to link these units to morphological features (hand shape, trajectory, etc.). This feature-based framework allows us to generate novel iconic gestures for embodied conversational agents, without drawing on a lexicon of canned gestures. We present an integrated microplanner that derives the form of both coordinated natural language and iconic gesture directly from given communicative goals, and serves as input to the speech and gesture realization engine in our NUMACK project.