Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts tomorrow LAST CALL!
go to course
User-Centred Design - Module 2
88% booked. Starts in 7 days
 
 

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !

 
 

Our Latest Books

 
 
Gamification at Work: Designing Engaging Business Software
by Janaki Mythily Kumar and Mario Herger
start reading
 
 
 
 
The Social Design of Technical Systems: Building technologies for communities
by Brian Whitworth and Adnan Ahmad
start reading
 
 
 
 
The Encyclopedia of Human-Computer Interaction, 2nd Ed.
by Mads Soegaard and Rikke Friis Dam
start reading
 
 

Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts tomorrow LAST CALL!
go to course
User-Centred Design - Module 2
88% booked. Starts in 7 days
 
 

Proceedings of the 2003 International Conference on Multimodal Interfaces


 
Time and place:

2003
Conf. description:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
Help us!
Do you know when the next conference is? If yes, please add it to the calendar!
Series:
This is a preferred venue for people like Trevor Darrell, Wen Gao, Rainer Stiefelhagen, Jie Yang, and Francis K. H. Quek. Part of the ICMI - International Conference on Multimodal Interfaces conference series.
Other years:
EDIT

References from this conference (2003)

The following articles are from "Proceedings of the 2003 International Conference on Multimodal Interfaces":

 what's this?

Articles

p. 1

Jain, Anil K. (2003): Multimodal user interfaces: who's the user?. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. p. 1. Available online

A wide variety of systems require reliable personal recognition schemes to either confirm or determine the identity of an individual requesting their services. The purpose of such schemes is to ensure that only a legitimate user, and not anyone else, accesses the rendered services. Examples of such applications include secure access to buildings, computer systems, laptops, cellular phones and ATMs. Biometric recognition, or simply biometrics, refers to the automatic recognition of individuals based on their physiological and/or behavioral characteristics. By using biometrics it is possible to confirm or establish an individual's identity based on "who she is", rather than by "what she possesses" (e.g., an ID card) or "what she remembers" (e.g., a password). Current biometric systems make use of fingerprints, hand geometry, iris, face, voice, etc. to establish a person's identity. Biometric systems also introduce an aspect of user convenience. For example, they alleviate the need for a user to remember multiple passwords associated with different applications. A biometric system that uses a single biometric trait for recognition has to contend with problems related to non-universality of the trait, spoof attacks, limited degrees of freedom, large intra-class variability, and noisy data. Some of these problems can be addressed by integrating the evidence presented by multiple biometric traits of a user (e.g., face and iris). Such systems, known as multimodal biometric systems, demonstrate substantial improvement in recognition performance. In this talk, we will present various applications of biometrics, challenges associated in designing biometric systems, and various fusion strategies available to implement a multimodal biometric system.

© All rights reserved Jain and/or his/her publisher

p. 101-108

Reithinger, Norbert, Alexandersson, Jan, Becker, Tilman, Blocher, Anselm, Engel, Ralf, Lckelt, Markus, Mller, Jochen, Pfleger, Norbert, Poller, Peter, Streit, Michael and Tschernomas, Valentin (2003): SmartKom: adaptive and flexible multimodal access to multiple applications. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 101-108. Available online

The development of an intelligent user interface that supports multimodal access to multiple applications is a challenging task. In this paper we present a generic multimodal interface system where the user interacts with an anthropomorphic personalized interface agent using speech and natural gestures. The knowledge-based and uniform approach of SmartKom enables us to realize a comprehensive system that understands imprecise, ambiguous, or incomplete multimodal input and generates coordinated, cohesive, and coherent multimodal presentations for three scenarios, currently addressing more than 50 different functionalities of 14 applications. We demonstrate the main ideas in a walk through the main processing steps from modality fusion to modality fission.

© All rights reserved Reithinger et al. and/or their publisher

p. 109-116

Flippo, Frans, Krebs, Allen and Marsic, Ivan (2003): A framework for rapid development of multimodal interfaces. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 109-116. Available online

Despite the availability of multimodal devices, there are very few commercial multimodal applications available. One reason for this may be the lack of a framework to support development of multimodal applications in reasonable time and with limited resources. This paper describes a multimodal framework enabling rapid development of applications using a variety of modalities and methods for ambiguity resolution, featuring a novel approach to multimodal fusion. An example application is studied that was created using the framework.

© All rights reserved Flippo et al. and/or their publisher

p. 117-124

Sinha, Anoop K. and Landay, James A. (2003): Capturing user tests in a multimodal, multidevice informal prototyping tool. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 117-124. Available online

Interaction designers are increasingly faced with the challenge of creating interfaces that incorporate multiple input modalities, such as pen and speech, and span multiple devices. Few early stage prototyping tools allow non-programmers to prototype these interfaces. Here we describe CrossWeaver, a tool for informally prototyping multimodal, multidevice user interfaces. This tool embodies the informal prototyping paradigm, leaving design representations in an informal, sketched form, and creates a working prototype from these sketches. CrossWeaver allows a user interface designer to sketch storyboard scenes on the computer, specifying simple multimodal command transitions between scenes. The tool also allows scenes to target different output devices. Prototypes can run across multiple standalone devices simultaneously, processing multimodal input from each one. Thus, a designer can visually create a multimodal prototype for a collaborative meeting or classroom application. CrossWeaver captures all of the user interaction when running a test of a prototype. This input log can quickly be viewed visually for the details of the users' multimodal interaction or it can be replayed across all participating devices, giving the designer information to help him or her analyze and iterate on the interface design.

© All rights reserved Sinha and Landay and/or their publisher

p. 12-19

Kaiser, Ed, Olwal, Alex, McGee, David, Benko, Hrvoje, Corradini, Andrea, Li, Xiaoguang, Cohen, Phil and Feiner, Steven (2003): Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 12-19. Available online

We describe an approach to 3D multimodal interaction in immersive augmented and virtual reality environments that accounts for the uncertain nature of the information sources. The resulting multimodal system fuses symbolic and statistical information from a set of 3D gesture, spoken language, and referential agents. The referential agents employ visible or invisible volumes that can be attached to 3D trackers in the environment, and which use a time-stamped history of the objects that intersect them to derive statistics for ranking potential referents. We discuss the means by which the system supports mutual disambiguation of these modalities and information sources, and show through a user study how mutual disambiguation accounts for over 45% of the successful 3D multimodal interpretations. An accompanying video demonstrates the system in action.

© All rights reserved Kaiser et al. and/or their publisher

p. 125-131

Fang, Gaolin, Gao, Wen and Zhao, Debin (2003): Large vocabulary sign language recognition based on hierarchical decision trees. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 125-131. Available online

The major difficulty for large vocabulary sign language or gesture recognition lies in the huge search space due to a variety of recognized classes. How to reduce the recognition time without loss of accuracy is a challenge issue. In this paper, a hierarchical decision tree is first presented for large vocabulary sign language recognition based on the divide-and-conquer principle. As each sign feature has the different importance to gestures, the corresponding classifiers are proposed for the hierarchical decision to gesture attributes. One- or two-handed classifier with little computational cost is first used to eliminate many impossible candidates. The subsequent hand shape classifier is performed on the possible candidate space. SOFM/HMM classifier is employed to get the final results at the last non-leaf nodes that only include few candidates. Experimental results on a large vocabulary of 5113-signs show that the proposed method drastically reduces the recognition time by 11 times and also improves the recognition rate about 0.95% over single SOFM/HMM.

© All rights reserved Fang et al. and/or their publisher

p. 132-139

Xiong, Yingen, Quek, Francis and McNeill, David (2003): Hand motion gestural oscillations and multimodal discourse. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 132-139. Available online

To develop multimodal interfaces, one needs to understand the constraints underlying human communicative gesticulation and the kinds of features one may compute based on these underlying human characteristics. In this paper we address hand motion oscillatory gesture detection in natural speech and conversation. First, the hand motion trajectory signals are extracted from video. Second, a wavelet analysis based approach is presented to process the signals. In this approach, wavelet ridges are extracted from the responses of wavelet analysis for the hand motion trajectory signals, which can be used to characterize frequency properties of the hand motion signals. The hand motion oscillatory gestures can be extracted from these frequency properties. Finally, we relate the hand motion oscillatory gestures to the phases of speech and multimodal discourse analysis. We demonstrate the efficacy of the system on a real discourse dataset in which a subject described her action plan to an interlocutor. We extracted the oscillatory gestures from the x, y and z motion traces of both hands. We further demonstrate the power of gestural oscillation detection as a key to unlock the structure of the underlying discourse.

© All rights reserved Xiong et al. and/or their publisher

p. 140-146

Nickel, Kai and Stiefelhagen, Rainer (2003): Pointing gesture recognition based on 3D-tracking of face, hands and head orientation. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 140-146. Available online

In this paper, we present a system capable of visually detecting pointing gestures and estimating the 3D pointing direction in real-time. In order to acquire input features for gesture recognition, we track the positions of a person's face and hands on image sequences provided by a stereo-camera. Hidden Markov Models (HMMs), trained on different phases of sample pointing gestures, are used to classify the 3D-trajectories in order to detect the occurrence of a gesture. When analyzing sample pointing gestures, we noticed that humans tend to look at the pointing target while performing the gesture. In order to utilize this behavior, we additionally measured head orientation by means of a magnetic sensor in a similar scenario. By using head orientation as an additional feature, we observed significant gains in both recall and precision of pointing gestures. Moreover, the percentage of correctly identified pointing

© All rights reserved Nickel and Stiefelhagen and/or their publisher

p. 147-150

Ko, T., Demirdjian, D. and Darrell, T. (2003): Untethered gesture acquisition and recognition for a multimodal conversational system. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 147-150. Available online

Humans use a combination of gesture and speech to convey meaning, and usually do so without holding a device or pointer. We present a system that incorporates body tracking and gesture recognition for an untethered human-computer interface. This research focuses on a module that provides parameterized gesture recognition, using various machine learning techniques. We train the support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. Finally multimodal recognition is performed using rank-order fusion to merge speech and vision hypotheses.

© All rights reserved Ko et al. and/or their publisher

p. 151-158

Kaur, Manpreet, Tremaine, Marilyn M., Huang, Ning, Wilder, Joseph, Gacovski, Zoran, Flippo, Frans and Mantravadi, Chandra Sekhar (2003): Where is "it"? Event Synchronization in Gaze-Speech Input Systems. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 151-158. Available online

The relationship between gaze and speech is explored for the simple task of moving an object from one location to another on a computer screen. The subject moves a designated object from a group of objects to a new location on the screen by stating, "Move it there". Gaze and speech data are captured to determine if we can robustly predict the selected object and destination position. We have found that the source fixation closest to the desired object begins, with high probability, before the beginning of the word "Move". An analysis of all fixations before and after speech onset time shows that the fixation that best identifies the object to be moved occurs, on average, 630 milliseconds before speech onset with a range of 150 to 1200 milliseconds for individual subjects. The variance in these times for individuals is relatively small although the variance across subjects is large. Selecting a fixation closest to the onset of the word "Move" as the designator of the object to be moved gives a system accuracy close to 95% for all subjects. Thus, although significant differences exist between subjects, we believe that the speech and gaze integration patterns can be modeled reliably for individual users and therefore be used to improve the performance of multimodal systems.

© All rights reserved Kaur et al. and/or their publisher

p. 159-163

Rudmann, Darrell S., McConkie, George W. and Zheng, Xianjun Sam (2003): Eyetracking in cognitive state detection for HCI. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 159-163. Available online

1. Past research in a number of fields confirms the existence of a link between cognition and eye movement control, beyond simply a pointing relationship. This being the case, it should be possible to use eye movement recording as a basis for detecting users' cognitive states in real time. Several examples of such cognitive state detectors have been reported in the literature. 2. A multi-disciplinary project is described in which the goal is to provide the computer with as much real-time information about the human state (cognitive, affective and motivational state) as possible, and to base computer actions on this information. The application area in which this is being implemented is science education, learning about gears through exploration. Two studies are reported in which participants solve simple problems of pictured gear trains while their eye movements are recorded. The first study indicates that most eye movement sequences are compatible with predictions of a simple sequential cognitive model, and it is suggested that those sequences that do not fit the model may be of particular interest in the HCI context as indicating problems or alternative mental strategies. The mental rotation of gears sometimes produces sequences of short eye movements in the direction of motion; thus, such sequences may be useful as cognitive state detectors. The second study tested the hypothesis that participants are thinking about the object to which their eyes are directed. In this study, the display was turned off partway through the process of solving a problem, and the participants reported what they were thinking about at that time. While in most cases the participants reported cognitive activities involving the fixated object, this was not the case on a sizeable number of trials.

© All rights reserved Rudmann et al. and/or their publisher

p. 164-171

Yu, Chen and Ballard, Dana H. (2003): A multimodal learning interface for grounding spoken language in sensory perceptions. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 164-171. Available online

Most speech interfaces are based on natural language processing techniques that use pre-defined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal human-computer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".

© All rights reserved Yu and Ballard and/or their publisher

p. 172-175

Massaro, Dominic W. (2003): A computer-animated tutor for spoken and written language learning. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 172-175. Available online

Baldi, a computer-animated talking head is introduced. The quality of his visible speech has been repeatedly modified and evaluated to accurately simulate naturally talking humans. Baldi's visible speech can be appropriately aligned with either synthesized or natural auditory speech. Baldi has had great success in teaching vocabulary and grammar to children with language challenges and training speech distinctions to children with hearing loss and to adults learning a new language. We demonstrate these learning programs and also demonstrate several other potential application areas for Baldi.

© All rights reserved Massaro and/or his/her publisher

p. 176-179

Gorniak, Peter and Roy, Deb (2003): Augmenting user interfaces with adaptive speech commands. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 176-179. Available online

We present a system that augments any unmodified Java application with an adaptive speech interface. The augmented system learns to associate spoken words and utterances with interface actions such as button clicks. Speech learning is constantly active and searches for correlations between what the user says and does. Training the interface is seamlessly integrated with using the interface. As the user performs normal actions, she may optionally verbally describe what she is doing. By using a phoneme recognizer, the interface is able to quickly learn new speech commands. Speech commands are chosen by the user and can be recognized robustly due to accurate phonetic modelling of the user's utterances and the small size of the vocabulary learned for a single application. After only a few examples, speech commands can replace mouse clicks. In effect, selected interface functions migrate from keyboard and mouse to speech. We demonstrate the usefulness of this approach by augmenting jfig, a drawing application, where speech commands save the user from the distraction of having to use a tool palette.

© All rights reserved Gorniak and Roy and/or their publisher

p. 180-187

Kster, Thomas, Pfeiffer, Michael and Bauckhage, Christian (2003): Combining speech and haptics for intuitive and efficient navigation through image databases. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 180-187. Available online

Given the size of todays professional image databases, the standard approach to object- or theme-related image retrieval is to interactively navigate through the content. But as most users of such databases are designers or artists who do not have a technical background, navigation interfaces must be intuitive to use and easy to learn. This paper reports on efforts towards this goal. We present a system for intuitive image retrieval that features different modalities for interaction. Apart from conventional input devices like mouse or keyboard it is also possible to use speech or haptic gesture to indicate what kind of images one is looking for. Seeing a selection of images on the screen, the user provides relevance feedback to narrow the choice of motifs presented next. This is done either by scoring whole images or by choosing certain image regions. In order to derive consistent reactions from multimodal user input, asynchronous integration of modalities and probabilistic reasoning based on Bayesian networks are applied. After addressing technical details, we will discuss a series of usability experiments, which we conducted to examine the impact of multimodal input facilities on interactive image retrieval. The results indicate that users appreciate multimodality. While we observed little decrease in task performance, measures of contentment exceeded those for conventional input devices.

© All rights reserved Kster et al. and/or their publisher

p. 188-195

Atienza, Rowel and Zelinsky, Alexander (2003): Interactive skills using active gaze tracking. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 188-195. Available online

We have incorporated interactive skills into an active gaze tracking system. Our active gaze tracking system can identify an object in a cluttered scene that a person is looking at. By following the user's 3-D gaze direction together with a zero-disparity filter, we can determine the object's position. Our active vision system also directs attention to a user by tracking anything with both motion and skin color. A Particle Filter fuses skin color and motion from optical flow techniques together to locate a hand or a face in an image. The active vision then uses stereo camera geometry, Kalman Filtering and position and velocity controllers to track the feature in real-time. These skills are integrated together such that they cooperate with each other in order to track the user's face and gaze at all times. Results and video demos provide interesting insights on how active gaze tracking can be utilized and improved to make human-friendly user interfaces.

© All rights reserved Atienza and Zelinsky and/or their publisher

p. 196-202

Tan, Yeow Kee, Sherkat, Nasser and Allen, Tony (2003): Error recovery in a blended style eye gaze and speech interface. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 196-202. Available online

In the work carried out earlier [1][2], it was found that an eye gaze and speech enabled interface was the most preferred form of data entry method when compared to other methods such as mouse and keyboard, handwriting and speech only. It was also found that several non-native United Kingdom (UK) English speaking speakers did not prefer the eye gaze and speech system due to the low success rate caused by the inaccuracy of the speech recognition component. Hence in order to increase the usability of the eye gaze and speech data entry system for these users, error recovery methods are required. In this paper we present three different multimodal interfaces that employ the use of speech recognition and eye gaze tracking within a virtual keypad style interface to allow for the use of error recovery (re-speak with keypad, spelling with keypad and re-speak and spelling with keypad). Experiments show that through the use of this virtual keypad interface, an accuracy gain of 10.92% during first attempt and 6.20% during re-speak by non-native speakers in ambiguous fields (initials, surnames, city and alphabets) can be achieved [3]. The aim of this work is to investigate whether the usability of the eye gaze and speech system can be improved through one of these three multimodal blended multimodal error recovery methods.

© All rights reserved Tan et al. and/or their publisher

p. 2

Marshall, Sandra (2003): New techniques for evaluating innovative interfaces with eye tracking. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. p. 2. Available online

Computer interfaces are changing rapidly, as are the cognitive demands on the operators using them. Innovative applications of new technologies such as multimodal and multimedia displays, haptic and pen-based interfaces, and natural language exchanges bring exciting changes to conventional interface usage. At the same time, their complexity may place overwhelming cognitive demands on the user. As novel interfaces and software applications are introduced into operational settings, it is imperative to evaluate them from a number of different perspectives. One important perspective examines the extent to which a new interface changes the cognitive requirements for the operator. This presentation describes a new approach to measuring cognitive effort using metrics based on eye movements and pupil dilation. It is well known that effortful cognitive processing is accompanied by increases in pupil dilation, but measurement techniques were not previously available that could supply results in real time or deal with data collected in long-lasting interactions. We now have a metric-the Index of Cognitive Activity-that is computed in real time as the operator interacts with the interface. The Index can be used to examine extended periods of usage or to assess critical events on an individual-by-individual basis. While dilation reveals when cognitive effort is highest, eye movements provide evidence of why. Especially during critical events, one wants to know whether the operator is confused by the presentation or location of specific information, whether he is attending to key information when necessary, or whether he is distracted by irrelevant features of the display. Important details of confusion, attention, and distraction are revealed by traces of his eye movements and statistical analyses of time spent looking at various features during critical event. Together, the Index of Cognitive Activity and the various analyses of eye movements provide essential information about how users interact with new interface technologies. Their use can aid designers of innovative hardware and software products by highlighting those features that increase rather than decrease users' cognitive effort. In the presentation, the underlying mathematical basis of the Index of Cognitive Activity will be described together with validating research results from a number of experiments. Eye movement analyses from the same studies give clues to the sources of increase in cognitive workload. To illustrate interface evaluation with the ICA and eye movement analysis, several extended examples will be presented using commercial and military displays. [NOTE: Dr. Marshall's eye tracking system will be available to view at Tuesday evening's joint UIST-ICMI demo reception.

© All rights reserved Marshall and/or his/her publisher

p. 20-27

Horvitz, Eric and Apacible, Johnson (2003): Learning and reasoning about interruption. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 20-27. Available online

We present methods for inferring the cost of interrupting users based on multiple streams of events including information generated by interactions with computing devices, visual and acoustical analyses, and data drawn from online calendars. Following a review of prior work on techniques for deliberating about the cost of interruption associated with notifications, we introduce methods for learning models from data that can be used to compute the expected cost of interruption for a user. We describe the Interruption Workbench, a set of event-capture and modeling tools. Finally, we review experiments that characterize the accuracy of the models for predicting interruption cost and discuss research directions.

© All rights reserved Horvitz and Apacible and/or their publisher

p. 203-210

Laerhoven, Kristof van, Villar, Nicolas, Schmidt, Albrecht, Kortuem, Gerd and Gellersen, Hans-Werner (2003): Using an autonomous cube for basic navigation and input. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 203-210. Available online

This paper presents a low-cost and practical approach to achieve basic input using a tactile cube-shaped object, augmented with a set of sensors, processor, batteries and wireless communication. The algorithm we propose combines a finite state machine model incorporating prior knowledge about the symmetrical structure of the cube, with maximum likelihood estimation using multivariate Gaussians. The claim that the presented solution is cheap, fast and requires few resources, is demonstrated by implementation in a small-sized, microcontroller-driven hardware configuration with inexpensive sensors. We conclude with a few prototyped applications that aim at characterizing how the familiar and elementary shape of the cube allows it to be used as an interaction device.

© All rights reserved Laerhoven et al. and/or their publisher

p. 211-218

Wilson, Andrew and Oliver, Nuria (2003): GWindows: robust stereo vision for gesture-based control of windows. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 211-218. Available online

Perceptual user interfaces promise modes of fluid computer-human interaction that complement the mouse and keyboard, and have been especially motivated in non-desktop scenarios, such as kiosks or smart rooms. Such interfaces, however, have been slow to see use for a variety of reasons, including the computational burden they impose, a lack of robustness outside the laboratory, unreasonable calibration demands, and a shortage of sufficiently compelling applications. We address these difficulties by using a fast stereo vision algorithm for recognizing hand positions and gestures. Our system uses two inexpensive video cameras to extract depth information. This depth information enhances automatic object detection and tracking robustness, and may also be used in applications. We demonstrate the algorithm in combination with speech recognition to perform several basic window management tasks, report on a user study probing the ease of using the system, and discuss the implications of such a system for future user interfaces.

© All rights reserved Wilson and Oliver and/or their publisher

p. 219-226

Gorniak, Peter and Roy, Deb (2003): A visually grounded natural language interface for reference to spatial scenes. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 219-226. Available online

Many user interfaces, from graphic design programs to navigation aids in cars, share a virtual space with the user. Such applications are often ideal candidates for speech interfaces that allow the user to refer to objects in the shared space. We present an analysis of how people describe objects in spatial scenes using natural language. Based on this study, we describe a system that uses synthetic vision to "see" such scenes from the person's point of view, and that understands complex natural language descriptions referring to objects in the scenes. This system is based on a rich notion of semantic compositionality embedded in a grounded language understanding framework. We describe its semantic elements, their compositional behaviour, and their grounding through the synthetic vision system. To conclude, we evaluate the performance of the system on unconstrained input.

© All rights reserved Gorniak and Roy and/or their publisher

p. 227-233

Ruddarraju, Ravikrishna, Haro, Antonio, Nagel, Kris, Tran, Quan T., Essa, Irfan A., Abowd, Gregory and Mynatt, Elizabeth D. (2003): Perceptual user interfaces using vision-based eye tracking. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 227-233. Available online

We present a multi-camera vision-based eye tracking method to robustly locate and track user's eyes as they interact with an application. We propose enhancements to various vision-based eye-tracking approaches, which include (a) the use of multiple cameras to estimate head pose and increase coverage of the sensors and (b) the use of probabilistic measures incorporating Fisher's linear discriminant to robustly track the eyes under varying lighting conditions in real-time. We present experiments and quantitative results to demonstrate the robustness of our eye tracking in two application prototypes.

© All rights reserved Ruddarraju et al. and/or their publisher

p. 234-241

Li, Yang, Landay, James A., Guan, Zhiwei, Ren, Xiangshi and Dai, Guozhong (2003): Sketching informal presentations. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 234-241. Available online

Informal presentations are a lightweight means for fast and convenient communication of ideas. People communicate their ideas to others on paper and whiteboards, which afford fluid sketching of graphs, words and other expressive symbols. Unlike existing authoring tools that are designed for formal presentations, we created SketchPoint to help presenters design informal presentations via freeform sketching. In SketchPoint, presenters can quickly author presentations by sketching slide content, overall hierarchical structures and hyperlinks. To facilitate the transition from idea capture to communication, a note-taking workspace was built for accumulating ideas and sketching presentation outlines. Informal feedback showed that SketchPoint is a promising tool for idea communication.

© All rights reserved Li et al. and/or their publisher

p. 242-249

Ou, Jiazhi, Fussell, Susan R., Chen, Xilin, Setlock, Leslie D. and Yang, Jie (2003): Gestural communication over video stream: supporting multimodal interaction for remote collaborative physical tasks. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 242-249. Available online

We present a system integrating gesture and live video to support collaboration on physical tasks. The architecture combines network IP cameras, desktop PCs, and tablet PCs to allow a remote helper to draw on a video feed of a workspace as he/she provides task instructions. A gesture recognition component enables the system both to normalize freehand drawings to facilitate communication with remote partners and to use pen-based input as a camera control device. Results of a preliminary user study suggest that our gesture over video communication system enhances task performance over traditional video-only systems. Implications for the design of multimodal systems to support collaborative physical tasks are also discussed.

© All rights reserved Ou et al. and/or their publisher

p. 250-257

Qvarfordt, Pernilla, Jnsson, Arne and Dahlbck, Nils (2003): The role of spoken feedback in experiencing multimodal interfaces as human-like. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 250-257. Available online

If user interfaces should be made human-like vs. tool-like has been debated in the HCI field, and this debate affects the development of multimodal interfaces. However, little empirical study has been done to support either view so far. Even if there is evidence that humans interpret media as other humans, this does not mean that humans experience the interfaces as human-like. We studied how people experience a multimodal timetable system with varying degree of human-like spoken feedback in a Wizard-of-Oz study. The results showed that users' views and preferences lean significantly towards anthropomorphism after actually experiencing the multimodal timetable system. The more human-like the spoken feedback is the more participants preferred the system to be human-like. The results also showed that the users experience matched their preferences. This shows that in order to appreciate a human-like interface, the users have to experience it.

© All rights reserved Qvarfordt et al. and/or their publisher

p. 258-264

Michel, Philipp and Kaliouby, Rana El (2003): Real time facial expression recognition in video using support vector machines. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 258-264. Available online

Enabling computer systems to recognize facial expressions and infer emotions from them in real time presents a challenging research topic. In this paper, we present a real time approach to emotion recognition through facial expression in live video. We employ an automatic facial feature tracker to perform face localization and feature extraction. The facial feature displacements in the video stream are used as input to a Support Vector Machine classifier. We evaluate our method in terms of recognition accuracy for a variety of interaction and classification scenarios. Our person-dependent and person-independent experiments demonstrate the effectiveness of a support vector machine and feature tracking approach to fully automatic, unobtrusive expression recognition in live video. We conclude by discussing the relevance of our work to affective and intelligent man-machine interfaces and exploring further improvements.

© All rights reserved Michel and Kaliouby and/or their publisher

p. 265-272

Xiao, Benfang, Lunsford, Rebecca, Coulston, Rachel, Wesson, Matt and Oviatt, Sharon (2003): Modeling multimodal integration patterns and performance in seniors: toward adaptive processing of individual differences. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 265-272. Available online

Multimodal interfaces are designed with a focus on flexibility, although very few currently are capable of adapting to major sources of user, task, or environmental variation. The development of adaptive multimodal processing techniques will require empirical guidance from quantitative modeling on key aspects of individual differences, especially as users engage in different types of tasks in different usage contexts. In the present study, data were collected from fifteen 66- to 86-year-old healthy seniors as they interacted with a map-based flood management system using multimodal speech and pen input. A comprehensive analysis of multimodal integration patterns revealed that seniors were classifiable as either simultaneous or sequential integrators, like children and adults. Seniors also demonstrated early predictability and a high degree of consistency in their dominant integration pattern. However, greater individual differences in multimodal integration generally were evident in this population. Perhaps surprisingly, during sequential constructions seniors' intermodal lags were no longer in average and maximum duration than those of younger adults, although both of these groups had longer maximum lags than children. However, an analysis of seniors' performance did reveal lengthy latencies before initiating a task, and high rates of self talk and task-critical errors while completing spatial tasks. All of these behaviors were magnified as the task difficulty level increased. Results of this research have implications for the design of adaptive processing strategies appropriate for seniors' applications, especially for the development of temporal thresholds used during multimodal fusion. The long-term goal of this research is the design of high-performance multimodal systems that adapt to a full spectrum of diverse users, supporting tailored and robust future systems.

© All rights reserved Xiao et al. and/or their publisher

p. 273-276

Zahariev, Mihaela A. and MacKenzie, Christine L. (2003): Auditory, graphical and haptic contact cues for a reach, grasp, and place task in an augmented environment. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 273-276. Available online

An experiment was conducted to investigate how performance of a reach, grasp and place task was influenced by added auditory and graphical cues. The cues were presented at points in the task, specifically when making contact for grasping or placing the object, and were presented in single or in combined modalities. Haptic feedback was present always during physical interaction with the object. The auditory and graphical cues provided enhanced feedback about making contact between hand and object and between object and table. Also, the task was performed with or without vision of hand. Movements were slower without vision of hand. Providing auditory cues clearly facilitated performance, while graphical contact cues had no additional effect. Implications are discussed for various uses of auditory displays in virtual environments.

© All rights reserved Zahariev and MacKenzie and/or their publisher

p. 277-280

Chan, Chi-Ho, Lyons, Michael J. and Tetsutani, Nobuji (2003): Mouthbrush: drawing and painting by hand and mouth. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 277-280. Available online

We present a novel multimodal interface which permits users to draw or paint using coordinated gestures of hand and mouth. A headworn camera captures an image of the mouth and the mouth cavity region is extracted by Fisher discriminant analysis of the pixel colour information. A normalized area parameter is read by a drawing or painting program to allow read-time gestural control of pen/brush parameters by mouth gesture while sketching with a digital pen/tablet. A new performance task, the Radius Control Task, is proposed as a means of systematic evaluation of performance of the interface. Data from preliminary experiments show that with some practice users can achieve single pixel radius control with ease. A trial of the system by a professional artist shows that it is ready for use as a novel tool for creative artistic expression.

© All rights reserved Chan et al. and/or their publisher

p. 28-35

Lang, Sebastian, Kleinehagenbrock, Marcus, Hohenner, Sascha, Fritsch, Jannik, Fink, Gernot A. and Sagerer, Gerhard (2003): Providing the basis for human-robot-interaction: a multi-modal attention system for a mobile robot. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 28-35. Available online

In order to enable the widespread use of robots in home and office environments, systems with natural interaction capabilities have to be developed. A prerequisite for natural interaction is the robot's ability to automatically recognize when and how long a person's attention is directed towards it for communication. As in open environments several persons can be present simultaneously, the detection of the communication partner is of particular importance. In this paper we present an attention system for a mobile robot which enables the robot to shift its attention to the person of interest and to maintain attention during interaction. Our approach is based on a method for multi-modal person tracking which uses a pan-tilt camera for face recognition, two microphones for sound source localization, and a laser range finder for leg detection. Shifting of attention is realized by turning the camera into the direction of the person which is currently speaking. From the orientation of the head it is decided whether the speaker addresses the robot. The performance of the proposed approach is demonstrated with an evaluation. In addition, qualitative results from the performance of the robot at the exhibition part of the ICVS'03 are provided.

© All rights reserved Lang et al. and/or their publisher

p. 281-284

Katsurada, Kouichi, Nakamura, Yusaku, Yamada, Hirobumi and Nitta, Tsuneo (2003): XISL: a language for describing multimodal interaction scenarios. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 281-284. Available online

This paper outlines the latest version of XISL (eXtensible Interaction Scenario Language). XISL is an XML-based markup language for web-based multimodal interaction systems. XISL enables to describe synchronization of multimodal inputs/outputs, dialog flow/transition, and some other descriptions required for multimodal interaction. XISL inherits these features from VoiceXML and SMIL. The original feature of XISL is that XISL has enough modality-extensibility. We present the basic XISL tags, outline of XISL execution systems, and then make a comparison with other languages.

© All rights reserved Katsurada et al. and/or their publisher

p. 285-288

Bauer, Daniel and Hollan, James D. (2003): IRYS: a visualization tool for temporal analysis of multimodal interaction. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 285-288. Available online

IRYS is a tool for the replay and analysis of gaze and touch behavior during on-line activities. Essentially a "multimodal VCR", it can record and replay computer screen activity and overlay this video with a synchronized "spotlight" of the user's attention, as measured by an eye-tracking and/or touch-tracking system. This cross-platform tool is particularly useful for detailed ethnographic analysis of "natural" on-line behavior involving multiple applications and windows in a continually changing workspace.

© All rights reserved Bauer and Hollan and/or their publisher

p. 289-292

Hazen, Timothy J., Weinstein, Eugene and Park, Alex (2003): Towards robust person recognition on handheld devices using face and speaker identification technologies. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 289-292. Available online

Most face and speaker identification techniques are tested on data collected in controlled environments using high quality cameras and microphones. However, the use of these technologies in variable environments and with the help of the inexpensive sound and image capture hardware present in mobile devices presents an additional challenge. In this study, we investigate the application of existing face and speaker identification techniques to a person identification task on a handheld device. These techniques have proven to perform accurately on tightly constrained experiments where the lighting conditions, visual backgrounds, and audio environments are fixed and specifically adjusted for optimal data quality. When these techniques are applied on mobile devices where the visual and audio conditions are highly variable, degradations in performance can be expected. Under these circumstances, the combination of multiple biometric modalities can improve the robustness and accuracy of the person identification task. In this paper, we present our approach for combining face and speaker identification technologies and experimentally demonstrate a fused multi-biometric system which achieves a 50% reduction in equal error rate over the better of the two independent systems.

© All rights reserved Hazen et al. and/or their publisher

p. 293-296

Abrilian, Sarkis, Martin, Jean-Claude and Buisine, Stephanie (2003): Algorithms for controlling cooperation between output modalities in 2D embodied conversational agents. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 293-296. Available online

Recent advances in the specification of the multimodal behavior of Embodied Conversational Agents (ECA) have proposed a direct and deterministic one-step mapping from high-level specifications of dialog state or agent emotion onto low-level specifications of the multimodal behavior to be displayed by the agent (e.g. facial expression, gestures, vocal utterance). The difference of abstraction between these two levels of specification makes difficult the definition of such a complex mapping. In this paper we propose an intermediate level of specification based on combinations between modalities (e.g. redundancy, complementarity). We explain how such intermediate level specifications can be described using XML in the case of deictic expressions. We define algorithms for parsing such descriptions and generating the corresponding multimodal behavior of 2D cartoon-like conversational agents. Some random selection has been introduced in these algorithms in order to induce some "natural variations" in the agent's behavior. We conclude on the usefulness of this approach for the design of ECA.

© All rights reserved Abrilian et al. and/or their publisher

p. 297-300

Wilhelm, Torsten, Bhme, Hans-Joachim and Gross, Horst-Michael (2003): Towards an attentive robotic dialog partner. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 297-300. Available online

This paper describes a system developed for a mobile service robot which detects and tracks the position of a user's face in 3D-space using a vision (skin color) and a sonar based component. To make the skin color detection robust under varying illumination conditions, it is supplied with an automatic white balance algorithm. The hypothesis of the user's position is used to orient the robot's head towards the current user allowing it to grab high resolution images of his face suitable for verifying the hypothesis and for extracting additional information.

© All rights reserved Wilhelm et al. and/or their publisher

p. 3

Spence, Charles (2003): Crossmodal attention and multisensory integration: implications for multimodal interface design. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. p. 3. Available online

One of the most important findings to emerge from the field of cognitive psychology in recent years has been the discovery that humans have a very limited ability to process incoming sensory information. In fact, contrary to many of the most influential human operator models, the latest research has shown that humans use the same limited pool of attentional resources to process the inputs arriving from each of their senses (e.g., hearing, vision, touch, smell, etc). His research calls for a radical new way of examining and understanding the senses, which has major implications for the way we design everything from household products to multimodal user interfaces. Instead, interface designers should realize that the decision to stimulate more senses actually reflects a trade-off between the benefits of utilizing additional senses and the costs associated with dividing attention between different sensory modalities. In this presentation, I will discuss some of the problems associated with dividing attention between eye and ear, as illustrated by talking on a mobile phone while driving. Charles has published more than 70 articles in scientific journals over the past decade. I hope to demonstrate that a better understanding of the senses and, especially the links between the senses that have been highlighted by recent cognitive neuroscience research, will enable interface designers to develop multimodal interfaces that more effectively stimulate the user's senses.

© All rights reserved Spence and/or his/her publisher

p. 301-302

Payandeh, Shahram, Dill, John, Wilson, Graham, Zhang, Hui, Shi, Lilong, Lomax, Alan and MacKenzie, Christine (2003): Demo: a multi-modal training environment for surgeons. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 301-302. Available online

This demonstration presents the current state of an on-going team project at Simon Fraser University in developing a virtual environment for helping to train surgeons in performing laparoscopic surgery. In collaboration with surgeons, an initial set of training procedures has been developed. Our goal has been to develop procedures in each of several general categories, such as basic hand-eye coordination, single-handed and bi-manual approaches and dexterous manipulation. The environment is based on an effective data structure that offers fast graphics and physically based modeling of both rigid and deformable objects. In addition, the environment supports both 3D and 5D input devices and devices generating haptic feedback. The demonstration allows users to interact with a scene using a haptic device.

© All rights reserved Payandeh et al. and/or their publisher

p. 303-304

Paiva, Ana, Prada, Rui, Chaves, Ricardo, Vala, Marco, Bullock, Adrian, Andersson, Gerd and Hook, Kristina (2003): Demo: playingfFantasyA with senToy. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 303-304. Available online

Game development is an emerging area of development for new types of interaction between computers and humans. New forms of communication are now being explored there, influenced not only by face to face communication but also by recent developments in multi-modal communication and tangible interfaces. This demo will feature a computer game, FantasyA, where users can play the game by interacting with a tangible interface, SenToy (see Figure 1). The main idea is to involve objects and artifacts from real life into ways to interact with systems, and in particular with games. So, SenToy is an interface for users to project some of their emotional gestures through moving the doll in certain ways. This device would establish a link between the users (holding the physical device) and a controlled avatar (embodied by that physical device) of the computer game, FantasyA.

© All rights reserved Paiva et al. and/or their publisher

p. 36-43

Oliver, Nuria and Horvitz, Eric (2003): Selective perception policies for guiding sensing and computation in multimodal systems: a comparative analysis. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 36-43. Available online

Intensive computations required for sensing and processing perceptual information can impose significant burdens on personal computer systems. We explore several policies for selective perception in SEER, a multimodal system for recognizing office activity that relies on a layered Hidden Markov Model representation. We review our efforts to employ expected-value-of-information (EVI) computations to limit sensing and analysis in a context-sensitive manner. We discuss an implementation of a one-step myopic EVI analysis and compare the results of using the myopic EVI with a heuristic sensing policy that makes observations at different frequencies. Both policies are then compared to a random perception policy, where sensors are selected at random. Finally, we discuss the sensitivity of ideal perceptual actions to preferences encoded in utility models about information value and the cost of sensing.

© All rights reserved Oliver and Horvitz and/or their publisher

p. 4-11

Nesbat, Saied B. (2003): A system for fast, full-text entry for small electronic devices. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 4-11. Available online

A novel text entry system designed based on the ubiquitous 12-button telephone keypad and its adaptation for a soft keypad are presented. This system can be used to enter full text (letters + numbers + special characters) on devices where the number of keys or the keyboard area is limited. Letter-frequency data is used for assigning letters to the positions of a 3x3 matrix on keys, enhancing the entry of the most frequent Letters performed by a double-click. Less frequent letters and characters are entered based on a 3x3 adjacency matrix using an unambiguous, two-keystroke scheme. The same technique is applied to a virtual or soft keyboard layout so letters and characters are entered with taps or slides on an 11-button keypad. Based on the application of Fitts' law, this system is determined to be 67% faster than the QWERTY soft keyboard and 31% faster than the multi-tap text entry system commonly used on cell phones today. The system presented in this paper is implemented and runs on Palm OS PDAs, replacing the built-in QWERTY keyboard and Graffiti recognition systems of these PDAs.

© All rights reserved Nesbat and/or his/her publisher

p. 44-51

Oviatt, Sharon, Coulston, Rachel, Tomko, Stefanie, Xiao, Benfang, Lunsford, Rebecca, Wesson, Matt and Carmichael, Lesley (2003): Toward a theory of organized multimodal integration patterns during human-computer interaction. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 44-51. Available online

As a new generation of multimodal systems begins to emerge, one dominant theme will be the integration and synchronization requirements for combining modalities into robust whole systems. In the present research, quantitative modeling is presented on the organization of users' speech and pen multimodal integration patterns. In particular, the potential malleability of users' multimodal integration patterns is explored, as well as variation in these patterns during system error handling and tasks varying in difficulty. Using a new dual-wizard simulation method, data was collected from twelve adults as they interacted with a map-based task using multimodal speech and pen input. Analyses based on over 1600 multimodal constructions revealed that users' dominant multimodal integration pattern was resistant to change, even when strong selective reinforcement was delivered to encourage switching from a sequential to simultaneous integration pattern, or vice versa. Instead, both sequential and simultaneous integrators showed evidence of entrenching further in their dominant integration patterns (i.e., increasing either their inter-modal lag or signal overlap) over the course of an interactive session, during system error handling, and when completing increasingly difficult tasks. In fact, during error handling these changes in the co-timing of multimodal signals became the main feature of hyper-clear multimodal language, with elongation of individual signals either attenuated or absent. Whereas Behavioral/Structuralist theory cannot account for these data, it is argued that Gestalt theory provides a valuable framework and insights into multimodal interaction. Implications of these findings are discussed for the development of a coherent theory of multimodal integration during human-computer interaction, and for the design of a new class of adaptive multimodal interfaces.

© All rights reserved Oviatt et al. and/or their publisher

p. 52-59

Swindells, Colin, Unden, Alex and Sang, Tao (2003): TorqueBAR: an ungrounded haptic feedback device. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 52-59. Available online

Kinesthetic feedback is a key mechanism by which people perceive object properties during their daily tasks -- particularly inertial properties. For example, transporting a glass of water without spilling, or dynamically positioning a handheld tool such as a hammer, both require inertial kinesthetic feedback. We describe a prototype for a novel ungrounded haptic feedback device, the TorqueBAR, that exploits a kinesthetic awareness of dynamic inertia to simulate complex coupled motion as both a display and input device. As a user tilts the TorqueBAR to sense and control computer programmed stimuli, the TorqueBAR's centre-of-mass changes in real-time according to the user's actions. We evaluate the TorqueBAR using both quantitative and qualitative techniques, and we describe possible applications for the device such as video games and real-time robot navigation.

© All rights reserved Swindells et al. and/or their publisher

p. 60-67

Paiva, Ana, Prada, Rui, Chaves, Ricardo, Vala, Marco, Bullock, Adrian, Andersson, Gerd and Hook, Kristina (2003): Towards tangibility in gameplay: building a tangible affective interface for a computer game. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 60-67. Available online

In this paper we describe a way of controlling the emotional states of a synthetic character in a game (FantasyA) through a tangible interface named SenToy. SenToy is a doll with sensors in the arms, legs and body, allowing the user to influence the emotions of her character in the game. The user performs gestures and movements with SenToy, which are picked up by the sensors and interpreted according to a scheme found through an initial Wizard of Oz study. Different gestures are used to express each of the following emotions: anger, fear, happiness, surprise, sadness and gloating. Depending upon the expressed emotion, the synthetic character in FantasyA will, in turn, perform different actions. The evaluation of SenToy acting as the interface to the computer game FantasyA has shown that users were able to express most of the desired emotions to influence the synthetic characters, and that overall, players, especially children, really liked the doll as an interface.

© All rights reserved Paiva et al. and/or their publisher

p. 68-72

Snelick, Robert, Indovina, Mike, Yen, James and Mink, Alan (2003): Multimodal biometrics: issues in design and testing. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 68-72. Available online

Experimental studies show that multimodal biometric systems for small-scale populations perform better than single-mode biometric systems. We examine if such techniques scale to larger populations, introduce a methodology to test the performance of such systems, and assess the feasibility of using commercial off-the-shelf (COTS) products to construct deployable multimodal biometric systems. A key aspect of our approach is to leverage confidence level scores from preexisting single-mode data. An example presents a multimodal biometrics system analysis that explores various normalization and fusion techniques for face and fingerprint classifiers. This multimodal analysis uses a population of about 1000 subjects, a number ten-times larger than seen in any previously reported study. Experimental results combining face and fingerprint biometric classifiers reveal significant performance improvement over single-mode biometric systems.

© All rights reserved Snelick et al. and/or their publisher

p. 73-76

Adelstein, Bernard D., Begault, Durand R., Anderson, Mark R. and Wenzel, Elizabeth M. (2003): Sensitivity to haptic-audio asynchrony. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 73-76. Available online

The natural role of sound in actions involving mechanical impact and vibration suggests the use of auditory display as an augmentation to virtual haptic interfaces. In order to budget available computational resources for sound simulation, the perceptually tolerable asynchrony between paired haptic-auditory sensations must be known. This paper describes a psychophysical study of detectable time delay between a voluntary hammer tap and its auditory consequence (a percussive sound of either 1, 50, or 200 ms duration). The results show Just Noticeable Differences (JNDs) for temporal asynchrony of 24 ms with insignificant response bias. The invariance of JND and response bias as a function of sound duration in this experiment indicates that observers cued on the initial attack of the auditory stimuli.

© All rights reserved Adelstein et al. and/or their publisher

p. 77-80

Siracusa, Michael, Morency, Louis-Philippe, Wilson, Kevin, Fisher, John and Darrell, Trevor (2003): A multi-modal approach for determining speaker location and focus. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 77-80. Available online

This paper presents a multi-modal approach to locate a speaker in a scene and determine to whom he or she is speaking. We present a simple probabilistic framework that combines multiple cues derived from both audio and video information. A purely visual cue is obtained using a head tracker to identify possible speakers in a scene and provide both their 3-D positions and orientation. In addition, estimates of the audio signal's direction of arrival are obtained with the help of a two-element microphone array. A third cue measures the association between the audio and the tracked regions in the video. Integrating these cues provides a more robust solution than using any single cue alone. The usefulness of our approach is shown in our results for video sequences with two or more people in a prototype interactive kiosk environment.

© All rights reserved Siracusa et al. and/or their publisher

p. 81-84

Hinckleyss, Ken (2003): Distributed and local sensing techniques for face-to-face collaboration. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 81-84. Available online

This paper describes techniques that allow users to collaborate on tablet computers that employ distributed sensing techniques to establish a privileged connection between devices. Each tablet is augmented with a two-axis linear accelerometer (tilt sensor), touch sensor, proximity sensor, and light sensor. The system recognizes when users bump two tablets together by looking for spikes in each tablet's accelerometer data that are synchronized in time; bumping establishes a privileged connection between the devices. Users can face one another and bump the tops of two tablets together to establish a collaborative face-to-face workspace. The system then uses the sensors to enhance transitions between personal work and shared work. For example, a user can hold his or her hand near the top of the workspace to "shield" the display from the other user. This gesture is sensed using the proximity sensor together with the light sensor, allowing for quick "asides" into private information or to sketch an idea in a personal workspace. Picking up, putting down, or walking away with a tablet are also sensed, as is angling the tablet towards the other user. Much research in single display groupware considers shared displays and shared artifacts, but our system explores a unique form of dual display groupware for face-to-face communication and collaboration using personal display devices.

© All rights reserved Hinckleyss and/or his/her publisher

p. 85-92

Westeyn, Tracy, Brashear, Helene, Atrash, Amin and Starner, Thad (2003): Georgia tech gesture toolkit: supporting experiments in gesture recognition. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 85-92. Available online

Gesture recognition is becoming a more common interaction tool in the fields of ubiquitous and wearable computing. Designing a system to perform gesture recognition, however, can be a cumbersome task. Hidden Markov models (HMMs), a pattern recognition technique commonly used in speech recognition, can be used for recognizing certain classes of gestures. Existing HMM toolkits for speech recognition can be adapted to perform gesture recognition, but doing so requires significant knowledge of the speech recognition literature and its relation to gesture recognition. This paper introduces the Georgia Tech Gesture Toolkit GT{sup:2}k which leverages Cambridge University's speech recognition toolkit, HTK, to provide tools that support gesture recognition research. GT{sup:2}k provides capabilities for training models and allows for both real-time and off-line recognition. This paper presents four ongoing projects that utilize the toolkit in a variety of domains.

© All rights reserved Westeyn et al. and/or their publisher

p. 93-100

Elting, Christian, Rapp, Stefan, Mhler, Gregor and Strube, Michael (2003): Architecture and implementation of multimodal plug and play. In: Proceedings of the 2003 International Conference on Multimodal Interfaces 2003. pp. 93-100. Available online

This paper describes the handling of multimodality in the Embassi system. Here, multimodality is treated in two modules. Firstly, a modality fusion component merges speech, video traced pointing gestures, and input from a graphical user interface. Secondly, a presentation planning component decides upon the modality to be used for the output, i.e., speech, an animated life-like character (ALC) and/or the graphical user interface, and ensures that the presentation is coherent and cohesive. We describe how these two components work and emphasize one particular feature of our system architecture: All modality analysis components generate output in a common semantic description format and all render components process input in a common output language. This makes it particularly easy to add or remove modality analyzers or renderer components, even dynamically while the system is running. This plug and play of modalities can be used to adjust the system's capabilities to different demands of users and their situative context. In this paper we give details about the implementations of the models, protocols and modules that are necessary to realize those features.

© All rights reserved Elting et al. and/or their publisher




 

Join our community and advance:

Your
Skills

Your
Network

Your
Career

 
Join our community!
 
 

User-contributed notes

Give us your opinion! Do you have any comments/additions
that you would like other visitors to see?

 
comment You (your email) say: Aug 26th, 2014
#1
Aug 26
Add a thoughtful commentary or note to this page ! 
 

your homepage, facebook profile, twitter, or the like
will be spam-protected
How many?
= e.g. "6"
User ExperienceBy submitting you agree to the Site Terms
 
 
 
 

Changes to this page (conference)

20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Added
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified

Page Information

Page maintainer: The Editorial Team
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_2003_international_conference_on_multimodal_interfaces.html

Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts tomorrow LAST CALL!
go to course
User-Centred Design - Module 2
88% booked. Starts in 7 days
 
 

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !

 
 

Our Latest Books

 
 
Gamification at Work: Designing Engaging Business Software
by Janaki Mythily Kumar and Mario Herger
start reading
 
 
 
 
The Social Design of Technical Systems: Building technologies for communities
by Brian Whitworth and Adnan Ahmad
start reading
 
 
 
 
The Encyclopedia of Human-Computer Interaction, 2nd Ed.
by Mads Soegaard and Rikke Friis Dam
start reading
 
 

Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts tomorrow LAST CALL!
go to course
User-Centred Design - Module 2
88% booked. Starts in 7 days