Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts the day after tomorrow !
go to course
User-Centred Design - Module 2
87% booked. Starts in 8 days
 
 

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !

 
 

Our Latest Books

 
 
Gamification at Work: Designing Engaging Business Software
by Janaki Mythily Kumar and Mario Herger
start reading
 
 
 
 
The Social Design of Technical Systems: Building technologies for communities
by Brian Whitworth and Adnan Ahmad
start reading
 
 
 
 
The Encyclopedia of Human-Computer Interaction, 2nd Ed.
by Mads Soegaard and Rikke Friis Dam
start reading
 
 

Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts the day after tomorrow !
go to course
User-Centred Design - Module 2
87% booked. Starts in 8 days
 
 

Proceedings of the 2006 International Conference on Multimodal Interfaces


 
Time and place:

2006
Conf. description:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
Help us!
Do you know when the next conference is? If yes, please add it to the calendar!
Series:
This is a preferred venue for people like Trevor Darrell, Wen Gao, Rainer Stiefelhagen, Jie Yang, and Francis K. H. Quek. Part of the ICMI - International Conference on Multimodal Interfaces conference series.
Other years:
EDIT

References from this conference (2006)

The following articles are from "Proceedings of the 2006 International Conference on Multimodal Interfaces":

 what's this?

Articles

p. 1

Warburton, Ted (2006): Weight, weight, don't tell me. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. p. 1. Available online

Remember the "Internet's firstborn," Ron Lussier's dancing baby from 1996? Other than a vague sense of repeated gyrations, no one can recall any of the movements in particular. Why is that? While that animation was ground-breaking in many respects, to paraphrase a great writer, there was no there there. The dancing baby lacked personality because the movements themselves lacked "weight." Each human being has a unique perceivable movement style composed of repeated recognizable elements that in combination and phrasing capture the liveliness of movement. The use of weight, or "effort quality," is a key element in movement style, defining a dynamic expressive range. In computer representation of human movement, however, weight is often an aspect of life-ness that gets diminished or lost in the process, contributing to a lack of groundedness, personality, and verisimilitude. In this talk, I unpack the idea of effort quality and describe current work with motion capture and telematics that puts the weight back on interface design.

© All rights reserved Warburton and/or his/her publisher

p. 100-107

Rohs, Michael and Essl, Georg (2006): Which one is better?: information navigation techniques for spatially aware handheld displays. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 100-107. Available online

Information navigation techniques for handheld devices support interacting with large virtual spaces on small displays, for example finding targets on a large-scale map. Since only a small part of the virtual space can be shown on the screen at once, typical interfaces allow for scrolling and panning to reach off-screen content. Spatially aware handheld displays sense their position and orientation in physical space in order to provide a corresponding view in virtual space. We implemented various one-handed navigation techniques for camera-tracked spatially aware displays. The techniques are compared in a series of abstract selection tasks that require the investigation of different levels of detail. The tasks are relevant for interfaces that enable navigating large scale maps and finding contextual information on them. The results show that halo is significantly faster than other techniques. In complex situations zoom and halo show comparable performance. Surprisingly, the combination of halo and zooming is detrimental to user performance.

© All rights reserved Rohs and Essl and/or their publisher

p. 108-117

Burke, Jennifer L., Prewett, Matthew S., Gray, Ashley A., Yang, Liuquin, Stilson, Frederick R. B., Coovert, Michael D., Elliot, Linda R. and Redden, Elizabeth (2006): Comparing the effects of visual-auditory and visual-tactile feedback on user performance: a meta-analysis. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 108-117. Available online

In a meta-analysis of 43 studies, we examined the effects of multimodal feedback on user performance, comparing visual-auditory and visual-tactile feedback to visual feedback alone. Results indicate that adding an additional modality to visual feedback improves performance overall. Both visual-auditory feedback and visual-tactile feedback provided advantages in reducing reaction times and improving performance scores, but were not effective in reducing error rates. Effects are moderated by task type, workload, and number of tasks. Visual-auditory feedback is most effective when a single task is being performed (g = .87), and under normal workload conditions (g = .71). Visual-tactile feedback is more effective when multiple tasks are begin performed (g = .77) and workload conditions are high (g = .84). Both types of multimodal feedback are effective for target acquisition tasks; but vary in effectiveness for other task types. Implications for practice and research are discussed.

© All rights reserved Burke et al. and/or their publisher

p. 118-125

Malkin, Robert, Chen, Datong, Yang, Jie and Waibel, Alex (2006): Multimodal estimation of user interruptibility for smart mobile telephones. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 118-125. Available online

Context-aware computer systems are characterized by the ability to consider user state information in their decision logic. One example application of context-aware computing is the smart mobile telephone. Ideally, a smart mobile telephone should be able to consider both social factors (i.e., known relationships between contactor and contactee) and environmental factors (i.e., the contactee's current locale and activity) when deciding how to handle an incoming request for communication. Toward providing this kind of user state information and improving the ability of the mobile phone to handle calls intelligently, we present work on inferring environmental factors from sensory data and using this information to predict user interruptibility. Specifically, we learn the structure and parameters of a user state model from continuous ambient audio and visual information from periodic still images, and attempt to associate the learned states with user-reported interruptibility levels. We report experimental results using this technique on real data, and show how such an approach can allow for adaptation to specific user preferences.

© All rights reserved Malkin et al. and/or their publisher

p. 12-19

Danninger, Maria, Kluge, Tobias and Stiefelhagen, Rainer (2006): MyConnector: analysis of context cues to predict human availability for communication. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 12-19. Available online

In this thriving world of mobile communications, the difficulty of communication is no longer contacting someone, but rather contacting people in a socially appropriate manner. Ideally, senders should have some understanding of a receiver's availability in order to make contact at the right time, in the right contexts, and with the optimal communication medium. We describe the design and implementation of MyConnector, an adaptive and context-aware service designed to facilitate efficient and appropriate communication, based on each party's availability. One of the chief design questions of such a service is to produce technologies with sufficient contextual awareness to decide upon a person's availability for communication. We present results from a pilot study comparing a number of context cues and their predictive power for gauging one's availability.

© All rights reserved Danninger et al. and/or their publisher

p. 126-127

Karpov, E., Kiss, I., Leppnen, J., Olsen, J., Oria, D., Sivadas, S. and Tian, J. (2006): Short message dictation on Symbian series 60 mobile phones. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 126-127. Available online

Dictation of natural language text on embedded mobile devices is a challenging task. First, it involves memory and CPU-efficient implementation of robust speech recognition algorithms that are generally resource demanding. Secondly, the acoustic and language models employed in the recognizer require the availability of suitable text and speech language resources, typically for a wide set of languages. Thirdly, a proper design of the UI is also essential. The UI has to provide intuitive and easy means for dictation and error correction, and must be suitable for a mobile usage scenario. In this demonstrator, an embedded speech recognition system for short message (SMS) dictation in US English is presented. The system is running on Nokia Series 60 mobile phones (e.g., N70, E60). The system's vocabulary is 23 thousand words. Its Flash and RAM memory footprints are small, 2 and 2.5 megabytes, respectively. After a short enrollment session, most native speakers can achieve a word accuracy of over 90% when dictating short messages in quiet or moderately noisy environments.

© All rights reserved Karpov et al. and/or their publisher

p. 128

Fillinger, Antoine, Degr, Stphane, Hamchi, Imad and Stanford, Vincent (2006): The NIST smart data flow system II multimodal data transport infrastructure. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. p. 128. Available online

Multimodal interfaces require numerous computing devices, sensors, and dynamic networking, to acquire, transport, and process the sensor streams necessary to sense human activities and respond to them. The NIST Smart Data Flow System Version II embodies many improvements requested by the research community including multiple operating systems, simplified data transport protocols, additional language bindings, an extensible object oriented architecture, and improved fault tolerance.

© All rights reserved Fillinger et al. and/or their publisher

p. 129-130

Boda, Pter Pl (2006): A contextual multimodal integrator. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 129-130. Available online

Multimodal Integration addresses the problem of combining various user inputs into a single semantic representation that can be used in deciding the next step of system action(s). The method presented in this paper uses a statistical framework to implement the integration mechanism and includes contextual information additionally to the actual user input. The underlying assumption is that the more information sources are taken into account, the better picture can be drawn about the actual intention of the user in the given context of the interaction. The paper presents the latest results with a Maximum Entropy classifier, with special emphasis on the use of contextual information (type of gesture movements and type of objects selected). Instead of explaining the design and implementation process in details (a longer paper to be published later will do that), only a short description is provided here about the demonstration implementation that produces above 91% accuracy for the 1st best and higher than 96% for the accumulated five N-bests results.

© All rights reserved Boda and/or his/her publisher

p. 131-132

Barthelmess, Paulo, Kaiser, Edward, Huang, Xiao, McGee, David and Cohen, Philip (2006): Collaborative multimodal photo annotation over digital paper. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 131-132. Available online

The availability of metadata annotations over media content such as photos is known to enhance retrieval and organization, particularly for large data sets. The greatest challenge for obtaining annotations remains getting users to perform the large amount of tedious manual work that is required. In this demo we show a system for semi-automated labeling based on extraction of metadata from naturally occurring conversations of groups of people discussing pictures among themselves. The system supports a variety of collaborative label elicitation scenarios mixing co-located and distributed participants, operating primarily via speech, handwriting and sketching over tangible digital paper photo printouts. We demonstrate the real-time capabilities of the system by providing hands-on annotation experience for conference participants. Demo annotations are performed over public domain pictures portraying mainstream themes (e.g. from famous movies).

© All rights reserved Barthelmess et al. and/or their publisher

p. 133-134

Bergl, Vladimr, mejrek, Martin, Fanta, Martin, Labsk, Martin, Seredi, Ladislav, Sediv, Jan and Ures, Lubos (2006): CarDialer: multi-modal in-vehicle cellphone control application. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 133-134. Available online

This demo presents CarDialer -- an in-car cellphone control application. Its multi-modal user interface blends state-of-the-art speech recognition technology (including text-to-speech synthesis) with the existing well proven elements of a vehicle information system GUI (buttons mounted on a steering wheel and an LCD equipped with touch-screen). This conversational system provides access to name dialing, unconstrained dictation of numbers, adding new names, operations with lists of calls and messages, notification of presence, etc. The application is fully functional from the first start, no prerequisite steps such as configuration, speech recognition enrollment) are required. The presentation of the proposed multi-modal architecture goes beyond the specific application and presents a modular platform to integrate application logic with various incarnations of UI modalities.

© All rights reserved Bergl et al. and/or their publisher

p. 135-136

Takikawa, Erina, Kinoshita, Koichi, Lao, Shihong and Kawade, Masato (2006): Gender and age estimation system robust to pose variations. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 135-136. Available online

For applications based on facial image processing, pose variation is a difficult problem. In this paper, we propose a gender and age estimation system that is robust against pose variations. The acceptable facial pose range is a yaw (left-right) from -30 degrees to +30 degrees and a pitch (up-down) from -20 degrees to +20 degrees. According to our experiments on several large databases collected under real environments, the gender estimation accuracy is 84.8% and the age estimation accuracy is 80.9% (subjects are divided into 5 classes). The average processing time is about 70 ms/frame for gender estimation and 95 ms/frame for age estimation (Pentium4 3.2 GHz). The system can be used to automatically analyze shopping customers and pedestrians using surveillance cameras.

© All rights reserved Takikawa et al. and/or their publisher

p. 137-138

Kinoshita, Koichi, Ma, Yong, Lao, Shihong and Kawaade, Masato (2006): A fast and robust 3D head pose and gaze estimation system. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 137-138. Available online

We developed a fast and robust head pose and gaze estimation system. This system can detect facial points and estimate 3D pose angles and gaze direction under various conditions including facial expression changes and partial occlusion. We need only one face image as input and do not need special devices such as blinking LEDs or stereo cameras. Moreover, no calibration is needed. The system shows a 95% head pose estimation accuracy and 81% gaze estimation accuracy (when the error margin is 15 degrees). The processing time is about 15 ms/frame (Pentium4 3.2 GHz). Acceptable range of facial pose is within a yaw (left-right) of 60 degrees and within a pitch (up-down) of 30 degrees.

© All rights reserved Kinoshita et al. and/or their publisher

p. 139-145

Zeng, Zhihong, Hu, Yuxiao, Fu, Yun, Huang, Thomas S., Roisman, Glenn I. and Wen, Zhen (2006): Audio-visual emotion recognition in adult attachment interview. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 139-145. Available online

Automatic multimodal recognition of spontaneous affective expressions is a largely unexplored and challenging problem. In this paper, we explore audio-visual emotion recognition in a realistic human conversation setting -- Adult Attachment Interview (AAI). Based on the assumption that facial expression and vocal expression be at the same coarse affective states, positive and negative emotion sequences are labeled according to Facial Action Coding System Emotion Codes. Facial texture in visual channel and prosody in audio channel are integrated in the framework of Adaboost multi-stream hidden Markov model (AMHMM) in which Adaboost learning scheme is used to build component HMM fusion. Our approach is evaluated in the preliminary AAI spontaneous emotion recognition experiments.

© All rights reserved Zeng et al. and/or their publisher

p. 146-154

Caridakis, George, Malatesta, Lori, Kessous, Loc, Amir, Noam, Raouzaiou, Amaryllis and Karpouzis, Kostas (2006): Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 146-154. Available online

Affective and human-centered computing are two areas related to HCI which have attracted attention during the past years. One of the reasons that this may be attributed to, is the plethora of devices able to record and process multimodal input from the part of the users and adapt their functionality to their preferences or individual habits, thus enhancing usability and becoming attractive to users less accustomed with conventional interfaces. In the quest to receive feedback from the users in an unobtrusive manner, the visual and auditory modalities allow us to infer the users' emotional state, combining information both from facial expression recognition and speech prosody feature extraction. In this paper, we describe a multi-cue, dynamic approach in naturalistic video sequences. Contrary to strictly controlled recording conditions of audiovisual material, the current research focuses on sequences taken from nearly real world situations. Recognition is performed via a 'Simple Recurrent Network' which lends itself well to modeling dynamic events in both user's facial expressions and speech. Moreover this approach differs from existing work in that it models user expressivity using a dimensional representation of activation and valence, instead of detecting the usual 'universal emotions' which are scarce in everyday human-machine interaction. The algorithm is deployed on an audiovisual database which was recorded simulating human-human discourse and, therefore, contains less extreme expressivity and subtle variations of a number of emotion labels.

© All rights reserved Caridakis et al. and/or their publisher

p. 155-161

Dong, Wen, Gips, Jonathan and Pentland, Alex (Sandy) (2006): A 'need to know' system for group classification. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 155-161. Available online

This paper outlines the design of a distributed sensor classification system with abnormality detection intended for groups of people who are participating in coordinated activities. The system comprises an implementation of a distributed Dynamic Bayesian Network (DBN) model called the Influence Model (IM) that relies heavily on an inter-process communication architecture called Enchantment to establish the pathways of information that the model requires. We use three examples to illustrate how the "need to know" system effectively recognizes the group structure by simulating the work of cooperating individuals.

© All rights reserved Dong et al. and/or their publisher

p. 162-170

Valstar, Michel F., Pantic, Maja, Ambadar, Zara and Cohn, Jeffrey F. (2006): Spontaneous vs. posed facial behavior: automatic analysis of brow actions. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 162-170. Available online

Past research on automatic facial expression analysis has focused mostly on the recognition of prototypic expressions of discrete emotions rather than on the analysis of dynamic changes over time, although the importance of temporal dynamics of facial expressions for interpretation of the observed facial behavior has been acknowledged for over 20 years. For instance, it has been shown that the temporal dynamics of spontaneous and volitional smiles are fundamentally different from each other. In this work, we argue that the same holds for the temporal dynamics of brow actions and show that velocity, duration, and order of occurrence of brow actions are highly relevant parameters for distinguishing posed from spontaneous brow actions. The proposed system for discrimination between volitional and spontaneous brow actions is based on automatic detection of Action Units (AUs) and their temporal segments (onset, apex, offset) produced by movements of the eyebrows. For each temporal segment of an activated AU, we compute a number of mid-level feature parameters including the maximal intensity, duration, and order of occurrence. We use Gentle Boost to select the most important of these parameters. The selected parameters are used further to train Relevance Vector Machines to determine per temporal segment of an activated AU whether the action was displayed spontaneously or volitionally. Finally, a probabilistic decision function determines the class (spontaneous or posed) for the entire brow action. When tested on 189 samples taken from three different sets of spontaneous and volitional facial data, we attain a 90.7% correct recognition rate.

© All rights reserved Valstar et al. and/or their publisher

p. 171-178

Maat, Ludo and Pantic, Maja (2006): Gaze-X: adaptive affective multimodal interface for single-user office scenarios. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 171-178. Available online

This paper describes an intelligent system that we developed to support affective multimodal human-computer interaction (AMM-HCI) where the user's actions and emotions are modeled and then used to adapt the HCI and support the user in his or her activity. The proposed system, which we named Gaze-X, is based on sensing and interpretation of the human part of the computer's context, known as W5+ (who, where, what, when, why, how). It integrates a number of natural human communicative modalities including speech, eye gaze direction, face and facial expression, and a number of standard HCI modalities like keystrokes, mouse movements, and active software identification, which, in turn, are fed into processes that provide decision making and adapt the HCI to support the user in his or her activity according to his or her preferences. To attain a system that can be educated, that can improve its knowledge and decision making through experience, we use case-based reasoning as the inference engine of Gaze-X. The utilized case base is a dynamic, incrementally self-organizing event-content-addressable memory that allows fact retrieval and evaluation of encountered events based upon the user preferences and the generalizations formed from prior input. To support concepts of concurrency, modularity/scalability, persistency, and mobility, Gaze-X has been built as an agent-based system where different agents are responsible for different parts of the processing. A usability study conducted in an office scenario with a number of users indicates that Gaze-X is perceived as effective, easy to use, useful, and affectively qualitative.

© All rights reserved Maat and Pantic and/or their publisher

p. 179-184

Ruttkay, Z. M., Reidsma, D. and Nijholt, A. (2006): Human computing, virtual humans and artificial imperfection. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 179-184. Available online

In this paper we raise the issue whether imperfections, characteristic of human-human communication, should be taken into account when developing virtual humans. We argue that endowing virtual humans with the imperfections of humans can help making them more 'comfortable' to interact with. That is, the natural communication of a virtual human should not be restricted to multimodal utterances that are always perfect, both in the sense of form and of content. We illustrate our views with examples from two own applications that we have worked on: the Virtual Dancer, and the Virtual Trainer. In both applications imperfectness helps in keeping the interaction engaging and entertaining.

© All rights reserved Ruttkay et al. and/or their publisher

p. 185-192

Chen, Lei, Harper, Mary and Huang, Zhongqiang (2006): Using maximum entropy (ME) model to incorporate gesture cues for SU detection. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 185-192. Available online

Accurate identification of sentence units (SUs) in spontaneous speech has been found to improve the accuracy of speech recognition, as well as downstream applications such as parsing. In recent multimodal investigations, gestural features were utilized, in addition to lexical and prosodic cues from the speech channel, for detecting SUs in conversational interactions using a hidden Markov model (HMM) approach. Although this approach is computationally efficient and provides a convenient way to modularize the knowledge sources, it has two drawbacks for our SU task. First, standard HMM training methods maximize the joint probability of observations and hidden events, as opposed to the posterior probability of a hidden event given observations, a criterion more closely related to SU classification error. A second challenge for integrating gestural features is that their absence sanctions neither SU events nor non-events; it is only the co-timing of gestures with the speech channel that should impact our model. To address these problems, a Maximum Entropy (ME) model is used to combine multimodal cues for SU estimation. Experiments carried out on VACE multi-party meetings confirm that the ME modeling approach provides a solid framework for multimodal integration.

© All rights reserved Chen et al. and/or their publisher

p. 193-200

Qu, Shaolin and Chai, Joyce Y. (2006): Salience modeling based on non-verbal modalities for spoken language understanding. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 193-200. Available online

Previous studies have shown that, in multimodal conversational systems, fusing information from multiple modalities together can improve the overall input interpretation through mutual disambiguation. Inspired by these findings, this paper investigates non-verbal modalities, in particular deictic gesture, in spoken language processing. Our assumption is that during multimodal conversation, user's deictic gestures on the graphic display can signal the underlying domain model that is salient at that particular point of interaction. This salient domain model can be used to constrain hypotheses for spoken language processing. Based on this assumption, this paper examines different configurations of salience driven language models (e.g., n-gram and probabilistic context free grammar) for spoken language processing across different stages. Our empirical results have shown the potential of integrating salience models based on non-verbal modalities in spoken language understanding.

© All rights reserved Qu and Chai and/or their publisher

p. 2

O'Modhrain, Sile (2006): Movement and music: designing gestural interfaces for computer-based musical instruments. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. p. 2. Available online

The concept of body-mediated or embodied interaction, of the coupling of interface and actor, has become increasingly relevant within the domain of HCI. With the reduced size and cost of a wide variety of sensor technologies and the ease with which they can be wirelessly deployed, on the body, in devices we carry with us and in the environment, comes the opportunity to use a wide range of human motion as an integral part of our interaction with many applications. While movement is potentially a rich, multidimensional source of information upon which interface designers can draw, its very richness poses many challenges in developing robust motion capture and gesture recognition systems. In this talk, I will suggest that lessons learned by designers of computer-based musical instruments whose task is to translate expressive movement into nuanced control of sound may now help to inform the design of movement-based interfaces for a much wider range of applications.

© All rights reserved O'Modhrain and/or his/her publisher

p. 20-27

Lunsford, Rebecca and Oviatt, Sharon (2006): Human perception of intended addressee during computer-assisted meetings. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 20-27. Available online

Recent research aims to develop new open-microphone engagement techniques capable of identifying when a speaker is addressing a computer versus human partner, including during computer-assisted group interactions. The present research explores: (1) how accurately people can judge whether an intended interlocutor is a human versus computer, (2) which linguistic, acoustic-prosodic, and visual information sources they use to make these judgments, and (3) what type of systematic errors are present in their judgments. Sixteen participants were asked to determine a speaker's intended addressee based on actual videotaped utterances matched on illocutionary force, which were played back as: (1) lexical transcriptions only, (2) audio-only, (3) visual-only, and (4) audio-visual information. Perhaps surprisingly, people's accuracy in judging human versus computer addressees did not exceed chance levels with lexical-only content (46%). As predicted, accuracy improved significantly with audio (58%), visual (57%), and especially audio-visual information (63%). Overall, accuracy in detecting human interlocutors was significantly worse than judging computer ones, and specifically worse when only visual information was present because speakers often looked at the computer when addressing peers. In contrast, accuracy in judging computer interlocutors was significantly better whenever visual information was present than with audio alone, and it yielded the highest accuracy levels observed (86%). Questionnaire data also revealed that speakers' gaze, peers' gaze, and tone of voice were considered the most valuable information sources. These results reveal that people rely on cues appropriate for interpersonal interactions in determining computer- versus human-directed speech during mixed human-computer interactions, even though this degrades their accuracy. Future systems that process actual rather than expected communication patterns potentially could be designed that perform better than humans.

© All rights reserved Lunsford and Oviatt and/or their publisher

p. 201-208

Noulas, A. K. and Krse, B. J. A. (2006): EM detection of common origin of multi-modal cues. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 201-208. Available online

Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables -- that is the identity of the speakers and the visible persons. In the M-step, the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.

© All rights reserved Noulas and Krse and/or their publisher

p. 209-216

Arthur, Alexander M., Lunsford, Rebecca, Wesson, Matt and Oviatt, Sharon (2006): Prototyping novel collaborative multimodal systems: simulation, data collection and analysis tools for the next decade. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 209-216. Available online

To support research and development of next-generation multimodal interfaces for complex collaborative tasks, a comprehensive new infrastructure has been created for collecting and analyzing time-synchronized audio, video, and pen-based data during multi-party meetings. This infrastructure needs to be unobtrusive and to collect rich data involving multiple information sources of high temporal fidelity to allow the collection and annotation of simulation-driven studies of natural human-human-computer interactions. Furthermore, it must be flexibly extensible to facilitate exploratory research. This paper describes both the infrastructure put in place to record, encode, playback and annotate the meeting-related media data, and also the simulation environment used to prototype novel system concepts.

© All rights reserved Arthur et al. and/or their publisher

p. 217-224

Ou, Jiazhi, Shi, Yanxin, Wong, Jeffrey, Fussell, Susan R. and Yang, Jie (2006): Combining audio and video to predict helpers' focus of attention in multiparty remote collaboration on physical tasks. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 217-224. Available online

The increasing interest in supporting multiparty remote collaboration has created both opportunities and challenges for the research community. The research reported here aims to develop tools to support multiparty remote collaborations and to study human behaviors using these tools. In this paper we first introduce an experimental multimedia (video and audio) system with which an expert can collaborate with several novices. We then use this system to study helpers' focus of attention (FOA) during a collaborative circuit assembly task. We investigate the relationship between FOA and language as well as activities using multimodal (audio and video) data, and use learning methods to predict helpers' FOA. We process different modalities separately and fusion the results to make a final decision. We employ a sliding window-based delayed labeling method to automatically predict changes in FOA in real time using only the dialogue among the helper and workers. We apply an adaptive background subtraction method and support vector machine to recognize the worker's activities from the video. To predict the helper's FOA, we make decisions using the information of joint project boundaries and workers' recent activities. The overall prediction accuracies are 79.52% using audio only and 81.79% using audio and video combined.

© All rights reserved Ou et al. and/or their publisher

p. 225-232

Wang, Qianying, Battocchi, Alberto, Graziola, Ilenia, Pianesi, Fabio, Tomasini, Daniel, Zancanaro, Massimo and Nass, Clifford (2006): The role of psychological ownership and ownership markers in collaborative working environment. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 225-232. Available online

In this paper, we present a study concerning psychological ownership for digital entities in the context of collaborative working environments. In the first part of the paper we present a conceptual framework of ownership: various issues such as definition, effects, target factors and behavioral manifestation are explicated. We then focus on ownership marking, a behavioral manifestation that is closely tied to psychological ownership. We designed an experiment using DiamondTouch Table to investigate the effect of two of the most widely used ownership markers on users' attitudes and performance. Both performance and attitudinal differences were found, suggesting the significant role of ownership and ownership markers in the groupware and interactive workspaces design.

© All rights reserved Wang et al. and/or their publisher

p. 233-238

Cohn, Jeffrey F. (2006): Foundations of human computing: facial expression and emotion. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 233-238. Available online

Many people believe that emotions and subjective feelings are one and the same and that a goal of human-centered computing is emotion recognition. The first belief is outdated; the second mistaken. For human-centered computing to succeed, a different way of thinking is needed. Emotions are species-typical patterns that evolved because of their value in addressing fundamental life tasks[19]. Emotions consist of multiple components that may include intentions, action tendencies, appraisals, other cognitions, central and peripheral changes in physiology, and subjective feelings. Emotions are not directly observable, but are inferred from expressive behavior, self-report, physiological indicators, and context. I focus on expressive behavior because of its coherence with other indicators and the depth of research on the facial expression of emotion in behavioral and computer science. In this paper, among the topics I include are approaches to measurement, timing or dynamics, individual differences, dyadic interaction, and inference. I propose that design and implementation of perceptual user interfaces may be better informed by considering the complexity of emotion, its various indicators, measurement, individual differences, dyadic interaction, and problems of inference.

© All rights reserved Cohn and/or his/her publisher

p. 239-248

Pantic, Maja, Pentland, Alex, Nijholt, Anton and Huang, Thomas (2006): Human computing and machine understanding of human behavior: a survey. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 239-248. Available online

A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior.

© All rights reserved Pantic et al. and/or their publisher

p. 249-256

Blanz, Volker (2006): Computing human faces for human viewers: automated animation in photographs and paintings. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 249-256. Available online

This paper describes a system for animating and modifying faces in images. It combines an algorithm for 3D face reconstruction from single images with a learning-based approach for 3D animation and face modification. Modifications include changes of facial attributes, such as body weight, masculine or feminine look, or overall head shape, as well as cut-and-paste exchange of faces. Unlike traditional photo retouche, this technique can be applied across changes in pose and lighting. Bridging the gap between photorealistic image processing and 3D graphics, the system provides tools for interacting with existing image material, such as photographs or paintings. The core of the approach is a statistical analysis of a dataset of 3D faces, and an analysis-by-synthesis loop that simulates the process of image formation for high-level image processing.

© All rights reserved Blanz and/or his/her publisher

p. 257-264

Rienks, Rutger, Zhang, Dong, Gatica-Perez, Daniel and Post, Wilfried (2006): Detection and application of influence rankings in small group meetings. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 257-264. Available online

We address the problem of automatically detecting participant's influence levels in meetings. The impact and social psychological background are discussed. The more influential a participant is, the more he or she influences the outcome of a meeting. Experiments on 40 meetings show that application of statistical (both dynamic and static) models while using simply obtainable features results in a best prediction performance of 70.59% when using a static model, a balanced training set, and three discrete classes: high, normal and low. Application of the detected levels are shown in various ways i.e. in a virtual meeting environment as well as in a meeting browser system.

© All rights reserved Rienks et al. and/or their publisher

p. 265-272

Smith, Kevin, Ba, Sileye O., Gatica-Perez, Daniel and Odobez, Jean-Marc (2006): Tracking the multi person wandering visual focus of attention. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 265-272. Available online

Estimating the wandering visual focus of attention (WVFOA) for multiple people is an important problem with many applications in human behavior understanding. One such application, addressed in this paper, monitors the attention of passers-by to outdoor advertisements. This paper investigates the problem of tracking the wandering visual focus-of-attention (VFOA) of multiple people, an important problem with many applications in human behavior understanding. We address the specific problem of monitoring attention to outdoor advertisements. To solve the WVFOA problem, we propose a multi-person tracking approach based on a hybrid Dynamic Bayesian Network that simultaneously infers the number of people in the scene, their body and head locations, and their head pose, in a joint state-space formulation that is amenable for person interaction modeling. The model exploits both global measurements and individual observations for the VFOA. For inference in the resulting high-dimensional state-space, we propose a trans-dimensional Markov Chain Monte Carlo (MCMC) sampling scheme, which not only handles a varying number of people, but also efficiently searches the state-space by allowing person-part state updates. Our model was rigorously evaluated for tracking and its ability to recognize when people look at an outdoor advertisement using a realistic data set.

© All rights reserved Smith et al. and/or their publisher

p. 273-280

Lunsford, Rebecca, Oviatt, Sharon and Arthur, Alexander M. (2006): Toward open-microphone engagement for multiparty interactions. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 273-280. Available online

There currently is considerable interest in developing new open-microphone engagement techniques for speech and multimodal interfaces that perform robustly in complex mobile and multiparty field environments. State-of-the-art audio-visual open-microphone engagement systems aim to eliminate the need for explicit user engagement by processing more implicit cues that a user is addressing the system, which results in lower cognitive load for the user. This is an especially important consideration for mobile and educational interfaces due to the higher load required by explicit system engagement. In the present research, longitudinal data were collected with six triads of high-school students who engaged in peer tutoring on math problems with the aid of a simulated computer assistant. Results revealed that amplitude was 3.25dB higher when users addressed a computer rather than human peer when no lexical marker of intended interlocutor was present, and 2.4dB higher for all data. These basic results were replicated for both matched and adjacent utterances to computer versus human partners. With respect to dialogue style, speakers did not direct a higher ratio of commands to the computer, although such dialogue differences have been assumed in prior work. Results of this research reveal that amplitude is a powerful cue marking a speaker's intended addressee, which should be leveraged to design more effective microphone engagement during computer-assisted multiparty interactions.

© All rights reserved Lunsford et al. and/or their publisher

p. 28-34

Zancanaro, Massimo, Lepri, Bruno and Pianesi, Fabio (2006): Automatic detection of group functional roles in face to face interactions. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 28-34. Available online

In this paper, we discuss a machine learning approach to automatically detect functional roles played by participants in a face to face interaction. We shortly introduce the coding scheme we used to classify the roles of the group members and the corpus we collected to assess the coding scheme reliability as well as to train statistical systems for automatic recognition of roles. We then discuss a machine learning approach based on multi-class SVM to automatically detect such roles by employing simple features of the visual and acoustical scene. The effectiveness of the classification is better than the chosen baselines and although the results are not yet good enough for a real application, they demonstrate the feasibility of the task of detecting group functional roles in face to face interactions.

© All rights reserved Zancanaro et al. and/or their publisher

p. 281-286

Voit, Michael and Stiefelhagen, Rainer (2006): Tracking head pose and focus of attention with multiple far-field cameras. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 281-286. Available online

In this work we present our recent approach on estimating head orientations and foci of attention of multiple people in a smart room, which is equipped with several cameras to monitor the room. In our approach, we estimate each person's head orientation with respect to the room coordinate system by using all camera views. We implemented a Neural Network to estimate head pose on every single camera view, a Bayes filter is then applied to integrate every estimate into one final, joint hypothesis. Using this scheme, we can track peoples' horizontal head orientations in a full 360 range at almost all positions within the room. The tracked head orientations are then used to determine who is looking at whom, i.e. people's focus of attention. We report experimental results on one meeting video, that was recorded in the smart room.

© All rights reserved Voit and Stiefelhagen and/or their publisher

p. 287-294

Morency, Louis-Philippe, Christoudias, C. Mario and Darrell, Trevor (2006): Recognizing gaze aversion gestures in embodied conversational discourse. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 287-294. Available online

Eye gaze offers several key cues regarding conversational discourse during face-to-face interaction between people. While a large body of research results exist to document the use of gaze in human-to-human interaction, and in animating realistic embodied avatars, recognition of conversational eye gestures -- distinct eye movement patterns relevant to discourse -- has received less attention. We analyze eye gestures during interaction with an animated embodied agent and propose a non-intrusive vision-based approach to estimate eye gaze and recognize eye gestures. In our user study, human participants avert their gaze (i.e. with "look-away" or "thinking" gestures) during periods of cognitive load. Using our approach, an agent can visually differentiate whether a user is thinking about a response or is waiting for the agent or robot to take its turn.

© All rights reserved Morency et al. and/or their publisher

p. 295-301

Rath, Matthias and Rohs, Michael (2006): Explorations in sound for tilting-based interfaces. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 295-301. Available online

Everyday experience as well as recent studies tell that information contained in ecological sonic feedback may improve human control of, and interaction with, a system. This notion is particularly worthwhile to consider in the context of mobile, tilting-based interfaces as have been proposed, developed and studied extensively. Two interfaces are used for this scope, the Ballancer, based on the metaphor of balancing a rolling ball on a track, and a more concretely application-oriented setup of a mobile phone with tilting-based input. First pilot studies have been conducted.

© All rights reserved Rath and Rohs and/or their publisher

p. 3

Clark, Herbert H. (2006): Mixing virtual and actual. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. p. 3. Available online

People often communicate with a mixture of virtual and actual elements. On the telephone, my sister and I and what we say are actual, even though our voices are virtual. In the London Underground, the warning expressed in the recording "Stand clear of the doors" is actual, even though the person making it is virtual. In the theater, Shakespeare, the actors, and I are actual, even though Romeo and Juliet and what they say are virtual. Mixtures like these cannot be accounted for in standard models of communication-for a variety of reasons. In this talk I introduce the notion of displaced actions (as on the telephone, in the London Underground, and in the theater) and characterize how they are used and interpreted in communication with a range of modern-day technologies.

© All rights reserved Clark and/or his/her publisher

p. 302-309

Enriquez, Mario, MacLean, Karon and Chita, Christian (2006): Haptic phonemes: basic building blocks of haptic communication. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 302-309. Available online

A haptic phoneme represents the smallest unit of a constructed haptic signal to which a meaning can be assigned. These haptic phonemes can be combined serially or in parallel to form haptic words, or haptic icons, which can hold more elaborate meanings for their users. Here, we use phonemes which consist of brief (<2 seconds) haptic stimuli composed of a simple waveform at a constant frequency and amplitude. Building on previous results showing that a set of 12 such haptic stimuli can be perceptually distinguished, here we test learnability and recall of associations for arbitrarily chosen stimulus-meaning pairs. We found that users could consistently recall an arbitrary association between a haptic stimulus and its assigned arbitrary meaning in a 9-phoneme set, during a 45 minute test period following a reinforced learning stage.

© All rights reserved Enriquez et al. and/or their publisher

p. 310-317

Vafai, Nasim Melony, Payandeh, Shahram and Dill, John (2006): Toward haptic rendering for a virtual dissection. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 310-317. Available online

In this paper we present a novel data structure combined with geometrically efficient techniques to simulate a "tissue peeling" method for deformable bodies. This is done to preserve the basic shape of a body in conjunction with soft-tissue deformation of multiple deformable bodies in a geometry-based model. We demonstrate our approach through haptic rendering of a virtual anatomical model for a dissection simulator that consists of surface skin along with multiple internal organs. The simulator uses multimodal cues in the form of haptic feedback to provide guidance and performance feedback to the user. The realism of the simulation is enhanced by computation of interaction forces using extrapolation techniques to send these forces back to the user via a haptic device.

© All rights reserved Vafai et al. and/or their publisher

p. 318-325

Morikawa, Osamu, Hashimoto, Sayuri, Munakata, Tsunetsugu and Okunaka, Junzo (2006): Embrace system for remote counseling. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 318-325. Available online

In counseling, non-verbal communication such as making physical contacts is an effective skill of role playing. In remote counseling via videophones, spacing and physical contacts cannot be used, and communication must be made only with expressions and words. This paper describes an embrace system for remote counseling, which consists of HyperMirror and vibrators and can provide effects similar to those by physical contacts in face-to-face counseling.

© All rights reserved Morikawa et al. and/or their publisher

p. 326-332

Quek, Francis, McNeill, David and Oliveira, Francisco (2006): Enabling multimodal communications for enhancing the ability of learning for the visually impaired. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 326-332. Available online

Students who are blind are typically one to three years behind their seeing counterparts in mathematics and science. We posit that a key reason for this resides in the inability of such students to access multimodal embodied communicative behavior of mathematics instructors. This impedes the ability of blind students and their teachers to maintain situated communication. In this paper, we set forth the relevant phenomenological analyses to support this claim. We show that mathematical communication and instruction are inherent embodied; that the blind are able to conceptualize visuo-spatial information; and argue that uptake of embodied behavior is critical to receiving relevant mathematical information. Based on this analysis, we advance an approach to provide students who are blind with awareness of their teachers' deictic gestural activity via a set of haptic output devices. We lay forth a set of open research question that researcher in multimodal interfaces may address.

© All rights reserved Quek et al. and/or their publisher

p. 333-338

Prewett, Matthew S., Yang, Liuquin, Stilson, Frederick R. B., Gray, Ashley A., Coovert, Michael D., Burke, Jennifer, Redden, Elizabeth and Elliot, Linda R. (2006): The benefits of multimodal information: a meta-analysis comparing visual and visual-tactile feedback. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 333-338. Available online

Information display systems have become increasingly complex and more difficult for human cognition to process effectively. Based upon Wickens' Multiple Resource Theory (MRT), information delivered using multiple modalities (i.e., visual and tactile) could be more effective than communicating the same information through a single modality. The purpose of this meta-analysis is to compare user effectiveness when using visual-tactile task feedback (a multimodality) to using only visual task feedback (a single modality). Results indicate that using visual-tactile feedback enhances task effectiveness more so than visual feedback (g = .38). When assessing different criteria, visual-tactile feedback is particularly effective at reducing reaction time (g = .631) and increasing performance (g = .618). Follow up moderator analyses indicate that visual-tactile feedback is more effective when workload is high (g = .844) and multiple tasks are being performed (g = .767). Implications of results are discussed in the paper.

© All rights reserved Prewett et al. and/or their publisher

p. 339-346

Liu, Peng and Soong, Frank K. (2006): Word graph based speech recognition error correction by handwriting input. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 339-346. Available online

We propose a convenient handwriting user interface for correcting speech recognition errors efficiently. Via the proposed hand-marked correction on the displayed recognition result, substitution, deletion and insertion errors can be corrected efficiently by rescoring the word graph generated in the recognition pass. A new path in the graph that matches the user's feedback in the maximum likelihood sense is found. With the aid of language model and hand corrections part in the best decoded path, rescoring the word graph can correct more errors than user provides. All recognition errors can be corrected after finite number of corrections. Experimental results show that by indicating one word error in user feedback, 33.8% of the erroneous sentences can be corrected; while by indicating one character error, 12.9% of the erroneous sentences can be corrected.

© All rights reserved Liu and Soong and/or their publisher

p. 347-356

Kaiser, Edward C. (2006): Using redundant speech and handwriting for learning new vocabulary and understanding abbreviations. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 347-356. Available online

New language constantly emerges from complex, collaborative human-human interactions like meetings -- such as, for instance, when a presenter handwrites a new term on a whiteboard while saying it. Fixed vocabulary recognizers fail on such new terms, which often are critical to dialogue understanding. We present a proof-of-concept multimodal system that combines information from handwriting and speech recognition to learn the spelling, pronunciation and semantics of out-of-vocabulary terms from single instances of redundant multimodal presentation (e.g. saying a term while handwriting it). For the task of recognizing the spelling and semantics of abbreviated Gantt chart labels across a held-out test series of five scheduling meetings we show a significant relative error rate reduction of 37% when our learning methods are used and allowed to persist across the meeting series, as opposed to when they are not used.

© All rights reserved Kaiser and/or his/her publisher

p. 35-38

Maganti, Hari Krishna and Gatica-Perez, Daniel (2006): Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 35-38. Available online

Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audio-visual information approaches.

© All rights reserved Maganti and Gatica-Perez and/or their publisher

p. 357-363

Portillo, Pilar Manchn, Garca, Guillermo Prez and Carredano, Gabriel Amores (2006): Multimodal fusion: a new hybrid strategy for dialogue systems. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 357-363. Available online

This is a new hybrid fusion strategy based primarily on the implementation of two former and differentiated approaches to multimodal fusion [11] in multimodal dialogue systems. Both approaches, their predecessors and their respective advantages and disadvantages will be described in order to illustrate how the new strategy merges them into a more solid and coherent solution. The first strategy was largely based on Johnston's approach [5] and implies the inclusion of multimodal grammar entries and temporal constraints. The second approach implied the fusion of information coming from different channels at dialogue level. The new hybrid strategy hereby described requires the inclusion of multimodal grammar entries and temporal constraints plus the additional information at dialogue level utilized in the second strategy. Within this new approach therefore, the fusion process will be initiated at grammar level and will be culminated at dialogue level.

© All rights reserved Portillo et al. and/or their publisher

p. 364-371

Lin, Tao and Imamiya, Atsumi (2006): Evaluating usability based on multimodal information: an empirical study. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 364-371. Available online

New technologies are making it possible to provide an enriched view of interaction for researchers using multimodal information. This preliminary study explores the use of multiple information streams in usability evaluation. In the study, easy, medium and difficult versions of a game task were used to vary the levels of mental effort. Multimodal data streams during the three versions were analyzed, including eye tracking, pupil size, hand movement, heart rate variability (HRV) and subjectively reported data. Four findings indicate the potential value of usability evaluations based on multimodal information: First, subjective and physiological measures showed significant sensitivity to task difficulty. Second, different mental workload levels appeared to correlate with eye movement patterns, especially with a combined eye-hand movement measure. Third, HRV showed correlations with saccade speed. Finally, we present a new method using the ratio of eye fixations over mouse clicks to evaluate performance in more detail. These results warrant further investigations and take an initial step toward establishing usability evaluation methods based on multimodal information.

© All rights reserved Lin and Imamiya and/or their publisher

p. 372-379

Smyth, Thomas N. and Kirkpatrick, Arthur E. (2006): A new approach to haptic augmentation of the GUI. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 372-379. Available online

Most users do not experience the same level of fluency in their interactions with computers that they do with physical objects in their daily life. We believe that much of this results from the limitations of unimodal interaction. Previous efforts in the haptics literature to remedy those limitations have been creative and numerous, but have failed to produce substantial improvements in human performance. This paper presents a new approach, whereby haptic interaction techniques are designed from scratch, in explicit consideration of the strengths and weaknesses of the haptic and motor systems. We introduce a haptic alternative to the tool palette, called Pokespace, which follows this approach. Two studies (6 and 12 participants) conducted with Pokespace found no performance improvement over a traditional interface, but showed that participants learned to use the interface proficiently after about 10 minutes, and could do so without visual attention. The studies also suggested several improvements to our design.

© All rights reserved Smyth and Kirkpatrick and/or their publisher

p. 380-387

Mana, Nadia and Pianesi, Fabio (2006): HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 380-387. Available online

One of the research goals in the human-computer interaction community is to build believable Embodied Conversational Agents, that is, agents able to communicate complex information with human-like expressiveness and naturalness. Since emotions play a crucial role in human communication and most of them are expressed through the face, having more believable ECAs implies to give them the ability of displaying emotional facial expressions. This paper presents a system based on Hidden Markov Models (HMMs) for the synthesis of emotional facial expressions during speech. The HMMs were trained on a set of emotion examples in which a professional actor uttered Italian non-sense words, acting various emotional facial expressions with different intensities. The evaluation of the experimental results, performed comparing the "synthetic examples" (generated by the system) with a reference "natural example" (one of the actor's examples) in three different ways, shows that HMMs for emotional facial expressions synthesis have some limitations but are suitable to make a synthetic Talking Head more expressive and realistic.

© All rights reserved Mana and Pianesi and/or their publisher

p. 388-390

Quek, Francis (2006): Embodiment and multimodality. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 388-390. Available online

Students who are blind are typically one to three years behind their seeing counterparts in mathematics and science. We posit that a key reason for this resides in the inability of such students to access multimodal embodied communicative behavior of mathematics instructors. This impedes the ability of blind students and their teachers to maintain situated communication. In this paper, we set forth the relevant phenomenological analyses to support this claim. We show that mathematical communication and instruction are inherent embodied; that the blind are able to conceptualize visuo-spatial information; and argue that uptake of embodied behavior is critical to receiving relevant mathematical information. Based on this analysis, we advance an approach to provide students who are blind with awareness of their teachers' deictic gestural activity via a set of haptic output devices. We lay forth a set of open research question that researcher in multimodal interfaces may address.

© All rights reserved Quek and/or his/her publisher

p. 39-42

Munteanu, Cosmin, Penn, Gerald, Baecker, Ron and Zhang, Yuecheng (2006): Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 39-42. Available online

The increased availability of broadband connections has recently led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. One challenge to skimming and browsing through such archives is the lack of text transcripts of the webcast's audio channel. This paper describes a procedure for prototyping an Automatic Speech Recognition (ASR) system that generates realistic transcripts of any desired Word Error Rate (WER), thus overcoming the drawbacks of both prototype-based and Wizard of Oz simulations. We used such a system in a user study showing that transcripts with WERs less than 25% are acceptable for use in webcast archives. As current ASR systems can only deliver, in realistic

© All rights reserved Munteanu et al. and/or their publisher

p. 4-11

Barthelmess, Paulo, Kaiser, Edward, Huang, Xiao, McGee, David and Cohen, Philip (2006): Collaborative multimodal photo annotation over digital paper. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 4-11. Available online

The availability of metadata annotations over media content such as photos is known to enhance retrieval and organization, particularly for large data sets. The greatest challenge for obtaining annotations remains getting users to perform the large amount of tedious manual work that is required. In this paper we introduce an approach for semi-automated labeling based on extraction of metadata from naturally occurring conversations of groups of people discussing pictures among themselves. As the burden for structuring and extracting metadata is shifted from users to the system, new recognition challenges arise. We explore how multimodal language can help in 1) detecting a concise set of meaningful labels to be associated with each photo, 2) achieving robust recognition of these key semantic terms, and 3) facilitating label propagation via multimodal shortcuts. Analysis of the data of a preliminary pilot collection suggests that handwritten labels may be highly indicative of the semantics of each photo, as indicated by the correlation of handwritten terms with high frequency spoken ones. We point to initial directions exploring a multimodal fusion technique to recover robust spelling and pronunciation of these high-value terms from redundant speech and handwriting.

© All rights reserved Barthelmess et al. and/or their publisher

p. 43-50

Yonezawa, Tomoko, Suzuki, Noriko, Abe, Shinji, Mase, Kenji and Kogure, Kiyoshi (2006): Cross-modal coordination of expressive strength between voice and gesture for personified media. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 43-50. Available online

The aim of this paper is to clarify the relationship between the expressive strengths of gestures and voice for embodied and personified interfaces. We conduct perceptual tests using a puppet interface, while controlling singing-voice expressions, to empirically determine the naturalness and strength of various combinations of gesture and voice. The results show that (1) the strength of cross-modal perception is affected more by gestural expression than by the expressions of a singing voice, and (2) the appropriateness of cross-modal perception is affected by expressive combinations between singing voice and gestures in personified expressions. As a promising solution, we propose balancing a singing voice and gestural expressions by expanding and correcting the width and shape of the curve of expressive strength in the singing voice.

© All rights reserved Yonezawa et al. and/or their publisher

p. 51-58

Reithinger, Norbert, Gebhard, Patrick, Lckelt, Markus, Ndiaye, Alassane, Pfleger, Norbert and Klesen, Martin (2006): VirtualHuman: dialogic and affective interaction with virtual characters. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 51-58. Available online

Natural multimodal interaction with realistic virtual characters provides rich opportunities for entertainment and education. In this paper we present the current VirtualHuman demonstrator system. It provides a knowledge-based framework to create interactive applications in a multi-user, multi-agent setting. The behavior of the virtual humans and objects in the 3D environment is controlled by interacting affective conversational dialogue engines. An elaborate model of affective behavior adds natural emotional reactions and presence of the virtual humans. Actions are defined in a XML-based markup language that supports the incremental specification of synchronized multimodal output. The system was successfully demonstrated during CeBIT 2006.

© All rights reserved Reithinger et al. and/or their publisher

p. 59-67

Melichar, Miroslav and Cenek, Pavel (2006): From vocal to multimodal dialogue management. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 59-67. Available online

Multimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal interaction natural and smooth, while keeping it manageable from the system perspective. Another central issue concerns algorithms for multimodal dialogue management. This paper presents a solution that relies on adapting an existing unimodal, vocal dialogue management framework to make it able to cope with multimodality. An experimental multimodal system, Archivus, is described together with discussion of the required changes to the unimodal dialogue management algorithms. Results of pilot Wizard of Oz experiments with Archivus focusing on system efficiency and user behaviour are presented{sup:1}.

© All rights reserved Melichar and Cenek and/or their publisher

p. 68-71

Foster, Mary Ellen, By, Tomas, Rickert, Markus and Knoll, Alois (2006): Human-Robot dialogue for joint construction tasks. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 68-71. Available online

We describe a human-robot dialogue system that allows a human to collaborate with a robot agent on assembling construction toys. The human and the robot are fully equal peers in the interaction, rather than simply partners. Joint action is supported at all stages of the interaction: the participants agree on a construction task, jointly decide how to proceed to proceed with the task, and also implement the selected plans jointly. The symmetry provides novel challenges for a dialogue system, and also makes it possible for findings from human-human joint-action dialogues to be easily implemented and tested.

© All rights reserved Foster et al. and/or their publisher

p. 72-75

Schweikardt, Eric and Gross, Mark D. (2006): roBlocks: a robotic construction kit for mathematics and science education. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 72-75. Available online

We describe work in progress on roBlocks, a computational construction kit that encourages users to experiment and play with a collection of sensor, logic and actuator blocks, exposing them to a variety of advanced concepts including kinematics, feedback and distributed control. Its interface presents novice users with a simple, tangible set of robotic blocks, whereas advanced users work with software tools to analyze and rewrite the programs embedded in each block. Early results suggest that roBlocks may be an effective vehicle to expose young people to complex ideas in science, technology, engineering and mathematics.

© All rights reserved Schweikardt and Gross and/or their publisher

p. 76-83

Tse, Edward, Greenberg, Saul and Shen, Chia (2006): GSI demo: multiuser gesture/speech interaction over digital tables by wrapping single user applications. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 76-83. Available online

Most commercial software applications are designed for a single user using a keyboard/mouse over an upright monitor. Our interest is exploiting these systems so they work over a digital table. Mirroring what people do when working over traditional tables, we want to allow multiple people to interact naturally with the tabletop application and with each other via rich speech and hand gestures. In previous papers, we illustrated multi-user gesture and speech interaction on a digital table for geospatial applications -- Google Earth, Warcraft III and The Sims. In this paper, we describe our underlying architecture: GSI Demo. First, GSI Demo creates a run-time wrapper around existing single user applications: it accepts and translates speech and gestures from multiple people into a single stream of keyboard and mouse inputs recognized by the application. Second, it lets people use multimodal demonstration -- instead of programming -- to quickly map their own speech and gestures to these keyboard/mouse inputs. For example, continuous gestures are trained by saying "Computer, when I do [one finger gesture], you do [mouse drag]". Similarly, discrete speech commands can be trained by saying "Computer, when I say [layer bars], you do [keyboard and mouse macro]". The end result is that end users can rapidly transform single user commercial applications into a multi-user, multimodal digital tabletop system.

© All rights reserved Tse et al. and/or their publisher

p. 84-91

Christoudias, C. Mario, Saenko, Kate, Morency, Louis-Philippe and Darrell, Trevor (2006): Co-Adaptation of audio-visual speech and gesture classifiers. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 84-91. Available online

The construction of robust multimodal interfaces often requires large amounts of labeled training data to account for cross-user differences and variation in the environment. In this work, we investigate whether unlabeled training data can be leveraged to build more reliable audio-visual classifiers through co-training, a multi-view learning algorithm. Multimodal tasks are good candidates for multi-view learning, since each modality provides a potentially redundant view to the learning algorithm. We apply co-training to two problems: audio-visual speech unit classification, and user agreement recognition using spoken utterances and head gestures. We demonstrate that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio-visual modalities. We also propose a co-adaptation algorithm, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.

© All rights reserved Christoudias et al. and/or their publisher

p. 92-99

Sowa, Timo (2006): Towards the integration of shape-related information in 3-D gestures and speech. In: Proceedings of the 2006 International Conference on Multimodal Interfaces 2006. pp. 92-99. Available online

This paper presents a model for the unified semantic representation of shape conveyed by speech and coverbal 3-D gestures. The representation is tailored to capture the semantic contributions of both modalities during free descriptions of objects. It is shown how the semantic content of shape-related adjectives, nouns, and iconic gestures can be modeled and combined when they occur together in multimodal utterances like "a longish bar" + iconic gesture. The model has been applied for the development of a prototype system for gesture recognition and integration with speech.

© All rights reserved Sowa and/or his/her publisher




 

Join our community and advance:

Your
Skills

Your
Network

Your
Career

 
Join our community!
 
 

User-contributed notes

Give us your opinion! Do you have any comments/additions
that you would like other visitors to see?

 
comment You (your email) say: Aug 25th, 2014
#1
Aug 25
Add a thoughtful commentary or note to this page ! 
 

your homepage, facebook profile, twitter, or the like
will be spam-protected
How many?
= e.g. "6"
User ExperienceBy submitting you agree to the Site Terms
 
 
 
 

Changes to this page (conference)

20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Added
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified
20 Apr 2011: Modified

Page Information

Page maintainer: The Editorial Team
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_2006_international_conference_on_multimodal_interfaces.html

Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts the day after tomorrow !
go to course
User-Centred Design - Module 2
87% booked. Starts in 8 days
 
 

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !

 
 

Our Latest Books

 
 
Gamification at Work: Designing Engaging Business Software
by Janaki Mythily Kumar and Mario Herger
start reading
 
 
 
 
The Social Design of Technical Systems: Building technologies for communities
by Brian Whitworth and Adnan Ahmad
start reading
 
 
 
 
The Encyclopedia of Human-Computer Interaction, 2nd Ed.
by Mads Soegaard and Rikke Friis Dam
start reading
 
 

Upcoming Courses

go to course
Quality Web Communication: The Beginner's Guide
Starts the day after tomorrow !
go to course
User-Centred Design - Module 2
87% booked. Starts in 8 days