May 07

If we want users to like our software, we should design it to behave like a likeable person.

-- Alan Cooper

 
 

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !

 
 

Our Latest Books

Kumar and Herger 2013: Gamification at Work: Designing Engaging Business Software...
by Janaki Mythily Kumar and Mario Herger

 
Start reading

Whitworth and Ahmad 2013: The Social Design of Technical Systems: Building technologies for communities...
by Brian Whitworth and Adnan Ahmad

 
Start reading

Soegaard and Dam 2013: The Encyclopedia of Human-Computer Interaction, 2nd Ed....
by Mads Soegaard and Rikke Friis Dam

 
Start reading
 
 

Help us help you!

 
 

Proceedings of the 2010 International Conference on Multimodal Interfaces


 
Time and place:

2010
Conf. description:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
Help us!
Do you know when the next conference is? If yes, please add it to the calendar!
Series:
This is a preferred venue for people like Trevor Darrell, Wen Gao, Rainer Stiefelhagen, Jie Yang, and Francis K. H. Quek. Part of the ICMI - International Conference on Multimodal Interfaces conference series.
Publisher:
EDIT

References from this conference (2010)

The following articles are from "Proceedings of the 2010 International Conference on Multimodal Interfaces":

 what's this?

Articles

p. 1

Haviland, John (2010): Language and thought: talking, gesturing (and signing) about space. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 1. Available online

Recent research has reopened debates about (neo)Whorfian claims that the language one speaks has an impact on how one thinks -- long discounted by mainstream linguistics and anthropology alike. Some of the most striking evidence for such possible impact derives, not surprisingly, from understudied "exotic" languages and, somewhat more surprisingly, from multimodal and notably gestural practices in communities which speak them. In particular, some of my own work on GuuguYimithirr, a Paman language spoken by Aboriginal people in northeastern Australia, and on Tzotzil, a language spoken by Mayan peasants in southeastern Mexico, suggests strong connections between linguistic expressions of spatial relations, gestural practices in talking about location and motion, and cognitive representations of space -- what have come to be called spatial "Frames of Reference." In this talk, I will present some of the evidence for such connections, and add to the mix evidence from an emerging, first generation sign language developed spontaneously in a single family by deaf siblings who have had contact with neither other deaf people nor any other sign language.

© All rights reserved Haviland and/or ACM Press

p. 10

Ehlen, Patrick and Johnston, Michael (2010): Speak4it: multimodal interaction for local search. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 10. Available online

Speak4itSM is a consumer-oriented mobile search application that leverages multimodal input and output to allow users to search for and act on local business information. It supports true multimodal integration where user inputs can be distributed over multiple input modes. In addition to specifying queries by voice (e.g., "bike repair shops near the golden gate bridge") users can combine speech and gesture. For example, "gas stations" + will return the gas stations along the specified route traced on the display. We provide interactive demonstrations of Speak4it on both the iPhone and iPad platforms and explain the underlying multimodal architecture and challenges of supporting multimodal interaction as a deployed mobile service.

© All rights reserved Ehlen and Johnston and/or ACM Press

p. 11

Rodríguez, Luis, García-Varea, Ismael, Revuelta-Martínez, A. and Vidal, Enrique (2010): A multimodal interactive text generation system. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 11. Available online

We present an interactive text generation system aimed at providing assistance for text typing in different environments. This system works by predicting what the user is going to type based on the text he or she typed previously. A multimodal interface is included, intended to facilitate the text generation in constrained environments. The prototype is designed following a modular client-server architecture to provide a high flexibility.

© All rights reserved Rodríguez et al. and/or ACM Press

p. 12

Kilgour, Jonathan, Carletta, Jean and Renals, Steve (2010): The Ambient Spotlight: personal multimodal search without query. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 12. Available online

The Ambient Spotlight is a prototype system based on personal meeting capture using a laptop and a portable microphone array. The system automatically recognises and structures the meeting content using automatic speech recognition, topic segmentation and extractive summarisation. The recognised speech in the meeting is used to construct queries to automatically link meeting segments to other relevant material, both multimodal and textual. The interface to the system is constructed around a standard calendar interface, and it is integrated with the laptop's standard indexing, search and retrieval.

© All rights reserved Kilgour et al. and/or ACM Press

p. 13

Zhang, Chunhui, Wang, Min and Harper, Richard (2010): Cloud mouse: a new way to interact with the cloud. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 13. Available online

In this paper we present a novel input device and associated UI metaphors for Cloud computing. Cloud computing will give users access to huge amount of data in new forms as well as anywhere and anytime, with applications ranging from Web data mining to social networks. The motivation of this work is to provide users access to cloud computing by a new personal device and to make nearby displays a personal displayer. The key points of this device are direct-point operation, grasping UI and tangible feedback. A UI metaphor for cloud computing is also introduced.

© All rights reserved Zhang et al. and/or ACM Press

p. 14

Ashley, Richard (2010): Musical performance as multimodal communication: drummers, musical collaborators, and listeners. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 14. Available online

Musical performance provides an interesting domain for understanding and investigating multimodal communication. Although the primary modality of music is auditory, musicians make considerable use of the visual channel as well. This talk examines musical performance as multimodal, focusing on drumming in one style of popular music (funk or soul music). The way drummers interact with, and communicate with, their musical collaborators and with listeners are examined, in terms of the structure of different musical parts; processes of mutual coordination, entrainment, and turn-taking (complementarity) are highlighted. Both pre-determined (composed) and spontaneous (improvised) behaviors are considered. The way in which digital drumsets function as complexly structured human interfaces to sound synthesis systems is examined as well.

© All rights reserved Ashley and/or ACM Press

p. 15

Yin, Ying and Davis, Randall (2010): Toward natural interaction in the real world: real-time gesture recognition. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 15. Available online

Using a new hand tracking technology capable of tracking 3D hand postures in real-time, we developed a recognition system for continuous natural gestures. By natural gestures, we mean those encountered in spontaneous interaction, rather than a set of artificial gestures chosen to simplify recognition. To date we have achieved 95.6% accuracy on isolated gesture recognition, and 73% recognition rate on continuous gesture recognition, with data from 3 users and twelve gesture classes. We connected our gesture recognition system to Google Earth, enabling real time gestural control of a 3D map. We describe the challenges of signal accuracy and signal interpretation presented by working in a real-world environment, and detail how we overcame them.

© All rights reserved Yin and Davis and/or ACM Press

p. 16

Rico, Julie and Brewster, Stephen A. (2010): Gesture and voice prototyping for early evaluations of social acceptability in multimodal interfaces. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 16. Available online

Interaction techniques that require users to adopt new behaviors mean that designers must take into account social acceptability and user experience otherwise the techniques may be rejected by users as they are too embarrassing to do in public. This research uses a set of low cost prototypes to study social acceptability and user perceptions of multimodal mobile interaction techniques early on in the design process. We describe 4 prototypes that were used with 8 focus groups to evaluate user perceptions of novel multimodal interactions using gesture, speech and nonspeech sounds, and gain feedback about the usefulness of the prototypes for studying social acceptability. The results of this research describe user perceptions of social acceptability and the realities of using multimodal interaction techniques in daily life. The results also describe key differences between young users (18-29) and older users (70-95) with respect to evaluation and approach to understanding these interaction techniques.

© All rights reserved Rico and Brewster and/or ACM Press

p. 17

Li, Yun, Chen, Xiang, Tian, Jianxun, Zhang, Xu, Wang, Kongqiao and Yang, Jihai (2010): Automatic recognition of sign language subwords based on portable accelerometer and EMG sensors. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 17. Available online

Sign language recognition (SLR) not only facilitates the communication between the deaf and hearing society, but also serves as a good basis for the development of gesture-based human-computer interaction (HCI). In this paper, the portable input devices based on accelerometers and surface electromyography (EMG) sensors worn on the forearm are presented, and an effective fusion strategy for combination of multi-sensor and multi-channel information is proposed to automatically recognize sign language at the subword classification level. Experimental results on the recognition of 121 frequently used Chinese sign language subwords demonstrate the feasibility of developing SLR system based on the presented portable input devices and that our proposed information fusion method is effective for automatic SLR. Our study will promote the realization of practical sign language recognizer and multimodal human-computer interfaces.

© All rights reserved Li et al. and/or ACM Press

p. 18

Oliveira, Francisco, Cowan, Heidi, Fang, Bing and Quek, Francis (2010): Enabling multimodal discourse for the blind. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 18. Available online

This paper presents research that shows that a high degree of skilled performance is required for multimodal discourse support. We discuss how students who are blind or visually impaired (SBVI) were able to understand the instructor's pointing gestures during planar geometry and trigonometry classes. For that, the SBVI must attend to the instructor's speech and have simultaneous access to the instructional graphic material, and to the where the instructor is pointing. We developed the Haptic Deictic System -- HDS, capable of tracking the instructor's pointing and informing the SBVI, through a haptic glove, where she needs to move her hand understand the instructor's illustration-augmented discourse. Several challenges had to be overcome before the SBVI were able to engage in fluid multimodal discourse with the help of the HDS. We discuss how such challenges were addressed with respect to perception and discourse (especially to mathematics instruction).

© All rights reserved Oliveira et al. and/or ACM Press

p. 19

Kamei, Koji, Shinozawa, Kazuhiko, Ikeda, Tetsushi, Utsumi, Akira, Miyashita, Takahiro and Hagita, Norihiro (2010): Recommendation from robots in a real-world retail shop. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 19. Available online

By applying network robot technologies, recommendation methods from E-Commerce are incorporated in a retail shop in the real world. We constructed an experimental shop environment where communication robots recommend specific items to the customers according to their purchasing behavior as observed by networked sensors. A recommendation scenario is implemented with three robots and investigated through an experiment. The results indicate that the participants stayed longer in front of the shelves when the communication robots tried to interact with them and were influenced to carry out similar purchasing behaviors as those observed earlier. Other results suggest that the probability of customers' zone transition can be used to anticipate their purchasing behavior.

© All rights reserved Kamei et al. and/or ACM Press

p. 2

Kaaresoja, Topi and Brewster, Stephen A. (2010): Feedback is... late: measuring multimodal delays in mobile device touchscreen interaction. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 2. Available online

Multimodal interaction is becoming common in many kinds of devices, particularly mobile phones. If care is not taken in design and implementation, there may be latencies in the timing of feedback in the different modalities may have unintended effects on users. This paper introduces an easy to implement multimodal latency measurement tool for touchscreen interaction. It uses off-the-shelf components and free software and is capable of measuring latencies accurately between different interaction events in different modalities. The tool uses a high-speed camera, a mirror, a microphone and an accelerometer to measure the touch, visual, audio and tactile feedback events that occur in touchscreen interaction. The microphone and the accelerometer are both interfaced with a standard PC soundcard that makes the measurement and analysis simple. The latencies are obtained by hand and eye using a slow-motion video player and an audio editor. To validate the tool, we measured four commercial mobile phones. Our results show that there are significant differences in latencies, not only between the devices, but also between different applications and modalities within one device. In this paper the focus is on mobile touchscreen devices, but with minor modifications our tool could be also used in other domains.

© All rights reserved Kaaresoja and Brewster and/or ACM Press

p. 20

Blumendorf, Marco, Roscher, Dirk and Albayrak, Sahin (2010): Dynamic user interface distribution for flexible multimodal interaction. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 20. Available online

The availability of numerous networked interaction devices within smart environments makes the exploitation of these devices for innovative and more natural interaction possible. In our work we make use of TVs with remote controls, picture frames, mobile phones, touch screens, stereos and PCs to create multimodal user interfaces. The combination of the interaction capabilities of the different devices allows to achieve a more suitable interaction for a situation. Changing situations can then require the dynamic redistribution of the created interfaces and the alteration of the used modalities and devices to keep up the interaction. In this paper we describe our approach for dynamically (re-) distributing user interfaces at run-time. A distribution component is responsible for determining the devices for the interaction based on the (changing) environment situation and the user interface requirements. The component provides possibilities to the application developer and to the user to influence the distribution according to their needs. A user interface model describes the interaction and the modality relations according to the CARE properties (Complementarity, Assignment, Redundancy and Equivalency) and a context model gathers and provides information about the environment.

© All rights reserved Blumendorf et al. and/or ACM Press

p. 21

Kildal, Johan (2010): 3D-press: haptic illusion of compliance when pressing on a rigid surface. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 21. Available online

This paper reports a new intramodal haptic illusion. This illusion involves a person pressing on a rigid surface and perceiving that the surface is compliant, i.e. perceiving that the contact point displaces into the surface. The design process, method and conditions used to create this illusion are described in detail. A user study is also reported in which all participants using variants of the basic method experienced the illusion, demonstrating the effectiveness of the method. This study also offers an initial indication of the mechanical dimensions of illusory compliance that could be manipulated by varying the stimuli presented to the users. This method could be used to augment touch interaction with mobile devices, transcending the rigid two-dimensional tangible surface (touch display) currently found on them.

© All rights reserved Kildal and/or ACM Press

p. 22

Ali, Abdallah El, Nack, Frank and Hardman, Lynda (2010): Understanding contextual factors in location-aware multimedia messaging. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 22. Available online

Location-aware messages left by people can make visible some aspects of their everyday experiences at a location. To understand the contextual factors surrounding how users produce and consume location-aware multimedia messaging (LMM), we use an experience-centered framework that makes explicit the different aspects of an experience. Using this framework, we conducted an exploratory, diary study aimed at eliciting implications for the study and design of LMM systems. In an earlier pilot study, we found that subjects did not have enough time to fully capture their everyday experiences using an LMM prototype, which led us to conduct a longer study using a multimodal diary method. The diary study data (verified for reliability using a categorization task) provided a closer look at the different aspects (spatiotemporal, social, affective, and cognitive) of people's experience. From the data, we derive three main findings (predominant LMM domains and tasks, capturing experience vs. experience of capture, context-dependent personalization) to inform the study and design of future LMM systems.

© All rights reserved Ali et al. and/or ACM Press

p. 23

Liu, Qiong, Liao, Chunyuan, Wilcox, Lynn and Dunnigan, Anthony (2010): Embedded media barcode links: optimally blended barcode overlay on paper for linking to associated media. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 23. Available online

Embedded Media Barcode Links, or simply EMBLs, are optimally blended iconic barcode marks, printed on paper documents, that signify the existence of multimedia associated with that part of the document content (Figure 1). EMBLs are used for multimedia retrieval with a camera phone. Users take a picture of an EMBL-signified document patch using a cell phone, and the multimedia associated with the EMBL-signified document location is displayed on the phone. Unlike a traditional barcode which requires an exclusive space, the EMBL construction algorithm acts as an agent to negotiate with a barcode reader for maximum user and document benefits. Because of this negotiation, EMBLs are optimally blended with content and thus have less interference with the original document layout and can be moved closer to a media associated location. Retrieval of media associated with an EMBL is based on the barcode identification of a captured EMBL. Therefore, EMBL retains nearly all barcode identification advantages, such as accuracy, speed, and scalability. Moreover, EMBL takes advantage of users' knowledge of a traditional barcode. Unlike Embedded Media Maker (EMM) which requires underlying document features for marker identification, EMBL has no requirement for the underlying features. This paper will discuss the procedures for EMBL construction and optimization. It will also give experimental results that strongly support the EMBL construction and optimization ideas.

© All rights reserved Liu et al. and/or ACM Press

p. 24

Xu, Wenchang, Yang, Xin and Shi, Yuanchun (2010): Enhancing browsing experience of table and image elements in web pages. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 24. Available online

As the popularity and diversification of both Internet and its access devices, users' browsing experience of web pages is in great need of improvement. Traditional browsing mode of web elements such as table and image is passive, which limits users' browsing efficiency of web pages. In this paper, we propose to enhance browsing experience of table and image elements in web pages by enabling real-time interactive access to web tables and images. We design new browsing modes that help users improve their browsing efficiency including operation mode, record mode for web tables and normal mode, starred mode, advanced mode for web images. We design and implement a plug-in for Microsoft Internet Explorer, called iWebWidget, which provides a customized user interface supporting real-time interactive access to web tables and images. Besides, we carry out a user study to testify the usefulness of iWebWidget. Experimental results show that users are satisfied and really enjoy the new browsing modes for both web tables and images.

© All rights reserved Xu et al. and/or ACM Press

p. 25

Chen, Ya-Xi, Reiter, Michael and Butz, Andreas (2010): PhotoMagnets: supporting flexible browsing and searching in photo collections. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 25. Available online

People's activities around their photo collections are often highly dynamic and unstructured, such as casual browsing and searching or loosely structured storytelling. User interfaces to support such an exploratory behavior are a challenging research question. We explore ways to enhance the flexibility in dealing with photo collections and designed a system named PhotoMagnets. It uses a magnet metaphor in addition to more traditional interface elements in order to support a flexible combination of structured and unstructured photo browsing and searching. In an evaluation we received positive feedback especially on the flexibility provided by this approach.

© All rights reserved Chen et al. and/or ACM Press

p. 26

Cheng, Peng-Wen, Chennuru, Snehal, Buthpitiya, Senaka and Zhang, Ying (2010): A language-based approach to indexing heterogeneous multimedia lifelog. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 26. Available online

Lifelog systems, inspired by Vannevar Bush's concept of "MEMory EXtenders" (MEMEX), are capable of storing a person's lifetime experience as a multimedia database. Despite such systems' huge potential for improving people's everyday life, there are major challenges that need to be addressed to make such systems practical. One of them is how to index the inherently large and heterogeneous lifelog data so that a person can efficiently retrieve the log segments that are of interest. In this paper, we present a novel approach to indexing lifelogs using activity language. By quantizing the heterogeneous high dimensional sensory data into text representation, we are able to apply statistical natural language processing techniques to index, recognize, segment, cluster, retrieve, and infer high-level semantic meanings of the collected lifelogs. Based on this indexing approach, our lifelog system supports easy retrieval of log segments representing past similar activities and generation of salient summaries serving as overviews of segments.

© All rights reserved Cheng et al. and/or ACM Press

p. 27

Li, Kaiming, Guo, Lei, Faraco, Carlos, Zhu, Dajiang, Deng, Fan, Zhang, Tuo, Jiang, Xi, Zhang, Degang, Chen, Hanbo, Hu, Xintao, Miller, Stephen and Liu, Tianming (2010): Human-centered attention models for video summarization. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 27. Available online

A variety of user attention models for video/audio streams have been developed for video summarization and abstraction, in order to facilitate efficient video browsing and indexing. Essentially, human brain is the end user and evaluator of multimedia content and representation, and its responses can provide meaningful guidelines for multimedia stream summarization. For example, video/audio segments that significantly activate the visual, auditory, language and working memory systems of the human brain should be considered more important than others. It should be noted that user experience studies could be useful for such evaluations, but are suboptimal in terms of their capability of accurately capturing the full-length dynamics and interactions of the brain's response. This paper presents our preliminary efforts in applying the brain imaging technique of functional magnetic resonance imaging (fMRI) to quantify and model the dynamics and interactions between multimedia streams and brain response, when the human subjects are presented with the multimedia clips, in order to develop human-centered attention models that can be used to guide and facilitate more effective and efficient multimedia summarization. Our initial results are encouraging.

© All rights reserved Li et al. and/or ACM Press

p. 28

Landay, James A. (2010): Activity-based Ubicomp: a new research basis for the future of human-computer interaction. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 28. Available online

Ubiquitous computing (Ubicomp) is bringing computing off the desktop and into our everyday lives. For example, an interactive display can be used by the family of an elder to stay in constant touch with the elder's everyday wellbeing, or by a group to visualize and share information about exercise and fitness. Mobile sensors, networks, and displays are proliferating worldwide in mobile phones, enabling this new wave of applications that are intimate with the user's physical world. In addition to being ubiquitous, these applications share a focus on high-level activities, which are long-term social processes that take place in multiple environments and are supported by complex computation and inference of sensor data. However, the promise of this Activity-based Ubicomp is unfulfilled, primarily due to methodological, design, and tool limitations in how we understand the dynamics of activities. The traditional cognitive psychology basis for human-computer interaction, which focuses on our short term interactions with technological artifacts, is insufficient for achieving the promise of Activity-based Ubicomp. We are developing design methodologies and tools, as well as activity recognition technologies, to both demonstrate the potential of Activity-based Ubicomp as well as to support designers in fruitfully creating these types of applications.

© All rights reserved Landay and/or ACM Press

p. 29

Deena, Salil, Hou, Shaobo and Galata, Aphrodite (2010): Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 29. Available online

We present a novel approach to speech-driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Audio and visual data from a talking head corpus are jointly modelled using the proposed method. The switching states are found using variable length Markov models trained on labelled phonetic data. We also propose a synthesis technique that takes into account both previous and future phonetic context, thus accounting for coarticulatory effects in speech.

© All rights reserved Deena et al. and/or ACM Press

p. 3

Kok, Iwan de, Ozkan, Derya, Heylen, Dirk and Morency, Louis-Philippe (2010): Learning and evaluating response prediction models using parallel listener consensus. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 3. Available online

Traditionally listener response prediction models are learned from pre-recorded dyadic interactions. Because of individual differences in behavior, these recordings do not capture the complete ground truth. Where the recorded listener did not respond to an opportunity provided by the speaker, another listener would have responded or vice versa. In this paper, we introduce the concept of parallel listener consensus where the listener responses from multiple parallel interactions are combined to better capture differences and similarities between individuals. We show how parallel listener consensus can be used for both learning and evaluating probabilistic prediction models of listener responses. To improve the learning performance, the parallel consensus helps identifying better negative samples and reduces outliers in the positive samples. We propose a new error measurement called f{sub:Consensus} which exploits the parallel consensus to better define the concepts of exactness (mislabels) and completeness (missed labels) for prediction models. We present a series of experiments using the MultiLis Corpus where three listeners were tricked into believing that they had a one-on-one conversation with a speaker, while in fact they were recorded in parallel in interaction with the same speaker. In this paper we show that using parallel listener consensus can improve learning performance and represent better evaluation criteria for predictive models.

© All rights reserved Kok et al. and/or ACM Press

p. 30

Rodríguez, Luis, García-Varea, Ismael and Vidal, Enrique (2010): Multi-modal computer assisted speech transcription. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 30. Available online

Speech recognition systems are not typically able to produce error-free results in real scenarios. On account of this, human intervention is usually needed. This intervention can be included into the system by following the Computer Assisted Speech Transcription (CAST) approach, where the user constantly interacts with the system during the transcription process. In order to improve this user interaction, a speech multi-modal interface is proposed here. In addition, the user of word graphs within CAST aimed at facilitating the design of such interface as well as improving the system response time is also discussed.

© All rights reserved Rodríguez et al. and/or ACM Press

p. 31

Tellex, Stefanie, Kollar, Thomas, Shaw, George, Roy, Nicholas and Roy, Deb (2010): Grounding spatial language for video search. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 31. Available online

The ability to find a video clip that matches a natural language description of an event would enable intuitive search of large databases of surveillance video. We present a mechanism for connecting a spatial language query to a video clip corresponding to the query. The system can retrieve video clips matching millions of potential queries that describe complex events in video such as "people walking from the hallway door, around the island, to the kitchen sink." By breaking down the query into a sequence of independent structured clauses and modeling the meaning of each component of the structure separately, we are able to improve on previous approaches to video retrieval by finding clips that match much longer and more complex queries using a rich set of spatial relations such as "down" and "past." We present a rigorous analysis of the system's performance, based on a large corpus of task-constrained language collected from fourteen subjects. Using this corpus, we show that the system effectively retrieves clips that match natural language descriptions: 58.3% were ranked in the top two of ten in a retrieval task. Furthermore, we show that spatial relations play an important role in the system's performance.

© All rights reserved Tellex et al. and/or ACM Press

p. 32

Ehlen, Patrick and Johnston, Michael (2010): Location grounding in multimodal local search. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 32. Available online

Computational models of dialog context have often focused on unimodal spoken dialog or text, using the language itself as the primary locus of contextual information. But as we move from spoken interaction to situated multimodal interaction on mobile platforms supporting a combination of spoken dialog with graphical interaction, touch-screen input, geolocation, and other non-linguistic contextual factors, we will need more sophisticated models of context that capture the influence of these factors on semantic interpretation and dialog flow. Here we focus on how users establish the location they deem salient from the multimodal context by grounding it through interactions with a map-based query system. While many existing systems rely on geolocation to establish the location context of a query, we hypothesize that this approach often ignores the grounding actions users make, and provide an analysis of log data from one such system that reveals errors that arise from that faulty treatment of grounding. We then explore and evaluate, using live field data from a deployed multimodal search system, several different context classification techniques that attempt to learn the location contexts users make salient by grounding them through their multimodal actions.

© All rights reserved Ehlen and Johnston and/or ACM Press

p. 33

Kurihara, Kazutaka, Mochizuki, Toshio, Oura, Hiroki, Tsubakimoto, Mio, Nishimori, Toshihisa, Nakahara, Jun, Yamauchi, Yuhei and Nagao, Katashi (2010): Linearity and synchrony: quantitative metrics for slide-based presentation methodology. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 33. Available online

In this paper we propose new quantitative metrics that express the characteristics of current general practices in slide-based presentation methodology. The proposed metrics are numerical expressions of: 'To what extent are the materials being presented in the prepared order?' and 'What is the degree of separation between the displays of the presenter and the audience?'. Through the use of these metrics, it becomes possible to quantitatively evaluate various extended methods designed to improve presentations. We illustrate examples of calculation and visualization for the proposed metrics.

© All rights reserved Kurihara et al. and/or ACM Press

p. 34

Lee, Myunghee and Kim, Gerard J. (2010): Empathetic video experience through timely multimodal interaction. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 34. Available online

In this paper, we describe a video playing system, named "Empatheater," that is controlled by multimodal interaction. As the video is played, the user must interact and emulate predefined video "events" through multimodal guidance and whole body interaction (e.g. following the main character's motion or gestures). Without the timely interaction, the video stops. The system shows guidance information as how to properly react and continue the video playing. The purpose of such a system is to provide indirect experience (of the given video content) by eliciting the user to mimic and empathize with the main character. The user is given the illusion (suspended disbelief) of playing an active role in the unraveling video content. We discuss various features of the newly proposed interactive medium. In addition, we report on the results of the pilot study that was carried out to evaluate its user experience compared to passive video viewing and keyboard based video control.

© All rights reserved Lee and Kim and/or ACM Press

p. 35

Pakkanen, Toni, Raisamo, Roope, Salminen, Katri and Surakka, Veikko (2010): Haptic numbers: three haptic representation models for numbers on a touch screen phone. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 35. Available online

Systematic research on haptic stimuli is needed to create viable haptic feeling for user interface elements. There has been a lot of research with haptic user interface prototypes, but much less with haptic stimulus design. In this study we compared three haptic representation models with two representation rates for the numbers used in the phone number keypad layout. Haptic representations for the numbers were derived from Arabic and Roman numbers, and from the Location of the number button in the layout grid. Using a Nokia 5800 Express Music phone participants entered phone numbers blindly in the phone. The speed, error rate, and subjective experiences were recorded. The results showed that the model had no effect to the measured performance, but subjective experiences were affected. The Arabic numbers with slower speed were preferred most. Thus, subjectively the performance was rated as better, even though objective measures showed no differences.

© All rights reserved Pakkanen et al. and/or ACM Press

p. 36

Cheng, Juan, Chen, Xiang, Lu, Zhiyuan, Wang, Kongqiao and Shen, Minfen (2010): Key-press gestures recognition and interaction based on SEMG signals. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 36. Available online

This article conducted research on the pattern recognition of keypress finger gestures based on surface electromyographic (SEMG) signals and the feasibility of keypress gestures for interaction application. Two sort of recognition experiments were designed firstly to explore the feasibility and repeatability of the SEMG-based classification of 1 6 keypress finger gestures relating to right hand and 4 control gestures, and the keypress gestures were defined referring to the standard PC key board. Based on the experimental results, 10 quite well recognized keypress gestures were selected as numeric input keys of a simulated phone, and the 4 control gestures were mapped to 4 control keys. Then two types of use tests, namely volume setting and SMS sending were conducted to survey the gesture-base interaction performance and user's attitude to this technique, and the test results showed that users could accept this novel input strategy with fresh experience.

© All rights reserved Cheng et al. and/or ACM Press

p. 37

Mu, Kaihui, Tao, Jianhua, Che, Jianfeng and Yang, Minghao (2010): Mood avatar: automatic text-driven head motion synthesis. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 37. Available online

Natural head motion is an indispensable part of realistic facial animation. This paper presents a novel approach to synthesize natural head motion automatically based on grammatical and prosodic features, which are extracted by the text analysis part of a Chinese Text-to-Speech (TTS) system. A two-layer clustering method is proposed to determine elementary head motion patterns from a multimodal database which covers six emotional states. The mapping problem between textual information and elementary head motion patterns is modeled by Classification and Regression Trees (CART). With the emotional state specified by users, results from text analysis are utilized to drive corresponding CART model to create emotional head motion sequence. Then, the generated sequence is interpolated by spine and used to drive a Chinese text-driven avatar. The comparison experiment indicates that this approach provides a better head motion and an engaging human-computer comparing to random or none head motion.

© All rights reserved Mu et al. and/or ACM Press

p. 38

Pitts, Matthew J., Burnett, Gary E., Williams, Mark A. and Wellings, Tom (2010): Does haptic feedback change the way we view touchscreens in cars?. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 38. Available online

Touchscreens are increasingly being used in mobile devices and in-vehicle systems. While the usability benefits of touchscreens are acknowledged, their use places significant visual demand on the user due to the lack of tactile and kinaesthetic feedback. Haptic feedback is shown to improve performance in mobile devices, but little objective data is available regarding touchscreen feedback in an automotive scenario. A study was conducted to investigate the effects of visual and haptic touchscreen feedback on driver visual behaviour and driving performance using a simulated driving environment. Results showed a significant interaction between visual and haptic feedback, with the presence of haptic feedback compensating for changes in visual feedback. Driving performance was unaffected by feedback condition but degraded from a baseline measure when touchscreen tasks were introduced. Subjective responses indicated an improved user experience and increased confidence when haptic feedback was enabled.

© All rights reserved Pitts et al. and/or ACM Press

p. 39

Sanchez-Cortes, Dairazalia, Aran, Oya, Mast, Marianne Schmid and Gatica-Perez, Daniel (2010): Identifying emergent leadership in small groups using nonverbal communicative cues. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 39. Available online

This paper addresses firstly an analysis on how an emergent leader is perceived in newly formed small-groups, and secondly, explore correlations between perception of leadership and automatically extracted nonverbal communicative cues. We hypothesize that the difference in individual nonverbal features between emergent leaders and non-emergent leaders is significant and measurable using speech activity. Our results on a new interaction corpus show that such an approach is promising, identifying the emergent leader with an accuracy of up to 80%.

© All rights reserved Sanchez-Cortes et al. and/or ACM Press

p. 4

Zhang, Hui, Fricker, Damian, Smith, Thomas G. and Yu, Chen (2010): Real-time adaptive behaviors in multimodal human-avatar interactions. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 4. Available online

Multimodal interaction in everyday life seems so effortless. However, a closer look reveals that such interaction is indeed complex and comprises multiple levels of coordination, from high-level linguistic exchanges to low-level couplings of momentary bodily movements both within an agent and across multiple interacting agents. A better understanding of how these multimodal behaviors are coordinated can provide insightful principles to guide the development of intelligent multimodal interfaces. In light of this, we propose and implement a research framework in which human participants interact with a virtual agent in a virtual environment. Our platform allows the virtual agent to keep track of the user's gaze and hand movements in real time, and adjust his own behaviors accordingly. An experiment is designed and conducted to investigate adaptive user behaviors in a human-agent joint attention task. Multimodal data streams are collected in the study including speech, eye gaze, hand and head movements from both the human user and the virtual agent, which are then analyzed to discover various behavioral patterns. Those patterns show that human participants are highly sensitive to momentary multimodal behaviors generated by the virtual agent and they rapidly adapt their behaviors accordingly. Our results suggest the importance of studying and understanding real-time adaptive behaviors in human-computer multimodal interactions.

© All rights reserved Zhang et al. and/or ACM Press

p. 40

Dong, Wen (2010): Quantifying group problem solving with stochastic analysis. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 40. Available online

Quantifying the relationship between group dynamics and group performance is a key issue of increasing group performance. In this paper, we will discuss how group performance is related to several heuristics about group dynamics in performing several typical tasks. We will also give our novel stochastic modeling in learning the structure of group dynamics. Our performance estimators account for between 40 and 60% of the variance across range of group problem solving tasks.

© All rights reserved Dong and/or ACM Press

p. 41

Ruiz, Natalie, Feng, Qian Qian, Taib, Ronnie, Handke, Tara and Chen, Fang (2010): Cognitive skills learning: pen input patterns in computer-based athlete training. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 41. Available online

In this paper, we describe a longitudinal user study with athletes using a cognitive training tool, equipped with an interactive pen interface, and think-aloud protocols. The aim is to verify whether cognitive load can be inferred directly from changes in geometric and temporal features of the pen trajectories. We compare trajectories across cognitive load levels and overall Pre and Post training tests. The results show trajectory durations and lengths decrease while speeds increase, all significantly, as cognitive load increases. These changes are attributed to mechanisms for dealing with high cognitive load in working memory, with minimal rehearsal. With more expertise, trajectory durations further decrease and speeds further increase, which is attributed in part to cognitive skill acquisition and to schema development, both in extraneous and intrinsic networks, between Pre and Post tests. As such, these pen trajectory features offer insight into implicit communicative changes related to load fluctuations.

© All rights reserved Ruiz et al. and/or ACM Press

p. 42

Tahiroglu, Koray and Ahmaniemi, Teemu (2010): Vocal sketching: a prototype tool for designing multimodal interaction. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 42. Available online

Dynamic audio feedback enriches the interaction with a mobile device. Novel sensor technologies and audio synthesis tools provide infinite number of possibilities to design the interaction between the sensory input and audio output. This paper presents a study where vocal sketching was used as prototype method to grasp ideas and expectations in early stages of designing multimodal interaction. We introduce an experiment where a graspable mobile device was given to the participants and urged to sketch vocally the sounds to be produced when using the device in a communication and musical expression scenarios. The sensory input methods were limited to gestures such as touch, squeeze and movements. Vocal sketching let us to examine closer how gesture and sound could be coupled in the use of our prototype device, such as moving the device upwards with elevating pitch. The results reported in this paper have already informed our opinions and expectations towards the actual design phase of the audio modality.

© All rights reserved Tahiroglu and Ahmaniemi and/or ACM Press

p. 43

Tada, Masahiro, Noma, Haruo and Renge, Kazumi (2010): Evidence-based automated traffic hazard zone mapping using wearable sensors. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 43. Available online

Recently, underestimating traffic condition risk is considered one of the biggest reasons for traffic accidents. In this paper, we proposed evidence-based automatic hazard zone mapping method using wearable sensors. Here, we measure driver's behavior using three-axis gyro sensors. Analyzing the measured motion data, proposed method can label characteristic motion that is observed at hazard zone. We gathered motion data sets form two types of driver, i.e., an instructor of driving school and an ordinary driver, then, tried to generate traffic hazard zone map focused on difference of the motions. Through the experiment in public road, we confirmed our method allows to extract hazard zone.

© All rights reserved Tada et al. and/or ACM Press

p. 44

Sumi, Yasuyuki, Yano, Masaharu and Nishida, Toyoaki (2010): Analysis environment of conversational structure with nonverbal multimodal data. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 44. Available online

This paper shows the IMADE (Interaction Measurement, Analysis, and Design Environment) project to build a recording and analyzing environment of human conversational interactions. The IMADE room is designed to record audio/visual, human-motion, eye gazing data for building interaction corpus mainly focusing on understanding of human nonverbal behaviors. In this paper, we show the notion of interaction corpus and iCorpusStudio, software environment for browsing and analyzing the interaction corpus. We also present a preliminary experiment on multiparty conversations.

© All rights reserved Sumi et al. and/or ACM Press

p. 45

Wang, Rongrong, Quek, Francis, Teh, James K. S., Cheok, Adrian D. and Lai, Sep Riang (2010): Design and evaluation of a wearable remote social touch device. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 45. Available online

Psychological and sociological studies have established the essential role that touch plays in interpersonal communication. However this channel is largely ignored in current telecommunication technologies. We design and implement a remote touch armband with an electric motor actuator. This is paired with a touch input device in the form of a force-sensor-embedded smart phone case. When the smart phone is squeezed, the paired armband will be activated to simulate a squeeze on the user's upper arm. A usability study is conducted with 22 participants to evaluate the device in terms of perceptibility. The results show that users can easily perceive touch at different force levels.

© All rights reserved Wang et al. and/or ACM Press

p. 46

Alabau, Vicent, Ortiz-Martínez, Daniel, Sanchis, Alberto and Casacuberta, Francisco (2010): Multimodal interactive machine translation. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 46. Available online

Interactive machine translation (IMT) [1] is an alternative approach to machine translation, integrating human expertise into the automatic translation process. In this framework, a human iteratively interacts with a system until the output desired by the human is completely generated. Traditionally, interaction has been performed using a keyboard and a mouse. However, the use of touchscreens has been popularised recently. Many touchscreen devices already exist in the market, namely mobile phones, laptops and tablet computers like the iPad. In this work, we propose a new interaction modality to take advantage of such devices, for which online handwritten text seems a very natural way of input. Multimodality is formulated as an extension to the traditional IMT protocol where the user can amend errors by writing text with an electronic pen or a stylus on a touchscreen. Different approaches to modality fusion have been studied. In addition, these approaches have been assessed on the Xerox task. Finally, a thorough study of the errors committed by the online handwritten system will show future work directions.

© All rights reserved Alabau et al. and/or ACM Press

p. 47

Lawson, Jean-Yves Lionel, Coterot, Mathieu, Carincotte, Cyril and Macq, Benoit (2010): Component-based high fidelity interactive prototyping of post-WIMP interactions. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 47. Available online

In order to support interactive high-fidelity prototyping of post-WIMP user interactions, we propose a multi-fidelity design method based on a unifying component-based model and supported by an advanced tool suite, the OpenInterface Platform Workbench. Our approach strives for supporting a collaborative (programmer-designer) and user-centered design activity. The workbench architecture allows exploration of novel interaction techniques through seamless integration and adaptation of heterogeneous components, high-fidelity rapid prototyping, runtime evaluation and fine-tuning of designed systems. This paper illustrates through the iterative construction of a running example how OpenInterface allows the leverage of existing resources and fosters the creation of non-conventional interaction techniques.

© All rights reserved Lawson et al. and/or ACM Press

p. 48

Serrano, Nicolás, Giménez, Adrià, Sanchis, Albert and Juan, Alfons (2010): Active learning strategies for handwritten text transcription. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 48. Available online

Active learning strategies are being increasingly used in a variety of real-world tasks, though their application to handwritten text transcription in old manuscripts remains nearly unexplored. The basic idea is to follow a sequential, line-byline transcription of the whole manuscript in which a continuously retrained system interacts with the user to efficiently transcribe each new line. This approach has been recently explored using a conventional strategy by which the user is only asked to supervise words that are not recognized with high confidence. In this paper, the conventional strategy is improved by also letting the system to recompute most probable hypotheses with the constraints imposed by user supervisions. In particular, two strategies are studied which differ in the frequency of hypothesis recomputation on the current line: after each (iterative) or all (delayed) user corrections. Empirical results are reported on two real tasks showing that these strategies outperform the conventional approach.

© All rights reserved Serrano et al. and/or ACM Press

p. 49

Chen, Yuting, Naveed, Adeel and Porzel, Robert (2010): Behavior and preference in minimal personality: a study on embodied conversational agents. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 49. Available online

Endowing embodied conversational agent with personality affords more natural modalities for their interaction with human interlocutors. To bridge the personality gap between users and agents, we designed minimal two personalities for corresponding agents i.e. an introverted and an extroverted agent. Each features a combination of different verbal and non-verbal behaviors. In this paper, we present an examination of the effects of the speaking and behavior styles of the two agents and explore the resulting design factors pertinent for spoken dialogue systems. The results indicate that users prefer the extroverted agent to the introverted one. The personality traits of the agents influence the users' preferences, dialogues, and behavior. Statistically, it is highly significant that users are more talkative with the extroverted agent. We also investigate the spontaneous speech disfluency of the dialogues and demonstrate that the extroverted behavior model reduce the user's speech disfluency. Furthermore, users having different mental models behave differently with the agents. The results and findings show that the minimal personalities of agents maximally influence the interlocutors' behaviors.

© All rights reserved Chen et al. and/or ACM Press

p. 5

Bohus, Dan and Horvitz, Eric (2010): Facilitating multiparty dialog with gaze, gesture, and speech. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 5. Available online

We study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We begin by reviewing a computational framework for turn-taking that provides the foundation for tracking and communicating intentions to hold, release, or take control of the conversational floor. We then present implementation aspects of this model in an embodied conversational agent. Empirical results with this model in a shared task setting indicate that the various verbal and non-verbal cues used by the avatar can effectively shape the multiparty conversational dynamics. In addition, we identify and discuss several context variables which impact the turn allocation process.

© All rights reserved Bohus and Horvitz and/or ACM Press

p. 50

Biel, Joan-Isaac and Gatica-Perez, Daniel (2010): Vlogcast yourself: nonverbal behavior and attention in social media. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 50. Available online

We introduce vlogs as a type of rich human interaction which is multimodal in nature and suitable for new large-scale behavioral data analysis. The automatic analysis of vlogs is useful not only to study social media, but also remote communication scenarios, and requires the integration of methods for multimodal processing and for social media understanding. Based on works from social psychology and computing, we first propose robust audio, visual, and multimodal cues to measure the nonverbal behavior of vloggers in their videos. Then, we investigate the relation between behavior and the attention videos receive in YouTube. Our study shows significant correlations between some nonverbal behavioral cues and the average number of views per video.

© All rights reserved Biel and Gatica-Perez and/or ACM Press

p. 51

Voit, Michael and Stiefelhagen, Rainer (2010): 3D user-perspective, voxel-based estimation of visual focus of attention in dynamic meeting scenarios. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 51. Available online

In this paper we present a new framework for the online estimation of people's visual focus of attention from their head poses in dynamic meeting scenarios. We describe a voxel based approach to reconstruct the scene composition from an observer's perspective, in order to integrate occlusion handling and visibility verification. The observer's perspective is thereby simulated with live head pose tracking over four far-field views from the room's upper corners. We integrate motion and speech activity as further scene observations in a Bayesian Surprise framework to model prior attractors of attention within the situation's context. As evaluations on a dedicated dataset with 10 meeting videos show, this allows us to predict a meeting participant's focus of attention correctly in up to 72.2% of all frames.

© All rights reserved Voit and Stiefelhagen and/or ACM Press

p. 52

Escalera, Sergio, Radeva, Petia, Vitrià, Jordi, Baró, Xavier and Raducanu, Bogdan (2010): Modelling and analyzing multimodal dyadic interactions using social networks. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 52. Available online

Social network analysis became a common technique used to model and quantify the properties of social interactions. In this paper, we propose an integrated framework to explore the characteristics of a social network extracted from multimodal dyadic interactions. First, speech detection is performed through an audio/visual fusion scheme based on stacked sequential learning. In the audio domain, speech is detected through clusterization of audio features. Clusters are modelled by means of an One-state Hidden Markov Model containing a diagonal covariance Gaussian Mixture Model. In the visual domain, speech detection is performed through differential-based feature extraction from the segmented mouth region, and a dynamic programming matching procedure. Second, in order to model the dyadic interactions, we employed the Influence Model whose states encode the previous integrated audio/visual data. Third, the social network is extracted based on the estimated influences. For our study, we used a set of videos belonging to New York Times' Blogging Heads opinion blog. The results are reported both in terms of accuracy of the audio/visual data fusion and centrality measures used to characterize the social network.

© All rights reserved Escalera et al. and/or ACM Press

p. 53

Hidaka, Shohei and Yu, Chen (2010): Analyzing multimodal time series as dynamical systems. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 53. Available online

We propose a novel approach to discovering latent structures from multimodal time series. We view a time series as observed data from an underlying dynamical system. In this way, analyzing multimodal time series can be viewed as finding latent structures from dynamical systems. In light this, our approach is based on the concept of generating partition which is the theoretically best symbolization of time series maximizing the information of the underlying original continuous dynamical system. However, generating partition is difficult to achieve for time series without explicit dynamical equations. Different from most previous approaches that attempt to approximate generating partition through various deterministic symbolization processes, our algorithm maintains and estimates a probabilistic distribution over a symbol set for each data point in a time series. To do so, we develop a Bayesian framework for probabilistic symbolization and demonstrate that the approach can be successfully applied to both simulated data and empirical data from multimodal agent-agent interactions. We suggest this unsupervised learning algorithm has a potential to be used in various multimodal datasets as first steps to identify underlying structures between temporal variables.

© All rights reserved Hidaka and Yu and/or ACM Press

p. 54

Gorga, Sebastian and Otsuka, Kazuhiro (2010): Conversation scene analysis based on dynamic Bayesian network and image-based gaze detection. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 54. Available online

This paper presents a probabilistic framework, which incorporates automatic image-based gaze detection, for inferring the structure of multiparty face-to-face conversations. This framework aims to infer conversation regimes and gaze patterns from the nonverbal behaviors of meeting participants, which are captured from image and audio streams with cameras and microphones. The conversation regime corresponds to a global conversational pattern such as monologue and dialogue, and the gaze pattern indicates "who is looking at whom". Input nonverbal behaviors include presence/absence of utterances, head directions, and discrete head-centered eye-gaze directions. In contrast to conventional meeting analysis methods that focus only on the participant's head pose as a surrogate of visual focus of attention, this paper newly incorporates vision-based gaze detection combined with head pose tracking into a probabilistic conversation model based on dynamic Bayesian network. Our gaze detector is able to differentiate 3 to 5 different eye gaze directions, e.g. left, straight and right. Experiments on four-person conversations confirm the power of the proposed framework in identifying conversation structure and in estimating gaze patterns with higher accuracy then previous models.

© All rights reserved Gorga and Otsuka and/or ACM Press

p. 6

Schauerte, Boris and Fink, Gernot A. (2010): Focusing computational visual attention in multi-modal human-robot interaction. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 6. Available online

Identifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we introduce a saliency-based model that reflects how multi-modal referring acts influence the visual search, i.e. the task to find a specific object in a scene. Therefore, we combine positional information obtained from pointing gestures with contextual knowledge about the visual appearance of the referred-to object obtained from language. The available information is then integrated into a biologically-motivated saliency model that forms the basis for visual search. We prove the feasibility of the proposed approach by presenting the results of an experimental evaluation.

© All rights reserved Schauerte and Fink and/or ACM Press

p. 7

Lepri, Bruno, Subramanian, Ramanathan, Kalimeri, Kyriaki, Staiano, Jacopo, Pianesi, Fabio and Sebe, Nicu (2010): Employing social gaze and speaking activity for automatic determination of the Extraversion trait. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 7. Available online

In order to predict the Extraversion personality trait, we exploit medium-grained behaviors enacted in group meetings, namely, speaking time and social attention (social gaze). The latter will be further distinguished in to attention given to the group members and attention received from them. The results of our work confirm many of our hypotheses: a) speaking time and (some forms of) social gaze are effective in automatically predicting Extraversion; b) classification accuracy is affected by the size of the time slices used for analysis, and c) to a large extent, the consideration of the social context does not add much to accuracy prediction, with an important exception concerning social gaze.

© All rights reserved Lepri et al. and/or ACM Press

p. 8

Li, Weifeng, Nüssli, Marc-Antoine and Jermann, Patrick (2010): Gaze quality assisted automatic recognition of social contexts in collaborative Tetris. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 8. Available online

The use of dual eye-tracking is investigated in a collaborative game setting. Social context influences individual gaze and action during a collaborative Tetris game: results show that experts as well as novices adapt their playing style when interacting in mixed ability pairs. The long term goal of our work is to design adaptive gaze awareness tools that take the pair composition into account. We therefore investigate the automatic detection (or recognition) of pair composition using dual gaze-based as well as action-based multimodal features. We describe several methods for the improvement of detection (or recognition) and experimentally demonstrate their effectiveness, especially in the situations when the collected gaze data are noisy.

© All rights reserved Li et al. and/or ACM Press

p. 9

Bee, Nikolaus, Wagner, Johannes, Andre, Elisabeth, Vogt, Thurid, Charles, Fred, Pizzi, David and Cavazza, Marc (2010): Discovering eye gaze behavior during human-agent conversation in an interactive storytelling application. In: Proceedings of the 2010 International Conference on Multimodal Interfaces 2010. p. 9. Available online

In this paper, we investigate the user's eye gaze behavior during the conversation with an interactive storytelling application. We present an interactive eye gaze model for embodied conversational agents in order to improve the experience of users participating in Interactive Storytelling. The underlying narrative in which the approach was tested is based on a classical XIXth century psychological novel: Madame Bovary, by Flaubert. At various stages of the narrative, the user can address the main character or respond to her using free-style spoken natural language input, impersonating her lover. An eye tracker was connected to enable the interactive gaze model to respond to user's current gaze (i.e. looking into the virtual character's eyes or not). We conducted a study with 19 students where we compared our interactive eye gaze model with a non-interactive eye gaze model that was informed by studies of human gaze behaviors, but had no information on where the user was looking. The interactive model achieved a higher score for user ratings than the non-interactive model. In addition we analyzed the users' gaze behavior during the conversation with the virtual character.

© All rights reserved Bee et al. and/or ACM Press




 
 

Join our community and advance:

 
1.

Your career

 
2.

Your network

 
 3.

Your skills

 
 
 
 
 

User-contributed notes

Give us your opinion! Do you have any comments/additions
that you would like other visitors to see?

 
comment You (your email) say: May 7th, 2014
#1
May 7
Add a thoughtful commentary or note to this page ! 
 

your homepage, facebook profile, twitter, or the like
will be spam-protected
How many?
= e.g. "6"
By submitting you agree to the Site Terms
 
 
 
 

Changes to this page (conference)

16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Added
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified
16 Jan 2011: Modified

Page Information

Page maintainer: The Editorial Team
URL: http://www.interaction-design.org/references/conferences/proceedings_of_the_2010_international_conference_on_multimodal_interfaces.html
May 07

If we want users to like our software, we should design it to behave like a likeable person.

-- Alan Cooper

 
 

Featured chapter

Marc Hassenzahl explains the fascinating concept of User Experience and Experience Design. Commentaries by Don Norman, Eric Reiss, Mark Blythe, and Whitney Hess

User Experience and Experience Design !

 
 

Our Latest Books

Kumar and Herger 2013: Gamification at Work: Designing Engaging Business Software...
by Janaki Mythily Kumar and Mario Herger

 
Start reading

Whitworth and Ahmad 2013: The Social Design of Technical Systems: Building technologies for communities...
by Brian Whitworth and Adnan Ahmad

 
Start reading

Soegaard and Dam 2013: The Encyclopedia of Human-Computer Interaction, 2nd Ed....
by Mads Soegaard and Rikke Friis Dam

 
Start reading
 
 

Help us help you!