What is this field of Human-Computer Interaction? People are quite different from computers. This is hardly a novel observation, but whenever people use computers, there is necessarily a zone of mutual accommodation and this defines our area of interest. People are so adaptable that they are capable of shouldering the entire burden of accommodation to an artifact, but skillful designers make large parts of this burden vanish by adapting the artifact to its users. To understand successful design requires an understanding of the technology, the person, and their mutual interaction [...]
-- Stephen Draper and Donald Norman. In "User Centered System Design" (1986) p. 1
Proceedings of the 2009 International Conference on Multimodal Interfaces
Time and place:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
The following articles are from "Proceedings of the 2009 International Conference on Multimodal Interfaces":
Add to course
Breazeal, Cynthia (2009): Living better with robots. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 1-2. Available online
The emerging field of Human-Robot Interaction is undergoing rapid growth, motivated by important societal challenges and new applications for personal robotic technologies for the general public. In this talk, I highlight several projects from my research group to illustrate recent research trends to develop socially interactive robots that work and learn with people as partners. An important goal of this work is to use interactive robots as a scientific tool to understand human behavior, to explore the role of physical embodiment in interactive technology, and to use these insights to design robotic technologies that can enhance human performance and quality of life. Throughout the talk I will highlight synergies with HCI and connect HRI research goals to specific applications in healthcare, education, and communication.
Luz, Saturnino and Kane, Bridget (2009): Classification of patient case discussions through analysis of vocalisation graphs. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 107-114. Available online
This paper investigates the use of amount and structure of talk as a basis for automatic classification of patient case discussions in multidisciplinary medical team meetings recorded in a real-world setting. We model patient case discussions as vocalisation graphs, building on research from the fields of interaction analysis and social psychology. These graphs are "content free" in that they only encode patterns of vocalisation and silence. The fact that it does not rely on automatic transcription makes the technique presented in this paper an attractive complement to more sophisticated speech processing methods as a means of indexing medical team meetings. We show that despite the simplicity of the underlying representation mechanism, accurate classification performance (F-scores: F_1 = 0.98, for medical patient case discussions, and F_1 = 0.97, for surgical case discussions) can be achieved with a simple k-nearest neighbour classifier when vocalisations are represented at the level of individual speakers. Possible applications of the method in health informatics for storage and retrieval of multimedia medical meeting records are discussed.
Yannakakis, Georgios N. (2009): Learning from preferences and selected multimodal features of players. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 115-118. Available online
The influence of multimodal sources of input data to the construction of accurate computational models of user preferences is investigated in this paper. The case study presented explores player entertainment preferences of physical game variants incorporating two data modalities. The main findings of the paper reveal the benefit of multiple modalities of input data for the prediction of preferences and highlight the impact of feature selection on the construction of such models.
Castellano, Ginevra, Pereira, André, Leite, Iolanda, Paiva, Ana and McOwan, Peter W. (2009): Detecting user engagement with a robot companion using task and social interaction-based features. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 119-126. Available online
Affect sensitivity is of the utmost importance for a robot companion to be able to display socially intelligent behaviour, a key requirement for sustaining long-term interactions with humans. This paper explores a naturalistic scenario in which children play chess with the iCat, a robot companion. A person-independent, Bayesian approach to detect the user's engagement with the iCat robot is presented. Our framework models both causes and effects of engagement: features related to the user's non-verbal behaviour, the task and the companion's affective reactions are identified to predict the children's level of engagement. An experiment was carried out to train and validate our model. Results show that our approach based on multimodal integration of task and social interaction-based features outperforms those based solely on non-verbal behaviour or contextual information (94.79% vs. 93.75% and 78.13%).
Fasel, Ian R., Shiomi, Masahiro, Chadutaud, Pilippe-Emmanuel, Kanda, Takayuki, Hagita, Norihiro and Ishiguro, Hiroshi (2009): Multi-modal features for real-time detection of human-robot interaction categories. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 127-134. Available online
Social interactions unfold over time, at multiple time scales, and can be observed through multiple sensory modalities. In this paper, we propose a machine learning framework for selecting and combining low-level sensory features from different modalities to produce high-level characterizations of human-robot social interactions in real-time. We introduce a novel set of fast, multi-modal, spatio-temporal features for audio sensors, touch sensors, floor sensors, laser range sensors, and the time-series history of the robot's own behaviors. A subset of these features are automatically selected and combined using GentleBoost, an ensemble machine learning technique, allowing the robot to make an estimate of the current interaction category every 100 milliseconds. This information can then be used either by the robot to make decisions autonomously, or by a remote human-operator who can modify the robot's behavior manually (i.e., semi-autonomous operation). We demonstrate the technique on an information-kiosk robot deployed in a busy train station, focusing on the problem of detecting interaction breakdowns (i.e., failure of the robot to engage in a good interaction). We show that despite the varied and unscripted nature of human-robot interactions in the real-world train-station setting, the robot can achieve highly accurate predictions of interaction breakdowns at the same instant human observers become aware of them.
Cassell, Justine, Geraghty, Kathleen, Gonzalez, Berto and Borland, John (2009): Modeling culturally authentic style shifting with virtual peers. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 135-142. Available online
We report on a new kind of culturally-authentic embodied conversational agent more in line with the ways that culture and ethnicity function in the real world. On the basis of the careful analysis of a corpus of verbal and nonverbal behavior, we found that children shift dialects and ways of using their body depending on social context and task. Based on these results, we implemented a culturally authentic African American virtual peer capable of "code-switching" between African American English and Mainstream American English, and of using nonverbal behavior differently, depending on context. An evaluation of the agent revealed that the virtual peer elicited the same style changes in real children as real children did in one another.
Fang, Rui, Chai, Joyce Y. and Ferreira, Fernanda (2009): Between linguistic attention and gaze fixations inmultimodal conversational interfaces. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 143-150. Available online
In multimodal human machine conversation, successfully interpreting human attention is critical. While attention has been studied extensively in linguistic processing and visual processing, it is not clear how linguistic attention is aligned with visual attention in multimodal conversational interfaces. To address this issue, we conducted a preliminary investigation on how attention reflected by linguistic discourse aligns with attention indicated by gaze fixations during human machine conversation. Our empirical findings have shown that more attended entities based on linguistic discourse correspond to higher intensity of gaze fixations. The smoother a linguistic transition is, the less distance between corresponding fixation distributions. These findings provide insight into how language and gaze can be combined to predict attention, which have important implications in many tasks such as word acquisition and object recognition.
Chen, Lei and Harper, Mary P. (2009): Multimodal floor control shift detection. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 15-22. Available online
Floor control is a scheme used by people to organize speaking turns in multi-party conversations. Identifying the floor control shifts is important for understanding a conversation's structure and would be helpful for more natural human computer interaction systems. Although people tend to use verbal and nonverbal cues for managing floor control shifts, only audio cues, e.g., lexical and prosodic cues, have been used in most previous investigations on speaking turn prediction. In this paper, we present a statistical model to automatically detect floor control shifts using both verbal and nonverbal cues. Our experimental results show that using a combination of verbal and nonverbal cues provides more accurate detection.
Brewster, Stephen (2009): Head-up interaction: can we break our addiction to the screen and keyboard?. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 151-152. Available online
Mobile user interfaces are commonly based on techniques developed for desktop computers in the 1970s, often including buttons, sliders, windows and progress bars. These can be hard to use on the move, which then limits the way we use our devices and the applications on them. This talk will look at the possibility of moving away from these kinds of interactions to ones more suited to mobile devices and their dynamic contexts of use where users need to be able to look where they are going, carry shopping bags and hold on to children. Multimodal (gestural, audio and haptic) interactions provide us new ways to use our devices that can be eyes and hands free, and allow users to interact in a 'head up' way. These new interactions will facilitate new services, applications and devices that fit better into our daily lives and allow us to do a whole host of new things. Brewster will discuss some of the work being done on input using gestures done with fingers, wrist and head, along with work on output using non-speech audio, 3D sound and tactile displays in applications such as for mobile devices such as text entry, camera phone user interfaces and navigation. He will also discuss some of the issues of social acceptability of these new interfaces.
Fusion engines are fundamental components of multimodal inter-active systems, to interpret input streams whose meaning can vary according to the context, task, user and time. Other surveys have considered multimodal interactive systems; we focus more closely on the design, specification, construction and evaluation of fusion engines. We first introduce some terminology and set out the major challenges that fusion engines propose to solve. A history of past work in the field of fusion engines is then presented using the BRETAM model. These approaches to fusion are then classified. The classification considers the types of application, the fusion principles and the temporal aspects. Finally, the challenges for future work in the field of fusion engines are set out. These include software frameworks, quantitative evaluation, machine learning and adaptation.
Mendonça, Hildeberto, Lawson, Jean-Yves Lionel, Vybornova, Olga, Macq, Benoit and Vanderdonckt, Jean (2009): A fusion framework for multimodal interactive applications. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 161-168. Available online
This research aims to propose a multi-modal fusion framework for high-level data fusion between two or more modalities. It takes as input low level features extracted from different system devices, analyses and identifies intrinsic meanings in these data. Extracted meanings are mutually compared to identify complementarities, ambiguities and inconsistencies to better understand the user intention when interacting with the system. The whole fusion life cycle will be described and evaluated in an office environment scenario, where two co-workers interact by voice and movements, which might show their intentions. The fusion in this case is focusing on combining modalities for capturing a context to enhance the user experience.
Dumas, Bruno, Ingold, Rolf and Lalanne, Denis (2009): Benchmarking fusion engines of multimodal interactive systems. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 169-176. Available online
This article proposes an evaluation framework to benchmark the performance of multimodal fusion engines. The paper first introduces different concepts and techniques associated with multimodal fusion engines and further surveys recent implementations. It then discusses the importance of evaluation as a mean to assess fusion engines, not only from the user perspective, but also at a performance level. The article further proposes a benchmark and a formalism to build testbeds for assessing multimodal fusion engines. In its last section, our current fusion engine and the associated system HephaisTK are evaluated thanks to the evaluation framework proposed in this article. The article concludes with a discussion on the proposed quantitative evaluation, suggestions to build useful testbeds, and proposes some future improvements.
Serrano, Marcos and Nigay, Laurence (2009): Temporal aspects of CARE-based multimodal fusion: from a fusion mechanism to composition components and WoZ components. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 177-184. Available online
The CARE properties (Complementarity, Assignment, Redundancy and Equivalence) define various forms that multimodal input interaction can take. While Equivalence and Assignment express the availability and respective absence of choice between multiple input modalities for performing a given task, Complementarity and Redundancy describe relationships between modalities and require fusion mechanisms. In this paper we present a summary of the works we have carried using the CARE properties for conceiving and implementing multimodal interaction, as well as a new approach using WoZ components. We present different technical solutions for implementing the Complementarity and Redundancy of modalities with a focus on the temporal aspects of the fusion. Starting from a monolithic fusion mechanism, we then explain our component-based approach and the composition components (i.e., Redundancy and Complementarity components). As a new contribution for exploring design solutions before implementing an adequate fusion mechanism as well as for tuning the temporal aspects of the performed fusion, we introduce Wizard of Oz (WoZ) fusion components. We illustrate the composition components as well as the implemented tools exploiting them using several multimodal systems including a multimodal slide viewer and a multimodal map navigator.
Ladry, Jean-François, Navarre, David and Palanque, Philippe A. (2009): Formal description techniques to support the design, construction and evaluation of fusion engines for sure (safe, usable, reliable and evolvable) multimodal interfaces. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 185-192. Available online
Representing the behaviour of multimodal interactive systems in a complete, concise and non-ambiguous way is still a challenge for formal description techniques (FDT). Depending on the FDT, multimodal interactive systems feature specific characteristics that are either cumbersome or impossible to capture with classical FDT. This is due to the multiple (potentially synergistic) use of modalities and the strong temporal constraints usually encountered in this kind of systems that have to be dealt with exhaustively if FDT are used. This paper focuses on the requirements for the modelling and construction of fusion engines for multimodal interfaces. It proposes a formal description technique dedicated to the engineering of interactive multimodal systems able to address the challenges of fusion engines. Such benefits are presented on a set of examples illustrating both the constructs and the process.
Sezgin, Tevfik Metin, Davies, Ian and Robinson, Peter (2009): Multimodal inference for driver-vehicle interaction. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 193-198. Available online
In this paper we present a novel system for driver-vehicle interaction which combines speech recognition with facial-expression recognition to increase intention recognition accuracy in the presence of engine- and road-noise. Our system would allow drivers to interact with in-car devices such as satellite navigation and other telematic or control systems. We describe a pilot study and experiment in which we tested the system, and show that multimodal fusion of speech and facial expression recognition provides higher accuracy than either would do alone.
Bader, Thomas, Vogelgesang, Matthias and Klaus, Edmund (2009): Multimodal integration of natural gaze behavior for intention recognition during object manipulation. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 199-206. Available online
Naturally gaze is used for visual perception of our environment and gaze movements are mainly controlled subconsciously. Forcing the user to consciously diverge from that natural gaze behavior for interaction purposes causes high cognitive workload and destroys information contained in natural gaze movements. Instead of proposing a new gaze-based interaction technique, we analyze natural gaze behavior during an object manipulation task and show ways how it can be used for intention recognition, which provides a universal basis for integrating gaze into multimodal interfaces for different applications. We propose a model for multimodal integration of natural gaze behavior and evaluate it for two different use cases, namely for improvement of robustness of other potentially noisy input cues and for the design of proactive interaction techniques.
Piwek, Paul (2009): Salience in the generation of multimodal referring acts. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 207-210. Available online
Pointing combined with verbal referring is one of the most paradigmatic human multimodal behaviours. The aim of this paper is foundational: to uncover the central notions that are required for a computational model of multimodal referring acts that include a pointing gesture. The paper draws on existing work on the generation of referring expressions and shows that in order to extend that work with pointing, the notion of salience needs to play a pivotal role. The paper starts by investigating the role of salience in the generation of referring expressions and introduces a distinction between two opposing approaches: salience-first and salience-last accounts. The paper then argues that these differ not only in computational efficiency, as has been pointed out previously, but also lead to incompatible empirical predictions. The second half of the paper shows how a salience-first account nicely meshes with a range of existing empirical findings on multimodal reference. A novel account of the circumstances under which speakers choose to point is proposed that directly links salience with pointing. Finally, this account is placed within a multi-dimensional model of salience for multimodal reference.
Baldwin, Tyler, Chai, Joyce Y. and Kirchhoff, Katrin (2009): Communicative gestures in coreference identification in multiparty meetings. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 211-218. Available online
During multiparty meetings, participants can use non-verbal modalities such as hand gestures to make reference to the shared environment. Therefore, one hypothesis is that incorporating hand gestures can improve coreference identification, a task that automatically identifies what participants refer to with their linguistic expressions. To evaluate this hypothesis, this paper examines the role of hand gestures in coreference identification, in particular, focusing on two questions: (1) what signals can distinguish communicative gestures that can potentially help coreference identification from non-communicative gestures; and (2) in what ways can communicative gestures help coreference identification. Based on the AMI data, our empirical results have shown that the length of gesture production is highly indicative of whether a gesture is communicative and potentially helpful in language understanding. Our experiments on the automated identification of coreferring expressions indicate that while the incorporation of simple gesture features does not improve overall performance, it does show potential on expressions referring to participants, an important and unique component of the meeting domain. A further analysis suggests that communicative gestures provide both redundant and complementary information, but further domain modeling and world knowledge incorporation is required to take full advantage of information that is complementary.
Otsuka, Kazuhiro, Araki, Shoko, Mikami, Dan, Ishizuka, Kentaro, Fujimoto, Masakiyo and Yamato, Junji (2009): Realtime meeting analysis and 3D meeting viewer based on omnidirectional multimodal sensors. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 219-220. Available online
This demo presents a realtime system for analyzing group meetings. Targeting round-table meetings, this system employs an omnidirectional camera-microphone system. The goal of this system is to automatically discover "who is talking to whom and when". To that purpose, the face pose/position of meeting participants are tracked on panorama images acquired from fisheye-based omnidirectional cameras. From audio signals obtained with microphone array, speaker diarization, i.e. the estimation of "who is speaking and when", is carried out. The visual focus of attention, i.e. "who is looking at whom", is estimated from the result of face tracking. The results are displayed based on a 3D visualization scheme. The advantage of our system is its realtimeness. We will demonstrate the portable version of the system consisting of two laptop PCs. In addition, we will showcase our meeting playback viewer with man-machine interfaces that allow users to freely control space and time of meeting scenes. With this viewer, users can also experience 3D positional sound effect linked with 3D viewpoint, using enhanced audio tracks for each participant.
Vishnoi, Nalini, Narber, Cody, Duric, Zoran and Gerber, Naomi Lynn (2009): Guiding hand: a teaching tool for handwriting. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 221-222. Available online
The goal of our demonstration is to illustrate how the haptic, force feedback device, can be used to assist people with disabilities in learning fine motor tasks, such as writing. We will be demonstrating this idea by the simulation of several letters and symbols. We use electromagnetic sensors (MotionStar Wireless2) to capture unencumbered movements performed by a 'normal' individual. The captured movement is translated to the haptic coordinate system with the use of a table-top centered frame as an intermediate frame. The translated movement is then fed into our haptic system, which varies the exerted force as a function of trainee performance. Our demonstration will use the Phantom Omni for the simulation of these writing tasks, and it will also provide visual feedback of the desired and user trajectories.
Popescu-Belis, Andrei, Poller, Peter and Kilgour, Jonathan (2009): A multimedia retrieval system using speech input. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 223-224. Available online
The AMIDA Automatic Content Linking Device (ACLD) monitors a conversation using automatic speech recognition (ASR), and uses the detected words to retrieve documents that are of potential use to the participants in the conversation. The document set that is available includes project related documents such as reports, memos or emails, as well as snippets of past meetings that were transcribed using offline ASR. In addition, results of Web searches are also displayed. Several visualisation interfaces are available.
Erp, Jan B. F. van, Werkhoven, Peter J., Thurlings, Marieke E. and Brouwer, Anne-Marie M. (2009): Navigation with a passive brain based interface. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 225-226. Available online
In this paper, we describe a Brain Computer Interface (BCI) for navigation. The system is based on detecting brain signals that are elicited by tactile stimulation on the torso indicating the desired direction.
Alabau, Vicent, Ortiz, Daniel, Romero, Verónica and Ocampo, Jorge (2009): A multimodal predictive-interactive application for computer assisted transcription and translation. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 227-228. Available online
Traditionally, Natural Language Processing (NLP) technologies have mainly focused on full automation. However, full automation often proves unnatural in many applications, where technology is expected to assist rather than replace the human agents. In consequence, Multimodal Interactive (MI) technologies have emerged. On the one hand, the user interactively co-operates with the system to improve system accuracy. On the other hand, multimodality improves system ergonomics. In this paper, we present an application that implements such MI technologies. First, we have designed an Application Programming Interface (API), featuring a client-server framework, to deal with most common NLP MI tasks. Second, we have developed a generic client application. The resulting client-server architecture has been successfully tested with two different NLP problems: transcription of text images and translation of texts.
Finomore, Victor S., Popik, Dianne K., Brungart, Douglas S. and Simpson, Brian D. (2009): Multi-modal communication system. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 229-230. Available online
The Multi-Modal Communication (MMC) tool was designed to alleviate the workload and errors associated with intensive radio communication environments. MMC captures, records, and displays the radio communication to the operator so that they have instant access to all current and past information. This eliminates the perishable nature of radio communication and allows the operators to focus on the task instead of remembering and writing down information. The MMC tool also employs virtual audio display technology to spatialized the multiple audio signals to aid in the intelligibility of the radio communication. The combination of these technologies has led to the design of a communication interface that will improve the performance of operators confronted with monitoring high volume of radio communication.
Petridis, Stavros, Gunes, Hatice, Kaltwang, Sebastian and Pantic, Maja (2009): Static vs. dynamic modeling of human nonverbal behavior from multiple cues and modalities. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 23-30. Available online
Human nonverbal behavior recognition from multiple cues and modalities has attracted a lot of interest in recent years. Despite the interest, many research questions, including the type of feature representation, choice of static vs. dynamic classification schemes, the number and type of cues or modalities to use, and the optimal way of fusing these, remain open research questions. This paper compares frame-based vs window-based feature representation and employs static vs. dynamic classification schemes for two distinct problems in the field of automatic human nonverbal behavior analysis: multicue discrimination between posed and spontaneous smiles from facial expressions, head and shoulder movements, and audio-visual discrimination between laughter and speech. Single cue and single modality results are compared to multicue and multimodal results by employing Neural Networks, Hidden Markov Models (HMMs), and 2- and 3-chain coupled HMMs. Subject independent experimental evaluation shows that: 1) both for static and dynamic classification, fusing data coming from multiple cues and modalities proves useful to the overall task of recognition, 2) the type of feature representation appears to have a direct impact on the classification performance, and 3) static classification is comparable to dynamic classification both for multicue discrimination between posed and spontaneous smiles, and audio-visual discrimination between laughter and speech.
Dumas, Bruno, Lalanne, Denis and Ingold, Rolf (2009): HephaisTK: a toolkit for rapid prototyping of multimodal interfaces. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 231-232. Available online
This article introduces HephaisTK, a toolkit for rapid prototyping of multimodal interfaces. After briefly discussing the state of the art, the architecture traits of the toolkit are displayed, along with the major features of HephaisTK: agent-based architecture, ability to plug in easily new input recognizers, fusion engine and configuration by means of a SMUIML XML file. Finally, applications created with the HephaisTK toolkit are discussed.
Llorens, David, Marzal, Andrés, Prat, Federico and Vilar, Juan Miguel (2009): State,: an assisted document transcription system. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 233-234. Available online
State is an interactive system for ancient and handwritten document transcription with several input modalities for entering and correcting text. It has a flexible architecture that allows easy connection to different OCR systems.
Monceaux, Jérôme, Becker, Joffrey, Boudier, Céline and Mazel, Alexandre (2009): Demonstration: first steps in emotional expression of the humanoid robot Nao. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 235-236. Available online
We created a library of emotional expressions, and not an emotional system, for the humanoid robot Nao from Aldebaran Robotics. This set of expressions could be used by robot behavior designers to create advanced behaviors, or by an emotion simulator. It is an insight into a conjoint work between an invited anthropologist and robotics researchers which resulted in about a hundred animations. We do not provide a review of the literature.
Mugellini, Elena, Sokhn, Maria, Carrino, Stefano and Khaled, Omar Abou (2009): WiiNote: multimodal application facilitating multi-user photo annotation activity. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 237-238. Available online
In this paper, we describe a multimodal application, called WiiNote, facilitating multi-user photo annotation activity. The application allows up to 4 users to simultaneously annotating their pictures adding either textual or vocal comments. Users use the Wii Remote device to select the whole picture or a specific region of it to be annotated. Annotations can be either free or structured, i.e. based on a domain specific data model expressed using MPEG7 standard or RDF language for ontology.
Kaplan, Frédéric (2009): Are gesture-based interfaces the future of human computer interaction?. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 239-240. Available online
The historical evolution of human machine interfaces shows a continuous tendency towards more physical interactions with computers. Nevertheless, the mouse and keyboard paradigm is still the dominant one and it is not yet clear whether there is among recent innovative interaction techniques any real challenger to this supremacy. To discuss the future of gesture-based interfaces, I shall build on my own experience in conceiving and launching QB1, probably the first computer delivered with no mouse or keyboard but equipped with a depth-perceiving camera enabling interaction with gestures. The ambition of this talk is to define more precisely how gestures change the way we can interact with computers, discuss how to design robust interfaces adapted to this new medium and review what kind of applications benefit the most from this type of interaction. Through a series of examples, we will see that it is important to consider gestures not as a way of emulating a mouse pointer at a distance or as elements of a "vocabulary" of commands, but as a new interaction paradigm where the interface components are organized in the user's physical space. This is a shift of reference frame, from a metaphorical virtual space (e.g. the desktop) where the user controls a representation of himself (e.g. the mouse pointer) to a truly user-centered augmented reality interface where the user directly touches and manipulates interface components positioned around his body. To achieve this kind of interactivity, depth-perceiving cameras can be relevantly associated with robotic techniques and machine vision algorithms to create a "halo" of interactivity that can literally follow the user while he moves in a room. In return, this new kind of intimacy with a computer interface paves the ways for innovative machine learning approaches to context understanding. A computer like QB1 knows more about its user than any other personal computer so far. Gesture-based interaction is not a mean for replacing the mouse with cooler or more intuitive ways of interacting but leads to a fundamentally different approach to the design human-computer interfaces.
Li, Zheng, Mao, Xia and Liu, Lei (2009): Providing expressive eye movement to virtual agents. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 241-244. Available online
Non-verbal behavior, particularly eye movement, plays a fundamental role in nonverbal communication among people. In order to realize natural and intuitive human-agent interaction, the virtual agents need to employ this communicative channel effectively. Against this background, our research addresses the problem of emotionally expressive eye movement manner by describing a preliminary approach based on the parameters picked from real-time eye movement data (pupil size, blink rate and saccade).
Dierker, Angelika, Mertes, Christian, Hermann, Thomas, Hanheide, Marc and Sagerer, Gerhard (2009): Mediated attention with multimodal augmented reality. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 245-252. Available online
We present an Augmented Reality (AR) system to support collaborative tasks in a shared real-world interaction space by facilitating joint attention. The users are assisted by information about their interaction partner's field of view both visually and acoustically. In our study, the audiovisual improvements are compared with an AR system without these support mechanisms in terms of the participants' reaction times and error rates. The participants performed a simple object-choice task we call the "gaze game" to ensure controlled experimental conditions. Additionally, we asked the subjects to fill in a questionnaire to gain subjective feedback from them. We were able to show an improvement for both dependent variables as well as positive feedback for the visual augmentation in the questionnaire.
Tellex, Stefanie and Roy, Deb (2009): Grounding spatial prepositions for video search. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 253-260. Available online
Spatial language video retrieval is an important real-world problem that forms a test bed for evaluating semantic structures for natural language descriptions of motion on naturalistic data. Video search by natural language query requires that linguistic input be converted into structures that operate on video in order to find clips that match a query. This paper describes a framework for grounding the meaning of spatial prepositions in video. We present a library of features that can be used to automatically classify a video clip based on whether it matches a natural language query. To evaluate these features, we collected a corpus of natural language descriptions about the motion of people in video clips. We characterize the language used in the corpus, and use it to train and test models for the meanings of the spatial prepositions "to," "across," "through," "out," "along," "towards," and "around." The classifiers can be used to build a spatial language video retrieval system that finds clips matching queries such as "across the kitchen."
Schauerte, Boris, Richarz, Jan, Plötz, Thomas, Thurau, Christian and Fink, Gernot A. (2009): Multi-modal and multi-camera attention in smart environments. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 261-268. Available online
This paper considers the problem of multi-modal saliency and attention. Saliency is a cue that is often used for directing attention of a computer vision system, e.g., in smart environments or for robots. Unlike the majority of recent publications on visual/audio saliency, we aim at a well grounded integration of several modalities. The proposed framework is based on fuzzy aggregations and offers a flexible, plausible, and efficient way for combining multi-modal saliency information. Besides incorporating different modalities, we extend classical 2D saliency maps to multi-camera and multi-modal 3D saliency spaces. For experimental validation we realized the proposed system within a smart environment. The evaluation took place for a demanding setup under real-life conditions, including focus of attention selection for multiple subjects and concurrently active modalities.
Ajaj, Rami, Jacquemin, Christian and Vernier, Frederic (2009): RVDT: a design space for multiple input devices, multipleviews and multiple display surfaces combination. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 269-276. Available online
We study interaction combination performed using a tabletop device, a mouse, and/or a six Degrees Of Freedom (DOF) input device in a systemcombining a 2Dflat (map-kind) view presented horizontally and a 3D perspective vertical view of the same virtual environment. The design of such a 2D/3D interface relies on the RVDT model and its design space that allow easy high-level combined interactions to achieve spatial tasks. RVDT integrates the relations between physical and numerical DOFs and applies to any graphical user interface in which multiple views, multiple display surfaces and multiple input devices are combined. The user study shows that experimented users prefer table-top/6DOF input device interaction combination with a maximal number of elementary tasks performed with both devices.
Farrahi, Katayoun and Gatica-Perez, Daniel (2009): Learning and predicting multimodal daily life patterns from cell phones. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 277-280. Available online
In this paper, we investigate the multimodal nature of cell phone data in terms of discovering recurrent and rich patterns in people's lives. We present a method that can discover routines from multiple modalities (location and proximity) jointly modeled, and that uses these informative routines to predict unlabeled or missing data. Using a joint representation of location and proximity data over approximately 10 months of 97 individuals' lives, Latent Dirichlet Allocation is applied for the unsupervised learning of topics describing people's most common locations jointly with the most common types of interactions at these locations. We further successfully predict where and with how many other individuals users will be, for people with both highly and lowly varying lifestyles.
Iben, Hendrik, Baumann, Hannes, Ruthenbeck, Carmen and Klug, Tobias (2009): Visual based picking supported by context awareness: comparing picking performance using paper-based lists versus lists presented on a head mounted display with contextual support. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 281-288. Available online
Warehouse picking is a traditional part of assembly and inventory control, and several commercial wearable computers address this market. However, head mounted displays (HMDs) are not yet used in these company's products. We present a 16 person user study that compares the efficiency and perceived workload of paper picking lists versus a HMD system aided by contextual cueing. With practice, users of the HMD system made significantly faster picks and made less mistakes related to missing or additional picked items overall.
Serrano, Nicolás, Pérez, Daniel, Sanchis, Albert and Juan, Alfons (2009): Adaptation from partially supervised handwritten text transcriptions. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 289-292. Available online
An effective approach to transcribe handwritten text documents is to follow an interactive-predictive paradigm in which both, the system is guided by the user, and the user is assisted by the system to complete the transcription task as efficiently as possible. This approach has been recently implemented in a system prototype called GIDOC, in which standard speech technology is adapted to handwritten text (line) images: HMM-based text image modelling, n-gram language modelling, and also confidence measures on recognized words. Confidence measures are used to assist the user in locating possible transcription errors, and thus validate system output after only supervising those (few) words for which the system is not highly confident. Here, we study the effect of using these partially supervised transcriptions on the adaptation of image and language models to the task.
Demirdjian, David and Varri, Chenna (2009): Recognizing events with temporal random forests. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 293-296. Available online
In this paper, we present a novel technique for classifying multimodal temporal events. Our main contribution is the introduction of temporal random forests (TRFs), an extension of random forests (and decision trees in general) to the time domain. The approach is relatively simple and able to discriminatively learn event classes while performing feature selection in an implicit fashion. We describe here our ongoing research and present experiments performed on gesture and audio-visual speech recognition datasets comparing our method against state-of-the-art algorithms.
Sriram, Janani C., Shin, Minho, Choudhury, Tanzeem and Kotz, David (2009): Activity-aware ECG-based patient authentication for remote health monitoring. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 297-304. Available online
Mobile medical sensors promise to provide an efficient, accurate, and economic way to monitor patients' health outside the hospital. Patient authentication is a necessary security requirement in remote health monitoring scenarios. The monitoring system needs to make sure that the data is coming from the right person before any medical or financial decisions are made based on the data. Credential-based authentication methods (e.g., passwords, certificates) are not well-suited for remote healthcare as patients could hand over credentials to someone else. Furthermore, one-time authentication using credentials or trait-based biometrics (e.g., face, fingerprints, iris) do not cover the entire monitoring period and may lead to unauthorized post-authentication use. Recent studies have shown that the human electrocardiogram (ECG) exhibits unique patterns that can be used to discriminate individuals. However, perturbation of the ECG signal due to physical activity is a major obstacle in applying the technology in real-world situations. In this paper, we present a novel ECG and accelerometer-based system that can authenticate individuals in an ongoing manner under various activity conditions. We describe the probabilistic authentication system we have developed and present experimental results from 17 individuals.
Jayagopi, Dinesh Babu and Gatica-Perez, Daniel (2009): Discovering group nonverbal conversational patterns with topics. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 3-6. Available online
This paper addresses the problem of discovering conversational group dynamics from nonverbal cues extracted from thin-slices of interaction. We first propose and analyze a novel thin-slice interaction descriptor -- a bag of group nonverbal patterns -- which robustly captures the turn-taking behavior of the members of a group while integrating its leader's position. We then rely on probabilistic topic modeling of the interaction descriptors which, in a fully unsupervised way, is able to discover group interaction patterns that resemble prototypical leadership styles proposed in social psychology. Our method, validated on the Augmented Multi-Party Interaction (AMI) meeting corpus, facilitates the retrieval of group conversational segments where semantically meaningful group behaviours emerge, without the need of any previous labeling.
Kozma, László, Klami, Arto and Kaski, Samuel (2009): GaZIR: gaze-based zooming interface for image retrieval. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 305-312. Available online
We introduce GaZIR, a gaze-based interface for browsing and searching for images. The system computes on-line predictions of relevance of images based on implicit feedback, and when the user zooms in, the images predicted to be the most relevant are brought out. The key novelty is that the relevance feedback is inferred from implicit cues obtained in real-time from the gaze pattern, using an estimator learned during a separate training phase. The natural zooming interface can be connected to any content-based information retrieval engine operating on user feedback. We show with experiments on one engine that there is sufficient amount of information in the gaze patterns to make the estimated relevance feedback a viable choice to complement or even replace explicit feedback by pointing-and-clicking.
Bohus, Dan and Horvitz, Eric (2009): Dialog in the open world: platform and applications. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 31-38. Available online
We review key challenges of developing spoken dialog systems that can engage in interactions with one or multiple participants in relatively unconstrained environments. We outline a set of core competencies for open-world dialog, and describe three prototype systems. The systems are built on a common underlying conversational framework which integrates an array of predictive models and component technologies, including speech recognition, head and pose tracking, probabilistic models for scene analysis, multiparty engagement and turn taking, and inferences about user goals and activities. We discuss the current models and showcase their function by means of a sample recorded interaction, and we review results from an observational study of open-world, multiparty dialog in the wild.
Dey, Prasenjit, Sitaram, Ramchandrula, Ajmera, Rahul and Bali, Kalika (2009): Voice key board: multimodal Indic text input. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 313-318. Available online
Multimodal systems, incorporating more natural input modalities like speech, hand gesture, facial expression etc., can make human-computer-interaction more intuitive by drawing inspiration from spontaneous human-human-interaction. We present here a multimodal input device for Indic scripts called the Voice Key Board (VKB) which offers a simpler and more intuitive method for input of Indic scripts. VKB exploits the syllabic nature of Indic language scripts and exploits the user's mental model of Indic scripts wherein a base consonant character is modified by different vowel ligatures to represent the actual syllabic character. We also present a user evaluation result for VKB comparing it with the most common input method for the Devanagari script, the InScript keyboard. The results indicate a strong user preference for VKB in terms of input speed and learnability. Though VKB starts with a higher user error rate compared to InScript, the error rate drops by 55% by the end of the experiment, and the input speed of VKB is found to be 81% higher than InScript. Our user study results point to interesting research directions for the use of multiple natural modalities for Indic text input.
Raisamo, Jukka, Raisamo, Roope and Surakka, Veikko (2009): Evaluating the effect of temporal parameters for vibrotactile saltatory patterns. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 319-326. Available online
Cutaneous saltation provides interesting possibilities for applications. An illusion of vibrotactile mediolateral movement was elicited to a left dorsal forearm to investigate emotional (i.e., pleasantness) and cognitive (i.e., continuity) experiences to vibrotactile stimulation. Twelve participants were presented with nine saltatory stimuli delivered to a linearly aligned row of three vibrotactile actuators separated by 70 mm in distance. The stimuli were composed of three temporal parameters of 12, 24 and 48 ms for both burst duration and inter-burst interval to form all nine possible uniform pairs. First, the stimuli were ranked by the participants using a special three-step procedure. Second, the participants rated the stimuli using two nine-point bipolar scales measuring the pleasantness and continuity of each stimulus, separately. The results showed especially the interval between two successive bursts was a significant factor for saltation. Moreover, the temporal parameters seemed to affect more the experienced continuity of the stimuli compared to pleasantness. These findings encourage us to continue to further study the saltation and the effect of different parameters for subjective experience.
Hoggan, Eve, Raisamo, Roope and Brewster, Stephen A. (2009): Mapping information to audio and tactile icons. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 327-334. Available online
We report the results of a study focusing on the meanings that can be conveyed by audio and tactile icons. Our research considers the following question: how can audio and tactile icons be designed to optimise congruence between crossmodal feedback and the type of information this feedback is intended to convey? For example, if we have a set of system warnings, confirmations, progress up-dates and errors: what audio and tactile representations best match the information or type of message? Is one modality more appropriate at presenting certain types of information than the other modality? The results of this study indicate that certain parameters of the audio and tactile modalities such as rhythm, texture and tempo play an important role in the creation of congruent sets of feedback when given a specific type of information to transmit. We argue that a combination of audio or tactile parameters derived from our results allows the same type of information to be derived through touch and sound with an intuitive match to the content of the message.
Ahmaniemi, Teemu Tuomas and Lantz, Vuokko Tuulikki (2009): Augmented reality target finding based on tactile cues. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 335-342. Available online
This study is based on a user scenario where augmented reality targets could be found by scanning the environment with a mobile device and getting a tactile feedback exactly in the direction of the target. In order to understand how accurately and quickly the targets can be found, we prepared an experiment setup where a sensor-actuator device consisting of orientation tracking hardware and a tactile actuator were used. The targets with widths 5ð, 10ð, 15ð, 20ð, and 25ð and various distances between each other were rendered in a 90ð -wide space successively, and the task of the test participants was to find them as quickly as possible. The experiment consisted of two conditions: the first one provided tactile feedback only when pointing was on the target and the second one included also another cue indicating the proximity of the target. The average target finding time was 1.8 seconds. The closest targets appeared to be not the easiest to find, which was attributed to the adapted scanning velocity causing the missing the closest targets. We also found that our data did not correlate well with Fitts' model, which may have been caused by the non-normal data distribution. After filtering out 30% of the least representative data items, the correlation reached up to 0.71. Overall, the performance between conditions did not differ from each other significantly. The only significant improvement in the performance offered by the close-to-target cue occurred in the tasks where the targets where the furthest from each other.
In this paper we investigate a set of privacy-sensitive audio features for speaker change detection (SCD) in multiparty conversations. These features are based on three different principles: characterizing the excitation source information using linear prediction residual, characterizing subband spectral information shown to contain speaker information, and characterizing the general shape of the spectrum. Experiments show that the performance of the privacy-sensitive features is comparable or better than that of the state-of-the-art full-band spectral-based features, namely, mel frequency cepstral coefficients, which suggests that socially acceptable ways of recording conversations in real-life is feasible.
Verdie, Yannick, Fang, Bing and Quek, Francis (2009): MirrorTrack: tracking with reflection -- comparison with top-down approach. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 347-350. Available online
Tabletop hand tracking techniques have evolved much during the last few years from single to multiple cameras, offering users an improved interactive experience. MirrorTrack is one of such techniques. This paper demonstrates the comparison of accuracy between MirrorTrack and top-down approach, which is generally used for table top tasks. In this paper, we focus on the comparison of distance errors in finger trajectory, and clicking errors by manual monitoring.
Kelly, Daniel, Delannoy, Jane Reilly, Donald, John Mc and Markham, Charles (2009): A framework for continuous multimodal sign language recognition. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 351-358. Available online
We present a multimodal system for the recognition of manual signs and non-manual signals within continuous sign language sentences. In sign language, information is mainly conveyed through hand gestures (Manual Signs). Non-manual signals, such as facial expressions, head movements, body postures and torso movements, are used to express a large part of the grammar and some aspects of the syntax of sign language. In this paper we propose a multichannel HMM based system to recognize manual signs and non-manual signals. We choose a single non-manual signal, head movement, to evaluate our framework when recognizing non-manual signals. Manual signs and non-manual signals are processed independently using continuous multidimensional HMMs and a HMM threshold model. Experiments conducted demonstrate that our system achieved a detection ratio of 0.95 and a reliability measure of 0.93.
Kannetis, Theofanis and Potamianos, Alexandros (2009): Towards adapting fantasy, curiosity and challenge in multimodal dialogue systems for preschoolers. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 39-46. Available online
We investigate how fantasy, curiosity and challenge contribute to the user experience in multimodal dialogue computer games for preschool children. For this purpose, an on-line multimodal platform has been designed, implemented and used as a starting point to develop web-based speech-enabled applications for children. Five task oriented games suitable for preschoolers have been implemented with varying levels of fantasy and curiosity elements, as well as, variable difficulty levels. Nine preschool children, ages 4-6, were asked to play these games in three sessions; in each session only one of the fantasy, curiosity or challenge factor was evaluated. Both objective and subjective criteria were used to evaluate the factors and applications. Results show that fantasy and curiosity are correlated with children's entertainment, while the level of difficulty seems to depend on each child's individual preferences and capabilities. In addition, high speech usage and high curiosity levels in the application correlate well with task completion, showing that preschoolers become more engaged when multimodal interfaces are speech enabled and contain curiosity elements.
Johnston, Michael (2009): Building multimodal applications with EMMA. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 47-54. Available online
Multimodal interfaces combining natural modalities such as speech and touch with dynamic graphical user interfaces can make it easier and more effective for users to interact with applications and services on mobile devices. However, building these interfaces remains a complex and high specialized task. The W3C EMMA standard provides a representation language for inputs to multimodal systems facilitating plug-and-play of system components and rapid prototyping of interactive multimodal systems. We illustrate the capabilities of the EMMA standard through examination of its use in a series of mobile multimodal applications for the iPhone.
Ishizuka, Kentaro, Araki, Shoko, Otsuka, Kazuhiro, Nakatani, Tomohiro and Fujimoto, Masakiyo (2009): A speaker diarization method based on the probabilistic fusion of audio-visual location information. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 55-62. Available online
This paper proposes a speaker diarization method for determining ""who spoke when"" in multi-party conversations, based on the probabilistic fusion of audio and visual location information. The audio and visual information is obtained from a compact system designed to analyze round table multi-party conversations. The system consists of two cameras and a triangular microphone array with three microphones, and can cover a spherical region. Speaker locations are estimated from audio and visual observations in terms of azimuths from this recording system. Unlike conventional speech diarization methods, our proposed method estimates the probability of the presence of multiple simultaneous speakers in a physical space with a small microphone setup instead of using a cascade consisting of speech activity detection, direction of arrival estimation, acoustic feature extraction, and information criteria based speaker segmentation. To estimate the speaker presence more correctly, the speech presence probabilities in a physical space are integrated with the probabilities estimated from participants' face locations obtained with a robust particle filtering based face tracker with two cameras equipped with fisheye lenses. The locations in a physical space with highly integrated probabilities are then classified into a certain number of speaker classes by using on-line classification to realize speaker diarization. The probability calculations and speaker classifications are conducted on-line, making it unnecessary to observe all the conversation data. An experiment using real casual conversations, which include more overlaps and short speech segments than formal meetings, showed the advantages of the proposed method.
Schermerhorn, Paul and Scheutz, Matthias (2009): Dynamic robot autonomy: investigating the effects of robot decision-making in a human-robot team task. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 63-70. Available online
Robot autonomy is of high relevance for HRI, in particular for interactions of humans and robots in mixed human-robot teams. In this paper, we investigate empirically the extent to which autonomy based on independent decision making and acting by the robot can affect the objective task performance of a mixed human-robot team while being subjectively acceptable to humans. The results demonstrate that humans not only accept robot autonomy in the interest of the team, but also view the robot more as a team member and find it easier to interact with, despite a very minimalist graphical/speech interface. Moreover, we find evidence that dynamic autonomy reduces human cognitive load.
Germesin, Sebastian and Wilson, Theresa (2009): Agreement detection in multiparty conversation. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 7-14. Available online
This paper presents a system for the automatic detection of agreements in multi-party conversations. We investigate various types of features that are useful for identifying agreements, including lexical, prosodic, and structural features. This system is implemented using supervised machine learning techniques and yields competitive results: Accuracy of 98.1% and a kappa value of 0.4. We also begin to explore the novel task of detecting the addressee of agreements (which speaker is being agreed with). Our system for this task
Fabbrizio, Giuseppe Di, Okken, Thomas and Wilpon, Jay G. (2009): A speech mashup framework for multimodal mobile services. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 71-78. Available online
Amid today's proliferation of Web content and mobile phones with broadband data access, interacting with small-form factor devices is still cumbersome. Spoken interaction could overcome the input limitations of mobile devices, but running an automatic speech recognizer with the limited computational capabilities of a mobile device becomes an impossible challenge when large vocabularies for speech recognition must often be updated with dynamic content. One popular option is to move the speech processing resources into the network by concentrating the heavy computation load onto server farms. Although successful services have exploited this approach, it is unclear how such a model can be generalized to a large range of mobile applications and how to scale it for large deployments. To address these challenges we introduce the AT&T speech mashup architecture, a novel approach to speech services that leverages web services and cloud computing to make it easier to combine web content and speech processing. We show that this new compositional method is suitable for integrating automatic speech recognition and text-to-speech synthesis resources into real multimodal mobile services. The generality of this method allows researchers and speech practitioners to explore a countless variety of mobile multimodal services with a finer grain of control and richer multimedia interfaces. Moreover, we demonstrate that the speech mashup is scalable and particularly optimized to minimize round trips in the mobile network, reducing latency for better user experience.
Cheamanunkul, Sunsern, Ettinger, Evan, Jacobsen, Matt, Lai, Patrick and Freund, Yoav (2009): Detecting, tracking and interacting with people in a public space. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 79-86. Available online
We have built a system that engages naive users in an audio-visual interaction with a computer in an unconstrained public space. We combine audio source localization techniques with face detection algorithms to detect and track the user throughout a large lobby. The sensors we use are an ad-hoc microphone array and a PTZ camera. To engage the user, the PTZ camera turns and points at sounds made by people passing by. From this simple pointing of a camera, the user is made aware that the system has acknowledged their presence. To further engage the user, we develop a face classification method that identifies and then greets previously seen users. The user can interact with the system through a simple hot-spot based gesture interface. To make the user interactions with the system feel natural, we utilize reconfigurable hardware, achieving a visual response time of less than 100ms. We rely heavily on machine learning methods to make our system self-calibrating and adaptive.
Cooke, Neil J. and Russell, Martin J. (2009): Cache-based language model adaptation using visual attention for ASR in meeting scenarios. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 87-90. Available online
In a typical group meeting involving discussion and collaboration, people look at one another, at shared information resources such as presentation material, and also at nothing in particular. In this work we investigate whether the knowledge of what a person is looking at may improve the performance of Automatic Speech Recognition (ASR). A framework for cache Language Model (LM) adaptation is proposed with the cache based on a person's Visual Attention (VA) sequence. The framework attempts to measure the appropriateness of adaptation from VA sequence characteristics. Evaluation on the AMI Meeting corpus data shows reduced LM perplexity. This work demonstrates the potential for cache-based LM adaptation using VA information in large vocabulary ASR deployed in meeting scenarios.
Kok, Iwan de and Heylen, Dirk (2009): Multimodal end-of-turn prediction in multi-party meetings. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 91-98. Available online
One of many skills required to engage properly in a conversation is to know the appropriate use of the rules of engagement. In order to engage properly in a conversation, a virtual human or robot should, for instance, be able to know when it is being addressed or when the speaker is about to hand over the turn. The paper presents a multimodal approach to end-of-speaker-turn prediction using sequential probabilistic models (Conditional Random Fields) to learn a model from observations of real-life multi-party meetings. Although the results are not as good as expected, we provide insight into which modalities are important when taking a multimodal approach to the problem based on literature and our own results.
Kumano, Shiro, Otsuka, Kazuhiro, Mikami, Dan and Yamato, Junji (2009): Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings. In: Proceedings of the 2009 International Conference on Multimodal Interfaces 2009. pp. 99-106. Available online
This paper proposes a novel facial expression recognizer and describes its application to group meeting analysis. Our goal is to automatically discover the interpersonal emotions that evolve over time in meetings, e.g. how each person feels about the others, or who affectively influences the others the most. As the emotion cue, we focus on facial expression, more specifically smile, and aim to recognize "who is smiling at whom, when, and how often", since frequently smiling carries affective messages that are strongly directed to the person being looked at; this point of view is our novelty. To detect such communicative smiles, we propose a new algorithm that jointly estimates facial pose and expression in the framework of the particle filter. The main feature is its automatic selection of interest points that can robustly capture small changes in expression even in the presence of large head rotations. Based on the recognized facial expressions and their directions to others, which are indicated by the estimated head poses, we visualize interpersonal smile events as a graph structure, we call it the interpersonal emotional network; it is intended to indicate the emotional relationships among meeting participants. A four-person meeting captured by an omnidirectional video system is used to confirm the effectiveness of the proposed method and the potential of our approach for deep understanding of human relationships developed through communications.
What is this field of Human-Computer Interaction? People are quite different from computers. This is hardly a novel observation, but whenever people use computers, there is necessarily a zone of mutual accommodation and this defines our area of interest. People are so adaptable that they are capable of shouldering the entire burden of accommodation to an artifact, but skillful designers make large parts of this burden vanish by adapting the artifact to its users. To understand successful design requires an understanding of the technology, the person, and their mutual interaction [...]
-- Stephen Draper and Donald Norman. In "User Centered System Design" (1986) p. 1