Proceedings of the 2007 International Conference on Multimodal Interfaces
Time and place:
The International Conference on Multimodal Interfaces (ICMI) is an annual ACM-sponsored conference that promotes research in next-generation perceptive, adaptive and multimodal user interfaces. These new interfaces are especially well suited for interpreting natural communication and activity patterns in real-world environments.
The following articles are from "Proceedings of the 2007 International Conference on Multimodal Interfaces":
Ivanov, Yuri (2007): Interfacing life: a year in the life of a research lab. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. p. 1. Available online
Humans perceive life around them through a variety of sensory inputs. Some, such as vision, or audition, have high information content, while others, such as touch and smell, do not. Humans and other animals use this gradation of senses to know how to attend to what's important. In contrast, it is widely accepted that in tasks of monitoring living spaces the modalities with high information content hold the key to decoding the behavior and intentions of the space occupants. In surveillance, video cameras are used to record everything that they can possibly see in the hopes that if something happens, it can later be found in the recorded data. Unfortunately, the latter proved to be harder than it sounds. In our work we challenge this idea and introduce a monitoring system that is built as a combination of channels with varying information content. The system has been deployed for over a year in our lab space and consists of a large motion sensor network combined with several video cameras. While the sensors give a general context of the events in the entire 3000 square meters of the space, cameras only attend to selected occurrences of the office activities. The system demonstrates several monitoring tasks which are all but impossible to perform in a traditional camera-only setting. In the talk we share our experiences, challenges and solutions in building and maintaining the system. We show some results from the data that we have collected for the period of over a year and introduce some other successful and novel applications of the system.
Perakakis, Manolis and Potamianos, Alexandros (2007): The effect of input mode on inactivity and interaction times of multimodal systems. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 102-109. Available online
In this paper, the efficiency and usage patterns of input modes in multimodal dialogue systems is investigated for both desktop and personal digital assistant (PDA) working environments. For this purpose a form-filling travel reservation application is evaluated that combines the speech and visual modalities; three multimodal modes of interaction are implemented, namely: "Click-To-Talk", "Open-Mike" and "Modality-Selection". The three multimodal systems are evaluated and compared with the "GUI-Only" and "Speech-Only" unimodal systems. Mode and duration statistics are computed for each system, for each turn and for each attribute in the form. Turn time is decomposed in interaction and inactivity time and the statistics for each input mode are computed. Results show that multimodal and adaptive interfaces are superior in terms of interaction time, but not always in terms of inactivity time. Also users tend to use the most efficient input mode, although our experiments show a bias towards the speech modality.
Thu, Ye Kyaw and Urano, Yoshiyori (2007): Positional mapping: keyboard mapping based on characters writing positions for mobile devices. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 110-117. Available online
Keyboard or keypad layout is one of the important factors to increase user text input speed especially on limited keypad such as mobile phones. This paper introduces novel key mapping method "Positional Mapping" (PM) for phonetic scripts such as Myanmar language based on its characters writing positions. Our approach has made key mapping for Myanmar language very simple and easier to memorize. We have developed positional mapping text input prototypes for mobile phone keypad, PDA, customizable keyboard DX1 input system and dual-joystick game pad, and conducted user studies for each prototype. Evaluation was made based on users' actual typing speed of our four PM prototypes, and it has proved that first time users can type at appropriate average typing speeds (i.e. 3min 47sec with DX1, 4min 42sec with mobile phone prototype, 4min 26sec with PDA and 5min 30sec with Dual Joystick Game Pad) to finish short Myanmar SMS message of 6 sentences. Positional Mapping can be extended to other phonetic scripts, which we present with a Bangla mobile phone prototype in this paper.
Szentgyorgyi, Christine and Lank, Edward (2007): Five-key text input using rhythmic mappings. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 118-121. Available online
Novel key mappings, including chording, character prediction, and multi-tap, allow the use of fewer keys than those on a conventional keyboard to enter text. In this paper, we explore a text input method that makes use of rhythmic mappings of five keys. The keying technique averages 1.5 keystrokes per character for typical English text. In initial testing, the technique shows performance similar to chording and other multi-tap techniques, and our subjects had few problems with basic text entry. Five-key entry techniques may have benefits for text entry in multi-point touch devices, as they eliminate targeting by providing a unique mapping for each finger.
Barthelmess, Paulo, Kaiser, Edward and McGee, David R. (2007): Toward content-aware multimodal tagging of personal photo collections. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 122-125. Available online
A growing number of tools is becoming available, that make use of existing tags to help organize and retrieve photos, facilitating the management and use of photo sets. The tagging on which these techniques rely remains a time consuming, labor intensive task that discourages many users. To address this problem, we aim to leverage the multimodal content of naturally occurring photo discussions among friends and families to automatically extract tags from a combination of conversational speech, handwriting, and photo content analysis. While naturally occurring discussions are rich sources of information about photos, methods need to be developed to reliably extract a set of discriminative tags from this noisy, unconstrained group discourse. To this end, this paper contributes an analysis of pilot data identifying robust multimodal features examining the interplay between photo content and other modalities such as speech and handwriting. Our analysis is motivated by a search for design implications leading to the effective incorporation of automated location and person identification (e.g. based on GPS and facial recognition technologies) into a system able to extract tags from natural multimodal conversations.
Zeng, Zhihong, Pantic, Maja, Roisman, Glenn I. and Huang, Thomas S. (2007): A survey of affect recognition methods: audio, visual and spontaneous expressions. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 126-133. Available online
Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. Promising approaches have been reported, including automatic methods for facial and vocal affect recognition. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions-despite the fact that deliberate behavior differs in visual and audio expressions from spontaneously occurring behavior. Recently efforts to develop algorithms that can process naturally occurring human affective behavior have emerged. This paper surveys these efforts. We first discuss human emotion perception from a psychological perspective. Next, we examine the available approaches to solving the problem of machine understanding of human affective behavior occurring in real-world settings. We finally outline some scientific and engineering challenges for advancing human affect sensing technology.
Theobald, Barry-John, Matthews, Iain A., Cohn, Jeffrey F. and Boker, Steven M. (2007): Real-time expression cloning using appearance models. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 134-139. Available online
Active Appearance Models (AAMs) are generative parametric models commonly used to track, recognise and synthesise faces in images and video sequences. In this paper we describe a method for transferring dynamic facial gestures between subjects in real-time. The main advantages of our approach are that: 1) the mapping is computed automatically and does not require high-level semantic information describing facial expressions or visual speech gestures. 2) The mapping is simple and intuitive, allowing expressions to be transferred and rendered in real-time. 3) The mapped expression can be constrained to have the appearance of the target producing the expression, rather than the source expression imposed onto the target face. 4) Near-videorealistic talking faces for new subjects can be created without the cost of recording and processing a complete training corpus for each. Our system enables face-to-face interaction with an avatar driven by an AAM of an actual person in real-time and we show examples of arbitrary expressive speech frames cloned across different subjects.
Yonezawa, Tomoko, Yamazoe, Hirotake, Utsumi, Akira and Abe, Shinji (2007): Gaze-communicative behavior of stuffed-toy robot with joint attention and eye contact based on ambient gaze-tracking. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 140-145. Available online
This paper proposes a gaze-communicative stuffed-toy robot system with joint attention and eye-contact reactions based on ambient gaze-tracking. For free and natural interaction, we adopted our remote gaze-tracking method. Corresponding to the user's gaze, the gaze-reactive stuffed-toy robot is designed to gradually establish 1) joint attention using the direction of the robot's head and 2) eye-contact reactions from several sets of motion. From both subjective evaluations and observations of the user's gaze in the demonstration experiments, we found that i) joint attention draws the user's interest along with the user-guessed interest of the robot, ii) "eye contact" brings the user a favorable feeling for the robot, and iii) this feeling is enhanced when "eye contact" is used in combination with "joint attention." These results support the approach of our embodied gaze-communication model.
Rohs, Michael, Schöning, Johannes, Raubal, Martin, Essl, Georg and Krüger, Antonio (2007): Map navigation with mobile devices: virtual versus physical movement with and without visual context. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 146-153. Available online
A user study was conducted to compare the performance of three methods for map navigation with mobile devices. These methods are joystick navigation, the dynamic peephole method without visual context, and the magic lens paradigm using external visual context. The joystick method is the familiar scrolling and panning of a virtual map keeping the device itself static. In the dynamic peephole method the device is moved and the map is fixed with respect to an external frame of reference, but no visual information is present outside the device's display. The magic lens method augments an external content with graphical overlays, hence providing visual context outside the device display. Here too motion of the device serves to steer navigation. We compare these methods in a study measuring user performance, motion patterns, and subjective preference via questionnaires. The study demonstrates the advantage of dynamic peephole and magic lens interaction over joystick interaction in terms of search time and degree of exploration of the search space.
Littlewort, Gwen C., Bartlett, Marian Stewart and Lee, Kang (2007): Faces of pain: automated measurement of spontaneous all facial expressions of genuine and posed pain. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 15-21. Available online
We present initial results from the application of an automated facial expression recognition system to spontaneous facial expressions of pain. In this study, 26 participants were videotaped under three experimental conditions: baseline, posed pain, and real pain. In the real pain condition, subjects experienced cold pressor pain by submerging their arm in ice water. Our goal was to automatically determine which experimental condition was shown in a 60 second clip from a previously unseen subject. We chose a machine learning approach, previously used successfully to categorize basic emotional facial expressions in posed datasets as well as to detect individual facial actions of the Facial Action Coding System (FACS) (Littlewort et al, 2006; Bartlett et al., 2006). For this study, we trained 20 Action Unit (AU) classifiers on over 5000 images selected from a combination of posed and spontaneous facial expressions. The output of the system was a real valued number indicating the distance to the separating hyperplane for each classifier. Applying this system to the pain video data produced a 20 channel output stream, consisting of one real value for each learned AU, for each frame of the video. This data was passed to a second layer of classifiers to predict the difference between baseline and pained faces, and the difference between expressions of real pain and fake pain. Naíve human subjects tested on the same videos were at chance for differentiating faked from real pain, obtaining only 52% accuracy. The automated system was successfully able to differentiate faked from real pain. In an analysis of 26 subjects, the system obtained 72% correct for subject independent discrimination of real versus fake pain on a 2-alternative forced choice. Moreover, the most discriminative facial action in the automated system output was AU 4 (brow lower), which all was consistent with findings using human expert FACS codes.
Danninger, Maria, Takayama, Leila, Wang, Qianying, Schultz, Courtney, Beringer, Jörg, Hofmann, Paul, James, Frankie and Nass, Clifford (2007): Can you talk or only touch-talk: A VoIP-based phone feature for quick, quiet, and private communication. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 154-161. Available online
Advances in mobile communication technologies have allowed people in more places to reach each other more conveniently than ever before. However, many mobile phone communications occur in inappropriate contexts, disturbing others in close proximity, invading personal and corporate privacy, and more broadly breaking social norms. This paper presents a telephony system that allows users to answer calls quietly and privately without speaking. The paper discusses the iterative process of design, implementation and system evaluation. The resulting system is a VoIP-based telephony system that can be immediately deployed from any phone capable of sending DTMF signals. Observations and results from inserting and evaluating this technology in real-world business contexts through two design cycles of the Touch-Talk feature are reported.
Hoggan, Eve and Brewster, Stephen (2007): Designing audio and tactile crossmodal icons for mobile devices. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 162-169. Available online
This paper reports an experiment into the design of crossmodal icons which can provide an alternative form of output for mobile devices using audio and tactile modalities to communicate information. A complete set of crossmodal icons was created by encoding three dimensions of information in three crossmodal auditory/tactile parameters. Earcons were used for the audio and Tactons for the tactile crossmodal icons. The experiment investigated absolute identification of audio and tactile crossmodal icons when a user is trained in one modality and tested in the other (and given no training in the other modality) to see if knowledge could be transferred between modalities. We also compared performance when users were static and mobile to see any effects that mobility might have on recognition of the cues. The results showed that if participants were trained in sound with Earcons and then tested with the same messages presented via Tactons they could recognize 85% of messages when stationary and 76% when mobile. When trained with Tactons and tested with Earcons participants could accurately recognize 76.5% of messages when stationary and 71% of messages when mobile. These results suggest that participants can recognize and understand a message in a different modality very effectively. These results will aid designers of mobile displays in creating effective crossmodal cues which require minimal training for users and can provide alternative presentation modalities through which information may be presented if the context requires.
Ruiz, Jaime and Lank, Edward (2007): A study on the scalability of non-preferred hand mode manipulation. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 170-177. Available online
In pen-tablet input devices modes allow overloading of the electronic stylus. In the case of two modes, switching modes with the non-preferred hand is most effective . Further, allowing temporal overlap of mode switch and pen action boosts speed . We examine the effect of increasing the number of interface modes accessible via non-preferred hand mode switching on task performance in pen-tablet interfaces. We demonstrate that the temporal benefit of overlapping mode-selection and pen action for the two mode case is preserved as the number of modes increases. This benefit is the result of both concurrent action of the hands, and reduced planning time for the overall task. Finally, while allowing bimanual overlap is still faster it takes longer to switch modes as the number of modes increases. Improved understanding of the temporal costs presented assists in the design of pen-tablet interfaces with larger sets of interface modes.
Harada, Susumu, Saponas, T. Scott and Landay, James A. (2007): VoicePen: augmenting pen input with simultaneous non-linguistic vocalization. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 178-185. Available online
This paper explores using non-linguistic vocalization as an additional modality to augment digital pen input on a tablet computer. We investigated this through a set of novel interaction techniques and a feasibility study. Typically, digital pen users control one or two parameters using stylus position and sometimes pen pressure. However, in many scenarios the user can benefit from the ability to continuously vary additional parameters. Non-linguistic vocalizations, such as vowel sounds, variation of pitch, or control of loudness have the potential to provide fluid continuous input concurrently with pen interaction. We present a set of interaction techniques that leverage the combination of voice and pen input when performing both creative drawing and object manipulation tasks. Our feasibility evaluation suggests that with little training people can use non-linguistic vocalization to productively augment digital pen interaction.
Kiriyama, Shinya, Yamamoto, Goh, Otani, Naofumi, Ishikawa, Shogo and Takebayashi, Yoichi (2007): A large-scale behavior corpus including multi-angle video data for observing infants' long-term developmental processes. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 186-192. Available online
We have developed a method for multimodal observation of infant development. In order to analyze development of problem solving skills by observing scenes of task achievement or communication with others, we have introduced a method for extracting detailed behavioral features expressed by gestures or eyes. We have realized an environment for recording behavior of the same infants continuously as multi-angle video. The environment has evolved into a practical infrastructure through the following four steps; (1) Establish an infant school and study the camera arrangement. (2) Obtain participants in the school who agree with the project purpose and start to hold regular classes. (3) Begin to construct a multimodal infant behavior corpus with considering observation methods. (4) Practice development process analyses using the corpus. We have constructed a support tool for observing a huge amount of video data which increases with age. The system has contributed to enrich the corpus with annotations from multimodal viewpoints about infant development. With a focus on the demonstrative expression as a fundamental human behavior, we have extracted 240 scenes from the video during 10 months and observed them. The analysis results have revealed interesting findings about the developmental changes in infants' gestures and eyes, and indicated the effectiveness of the proposed observation method.
Pietrzak, Thomas, Martin, Benoît, Pecci, Isabelle, Saarinen, Rami, Raisamo, Roope and Järvi, Janne (2007): The micole architecture: multimodal support for inclusion of visually impaired children. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 193-200. Available online
Modern information technology allows us to seek out new ways to support the computer use and communication of disabled people. With the aid of new interaction technologies and techniques visually impaired and sighted users can collaborate, for example, in the classroom situations. The main goal of the MICOLE project was to create a software architecture that makes it easier for the developers to create multimodal multi-user applications. The framework is based on interconnected software agents. The hardware used in this study includes VTPlayer Mouse which has two built-in Braille displays, and several haptic devices such as PHANToM Omni, PHANToM Desktop and PHANToM Premium. We also used the SpaceMouse and various audio setups in the applications. In this paper we present a software architecture, a set of software agents, and an example of using the architecture. The example application shown is an electric circuit application that follows the single-user with many devices scenario. The application uses a PHANToM and a VTPlayer Mouse together with visual and audio feedback to make the electric circuits understandable through touch.
Hagita, Norihiro (2007): The great challenge of multimodal interfaces towards symbiosis of human and robots. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. p. 2. Available online
Miletto, Evandro Manara, Flores, Luciano Vargas, Pimenta, Marcelo Soares, Rutily, Jérôme and Santagada, Leonardo (2007): Interfaces for musical activities and interfaces for musicians are not the same: the case for codes, a web-based environment for cooperative music prototyping. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 201-207. Available online
In this paper, some requirements of user interfaces for musical activities are investigated and discussed, particularly focusing on the necessary distinction between interfaces for musical activities and interfaces for musicians. We also discuss the interactive and cooperative aspects of music creation activities in CODES, a Web-based environment for cooperative music prototyping, designed mainly for novices in music. Aspects related to interaction flexibility and usability are presented, as well as features to support manipulation of complex musical information, cooperative activities and group awareness, which allow users to understand the actions and decisions of all group members cooperating and sharing a music prototype.
Kubat, Rony, DeCamp, Philip and Roy, Brandon (2007): TotalRecall: visualization and semi-automatic annotation of very large audio-visual corpora. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 208-215. Available online
We introduce a system for visualizing, annotating, and analyzing very large collections of longitudinal audio and video recordings. The system, TotalRecall, is designed to address the requirements of projects like the Human Speechome Project, for which more than 100,000 hours of multitrack audio and video have been collected over a twenty-two month period. Our goal in this project is to transcribe speech in over 10,000 hours of audio recordings, and to annotate the position and head orientation of multiple people in the 10,000 hours of corresponding video. Higher level behavioral analysis of the corpus will be based on these and other annotations. To efficiently cope with this huge corpus, we are developing semi-automatic data coding methods that are integrated into TotalRecall. Ultimately, this system and the underlying methodology may enable new forms of multimodal behavioral analysis grounded in ultradense longitudinal data.
Fernandes, Vitor, Guerreiro, Tiago, Araújo, Bruno, Jorge, Joaquim and Pereira, João (2007): Extensible middleware framework for multimodal interfaces in distributed environments. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 216-219. Available online
We present a framework to manage multimodal applications and interfaces in a reusable and extensible manner. We achieve this by focusing the architecture both on applications' needs and devices' capabilities. One particular domain we want to approach is collaborative environments where several modalities and applications make it necessary to provide for an extensible system combining diverse components across heterogeneous platforms on-the-fly. This paper describes the proposed framework and its main contributions in the context of an architectural application scenario. We demonstrate how to connect different non-conventional applications and input modalities around an immersive environment (tiled display wall).
Gong, Shaogang, Shan, Caifeng and Xiang, Tao (2007): Visual inference of human emotion and behaviour. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 22-29. Available online
We address the problem of automatic interpretation of non-exaggerated human facial and body behaviours captured in video. We illustrate our approach by three examples. (1) We introduce Canonical Correlation Analysis (CCA) and Matrix Canonical Correlation Analysis (MCCA) for capturing and analyzing spatial correlations among non-adjacent facial parts for facial behaviour analysis. (2) We extend Canonical Correlation Analysis to multimodality correlation for behaviour inference using both facial and body gestures. (3) We model temporal correlation among human movement patterns in a wider space using a mixture of Multi-Observation Hidden Markov Model for human behaviour profiling and behavioural anomaly detection.
Lee, Jong-Seok and Park, Cheol Hoon (2007): Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 220-227. Available online
The use of visual information of speech has been shown to be effective for compensating for performance degradation of acoustic speech recognition in noisy environments. However, visual noise is usually ignored in most of audio-visual speech recognition systems, while it can be included in visual speech signals during acquisition or transmission of the signals. In this paper, we present a new temporal filtering technique for extraction of noise-robust visual features. In the proposed method, a carefully designed band-pass filter is applied to the temporal pixel value sequences of lip region images in order to remove unwanted temporal variations due to visual noise, illumination conditions or speakers' appearances. We demonstrate that the method can improve not only visual speech recognition performance for clean and noisy images but also audio-visual speech recognition performance in both acoustically and visually noisy conditions.
Morita, Tomoyuki, Mase, Kenji, Hirano, Yasushi and Kajita, Shoji (2007): Reciprocal attentive communication in remote meeting with a humanoid robot. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 228-235. Available online
In this paper, we investigate the reciprocal attention modality in remote communication. A remote meeting system with a humanoid robot avatar is proposed to overcome the invisible wall for a video conferencing system. Our experimental result shows that a tangible robot avatar provides more effective reciprocal attention against video communication. The subjects in the experiment are asked to determine whether a remote participant with the avatar is actively listening or not to the local presenter's talk. In this system, the head motion of a remote participant is transferred and expressed by the head motion of a humanoid robot. While the presenter has difficulty in determining the extent of a remote participant's attention with a video conferencing system, she/he has better sensing of remote attentive states with the robot. Based on the evaluation result, we propose a vision system for the remote user that integrates omni-directional camera and robot-eye camera images to provide a wideview with a delay compensation feature.
Govindarajulu, Naveen Sundar and Madhvanath, Sriganesh (2007): Password management using doodles. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 236-239. Available online
The average computer user needs to remember a large number of text username and password combinations for different applications, which places a large cognitive load on the user. Consequently users tend to write down passwords, use easy to remember (and guess) passwords, or use the same password for multiple applications, leading to security risks. This paper describes the use of personalized hand-drawn "doodles" for recall and management of password information. Since doodles can be easier to remember than text passwords, the cognitive load on the user is reduced. Our method involves recognizing doodles by matching them against stored prototypes using handwritten shape matching techniques. We have built a system which manages passwords for web applications through a web browser. In this system, the user logs into a web application by drawing a doodle using a touchpad or digitizing tablet attached to the computer. The user is automatically logged into the web application if the doodle matches the doodle drawn during enrollment. We also report accuracy results for our doodle recognition system, and conclude with a summary of next steps.
Corradini, Andrea (2007): A computational model for spatial expression resolution. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 240-246. Available online
This paper presents a computational model for the interpretation of linguistic spatial propositions in the restricted realm of a 2D puzzle game. Based on an experiment aimed at analyzing human judgment of spatial expressions, we establish a set of criteria that explain human preference for certain interpretations over others. For each of these criteria, we define a metric that combines the semantic and pragmatic contextual information regarding the game as well as the utterance being resolved. Each metric gives rise to a potential field that characterizes the degree of likelihood for carrying out the instruction at a specific hypothesized location. We resort to machine learning techniques to determine a model of spatial relationships from the data collected during the experiment. Sentence interpretation occurs by matching the potential field of each of its possible interpretations to the model at hand. The system's explanation capabilities lead to the correct assessment of ambiguous situated utterances for a large percentage of the collected expressions.
Everitt, Katherine M., Harada, Susumu, Bilmes, Jeff and Landay, James A. (2007): Disambiguating speech commands using physical context. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 247-254. Available online
Speech has great potential as an input mechanism for ubiquitous computing. However, the current requirements necessary for accurate speech recognition, such as a quiet environment and a well-positioned and high-quality microphone, are unreasonable to expect in a realistic setting. In a physical environment, there is often contextual information which can be sensed and used to augment the speech signal. We investigated improving speech recognition rates for an electronic personal trainer using knowledge about what equipment was in use as context. We performed an experiment with participants speaking in an instrumented apartment environment and compared the recognition rates of a larger grammar with those of a smaller grammar that is determined by the context.
Otsuka, Kazuhiro, Sawada, Hiroshi and Yamato, Junji (2007): Automatic inference of cross-modal nonverbal interactions in multiparty conversations: "who responds to whom, when, and how?" from gaze, head gestures, and utterances. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 255-262. Available online
A novel probabilistic framework is proposed for analyzing cross-modal nonverbal interactions in multiparty face-to-face conversations. The goal is to determine "who responds to whom, when, and how" from multimodal cues including gaze, head gestures, and utterances. We formulate this problem as the probabilistic inference of the causal relationship among participants' behaviors involving head gestures and utterances. To solve this problem, this paper proposes a hierarchical probabilistic model; the structures of interactions are probabilistically determined from high-level conversation regimes (such as monologue or dialogue) and gaze directions. Based on the model, the interaction structures, gaze, and conversation regimes, are simultaneously inferred from observed head motion and utterances, using a Markov chain Monte Carlo method. The head gestures, including nodding, shaking and tilt, are recognized with a novel Wavelet-based technique from magnetic sensor signals. The utterances are detected using data captured by lapel microphones. Experiments on four-person conversations confirm the effectiveness of the framework in discovering interactions such as question-and-answer and addressing behavior followed by back-channel responses.
Sturm, Janienke, Herwijnen, Olga Houben-van, Eyck, Anke and Terken, Jacques (2007): Influencing social dynamics in meetings through a peripheral display. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 263-270. Available online
We present a service providing real-time feedback to participants of small group meetings on the social dynamics of the meeting. The service measures and visualizes properties of participants' behaviour that are relevant to the social dynamics of the meeting: speaking time and gaze behaviour. The dynamic visualization is offered to meeting participants during the meeting through a peripheral display. Whereas an initial version was evaluated using wizards to obtain the required information about gazing behaviour and speaking activity instead of perceptual systems, in the current paper we employ a system including automated perceptual components. We describe the system properties and the perceptual components. The service was evaluated in a within-subjects experiment, where groups of participants discussed topics of general interest, with a total of 82 participants. It was found that the presence of the feedback about speaking time influenced the behaviour of the participants in such a way that it made over-participators to behave less dominant and under-participators to become more active. Feedback on eye gaze behaviour did not affect participants' gazing behaviour (both for listeners and for speakers) during the meeting.
Dong, Wen, Lepri, Bruno, Cappelletti, Alessandro, Pentland, Alex Sandy, Pianesi, Fabio and Zancanaro, Massimo (2007): Using the influence model to recognize functional roles in meetings. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 271-278. Available online
In this paper, an influence model is used to recognize functional roles played during meetings. Previous works on the same corpus demonstrated a high recognition accuracy using SVMs with RBF kernels. In this paper, we discuss the problems of that approach, mainly over-fitting, the curse of dimensionality and the inability to generalize to different group configurations. We present results obtained with an influence modeling method that avoid these problems and ensures both greater robustness and generalization capability.
Tochigi, Hiroko, Shinozawa, Kazuhiko and Hagita, Norihiro (2007): User impressions of a stuffed doll robot's facing direction in animation systems. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 279-284. Available online
This paper investigates the effect on user impressions of the body direction of a stuffed doll robot in an animation system. Many systems that combine a computer display with a robot have been developed, and one of their applications is entertainment, for example, an animation system. In these systems, the robot, as a 3D agent, can be more effective than a 2D agent in helping the user enjoy the animation experience by using spatial characteristics, such as body direction, as a means of expression. The direction in which the robot faces, i.e., towards the human or towards the display, is investigated here. User impressions from 25 subjects were examined. The experiment results show that the robot facing the display together with a user is effective for eliciting good feelings from the user, regardless of the user's personality characteristics. Results also suggest that extroverted subjects tend to have a better feeling towards a robot facing the user than introverted ones.
Osaki, Kouzi, Watanabe, Tomio and Yamamoto, Michiya (2007): Speech-driven embodied entrainment character system with hand motion input in mobile environment. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 285-290. Available online
InterActor is a speech-input-driven CG-embodied interaction character that can generate communicative movements and actions for entrained interactions. InterPuppet, on the other hand, is an embodied interaction character that is driven by both speech input-similar to InterActor-and hand motion input, like a puppet. Therefore, humans can use InterPuppet to effectively communicate by using deliberate body movements and natural communicative movements and actions. In this paper, an advanced InterPuppet system that uses a cellular-phone-type device is developed, which can be used in a mobile environment. The effectiveness of the system is demonstrated by performing a sensory evaluation experiment in an actual remote communication scenario.
Horchani, Meriam, Caron, Benjamin, Nigay, Laurence and Panaget, Franck (2007): Natural multimodal dialogue systems: a configurable dialogue and presentation strategies component. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 291-298. Available online
In the context of natural multimodal dialogue systems, we address the challenging issue of the definition of cooperative answers in an appropriate multimodal form. Highlighting the intertwined relation of multimodal outputs with the content, we focus on the Dialogic strategy component, a component that defines from the set of possible contents to answer a user's request, the content to be presented to the user and its multimodal presentation. The content selection and the presentation allocation managed by the Dialogic strategy component are based on various constraints such as the availability of a modality and the user's preferences. Considering three generic types of dialogue strategies and their corresponding handled types of information as well as three generic types of presentation tasks, we present a first implementation of the Dialogic strategy component based on rules. By providing a graphical interface to configure the component by editing the rules, we show how the component can be easily modified by ergonomists at design time for exploring different solutions. In further work we envision letting the user modify the component at runtime.
Klug, Tobias and Mühlhäuser, Max (2007): Modeling human interaction resources to support the design of wearable multimodal systems. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 299-306. Available online
Designing wearable application interfaces that integrate well into real-world processes like aircraft maintenance or medical examinations is challenging. One of the main success criteria is how well the multimodal interaction with the computer system fits an already existing real-world task. Therefore, the interface design needs to take the real-world task flow into account from the beginning. We propose a model of interaction devices and human interaction capabilities that helps evaluate how well different interaction devices/techniques integrate with specific real-world scenarios. The model was developed based on a survey of wearable interaction research literature. Combining this model with descriptions of observed real-world tasks, possible conflicts between task performance and device requirements can be visualized helping the interface designer to find a suitable solution.
Massaro, Dominic W. (2007): Just in time learning: implementing principles of multimodal processing and learning for education. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 3-8. Available online
Baldi, a 3-D computer-animated tutor has been developed to teach speech and language. I review this technology and pedagogy and describe evaluation experiments that have substantiated the effectiveness of our language-training program, Timo Vocabulary, to teach vocabulary and grammar. With a new Lesson Creator, teachers, parents, and even students can build original lessons that allow concepts, vocabulary, animations, and pictures to be easily integrated. The Lesson Creator application facilitates the specialization and individualization of lessons by allowing teachers to create customized vocabulary lists Just in Time as they are needed. The Lesson Creator allows the coach to give descriptions of the concepts as well as corrective feedback, which allows errorless learning and encourages the child to think as they are learning. I describe the Lesson Creator, illustrate it, and speculate on how its evaluation can be accomplished.
Schuller, Bjöern, Müeller, Ronald, Höernler, Benedikt, Höethker, Anja, Konosu, Hitoshi and Rigoll, Gerhard (2007): Audiovisual recognition of spontaneous interest within conversations. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 30-37. Available online
In this work we present an audiovisual approach to the recognition of spontaneous interest in human conversations. For a most robust estimate, information from four sources is combined by a synergistic and individual failure tolerant fusion. Firstly, speech is analyzed with respect to acoustic properties based on a high-dimensional prosodic, articulatory, and voice quality feature space plus the linguistic analysis of spoken content by LVCSR and bag-of-words vector space modeling including non-verbals. Secondly, visual analysis provides patterns of the facial expression by AAMs, and of the movement activity by eye tracking. Experiments base on a database of 10.5h of spontaneous human-to-human conversation containing 20 subjects in gender and age-class balance. Recordings are fulfilled with a room microphone, camera, and headsets for close-talk to consider diverse comfort and noise conditions. Three levels of interest were annotated within a rich transcription. We describe each information stream and a fusion on an early level in detail. Our experiments aim at a person-independent system for real-life usage and show the high potential of such a multimodal approach. Benchmark results based on transcription versus automatic processing are also provided.
Tse, Edward, Hancock, Mark and Greenberg, Saul (2007): Speech-filtered bubble ray: improving target acquisition on display walls. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 307-314. Available online
The rapid development of large interactive wall displays has been accompanied by research on methods that allow people to interact with the display at a distance. The basic method for target acquisition is by ray casting a cursor from one's pointing finger or hand position; the problem is that selection is slow and error-prone with small targets. A better method is the bubble cursor that resizes the cursor's activation area to effectively enlarge the target size. The catch is that this technique's effectiveness depends on the proximity of surrounding targets: while beneficial in sparse spaces, it is less so when targets are densely packed together. Our method is the speech-filtered bubble ray that uses speech to transform a dense target space into a sparse one. Our strategy builds on what people already do: people pointing to distant objects in a physical workspace typically disambiguate their choice through speech. For example, a person could point to a stack of books and say "the green one". Gesture indicates the approximate location for the search, and speech 'filters' unrelated books from the search. Our technique works the same way; a person specifies a property of the desired object, and only the location of objects matching that property trigger the bubble size. In a controlled evaluation, people were faster and preferred using the speech-filtered bubble ray over the standard bubble ray and ray casting approach.
Ruiz, Natalie, Taib, Ronnie, Shi, Yu (David), Choi, Eric and Chen, Fang (2007): Using pen input features as indices of cognitive load. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 315-318. Available online
Multimodal interfaces are known to be useful in map-based applications, and in complex, time-pressure based tasks. Cognitive load variations in such tasks have been found to impact multimodal behaviour. For example, users become more multimodal and tend towards semantic complementarity as cognitive load increases. The richness of multimodal data means that systems could monitor particular input features to detect experienced load variations. In this paper, we present our attempt to induce controlled levels of load and solicit natural speech and pen-gesture inputs. In particular, we analyse for these features in the pen gesture modality. Our experimental design relies on a map-based Wizard of Oz, using a tablet PC. This paper details analysis of pen-gesture interaction across subjects, and presents suggestive trends of increases in the degree of degeneration of pen-gestures in some subjects, and possible trends in gesture kinematics, when cognitive load increases.
Breitfuss, Werner, Prendinger, Helmut and Ishizuka, Mitsuru (2007): Automated generation of non-verbal behavior for virtual embodied characters. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 319-322. Available online
In this paper we introduce a system that automatically adds different types of non-verbal behavior to a given dialogue script between two virtual embodied agents. It allows us to transform a dialogue in text format into an agent behavior script enriched by eye gaze and conversational gesture behavior. The agents' gaze behavior is informed by theories of human face-to-face gaze behavior. Gestures are generated based on the analysis of linguistic and contextual information of the input text. The resulting annotated dialogue script is then transformed into the Multimodal Presentation Markup Language for 3D agents (MPML3D), which controls the multi-modal behavior of animated life-like agents, including facial and body animation and synthetic speech. Using our system makes it very easy to add appropriate non-verbal behavior to a given dialogue text, a task that would otherwise be very cumbersome and time consuming.
Wang, Sy Bor, Demirdjian, David and Darrell, Trevor (2007): Detecting communication errors from visual cues during the system's conversational turn. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 323-326. Available online
Automatic detection of communication errors in conversational systems has been explored extensively in the speech community. However, most previous studies have used only acoustic cues. Visual information has also been used by the speech community to improve speech recognition in dialogue systems, but this visual information is only used when the speaker is communicating vocally. A recent perceptual study indicated that human observers can detect communication problems when they see the visual footage of the speaker during the system's reply. In this paper, we present work in progress towards the development of a communication error detector that exploits this visual cue. In datasets we collected or acquired, facial motion features and head poses were estimated while users were listening to the system response and passed to a classifier for detecting a communication error. Preliminary experiments have demonstrated that the speaker's visual information during the system's reply is potentially useful and accuracy of automatic detection is close to human performance.
Manchón, Pilar, Solar, Carmen del, Amores, Gabriel and Pérez, Guillermo (2007): Multimodal interaction analysis in a smart house. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 327-334. Available online
This is a large extension to a previous paper presented in LREC 2006 . It describes the motivation, collection and format of the MIMUS corpus, as well as an in-depth and issue-focused analysis of the data. MIMUS  is the result of multimodal WoZ experiments conducted at the University of Seville as part of the TALK project. The main objective of the MIMUS corpus was to gather information about different users and their performance, preferences and usage of a multimodal multilingual natural dialogue system in the Smart Home scenario in Spanish. The focus group is composed by wheel-chair-bound users, because of their special motivation to use this kind of technology, along with their specific needs. Throughout this article, the WoZ platform, experiments, methodology, annotation schemes and tools, and all relevant data will be discussed, as well as the results of the in-depth analysis of these data. The corpus compresses a set of three related experiments. Due to the limited scope of this article, only some results related to the first two experiments (1A and 1B) will be discussed. This article will focus on subject's preferences, multimodal behavioural patterns and willingness to use this kind of technology.
Lin, Norman, Kajita, Shoji and Mase, Kenji (2007): A multi-modal mobile device for learning Japanese kanji characters through mnemonic stories. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 335-338. Available online
We describe the design of a novel multi-modal, mobile computer system to support foreign students in learning Japanese kanji characters through creation of mnemonic stories. Our system treats complicated kanji shapes as hierarchical compositions of smaller shapes (following Heisig, 1986) and allows hyperlink navigation to quickly follow whole-part relationships. Visual display of kanji shape and meaning are augmented with user-supplied mnemonic stories in audio form, thereby dividing the learning information multi-modally into visual and audio modalities. A device-naming scheme and color-coding allow for asynchronous sharing of audio mnemonic stories among different users' devices. We describe the design decisions for our mobile multi-modal interface and present initial usability results based on feedback from beginning kanji learners. Our combination of mnemonic stories, audio and video modalities, and mobile device provide a new and effective system for computer-assisted kanji learning.
Ng, Kia C., Weyde, Tillman, Larkin, Oliver, Neubarth, Kerstin, Koerselman, Thijs and Ong, Bee (2007): 3d augmented mirror: a multimodal interface for string instrument learning and teaching with gesture support. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 339-345. Available online
Multimodal interfaces can open up new possibilities for music education, where the traditional model of teaching is based predominantly on verbal feedback. This paper explores the development and use of multimodal interfaces in novel tools to support music practice training. The design of multimodal interfaces for music education presents a challenge in several respects. One is the integration of multimodal technology into the music learning process. The other is the technological development, where we present a solution that aims to support string practice training with visual and auditory feedback. Building on the traditional function of a physical mirror as a teaching aid, we describe the concept and development of an "augmented mirror" using 3D motion capture technology.
Brandherm, Boris, Prendinger, Helmut and Ishizuka, Mitsuru (2007): Interest estimation based on dynamic bayesian networks for visual attentive presentation agents. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 346-349. Available online
In this paper, we describe an interface consisting of a virtual showroom where a team of two highly realistic 3D agents presents product items in an entertaining and attractive way. The presentation flow adapts to users' attentiveness, or lack thereof, and may thus provide a more personalized and user-attractive experience of the presentation. In order to infer users' attention and visual interest regarding interface objects, our system analyzes eye movements in real-time. Interest detection algorithms used in previous research determine an object of interest based on the time that eye gaze dwells on that object. However, this kind of algorithm is not well suited for dynamic presentations where the goal is to assess the user's focus of attention regarding a dynamically changing presentation. Here, the current context of the object of attention has to be considered, i.e., whether the visual object is part of (or contributes to) the current presentation content or not. Therefore, we propose a new approach that estimates the interest (or non-interest) of a user by means of dynamic Bayesian networks. Each of a predefined set of visual objects has a dynamic Bayesian network assigned to it, which calculates the current interest of the user in this object. The estimation takes into account (1) each new gaze point, (2) the current context of the object, and (3) preceding estimations of the object itself. Based on these estimations the presentation agents can provide timely and appropriate response.
Noulas, Athanasios and Krose, Ben J. A. (2007): On-line multi-modal speaker diarization. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 350-357. Available online
This paper presents a novel framework that utilizes multi-modal information to achieve speaker diarization. We use dynamic Bayesian networks to achieve on-line results. We progress from a simple observation model to a complex multi-modal one as more data becomes available. We present an efficient way to guide the learning procedure of the complex model using the early results achieved with the simple model. We present the results achieved in various real-world situations, including videos coming from webcameras, human computer interaction and video conferences.
Kurihara, Kazutaka, Goto, Masataka, Ogata, Jun, Matsusaka, Yosuke and Igarashi, Takeo (2007): Presentation sensei: a presentation training system using speech and image processing. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 358-365. Available online
In this paper we present a presentation training system that observes a presentation rehearsal and provides the speaker with recommendations for improving the delivery of the presentation, such as to speak more slowly and to look at the audience. Our system "Presentation Sensei" is equipped with a microphone and camera to analyze a presentation by combining speech and image processing techniques. Based on the results of the analysis, the system gives the speaker instant feedback with respect to the speaking rate, eye contact with the audience, and timing. It also alerts the speaker when some of these indices exceed predefined warning thresholds. After the presentation, the system generates visual summaries of the analysis results for the speaker's self-examinations. Our goal is not to improve the content on a semantic level, but to improve the delivery of it by reducing inappropriate basic behavior patterns. We asked a few test users to try the system and they found it very useful for improving their presentations. We also compared the system's output with the observations of a human evaluator. The result shows that the system successfully detected some inappropriate behavior. The contribution of this work is to introduce a practical recognition-based human training system and to show its feasibility despite the limitations of state-of-the-art speech and video recognition technologies.
Our new research project called "ambient intelligence" concentrates on the creation of new lifestyles through research on communication science and intelligence integration. It is premised on the creation of such virtual communication partners as fairies and goblins that can be constantly at our side. We call these virtual communication partners mushrooms. To show the essence of ambient intelligence, we developed two multimodal prototype systems: mushrooms that watch, listen, and answer questions and a Quizmaster Mushroom. These two systems work in real time using speech, sound, dialogue, and vision technologies. We performed preliminary experiments with the Quizmaster Mushroom. The results showed that the system can transmit knowledge to users while they are playing the quizzes. Furthermore, through the two mushrooms, we found policies for design effects in multimodal interface and integration.
Leung, Rock, MacLean, Karon, Bertelsen, Martin Bue and Saubhasik, Mayukh (2007): Evaluation of haptically augmented touchscreen gui elements under cognitive load. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 374-381. Available online
Adding expressive haptic feedback to mobile devices has great potential to improve their usability, particularly in multitasking situations where one's visual attention is required. Piezoelectric actuators are emerging as one suitable technology for rendering expressive haptic feedback on mobile devices. We describe the design of redundant piezoelectric haptic augmentations of touchscreen GUI buttons, progress bars, and scroll bars, and their evaluation under varying cognitive load. Our haptically augmented progress bars and scroll bars led to significantly faster task completion, and favourable subjective reactions. We further discuss resulting insights into designing useful haptic feedback for touchscreens and highlight challenges, including means of enhancing usability, types of interactions where value is maximized, difficulty in disambiguating background from foreground signals, tradeoffs in haptic strength vs. resolution, and subtleties in evaluating these types of interactions.
Valstar, Michel F., Gunes, Hatice and Pantic, Maja (2007): How to distinguish posed from spontaneous smiles using geometric features. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 38-45. Available online
Automatic distinction between posed and spontaneous expressions is an unsolved problem. Previously cognitive sciences' studies indicated that the automatic separation of posed from spontaneous expressions is possible using the face modality alone. However, little is known about the information contained in head and shoulder motion. In this work, we propose to (i) distinguish between posed and spontaneous smiles by fusing the head, face, and shoulder modalities, (ii) investigate which modalities carry important information and how the information of the modalities relate to each other, and (iii) to which extent the temporal dynamics of these signals attribute to solving the problem. We use a cylindrical head tracker to track the head movements and two particle filtering techniques to track the facial and shoulder movements. Classification is performed by kernel methods combined with ensemble learning techniques. We investigated two aspects of multimodal fusion: the level of abstraction (i.e., early, mid-level, and late fusion) and the fusion rule used (i.e., sum, product and weight criteria). Experimental results from 100 videos displaying posed smiles and 102 videos displaying spontaneous smiles are presented. Best results were obtained with late fusion of all modalities when 94.0% of the videos were classified correctly.
Iwahashi, Naoto and Nakano, Mikio (2007): Multimodal interfaces in semantic interaction. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. p. 382. Available online
This workshop addresses the approaches, methods, standardization, and theories for multimodal interfaces in which machines need to interact with humans adaptively according to context, such as the situation in the real world and each human's individual characteristics. To realize such interaction -- as semantic interaction -- it is necessary to extract and use the valuable context information needed for understanding interaction from the obtained real-world information. In addition, it is important for the user and the machine to share knowledge and an understanding of a given situation naturally through speech, images, graphics, manipulators, and so on. Submitted papers address these topics from diverse fields, such as human-robot interaction, machine learning, and game design.
Barthelmess, Paulo and Kaiser, Edward (2007): Workshop on tagging, mining and retrieval of human related activity information. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 383-384. Available online
Inexpensive and user friendly cameras, microphones, and other devices such as digital pens are making it increasingly easy to capture, store and process large amounts of data over a variety of media. Even though the barriers for data acquisition have been lowered, making use of these data remains challenging. The focus of the present workshop is on issues related to theory, methods and techniques for facilitating the organization, retrieval and reuse of multimodal information. The emphasis is on organization and retrieval of information related to human activity, i.e. that is generated and consumed by individuals and groups as they go about their work, learning and leisure.
Wren, Christopher R. and Ivanov, Yuri A. (2007): Workshop on massive datasets. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. p. 385. Available online
Kaliouby, Rana El and Teeters, Alea (2007): Eliciting, capturing and tagging spontaneous facialaffect in autism spectrum disorder. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 46-53. Available online
The emergence of novel affective technologies such as wearable interventions for individuals who have difficulties with social-emotional communication requires reliable, real-time processing of spontaneous expressions. This paper describes a novel wearable camera and a systematic methodology to elicit, capture and tag natural, yet experimentally controlled face videos in dyadic conversations. The MIT-Groden-Autism corpus is the first corpus of naturally-evoked facial expressions of individuals with and without Autism Spectrum Dis-orders (ASD), a growing population who have difficulties with social-emotion communication. It is also the largest in number and duration of the videos, and represents affective-cognitive states that extend beyond the basic emotions. We highlight the machine vision challenges inherent in processing such a corpus, including pose changes and pathological affective displays.
Morimoto, Kazuhiro, Miyajima, Chiyomi, Kitaoka, Norihide, Itou, Katunobu and Takeda, Kazuya (2007): Statistical segmentation and recognition of fingertip trajectories for a gesture interface. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 54-57. Available online
This paper presents a virtual push button interface created by drawing a shape or line in the air with a fingertip. As an example of such a gesture-based interface, we developed a four-button interface for entering multi-digit numbers by pushing gestures within an invisible 2x2 button matrix inside a square drawn by the user. Trajectories of fingertip movements entering randomly chosen multi-digit numbers are captured with a 3D position sensor mounted on the forefinger's tip. We propose a statistical segmentation method for the trajectory of movements and a normalization method that is associated with the direction and size of gestures. The performance of the proposed method is evaluated in HMM-based gesture recognition. The recognition rate of 60.0% was improved to 91.3% after applying the normalization method.
Schmid, Andreas J., Hoffmann, Martin and Woern, Heinz (2007): A tactile language for intuitive human-robot communication. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 58-65. Available online
This paper presents a tactile language for controlling a robot through its artificial skin. This language greatly improves the multimodal human-robot communication by adding both redundant and inherently new ways of robot control through the tactile mode. We defined an interface for arbitrary tactile sensors, implemented a symbol recognition for multi-finger contacts, and integrated that together with a freely available character recognition software into an easy-to-extend system for tactile language processing that can also incorporate and process data from non-tactile interfaces. The recognized tactile symbols allow for both a direct control of the robot's tool center point as well as abstract commands like "stop" or "grasp object x with grasp type y". In addition to this versatility, the symbols are also extremely expressive since multiple parameters like direction, distance, and speed can be decoded from a single human finger stroke. Furthermore, our efficient symbol recognition implementation achieves real-time performance while being platform-independent. We have successfully used both a multi-touch finger pad and our artificial robot skin as tactile interfaces. We evaluated our tactile language system by measuring its symbol and angle recognition performance, and the results are promising.
Matsusaka, Yosuke, Enomoto, Mika and Den, Yasuharu (2007): Simultaneous prediction of dialog acts and address types in three-party conversations. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 66-73. Available online
This paper reports on automatic prediction of dialog acts and address types in three-party conversations. In multi-party interaction, dialog structure becomes more complex compared to one-to-one case, because there is more than one hearer for an utterance. To cope with this problem, we predict dialog acts and address types simultaneously on our framework. Prediction of dialog act labels has gained to 68.5% by considering both context and address types. CART decision tree analysis has also been applied to examine useful features to predict those labels.
Kasper, Alexander, Becher, Regine, Steinhaus, Peter and Dillmann, Rüdiger (2007): Developing and analyzing intuitive modes for interactive object modeling. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 74-81. Available online
In this paper we present two approaches for intuitive interactive modelling of special object attributes by use of specific sensoric hardware. After a brief overview over the state of the art in interactive, intuitive object modeling, we motivate the modeling task by deriving the dierent object attributes that shall be modeled from an analysis of important interactions with objects. As an example domain, we chose the setting of a service robot in a kitchen. Tasks from this domain were used to derive important basic actions from which in turn the necessary object attributes were inferred. In the main section of the paper, two of the derived attributes are presented, each with an intuitive interactive modeling method. The object attributes to be modeled a restable object positions and movement restrictions for objects. Both of the intuitive interaction methods were evaluated with a group of test persons and the results are discussed. The paper ends with conclusions on the discussed results and a preview of future work in this area, in particular of potential applications.
Sawamoto, Yuichi, Koyama, Yuichi, Hirano, Yasushi, Kajita, Shoji, Mase, Kenji, Katsuyama, Kimiko and Yamauchi, Kazunobu (2007): Extraction of important interactions in medical interviews using nonverbal information. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 82-85. Available online
We propose a method of extracting important interaction patterns in medical interviews. Because the interview is a major step where doctor-patient communication takes place, improving the skill and the quality of the medical interview will lead to a better medical care. A pattern mining method for multimodal interaction logs, such as gestures and speech, is applied to medical interviews in order to extract certain doctor-patient interactions. As a result, we demonstrated that several interesting patterns are extracted and we examined their interpretations. The extracted patterns are considered to be ones that doctors should acquire in training and practice for the medical interview.
Yu, Zhiwen, Ozeki, Motoyuki, Fujii, Yohsuke and Nakamura, Yuichi (2007): Towards smart meeting: enabling technologies and a real-world application. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 86-93. Available online
In this paper, we describe the enabling technologies to develop a smart meeting system based on a three layered generic model. From physical level to semantic level, it consists of meeting capturing, meeting recognition, and semantic processing. Based on the overview of underlying technologies and existing work, we propose a novel real-world smart meeting application, called MeetingAssistant. It is distinctive from previous systems in two aspects. First it provides the real-time browsing that allows a participant to instantly view the status of the current meeting. This feature is helpful in activating discussion and facilitating human communication during a meeting. Second, the context-aware browsing adaptively selects and displays meeting information according to user's situational context, e.g., user purpose, which makes meeting viewing more efficient.
Ashraf, Ahmed Bilal, Lucey, Simon, Cohn, Jeffrey F., Chen, Tsuhan, Ambadar, Zara, Prkachin, Ken, Solomon, Patty and Theobald, Barry J. (2007): The painful face: pain expression recognition using active appearance models. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 9-14. Available online
Pain is typically assessed by patient self-report. Self-reported pain, however, is difficult to interpret and may be impaired or not even possible, as in young children or the severely ill. Behavioral scientists have identified reliable and valid facial indicators of pain. Until now they required manual measurement by highly skilled observers. We developed an approach that automatically recognizes acute pain. Adult patients with rotator cuff injury were video-recorded while a physiotherapist manipulated their affected and unaffected shoulder. Skilled observers rated pain expression from the video on a 5-point Likert-type scale. From these ratings, sequences were categorized as no-pain (rating of 0), pain (rating of 3, 4, or 5), and indeterminate (rating of 1 or 2). We explored machine learning approaches for pain-no pain classification. Active Appearance Models (AAM) were used to decouple shape and appearance parameters from the digitized face images. Support vector machines (SVM) were used with several representations from the AAM. Using a leave-one-out procedure, we achieved an equal error rate of 19% (hit rate =
Terken, Jacques, Joris, Irene and Valk, Linda De (2007): Multimodal cues for addressee-hood in triadic communication with a human information retrieval agent. In: Proceedings of the 2007 International Conference on Multimodal Interfaces 2007. pp. 94-101. Available online
Over the last few years, a number of studies have dealt with the question of how the addressee of an utterance can be determined from observable behavioural features in the context of mixed human-human and human-computer interaction (e.g. in the case of someone talking alternatingly to a robot and another person). Often in these cases, the behaviour is strongly influenced by the difference in communicative ability of the robot and the other person, and the "salience" of the robot or system, turning it into a situational distractor. In the current paper, we study triadic human-human communication, where one of the participants plays the role of an information retrieval agent (such as in a travel agency where two customers who want to book a vacation, engage in a dialogue with the travel agent to specify constraints on preferable options). Through a perception experiment we investigate the role of audio and visual cues as markers of addressee-hood of utterances by customers. The outcomes show that both audio and visual cues provide specific types of information, and that combined audio-visual cues give the best performance. In addition, we conduct a detailed analysis of the eye gaze behaviour of the information retrieval agent both when listening and speaking, providing input for modelling the behaviour of an embodied conversational agent.