42. Multimodal Affective Computing
Affective Computing is computing that relates to, arises from, or deliberately influences emotion or other affective phenomena (Picard 1997).
Research on automatic emotion recognition did not start until the 1990s. Although researchers like Ekman published studies on how people recognized emotions from face display in the 1960s (Ekman and Friesen 1968), people would find it absurd that anyone would even propose giving machines such abilities when emotional mechanisms were not considered to have a significant role in various aspects of human life. However, scientists found out that even in the most rational of decisions, emotions persist: emotions always exist, we always feel something.
In the early 1990s, Salovey and Mayer published a series of papers on emotional intelligence (Salovey and Mayer 1990). They suggested that the capacity to perceive and understand emotions define a new variable in personality. Goleman popularized his view of emotional intelligence or Emotional Quotient (EQ) in his 1995 bestselling book by discussing why EQ mattered more than Intelligence Quotient (IQ) (Goleman 1995). Goleman drew together research in neurophysiology, psychology and cognitive science. Other scientists also provided evidence that emotions were tightly coupled with all functions we, humans, are engaged with: attention, perception, learning, reasoning, decision making, planning, action selection, memory storage and retrieval (Isen 2000 and Picard 2003).
This new scientific understanding of emotions provided inspiration to various researchers for building machines that will have abilities to recognize, express, model, communicate, and respond to emotions. The initial focus has been on the recognition of the prototypical emotions from posed visual input, namely face expressions. All existing work in the early 1990s attempted to recognize prototypical emotions from two static face images: neutral and expressive. In the second half of 1990s, automated face expression analysis started focusing on posed video sequences and exploiting temporal information in the displayed face expressions. In parallel to the automatic emotion recognition from visual input, works focusing on audio input emerged. Rosalind Picard's awardwinning book, Affective Computing, was published in 1997, laying the groundwork for giving machines the skills of emotional intelligence. The book triggered an explosion of interest in the emotional side of computers and their users and a new research area called affective computing emerged. Affective computing advocated the idea that it might not be essential for machines to posses all the emotional intelligence and skills humans do. However, for natural and effective humancomputer interaction, computers still needed to look intelligent to some extent (Picard, 1997). Experiments conducted by Reeves and Nass showed that for an intelligent interaction, the basic humanhuman issues should hold (Reeves and Nass 1996).
One major limitation of affective computing has been that most of the past research had focused on emotion recognition from one single sensorial source, or modality. However, as natural humanhuman interaction (HHI) is multimodal, the single sensory observations are often ambiguous, uncertain, and incomplete. It was not till 1998 that computer scientists attempted to use multiple modalities for recognition of emotions/affective states. The combined use of multiple modalities for sensing affective states in itself triggered another research area. What channels to use? And how to combine them? The initial interest was on fusing visual and audio data. The results were promising, using multiple modalities improved the overall recognition accuracy helping the systems function in a more efficient and reliable way. Starting from the work of Picard in 2001, interest in detecting emotions from physiological signals emerged. Moreover, researchers moved their focus from posed to spontaneous visual data (Braathen et al. 2002). Although a fundamental study by Ambady and Rosenthal suggested that the most significant channels for judging behavioural cues of humans appear to be the visual channels of face expressions and body gestures (Ambady and Rosenthal 1992), the existing literature on automatic emotion recognition did not focus on the expressive information that body gestures carry till 2003 (Hudlicka, 2003). Following the new findings in psychology, some researchers advocate that a reliable automatic affect recognition system should attempt to combine face expressions and body gestures. Accordingly, a number of approaches have been proposed for such sensorial sources (Gunes and Piccardi 2007), (Kapoor et al. 2007), (Karpouzis et al. 2007), (Lisetti and Nasoz 2002) and (Martin et al. 2006). With all these new areas, a number of new challenges have arisen.
Overall, the interest in affective computing has grown significantly in the last three years. In Europe (EU), HumanMachine Interaction Network on Emotion (HUMAINE) was created as a Network of Excellence in the EU's Sixth Framework Programme, under the Information Society Technologies (IST) programme (Humaine 2007). The HUMAINE Network started on 1st January 2004, and is funded to run for four years. In parallel to this the First international Conference on Affective Computing and Intelligent Interaction was organized in 2005 bringing together researchers from diverse fields of research (ACII 2005).
Currently, every research group agrees that multiple modalities should be explored in order to understand which channels provide better information for automatic affect/emotion recognition. If a monomodal affect recognition system is compared to a multimodal one some of the assumptions made when building monomodal affect recognisers still hold (e.g., affect data collection is still needed). However, specific problems exist for multimodal affect recognition (e.g., multiple sensors are now required).Therefore, some new assumptions need to be taken into consideration.
The final stage affective computing has reached today is, combining multiple channels for affect recognition and moving from posed data towards spontaneous data. Achieving these aims is an open challenge. At this stage, scientists expect emotion recognition to be solvable by machine in the near future, at least as well as people can label such patterns (Picard 2003). A significant issue to note here is that, the focus of affective computing research field is gradually moving from just developing more efficient and effective automated techniques to concentrating on more context/culture/userrelated aspects. In order to achieve the smooth transition aimed, it should be realised and understood that machine learning for humancomputer applications is distinctively different from the conventional machine learning field. Issues such as loads of data, spatial coherence, and the large variety of appearances make affective behaviour analysis in particular, a special challenge for machine learning algorithms.
Today, the term affective computing has many aims in common with the recently emerging research field called human computing. Human computing is an interdisciplinary research field focusing on computing and computational artefacts as they relate to the human condition. As defined in (Pantic et al. 2007), human computing focuses on the human portion of the HCI context, going beyond the traditional keyboard and mouse to include natural, humanlike interactive functions including understanding and emulating behavioural and social signalling. Human computing research field is interested in devising automated analysis algorithms that aim to extract, efficiently describe, and organise information regarding the state or state transition of individuals (identity, emotional state, activity, position and pose, etc), interactions between individuals (dialogue, gestures, engagement into collaborative or competitive activities like sports), and physical characteristics of humans (anthropometric characteristics, 3D head/body models) (Pantic et al. 2007).
Starting from the survey by Pantic, Pentland, Nijholt and Huang (Pantic et al. 2007), special sessions have already been organised and special journal issues have been proposed in this field:
- Special session on Human Computing at the ACM International Conference on Multimodal Interfaces, November 2006.
- Workshop on Artificial Intelligence for Human Computing in conjunction with the International Joint Conference on Artificial Intelligence, January 2007.
- International Workshop on HumanCentered Multimedia in Conjunction with ACM Multimedia 2007.
- International Journal of Image and Video Processing, special issue on Anthropocentric Video Analysis: Tools and Applications, 2007.
- Lecture Notes on Artificial Intelligence (LNAI), special volume on AI for Human Computing, 2007.
- IEEE Computer Magazine, Special Issue on HumanCentered Computing, 2007.
- IEEE Transactions on Systems, Man, and Cybernetics Part B, special issue on Human Computing, 2007.
Although research fields such as affective computing, human computing and multimodal interfaces seem to be detached and have their own research community/conferences/audience, as prophesied by some researchers (e.g., Pantic et al. 2007), future progress in these fields is likely to bring them together and merge them into one single most widespread research area within computer science, artificial intelligence and CHI research communities. The future direction in these research fields is to advance further by making computers/machines/devices/environments more humanlike rather than forcing humans to act machinelike. Further progress is mandatory in order to achieve this common goal.
Where to learn more?
Informative websites are listed as follows: