编辑: 阿拉蕾 | 2019-07-16 |
Chai Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824, USA {qushaoli, jchai}@cse.msu.edu ABSTRACT In a multimodal conversational interface supporting speech and deictic gesture, deictic gestures on the graphical display have been traditionally used to identify user attention, for example, through reference resolution. Since the context of the identi?ed attention can potentially constrain the associ- ated intention, our hypothesis is that deictic gestures can go beyond attention and apply to intention recognition. Driven by this assumption, this paper systematically investigates the role of deictic gestures in intention recognition. We ex- periment with different model-based methods and instance- based methods to incorporate gestural information for inten- tion recognition. We examine the effects of utilizing gestu- ral information in two different processing stages: speech recognition stage and language understanding stage. Our empirical results have shown that utilizing gestural informa- tion improves intention recognition. The performance is fur- ther improved when gestures are incorporated in both speech recognition and language understanding stages compared to either stage alone. Author Keywords Multimodal interface, language understanding, speech, ges- ture. ACM Classi?cation Keywords H5.2 [Information interfaces and presentation]: User Inter- faces - Theory and methods, Natural language. INTRODUCTION In multimodal conversational systems, multiple input modal- ities (e.g., speech, gesture, and eye gaze) are utilized to fa- cilitate more natural and ef?cient human machine conver- sation [23]. Many systems with different combinations of modalities haven been developed in the last two decades [7, 13, 17, 27]. In this paper, we focus on a speech-gesture sys- tem where user speech inputs are accompanied by deictic Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. IUI'
08, January 13-16, 2008, Maspalomas, Gran Canaria, Spain. Copyright
2008 ACM 978-1-59593-987-6/ 08/
0001 $5.00. gestures on the graphical display. Since human speech is the most natural communication mode, this type of system is easier and more intuitive for users to interact with. A key component in multimodal conversational systems is semantic interpretation, which is to identify semantic mean- ings from user input. In conversational systems, the mean- ing from user input can be generally categorized into inten- tion and attention [11]. Intention indicates the user'
s motiva- tion and action. Attention re?ects focus of the conversation, in other words, what has been talked about. In the speech- gesture system where speech is the dominant mode of com- munication, the user intention (such as asking for price of an object) is generally expressed by spoken language and at- tention (e.g., the speci?c object) is indicated by the deictic gesture on the graphical display. Based on such observations, many speech-gesture systems apply a semantic fusion approach where speech is used to mainly identify intention and deictic gestures are used to identify attention [1, 18, 9]. It is not clear whether deictic gestures can be used at all in recognizing intention and how to use those gestures. In our view, deictic gestures not only indicate users'