编辑: 迷音桑 | 2018-04-12 |
973 Fundamental Research Program of China. Keywords: Chinese named entity recognition, word segmentation, role model, ICTCLAS 1. Introduction Named entities (NE) are broadly distributed in original texts from many domains, especially politics, sports, and economics. NE can answer for us many questions like who , where , when , what , how much , and how long . NE recognition (NER) is an essential process widely required in natural language understanding and many other text-based applications, such as question answering, information retrieval, and information extraction. NER is also an important subtask of the Multilingual Entity Task (MET), which was established in the spring of
1996 and run in conjunction with the Message Understanding Conference (MUC). The entities defined in MET are divided into three categories: entities [organizations (ORG), persons (PER), locations (LOC)], times (dates and times), and quantities (monetary values and percentages) [N.A.Chinchor, 1998]. As for NE in Chinese, we further divide PER into two sub-classes: Chinese PER and transliterated PER on the basis of their distinct features. Similarly, LOC is split into Chinese LOC and transliterated LOC. In this work, we only focus on those more difficult but commonly used categories: PER, LOC and ORG. Other NE such as times (TIME) and quantities (QUAN), in a border sense, can be recognized simply via finite state automata. Chinese NER has not been researched intensively till now, while English NER has received much attention. Because of the inherent difference between the two languages, Chinese NER is more complicated and difficult. Approaches that are successfully applied in English cannot be simply extended to cope with the problems of Chinese NER. Unlike Western languages such as English and Spanish, there are no delimiters to mark word Chinese Named Entity Recognition Using Role Model
31 boundaries and no explicit definitions of words in Chinese. Generally speaking, Chinese NER has two sub-tasks: locating the string of NE and identifying its category. NER is an intermediate step in Chinese word segmentation, and token sequences greatly influence the process of NER. Take 孙家正在工作 (pronunciation: sun jia zheng zai gong zuo ) as an example. 孙家正 (Sun Jia-Zheng) in 孙家正/在/工作/ (Sun Jia-Zheng is working) can be recognized as a Chinese PER, and 孙家 is also an ORG in 孙家/正在/工作/ (The Sun family is working). Here, 孙家正在 contains some ambiguous cases: 孙家正 (Sun Jia-Zheng, a PER name), 孙家 (the Sun family, an ORG name), and 正在 (just now, a common word). Such problems are caused by Chinese character strings without word segmentation, and they are hard to solve in the process of NER. Sun et al. [2002] points out that Chinese NE identification and word segmentation are interactional in nature. In this paper, we present a unified statistical approach, namely, a role model, to recognize Chinese NE. Here, roles are defined as some special token classes, including an NE component and its neighboring and remote contexts.........