【PDF】中文分词十年回顾 - 资源下载

编辑：

JZS133

2017-10-12

第21卷第3期2007年5月中文信息学报 JOURNAL OF CHINESE INFORMATION PROCESSING V01.

21,No.3 May.2007 文章编号:1003―0077(2007)03―0008―012 中文分词十年回顾黄昌宁1,赵海2 (1.微软亚洲研究院,北京100080;

2.香港城市大学,香港) 摘要:过去的十年间,尤其是2003年国际中文分词评测活动Bakeoff开展以来,中文自动分词技术有了可喜的进步.其主要表现为:(1)通过分词规范+词表+分词语料库的方法,使中文词语在真实文本中得到了可计算的定义,这是实现计算机自动分词和可比评测的基础;

(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;

(3)在Bakeoff数据上的评估结果表明,未登录词造成的分词精度失落至少比分词歧义大5倍以上;

(4)实验证明,能够大幅度提高未登录词识别性能的字标注统计学习方法优于以往的基于词(或词典)的方法, 并使自动分词系统的精度达到了新高. 关键词:计算机应用;

中文信息处理;

中文分词;

词语定义;

未登录词识别;

字标注分词方法中图分类号:TP391 文献标识码:A Chinese Word Segmentation:A Decade Review HUANG Chang-nin91,ZHAO Hai2 (1.Microsoft Research Asia,Beijing 100080,China;

2.City University of Hong Kong,Hong Kong,China) Abstract:During the last decade,especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003,the study in automatic Chinese word segmentation has been greatly improved.Those improve― ments could be summarized as following:(1)on the computation sense Chinese words in real text have been well―de― fined by segmentation guidelines+lexicon+segmented corpus ;

(2)practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule―based systems;

(3)the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out of-vocabulary(OOV)words is at least five times greater than that of segmentation ambiguities;

(4)the better performance of OOV recognition the higher accuracy of the segmentation system in whole,and the accuracy of statistic segmentation systems with character-based tagging ap― proach outperforms any other word―based system. Key words:computer application;

Chinese information processing;

Chinese word segmentation(CWS);

definition of words;

out―of―vocabulary(OOV)word recognition;

Character-based tagging approach of CWS 十年前,笔者受《语言文字应用》杂志主编费锦昌先生之托,主持了该杂志以自动分词为题的中文信息处理专题讨论,并为此以中文信息处理中的分词问题为题写了一篇短文,向语言学界的同行介绍计算机信息处理研究所面临的几个语言学问题L1].根据当时的认识,笔者在这篇文章中提出了中文分词研究的四个难题:(1) 词是否有清晰的界定;

(2)分词和理解孰先孰后;

(3)分词歧义消解;

(4)未登录词(Out―of―vocabulary,简称OOV)识别.如今十年过去了,经过国内外同行的不懈努力, 在这四个问题上我们究竟取得了哪些进展呢?这正是本文要逐一回顾的问题.十年间,尤其是2003年 7月首届国际中文分词评测活动Bakeofft胡开展以来,中文自动分词技术有了可喜的进步.主要表现为:(1)通过分词规范+词表+分词语料库的方法,使中文词语在真实文本中得到可计算的定义,这是实现计算机自动分词和可比评测的基础;

(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;

注：以上内容是本站开源项目的机器提供的预览内容，更完整和更好的阅读体验请直接免费下载资源后阅读

下载（注：源文件不在本站服务器，都将跳转到源网站下载）

备用下载

下一篇: 地址242 新北市新庄区铭德街 81 号
上一篇: Significant Investor Stream

PDF《中文分词十年回顾》