编辑: 此身滑稽 2012-12-20
(Appeared in) 2004.

The studies on the theory and methodology of the digitalized Chinese teaching to foreigners: Proceedings of the Fourth International Conference on New Technologies in Teaching and Learning Chinese. Zhang, Pu, Tianwei Xie and Juan Xu. (eds.). 501-511. Beijing: Tsinghua University Press. A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction1 Jun Da Middle Tennessee State University Murfreesboro, Tennessee, USA

37132 [email protected] Abstract: This paper describes the findings of a research project whose main objective is to compile a character frequency list based on a very large collection of Chinese texts collected from various online sources. As compared with several previous studies on Chinese character frequencies, this project uses a much larger corpus that not only covers more subject fields but also contains a better proportion of informative versus imaginative Modern Chinese texts. In addition, this project also computes two bigram frequency lists that can be used for compiling a list of most frequently used two-character words in Chinese. Keywords: Chinese text corpus, character, bigram, frequency, word segmentation, Mutual Information 1. Introduction Character and word frequencies are useful information for Chinese language learning and instruction. Chinese learners are often curious about how many characters they should learn in order to master the language. Answers to the question vary from 1,000 to 3,500 or even more characters, depending on whether Chinese is learned as first or second/foreign language. Similar interests are also found among authors of Chinese language learning materials. From time to time they rely on frequency information to decide on which particular sets of characters and words to include and how to sequence them in the learning materials they develop. In the past, character frequency information has been made available from several sources. One important source is the List of Frequently Used Characters in Modern Chinese (《现代汉语常用字表》, henceforth Changyong Zibiao) recommended jointly by the National Working Committee on Languages and Writing Systems (国家语言文字工作委员会) and the Ministry of Education, China in 1988. It includes 3,500 characters divided into two frequency levels. According to the Ministry of Education

2 , the list was compiled based on information from other previously compiled character frequency lists, dictionaries as well as a corpus of Chinese texts published from

1928 to

1986 covering ten subject categories.

1 Research for this paper was supported in part by Middle Tennessee State University Faculty Research and Creative Activity Grant Program in 2001.

2 c.f., http://www.moe.edu.cn/moe-dept/yuxin/index.htm.

2 In addition to the government-sponsored character list compiled in the late 1980s, there were several empirical studies on Chinese character frequencyin the late 1990s whose results are accessible on the Internet. For example, Tsai (1996) compiled a character frequency list based on 1993C

1994 Big5-encoded newsgroup archives. Da (1998) computed character frequency lists based on a

45 million character corpus of Simplified Chinese texts collected from various online sources. He (1998) produced acharacter frequency list from a trans-regional diachronic survey of Chinese literary texts published in the 1960s, 1980s and 1990s. Apart from the above three empirical studies whose results are accessible online, there have been several other corpus-based studies of Chinese texts conducted at Peking University and Tsinghua University, etc. (Feng 2002). While those studies are reported to have looked into character and/or word frequency information one way or the other, details of their results have unfortunately not been made public and hence are beyond the reach of most Chinese language instruction professionals and researchers. As far as the four accessible frequency lists are concerned, there are a few problems that may have hindered their usefulness. In the case of Changyong Zibiao, it is not known how those

下载(注:源文件不在本站服务器,都将跳转到源网站下载)
备用下载
发帖评论
相关话题
发布一个新话题