编辑: 此身滑稽 | 2012-12-20 |
3500 are ranked among themselves. If one wants to use the list for developing beginning-level learning materials, for example, it is difficult to decide which basic set of characters to include and how to present them sequentially so that learners are provided with maximum exposure to the language within the limited time frame of a language learning program. In the case of the three empirical studies mentioned above, they tend to be based on collections of Chinese texts that are either too limited in subject domains or encoded in outdated Chinese encoding standard. For example, Da'
s (1998) study used materials that were encoded in GB2312-80, a character set that contains less than 7,000 distinct characters. While they provided useful information at the time of the study, it now appears that the Da'
s (1998) results are outdated given the fact the more and more Chinese webpages are encoded in the more recent GB13000 (also known as GBK) or GB18030
3 standards which contain much larger character sets. As compared with the availability of detailed information about character frequencies, information about word frequencies is much scarce and only a few graded word lists are accessible to Chinese language learners and instruction professionals. One such graded list is the HSK List of 8,000 Chinese Vocabulary published by Beijing Language and Culture University in 2000. The other is Dew'
s (1999) handbook which grades 6,000 Chinese vocabulary into elementary, intermediate and more advanced levels. The scarcity of accessible information about word frequencies in Chinese may be due to the fact that while a word in Chinese may contain one, two, three or even more characters, researches employing heuristic methods for segmenting individual words in running Chinese texts that do not contain word delimiters are far from conclusive (c.f., Sproat and Emerson 2003, among others). In this paper, we report the findings of a recent research project whose main objective is to compile yet another character frequency list based on a very large collection of online Chinese texts that are encoded in not only the GB2312-80 but also the GB13000 standard. With detailed results of the research project made available at http://lingua.mtsu.edu/chinese- computing, we will focus our discussion in this paper on the construction of the corpus used in the study and some general distribution patterns of the character frequencies found in our corpus.
3 For more information about various encoding standards for Chinese characters, please refer to, for example, http://www.praxagora.com/lunde/cjk_inf.html .
3 In addition, we will discuss the computing of two bigram frequency lists that can be used as the basis for compiling a two-character word frequency list in Modern Chinese. It is hoped that those frequency lists will provide a better tool for both Chinese language learners and instruction professionals. 2. This study 2.1. Corpus design and data collection The main objective of this research project is to compile a character frequency list that can be used for both Chinese language learning and instruction. Accordingly, the following three measures have been taken in the construction of the Chinese text corpus used in the study: 1) Both Classical and Modern Chinese are collected from various online sources, where texts written before
1911 are categorized as Classical and those published in or after
1911 Modern Chinese. 2) Only formal Modern Chinese texts are included in the corpus. No efforts have been made to collect informal writings of Modern Chinese such as postings on various online BBS or email messages. 3) With references to the structures of Brown Corpus (Francis and Kucera 1964), British National Corpus (Burnard 2000) and Longman/Lancaster Corpus (Summers 1991), efforts are made to collect text materials from a diverse range of subject fields (c.f. Table 1). In addition, a distinction is made between imaginative (i.e., those written for entertainment or related to literary works) and informative texts (i.e., those written for information and/or knowledge) for Modern Chinese. Table 1: List of subject fields used in the study Category Subcategory Subject fields Classical Chinese Novels, prose, history, poetry and drama, etc. Informative Computer science, economics, education, government, health, history, law, military, news, philosophy, politics, popular science, religion, etc. Modern Chinese Imaginative General fiction, children, detective, drama, history, Kongfu or martial arts, military, prose, literary review and science fiction, etc. All electronic texts used in this study were collected between