编辑: hyszqmzc | 2017-09-18 |
2 Evaluation Our system'
s input is Chinese. The output is a string of Chinese characters that approximate English sounds, which we call Chinglish. We build several candidate Chinese-to-Chinglish sys- tems and evaluate them as follows: ? We compute the normalized edit distance between the system'
s output and a human- generated Chinglish reference. ? A Chinese speaker pronounces the system'
s output out loud, and an English listener takes dictation. We measure the normalized edit distance against an English reference. ? We automate the previous evaluation by re- place the two humans with: (1) a Chinese speech synthesizer, and (2) a English speech recognizer.
3 Data We seek to imitate phonetic transformations found in phrasebooks, so phrasebooks themselves are a good source of training data. We obtained a col- lection of
1312 phrasebook tuples
1 (see Table 1). We use
1182 utterances for training,
65 for de- velopment, and
65 for test. We know of no other computational work on this type of corpus. Our Chinglish has interesting gross empirical properties. First, because Chinglish and Chinese are written with the same characters, they render the same inventory of
416 distinct syllables. How- ever, the distribution of Chinglish syllables differs
1 Dataset can be found at http://www.isi.edu/ natural-language/mt/chinglish-data.txt a great deal from Chinese (Table 2). Syllables si and te are very popular, because while conso- nant clusters like English st are impossible to re- produce exactly, the particular vowels in si and te are fortunately very weak. Frequency Rank Chinese Chinglish
1 de si
2 shi te
3 yi de
4 ji yi
5 zhi fu Table 2: Top
5 frequent syllables in Chinese (McEnery and Xiao, 2004) and Chinglish We ?nd that multiple occurrences of an English word type are generally associated with the same Chinglish sequence. Also, Chinglish characters do not generally span multiple English words. It is reasonable for can I to be rendered as kan nai , with nai spanning both English words, but this is rare.
4 Model We model Chinese-to-Chinglish translation with a cascade of weighted ?nite-state transducers (wFST), shown in Figure 2. We use an online MT system to convert Chinese to an English word sequence (Eword), which is then passed through FST A to generate an English sound sequence (Epron). FST A is constructed from the CMU Pro- nouncing Dictionary (Weide, 2007). Next, wFST B translates English sounds into Chinese sounds (Pinyin-split). Pinyin is an of?cial syllable-based romanization of Mandarin Chinese characters, and Pinyin-split is a standard separa- tion of Pinyin syllables into initial and ?nal parts. Our wFST allows one English sound token to map Figure 2: Finite-st........