Language varieties

Abstract

Part-of-speech tagging, like any supervised statistical NLP task, is more difficult when test sets are very different from training sets, for example when tagging across genres or language varieties. We examined the problem of POS tagging of different varieties of Mandarin Chinese. An analytic study first showed that unknown words were a major source of difficulty in cross-variety tagging. Unknown words in English tend to be proper nouns. By contrast, we found that Mandarin unknown words were mostly common nouns and verbs. We showed these results are caused by the high frequency of morphological compounding in Mandarin; in this sense Mandarin is more like German than English. Based on this analysis, we propose a variety of new morphological unknown-word features for POS tagging, extending earlier work by others on unknown-word tagging in English and German. Our features were implemented in a maximum entropy Markov model. Our system achieves state-of-the-art performance in Mandarin tagging, including improving unknown-word tagging performance on unseen varieties in Chinese Treebank 5.0 from 61% to 80% correct

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 91,783

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

  • Only published works are available at libraries.

Analytics

Added to PP
2010-12-22

Downloads
43 (#369,055)

6 months
3 (#969,763)

Historical graph of downloads
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

Mandarin Chinese: A Functional Reference Grammar.Stephen Wadley, Charles N. Li & Sandra A. Thompson - 1987 - Journal of the American Oriental Society 107 (3):505.
A Grammar of Spoken Chinese.O. Švarný, Yuen Ren Chao & O. Svarny - 1972 - Journal of the American Oriental Society 92 (1):136.

Add more references