Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages

Kazi Saidul Hasan and Vincent Ng.
Proceedings of the Twelfth Conference of the European Chapter of the Association for Computational Linguistics, pp. 363-371, 2009.

Click here for the PostScript or PDF version. The talk slides are available here.

Abstract

This paper examines unsupervised approaches to part-of-speech (POS) tagging for morphologically-rich, resource-scarce languages, with an emphasis on Goldwater and Griffiths's (2007) nonparametric fully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon, and consequently, we propose a weakly supervised fully-Bayesian approach to POS tagging, which relaxes the unrealistic assumption by automatically acquiring the lexicon from a small amount of POS-tagged data. Since such relaxation comes at the expense of a drop in tagging accuracy, we propose two extensions to the Bayesian framework and demonstrate that they are effective in improving a fully-Bayesian POS tagger for Bengali, our representative morphologically-rich, resource-scarce language.

BibTeX entry

@InProceedings{Hasan+Ng:09a,
  author = {Hasan, Kazi Saidul and Vincent Ng},
  title = {Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages},
  booktitle = {Proceedings of the 12th Conference on the European Chapter of the Association for Computational Linguistics},
  pages = {363--371},
  year = 2009
}

poster