Unsupervised Morphological Parsing of Bengali

Sajib Dasgupta and Vincent Ng.
Language Resources and Evaluation: Special Double-Issue on Asian Language Processing, pp. 311-330, 2006.

Click here for the PostScript, PDF, or HTML version.

Abstract

Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and roots without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 4110 human-segmented Bengali words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological analyzers, by about 23%.

Dataset

The Bengali dataset used in this paper is available from this page.

Software

Our unsupervised morphological segmenter, UnDivide++, is freely available. Try it out and give us your feedback!

BibTeX entry

@Article{Dasgupta+Ng:06a,
  author = {Sajib Dasgupta and Vincent Ng},
  title = {Unsupervised Morphological Parsing of {Bengali}},
  journal = {Language Resources and Evaluation},
  volume = 40,
  number = {3--4},
  pages = {311--330},
  year = 2006
}

Of related interest:

The guest editors' introduction to the special issue, which gives an overview of the state of the art in Asian language processing circa 2006.