Unsupervised Morphological Parsing of Bengali

Sajib Dasgupta and Vincent Ng.
Language Resources and Evaluation: Special Double-Issue on Asian Language Processing, 2007.

Click here for the PostScript, PDF, or HTML version.

Abstract

Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and roots without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 4110 human-segmented Bangla words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological analyzers, by about 23%.

Dataset

The Bengali dataset used in this paper is available from here.