Unsupervised Morphological Parsing of Bengali
Sajib Dasgupta and Vincent Ng.
Language Resources and Evaluation: Special Double-Issue on Asian
Language Processing, 2007.
Click here for the
PostScript, PDF, or
HTML version.
Abstract
Unsupervised morphological analysis is the task of segmenting words into
prefixes,
suffixes and roots without prior knowledge of language-specific morphotactics
and morpho-phonological rules. This paper introduces a simple, yet highly
effective algorithm for unsupervised morphological learning for Bengali, an
Indo-Aryan language that is highly inflectional in nature.
When evaluated on a set of 4110 human-segmented Bangla words, our algorithm
achieves an F-score of 83%, substantially outperforming Linguistica, one of
the most widely-used unsupervised morphological analyzers, by about 23%.
Dataset
The Bengali dataset used in this paper is available from
here.