Unsupervised Morphological Parsing of Bengali
Sajib Dasgupta and Vincent Ng.
Language Resources and Evaluation: Special Double-Issue on Asian
Language Processing, pp. 311-330, 2006.
Click here for the
PostScript, PDF, or
HTML version.
Abstract
Unsupervised morphological analysis is the task of segmenting words into
prefixes,
suffixes and roots without prior knowledge of language-specific morphotactics
and morpho-phonological rules. This paper introduces a simple, yet highly
effective algorithm for unsupervised morphological learning for Bengali, an
Indo-Aryan language that is highly inflectional in nature.
When evaluated on a set of 4110 human-segmented Bengali words, our algorithm
achieves an F-score of 83%, substantially outperforming Linguistica, one of
the most widely-used unsupervised morphological analyzers, by about 23%.
Dataset
The Bengali dataset used in this paper is available from
this page.
Software
Our unsupervised morphological segmenter, UnDivide++, is freely available. Try it out and give us your feedback!
BibTeX entry
@Article{Dasgupta+Ng:06a,
author = {Sajib Dasgupta and Vincent Ng},
title = {Unsupervised Morphological Parsing of {Bengali}},
journal = {Language Resources and Evaluation},
volume = 40,
number = {3--4},
pages = {311--330},
year = 2006
}
Of related interest:
The guest editors'
introduction to the special issue,
which gives an overview of the state of the art in Asian language processing
circa 2006.