Unsupervised Word Segmentation for Bangla
Sajib Dasgupta and Vincent Ng.
Proceedings of the Fifth International Conference on Natural Language Processing, pp. 15-24, 2007.
Click here for the
PostScript or PDF
version.
The talk slides are available here.
Abstract
Unsupervised word segmentation is the task of segmenting words into prefixes,
suffixes and roots without prior knowledge of language-specific morphotactics
and morpho-phonological rules. This paper introduces a simple, yet highly
effective algorithm for unsupervised word segmentation for Bangla, an
Indo-Aryan language that is highly inflectional in nature.
When evaluated on a set of 2511 human-segmented Bangla words, our algorithm
achieves an F-score of 84%, substantially outperforming Linguistica, one of
the most widely-used unsupervised morphological analyzers, by about 23%.
Software
Our unsupervised morphological segmenter, UnDivide++, is freely available. Try it out and give us your feedback!
BibTeX entry
@InProceedings{Dasgupta+Ng:07c,
author = {Sajib Dasgupta and Vincent Ng},
title = {Unsupervised Word Segmentation for {Bangla}},
booktitle = {Proceedings of the 5th International Conference on Natural Language Processing},
pages = {15--24},
year = 2007
}