Unsupervised Word Segmentation for Bangla
Sajib Dasgupta and Vincent Ng.
Proceedings of the Fifth International Conference on Natural Language Processing, pp. 15-24, 2007.
Click here for the
PostScript or PDF
version.
Abstract
Unsupervised word segmentation is the task of segmenting words into prefixes,
suffixes and roots without prior knowledge of language-specific morphotactics
and morpho-phonological rules. This paper introduces a simple, yet highly
effective algorithm for unsupervised word segmentation for Bangla, an
Indo-Aryan language that is highly inflectional in nature.
When evaluated on a set of 2511 human-segmented Bangla words, our algorithm
achieves an F-score of 84%, substantially outperforming Linguistica, one of
the most widely-used unsupervised morphological analyzers, by about 23%.
BibTeX entry
@InProceedings{Dasgupta+Ng:07c,
author = {Sajib Dasgupta and Vincent Ng},
title = {Unsupervised Word Segmentation for {Bangla}},
booktitle = {Proceedings of ICON},
pages = {15--24},
year = 2007
}