Unsupervised Word Segmentation for Bangla

Sajib Dasgupta and Vincent Ng.
Proceedings of the Fifth International Conference on Natural Language Processing, pp. 15-24, 2007.

Click here for the PostScript or PDF version. The talk slides are available here.

Abstract

Unsupervised word segmentation is the task of segmenting words into prefixes, suffixes and roots without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised word segmentation for Bangla, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 2511 human-segmented Bangla words, our algorithm achieves an F-score of 84%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological analyzers, by about 23%.

Software

Our unsupervised morphological segmenter, UnDivide++, is freely available. Try it out and give us your feedback!

BibTeX entry

@InProceedings{Dasgupta+Ng:07c,
  author = {Sajib Dasgupta and Vincent Ng},
  title = {Unsupervised Word Segmentation for {Bangla}},
  booktitle = {Proceedings of the 5th International Conference on Natural Language Processing},
  pages = {15--24},
  year = 2007
}