Unsupervised Word Segmentation for Bangla

Sajib Dasgupta and Vincent Ng.
Proceedings of the Fifth International Conference on Natural Language Processing, pp. 15-24, 2007.

Click here for the PostScript or PDF version.

Abstract

Unsupervised word segmentation is the task of segmenting words into prefixes, suffixes and roots without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised word segmentation for Bangla, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 2511 human-segmented Bangla words, our algorithm achieves an F-score of 84%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological analyzers, by about 23%.

BibTeX entry

@InProceedings{Dasgupta+Ng:07c,
  author = {Sajib Dasgupta and Vincent Ng},
  title = {Unsupervised Word Segmentation for {Bangla}},
  booktitle = {Proceedings of ICON},
  pages = {15--24},
  year = 2007
}