Unsupervised Word Segmentation for Bangla

Sajib Dasgupta and Vincent Ng.
Fifth International Conference on Natural Language Processing, 2007.

Click here for the PostScript or PDF version.

Abstract

Unsupervised word segmentation is the task of segmenting words into prefixes, suffixes and roots without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised word segmentation for Bangla, an Indo-Aryan language that is highly inflectional in nature. When evaluated on a set of 2511 human-segmented Bangla words, our algorithm achieves an F-score of 84%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological analyzers, by about 23%.