High-Performance, Language-Independent Morphological Segmentation

Sajib Dasgupta and Vincent Ng.
NAACL HLT 2007: Proceedings of the Main Conference, pp. 155-163, 2007.

Click here for the PostScript or PDF version. The talk slides are available here.

Abstract

This paper introduces an unsupervised morphological segmentation algorithm that shows robust performance for four languages with different levels of morphological complexity. In particular, our algorithm outperforms Goldsmith's Linguistica and Creutz and Lagus's Morfessor for English and Bengali, and achieves performance that is comparable to the best results for all three PASCAL evaluation datasets. Improvements arise from (1) the use for relative corpus frequency and suffix level similarity for detecting incorrect morpheme attachments and (2) the induction of orthographic rules and allomorphs for segmenting words where roots exhibit spelling changes during morpheme attachments.

Dataset

The Bengali dataset used in this paper is available from this page.

Software

Our unsupervised morphological segmenter, UnDivide++, is freely available. Try it out and give us your feedback!

BibTeX entry

@InProceedings{Dasgupta+Ng:07a,
  author = {Sajib Dasgupta and Vincent Ng},
  title = {High-Performance, Language-Independent Morphological Segmentation},
  booktitle = {NAACL HLT 2007: Proceedings of the Main Conference},
  pages = {155--163},
  year = 2007
}