Unsupervised Morphological Segmentation Dataset

This page is a distribution site for unsupervised word segmentaion data for Bengali. This data set was introduced in the following papers:

Unsupervised Word Segmentation for Bangla. Sajib Dasgupta and Vincent Ng. To appear in the 5th International Conference on Natural Language Processing (ICON), Hyderabad, India, 4 - 6 January 2007.

Unsupervised Morphological Parsing of Bengali. Sajib Dasgupta and Vincent Ng. To appear in the Journal of Language Resources and Evaluation (LRE) 2007, published by Springer.

Here are the files:

Bengali Dataset (Transliterated) : These are the 4110 instances used in the LRE paper. The first 2511 instances are used in the ICON paper.

Bengali Dataset (Original) : These are the 4110 instances used in the LRE paper in Bengali Font.

Mapping : The transliteration we used to map Bengali to English.