Multi-clustering Datasets

This page is a distribution site for the Multi-clustering datasets, a subset of which were introduced in the following papers:

Mining Clustering Dimensions.
Sajib Dasgupta and Vincent Ng.
In the Proceedings of the International Conference on Machine Learning (ICML), 2010.

Towards Subjectifying Text Clustering.
Sajib Dasgupta and Vincent Ng.
In the Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 2010.


While traditional work on text clustering has largely focused on grouping documents by topic, it is conceivable that a user may want to cluster documents along other dimensions, such as the author's mood, gender, age or sentiment. This is useful as users often have a single clustering along a particular dimension in mind, but the fact that there could be 'alternative' ways to cluster the data may provide her important insights which were otherwise missing and could potentially be valuable.

Motivated in part by this observation, we take a multifaceted approach to document annotation: we annotate a set of documents across multiple dimensions, where each dimension represents a particular classification structure along which the document set can be meaningfully categorized.

We use the annotations as a gold-standard to evaluate an alternative (or multi-) clustering system, which seeks to organize, or cluster, a set of text documents along multiple dimensions.

For example, given a collection of reviews we annotate it along the following four dimensions:

  1. Sentiment:
    Classify a review as positive (thumbs up) or negative (thumbs down).

  2. Topic:
    Classify a review according to the product description or the topic it pertains to. For example, classify a review according to whether it's a book, movie, or an electronic product review.

  3. Subjectivity:
    Classify a review according to whether the review contains mostly a narrative description of the product and is therefore largely "objective", or whether it contains mostly the author's opinion and is therefore largely "subjective".

  4. Strength:
    Classify a review according to whether the opinion expressed in a review is "strong" or "weak".

Thus a particular review labeled as {"Positive", "Movie", "Subjective", "Strong"} permeates four different pieces of information to the end user: the review is positive sentiment bearing and related to a movie, it's mostly subjective and the strength of the opinion expressed in the review is strong.


Annotation guidelines and other details regarding clustering dimensions for each dataset can be found in the papers listed above. For any reference to the multi-clustering datasets and the corresponding annotations please cite the ICML paper above. We collected the datasets from numerous sources, which are listed along with a reference paper and a web-source in the corresponding folder. Please cite the source(s) for any reference to the dataset(s).


Datasets:

Download Multi-clustering datasets

Directions to use the datasets:

Guideline