Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation

Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng, and Michael Strube.
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 30-35, 2014.

Click here for the PDF version.

Abstract

The definitions of two coreference scoring metrics --- B^3 and CEAF --- are underspecified with respect to predicted, as opposed to key (or gold) mentions. Several variations have been proposed that manipulate either, or both, the key and predicted mentions in order to get a one-to-one mapping. On the other hand, the metric BLANC was, until recently, limited to scoring partitions of key mentions. In this paper, we (i) argue that mention manipulation for scoring predicted mentions is unnecessary, and potentially harmful as it could produce unintuitive results; (ii) illustrate the application of all these measures to scoring predicted mentions; (iii) make available an open-source, thoroughly-tested reference implementation of the main coreference evaluation measures; and (iv) rescore the results of the CoNLL-2011/2012 shared task systems with this implementation. This will help the community accurately measure and compare new end-to-end coreference resolution algorithms.

Software

A reference implementation of the most commonly-used coreference scoring metrics that can be applied to both gold and predicted mentions is available here.

BibTeX entry

@InProceedings{Pradhan+etal:14a,
  author = {Sameer Pradhan and Xiaoqiang Luo and Marta Recasens and Eduard Hovy and Vincent Ng and Michael Strube},
  title = {Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation},
  booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  pages = {30--35}, 
  year = 2014}

poster