Simple Yet Powerful Native Language Identification on TOEFL11

Ching-Yi Wu, Po-Hsiang Lai, Yang Liu and Vincent Ng.
Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications: Shared Task, pp. 152-156, 2013.

Click here for the PDF version.

Abstract

Native language identification (NLI) is the task to determine the native language of the author based on an essay written in a second language. NLI is often treated as a classification problem. In this paper, we use the TOEFL11 data set which consists of more data, in terms of the amount of essays and languages, and less biased across prompts, i.e., topics, of essays. We demonstrate that even using word level n-grams as features, and support vector machine (SVM) as a classifier can yield nearly 80% accuracy. We observe that the accuracy of a binary-based word level n-gram representation (~80%) is much better than the performance of a frequency-based word level n-gram representation (~20%). Notably, comparable results can be achieved without removing punctuation marks, suggesting a very simple baseline system for NLI.

BibTeX entry

@InProceedings{Wu+etal:13a,
  author = {Ching-Yi Wu and Po-Hsiang Lai and Yang Liu and Vincent Ng},
  title = {Simple Yet Powerful Native Language Identification on TOEFL11},
  booktitle = {Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications},
  pages = {152--156}, 
  year = 2013}