eXtended WordNet  
The University of Texas at Dallas
home  |  about xwn  |  downloads  |  papers  |  people  |  news  |  related links  |  contacts  

Semantic annotation of WordNet glosses

Introduction

The Extended WordNet project [1] aims to transform the WordNet glosses into a format that allows the derivation of additional semantic and logic relations. The last release of the Extended WordNet is based on WordNet 2.0 has three stages: part of speech tagging and parsing, logic form transformation, and semantic disambiguation. This paper presents the semantic disambiguation of the WordNet glosses. The next section presents some statistics regarding the disambiguation of WordNet glosses, the second section describes the format of the files, and the third section briefly presents the methods used for the semantic annotation.

Statistics

WordNet 2.0 contains a total number of 115,424 glosses divided into 79,689 noun synset glosses, 13,508 verb synset glosses, 18,563 adjective synset glosses and 3,664 adverb glosses. In order to be consistent with the logic form transformation and parsing trees, in each gloss we removed the examples and the comments in parentheses. This resulted in 637,067 open class words to be disambiguated. From these, 160,879 are monosemous remainig 476,188 to be disambiguated. For disambiguating these open class words we used both manual and automatic annotation. Automatic annotation was done using two programs: one specially designed to disambiguate the WordNet glosses called XWN_WSD, and an in-house system for WSD of open text. A voting between the two systems was performed and we estimate a precision of 90% for the words tagged with the same sense by both system. The precision of annotation was classified as "gold" for manually checked words, "silver" for the words automatically tagged with the same sense by the both disambiguation systems, and "normal" for the rest of the words automatically annotated by the XWN_WSD system. Word forms corresponding to the verbs "to be" and "to have" were not disambiguated automatically. Table 1 presents the number of the open class words in each category for sets of glosses corresponding to each part of speech for XWN2.0-1.1 release of XWN.
Set of glosses Number of glosses Open class words Monosemous words "Gold" words "Silver" words "Normal" words
Noun glosses 79,689 505,946 138,274 10,142 45,015 296,045
Verb glosses 13,508 48,200 6,903 2,212 5,193 30,813
Adjective glosses 18,563 74,108 14,142 263 6,599 50,359
Adverb glosses 3,664 8,998 1,605 1,829 385 4,920
Table 1. Disambiguated words in each category.

File Format

For releasing the semantically annotated glosses we used an XML format. Below there is a part of XML schema definition file regarding the semantic disambiguation:

<xsd:simpleType name="puncType">
  <xsd:restriction base="xsd:string">
    <xsd:pattern value="([^a-zA-Z0-9])+"/>
  </xsd:restriction>
</xsd:simpleType>

<xsd:complexType name="wfType">
  <xsd:simpleContent>
    <xsd:extension base="xsd:string">
      <xsd:attribute name="pos" type="wPosType" use="required"/>
      <xsd:attribute name="lemma" type="xsd:string" use="optional"/>
      <xsd:attribute name="quality" type="qualityType" use="optional" default="normal"/>
      <xsd:attribute name="wnsn" type="senseType" use="optional"/>
    </xsd:extension>
  </xsd:simpleContent>
</xsd:complexType>

<xsd:complexType name="wsdType">
  <xsd:all>
     <xsd:element name="punc" type="puncType" minOccurs="0" maxOccurs="unbounded"/>
     <xsd:element name="wf" type="wfType" minOccurs="0" maxOccurs="unbounded"/>
  </xsd:all>
</xsd:complexType>

<xsd:element name="xwn">
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name="gloss" minOccurs="0" maxOccurs="unbounded">
        <xsd:complexType>
          <xsd:sequence>
            <xsd:element name="synonymSet" type="xsd:string"/>
            <xsd:element name="text" type="xsd:string"/>
	    <xsd:element name="wsd" type="wsdType"/>
            <xsd:element name="parse" type="parseType" minOccurs="1" maxOccurs="unbounded"/>
            <xsd:element name="lft" type="lftType" minOccurs="1" maxOccurs="unbounded"/>
          </xsd:sequence>
          <xsd:attribute name="synsetID" type="synsetIDType" use="required"/>
          <xsd:attribute name="pos" type="glossPosType" use="required"/>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
    <xsd:attribute name="ver" type="xsd:string"/>
    <xsd:attribute name="wnver" type="xsd:string"/>
  </xsd:complexType>
</xsd:element>

Each file contains the enclosing tag <xwn>. This tag contains the attribute "ver" representing the current release version (2.0-1), and "wnver" representing the WordNet version (2.0). The glosses are represented by the <gloss> tag that inlcludes the synonym set, the text of the gloss, the parse tree, the logic form tranformation and the semantic disambiguation of the gloss. The semantic disambiguation part is marked by the tag <gloss> and includes words represented by the <wf> tag and punctuation represented by the <punc> tag.
The <punc> tag does not have any attribute.
The tag <wf> contains the following attributes:
  • pos - representing the part of speech as given by the Brill tagger [2]. This attribute is required.
  • quality - representing the quality of the semantic annotation as described above. This attribute can take 3 values" gold", "silver" and "normal".
  • lemma - representing the stem of a word in the open class category.
  • wnsn - representing the annotated sense or senses separated by comma.
The senses stored in "wnsn" attribute were obtained using several methods of semantic disambiguation. The following section will overview the process of semantically disambiguation of WordNet glosses.

Semantic Disambiguation of WordNet glosses

The semantic disambiguation of WordNet glosses consists of two phases:
  1. The first phase is preprocessing that separates the WordNet glosses into definitions and examples, and performs tokenization, part of speech tagging using Brill's tagger [2], and identifying of compound concepts.
  2. The second phase is the effective disambiguation that consists of assigning to each open class word the correct sense using its part of speech. The senses were assigned using both manual and automatic procedures.
Human annotators disambiguated open class words from the set of glosses labeled as gold standards for checking the disambiguation system accuracy. These disambiguated glosses were integrated into the files from this release of Extended WordNet package.
The disambiguation software is based on several heuristics:
  • The Monosemous Words method identify all the words with only one sense and mark them with sense #1.
  • The Same Hierarchy method identifies the gloss word belonging to the same hierarchy as the synset of the gloss.
  • The Lexical Parallelism method identifies the words with the same part of speech separated by comas or conjunctions and mark them with senses that belongs to the same hierarchy, when this is possible.
  • Given a word in a gloss, the Semcor bigrams method forms two pairs, one with the previous word and the other with the next word, and searches for these pairs in Semcor corpus [4]. If in all the occurrences of these pairs, the given word has the same sense, and the number of occurrences is bigger than a threshold than we assign that sense to the word.
  • Given an ambiguous word W in the synset S, the Cross-Reference method looks for a reference to the synset S in all the glosses corresponding to the word senses.
  • Reversed Cross-Reference method tries to find if there are two words in the gloss belonging to the same synset.
  • Distance among glosses method determines the number of common words between two synsets. For an ambiguous word W in a gloss G, this method selects the sense of the word that has the greatest number of common words with the gloss G.
  • Some of the WordNet glosses have a domain associated with them written in parentheses. Magnini [3] assigned a domain to all the nouns synsets in WordNet. The Common Domain method selects the sense of a word that has the same domain as the synset of the gloss.
  • The "Patterns" method ([5]) exploits the idiosyncratic nature of the WordNet glosses identifying the repetitive expressions.
These methods disambiguate 64% words of WordNet glosses with 75% accuracy. The rest of the words were tagged with the first sense.
For disambiguating the WordNet glosses we also used another WSD system for open text. The glosses were transformed into sentences and disambiguated using this system with 100% coverage and 70% accuracy.
About 10% of words tagged with the same sense by both systems have an estimated 90% accuracy.

Conclusion

The semantic disambiguation of WordNet glosses is part of extended WordNet. The definitions were first separated from comments and examples, tokenized and part of speech tagged. This preprocessing stage resulted in 637,067 open class words to be disambiguated for which we used both human and automatic annotation. The words manually disambiguated or checked were labeled as "gold". We performed a voting between two disambiguation systems, one specially designed for disambiguating glosses, and one for disambiguating open text. The words that have the same sense assigned by both systems were labeld as "silver". The rest of the words are labeled as normal. The disambiguated glosses are presented in an XML format. The disambiguated words in WordNet can be used to derive new semantic relations and build lexical chains [6].

References

  1. S. Harabagiu, G. Miller, D. Moldovan, WordNet2 - a morphologically and semantically enhanced resource. In Proceedings if SIGLEX-99, pages 1-8, Univ of Mariland, 1999.
  2. E. Brill. Transformation-based error driven learning and natural language processing a case study in part of speech tagging. Computational linguistic, 21(4):543-566, 1995
  3. B. Magnini, C. Strapparava. Experiments in Word Domain Disambiguation for Parallel Texts. Proceedings of the ACL workshop on Word Senses and Multilinguality, pag. 27-33, 2000
  4. G. Miller, G. A. Leacock, C. Tengi, R., and Bunker, R. A semantic concordance. Proceedings of the ARPA Human Language Technology Workshop (Princeton, NJ, March 21--23) , pp. 303--308, 1993
  5. A. Novischi. Accurate Semantic Annotations via Pattern Matching. Proceedings of Florida Artificial Intelligence Research Society, 2002.
  6. D. Moldovan, A. Novischi. "Lexical Chains for Question Answering" Proceedings of COLING 2002.