1. Introduction

           

Project Goal

 

            According to Mann and Thompson, there are no algorithms for building the RS-trees automatically because of (1)

Ambiguous definitions of the rhetorical relations, (2) incomplete description of RS-trees, and (3) the constraints specification is not complete for the compositionality requirements (adjacent text spans put together).  These constraints

motivated Dr. Daniel Marcu to develop algorithms for building RS-trees automatically.  And we tried to implement his algorithms.  Our goal was not just to completely implement Marcu’s algorithm, but also to transform the whole algorithm from semi-automatic to fully-automatic.

 

 

Background of Rhetorical Structure Theory (RST)

 

            Driven mostly by research in Natural Language generation, Rhetorical Structure Theory (RST) has become one of the most popular and widely applied discourse theories of the last decade1.  It is a descriptive theory of a major aspect of the organization of natural text that shows how texts can be structured in terms of relations that hold between parts of the text.  The combinations of features that are provided by the Rhetorical Structure Theory are used in many areas of discourse studies.  According to Mann & Thompson, RST has been tremendously useful for the following reasons –

·        It identifies hierarchic structure in text.

·        It describes the relations between text parts in functional terms, identifying both the transition point of a relation and the extent of the items related.

·        It provides comprehensive analyses rather than selective commentary.

·        It is insensitive to text size, and has been applied to a wide variety of sizes of text.

·        RST provides a general way to describe the relations among clauses in a text, whether or not they are grammatically or lexically signaled.

·        Descriptive RST has been used as an analytical tool for a wide range of text types.

·        Descriptive RST lays a foundation for studies in contrastive rhetoric.

·        RST has proven also to be useful in analyzing narrative discourse.

·        RST provides a framework for investigating Relational Propositions, which are unstated but inferred propositions that arise from the text structure in the process of interpreting text.

 

In Rhetorical Structure Theory, a text is broken into many text spans or clause-like units.  A rhetorical structure representation (tree) is then formed from the text spans.  Each leaf node of the tree is associated with a contiguous textual span.  The internal nodes are labeled with the names of the rhetorical relations that hold between the textual spans that are subsumed by their child nodes.  Every text span is categorized as either Nucleus or a Satellite.  A nucleus is the most important part of a sentence, and it subsumes another text span, which is known as satellite.  Note that satellites can be removed from the text without affecting the understanding of the textual unit to which they belong.  But the nuclei can not be removed from the textual unit, and carry valuable information for understanding the text.  Following is a text and its rhetorical structure is shown in figure 1.

With its distant orbit - 50 percent farther from the sun than Earth - and slim atmospheric blanket, Mars experiences frigid weather conditions.

 

 
Text1:

 

Figure 1: Rhetorical Structure for Text1

            Each Nucleus and the Satellite are connected by a relation that holds between them.  There are many relations that we will discuss in details in later section of this paper.  For example, a justification relation connects the satellite and the nucleus in the above text1.

 

 

2. Rhetorical Parsing Algorithm

 

As mentioned before, in this project we have implemented the Rhetorical Parsing Algorithm by Daniel Marcu.  Marcu’s Rhetorical Parsing Algorithm takes a free, unrestricted text as input, and identifies discourse usage of cue phrases and breaks sentences into clauses, hypothesize rhetorical relations that hold among textual units, and produce valid rhetorical structure trees for unrestricted natural language texts. This method is a surface-based approach for determining all the possible rhetorical analyses of a given discourse.  An outline of the rhetorical parsing algorithm is shown below with short explanation for each step.

 

I.                    Cue Phrase Identifier:  Determine the set D of all instances of cue phrases that occur in the text T.  This set also includes the punctuation marks such as commas, periods, and semicolons.

II.                 Elementary Unit Identifier:  Determine elementary units of texts such as sentences and clause-like units, by scanning through the text.  Also, determine the set Dd Î D of cue phrases that have a discourse function in structuring the text.  We use the corpus analysis of cue phrases (details provided in Section 3 of the paper) for this purpose.

III.               Formulating Text Structure:  In the third step, we build valid text structures for each of the three highest levels of granularity, which are the sentence, paragraph, and section levels, using the following steps.

III.1           Using the cue phrases that were assigned a discourse function in step II, the Rhetorical Parser speculates rhetorical relations among elementary units.

III.2           The algorithm then formulates rhetorical relations among the units for which no speculations were made in the previous step.  In other words, when the textual units under consideration are not characterized by any discourse markers (used in the corpus analysis), rhetorical relations are hypothesized on the basis of a simple cohesive device, which is similar to that used by Hearst (1997).

III.3           Once the set of textual units and the set of rhetorical relations that hold among the units have been determined, the algorithm derives discourse trees at each of the three levels that are assumed to be in correlation with the discourse structure: sentence, paragraph, and section levels.  The derivation is accomplished by a chart-based implementation of a proof theory provided by Marcu (2000).

III.4           Since the rhetorical parsing process is ambiguous; more than one discourse tree is usually obtained at each of these levels.  To deal with this, a weight is assigned to each of the text trees and the tree with the maximal weight is selected.

IV.              Assembling: In the final step, the algorithm assembles the trees built at each level of granularity, thus obtaining a discourse tree that spans over the entire text.  For this purpose, we merge the best trees that correspond to each level into a discourse tree that spans the entire text and that has clause-like units as its elementary units.

The Rhetorical Structure (RS) trees that are obtained by the algorithm are validated using Well-Constrained Mathematical Model developed by Marcu (1996).  The Mathematical Model consists of constraints, which are specific to valid rhetorical structures in the language of first-order logic.

 

 

 

 

 

 

 

 

 

 

 

 

 

3. Corpus Analysis of Cue Phrases

 

            In order to automatically build the valid text structure of an arbitrary text, we need only to determine the elementary units of that text and the rhetorical relations that hold among those units.  Therefore, an accurate determination of the elementary units of a text and the relations that hold among them is the most important task.  Using cue phrases is one of the best ways to determine elementary units.  According to Litman and Hirschberg, Cue phrases are words, phrases, or linguistic expressions that may directly and explicitly mark the structure of a discourse.

            Corpus analysis of cue phrases was designed to investigate how cue phrases can be used to identify the elementary units of texts, as well as to determine what rhetorical relations hold between the units.  This module is the core of our project and building block for rest of the modules.  The corpus analysis in the thesis paper by Daniel Marcu was done manually.  But one of the goals of this project is to automate every step in Marcu’s Rhetorical Parsing Algorithm including the corpus analysis of cue phrases.

            A list of around four hundred ninety-six cue phrases is obtained from Daniel Marcu’s thesis paper.  Also, a corpus with three hundred and eighty-five Wall Street journal articles was used for the training purpose.  The corpus was hand annotated on a set of fifty-six rhetorical relations. Each file contains the annotated data in the following format:

 

Spencer J. Volk, president and chief operating officer of this consumer and industrial products company, was elected a director.  Mr. Volk, 55 years old, succeeds Duncan Dwight, who retired in September.

 

 
                        Text2:

 

 

 

Text Box: ( Root (span 1 3)
( Nucleus (leaf 1) (rel2par span) (text _!Spencer J. Volk, president and chief operating officer of this consumer and industrial products company, was elected a director._!) )
( Satellite (span 2 3) (rel2par elaboration-additional)
( Nucleus (leaf 2) (rel2par span) (text _!Mr. Volk, 55 years old, succeeds Duncan Dwight,_!) )
( Satellite (leaf 3) (rel2par elaboration-additional-e) (text _!who retired in September._!) )
)
)
 

 


           

 

 

Figure2: Annotated Corpus Format for text2

 

            A program was developed to read and analyze the annotated corpus.  Every file in the corpus corresponds to a complete binary tree rooted at the node called the “ROOT”.  Every non-leaf node corresponds to a relation.  Figure 3 shows the binary tree for the text2.  Following is a list of steps that are taken for automating the corpus analysis.

 

 

 

 

 

 


Figure 3: Binary tree for text2

 

a)      First, all cue phrases (markers) are extracted from the trees.

b)      A set of features associated with each cue phrase was derived by traversing the trees.

c)      All markers with their respective features were stored in an efficient data structure.  We chose the map structure in STL (Standard Template Library) to be the best for our requirement.

d)      The data stored above was used in the following way. A new input utterance was taken containing a set of sentences. Each sentence was stored on a different line. (assumption:- End of sentence detection problem has been solved).

e)      All markers were extracted from this input. These markers were matched with the markers in the data structure. The best matching markers were chosen and their features were assigned to the markers from the input.

Table 1 shows the list of features and their values that are extracted for each cue phrase during the corpus analysis process.  This set of markers with all the features forms the input to the next two modules, namely, elementary units identifier and rhetorical relation hypothesizer.

 

 

 

 

 

 

The district court accordingly refused to assume jurisdiction, whereupon the superior court made the restraining order permanent.

 

 

 

 
                        Text3

 

Field

Content

Marker

accordingly

Usage

D

Right Boundary

, whereupon

Where to Link

B

Types of textual units

MC:C

Clause Distance

0

Sentence Distance

0

Distance to salient unit

-1

Position

M

Statuses

S:N

Rhetorical Relation

VOLITIONAL-CAUSE

Break Action

NOTHING

Table 1: List of Features for cue phrase in Text3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4. Elementary Units Identifier

 

            Using the output from the above corpus analysis, we have directly implemented Daniel Marcu’s algorithm that identifies elementary textual unit boundaries in sentences and cue phrases that have discourse markers.  This algorithm takes an unrestricted text and set of cue phrases with corresponding features as an input.

            First task is to determine the potential discourse markers for a given unrestricted text.  Using the four hundred and ninety-six cue phrases that were obtained from Daniel Marcu’s thesis paper, the markers in the unknown text are identified.  To do that each word in the file has been scrutinized against the list of cue phrases.  These extracted cue phrases are then stored in a data structure, which in our case is an array to be the best for our requirement.  The position of each marker in a sentence is also calculated and stored in another data structure.

            The next step for identifying elementary units from the unrestricted text is to get discourse actions.  To do this we have used the corpus analysis of cue phrases and position of the marker in the sentence that was calculated in last step.  We have discussed the possible discourse (breaking) actions in the following.  Note that these breaking actions are also used for training purpose during the corpus analysis.

v     NOTHING: Action nothing instruct the shallow analyzer to treat the cue phrase under consideration as a simple word.  That is, no textual unit boundary is normally set when a cue phrase associated with such an action is processed.

v     NORMAL: Action normal instructs the analyzer to insert a textual boundary immediately before the occurrence of the marker.  Textual boundaries correspond to elementary unit breaks.

v     COMMA: Action comma instructs the analyzer to insert a textual boundary immediately after the occurrence of the first comma in the input stream.  If the first comma is followed by an and or an or, the textual boundary is set after the occurrence of the next comma instead.  If no comma is found before the end of the sentence, a textual boundary is created at the end of the sentence.

v      NORMAL_THEN_COMMA: Action normal_then_comma instructs the analyzer to insert a textual boundary immediately before occurrence of the marker and to insert another textual boundary immediately after the occurrence of the first comma in the input stream.

v     END: Action end instructs the analyzer to insert a textual boundary immediately after the cue phrase.

v     MATCH_PAREN: Action match_paren instructs the analyzer to insert textual boundaries both before the occurrence of the open parenthesis that is normally characterized by such an action, and after the closed parenthesis that follows it.

v     MATCH_DASH: Action match_dash instructs the analyzer to insert textual boundary before the occurrence of the cue phrase.  The cue phrase is usually a dash.  The action also instructs the analyzer to insert a textual boundary after the next dash in the text.

v     DUAL: Action dual instructs the analyzer to insert a textual boundary immediately before the cue phrase under consideration if there is no other cue phrase that immediately precedes it.

            Once we have determined the discourse/breaking action for every marker in the text, we can partition the text into elementary units using these actions.  A simple program is implemented in order to automate this process.  Figure 4 shows the skeleton of the algorithm that was used for identifying clause-like units.  Text3 shows a simple text with its boundary for clause-like units.

 

 

Figure 4: A skeleton of clause-like units identifying algorithm

                       

[I went to the theatre] [although I had a terrible headache.]  [The trip was fun,] [and although we were badly bitten by black flies,] [I do not regret it.]

 

 
                        Text3

 

 

 

 

 

 

 

 

 

 

5. Generating Rhetorical Trees

 

            The rhetorical tree generation module consists of two parts. First step is to hypothesizing rhetorical relations between textual units of various granularities.  And the second part is to build rhetorical tree using the found relations.  We have automated both of these steps by implementing simple programs.

            The program that hypothesizes rhetorical relations takes a sequence of textual units and the set of discourse markers that are in the textual units as an input.  It iterates over all the textual units that were obtained from the previous module.  For each discourse marker, a set of disjunctive hypothesis concerning the rhetorical relation that the marker under scrutiny may signal was constructed.  This was done by using the features derived from corpus analysis.  Following are some of the features that are used in this process.

v     Statuses: specifies the rhetorical status of the units that are linked by the discourse marker.

v     Where to Link: specifies whether the rhetorical relations signaled by the discourse marker concern a textual unit that goes before or after the unit that contains the marker.

v     Types of textual units: specifies the nature of textual units.

v     Rhetorical relation: specifies the name of the rhetorical relation that is signaled by the marker.

Using the above features, the algorithm automatically builds disjunctive hypotheses in the form as shown bellow.  Figure 5 shows an example of a text and its rhetorical relations.

 

rhet_rel (Relation_Name, Nucleus, Satellite)

Text Box: [Only the midday sun a tropical latitudes is warm enough to thaw ice on occasion,1] [but any liquid water formed in this way would evaporate almost instantly2] [because of the low atmospheric pressure.3]
	rhet_rel (CONTRAST, 1, 2) + rhet_rel (CONTRAST, 1, 3)
	rhet_rel ( CAUSE, 3, 1) + rhet_rel (EVIDENCE, 3, 1) + rhet_rel (CAUSE, 3, 		2) + rhet_rel (EVIDENCE, 3, 2)

 

 

 

 

 

 

 

 


Figure 5: Textual units and the rhetorical relations that hold among them

 

            The above relations are then passed through another program that transforms the relations to tree-like structure.  Along with C++ programming language, a component of JAVA (SWING) is used for viewing the graphical output of this module.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6. Conclusion

 

            Although, Daniel Marcu has presented the Rhetorical Parsing Algorithm in his thesis paper in great details, the implementation of the algorithm was a great challenge for us.  Mainly because, the algorithm he has presented is semi-automatic rather than fully automatic.  Since no clear algorithm was presented for the cue phrase analysis, a great deal of effort has been put first to define clear rules, and then to encode them into a simple program.  Therefore, the accuracy and the efficiency have reduced.  As this project requires more time and effort than just three to four weeks, more work on this project will definitely improve the result to a better performance.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7. References

 

  1. Marcu, Daniel. 2000. "The Rhetorical Parsing of Unrestricted Texts: A Surface-Based Approach". Computational Linguistics.
  2. Mann, William C. and Thompson Sandra A., 1988, “Rhetorical Structure Theory: Toward a functional theory of text organization”, Rhetorical Structure Theory, pp. 243 – 281.
  3. Marcu, Daniel. 1996. "Building up Rhetorical Structure Trees". American Association for Artificial Intelligence.
  4. Marcu, Daniel 1997, The Rhetorical Parsing, Summarization, and Generation of Natural Language Text – PhD thesis.
  5. Diane, Littman and Hirschberg Julia, Disambiguating Cue Phrases in Text and Speech, AT&T Bell Laboratories.
  6. Moore, Johanna D. and Pollack, Martha E., 1992, “A Problem for RST: The Need for Multi-Level Discourse Analysis”, Association for Computational Linguistics.