1. Introduction
Project Goal
According to Mann and Thompson, there are no algorithms for building the RS-trees automatically because of (1)
Ambiguous definitions of the rhetorical relations, (2) incomplete description of RS-trees, and (3) the constraints specification is not complete for the compositionality requirements (adjacent text spans put together). These constraints
motivated Dr. Daniel Marcu to develop algorithms for building RS-trees automatically. And we tried to implement his algorithms. Our goal was not just to completely implement Marcu’s algorithm, but also to transform the whole algorithm from semi-automatic to fully-automatic.
Background of Rhetorical Structure
Theory (RST)
Driven mostly by research in Natural Language generation, Rhetorical Structure Theory (RST) has become one of the most popular and widely applied discourse theories of the last decade1. It is a descriptive theory of a major aspect of the organization of natural text that shows how texts can be structured in terms of relations that hold between parts of the text. The combinations of features that are provided by the Rhetorical Structure Theory are used in many areas of discourse studies. According to Mann & Thompson, RST has been tremendously useful for the following reasons –
· It identifies hierarchic structure in text.
· It describes the relations between text parts in functional terms, identifying both the transition point of a relation and the extent of the items related.
· It provides comprehensive analyses rather than selective commentary.
· It is insensitive to text size, and has been applied to a wide variety of sizes of text.
· RST provides a general way to describe the relations among clauses in a text, whether or not they are grammatically or lexically signaled.
· Descriptive RST has been used as an analytical tool for a wide range of text types.
· Descriptive RST lays a foundation for studies in contrastive rhetoric.
· RST has proven also to be useful in analyzing narrative discourse.
· RST provides a framework for investigating Relational Propositions, which are unstated but inferred propositions that arise from the text structure in the process of interpreting text.
In Rhetorical Structure Theory, a text is broken into many text spans or clause-like units. A rhetorical structure representation (tree) is then formed from the text spans. Each leaf node of the tree is associated with a contiguous textual span. The internal nodes are labeled with the names of the rhetorical relations that hold between the textual spans that are subsumed by their child nodes. Every text span is categorized as either Nucleus or a Satellite. A nucleus is the most important part of a sentence, and it subsumes another text span, which is known as satellite. Note that satellites can be removed from the text without affecting the understanding of the textual unit to which they belong. But the nuclei can not be removed from the textual unit, and carry valuable information for understanding the text. Following is a text and its rhetorical structure is shown in figure 1.
With its distant orbit - 50 percent farther from the
sun than Earth - and slim atmospheric blanket, Mars experiences frigid
weather conditions.
Text1:

Figure
1: Rhetorical Structure for Text1
Each Nucleus and the Satellite are connected by a relation that holds between them. There are many relations that we will discuss in details in later section of this paper. For example, a justification relation connects the satellite and the nucleus in the above text1.
2. Rhetorical Parsing Algorithm
As mentioned before, in this project we have implemented the Rhetorical Parsing Algorithm by Daniel Marcu. Marcu’s Rhetorical Parsing Algorithm takes a free, unrestricted text as input, and identifies discourse usage of cue phrases and breaks sentences into clauses, hypothesize rhetorical relations that hold among textual units, and produce valid rhetorical structure trees for unrestricted natural language texts. This method is a surface-based approach for determining all the possible rhetorical analyses of a given discourse. An outline of the rhetorical parsing algorithm is shown below with short explanation for each step.
I. Cue Phrase Identifier: Determine the set D of all instances of cue phrases that occur in the text T. This set also includes the punctuation marks such as commas, periods, and semicolons.
II. Elementary Unit Identifier: Determine elementary units of texts such as sentences and clause-like units, by scanning through the text. Also, determine the set Dd Î D of cue phrases that have a discourse function in structuring the text. We use the corpus analysis of cue phrases (details provided in Section 3 of the paper) for this purpose.
III. Formulating Text Structure: In the third step, we build valid text structures for each of the three highest levels of granularity, which are the sentence, paragraph, and section levels, using the following steps.
III.1 Using the cue phrases that were assigned a discourse function in step II, the Rhetorical Parser speculates rhetorical relations among elementary units.
III.2 The algorithm then formulates rhetorical relations among the units for which no speculations were made in the previous step. In other words, when the textual units under consideration are not characterized by any discourse markers (used in the corpus analysis), rhetorical relations are hypothesized on the basis of a simple cohesive device, which is similar to that used by Hearst (1997).
III.3 Once the set of textual units and the set of rhetorical relations that hold among the units have been determined, the algorithm derives discourse trees at each of the three levels that are assumed to be in correlation with the discourse structure: sentence, paragraph, and section levels. The derivation is accomplished by a chart-based implementation of a proof theory provided by Marcu (2000).
III.4 Since the rhetorical parsing process is ambiguous; more than one discourse tree is usually obtained at each of these levels. To deal with this, a weight is assigned to each of the text trees and the tree with the maximal weight is selected.
IV. Assembling: In the final step, the algorithm assembles the trees built at each level of granularity, thus obtaining a discourse tree that spans over the entire text. For this purpose, we merge the best trees that correspond to each level into a discourse tree that spans the entire text and that has clause-like units as its elementary units.
The Rhetorical Structure (RS) trees that are obtained by the algorithm are validated using Well-Constrained Mathematical Model developed by Marcu (1996). The Mathematical Model consists of constraints, which are specific to valid rhetorical structures in the language of first-order logic.
3. Corpus Analysis of Cue Phrases
In order to automatically build the valid text structure of an arbitrary text, we need only to determine the elementary units of that text and the rhetorical relations that hold among those units. Therefore, an accurate determination of the elementary units of a text and the relations that hold among them is the most important task. Using cue phrases is one of the best ways to determine elementary units. According to Litman and Hirschberg, Cue phrases are words, phrases, or linguistic expressions that may directly and explicitly mark the structure of a discourse.
Corpus analysis of cue phrases was designed to investigate how cue phrases can be used to identify the elementary units of texts, as well as to determine what rhetorical relations hold between the units. This module is the core of our project and building block for rest of the modules. The corpus analysis in the thesis paper by Daniel Marcu was done manually. But one of the goals of this project is to automate every step in Marcu’s Rhetorical Parsing Algorithm including the corpus analysis of cue phrases.
A
list of around four hundred ninety-six cue phrases is obtained from Daniel Marcu’s
thesis paper. Also, a corpus with
Spencer J.
Volk, president and chief operating officer of this consumer and industrial
products company, was elected a director.
Mr. Volk, 55 years old, succeeds Duncan Dwight, who retired in
September.
Text2:

A program was developed to read and
analyze the annotated corpus. Every file
in the corpus corresponds to a complete binary tree rooted at the node called
the “ROOT”. Every non-leaf node
corresponds to a relation. Figure 3
shows the binary tree for the text2.
Following is a list of steps that are taken for automating the corpus
analysis.

a)
First,
all cue phrases (markers) are extracted from the trees.
b)
A set of
features associated with each cue phrase was derived by traversing the trees.
c)
All markers with their respective features were
stored in an efficient data structure.
We chose the map structure in STL (Standard Template Library) to be the
best for our requirement.
d)
The data stored above was used in the following
way. A new input utterance was taken containing a set of sentences. Each
sentence was stored on a different line. (assumption:- End of sentence
detection problem has been solved).
e) All markers were extracted from this input. These markers were matched with the markers in the data structure. The best matching markers were chosen and their features were assigned to the markers from the input.
Table 1 shows the
list of features and their values that are extracted for each cue phrase during
the corpus analysis process. This set of
markers with all the features forms the input to the next two modules, namely,
elementary units identifier and rhetorical relation hypothesizer.
The district court accordingly refused to
assume jurisdiction, whereupon the superior court made the restraining order
permanent.
Text3
|
Field |
Content |
|
Marker |
accordingly |
|
Usage |
D |
|
Right Boundary |
, whereupon |
|
Where to Link |
B |
|
Types of textual units |
MC:C |
|
Clause Distance |
0 |
|
Sentence Distance |
0 |
|
Distance to salient unit |
-1 |
|
Position |
M |
|
Statuses |
S:N |
|
Rhetorical Relation |
VOLITIONAL-CAUSE |
|
Break Action |
NOTHING |
Table 1: List of Features for cue phrase in Text3
4. Elementary Units Identifier
Using
the output from the above corpus analysis, we have directly implemented Daniel
Marcu’s algorithm that identifies elementary textual unit boundaries in
sentences and cue phrases that have discourse markers. This algorithm takes an unrestricted text and
set of cue phrases with corresponding features as an input.
First
task is to determine the potential discourse markers for a given unrestricted
text. Using the four hundred and
ninety-six cue phrases that were obtained from Daniel Marcu’s thesis paper, the
markers in the unknown text are identified.
To do that each word in the file has been scrutinized against the list
of cue phrases. These extracted cue
phrases are then stored in a data structure, which in our case is an array to
be the best for our requirement. The
position of each marker in a sentence is also calculated and stored in another
data structure.
The next step for identifying elementary units from the unrestricted text is to get discourse actions. To do this we have used the corpus analysis of cue phrases and position of the marker in the sentence that was calculated in last step. We have discussed the possible discourse (breaking) actions in the following. Note that these breaking actions are also used for training purpose during the corpus analysis.
v
NOTHING: Action nothing instruct the shallow
analyzer to treat the cue phrase under consideration as a simple word. That is, no textual unit boundary is normally
set when a cue phrase associated with such an action is processed.
v
v COMMA: Action comma instructs the analyzer to insert a textual boundary immediately after the occurrence of the first comma in the input stream. If the first comma is followed by an and or an or, the textual boundary is set after the occurrence of the next comma instead. If no comma is found before the end of the sentence, a textual boundary is created at the end of the sentence.
v
NORMAL_THEN_COMMA: Action
normal_then_comma instructs the analyzer to insert a textual boundary
immediately before occurrence of the marker and to insert another textual
boundary immediately after the occurrence of the first comma in the input
stream.
v
END: Action end instructs the analyzer to
insert a textual boundary immediately after the cue phrase.
v
MATCH_PAREN: Action
match_paren instructs the analyzer to insert textual boundaries both before the
occurrence of the open parenthesis that is normally characterized by such an
action, and after the closed parenthesis that follows it.
v
MATCH_DASH: Action match_dash instructs the analyzer
to insert textual boundary before the occurrence of the cue phrase. The cue phrase is usually a dash. The action also instructs the analyzer to
insert a textual boundary after the next dash in the text.
v
DUAL: Action dual instructs the analyzer to
insert a textual boundary immediately before the cue phrase under consideration
if there is no other cue phrase that immediately precedes it.
Once we
have determined the discourse/breaking action for every marker in the text, we
can partition the text into elementary units using these actions. A simple program is implemented in order to
automate this process. Figure 4 shows
the skeleton of the algorithm that was used for identifying clause-like
units. Text3 shows a simple text with
its boundary for clause-like units.

[I went to
the theatre] [although I had a terrible headache.] [The trip was fun,] [and although we were
badly bitten by black flies,] [I do not regret it.]
Text3
5. Generating
Rhetorical Trees
The rhetorical tree generation
module consists of two parts. First step is to hypothesizing rhetorical
relations between textual units of various granularities. And the second part is to build rhetorical
tree using the found relations. We have
automated both of these steps by implementing simple programs.
The
program that hypothesizes rhetorical relations takes a sequence of textual
units and the set of discourse markers that are in the textual units as an
input. It iterates over all the textual
units that were obtained from the previous module. For each discourse marker, a set of disjunctive
hypothesis concerning the rhetorical relation that the marker under scrutiny
may signal was constructed. This was
done by using the features derived from corpus analysis. Following are some of the features that are
used in this process.
v
Statuses: specifies the rhetorical status of the
units that are linked by the discourse marker.
v
Where to Link: specifies
whether the rhetorical relations signaled by the discourse marker concern a
textual unit that goes before or after the unit that contains the marker.
v
Types of textual units: specifies the
nature of textual units.
v
Rhetorical relation: specifies the
name of the rhetorical relation that is signaled by the marker.
Using the above features, the algorithm
automatically builds disjunctive hypotheses in the form as shown bellow. Figure 5 shows an example of a text and its
rhetorical relations.
rhet_rel (Relation_Name, Nucleus, Satellite)
![Text Box: [Only the midday sun a tropical latitudes is warm enough to thaw ice on occasion,1] [but any liquid water formed in this way would evaporate almost instantly2] [because of the low atmospheric pressure.3]
rhet_rel (CONTRAST, 1, 2) + rhet_rel (CONTRAST, 1, 3)
rhet_rel ( CAUSE, 3, 1) + rhet_rel (EVIDENCE, 3, 1) + rhet_rel (CAUSE, 3, 2) + rhet_rel (EVIDENCE, 3, 2)](Rhetorical%20Tree_files/image007.gif)
The above relations are then passed
through another program that transforms the relations to tree-like
structure. Along with C++ programming
language, a component of JAVA (SWING) is used for viewing the graphical output
of this module.
6. Conclusion
Although, Daniel Marcu has presented
the Rhetorical Parsing Algorithm in his thesis paper in great details, the
implementation of the algorithm was a great challenge for us. Mainly because, the algorithm he has presented
is semi-automatic rather than fully automatic.
Since no clear algorithm was presented for the cue phrase analysis, a
great deal of effort has been put first to define clear rules, and then to
encode them into a simple program.
Therefore, the accuracy and the efficiency have reduced. As this project requires more time and effort
than just three to four weeks, more work on this project will definitely
improve the result to a better performance.
7. References