Related articles |
---|
The Tomita Parsing Algorithm (LR(k) with Dynamic Programming) markh@csd4.csd.uwm.edu (1993-05-23) |
Re: The Tomita Parsing Algorithm (LR(k) with Dynamic Programming) Bernard.Lang@inria.fr (1993-05-24) |
Newsgroups: | comp.theory,comp.compilers,comp.ai.nat-lang |
From: | markh@csd4.csd.uwm.edu (Mark) |
Keywords: | parse |
Organization: | Computing Services Division, University of Wisconsin - Milwaukee |
Date: | Sun, 23 May 1993 05:51:54 GMT |
A while back (1985) a researcher at CMU, named Masaru Tomita, came out
with a small book describing his unique approach to natural language
parsing. The approach involved the combination of several distinctive
features:
(1) The parser was LR(k)-based.
(2) A compact representation of all possible parses of the given input
was returned by the parser.
(3) Breadth-first search was used to handle non-determinism, and
the parser's stack was generalized into a graph.
(4) It reduces to LR(k) parsing when the grammar is LR(k).
Limitations that were pointed out in the 1985 book (inability to handle
grammars with rules like; S -> S) can be eliminated if the algorithm is
viewed correctly as a dynamic programming algorithm. This is what I am
about to describe below.
A software demo (in C) of the parser should be available at the FTP
site csd4.csd.uwm.edu in the directory /pub/regex/tomita by Monday.
I haven't been in touch with any of Tomita's later works, so this may
be current or probably even in advance of current. But I think it's of
sufficient interest to present here.
THE PARSE FOREST (Example 1):
The best way to explain the approach is to look at an example or two.
For the first example, take the following grammar (presented in the
equational form accepted by the demo software):
S = NP VP.
VP = V NP | VP PP.
NP = D N | NP PP.
PP = P NP.
D: the a an.
N: boy girl park saw.
P: in at.
V: saw.
Suppose it were desired to parse the following sentence using this grammar:
The boy saw a girl in the park.
First, label the sentence as follows:
0 1 2 3 4 5 6 7 8
The boy saw a girl in the park.
The primary function of the LR(k) parsing automaton is to limit the search
to for the possible parses. The grammar above will have been precompiled
to a LR(k) parsing table. I won't go into this in any more detail, but
will note that by the time the word "saw" is reached, the parser will only
have entries in the table for a word of type P or V. Therefore, "saw"
will be correctly recognized a as word of of type V.
The parser will build up a compact representation of all the possible
parses of this input. This representation can be characterized as:
The largest subset, G, of the grammar
such that L(G) = { the input }.
in particular as the following:
D_01: The.
N_12: boy.
NP_02 = D_01 N_12.
V_23: saw.
D_34: a.
N_45: girl.
NP_35 = D_34 N_45.
VP_25 = V_23 NP_35.
P_56: in.
D_67: the.
N_78: park.
NP_68 = D_67 N_78.
PP_58 = P_56 NP_68.
NP_38 = NP_35 PP_58.
VP_28 = V_23 NP_38 | VP_25 PP_58.
S_08 = NP_02 VP_28.
The listing here is also meant to show the order in which each of the
items are created. As you can see, the parsing process is inherently
left-to-right. Because of this, it is also feasible to implement the
algorithm in such a way that the parser's actions are reversible. To undo
a word, one merely needs to erase all the items created at and after the
word. For instance, if the word "park" were to be erased, then all the
nodes N_78 through S_08 would be erased.
Generally, this list would be represented in graphical form as a "parse
forest". The root of the forest would be S_08. Each of the equations
above indicates a node in the forest. For instance, S_08 = NP_02 VP_28
indicates that S_08 branches off to NP_02 VP_28. Each of the lexical
items listed (e.g. N_78: park) indicates a leaf. As an exercise you may
want to diagram the list presented above.
Of particular interest here is the item VP_28. It has two branchings
listed under it, corresponding to the ambiguity:
saw (the girl in the park) vs (saw the girl) in the park.
In Tomita's (1985) terminology, VP_28 is called a Packed Node, and each of
the branchings is called a Subnode. What makes the storage efficient is
precisely that this node is only presented once as a single unity to its
parent node(s) S_08. In other words, ambiguities are localized as much as
possible.
The basic application of this idea is that when the parse is complete,
and the semantic evaluation is applied disambiguation can be carried out
more efficiently by pruning subnodes. (Unfortinately, semantic
constraints such as agreement rules will require unpacking packed nodes
before applying the pruning operation).
Because of the indexing, it is not too difficult to see that the
maximum number of nodes that will be created by ANY parse using the Tomita
algorithm will be proportional to N^2, where N is the size of the input (8
above). Also, each node cannot have more than N subnodes so that the
maximum storage required to represent all the parses of a given input is
proportional to N^3.
THE GRAPH-STRUCTURED PARSING STACK:
Assuming that the LR(k) parsing table has states numbered 0, 1, ...,
Q-1 (Q = 12 for the grammar in Example 1), where 0 is the starting state,
the parsing stack for the sentence above will have layers numbered 0, 1,
..., 8. Each layer, L, will consist of at most one vertex, v(L, S) for
each of the states S = 0, ..., Q-1. Layer 0 will have only one vertex:
v(0, 0).
Because of the way LR(k) parsing is defined, there can only be at most
one transition between two given states, e.g. 0 S |- 1, 0 NP |- 2, 0 D |-
3. Therefore, between any two vertices of the graph will lie at most ONE
edge labeled by ONE node, e.g.
D_01
v(0, 0) ---------> v(1, 3)
In general for any transition s1 X |- s2, and any pair of vertices v(l,
s1), v(m, s2) (l <= m) will lie at most one edge labeled as follows:
X_lm
v(l, q1) ---------> v(m, q2)
Tomita's implementation of the graph structured stack also goes one
step further and represents the labels themselves as nodes, e.g.:
v(0, 0) --- [D_01] -----> v(1, 3)
and allows for branches to be merged, e.g.,
P_56
v(5, 10) ----------> v(6, 7)
/
v(5, 4) -----/
P_56
becomes:
v(5, 10) --- [P_56] ----> v(6, 7)
/
v(5, 4) --/
Like with the parse forest, the labeling of the nodes makes the storage
efficiency immediately apparent: the number of nodes in the graph will be
proportional to N^2 (actually, N^2 Q, but Q is fixed).
THE PARSING ALGORITHM:
The basic feature is that parsing is done breadth-first. When lexical
item X_(m-1)m is input, then all possible nodes of the form X_nm will be
created under the direction of the LR(k) parsing table. This may possible
lead to more vertices v(m, s) as a result.
In particular, in the LR(k) parsing automaton, all REDUCE actions are
carried out any SHIFT action. This will assure that the parses being
carried out in parallel will all be synchronized to the most recently
processed lexical items.
This is an outline of the algorithm:
(0) create vertex v(0, 0), L = 0.
(1) apply all possible REDUCE actions over the current set of vertices
v(L, S).
this may lead to:
(a) the creation of new vertices
(b) the creation of new nodes and edges of the form:
...---> [X_ml] ---> v(L, S)
to currently existing vertices.
Both cases, as a result, may lead to the applicability of more REDUCE
actions, even in case (b) if the vertex was already there. Carry out this
process until all the REDUCE actions are exhausted.
(2) if L = N, the parsing is complete, go to step (5)
(3) input the next lexical item. For all possible classes, X, that this item
belongs to (e.g., saw is an item of types N and V), carry out all the
SHIFT actions from all the vertices v(L, S). This will result in the
creation of new vertices of the form v(L + 1, S').
If the parsing process fails, it will fail at this point due to the lack
of any applicable shift action. Tomita described an interactive
front-end in which the recovery action consisted in undoing the parse
action and prompting the user for a correction.
(4) increase L by 1 and go to step (1).
(5) if parsing succeeds, there will be exactly one edge of the form:
v(0, 0) ----- [S_0N] ------> v(N, 1)
where I am assuming that state 1 of the LR(k) parser is the accepting
state. The node S_0N is then returned as the result of the parse.
HANDLING PATHOLOGICAL GRAMMARS (Example 2):
This is a rather generic example used to show what happens in the
extreme case where cyclic rules and empty reduction rules occur.
S = | S T | S.
T = a S b | x.
a: "(".
b: ")".
x: "a" "b" "c" "d".
In this case, S has both an empty reduction S -> (empty string), and a
cyclic reduction S -> S.
Taking the following input as an example:
0 1 2 3 4 5 6
a ( b c ) d
The result of parsing this will be a CYCLIC parse forest (rooted at S_06):
S_00 = S_00 | .
x_01: "a".
T_01 = x_01.
S_01 = S_01 | S_00 T_01.
a_12: "(".
S_22 = S_22 | .
x_23: "b".
T_23 = x_23.
S_23 = S_23 | S_22 T_23.
x_34: "c".
T_34 = x_34.
S_24 = S_24 | S_23 T_34.
b_45: ")".
T_15 = a_12 S_24 b_45.
S_05 = S_05 | S_01 T_15.
x_56: "d".
T_56 = x_56.
S_06 = S_06 | S_05 T_56.
Nobody's ever going to use a grammar like this, but it illustrates the
generality of the parsing algorithm, which was originally intended as only
a slight generalization of LR(k) parsing when it was discovered in 1985.
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.