Sun, 23 May 1993 05:51:54 GMT

Related articles |
---|

The Tomita Parsing Algorithm (LR(k) with Dynamic Programming) markh@csd4.csd.uwm.edu (1993-05-23) |

Re: The Tomita Parsing Algorithm (LR(k) with Dynamic Programming) Bernard.Lang@inria.fr (1993-05-24) |

Newsgroups: | comp.theory,comp.compilers,comp.ai.nat-lang |

From: | markh@csd4.csd.uwm.edu (Mark) |

Keywords: | parse |

Organization: | Computing Services Division, University of Wisconsin - Milwaukee |

Date: | Sun, 23 May 1993 05:51:54 GMT |

A while back (1985) a researcher at CMU, named Masaru Tomita, came out

with a small book describing his unique approach to natural language

parsing. The approach involved the combination of several distinctive

features:

(1) The parser was LR(k)-based.

(2) A compact representation of all possible parses of the given input

was returned by the parser.

(3) Breadth-first search was used to handle non-determinism, and

the parser's stack was generalized into a graph.

(4) It reduces to LR(k) parsing when the grammar is LR(k).

Limitations that were pointed out in the 1985 book (inability to handle

grammars with rules like; S -> S) can be eliminated if the algorithm is

viewed correctly as a dynamic programming algorithm. This is what I am

about to describe below.

A software demo (in C) of the parser should be available at the FTP

site csd4.csd.uwm.edu in the directory /pub/regex/tomita by Monday.

I haven't been in touch with any of Tomita's later works, so this may

be current or probably even in advance of current. But I think it's of

sufficient interest to present here.

THE PARSE FOREST (Example 1):

The best way to explain the approach is to look at an example or two.

For the first example, take the following grammar (presented in the

equational form accepted by the demo software):

S = NP VP.

VP = V NP | VP PP.

NP = D N | NP PP.

PP = P NP.

D: the a an.

N: boy girl park saw.

P: in at.

V: saw.

Suppose it were desired to parse the following sentence using this grammar:

The boy saw a girl in the park.

First, label the sentence as follows:

0 1 2 3 4 5 6 7 8

The boy saw a girl in the park.

The primary function of the LR(k) parsing automaton is to limit the search

to for the possible parses. The grammar above will have been precompiled

to a LR(k) parsing table. I won't go into this in any more detail, but

will note that by the time the word "saw" is reached, the parser will only

have entries in the table for a word of type P or V. Therefore, "saw"

will be correctly recognized a as word of of type V.

The parser will build up a compact representation of all the possible

parses of this input. This representation can be characterized as:

The largest subset, G, of the grammar

such that L(G) = { the input }.

in particular as the following:

D_01: The.

N_12: boy.

NP_02 = D_01 N_12.

V_23: saw.

D_34: a.

N_45: girl.

NP_35 = D_34 N_45.

VP_25 = V_23 NP_35.

P_56: in.

D_67: the.

N_78: park.

NP_68 = D_67 N_78.

PP_58 = P_56 NP_68.

NP_38 = NP_35 PP_58.

VP_28 = V_23 NP_38 | VP_25 PP_58.

S_08 = NP_02 VP_28.

The listing here is also meant to show the order in which each of the

items are created. As you can see, the parsing process is inherently

left-to-right. Because of this, it is also feasible to implement the

algorithm in such a way that the parser's actions are reversible. To undo

a word, one merely needs to erase all the items created at and after the

word. For instance, if the word "park" were to be erased, then all the

nodes N_78 through S_08 would be erased.

Generally, this list would be represented in graphical form as a "parse

forest". The root of the forest would be S_08. Each of the equations

above indicates a node in the forest. For instance, S_08 = NP_02 VP_28

indicates that S_08 branches off to NP_02 VP_28. Each of the lexical

items listed (e.g. N_78: park) indicates a leaf. As an exercise you may

want to diagram the list presented above.

Of particular interest here is the item VP_28. It has two branchings

listed under it, corresponding to the ambiguity:

saw (the girl in the park) vs (saw the girl) in the park.

In Tomita's (1985) terminology, VP_28 is called a Packed Node, and each of

the branchings is called a Subnode. What makes the storage efficient is

precisely that this node is only presented once as a single unity to its

parent node(s) S_08. In other words, ambiguities are localized as much as

possible.

The basic application of this idea is that when the parse is complete,

and the semantic evaluation is applied disambiguation can be carried out

more efficiently by pruning subnodes. (Unfortinately, semantic

constraints such as agreement rules will require unpacking packed nodes

before applying the pruning operation).

Because of the indexing, it is not too difficult to see that the

maximum number of nodes that will be created by ANY parse using the Tomita

algorithm will be proportional to N^2, where N is the size of the input (8

above). Also, each node cannot have more than N subnodes so that the

maximum storage required to represent all the parses of a given input is

proportional to N^3.

THE GRAPH-STRUCTURED PARSING STACK:

Assuming that the LR(k) parsing table has states numbered 0, 1, ...,

Q-1 (Q = 12 for the grammar in Example 1), where 0 is the starting state,

the parsing stack for the sentence above will have layers numbered 0, 1,

..., 8. Each layer, L, will consist of at most one vertex, v(L, S) for

each of the states S = 0, ..., Q-1. Layer 0 will have only one vertex:

v(0, 0).

Because of the way LR(k) parsing is defined, there can only be at most

one transition between two given states, e.g. 0 S |- 1, 0 NP |- 2, 0 D |-

3. Therefore, between any two vertices of the graph will lie at most ONE

edge labeled by ONE node, e.g.

D_01

v(0, 0) ---------> v(1, 3)

In general for any transition s1 X |- s2, and any pair of vertices v(l,

s1), v(m, s2) (l <= m) will lie at most one edge labeled as follows:

X_lm

v(l, q1) ---------> v(m, q2)

Tomita's implementation of the graph structured stack also goes one

step further and represents the labels themselves as nodes, e.g.:

v(0, 0) --- [D_01] -----> v(1, 3)

and allows for branches to be merged, e.g.,

P_56

v(5, 10) ----------> v(6, 7)

/

v(5, 4) -----/

P_56

becomes:

v(5, 10) --- [P_56] ----> v(6, 7)

/

v(5, 4) --/

Like with the parse forest, the labeling of the nodes makes the storage

efficiency immediately apparent: the number of nodes in the graph will be

proportional to N^2 (actually, N^2 Q, but Q is fixed).

THE PARSING ALGORITHM:

The basic feature is that parsing is done breadth-first. When lexical

item X_(m-1)m is input, then all possible nodes of the form X_nm will be

created under the direction of the LR(k) parsing table. This may possible

lead to more vertices v(m, s) as a result.

In particular, in the LR(k) parsing automaton, all REDUCE actions are

carried out any SHIFT action. This will assure that the parses being

carried out in parallel will all be synchronized to the most recently

processed lexical items.

This is an outline of the algorithm:

(0) create vertex v(0, 0), L = 0.

(1) apply all possible REDUCE actions over the current set of vertices

v(L, S).

this may lead to:

(a) the creation of new vertices

(b) the creation of new nodes and edges of the form:

...---> [X_ml] ---> v(L, S)

to currently existing vertices.

Both cases, as a result, may lead to the applicability of more REDUCE

actions, even in case (b) if the vertex was already there. Carry out this

process until all the REDUCE actions are exhausted.

(2) if L = N, the parsing is complete, go to step (5)

(3) input the next lexical item. For all possible classes, X, that this item

belongs to (e.g., saw is an item of types N and V), carry out all the

SHIFT actions from all the vertices v(L, S). This will result in the

creation of new vertices of the form v(L + 1, S').

If the parsing process fails, it will fail at this point due to the lack

of any applicable shift action. Tomita described an interactive

front-end in which the recovery action consisted in undoing the parse

action and prompting the user for a correction.

(4) increase L by 1 and go to step (1).

(5) if parsing succeeds, there will be exactly one edge of the form:

v(0, 0) ----- [S_0N] ------> v(N, 1)

where I am assuming that state 1 of the LR(k) parser is the accepting

state. The node S_0N is then returned as the result of the parse.

HANDLING PATHOLOGICAL GRAMMARS (Example 2):

This is a rather generic example used to show what happens in the

extreme case where cyclic rules and empty reduction rules occur.

S = | S T | S.

T = a S b | x.

a: "(".

b: ")".

x: "a" "b" "c" "d".

In this case, S has both an empty reduction S -> (empty string), and a

cyclic reduction S -> S.

Taking the following input as an example:

0 1 2 3 4 5 6

a ( b c ) d

The result of parsing this will be a CYCLIC parse forest (rooted at S_06):

S_00 = S_00 | .

x_01: "a".

T_01 = x_01.

S_01 = S_01 | S_00 T_01.

a_12: "(".

S_22 = S_22 | .

x_23: "b".

T_23 = x_23.

S_23 = S_23 | S_22 T_23.

x_34: "c".

T_34 = x_34.

S_24 = S_24 | S_23 T_34.

b_45: ")".

T_15 = a_12 S_24 b_45.

S_05 = S_05 | S_01 T_15.

x_56: "d".

T_56 = x_56.

S_06 = S_06 | S_05 T_56.

Nobody's ever going to use a grammar like this, but it illustrates the

generality of the parsing algorithm, which was originally intended as only

a slight generalization of LR(k) parsing when it was discovered in 1985.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.