10 May 91 11:11:49 +0200

Related articles |
---|

Generating LR parsers from EBNF? dexter314159@qwest.net (Gary F. Goodman) (2006-04-03) |

Re: Generating LR parsers from EBNF? gene@abhost.us (Gene Wirchenko) (2006-04-17) |

Generating LR parsers from EBNF? S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) (1991-05-07) |

Re: Generating LR parsers from EBNF? grass@ulysses.att.com (1991-05-09) |

Re: Generating LR parsers from EBNF? KARSTEN@tfl.dk (Karsten Nyblad, TFL, Denmark) (1991-05-10) |

Newsgroups: | comp.compilers |

From: | "Karsten Nyblad, TFL, Denmark" <KARSTEN@tfl.dk> |

Keywords: | EBNF, parse, bibliography |

Organization: | TFL, A Danish Telecommunication Research Laboratory |

References: | <29305.9105072033@pessoa.ecs.soton.ac.uk> |

Date: | 10 May 91 11:11:49 +0200 |

In article <29305.9105072033@pessoa.ecs.soton.ac.uk>, S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) writes:

*> [re generating parsing tables directly from EBNF]*

*> I thought about the idea and tried a few tiny examples by hand. The main*

*> trick is to reason about the placing of the `dot' in the EBNF items.*

*> Shifting the dot in an item may produce more than one derived item. For*

*> example, consider the simple grammar for nested parenthesized lists of*

*> numbers:*

*> *

*> s -> ( {s} )*

*> s -> number*

*> *

*> `{s}' menas 0 or more s's. This grammar might generate `5' or *

*> `(1 2 (3 4) 5)'.*

*> *

*> The item `s -> . ( {s} )' shifted to two items, `s -> ( { . s} )' and*

*> `s -> ( {s} . )' The state containing these two items would usually be two*

*> states in a CFG generated automaton.*

*> *

*> The parse tables came out smaller. The EBNF generated table was smaller*

*> because some table slots in the CFG generated table were being replaced*

*> with sequences of operations, for example a reduction of a null production*

*> might be replaced by a compound operation `reduce and goto'. After these*

*> changes some states and some columns of the goto table are never used and*

*> may be removed.*

*> *

*> I have looked in several books and a couple of conference proceedings but*

*> I have failed to find any references on generating LR parse tables*

*> directly from an EBNF grammar. What I would like to know is:*

*> *

*> 1. Are there any references?*

Acta Informatica 7, 61-73 (1976)

O.L. Madsen and B.B. Kristensen: LR-Parsing of Extended Context Free

Grammars.

Acta Informatica 11 177-193 (1979)

Wilf R. LaLonde: Constructing LR Parsers for Regular Right Part Grammars

Communications of the ACM: Volume 20 number 10

Wilf R. LaLonde: Regular Right Part Grammars and Their Parsers

Acta Informatica Vol 21 Fasc 1 1984.

N.P. Chapman: LALR(1,1) Parser Generation for Regular Right Part Grammars.

Acta Informatica Vol 23 149-162 (1986)

Ikuo Nakata and Masataka Sassa: Generation of Efficeint LALR Parsers for

Regular Right Part Grammars

(Take care: Their algorithm can't handle the following

grammar:

A -> a a

A -> a A

There is a bug in their optimizing algorithm too. It can't handle the

grammar of their example. Otherwise it is the most advanced algorithm.)

It is possible to fix both the bugs. I am currently writing a

parser generator, which does it.

*> 2. Is the EBNF approach equivalent to the CFG one? Is the difference in*

*> the tables always a due to small set of `optimizations'?*

*> *

*> 3. Which is faster? The EBNF item sets are smaller but*

*> more complex to handle. Using the EBNF approach seems*

*> to reduce the need for `optimizing' the generated table.*

It is hard to tell. I can confirm that the number of states is

smaller. My parser generator generates 592 states for an ADA grammar,

which typed in directly from the standard, and which does not handle

pragmas. I prefer generators which let the user write his grammar

like he likes, without loss of performance. In order to obtain that,

you will still need to remove unit reductions, and make stackings that

are always followed by a reduction into one operation. Removing unit

reductions is harder than in normal LR parsing because you will need

to split states that recognize repeation of a symbol to get the unit

productions, i.e., assume the production:

expression -> term { or term }

When this production is reduced, the length of the right hand side is

frequently 1, and depending of the algorithm you use for generating

the LR0 automaton you might need splitting states. I do.

I don't know whether anybody has ever completed writing a generator.

My generator in itself is rather fast. When generating the ADA

parser, it uses 20 seconds form start to the lookahead sets are

calculated when executed on a VAX with 2.8 the power of the VAX780 and

with optimization disabled when compiling the parser generator. I am

currently not capable of doing anything but generating lookahead set,

and that takes 5700 lines of C code. I think I will need 2-3000 lines

of code more before having a parser generator without error recorvery

and without any facilities for handling attributes.

*> 4. I only generated the LR(0) automaton by hand. I have not thought*

*> about `higher' grammar categories like LALR(1), LR(2), regular-LR etc.*

*> Are these kinds easily generated using the EBNF method?*

Well, generating the basic LR(0) atomaton is easy. You generate a DFA

for each right hand side of the grammar. You use the states in the

DFAs to substitute items, i.e., each DFA state repesent one or more

positions of dots in the original regular expression. Now it is easy

to generate the LR(0). Calculating the lookahead sets is as easy as

with normal LALR algorithm.

You can also generate the LR0 atomaton as you describe, Stephen.

Madsen's and Kristensen's paper is about that. They put some

restrictions on the regular expressions. I do not remember why, but

it has the advantage that there is no ambiguities in how to build the

parse trees recognized by their parsers.

The real problem is deciding how much to pop from the stack when

reducing a production. Chapman uses DFAs that recognize the reverse

sequence of symbols of the right hand side, or rather recognize not

the symbols but the states representing them on the parser stack.

When the parser reduces, it starts by recognizing the right hand part

starting from the top of the parser stack. Then the recognized part

of the parser stack is substituted by the state representing the

nonterminal of the reduced production.

Nakata and Sassa prove a theorem, that say that when reducing, the

state to restart the parser in, will always be the topmost of a set of

possible states (topmost on the parser stack.) As I wrote previously,

the there is a bug the proof, but after correcting that you will get

a parser that is faster than Chapmans, because you save recognizing

the right hand side once more.

Karsten Nyblad

TFL, A Danish Telecommunication Research Laboratory

E-mail: karsten@tfl.dk

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.