Re: Generating LR parsers from EBNF?

"Karsten Nyblad, TFL, Denmark" <KARSTEN@tfl.dk>
10 May 91 11:11:49 +0200

          From comp.compilers

Related articles
Generating LR parsers from EBNF? dexter314159@qwest.net (Gary F. Goodman) (2006-04-03)
Re: Generating LR parsers from EBNF? gene@abhost.us (Gene Wirchenko) (2006-04-17)
Generating LR parsers from EBNF? S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) (1991-05-07)
Re: Generating LR parsers from EBNF? grass@ulysses.att.com (1991-05-09)
Re: Generating LR parsers from EBNF? KARSTEN@tfl.dk (Karsten Nyblad, TFL, Denmark) (1991-05-10)
| List of all articles for this month |
Newsgroups: comp.compilers
From: "Karsten Nyblad, TFL, Denmark" <KARSTEN@tfl.dk>
Keywords: EBNF, parse, bibliography
Organization: TFL, A Danish Telecommunication Research Laboratory
References: <29305.9105072033@pessoa.ecs.soton.ac.uk>
Date: 10 May 91 11:11:49 +0200

In article <29305.9105072033@pessoa.ecs.soton.ac.uk>, S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) writes:
> [re generating parsing tables directly from EBNF]
> I thought about the idea and tried a few tiny examples by hand. The main
> trick is to reason about the placing of the `dot' in the EBNF items.
> Shifting the dot in an item may produce more than one derived item. For
> example, consider the simple grammar for nested parenthesized lists of
> numbers:
>
> s -> ( {s} )
> s -> number
>
> `{s}' menas 0 or more s's. This grammar might generate `5' or
> `(1 2 (3 4) 5)'.
>
> The item `s -> . ( {s} )' shifted to two items, `s -> ( { . s} )' and
> `s -> ( {s} . )' The state containing these two items would usually be two
> states in a CFG generated automaton.
>
> The parse tables came out smaller. The EBNF generated table was smaller
> because some table slots in the CFG generated table were being replaced
> with sequences of operations, for example a reduction of a null production
> might be replaced by a compound operation `reduce and goto'. After these
> changes some states and some columns of the goto table are never used and
> may be removed.
>
> I have looked in several books and a couple of conference proceedings but
> I have failed to find any references on generating LR parse tables
> directly from an EBNF grammar. What I would like to know is:
>
> 1. Are there any references?


Acta Informatica 7, 61-73 (1976)
O.L. Madsen and B.B. Kristensen: LR-Parsing of Extended Context Free
Grammars.


Acta Informatica 11 177-193 (1979)
Wilf R. LaLonde: Constructing LR Parsers for Regular Right Part Grammars


Communications of the ACM: Volume 20 number 10
Wilf R. LaLonde: Regular Right Part Grammars and Their Parsers


Acta Informatica Vol 21 Fasc 1 1984.
N.P. Chapman: LALR(1,1) Parser Generation for Regular Right Part Grammars.


Acta Informatica Vol 23 149-162 (1986)
Ikuo Nakata and Masataka Sassa: Generation of Efficeint LALR Parsers for
Regular Right Part Grammars
(Take care: Their algorithm can't handle the following
grammar:
          A -> a a
          A -> a A
There is a bug in their optimizing algorithm too. It can't handle the
grammar of their example. Otherwise it is the most advanced algorithm.)


It is possible to fix both the bugs. I am currently writing a
parser generator, which does it.


> 2. Is the EBNF approach equivalent to the CFG one? Is the difference in
> the tables always a due to small set of `optimizations'?
>
> 3. Which is faster? The EBNF item sets are smaller but
> more complex to handle. Using the EBNF approach seems
> to reduce the need for `optimizing' the generated table.


It is hard to tell. I can confirm that the number of states is
smaller. My parser generator generates 592 states for an ADA grammar,
which typed in directly from the standard, and which does not handle
pragmas. I prefer generators which let the user write his grammar
like he likes, without loss of performance. In order to obtain that,
you will still need to remove unit reductions, and make stackings that
are always followed by a reduction into one operation. Removing unit
reductions is harder than in normal LR parsing because you will need
to split states that recognize repeation of a symbol to get the unit
productions, i.e., assume the production:


      expression -> term { or term }


When this production is reduced, the length of the right hand side is
frequently 1, and depending of the algorithm you use for generating
the LR0 automaton you might need splitting states. I do.


I don't know whether anybody has ever completed writing a generator.
My generator in itself is rather fast. When generating the ADA
parser, it uses 20 seconds form start to the lookahead sets are
calculated when executed on a VAX with 2.8 the power of the VAX780 and
with optimization disabled when compiling the parser generator. I am
currently not capable of doing anything but generating lookahead set,
and that takes 5700 lines of C code. I think I will need 2-3000 lines
of code more before having a parser generator without error recorvery
and without any facilities for handling attributes.


> 4. I only generated the LR(0) automaton by hand. I have not thought
> about `higher' grammar categories like LALR(1), LR(2), regular-LR etc.
> Are these kinds easily generated using the EBNF method?


Well, generating the basic LR(0) atomaton is easy. You generate a DFA
for each right hand side of the grammar. You use the states in the
DFAs to substitute items, i.e., each DFA state repesent one or more
positions of dots in the original regular expression. Now it is easy
to generate the LR(0). Calculating the lookahead sets is as easy as
with normal LALR algorithm.


You can also generate the LR0 atomaton as you describe, Stephen.
Madsen's and Kristensen's paper is about that. They put some
restrictions on the regular expressions. I do not remember why, but
it has the advantage that there is no ambiguities in how to build the
parse trees recognized by their parsers.


The real problem is deciding how much to pop from the stack when
reducing a production. Chapman uses DFAs that recognize the reverse
sequence of symbols of the right hand side, or rather recognize not
the symbols but the states representing them on the parser stack.
When the parser reduces, it starts by recognizing the right hand part
starting from the top of the parser stack. Then the recognized part
of the parser stack is substituted by the state representing the
nonterminal of the reduced production.


Nakata and Sassa prove a theorem, that say that when reducing, the
state to restart the parser in, will always be the topmost of a set of
possible states (topmost on the parser stack.) As I wrote previously,
the there is a bug the proof, but after correcting that you will get
a parser that is faster than Chapmans, because you save recognizing
the right hand side once more.


Karsten Nyblad
TFL, A Danish Telecommunication Research Laboratory
E-mail: karsten@tfl.dk
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.