Re: regular expression operators in CF grammars

"Chris F Clark" <cfc@world.std.com>
17 Jun 2002 00:14:10 -0400

          From comp.compilers

Related articles
Compiler books for beginners? bart@dynarec.com (2002-05-27)
Re: Compiler books for beginners? rboland@unb.ca (Ralph Boland) (2002-05-27)
Re: regular expression operators in CF grammars parsersinc@yahoo.com (SLK Parsers) (2002-06-13)
Re: regular expression operators in CF grammars neelk@alum.mit.edu (Neelakantan Krishnaswami) (2002-06-14)
Re: regular expression operators in CF grammars rboland@unb.ca (Ralph Boland) (2002-06-17)
Re: regular expression operators in CF grammars cfc@world.std.com (Chris F Clark) (2002-06-17)
Re: regular expression operators in CF grammars kgw-news@stiscan.com (2002-06-20)
Re: regular expression operators in CF grammars kgw-news@stiscan.com (2002-06-28)
Re: regular expression operators in CF grammars parsersinc@yahoo.com (SLK Parsers) (2002-06-28)
Re: regular expression operators in CF grammars parsersinc@yahoo.com (SLK Parsers) (2002-06-28)
Re: regular expression operators in CF grammars soenke.kannapinn@wincor-nixdorf.com (=?Windows-1252?Q?S=F6nke_Kannapinn?=) (2002-06-28)
Re: regular expression operators in CF grammars cfc@world.std.com (Chris F Clark) (2002-07-02)
[14 later articles]
| List of all articles for this month |
From: "Chris F Clark" <cfc@world.std.com>
Newsgroups: comp.compilers
Date: 17 Jun 2002 00:14:10 -0400
Organization: Compilers Central
References: 02-05-142 02-05-151 02-06-024
Keywords: parse
Posted-Date: 17 Jun 2002 00:14:10 EDT

The author(s) of SLK Parsers write:
> The classical, and proven grammar analysis algorithms in parsing theory are
> based on definitions that do not include a notion of regular expressions as
> in EBNF. Adhoc programming methods that internally translate EBNF run the
> risk of introducing incorrectness.


While this is potentially a true statement, if you think about it, it
discourages all improvement, since any improvement runs the risk of
being incorrect until it is "proven". Note, even a "classical"
algorithm has some risk that ones own implementation has made an error
(perhaps typographical) in the implementation.


That being said, there are some fairly straight-forward algorithms for
implementing EBNF translations directly as part of the parser
generation. I would be very surprised in the EBNF translations in
some of the more popular LL parser generators (e.g. ANTLR, JavaCC,
RDP) were not robust.


It "is" more difficult to do so correctly in the LR case, than the LL
case because of the way reductions are handled in LR. However, there
is even a complete set of published algorithms that cover implementing
EBNF in an LR parser written by Nakata and Sassa--not that those
articles are easy to track down nor easy to read. (By the way, just
as a minor point of pride, Nakata is currently a Yacc++ user and has
been for a couple of years.)


> The free-form notation of SLK allows the use of RE operators as
> documentation, but not as non-BNF shorthand.
...
> However, this still requires the defining productions
...


This is "ok" for many uses. However, it misses the big advantages of
accepting EBNF directly, and some of the reasons why a "direct
translation" approach is useful (i.e. not translating EBNF into
productions and then using the translated grammar, but instead doing
the expansion of the RE's inline as part of the table generation
process).


The biggest advantage of direct translation is the elimination of
unnecessary conflicts. At each point where one reaches a production
boundary (the start of a production in LL or the end of a production
in LR), the parser generator must insert code which determines which
production to apply. If one translates EBNF into auxillary
productions, then one has usually introduced additional production
boundaries. This, means more places conflicts can occur. In
contract, with a direct translation approach, the parser just marches
right through the regular expression and a conflict does not occur.
(In fact, one technique the Yacc++ documentation suggests for removing
conflicts is "flattening" the grammar, that is making longer rules by
replacing nested rules by the equivalent regular expressions.)


The auxillary productions also make the resulting parser (either as
code in a recursive descent implementation or as tables) larger and
slightly slower. Now, with current machines and grammar sizes, this
is probably mostly irrelevant--the extra cost is effectively within
the noise bars.


In my personal opinion, the correct answer is a combination of the two
approaches. First, the parser generator should build a "recognizer"
that expands the EBNF inline to minimize conflicts. The parser
generator should then apply a "parser" that, knowing how the grammar
should be matched, separates the EBNF into groups for each RE
operator. That would give the grammar author the best of both worlds.
They would have the fewer conflicts that result from the direct
translation. Yet, they would have the simplified semantic model that
the EBNF->productions suggest. Ideally, the user would have the
ability (perhaps on a per operator basis) to control which method was
used, just direct translation, just expansion, the mixture. Note,
Yacc++ does NOT provide this feature. It's on our quite long to-do
list, but it isn't likely to happen soon.


Lest one think that the conflict avoidance problem is a minor point,
let me just say that I have seen numerous grammars where using regular
expressions were an important key to making the grammar parseable.
The example that currently comes to mind was a grammar for a language
where calls to member functions allowed one syntactic variation (I
think the case was dotted names for the function name, and perhaps
optional arguments in the parameter list) and non-member functions had
a slightly different syntactic variation. The key in parsing the
language was recognizing the dotted names with a regular expression
rather than a recursive rule. This allowed the parser the lookahead
to see the openning parenthesis of the parameter list after the dotted
name and to know that it was a function call at an important point in
the disambiguation.


The one place where the EBNF->productions translation might be
completely acceptable is when using an Earley style parser (or its
modern variations, often called Tomita or GLR parsers). In that case,
ambiguity (and the resulting conflicts) is not a problem, it is
resolved at run-time or possibly not at all. I do not have the
experience to say what are the tradeoffs in that parsing methodology.


-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.