Re: regular expression operators in CF grammars

"Chris F Clark" <>
15 Jul 2002 23:44:57 -0400

          From comp.compilers

Related articles
[11 earlier articles]
Re: regular expression operators in CF grammars (Chris F Clark) (2002-07-02)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-07-02)
Re: regular expression operators in CF grammars (Joachim Durchholz) (2002-07-02)
Re: regular expression operators in CF grammars (=?Windows-1252?Q?S=F6nke_Kannapinn?=) (2002-07-04)
Re: regular expression operators in CF grammars (SLK Parsers) (2002-07-04)
Re: regular expression operators in CF grammars (Ralph Boland) (2002-07-04)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-07-15)
Re: regular expression operators in CF grammars (2002-07-21)
Re: regular expression operators in CF grammars (Sönke Kannapinn) (2002-08-10)
Re: regular expression operators in CF grammars (Sönke Kannapinn) (2002-08-10)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-08-23)
Re: regular expression operators in CF grammars (tj bandrowsky) (2002-09-03)
Re: regular expression operators in CF grammars (Chris F Clark) (2002-09-12)
[2 later articles]
| List of all articles for this month |

From: "Chris F Clark" <>
Newsgroups: comp.compilers
Date: 15 Jul 2002 23:44:57 -0400
Organization: Compilers Central
References: 02-05-142 02-05-151 02-06-024 02-06-056 02-06-090 02-07-010 02-07-025
Keywords: parse, theory
Posted-Date: 15 Jul 2002 23:44:57 EDT

Sönke Kannapinn writes in our on-going discussion:

> I highly recommend the beautiful article entitled "Regular
> Expressions into Finite Automata" by Anne Brüggemann-Klein (TCS 120,
> 1993) to anyone interested in that field.

Thank you, I will look it up. (BTW, can I have a little better
bibliographic data to make that easier?)

> You only gain in power if you allow *ambiguous* rhs regular
> expressions (non-stronly-unambiguous, to be precise), or ambiguous
> rhs NFA's. Are you able to write down a compiler-related example
> where you really make wise use of this freedom?

"Wise" is a strong word--I cannot promise that the problem we were
solving was wise. However, I can describe the rough outline of what
the problem was and why having an ambiguous regular expression was a
key to solving it.

The original grammar had productions something like the following:

obj_id: id | obj_id "." id; // a recursive definition of id ("." id)*

expr: obj_id
        | simple_func
        | obj_func
        | expr "+" expr

simple_func: id "(" asgn_list ")"; // simple funcs have complex params

asgn_list: (obj_id "=" expr ";")+;

obj_func: obj_id "(" expr_list? ")"; // obj funcs have simple params

expr_list: expr ("," expr)*;

The problem is this grammar isn't LR(k) for any k. Why not? Because,
one needs unbounded (aka infinite) lookahead to distinguish between
the id of a simple_func and the obj_id of an obj_func. You have to
look past the left paren and into the parameter list and see whether
it contains an "=" or a "," or is missing. The solution that seemed
best was to use regular expressions (relying on their ambiguity) and
replace the following two rules:

obj_id: id ("." id)*;

obj_func: id ("." id)* "(" expr_list? ")";

Doing so, we pushed the lookahead requirements to the point where the
parser generator could resolve them. More importantly, we did so with
minimal changes to the structure of the grammar. There may be an
LR(1) grammar for the language in question. However, I doubt that it
reflects the author's intended structure as well as these minor
modifications did. Moreover, we could recover the structure by using
a grammar predicate, getting the best of both worlds.

obj_func: id ("." id)* "(" expr_list? ")" >>
                    obj_id "(" expr_list? ")";

Since, I don't presume you are familiar with the notation I use for
predicates. The predicate says, if you can match with the regular
expression before the >> operator, apply the regular expression after
the >> operator. Thus, we pattern match with the ambiguous rule and
then apply the non-ambiguous rule to have a nice representation.

BTW, lest one come up with the obvious extension of allowing obj_id's
in the simple_func case and dealing with the restriction semantically,
the simple_func and obj_id non-terminals were used in several other
places in the grammar and changing the simple_func rule introduced
other conflicts in those places.

> I want to attach semantic code depending on whether the optional
> part has actually been seen by the parser or not. In other words:
> I'm not only interested in *whether* this rule has been applied but
> *how* it has been applied. And *I* don't want to re-match the rhs
> regular expression to the actual handle; the parser's handle
> reduction machinery should have done so before! And it should have
> invoked the appropriate pieces of semantic action code associated
> with the regular operators actually applied.

You have a valid point and I would like that too. Our ambiguous
interpretation does not do that. If your software can do that, so
much the better. (As the saying goes, I take my hat off to you,
meaning I honor what you have done.) It does not surprise me that in
the years since Barbara and I wrote Yacc++ people have come up with
better solutions to some of the problems we were addressing.

However, I can still see the advantage to treating regular expression
operators as ambiguous in some cases. In particular, there are many
grammars that are not LR(k) for any k. It would be nice to be able to
parse them using a grammar based approach rather than ad hoc
code. (see note) (For example, the combination of regular expressions
and predicates allow us to parse a**i b**i c**i type grammars, and we
have an example of that too.) If occassionally that means that one
has to work a little harder to extract the structure from the parser's
stack at semantic time, it is a reasonable price to pay in my mind.
Especially since most regular expressions are not very complicated and
only introduce a little bit of chaos on the stack. Thus, the
re-parsing problem is generally not large. Still, your point is well
taken and it would be better if we could only have the reparsing issue
at the points where it was truly needed.

Our solution (and it is not a perfect one) is that one uses regular
expressions when one is willing to accept the ambiguity issues and
recursive rules when ambiguity is unacceptable. A better solution
might have two types of regular expression operators (ambiguous and
unambiguous ones) or some other way of indicating when and where
ambiguity is acceptable/desired, so that one doesn't lose the
convenience of writing regular expressions only because one is
unwilling to tolerate some ambiguity at semantic time.

Thanks for the stimulating exchange,

Chris Clark Internet :
Compiler Resources, Inc. Web Site :
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Note: the problem that many of our clients come to us with is that
they have a language spec that isn't well written (i.e. it isn't
really parsable "as is") for a variety of reasons and they want
something that they can understand and extend. So, our first goal is
to get them a grammar that accepts the target language. In those
cases, it isn't an option to say, you should just redesign the
language. You have to find some way to parse the language that they
have been given.

And it's not just home-grown languages that have that problem. I
remember working with Jim Roskind and Bruce Blodgett on making ANSI C
typedef's sensible to yacc. It was not a trivial process to know when
a typedef from an outer scope could be redeclared as a new identifier,
until we got the rules straightened out.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.