Re: regular expression operators in CF grammars

"Chris F Clark" <cfc@shell01.TheWorld.com>
23 Aug 2002 11:15:48 -0400

          From comp.compilers

Related articles
[15 earlier articles]
Re: regular expression operators in CF grammars parsersinc@yahoo.com (SLK Parsers) (2002-07-04)
Re: regular expression operators in CF grammars rboland@unb.ca (Ralph Boland) (2002-07-04)
Re: regular expression operators in CF grammars cfc@world.std.com (Chris F Clark) (2002-07-15)
Re: regular expression operators in CF grammars kgw-news@stiscan.com (2002-07-21)
Re: regular expression operators in CF grammars nospam@snafu.de (Sönke Kannapinn) (2002-08-10)
Re: regular expression operators in CF grammars nospam@snafu.de (Sönke Kannapinn) (2002-08-10)
Re: regular expression operators in CF grammars cfc@shell01.TheWorld.com (Chris F Clark) (2002-08-23)
Re: regular expression operators in CF grammars tbandrow@unitedsoftworks.com (tj bandrowsky) (2002-09-03)
Re: regular expression operators in CF grammars cfc@shell01.TheWorld.com (Chris F Clark) (2002-09-12)
Re: regular expression operators in CF grammars cfc@TheWorld.com (Chris F Clark) (2002-09-14)
RE: regular expression operators in CF grammars qjackson@shaw.ca (Quinn Tyler Jackson) (2002-09-14)
| List of all articles for this month |
From: "Chris F Clark" <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 23 Aug 2002 11:15:48 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 02-05-142 02-05-151 02-06-024 02-06-056 02-06-090 02-07-010 02-07-025 02-07-043 02-08-035
Keywords: parse
Posted-Date: 23 Aug 2002 11:15:48 EDT

Sönke Kannapinn writes:
> In general, I think it could be helpful if I try to clarify a few
> points on ELR parsing in this posting.


From your posting, I see that although we have not completely
understood each other, we are not as far off as it would have seemed.


> This is achieved by transforming, e.g., an
> ecfg rule containing a regular *-operator
>
> A -> <alpha> <beta>* <gamma>
>
> into cfg rules
>
> A -> <alpha> A'
> A' -> <beta> A'
> A' -> <gamma>
...
> A "right-biased" transformation strategy such as the one sketched
> above does *not* introduce new look-ahead boundaries "in the middle"
> of rhs regex's.


Point well made. I had "assumed" that the only "valid" transformation was
    A -> <alpha> A' <gamma>
    A' -> <beta> A' | empty
as that is the transformation that reflects the original structure of
the grammar. In particular, that is the "shape" of the resulting tree
that I would expect to see constructed.


So, thank you, I learned something. The concept of a right-biased
expansion is intuitively obvious, now that it's been pointed out to
me :-).


My next question is do you then massage the resulting tree to be in
the form I would expect above or do you leave it in the same shape
that the right-biased transformation makes?


The reason I ask this question is that it mirrors a question someone
recently asked about grammar transformation tools. In a sense, the
right-biased transformations you are applying "change the shape" of
the grammar from what a naive grammar write would expect. Now, that's
still better than entirely losing the structure of regular
expressions, but if one does not put the tree back in there naive
shape, it still seems that the user will have to understand the
grammar transformations.


> BTW: The (corrected) grammar that you presented is in fact an
> ELR(1) grammar as far as I (and Madsen and Kristensen 1976)
> understand that term. So an ELR(1) parser is able to unveil the
> inductive structure of all handles during reductions and make it
> accessible to the user's semantic action code.


Yes, this does not surprise me now that I better understand you. The
right-biased transformations would not introduce the boundaries that
our "ambiguous treatement of regular expressions" was attempting to
avoid. I was not trying to suggest handling ambiguous regular
expressions (more on that in a moment). I was merely saying that one
does not want to detect the right edge prematurely. We solved that
problem by erasing both edges (which is what I mean by treating them
ambiguously, one cannot determine where the regular expression starts
or finishes at the reduction point of the containing rule. The
right-biased solution should allow one to recover the edges working
from the right end of the rule. Thus, it is a better solution.


More no ambiguous regular expressions:
> (In contrast, "a**" is weakly unambiguous, but not strongly; and
> "a* a*" is neither weakly nor strongly unambiguous.)


I believe we specifically disallow a** as it adds no meaning that a*
does not. (That also helps us reserve ** as an operator for future
regular expression extensions.)


However, a* a*, we accept. Internally, it works essentially treating
the first a* as greedy (and the 2nd a* as always empty), but since one
cannot see the regular expression boundaries, that is irrelevant. The
more interesting case is a* (a|b)*, where the 1st b determines that
one must use the 2nd closure, or (a|b)* (a|c)* where "somewhere"
between that last b and the 1st c, one must switch closures.
Disallowing these ambiguities (especially the last one) makes the
grammar writers job harder (and for that I would suggest allowing
ambiguities).


A similar interesting case, we have worried about is A->A* or A->A* a,
etc. where recursion and regular expressions are mixed. In that case,
we "choose" to iterate over the regular expression rather than
building the recursively structured tree.


Part of this bias comes from the original orientation of Yacc++. It
was originally developed as part of another tool (called JADE) which
was a variation of Emacs that imposed structure (via grammars) on the
buffers. In editors, one wants to keep the structure as flat (and
minimal) as possible to avoid tree reshaping artifacts that most
syntax directed editors suffer from. So, when looking at a regular
expression, I don't ask does the resulting tree tower to the left or
to the right, to me it should not "tower" at all, it is simply a list
and it has n-entries in it. However, that is just my personal bias.


I think one can see an even more extreme case of that bias in
"languages" like HTML, which are just barely hierarchical. The
essential idea is that one has a stream of text and one inserts markup
to distinguish portions of it, but the markup should not interfere too
much with the text as text.


Best regards,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.