Re: Interface (Grammar precedence/notation) question

Chris F Clark <cfc@shell01.TheWorld.com>
5 Jan 2007 05:46:17 -0500

From comp.compilers

Related articles
Interface (Grammar precedence/notation) question cfc@shell01.TheWorld.com (Chris F Clark) (2006-12-28)
Re: Interface (Grammar precedence/notation) question DrDiettrich1@aol.com (Hans-Peter Diettrich) (2006-12-29)
Re: Interface (Grammar precedence/notation) question haberg@math.su.se (2006-12-29)
*Re: Interface (Grammar precedence/notation) question cfc@shell01.TheWorld.com (Chris F Clark)* (2007-01-05)**

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	5 Jan 2007 05:46:17 -0500
Organization:	The World Public Access UNIX, Brookline, MA
References:	06-12-108 06-12-110
Keywords:	design
Posted-Date:	05 Jan 2007 05:46:17 EST

I wrote:
>> I am in the process of considering a new feature for a new version of
>> Yacc++
>
>> The question is, will this reversing of the precedence of and and or
>> be a bad interface idea?

Hans-Peter Diettrich <DrDiettrich1@aol.com> writes:

>
> Yes, definitely.
> I'd suggest a different wording, at least.

Hans said:
H> It will be no worse than interchanging the meaning of 'true' and 'false',
H> or '0' and '1', or 'US' and 'China'. That is, almost every user will do it
H> wrong, even if they try hard to do it right. :-)

After, hearing both yours and Hans' replies, I am going to not do
something non-standard. The point that most users would simply get it
wrong even when trying is most poignant and credible. Fortunately,
your reply suggests an alternate solution that preserves the
simplicity of what I want and doesn't require me to mess-up the and/or
precedence in the process. *THANKS!!!* I'll point out the key idea
later.

>> family inside_grammar: LL(1);
>> family outside_grammar: LALR(1);
>>
>> Now, note that because the inside_grammar is embedded inside the
>> outside_grammar, it must be LALR(1), because otherwise the checks on
>> the outside grammar will fail.
>
> Some kind of inheritance?

Yes, effectively, although we won't call it inheritance because we
alrady have inehritance in grammars used a completely different way.

If you have a rule where non-terminal a is defined in terms of
non-terminal b. The restrictions of non-terminal a are also applied
to b, unless the restriction on non-terminal a was a "mostly"
restriction and b had a specific restriction declared for it (like an
override). If the restrctions on non-terminal a were not "mostly"
restrictions, they apply to non-terminal b, even if b has specific
restrictions declared.

That makes the restrictions work "right". If you want the whole
grammar to be LL(1), then all of its parts must be LL(1). If you want
some part of a grammar to be LR(1), then you want all the sub-parts of
that grammar to be LR(1). If you embed one type of grammar in the
other, you wnat the outer checks applied to the inner grammar and thus
the inner grammar must be both.

The idea of the "mostly" restriction, is for cases where you know that
most of your grammar should follow some standard (i.e. most grammars
one naturally writes are both LL(1) and LR(1)), but you may have
certain problem parts that you want verified to a looser standard, or
perhaps not verified at all because you can't write them in a
class conformant manner and you have hand checked them.

> I wonder what's the purpose of the grammar classification.
> Shall it describe the properties (and syntax?) of existing grammars, or
> shall it indicate what grammar class is acceptable for the intended parser?

The purpose is to help users write "trusted" grammars. If your
grammar is LL or LR, one knows that it is unambiguous and a tool can
verify that your grammar fits those requirements.

However, if one is writing a more general grammar, a predicated one, a
PEG (parsing expression grammar), or a GLR grammar, one might write an
ambiguous grammar without knowing it. Moreover, we cannot
mechanically check such grammars for consistency.

Fortunately, if you don't have any recursive productions that escape
from LL or LR sub-parts on ones grammar, then you can verify the
sub-parts and be certain that the outer levels which are using
backtracking are still relatively "safe"--you won't have complicated
nested ambiguous cases, all your ambiguities will be simple.

> As indicated above, wouldn't it be sufficient to specify an
> compatibility tree (or graph) for the grammar types? Then the parser
> type can be determined from the given grammars, and it can be restricted
> to some "level" in the tree. This would lead to a distinction between
> the grammar and parser types, where grammar types describe the given
> grammars, and the parser type describes the desired parser.

My intent is to have a separate declaration that describes the "output
type" such as recursive descent, LL tables, LR tables, GLR tables,
DFA, or NFA. The first implementation is likely to support only 1 or
2 output types.

> IMO precedence and binding can be added to LL grammars, in the same way
> as they already can be added to LR grammars. It's a matter of the
> implementation, whether such an annotated LL grammar will be
> auto-expanded into a traditional LL grammar, internally, or whether it
> will be implemented immediatly in a specialized rule (subparser) class.

I'm pretty certain that you are right about this. I'll know more as I
work through the details.

> I'd also favor an optional disambiguation by sequential application of
> alternatives (in LL parsers), turning LL(1) errors into non-fatal
> warnings. The resulting code would look like:
> if (token in FirstOfAlt1) Alt1()
> else if (token in FirstOfAlt2) Alt2()
> ...
> instead of the traditional
> case (token) of {...}

Yes, I've thought about that as a useful thing. I don't know if I
will incorporate it into what I'm doing, as I don't know if I'm even
going to have a "recursive descent" output option yet.

>> What should I call LL grammars that depend on the hack that allow them
>> to deal with the if-then-else ambiguity?
>
> Also the implied handling of the dangling else, by a "longest match"
> attribute, IMO is applicable to every grammar type?

That thought is worth considering. The more things that I factor the
better the categorization scheme will be.

> What about:
> family grammar: mostly lalr, lalr or ll;

I like this comma notation, it is a nice implied "and" that doesn't
use the "and" word and thus won't get confused with the "boolean and"
which can have normal precedence. It works well with

>> I guess if I implicitly "and" multiple family rules, then one could
>> write it as:
>>
>> family grammar: lalr or ll; // any part of this grammar must be lalr or ll
>> family grammar: mostly lalr; // the top rule(s) must be lalr

I can then talk about the list of rules being implicitly anded
together (i.e. they all apply) and yet still leave the normal Boolean
"and" and "or" with their normal precedence. That will be natural and
not confusing.

> Why that? What's the purpose or the consequences of according
> (mis-)behaviour?

Worth emphasizing: the purpose of this notation is to facilitate error
checking of the grammar. The notation that Yacc++ supports allows one
to write ambiguous grammars, which can be a good thing, it simplifies
the writing of grammars. Except that with predicates, you can get
subtle ambiguous cases that may not mean what you think. This will be
more true as I extend Yacc++ into handling the GLR and PEG grammar
classes.

In fact, it was the PEG grammar class that got me thinking about this.
When one writes a PEG grammar there is an implicit ordering of rules
that affects how things are parsed. And, in many cases this is fine.
In fact, it is part of why PEGs are "composable". However, it isn't
fine when one writes a recursive case where the recursion depends on
an ordered sub-non-terminal. In that case, one can get the type of
ambiguity that is the same thing I don't like about hand-written
recursive descent. That is you can get the type of ambiguity where
you aren't really sure what is or is not in the language being parsed
(and if it is parsed what it is recognized as).

That problem doesn't occur if all the recursive parts of ones grammars
are LL or LR. Thus, if one had a tool where one could specify that
the different fragments were in the various classes (what I proposed
above) and the tool checks those constraints, and you also specified
that the parts outside the checked classes weren't recursive, you
would eliminate the worst ambiguity problems--the nested ones.

You still wouldn't know whether two fragments of the grammar parsed
the same input sequence, but at least that problem would only be at
the "gross" level and there the case where something might be both a
declaration and an expression might be an acceptable ambiguity that
you want resolved in a specific way.

Most importantly for me, it can help one "maintenance error" proof the
grammar. One does that by marking the parts of the grammar one knows
are supposed to be well-formed (e.g. LL and/or LR), so that if someone
makes a change that breaks that, the change is flagged.

The reason behind this is that some Yacc++ grammars are "venerable"
these days. I have clients who have grammars where none of the
original authors are still on the team. Moreover, they are large and
being maintained often by more inexperienced developers. Sometimes,
even I have trouble figuring out what the consequences of a given
subtle change to a grammar does to the language parsed. Keeping these
grammars working can be a major task. The more I can help the users
write sound bug-free code the better everyone's life is.

> Apart from LL, LR etc. grammar types, the grammar syntax should be
> addressed. IMO EBNF will allow to describe both LL and LR grammars,
> whereas BNF may exclude an LL interpretation, due to explicitly unrolled
> repetitions (into left or right recursion).

Yes, Yacc++ already accepts EBNF with extensions. And, I agree that
writing things as regular expressions is usually better than explicit
recursions.

Again, thanks for the detailed feedback. I hope my responses make it
more clear what I am attempting to do.

BTW, it is my intent that a version of this will be released as free
(open source) software (either GPL or BSD, I'm not sure yet), as well
as under the current license. I will let you know more as it gets
closer to reality.

-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Interface (Grammar precedence/notation) question

Chris F Clark <cfc@shell01.TheWorld.com>5 Jan 2007 05:46:17 -0500

Chris F Clark <cfc@shell01.TheWorld.com>
5 Jan 2007 05:46:17 -0500