Interface (Grammar precedence/notation) question

Chris F Clark <cfc@shell01.TheWorld.com>
28 Dec 2006 13:16:56 -0500

          From comp.compilers

Related articles
Interface (Grammar precedence/notation) question cfc@shell01.TheWorld.com (Chris F Clark) (2006-12-28)
Re: Interface (Grammar precedence/notation) question DrDiettrich1@aol.com (Hans-Peter Diettrich) (2006-12-29)
Re: Interface (Grammar precedence/notation) question haberg@math.su.se (2006-12-29)
Re: Interface (Grammar precedence/notation) question cfc@shell01.TheWorld.com (Chris F Clark) (2007-01-05)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 28 Dec 2006 13:16:56 -0500
Organization: The World Public Access UNIX, Brookline, MA
Keywords: parse, question
Posted-Date: 28 Dec 2006 13:16:56 EST

I am in the process of considering a new feature for a new version of
Yacc++ and I want some input on the notation I am planning to use, in
particular the precedence of the "and" and "or" operators I will be
using. Typically, "and" binds tighter than "or", so that clauses like
"a and b or c", mean if c is true the whole clause is true otherwise
both a and b must be true for the whole clause to be true. In the
notation I am considering it would be more convenient (as I will
explain) to reverse this precedence, so that "a and b or c" means a
mut be true for the whole clause to be true and also either b or c
must also be true.


<For those of you who want to quote a minimal amount and retain
context, here is the core line to quote:>


The question is, will this reversing of the precedence of and and or
be a bad interface idea?


<end question to quote>


The notation I am considering looks something like this:


family nterm1, nterm2, nterm3: ll or lr and mostly lr;


The grammar for the notation looks like this:


    familyrule: "family" nterm ("," nterm)* ":" familyand ";";


    familyand: familyor ("and" familyand)*;


    familyor: familyprinciple ("or" familyprinciple)*;


    familyprinciple: familyname | "(" familyand ")";


    familyname: "mostly"?
                            ( "any"
                            | "ll" number? // also allow parens around number
                            | "lr" number?
                            | "lalr" number?
                            | "slr" number?
                            | "glr"
                            | "peg"
                            | "regular"
                            | "no" "recursion"
                            // more alternatives may be added
                            );




Now, the meaning of a familyrule is that the parser generator should
check the non-terminals listed for obeying (being in) the grammar
class listed.


Thus, a typical "yacc" grammar would be annotated like:


family yacc_grammar: lalr 1;


This means that the entire grammar rooted at the non-terminal
yacc_grammar should describe a language that is LALR(1) parseable,
that is that a LALR(1) parser can be constructed which parses the
language described by the grammar. If the grammar doesn't meet that
constraint (the grammar isn't LALR(1)), an error should be generated.


An "antlr" (or "pccts") grammar would be annotated:


family antlr_grammar: ll;


Thus, an antlr grammar must be parseable by an LL parser, but it
doesn't need to be an LL(1) parser.


A rats! grammar would be annotated:


family rats_grammar: peg;


Finally, a glr grammar would be anotated:


family glr_grammar: glr;


If one wanted a grammar that was parseable by mosy any parser
generator, one might specify it as:


family nice_grammar: ll(1) and slr(1);


This notation means that the grammar nice_grammar must be both LL(1)
and SLR(1), which would mean it would be acceptable to nearly any
tool.


Finally, perhapes we have a grammar that all we care about is that it
is parseable by some other tool and we don't care whether that tool is
yacc-like or pccts-like. We might annotate that grammar as.


family ok_grammar: ll or lalr(1);


This means that ok_grammar can be either LL or it can be LALR(1) and
no error will be generated. However, we don't care which language
class the grammar belongs to, as long as it is one or the other.


Finally, if we didn't want the grammar checked, we would mark it
any.


family unchecked_grammar: any;


No errors are reported for grammars marked any. If they don't parse
what you expect, well you told us not to warn you.


Ok, that's the simple cases. Let's deal with grammars inside grammars
(sub-grammars) so to speak. Suppose, we have a grammar
(inside_grammar) that we know is LL(1) and we want to embed it in a
larger grammar (outside_grammar), and we want that larger grammar to
be LALR(1). We mould mark that case as:


family inside_grammar: LL(1);
family outside_grammar: LALR(1);


Now, note that because the inside_grammar is embedded inside the
outside_grammar, it must be LALR(1), because otherwise the checks on
the outside grammar will fail.


If we wanted to relax that constraint, we would write:


family inside_grammar: LL(1);
family outside_grammar: MOSTLY LALR(1);


That would mean the parser generator would start checking
outside_grammar using the LALR(1) rules, but when it came to
inside_grammar, since it was marked LL(1), it would apply the LL(1)
checks instead. (There's a clever paper by one of Nigel Horspool's
students whose name escapes me now, since I don't have it in front of
me, that shows how one can do exactly that--well, with a *little*
handwaving.)


The point of the "mostly" clause is that it constrains the top level
non-terminal to be of a certain class (and any unmarked non-terminals
below it), but allows phrases inside to follow different rules, as
long as they are marked as to what rules they follow. A real common
use might be to make an LL grammar for the bulk of the language, but
have expressions and if_statements be LR.


family grammar: mostly ll;
family expression, if_statement: lr;


The nice thing about specifying something that way, is that one gets
assurances that the vast majority of the language is easy to parse, as
it is LL, but one has the freedom to use the more powerful LR parsing
algorithm in the places where one needs it, so that one doesn't have
to create all those nit-picky non-terminals to encode the precedence
hierarchy.


<another quotable question:>


What should I call LL grammars that depend on the hack that allow them
to deal with the if-then-else ambiguity?


<end quotable question>


Technically, such grammars aren't actually LL. However, since most LL
parser generators can deal with them, it seems wrong to not allow one
to express that a grammar fits in the class the such parser generators
can handle. I'm thinking of calling them "near LL". However, if there
is an established term for them (or if near LL is used for something
else), I'm open to other names.


And, now to why I want the precedence reversed. Suppose one wants a
gramar where any fragment should either be LL or LALR, but the top set
of rules should be LALR. It seems natural to write it as thus.


family grammar: mostly lalr and lalr or ll;
or
family grammar: lalr or ll and mostly lalr;


Instead of requiring parenthesis as in:


family grammar: mostly lalr and (lalr or ll);


However, I am well aware that little issue of precedence can make
languages harder to write (it was an often laid critique of Wirth's
languages that with each one he got a different set of precedences
wrong), and the precedence I am proposing is "reversed" from normal
Boolean logic. My worry is that people will not trust the unusual
precedence and simply parenthesize all and/or combinations.


I guess if I implicitly "and" multiple family rules, then one could
write it as:


family grammar: lalr or ll; // any part of this grammar must be lalr or ll
family grammar: mostly lalr; // the top rule(s) must be lalr


Is that more clear?


Any thoughts?
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.