Re: Problem with ambigous grammar

Chris F Clark <cfc@shell01.TheWorld.com>
2 Jan 2007 01:01:32 -0500

          From comp.compilers

Related articles
Problem with ambigous grammar sr@247ms.com (2007-01-01)
Re: Problem with ambigous grammar cfc@shell01.TheWorld.com (Chris F Clark) (2007-01-02)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 2 Jan 2007 01:01:32 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 07-01-007
Keywords: parse
Posted-Date: 02 Jan 2007 01:01:32 EST

sr@247ms.com writes:
> I am quite new to parsing and don't know if this newsgroup is
> appropriate for this type of questions, anyway, the question regards a
> certain type of ambiguity in a CF grammar and maybe some of you can
> still help.
>
> I have a production R -> A | B and for the non-terminals A and B I
> have: FIRST(A) intersect FIRST(B) != EMPTY-SET.
> A sentence with a terminal in this intersection will now eventually
> have 2 parse trees.
>
> example:
>
> R -> keyword | identifier
> keyword -> if | then | else | class | ...
> identifier -> [a-z]+


Yes, this is actually a very common theoretical problem. As, John our
moderator said, it is rare to find a grammar that doesn't suffer from
this problem.


In addition, there are numerous solutions to this problem. You have
stumbled on some of them. Please read through them all, since the
order you listed them in isn't the most practical order, as many of
your ideas are overkill.


> a) when the rule R->keyword produces a string s, then don't try the
> rule R->identifier.


This is the "parsing expression grammar" approach. One orders the
parser's attempts by the order the rules appear in the grammar and
stops once one has found a rule that matches. The PEG approach came
out of extending the work Terence Parr pioneered on "predicates".
Predicates are ways of specifying what you call "precedence" in a
grammar just the way you would like, which is not actually the normal
meaning of the term, but that's ok--your meaning was understood.
Predicates are a way of saying, if two rules apply, which rule you
want the parser to use.


Predicates are a very powerful mechanism. You can solve many
non-context free parsing problems using them. However, they are also
"dangerous" in the sense that you can write predicates that effectively
override the way a parser works and in doing so create a language
where you really aren't sure what is accepted. The main problem with
predicates is that in the general case we have no way of verifying
exactly what language a predicated parser accepts, as explained when I
discuss your XOR idea.


It is worth noting that predicates were initially implemented by
backtracking (and thus took non-linear time) and are a sophisticated
way of specifying controlled backtracking. However, one of the
advances of PEGs was to establish a set of grammars that could be
processed in linear time.


> b) Get all parse trees. Afterwards remove the 'bad' ones.


This is the GLR approach, make a parse forest and if the result isn't
a tree, use semantics to prune away the ambiguous trees that aren't
wanted. As you noted in some text I elided, the GLR approach can be
slow if used two liberally, since it can easily have n**2 running
times, rather than linear ones which DFA, LL, and LR techniques have.
Using it to parse identifiers would probably be such a case, as you
wouldn't want your n**2 case to be on the length of each identifier.


> d) replace B with a new non-terminal C such that FIRST(C) =
> FIRST(B)-FIRST(A). Question: can such a C be constructed (at all?
> sometimes maybe? )


In this case, the intersection can be constructed. However, in many
cases, especially with complex recursion the taking of the
intersection is non-trivial (and I'm certain it cannot be automated
because whether two context-free grammar (fragments) parse the same
text is undecidable in general). Still, it is worth noting that the
complex cases don't occur in practice. If you look up "scanner-less
parsing", you should find papers where they deal with this problem by
simply assuming that a suitably powerful parser generator can
successfully take the intersection.


> I can immagine that c) and d) are similar. Of course both c and d will
> Of course they solved ambiguity by simply specifying this extra 'that
> is not a keyword' in the grammar, but this is definately not BNF
> notation to me.


That's right it is not BNF. However, as John pointed out, that's
probably *THE CLASSIC* practical solution.


Simply put, one splits the grammar in two parts the "lexer" and the
"parser". The lexer handles the rules for identifiers and keywords.
The parser handles the larger structures. Next, one arranges the
lexer grammar such that if the string matches both a keyword and an
identifier, the string is treated as a keyword. This is easy to do
with regular expressions, particularly, if you have a tool like LEX
that has a "first match" rule, that allows you to specify if two
regular expressions match, which one you want returned by the lexer.
Note, this is such a powerful solution, almost all lexer generators
support this feature.


The nice thing about this split is that, while in theory one could use
this to do all sorts of powerful and unmaintainable things, in
practice it can easily be used in a very limited and maintainable
fashion. One writes a simple lexer grammar, that is easily verified
and one can easily check the exact cases where one wants one token
instead of another. Then, one writes a normal parsing grammar based
on those tokens and does not need to worry about the token
interactions.


> This brings me then to another question: How can we specify a 'XOR'
> operation using BNF notation and how can we specify a 'NOT' operation
> using BNF notation. If we had a way to specify a XOR operator in a CF
> grammar then we could easily solve the problem above. maybe a CF
> grammar is not strong enough to specify 'XOR', then maybe in a CS or
> PS grammar ?


You are right in tradition CF notation, there are no XOR nor NOT
operations. In fact, deterministic CF grammars are not closed under
these operations (set-difference and complement). If you introduce
these notations, one has effectively introduced a subset of
"predicated" grammars.


There is no known parsing technology that can deal with the XOR or NOT
concepts, plus deterministic CF grammars, that isn't able to deal with
predicated grammars. Moreover, there is little need for such a
technology either, because once you have the XOR or NOT operation, you
can describe grammars where you have to know if the intersection of
two CF grammars is empty (the undecidable problem mentioned above),
and as a result it isn't possible to mechanically validate that such
grammars aren't ambiguous.


As a result, there are basically four classes of your problem,
although others, especially those which don't like lexical feedback and
prefer alternate solutions, may divide these groupings differently.


1) Your language is parsable under a sufficiently powerful
      deterministic (but not predicated scheme). You can use a
      scanner-less parser in that case. These cases can be automatically
      checked.


2) Your language needs only the traditional lexer/parser split to solve
      it. Most languages are in this set. These cases can usually be
      trivially checked by hand. The triviality of these checks are
      such, that one doesn't usually even bother.


3) Not discussed above, you need a slight variation on the
      lexer/parser split to solve the language. The most common issue is
      the need for "lexer feedback" to resolve some parsing problem.
      These cases can also usually checked by hand, but not always
      trivially. Languages with context sensitive keywords and/or macro
      processors often fall into this class. C++ may or may not fit in
      this class.


4) You need full predicated (or backtrack) parsing. Writing proofs of
      these cases can be quite hard. One predicated parser generator
      (meta-S aka grammarforge) has been shown to be Turing powerful,
      which means determining whether a grammar that is acceptible to it
      is unambiguous is equivalent to solving the halting problem.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.