Re: detecting ambiguous grammars

Chris F Clark <cfc@world.std.com>
23 Feb 2001 00:00:08 -0500

From comp.compilers

Related articles
detecting ambiguous grammars thant@acm.org (Thant Tessman) (2001-02-15)
Re: detecting ambiguous grammars davidpereira@home.com (David Pereira) (2001-02-17)
*Re: detecting ambiguous grammars cfc@world.std.com (Chris F Clark)* (2001-02-23)**
Re: detecting ambiguous grammars joachim_d@gmx.de (Joachim Durchholz) (2001-03-01)
Re: detecting ambiguous grammars henry@spsystems.net (2001-03-04)
Re: detecting ambiguous grammars thant@acm.org (Thant Tessman) (2001-03-04)
Re: detecting ambiguous grammars davidpereira@home.com (David Pereira) (2001-03-08)
Re: detecting ambiguous grammars dlester@cs.man.ac.uk (2001-03-10)
Re: detecting ambiguous grammars thant@acm.org (Thant Tessman) (2001-03-12)
[13 later articles]

| List of all articles for this month |

From:	Chris F Clark <cfc@world.std.com>
Newsgroups:	comp.compilers
Date:	23 Feb 2001 00:00:08 -0500
Organization:	Compilers Central
References:	01-02-080
Keywords:	parse, theory
Posted-Date:	23 Feb 2001 00:00:08 EST

> Is there a general algorithm for detecting whether a grammar is
> ambiguous? I've read posts that suggest the answer is no, but could
> someone point me at the reason why?

First, let me clarify your question. You are probably asking about
the ambiguity of general context free languages (CFLs). CFLs are
generally what one is talking about when one takes about a grammar or
BNF. CFLs the languages defined by the set of grammars one gets when
one allows (potentially recursive) productions with rules of the form:

non-terminal: sequence-of-terminals-and-non-terminals;

e.g.

stmt: if expr then stmt (else stmt)?
| var "=" expr
;

For the general class of CFLs, i.e. just an arbitrary grammar whose
rules are not otherwise constrained, the answer is there is no
algorithm to determine whether the grammar is ambiguous or
unambiguous.

By algorithm I mean a program that terminates (reaches an answer) and
that answer correctly distinguishes the two cases (ambiguous and
unambiguous) for all grammars. There are algorithms which can detect
subsets on one class--e.g. find a set of grammars such that all
grammars in the class are [un]ambiguous--but where there are grammars
outside the class may still be either ambiguous or not.

The reason this is so, is that it is possible to write a grammar that
solves a "Post Correspondence Problem". You can find a proof that
this is so in _The Theory of Parsing, Translation, and Compiling_ by
Aho and Ullman. A "Post Correspondence Problem" shows whether one
string can be converted to another using certain replacement rules.
You can map Turing Machine problems into Post Correspondence Problems.
Thus, if you could solve the Post Correspondence Problem, you could
solve the equivalent Turing Machine problem. Thus, the algorithm the
determines that a grammar is unambiguous, would be able to show that a
particular Turing Machine halts. And, of course, we all know that no
algorithm (run on anything equivalent to a Turing Machine) cannot
determine that an arbitrary Turing Machine halts. Thus, we cannot
decide that a grammar is unambiguous (for any arbitrary grammar).

Thus, for arbitrary CFLs, the problem is unsolvable. However, there
are subclasses where the problem is solvable.

> A naive approach would be to systematically generate all possible
> strings less than a given length and just check to see if any of the
> strings are equivalent. Of course this only works for statements
> less than the given string length. (Is there a way to prove that
> this is sufficient for some class of grammars?)

Your naive algorithm is not a bad start. However, most interesting
grammars generate strings of arbitrary length. The way to beat such
an algorithm is akin to one way people used to beat chess playing
programs--push the bad case sufficiently far into the future. In
other words, you can always define a grammar that generates a string
of an arbitrary length before generating an ambiguity. Thus, you can
always find a grammar whose ambiguity requires a string longer that
what your algorithm has checked so far.

While that is fatal to the algorithm, it is not as bad as it seems.
To create a grammar whose ambiguity lies indefinitely far in the
future, one must have an arbitrarily large grammar. In other words,
if you have a grammar of a known size (i.e. a fixed number of rules
and a fixed length for each rule), there does exists some bound where
the naive algorithm will have divided the grammars (of that size and
smaller) into ambiguous and unambiguous classes.

However, given the general unsolvability of the problem, there can be
no algorithm that computes that bound. Thus, there is some number
where we could let your naive algorithm run and it would have solved
the problem for our arbitrary grammar in question. However, we don't
know what that number is. Moreover, one of the two answers (ambiguous
or unambiguous) must be represented by the naive program still not
having found an answer in that time.

In other words, the naive algorithm must detect an unambiguous grammar
by not having found an ambiguity in the grammar within the bounded
number of steps. However, since we don't know what that bound is, we
don't know when to stop the naive program and declare that the grammar
is unambiguous. We only know that the grammar is unambiguous or it is
ambiguous, but not within the first n symbols. So, your naive
algorithm is one that allows you to detect grammars that are ambiguous
in the first n symbols, but there are both unambiguous and ambiguous
grammars that your algorithm fails to distinguish.

> But how about generating all possible strings where each
> non-terminal is expanded at least three times. Why three times?
> Because that's what seems to work in the hand expansions I've done.

In the general case, 3 will not be large enough. The above argument
proves that any fixed constant will be too small (for arbitrary
grammars).

On the other hand, most grammars are trivially ambiguous or
unambiguous. Especially the grammars that people write for things
like programming languages. As a result, 3 is not a bad cutoff for a
tool that helps you catch the obvious cases. Just realize that your
tool may not detect some subtle ambiguities. Note setting the number
to 4, 5, or more will catch a few more subtle cases. However, you
will find that diminshing returns quickly set in. Unfortuntately, for
any real sized grammar, the subtlest cases will not be distinguished
until well past the point where diminshing returns have set in.

> The grammar has produced two instances of the string "id + id + id"
> (as well as some identical E3 strings) which shows that the grammar
> is ambiguous. Can folks suggest why this approach may or may not be
> a wild goose chase? (I'm going to be using a top-down parser and my
> goal is not efficiency, but the exploration of the implications of
> various grammar designs.)

Finally, if you are going to build a top-down parser, I hope you will
use one of the available tools to do so. Of course, those tools solve
the ambiguity problem. Any grammar which is LL is unambiguous. (Same
is true for LR grammars, but you mentioned top-down which is generally
a synonym for LL). The only place you would have potential ambiguity
problems is where you used the tools extensions that allow you to
escape the LL limitations (e.g. predicates or java code).

I have a lot more thoughts on this topic as it is relevant to the
LR-infinity work that I purse off-and-on. However, I hope this has
addressed your needs.

Regards,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: detecting ambiguous grammars

Chris F Clark <cfc@world.std.com>23 Feb 2001 00:00:08 -0500

Chris F Clark <cfc@world.std.com>
23 Feb 2001 00:00:08 -0500