23 Feb 2001 00:00:08 -0500

Related articles |
---|

detecting ambiguous grammars thant@acm.org (Thant Tessman) (2001-02-15) |

Re: detecting ambiguous grammars davidpereira@home.com (David Pereira) (2001-02-17) |

Re: detecting ambiguous grammars cfc@world.std.com (Chris F Clark) (2001-02-23) |

Re: detecting ambiguous grammars joachim_d@gmx.de (Joachim Durchholz) (2001-03-01) |

Re: detecting ambiguous grammars henry@spsystems.net (2001-03-04) |

Re: detecting ambiguous grammars thant@acm.org (Thant Tessman) (2001-03-04) |

Re: detecting ambiguous grammars davidpereira@home.com (David Pereira) (2001-03-08) |

Re: detecting ambiguous grammars dlester@cs.man.ac.uk (2001-03-10) |

Re: detecting ambiguous grammars thant@acm.org (Thant Tessman) (2001-03-12) |

[13 later articles] |

From: | Chris F Clark <cfc@world.std.com> |

Newsgroups: | comp.compilers |

Date: | 23 Feb 2001 00:00:08 -0500 |

Organization: | Compilers Central |

References: | 01-02-080 |

Keywords: | parse, theory |

Posted-Date: | 23 Feb 2001 00:00:08 EST |

*> Is there a general algorithm for detecting whether a grammar is*

*> ambiguous? I've read posts that suggest the answer is no, but could*

*> someone point me at the reason why?*

First, let me clarify your question. You are probably asking about

the ambiguity of general context free languages (CFLs). CFLs are

generally what one is talking about when one takes about a grammar or

BNF. CFLs the languages defined by the set of grammars one gets when

one allows (potentially recursive) productions with rules of the form:

non-terminal: sequence-of-terminals-and-non-terminals;

e.g.

stmt: if expr then stmt (else stmt)?

| var "=" expr

;

For the general class of CFLs, i.e. just an arbitrary grammar whose

rules are not otherwise constrained, the answer is there is no

algorithm to determine whether the grammar is ambiguous or

unambiguous.

By algorithm I mean a program that terminates (reaches an answer) and

that answer correctly distinguishes the two cases (ambiguous and

unambiguous) for all grammars. There are algorithms which can detect

subsets on one class--e.g. find a set of grammars such that all

grammars in the class are [un]ambiguous--but where there are grammars

outside the class may still be either ambiguous or not.

The reason this is so, is that it is possible to write a grammar that

solves a "Post Correspondence Problem". You can find a proof that

this is so in _The Theory of Parsing, Translation, and Compiling_ by

Aho and Ullman. A "Post Correspondence Problem" shows whether one

string can be converted to another using certain replacement rules.

You can map Turing Machine problems into Post Correspondence Problems.

Thus, if you could solve the Post Correspondence Problem, you could

solve the equivalent Turing Machine problem. Thus, the algorithm the

determines that a grammar is unambiguous, would be able to show that a

particular Turing Machine halts. And, of course, we all know that no

algorithm (run on anything equivalent to a Turing Machine) cannot

determine that an arbitrary Turing Machine halts. Thus, we cannot

decide that a grammar is unambiguous (for any arbitrary grammar).

Thus, for arbitrary CFLs, the problem is unsolvable. However, there

are subclasses where the problem is solvable.

*> A naive approach would be to systematically generate all possible*

*> strings less than a given length and just check to see if any of the*

*> strings are equivalent. Of course this only works for statements*

*> less than the given string length. (Is there a way to prove that*

*> this is sufficient for some class of grammars?)*

Your naive algorithm is not a bad start. However, most interesting

grammars generate strings of arbitrary length. The way to beat such

an algorithm is akin to one way people used to beat chess playing

programs--push the bad case sufficiently far into the future. In

other words, you can always define a grammar that generates a string

of an arbitrary length before generating an ambiguity. Thus, you can

always find a grammar whose ambiguity requires a string longer that

what your algorithm has checked so far.

While that is fatal to the algorithm, it is not as bad as it seems.

To create a grammar whose ambiguity lies indefinitely far in the

future, one must have an arbitrarily large grammar. In other words,

if you have a grammar of a known size (i.e. a fixed number of rules

and a fixed length for each rule), there does exists some bound where

the naive algorithm will have divided the grammars (of that size and

smaller) into ambiguous and unambiguous classes.

However, given the general unsolvability of the problem, there can be

no algorithm that computes that bound. Thus, there is some number

where we could let your naive algorithm run and it would have solved

the problem for our arbitrary grammar in question. However, we don't

know what that number is. Moreover, one of the two answers (ambiguous

or unambiguous) must be represented by the naive program still not

having found an answer in that time.

In other words, the naive algorithm must detect an unambiguous grammar

by not having found an ambiguity in the grammar within the bounded

number of steps. However, since we don't know what that bound is, we

don't know when to stop the naive program and declare that the grammar

is unambiguous. We only know that the grammar is unambiguous or it is

ambiguous, but not within the first n symbols. So, your naive

algorithm is one that allows you to detect grammars that are ambiguous

in the first n symbols, but there are both unambiguous and ambiguous

grammars that your algorithm fails to distinguish.

*> But how about generating all possible strings where each*

*> non-terminal is expanded at least three times. Why three times?*

*> Because that's what seems to work in the hand expansions I've done.*

In the general case, 3 will not be large enough. The above argument

proves that any fixed constant will be too small (for arbitrary

grammars).

On the other hand, most grammars are trivially ambiguous or

unambiguous. Especially the grammars that people write for things

like programming languages. As a result, 3 is not a bad cutoff for a

tool that helps you catch the obvious cases. Just realize that your

tool may not detect some subtle ambiguities. Note setting the number

to 4, 5, or more will catch a few more subtle cases. However, you

will find that diminshing returns quickly set in. Unfortuntately, for

any real sized grammar, the subtlest cases will not be distinguished

until well past the point where diminshing returns have set in.

*> The grammar has produced two instances of the string "id + id + id"*

*> (as well as some identical E3 strings) which shows that the grammar*

*> is ambiguous. Can folks suggest why this approach may or may not be*

*> a wild goose chase? (I'm going to be using a top-down parser and my*

*> goal is not efficiency, but the exploration of the implications of*

*> various grammar designs.)*

Finally, if you are going to build a top-down parser, I hope you will

use one of the available tools to do so. Of course, those tools solve

the ambiguity problem. Any grammar which is LL is unambiguous. (Same

is true for LR grammars, but you mentioned top-down which is generally

a synonym for LL). The only place you would have potential ambiguity

problems is where you used the tools extensions that allow you to

escape the LL limitations (e.g. predicates or java code).

I have a lot more thoughts on this topic as it is relevant to the

LR-infinity work that I purse off-and-on. However, I hope this has

addressed your needs.

Regards,

-Chris

*****************************************************************************

Chris Clark Internet : compres@world.std.com

Compiler Resources, Inc. Web Site : http://world.std.com/~compres

3 Proctor Street voice : (508) 435-5016

Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.