2 Jan 2007 01:01:32 -0500

Related articles |
---|

Problem with ambigous grammar sr@247ms.com (2007-01-01) |

Re: Problem with ambigous grammar cfc@shell01.TheWorld.com (Chris F Clark) (2007-01-02) |

From: | Chris F Clark <cfc@shell01.TheWorld.com> |

Newsgroups: | comp.compilers |

Date: | 2 Jan 2007 01:01:32 -0500 |

Organization: | The World Public Access UNIX, Brookline, MA |

References: | 07-01-007 |

Keywords: | parse |

Posted-Date: | 02 Jan 2007 01:01:32 EST |

sr@247ms.com writes:

*> I am quite new to parsing and don't know if this newsgroup is*

*> appropriate for this type of questions, anyway, the question regards a*

*> certain type of ambiguity in a CF grammar and maybe some of you can*

*> still help.*

*>*

*> I have a production R -> A | B and for the non-terminals A and B I*

*> have: FIRST(A) intersect FIRST(B) != EMPTY-SET.*

*> A sentence with a terminal in this intersection will now eventually*

*> have 2 parse trees.*

*>*

*> example:*

*>*

*> R -> keyword | identifier*

*> keyword -> if | then | else | class | ...*

*> identifier -> [a-z]+*

Yes, this is actually a very common theoretical problem. As, John our

moderator said, it is rare to find a grammar that doesn't suffer from

this problem.

In addition, there are numerous solutions to this problem. You have

stumbled on some of them. Please read through them all, since the

order you listed them in isn't the most practical order, as many of

your ideas are overkill.

*> a) when the rule R->keyword produces a string s, then don't try the*

*> rule R->identifier.*

This is the "parsing expression grammar" approach. One orders the

parser's attempts by the order the rules appear in the grammar and

stops once one has found a rule that matches. The PEG approach came

out of extending the work Terence Parr pioneered on "predicates".

Predicates are ways of specifying what you call "precedence" in a

grammar just the way you would like, which is not actually the normal

meaning of the term, but that's ok--your meaning was understood.

Predicates are a way of saying, if two rules apply, which rule you

want the parser to use.

Predicates are a very powerful mechanism. You can solve many

non-context free parsing problems using them. However, they are also

"dangerous" in the sense that you can write predicates that effectively

override the way a parser works and in doing so create a language

where you really aren't sure what is accepted. The main problem with

predicates is that in the general case we have no way of verifying

exactly what language a predicated parser accepts, as explained when I

discuss your XOR idea.

It is worth noting that predicates were initially implemented by

backtracking (and thus took non-linear time) and are a sophisticated

way of specifying controlled backtracking. However, one of the

advances of PEGs was to establish a set of grammars that could be

processed in linear time.

*> b) Get all parse trees. Afterwards remove the 'bad' ones.*

This is the GLR approach, make a parse forest and if the result isn't

a tree, use semantics to prune away the ambiguous trees that aren't

wanted. As you noted in some text I elided, the GLR approach can be

slow if used two liberally, since it can easily have n**2 running

times, rather than linear ones which DFA, LL, and LR techniques have.

Using it to parse identifiers would probably be such a case, as you

wouldn't want your n**2 case to be on the length of each identifier.

*> d) replace B with a new non-terminal C such that FIRST(C) =*

*> FIRST(B)-FIRST(A). Question: can such a C be constructed (at all?*

*> sometimes maybe? )*

In this case, the intersection can be constructed. However, in many

cases, especially with complex recursion the taking of the

intersection is non-trivial (and I'm certain it cannot be automated

because whether two context-free grammar (fragments) parse the same

text is undecidable in general). Still, it is worth noting that the

complex cases don't occur in practice. If you look up "scanner-less

parsing", you should find papers where they deal with this problem by

simply assuming that a suitably powerful parser generator can

successfully take the intersection.

*> I can immagine that c) and d) are similar. Of course both c and d will*

*> Of course they solved ambiguity by simply specifying this extra 'that*

*> is not a keyword' in the grammar, but this is definately not BNF*

*> notation to me.*

That's right it is not BNF. However, as John pointed out, that's

probably *THE CLASSIC* practical solution.

Simply put, one splits the grammar in two parts the "lexer" and the

"parser". The lexer handles the rules for identifiers and keywords.

The parser handles the larger structures. Next, one arranges the

lexer grammar such that if the string matches both a keyword and an

identifier, the string is treated as a keyword. This is easy to do

with regular expressions, particularly, if you have a tool like LEX

that has a "first match" rule, that allows you to specify if two

regular expressions match, which one you want returned by the lexer.

Note, this is such a powerful solution, almost all lexer generators

support this feature.

The nice thing about this split is that, while in theory one could use

this to do all sorts of powerful and unmaintainable things, in

practice it can easily be used in a very limited and maintainable

fashion. One writes a simple lexer grammar, that is easily verified

and one can easily check the exact cases where one wants one token

instead of another. Then, one writes a normal parsing grammar based

on those tokens and does not need to worry about the token

interactions.

*> This brings me then to another question: How can we specify a 'XOR'*

*> operation using BNF notation and how can we specify a 'NOT' operation*

*> using BNF notation. If we had a way to specify a XOR operator in a CF*

*> grammar then we could easily solve the problem above. maybe a CF*

*> grammar is not strong enough to specify 'XOR', then maybe in a CS or*

*> PS grammar ?*

You are right in tradition CF notation, there are no XOR nor NOT

operations. In fact, deterministic CF grammars are not closed under

these operations (set-difference and complement). If you introduce

these notations, one has effectively introduced a subset of

"predicated" grammars.

There is no known parsing technology that can deal with the XOR or NOT

concepts, plus deterministic CF grammars, that isn't able to deal with

predicated grammars. Moreover, there is little need for such a

technology either, because once you have the XOR or NOT operation, you

can describe grammars where you have to know if the intersection of

two CF grammars is empty (the undecidable problem mentioned above),

and as a result it isn't possible to mechanically validate that such

grammars aren't ambiguous.

As a result, there are basically four classes of your problem,

although others, especially those which don't like lexical feedback and

prefer alternate solutions, may divide these groupings differently.

1) Your language is parsable under a sufficiently powerful

deterministic (but not predicated scheme). You can use a

scanner-less parser in that case. These cases can be automatically

checked.

2) Your language needs only the traditional lexer/parser split to solve

it. Most languages are in this set. These cases can usually be

trivially checked by hand. The triviality of these checks are

such, that one doesn't usually even bother.

3) Not discussed above, you need a slight variation on the

lexer/parser split to solve the language. The most common issue is

the need for "lexer feedback" to resolve some parsing problem.

These cases can also usually checked by hand, but not always

trivially. Languages with context sensitive keywords and/or macro

processors often fall into this class. C++ may or may not fit in

this class.

4) You need full predicated (or backtrack) parsing. Writing proofs of

these cases can be quite hard. One predicated parser generator

(meta-S aka grammarforge) has been shown to be Turing powerful,

which means determining whether a grammar that is acceptible to it

is unambiguous is equivalent to solving the halting problem.

Hope this helps,

-Chris

*****************************************************************************

Chris Clark Internet : compres@world.std.com

Compiler Resources, Inc. Web Site : http://world.std.com/~compres

23 Bailey Rd voice : (508) 435-5016

Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

------------------------------------------------------------------------------

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.