20 Aug 2003 01:29:23 -0400

Related articles |
---|

Re: Re: DFAs,NFAs,REs,CFGs,PDAs ARGH! bobkfoster@comcast.net (Bob Foster) (2003-08-20) |

Re: Re: DFAs,NFAs,REs,CFGs,PDAs ARGH! derkgwen@HotPOP.com (Derk Gwen) (2003-08-23) |

Re: DFAs,NFAs,REs,CFGs,PDAs ARGH! joachim.durchholz@web.de (Joachim Durchholz) (2003-08-23) |

From: | "Bob Foster" <bobkfoster@comcast.net> |

Newsgroups: | comp.compilers |

Date: | 20 Aug 2003 01:29:23 -0400 |

Organization: | Compilers Central |

References: | <20030818051131.4276.qmail@tom.iecc.com> <084801c36559$e4bea560$6701a8c0@snobird> <Pine.BSI.4.56.0308181033270.26284@tom.iecc.com> |

Keywords: | parse, theory |

Posted-Date: | 20 Aug 2003 01:29:23 EDT |

A recent posting reiterated the Chomsky hierarchy of languages,

characterizing the regular languages as "type 3" and context-free

languages as "type 2" but did not mention a well-known class of

languages that are approximately 2.5 on that scale. Regular tree

languages occupy an interesting position in the heirarchy; they are

regular and can be parsed using regular grammar techniques (such as

Brzozowski derivatives [1]), yet they can be recursively

defined. Considered as a string, a regular tree language is,

obviously, a "parenthesis language", i.e., can have a rule of the form

R -> aRb, where a and b are the special symbols used to denote tree

nesting, and serve as parenthesis. IOW, it is context-free as a string

and regular as a tree, a convenient circumstance that is

much-exploited these days in validating XML documents (e.g., in RELAX

NG [2]) but not much used in a more general parsing context.

But I didn't write this note to bore you with well-known facts about

regular trees, but to point up a little-known fact about tree-like

context-free languages.

In regular tree languages, the "parenthesis tokens" are not in the

alphabet of the language, but they could be. If they are, the result

is a class of context-free languages (somewhat inadvertantly)

described in a 1970 paper by Steven Vere [3]. Vere's technique uses

derivatives to produce DFA for each rule of a context-free grammar,

which are then used by a pushdown automaton to parse the language. At

the end of the paper it is noted that the technique fails for some

context-free grammars. Judging by citations, this seems to have led

the paper to be viewed as a negative result, sidelining the approach.

This seems a shame. What Vere actually found was that a context-free

language can be parsed as a nested regular language if it observes a

few not very onerous restrictions with respect to recursion. The first

(conventional) restriction is that left recursion is either removed or

disallowed. The second (unconventional) restriction can be summed up

tidily as "it cannot be ambiguous whether to push or pop". In other

words, there cannot be an ambiguous choice one branch of which is in

the First set of a recursive rule and another not, nor an ambiguous

choice one branch of which is in the Follow set of a recursive rule

and another not. (The "or pop" restriction can be selectively waived,

as it is common to do in parsing the recursively defined if/then/else

construction; it can be disambiguated by consistently picking one path

or the other.)

These restrictions are commonly met by parenthesis languages, which

observe the sufficient but not necessary condition that entrance to

and exit from recursive rules are bracketed by tokens that are not

otherwise used in the rule(s). Thus, many parenthesis languages can be

parsed using regular language techniques, e.g., by derivatives

augmented in one way or the other by a simple stack.

People trained to think in terms of LL(n), LR(n), LALR(n),

etc. parsing might be surprised to learn that these parsing approaches

allow unbounded ambiguity (except as noted above) without lookahead,

backtracking or exponential blowup.

If, as seems likely, the irreducible recursion in many modern

programming languages meets the restrictions above, this approach to

parsing might have broad utility. At least, it deserves more attention

than it appears to have gotten.

Bob Foster

[1] Brzozowski, Janusz A., "Derivatives of Regular Expressions," JACM 11:4,

1964.

[2] Clark, James, "An algorithm for RELAX NG validation,"

http://www.thaiopensource.com/relaxng/derivative.html.

[3] Vere, Steven, "Translation Equations," CACM 13:2, 1970.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.