|Re: Re: DFAs,NFAs,REs,CFGs,PDAs ARGH! email@example.com (Bob Foster) (2003-08-20)|
|Re: Re: DFAs,NFAs,REs,CFGs,PDAs ARGH! derkgwen@HotPOP.com (Derk Gwen) (2003-08-23)|
|Re: DFAs,NFAs,REs,CFGs,PDAs ARGH! firstname.lastname@example.org (Joachim Durchholz) (2003-08-23)|
|From:||"Bob Foster" <email@example.com>|
|Date:||20 Aug 2003 01:29:23 -0400|
|References:||<firstname.lastname@example.org> <084801c36559$e4bea560$6701a8c0@snobird> <Pine.BSI.email@example.com>|
|Posted-Date:||20 Aug 2003 01:29:23 EDT|
A recent posting reiterated the Chomsky hierarchy of languages,
characterizing the regular languages as "type 3" and context-free
languages as "type 2" but did not mention a well-known class of
languages that are approximately 2.5 on that scale. Regular tree
languages occupy an interesting position in the heirarchy; they are
regular and can be parsed using regular grammar techniques (such as
Brzozowski derivatives ), yet they can be recursively
defined. Considered as a string, a regular tree language is,
obviously, a "parenthesis language", i.e., can have a rule of the form
R -> aRb, where a and b are the special symbols used to denote tree
nesting, and serve as parenthesis. IOW, it is context-free as a string
and regular as a tree, a convenient circumstance that is
much-exploited these days in validating XML documents (e.g., in RELAX
NG ) but not much used in a more general parsing context.
But I didn't write this note to bore you with well-known facts about
regular trees, but to point up a little-known fact about tree-like
In regular tree languages, the "parenthesis tokens" are not in the
alphabet of the language, but they could be. If they are, the result
is a class of context-free languages (somewhat inadvertantly)
described in a 1970 paper by Steven Vere . Vere's technique uses
derivatives to produce DFA for each rule of a context-free grammar,
which are then used by a pushdown automaton to parse the language. At
the end of the paper it is noted that the technique fails for some
context-free grammars. Judging by citations, this seems to have led
the paper to be viewed as a negative result, sidelining the approach.
This seems a shame. What Vere actually found was that a context-free
language can be parsed as a nested regular language if it observes a
few not very onerous restrictions with respect to recursion. The first
(conventional) restriction is that left recursion is either removed or
disallowed. The second (unconventional) restriction can be summed up
tidily as "it cannot be ambiguous whether to push or pop". In other
words, there cannot be an ambiguous choice one branch of which is in
the First set of a recursive rule and another not, nor an ambiguous
choice one branch of which is in the Follow set of a recursive rule
and another not. (The "or pop" restriction can be selectively waived,
as it is common to do in parsing the recursively defined if/then/else
construction; it can be disambiguated by consistently picking one path
or the other.)
These restrictions are commonly met by parenthesis languages, which
observe the sufficient but not necessary condition that entrance to
and exit from recursive rules are bracketed by tokens that are not
otherwise used in the rule(s). Thus, many parenthesis languages can be
parsed using regular language techniques, e.g., by derivatives
augmented in one way or the other by a simple stack.
People trained to think in terms of LL(n), LR(n), LALR(n),
etc. parsing might be surprised to learn that these parsing approaches
allow unbounded ambiguity (except as noted above) without lookahead,
backtracking or exponential blowup.
If, as seems likely, the irreducible recursion in many modern
programming languages meets the restrictions above, this approach to
parsing might have broad utility. At least, it deserves more attention
than it appears to have gotten.
 Brzozowski, Janusz A., "Derivatives of Regular Expressions," JACM 11:4,
 Clark, James, "An algorithm for RELAX NG validation,"
 Vere, Steven, "Translation Equations," CACM 13:2, 1970.
Return to the
Search the comp.compilers archives again.