Re: "Regular expressions" for stack automata?

Chris F Clark <world!cfc@uunet.uu.net>
20 Sep 1999 12:05:38 -0400

          From comp.compilers

Related articles
"Regular expressions" for stack automata? Marko.Makela@HUT.FI (Marko =?ISO-8859-1?Q?M=E4kel=E4?=) (1999-09-10)
Re: "Regular expressions" for stack automata? Marko.Makela@HUT.FI (Marko =?ISO-8859-1?Q?M=E4kel=E4?=) (1999-09-16)
Re: "Regular expressions" for stack automata? world!cfc@uunet.uu.net (Chris F Clark) (1999-09-20)
Re: "Regular expressions" for stack automata? ilya@math.ohio-state.edu (1999-09-24)
Re: "Regular expressions" for stack automata? qjackson@wave.home.com (Quinn Tyler Jackson) (1999-10-04)
| List of all articles for this month |

From: Chris F Clark <world!cfc@uunet.uu.net>
Newsgroups: comp.compilers
Date: 20 Sep 1999 12:05:38 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 99-09-030 99-09-050
Keywords: lex, parse

Marko (answering himself) wrote:
> Marko> Now, my questions: Has this field been researched? Is there
> Marko> any grep-like tool that uses stack automata instead of finite
> Marko> state automata?
>
> I was pointed out in an e-mail message that QED (the ancestor of ed
> and vi, and probably the first tool that introduced regular expression
> based search and replace functions) has this. The construct is very
> simple: named regular expressions. From the QED manual (see
> <URL:http://cm.bell-labs.com/cm/cs/who/dmr/qed.html>):
>
> The regular expression "\E(bal)" defined by
>
> e(bal)/[^`']*(`\E(bal)')*[^`']*/
>
> matches any string balanced with respect to single quotes "`" and "'".
>
> By the way, I was very impressed that the regular expressions have
> remained practically unchanged for a so long time.
>
> John> [Many implementations of REs such as GNU grep extend the
> John> REs in interesting ways with backreferences. -John]
>
> True, but I haven't seen anything like the \E(re) construct of QED
> anywhere, and \1..\9 cannot express the same thing. I wonder how well
> Perl regular expressions would perform with this construct. I would
> expect especially the non-greedy closure operators *? and +? to be
> utterly inefficient with this construct.


You will find the \E(bal) concept only in lexer generators that use
something like LL or LR parsing technology to lex. There are at least
two examples of this that I know of, ANTLR 2.x (LL lexing) and Yacc++
(LR lexing).


The reason you will only find the extension in lexers generated by
these more powerful tools is that you need a stack automata and
"regular expressions" (as opposed to regexp's, i.e. what perl, emacs,
and grep have) are defined mathematically to be equivalent to simple
finite state machines with no stack. Finite state machines with a
stack are equivalent to context free languages. (And there is plenty
of studies of CFL's, but not listed under "regular expressions" since
to the theorist they are different beasts.)


Now, regular expressions (RE's) haven't seemed to advance in a long
time because it has been proven that all regular expressions can be
reduced to simple FSM's and vice-versa so no additional operators
would theoretically be necessary (or useful) to someone studying them
as automata. As a result, most interesting extensions to RE's don't
get press in the books that teach them (automata or compiler) as the
extensions are simply equivalent to something already covered (or they
aren't RE's at all).


However, if you read works in the area of computational linguistics
you will find that there is a more lively interest in regular
expressions. The linguists use RE's to model language translations
and have systems that use a variety of extensions to RE's. (They are
often interested in FST's (finite state transducers), which are
essentially RE's with outputs or "synchronized" FSM's where two
machines (one for the input and one for the output) are
interconnected.)


Now, let's return to regexp's (i.e. what perl, grep, and emacs have).
You are right that the backreference capability is not the same as
stack automata. In fact, the two facilities are orthogonal. Both
facilities solve different problems (and there are expressions you can
write in one that you can't write in the other). Equally important is
that the backreference capacity (like the stack) takes you out of the
domain of pure regular expressions. That is you can't translate a
backreference into a set of states in a FSM (at least not at regexp
compile time). In fact, you will find that most regexp packages do
not generate FSM's (like most lever generators do) but instead
"interpret" the regexp while doing the match. If you use this
approach, the backreference facility is easy to implement. That's
because you don't even try to generate an FSM for it, but merely do
the match against the appropriate backreferenced string at the correct
stage of the interpretation.


The use of stack automata goes the other way. Stack automata are
usually implemented by converting the regular expressions to FSM's.
That does not mean that a regexp package could not include a stack and
push things onto it at the appropriate points in the interpretation.


However, one often wants to take more care with stack automata--they
introduce ambiguity problems that one would like to know about before
running them on an input and getting an incorrect parse (i.e. the
regexp package chose the wrong ambiguous case to match with). As far
as I know, the only way of detecting those ambiguities is to run them
through an appropriate parser generator and seeing if it detects
conflicts. Even that method is not precise as most parser generators
will reject some unambiguous grammars that are simply beyond the
generator's resolution capacity. There is a proof, in fact, that the
problem cannot be solved in general and that one must either accept
some ambiguous grammars or reject some unambiguous grammars.


The ambiguity problem does not plague pure regular expressions,
because all regular expressions that match the same set of strings are
considered equivalent (and there is no feature in regular expressions
to distinguish how the parts of the regular expression were lined up
against the string).


However, a similar ambiguity problem also plagues backreferences, but
most regexp packages sweep it under the rug. They have the ambiguity
problem because the backreferences allow one to determine how the
parts of the regexp lined up against the string. That breaks the
equivalence of the underlying regular expressions (i.e. "a \(a*\)" is
not equivalent to "\(a+\)"). Note that the perl regexps introduce the
non-greedy verions of the regular expression operators to give the
user control over how the underlying regular expressions line up.


This is probably more that you wanted to know.
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
------------------------------------------------------------------------------
Web Site in Progress: Web Site : http://world.std.com/~compres


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.