Re: POSIX regex subexpression matching semantics

"Stefan Monnier" <monnier+comp/compilers/news/>
3 Apr 2000 04:09:17 -0400

          From comp.compilers

Related articles
POSIX regex subexpression matching semantics (2000-04-01)
Re: POSIX regex subexpression matching semantics monnier+comp/compilers/news/ (Stefan Monnier) (2000-04-03)
Re: POSIX regex subexpression matching semantics (2000-04-05)
| List of all articles for this month |

From: "Stefan Monnier" <monnier+comp/compilers/news/>
Newsgroups: comp.compilers
Date: 3 Apr 2000 04:09:17 -0400
Organization: Compilers Central
References: 00-04-021
Keywords: lex, DFA
User-Agent: Gnus/5.0804 (Gnus v5.8.4) Emacs/21.0.90

>>>>> "Ville" == Ville Laurikari <> writes:
> The GNU regex(7) man page says that when (a*)* is matched against
> "bc" both the whole RE _and_ the parenthesized subexpression match the
> empty string. If this is so, when matched against "aa" shouldn't the
> whole RE match the whole string first _and then_ the empty string at the
> end?

I believe this is indeed what Perl does.
I do not like this behavior, tho: since there is generally no way to get the
list of all matches of a subexpression but only the last match, it means
that in `(a*)*' the submatch will always end up as the empty string,
which is rather uninteresting.

I'd thus argue that when `(a*)*' is matched against `bc' the subexpression
should not match at all (i.e. the outer repetition operator loops 0
times rather than once).
And more generally, if the body of a repetition operator can match the empty
string, it should only match such an empty string if necessary (since
it will always be possible). Matching the empty string is made necessary
when there is a minimum number of matches as is the case with `+' and
with `{n,m}'.

The rule that subexpressions also should match the longest match is
often not properly followed by backtracking implementations (i.e.
all implementations I can think of that allow the use of backreferences.
I'm actually curious: has anyone tried to extend an NFA or DFA so
as to allow matching backreferences ?).

More specifically, in your example, `(a|(a*b*))*' will usually
match `aaaaabbbbbbbb' by matching the `a' alternative five times
and then that `a*b*' once although the longest match rule would say that
it should have just matched `a*b*' once.

This behavior makes implementation easier and can be convenient to the
user since it makes the | operator non-commutative, allowing us to express
a limited form of preference.
For example, Perl's non-greedy repetition operator `a*?' is simply translated
into `(|a)*?'.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.