# Question on regular expression matching

## wolf@di.epfl.ch (Thomas Wolf)15 Mar 1998 00:18:43 -0500

From comp.compilers

Related articles
Question on regular expression matching wolf@di.epfl.ch (1998-03-15)
Re: Question on regular expression matching henry@zoo.toronto.edu (Henry Spencer) (1998-03-18)
| List of all articles for this month |

 From: wolf@di.epfl.ch (Thomas Wolf) Newsgroups: comp.compilers Date: 15 Mar 1998 00:18:43 -0500 Organization: Compilers Central Keywords: lex, question

What is the supposed result of POSIX-style subexpression matches for
ambiguous regular expressions? As an example, consider the RE
"((a)(b|c)|(acab))*" matching the string "acab". What should the
subexpression matches be in this case?

I would expect either

match[0] = {0, 4} The whole RE
match[1] = {0, 4} The "*"
match[2] = {-1, -1} Yes, "(a)" matched at pos. 2, but we took the
other branch of the "|".
match[3] = {-1, -1} Again, we took the other branch ('cause it's
longer).
match[4] = {0, 4} "(acab)" matched the whole string.

or

match[0] = {0, 4} The whole RE
match[1] = {2, 4} The last time the "*" is considered
match[2] = {2, 3} Match of "(a)" in the last iteration of the "*"
match[3] = {3, 4} Match of "(b|c)" in the last iteration
match[4] = {-1, -1} Although it matches the whole input, it doesn't
do so on the last iteration.

Which of these is it, and why? If it's the second one, would the
result change if the RE was written "((acab)|(a)(b|c))*" (besides
the different subexpression numbering, of course)? And if it's
neither, where did I go wrong? Would a match result reporting a
null match of the "*" at the end of "acab" be possible, too?

Personally, I think it should be the first one, for otherwise matching
the RE "((a)(b|c)|(acax))*" against the string "acax" would report a
full match, but claim that none of the subexpressions matched!?
--
Thomas Wolf
--

Post a followup to this message