|Question on regular expression matching firstname.lastname@example.org (1998-03-15)|
|Re: Question on regular expression matching email@example.com (Henry Spencer) (1998-03-18)|
|From:||Henry Spencer <firstname.lastname@example.org>|
|Date:||18 Mar 1998 22:51:57 -0500|
|Organization:||SP Systems, Toronto, Canada|
Thomas Wolf <email@example.com> wrote:
>What is the supposed result of POSIX-style subexpression matches for
>ambiguous regular expressions? As an example, consider the RE
>"((a)(b|c)|(acab))*" matching the string "acab". What should the
>subexpression matches be in this case?
Within an expression, the POSIX spec says that subexpressions match the
longest substrings they can, and that earlier subexpressions have priority
over later ones. (The wording of the spec is admittedly less than ideal.)
Here, the earliest subexpression is the one that starts earliest: the one
in the outermost set of parentheses. It starts out by matching the
longest string it can, using its second branch. That gobbles up "acab".
Then it tries to match again because of the "*", but can't.
That having been settled... Just how that subexpression matches its
substring is sorted out the same way. The earlier subexpressions again
get priority, but that makes no difference in this case because the first
branch cannot match "acab" and the second must be used.
Earlier (or outer) subexpressions match as much as they can, and later
(or inner) ones do what they can within those constraints.
So match, , and  are all the whole string, and match and 
are void since those subexpressions did not participate.
>...Would a match result reporting a
>null match of the "*" at the end of "acab" be possible, too?
Only if you started the match attempt there. POSIX REs always match
starting as early as they can.
| Henry Spencer
Return to the
Search the comp.compilers archives again.