Re: Regex -> DFA question

Chris F Clark <cfc@world.std.com>
3 Apr 2000 11:27:23 -0400

          From comp.compilers

Related articles
Regex -> DFA question bkubesh@earthlink.net (bkubesh) (2000-03-11)
Regex -> DFA question uday@cs.unipune.ernet.in (Uday Khedkar) (2000-04-01)
Re: Regex -> DFA question cfc@world.std.com (Chris F Clark) (2000-04-03)
| List of all articles for this month |

From: Chris F Clark <cfc@world.std.com>
Newsgroups: comp.compilers
Date: 3 Apr 2000 11:27:23 -0400
Organization: Compilers Central
References: 00-03-062 00-04-005
Keywords: lex, DFA

bkubesh <bkubesh@earthlink.net> wrote:
>ago, but with a twist. When constructing multiple Regex's how does one
>handle the following case:
>
>R1 | R2
>
>Where R1=(ab.*cd) and R2=(gh.*ij)
>
>Suppose we process the following input: abghxxcidij
>
>After we have processed the starting sequences, it would seem as though
>we need to be able to "cross track" between the two expressions in the
>OR. How do I handle this case for R2 in the above example. If I first


to which Uday Khedker correctly replied:
> That's not quite correct. When you say (R1 | R2) you match either R1
> or R2 (or both in case R1 == R2). There is no concept of "changing"
> the track. If you start matching R1 in such way that the string you
> matched cannot begin R2, then you continue matching R1 by ignoring R2
> completely.
>
> So the input provided above does not belong to the language (R1|R2).


However, one source of confusion in this is that "regexps" are often
unsed in contexts other than lexing. In fact, the term "regexp" is
usually used when the context is string search rather than lexing.
One reason for this is that "regexp" as a term often allows certain
variations that aren't strictly "regular expressions".


However, there is a subtle second reason, which I think is implied in
the original question. When one is doing string searches, the entire
input does not have to match the regexp, there only has to exist some
substring of the input that matches the regexp. That is a slightly
different question than does the string match the given regular
expression (and also a different question can the string be broken
down into a series of "tokens" where each token matches one of the
regular expressions from a set).


Each of the three different questions (substring matching, whole
string matching, and tokenizing) have slightly different implications
on how the "engine" (automata) matches characters to states.


In the "whole string" matching case, once a mismatch is found the
string cannot be matched to that part of the regular expression. In
the example if the first character of the string is not an "a" and the
second a "b", the string cannot match R1. Similarly, if the first
character is not a "g" and the second and "h", the string cannot match
R2. For those use to the string searching semantics, this is
equivalent to the start of the match being anchored at the start of
the string and the end of the match being anchored at the end of the
string. Most formal treatments of automata are implicitly talking
about the "whole string" matching case. When an automata book
describes the "language" defined by a particular machine (or regular
expression or grammar), the book is talking about the set of "whole
strings" that match that particular machine (...).


The lexing (tokenizing) world uses a variation on the same scheme,
such that the theorems from the "whole string" matching case apply.
In the lexing case, the left (starting) end of the string is still
anchored (to exactly where the previous token ended). The reason is
that we want to partition the input into tokens that have no gaps
between them. Therefore, every character must belong to some token
(and no character can belong to more than one token). The right end
is not anchored but instead matches the string specified by its
machine (or regular expression or grammar). One facet of having the
left end anchored, is that the same failure property holds--That is
once the start of a particular part of a regular expression doesn't
match it can never match.


The substring matching case, leaves both ends unanchored. This is
equivalent to writing .*(regexp).* for the whole string matching case.
By removing the anchor from the left end of the string, it is possible
for a mtch to be "picked up" later. That's equivalent to making the
.* at the beginning of the string match more characters. In that
context the question makes sense.


The subtle difference between the three questions explain why the
algorithms for each of the three questions are slightly different.
There are string search algorithms the build dfa's to do the search.
However, many popular string search algorithms use other techniques.
The reason being that in a string search it is extremely likely that
the match will not occur with the string anchored at the starting
position. Therefore, the faster the algorithm can determine where the
match sould be anchored, the faster it will operate.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.