From: | Kaz Kylheku <kkylheku@gmail.com> |
Newsgroups: | comp.compilers |
Date: | Tue, 5 Jan 2010 01:58:15 +0000 (UTC) |
Organization: | A noiseless patient Spider |
References: | 10-01-013 |
Keywords: | lex, DFA |
Posted-Date: | 05 Jan 2010 13:36:03 EST |
On 2010-01-03, Peng Yu <pengyu.ut@gmail.com> wrote:
> Suppose that I have two regular sets A and B that are generated by two
> regular expressions. I'm wondering if there is always a regular
> expression that can generated the difference between the two sets,
> i.e., A-B. If there is such regular expression, is there a mechanical
> way to generated such regular expression.
The following paper, which is actually not a paper but a set of solutions
to a midterm examination, has some important clues.
It gives two algoirthms which are defined over NFA's: one for
computing the complement of an NFA (~A), and one for computing the
intersection of two NFA's, A&B.
The difference A-B can be given by this equivalence:
A-B <=> A&(~B)
I.e. applying both algorithms. In fact this is one of the questions
in this midterm:
http://cs164fa09.pbworks.com/f/midterm1_solutions.pdf
Actually, by de Morgan's we should be be able to eliminate &, right?
A&(~B) <=> ~(~A|~(~B)) <=> ~(~A|B)
So it looks like all we need is the complement to do difference.
Constructing the complement over NFA's is far simpler
than going from NFA -> DFA and back.
Complement is simple, in particular: the complemented NFA inherits all
of states and transitions of the original NFA. The meaning of the
states is inverted: non-accepting states in the original become
acceptance states, and vice versa. A new failure state is added which
is an acceptance state (due to the inverted sense of the
complement). Extra transitions are added such that for any state in
the original NFA which has no transition on some input character C, a
transition is added for that state and character which takes it to
this failure state. This rule is applied to the failure state itself,
which self-transitions on all inputs, serving as a ``honeypot'' for
arbitrarily long suffixes. I.e. every explicit failure in the
original NFA (no transition on some character) is routed to this
failure state which has accept semantics under the complement, and
whenever the original NFA reaches an accepting state, in the
complement NFA, this event corresponds to reaching a non-accepting
state. Thus success becomes failure and vice versa.
This still leaves you with the problem of reconstructing a regular
expression from an NFA.
A workable approach for doing the complement, difference, etc,
directly over the regular expression may be to exploit derivatives
somehow.
The derivative of a language L with respect so some string u, is the
set of suffixes of all strings in L which have u as a prefix, with the
u prefix removed. E.g if the languages is { abc, def, deg }, its
derivative with respect to de is { f, g }.
In consideration of a language being generated by a regex, the derivative
concept can be applied to the regex. For instance the derivative of
the regex abb* with respect to ab is the regex b* (which can derive the
empty string). The derivative can sometimes be the empty regex e,
or a special null value which means ``empty set'' or ``matches nothing''.
E.g. the derivative of the regex a* with respect to the string b is
this null value, which may be notated 0. We can write this as
d(b,a*) = 0.
Derivatives can be used to match regular expressions, as an alternative to NFA
construction, and quite readily support operations like complement and
intersection. They can be translated directly to DFA.
It looks like there should be a way to construct a complement, difference or
intersection using derivatives somehow.
Useful paper on derivatives: _Regular Expression Derivatives
Re-Examined_, Owens, Reppy, Turon:
http://www.ccs.neu.edu/home/turon/re-deriv.pdf (and its references).
Using pencil and paper, I think I found the start of a workable
algorithm for computing the compelment of a regex using derivatives,
directly. A similar approach should work for intersections.
Using a four letter {a, b, c, d} alphabet, I computed the following
example complement:
complement(a(abc)*d) -> (|[bcd]|a(abc)*[bc]|a(abc)*d[abcd]+|
aab[abd][abcd]*|aa[acd][abcd]*)
On informal inspection this seems right (do you see any mistakes).
I.e. an empty string mismatches, since the regex derives only nonempty
strings, so we include it. The the one-letter strings b, c and d do
not match, because the regex has a null derivative with respect to
these. An a followed by by zero or more repetitions of "abc",
followed by something other than an a or d (i.e. [bc]) does not match
and so is a branch of of the complement. And so forth.
The ``pre-algorithm'', loosely speaking, is to consider the space of
possible inputs, letter by letter. For each letter, compute the
derivative of the regex with respect to that letter. For all letters
deriving a 0 (null), you can construct a set; this is the set matched
by the complement regex at that character position: add this as a
disjunction to the complement regex, remembering that in the branch
you must match the prior characters leading up to that position). The
* operator is handled specially, along these lines. When the regex is
of the form (r1)*.r2 (I.e. a kleene followed by more regex material)
the complement should match zero or more r1, and then a character
which may not begin r1 or r2. Another branch matches (r1*), and then
continues the process. Of course, cases also have to be included for
mismatching /into/ r1, character by character. I don't have all this
worked out in detail, but just playing with pencil and paper
derivations, the concept seems promising.
In the above example, the rule gives us the two branches a(abc)*[bc],
and a(abc)*d[abcd]. In the first one, the complement regex matches
leading a, the string abc zero or more times, but then a "bc" (letters
which do not begin another "abc" repetition, nor match the "d" suffix
which follows). In the second one, the complement matches a leading a
with zero or more of the "abc" repeat, then a d which branches to the
next part of the pattern, but then an extra letter (wildcard of the
whole alphabet) which spoils the match. How that last one works is
that the derivative of d with respect to a, b, and c derives 0, so we
match a d (the complement of those three). Then we must match the
complement of the epsilon regex which is any letter, one or more
times. I.e. the complement regex must match all strings which are
prefixes of the regular regex, but have incompatible suffixes: d
followed by junk.
The mismatches through the (abc) part of *(abc) are reflected in
the branches aa[acd][abcd]* and aab[abd][abcd]*. I.e. the complement matches
the leading a, and then "a" (possible start of "abc") followed by something
other than b, and also the leading a, then "ab", followed by something other
than c, failing to complete the "abc". Both branches have to allow arbitrary
amount of junk, all of which mismatches the original regex and so matches
the complement.
Without a question, character class syntax with copmlements is essential, since
for an actual character set, the complement spaces are huge; e.g. we want to
encode [abcd] simply as . ; the equivalent of literally writing [abcd] even in
ASCII will kill us, not to mention Unicode.
In our above example, we really want to write it like this:
complement(a(abc)*d) -> (|[^a]|a(abc)*[^ad]|a(abc)*d.+|aab[^c].*|aa[^b].*)
Now it works for an alphabet which includes symbols other than just a, b, c
and d.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.