|compiling case insensitive regular expressions firstname.lastname@example.org (Armel) (2010-11-01)|
|Re: compiling case insensitive regular expressions email@example.com (glen herrmannsfeldt) (2010-11-03)|
|Re: compiling case insensitive regular expressions firstname.lastname@example.org (2010-11-03)|
|Re: compiling case insensitive regular expressions email@example.com (Armel) (2010-11-04)|
|Re: compiling case insensitive regular expressions firstname.lastname@example.org (Russ Cox) (2010-11-04)|
|Re: compiling case insensitive regular expressions email@example.com (glen herrmannsfeldt) (2010-11-05)|
|Re: compiling case insensitive regular expressions firstname.lastname@example.org (BGB) (2010-11-06)|
|From:||Russ Cox <email@example.com>|
|Date:||Thu, 4 Nov 2010 15:15:24 -0400|
|Posted-Date:||04 Nov 2010 22:25:04 EDT|
> I need to compile regular expressions which are case insensitive,
> there are two cases, the part which must be matched case insensitvely
> might be just a portion but it can be the entire RE as well. B The RE
> will be Unicode enabled and must be compiled to AFD.
> I am wondering what is the "best practice" in this field, is it
> generally more efficient to precompute for each RE its "case
> insensitive RE" (by replacing each symbol by a ORed version of it,
> i.e; hello becomes (h|H)(e|E)(l|L)(l|L)(o|O)..) and compile that in
> place of the original or is it as good to simply replace the input
> symbols by their lowercase version and compile a "lowercase only"
> version of the RE?
My experience has been that you almost always want to do it in
the RE, not by fiddling with the input stream.
* If only some of the regexp is case-insensitive, transforming the
input is incorrect.
* A tight byte-at-a-time DFA inner loop slows down noticeably
(10% or so on a typical x86) when you have the extra conditional
branch to turn upper case into lower.
* Computing "ToLower" of an arbitrary Unicode code point is
quite expensive. If you bake it into the RE and then the DFA
then the work happens during table construction, and you get
much faster execution.
All that said, it is also the case that turning /Hello, world/ into
/[Hh][Ee][Ll][Ll][Oo], [Ww][Oo][Rr][Ll][Dd]/ is perhaps not the best
use of memory, especially if you have an inefficient representation
of character classes and an efficient representation of literal strings.
In RE2 (http://code.google.com/p/re2/) I introduced a bit in the parsed
RE representation to mean "ASCII case-insenstiive" so that I could
continue to store the parsed form as the more compact literal strings
where possible. Even so, ASCII is really the only easy case. If you
type in /=CE=9A=CE=B1=CE=BB=CE=B7=CE=BC=CE=AD=CF=81=CE=B1 =CE=BA=CF=8C=CF=
=83=CE=BC=CE=B5/ then RE2 turns it into a sequence of
One final warning: even ASCII is not completely straightforward:
according to the Unicode spec, a case-insensitive match for /sky/ should
match =C5=BFKy - that's a long s, a Kelvin symbol, and an ordinary y.
Return to the
Search the comp.compilers archives again.