Related articles |
---|
compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-01) |
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-03) |
Re: compiling case insensitive regular expressions benhanson2@icqmail.com (2010-11-03) |
Re: compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-04) |
Re: compiling case insensitive regular expressions rsc@swtch.com (Russ Cox) (2010-11-04) |
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-05) |
Re: compiling case insensitive regular expressions cr88192@hotmail.com (BGB) (2010-11-06) |
From: | Russ Cox <rsc@swtch.com> |
Newsgroups: | comp.compilers |
Date: | Thu, 4 Nov 2010 15:15:24 -0400 |
Organization: | Compilers Central |
References: | 10-11-004 |
Keywords: | lex, practice |
Posted-Date: | 04 Nov 2010 22:25:04 EDT |
> I need to compile regular expressions which are case insensitive,
> there are two cases, the part which must be matched case insensitvely
> might be just a portion but it can be the entire RE as well. B The RE
> will be Unicode enabled and must be compiled to AFD.
>
> I am wondering what is the "best practice" in this field, is it
> generally more efficient to precompute for each RE its "case
> insensitive RE" (by replacing each symbol by a ORed version of it,
> i.e; hello becomes (h|H)(e|E)(l|L)(l|L)(o|O)..) and compile that in
> place of the original or is it as good to simply replace the input
> symbols by their lowercase version and compile a "lowercase only"
> version of the RE?
My experience has been that you almost always want to do it in
the RE, not by fiddling with the input stream.
* If only some of the regexp is case-insensitive, transforming the
input is incorrect.
* A tight byte-at-a-time DFA inner loop slows down noticeably
(10% or so on a typical x86) when you have the extra conditional
branch to turn upper case into lower.
* Computing "ToLower" of an arbitrary Unicode code point is
quite expensive. If you bake it into the RE and then the DFA
then the work happens during table construction, and you get
much faster execution.
All that said, it is also the case that turning /Hello, world/ into
/[Hh][Ee][Ll][Ll][Oo], [Ww][Oo][Rr][Ll][Dd]/ is perhaps not the best
use of memory, especially if you have an inefficient representation
of character classes and an efficient representation of literal strings.
In RE2 (http://code.google.com/p/re2/) I introduced a bit in the parsed
RE representation to mean "ASCII case-insenstiive" so that I could
continue to store the parsed form as the more compact literal strings
where possible. Even so, ASCII is really the only easy case. If you
type in /=CE=9A=CE=B1=CE=BB=CE=B7=CE=BC=CE=AD=CF=81=CE=B1 =CE=BA=CF=8C=CF=
=83=CE=BC=CE=B5/ then RE2 turns it into a sequence of
character classes.
One final warning: even ASCII is not completely straightforward:
according to the Unicode spec, a case-insensitive match for /sky/ should
match =C5=BFKy - that's a long s, a Kelvin symbol, and an ordinary y.
Russ
Return to the
comp.compilers page.
Search the
comp.compilers archives again.