|compiling case insensitive regular expressions firstname.lastname@example.org (Armel) (2010-11-01)|
|Re: compiling case insensitive regular expressions email@example.com (glen herrmannsfeldt) (2010-11-03)|
|Re: compiling case insensitive regular expressions firstname.lastname@example.org (2010-11-03)|
|Re: compiling case insensitive regular expressions email@example.com (Armel) (2010-11-04)|
|Re: compiling case insensitive regular expressions firstname.lastname@example.org (Russ Cox) (2010-11-04)|
|Re: compiling case insensitive regular expressions email@example.com (glen herrmannsfeldt) (2010-11-05)|
|Re: compiling case insensitive regular expressions firstname.lastname@example.org (BGB) (2010-11-06)|
|Date:||Sat, 06 Nov 2010 11:46:47 -0700|
|Posted-Date:||07 Nov 2010 00:42:30 EDT|
On 11/4/2010 12:15 PM, Russ Cox wrote:
admittedly, I don't much use regexes, although a few times I have used
regex-like strategies (such as a special notation for matching/encoding
x86 instructions, ...), and have used regexes in a few cases.
most of my tokenizers/... are plain C code.
> One final warning: even ASCII is not completely straightforward:
> according to the Unicode spec, a case-insensitive match for /sky/ should
> match E?Ky - that's a long s, a Kelvin symbol, and an ordinary y.
rarely do I find it necessary to resort to the full level of Unicode
pedantics, especially WRT most code and data tasks (it would just slow
down the handling logic).
so, one can assume a single "sane" mapping between any upper and
lower-case characters, and ignore any cases where there is not a 1:1
mapping (random characters in other locales/... which may compare equal
in some cases, characters which map between multiple characters in upper
or lower-case, ...).
that or just treat Unicode space as a "big ASCII": there is all the
special case logic for ASCII, and everything else is often just treated
as "unknown solid characters" (this is how most of my compiler code works).
for example, my assembler is partly case-insensitive, but is only case
insensitive for certain things (opcode and register names, but not
labels or variables), and only in ASCII range (there are no greek or
cyrillic or similar opcode names anyways, so it doesn't matter).
I usually compare by forcing both sides to lower case during the
compare. or, case-insensitive string interning simply forcing the string
to lower-case prior to interning it (one can intern "FOO" or "Foo" and
get back "foo"). interning allows '==' rather than string compare
operations, and so is often faster (though not always, since if only
doing a small number of compares the cost of interning the string is
greater than the cost of doing the compares).
usually, I leave things case sensitive (and demand an exact 1:1
code-point match), since this makes things simpler and faster. also I
tend to see case-sensitive as more intuitive anyways, since to me 'A'
and 'a' seem like different letters anyways.
and, yes, I am also a fan of UTF-8, since it maps nicely to ASCII and
95% of the time takes less space than UTF-16.
decided to leave out a bit about various ways to implement
uppercase/lowercase mapping, as this is probably not the issue.
Return to the
Search the comp.compilers archives again.