Re: compiling case insensitive regular expressions

BGB <cr88192@hotmail.com>
Sat, 06 Nov 2010 11:46:47 -0700

          From comp.compilers

Related articles
compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-01)
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-03)
Re: compiling case insensitive regular expressions benhanson2@icqmail.com (2010-11-03)
Re: compiling case insensitive regular expressions armelasselin@hotmail.com (Armel) (2010-11-04)
Re: compiling case insensitive regular expressions rsc@swtch.com (Russ Cox) (2010-11-04)
Re: compiling case insensitive regular expressions gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-11-05)
Re: compiling case insensitive regular expressions cr88192@hotmail.com (BGB) (2010-11-06)
| List of all articles for this month |

From: BGB <cr88192@hotmail.com>
Newsgroups: comp.compilers
Date: Sat, 06 Nov 2010 11:46:47 -0700
Organization: albasani.net
References: 10-11-004 10-11-010
Keywords: lex
Posted-Date: 07 Nov 2010 00:42:30 EDT

On 11/4/2010 12:15 PM, Russ Cox wrote:


<snip, regexes>


admittedly, I don't much use regexes, although a few times I have used
regex-like strategies (such as a special notation for matching/encoding
x86 instructions, ...), and have used regexes in a few cases.


most of my tokenizers/... are plain C code.




> One final warning: even ASCII is not completely straightforward:
> according to the Unicode spec, a case-insensitive match for /sky/ should
> match E?Ky - that's a long s, a Kelvin symbol, and an ordinary y.


rarely do I find it necessary to resort to the full level of Unicode
pedantics, especially WRT most code and data tasks (it would just slow
down the handling logic).


so, one can assume a single "sane" mapping between any upper and
lower-case characters, and ignore any cases where there is not a 1:1
mapping (random characters in other locales/... which may compare equal
in some cases, characters which map between multiple characters in upper
or lower-case, ...).


that or just treat Unicode space as a "big ASCII": there is all the
special case logic for ASCII, and everything else is often just treated
as "unknown solid characters" (this is how most of my compiler code works).


for example, my assembler is partly case-insensitive, but is only case
insensitive for certain things (opcode and register names, but not
labels or variables), and only in ASCII range (there are no greek or
cyrillic or similar opcode names anyways, so it doesn't matter).


I usually compare by forcing both sides to lower case during the
compare. or, case-insensitive string interning simply forcing the string
to lower-case prior to interning it (one can intern "FOO" or "Foo" and
get back "foo"). interning allows '==' rather than string compare
operations, and so is often faster (though not always, since if only
doing a small number of compares the cost of interning the string is
greater than the cost of doing the compares).


usually, I leave things case sensitive (and demand an exact 1:1
code-point match), since this makes things simpler and faster. also I
tend to see case-sensitive as more intuitive anyways, since to me 'A'
and 'a' seem like different letters anyways.




and, yes, I am also a fan of UTF-8, since it maps nicely to ASCII and
95% of the time takes less space than UTF-16.




decided to leave out a bit about various ways to implement
uppercase/lowercase mapping, as this is probably not the issue.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.