|Universal Character Names email@example.com (Eric Lemings) (1998-10-10)|
|Re: Universal Character Names firstname.lastname@example.org (Quinn Tyler Jackson) (1998-10-13)|
|Re: Universal Character Names Brian.Inglis@cadvision.com (1998-10-13)|
|Re: Universal Character Names email@example.com (Eric Lemings) (1998-10-17)|
|Re: Universal Character Names firstname.lastname@example.org (Dr Richard A. O'Keefe) (1998-10-17)|
|Re: Universal Character Names email@example.com.OZ.AU (1998-10-22)|
|Re: Universal Character Names firstname.lastname@example.org (1998-10-30)|
|From:||email@example.com (Paul Eggert)|
|Date:||30 Oct 1998 13:09:42 -0500|
|Organization:||Twin Sun Inc, El Segundo, CA, USA|
|References:||98-10-068 98-10-080 98-10-103|
"Dr Richard A. O'Keefe" <firstname.lastname@example.org> writes:
>The let of letters does NOT vary with locale in C, C9x, C++, or Java.
This isn't true of the latest draft for C9x, which extends identifiers
to allow implementation-defined characters in addition to the required
characters and UCNs. Such characters may or may not correspond to UCNs.
I believe that the intent here is that the locale used to invoke the
compiler can affect the charset/encoding allowed in identifiers. For
example, on a Solaris 2.6 host, compiling with the locale "ja" might
allow EUC-JIS Japanese letters and digits in identifiers, whereas
compiling with the locale "ja_JP.PCK" might allow Shift-JIS Japanese
letters and digits.
Thus it might well make sense to use '[:alpha:]' in a regular
expression that matches C9x identifiers, since '[:alpha:]' also depends
on the (compile-time) locale.
>In fact the details about universal character names in C9x are
Very much so! C9x UCNs are not the same as C++ UCNs. Also, UCNs in
the latest C9x draft are quite a different animal from earlier drafts.
UCNs are a new feature, for which there is little practical
experience, and I advise authors of portable software to stay away
from them for a decade or so.
>It would have been pleasant had C9x followed the rules in
>section 5.14 "Identifiers" of the Unicode 2.0 book
Wouldn't this have required Unicode canonicalization? That would have
been controversial, as (excuse me while I hand-wave a bit) to some
extent canonicalization is a bit like case-folding, and historically C
identifiers have been case-sensitive.
Draft C9x sidesteps this problem by requiring support only for letters
and digits that can be represented as a single Unicode character, which
is analogous to sidestepping the case-sensitivity issue by requiring
support only for lower-case letters.
>there are EIGHT "translation phases" in C9x:
>1. Map source multibyte characters to the source character set.
> This includes converting end of record to newline, and it
> SPECIFICALLY INCLUDES CONVERTING non-basic characters to UCNs.
This is no longer true in the latest C9x draft.
Several of the crucial details have changed in this area.
>What this means is that if you want a tool to do something useful with
>identifiers in C source files, you would have to be very very silly
>not to do it by taking a freely available preprocessor (such as the
>GNU one) and bolting your tool on the end.
It depends on the application. If your application is required to find
all identifiers, then you're correct. But if you're just trying to
write a regular expression that matches (say) function headers, and if
you can assume a reasonable style in the source code, then it's
reasonable to use an ordinary regular expression containing '[:alpha:]'.
By the way, the GCC2 preprocessor (which I help maintain) doesn't
understand UCNs yet. I'm waiting for the C9x spec to settle down.
My impression is that UCNs were mandated by the ISO bureaucracy, and
will will provide grist for the C language lawyers' mills indefinitely,
but in their current form they aren't particularly useful in real code.
For example, one can't reliably use UCNs for program interchange, since
(in the latest C9x draft, at least), uniformly replacing extended
characters with their corresponding UCNs can change the meaning of the
Return to the
Search the comp.compilers archives again.