Re: Universal Character Names

eggert@twinsun.com (Paul Eggert)
30 Oct 1998 13:09:42 -0500

          From comp.compilers

Related articles
Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-10)
Re: Universal Character Names qjackson@wave.home.com (Quinn Tyler Jackson) (1998-10-13)
Re: Universal Character Names Brian.Inglis@cadvision.com (1998-10-13)
Re: Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-17)
Re: Universal Character Names ok@atlas.otago.ac.nz (Dr Richard A. O'Keefe) (1998-10-17)
Re: Universal Character Names fjh@cs.mu.OZ.AU (1998-10-22)
Re: Universal Character Names eggert@twinsun.com (1998-10-30)
| List of all articles for this month |
From: eggert@twinsun.com (Paul Eggert)
Newsgroups: comp.compilers
Date: 30 Oct 1998 13:09:42 -0500
Organization: Twin Sun Inc, El Segundo, CA, USA
References: 98-10-068 98-10-080 98-10-103
Keywords: C, i18n



"Dr Richard A. O'Keefe" <ok@atlas.otago.ac.nz> writes:


>The let of letters does NOT vary with locale in C, C9x, C++, or Java.


This isn't true of the latest draft for C9x, which extends identifiers
to allow implementation-defined characters in addition to the required
characters and UCNs. Such characters may or may not correspond to UCNs.


I believe that the intent here is that the locale used to invoke the
compiler can affect the charset/encoding allowed in identifiers. For
example, on a Solaris 2.6 host, compiling with the locale "ja" might
allow EUC-JIS Japanese letters and digits in identifiers, whereas
compiling with the locale "ja_JP.PCK" might allow Shift-JIS Japanese
letters and digits.


Thus it might well make sense to use '[:alpha:]' in a regular
expression that matches C9x identifiers, since '[:alpha:]' also depends
on the (compile-time) locale.


>In fact the details about universal character names in C9x are
>somewhat subtle.


Very much so! C9x UCNs are not the same as C++ UCNs. Also, UCNs in
the latest C9x draft are quite a different animal from earlier drafts.
UCNs are a new feature, for which there is little practical
experience, and I advise authors of portable software to stay away
from them for a decade or so.


>It would have been pleasant had C9x followed the rules in
>section 5.14 "Identifiers" of the Unicode 2.0 book


Wouldn't this have required Unicode canonicalization? That would have
been controversial, as (excuse me while I hand-wave a bit) to some
extent canonicalization is a bit like case-folding, and historically C
identifiers have been case-sensitive.


Draft C9x sidesteps this problem by requiring support only for letters
and digits that can be represented as a single Unicode character, which
is analogous to sidestepping the case-sensitivity issue by requiring
support only for lower-case letters.


>there are EIGHT "translation phases" in C9x:


>1. Map source multibyte characters to the source character set.
> This includes converting end of record to newline, and it
> SPECIFICALLY INCLUDES CONVERTING non-basic characters to UCNs.


This is no longer true in the latest C9x draft.
Several of the crucial details have changed in this area.


>What this means is that if you want a tool to do something useful with
>identifiers in C source files, you would have to be very very silly
>not to do it by taking a freely available preprocessor (such as the
>GNU one) and bolting your tool on the end.


It depends on the application. If your application is required to find
all identifiers, then you're correct. But if you're just trying to
write a regular expression that matches (say) function headers, and if
you can assume a reasonable style in the source code, then it's
reasonable to use an ordinary regular expression containing '[:alpha:]'.


By the way, the GCC2 preprocessor (which I help maintain) doesn't
understand UCNs yet. I'm waiting for the C9x spec to settle down.


My impression is that UCNs were mandated by the ISO bureaucracy, and
will will provide grist for the C language lawyers' mills indefinitely,
but in their current form they aren't particularly useful in real code.
For example, one can't reliably use UCNs for program interchange, since
(in the latest C9x draft, at least), uniformly replacing extended
characters with their corresponding UCNs can change the meaning of the
program.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.