Re: Universal Character Names

"Dr Richard A. O'Keefe" <ok@atlas.otago.ac.nz>
17 Oct 1998 01:54:36 -0400

From comp.compilers

Related articles
Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-10)
Re: Universal Character Names qjackson@wave.home.com (Quinn Tyler Jackson) (1998-10-13)
Re: Universal Character Names Brian.Inglis@cadvision.com (1998-10-13)
Re: Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-17)
*Re: Universal Character Names ok@atlas.otago.ac.nz (Dr Richard A. O'Keefe)* (1998-10-17)**
Re: Universal Character Names fjh@cs.mu.OZ.AU (1998-10-22)
Re: Universal Character Names eggert@twinsun.com (1998-10-30)

| List of all articles for this month |

From:	"Dr Richard A. O'Keefe" <ok@atlas.otago.ac.nz>
Newsgroups:	comp.compilers
Date:	17 Oct 1998 01:54:36 -0400
Organization:	Department of Computer Science, University of Otago
References:	98-10-068 98-10-080
Keywords:	C, i18n

Eric Lemings <eric.b.lemings@lmco.com> wrote:
> [C++, Java, and C9x have "universal character names"]
> Needless to say this makes the old regexp for identifiers:
> [a-zA-Z_]+[a-zA-Z0-9_]*
> obsolete. How would you modify it to handle UCN?

That old regular expression was totally broken anyway, at least for C
and C++. You forgot that a backslash-newline sequence can be inserted
at _any_ point within _any_ token. So a _working_ regular expression
for recognising identifiers used to be
      [a-zA-Z_](((??/|\\)\n)*[a-zA-Z0-9_])*

Brian Inglis wrote:
> Emphasis is on the *OLD*!
> AFAIR, modern syntax designed to handle this is:
> [[:alpha:]_]+[[:alpha:][:digit]_]*
> AFAIR, the extra brackets are a required part of the syntax...

Wrong. The let of letters does NOT vary with locale in C, C9x, C++,
or Java. The whole point of [:alpha:] is to adapt to the locale,
which an identifier pattern MUST NOT DO.

In fact the details about universal character names in C9x are
somewhat subtle. It is NOT the case that EVERY ucn is allowed at all,
still less that every ucn is allowed in an identifier. Annex H of the
draft standard lists as ranges exactly which UCNs are allowed in an
identifier. It would have been pleasant had C9x followed the rules in
section 5.14 "Identifiers" of the Unicode 2.0 book; the C9x people
were aware of those rules and Annex H is to some extent an
approximation of them.

Frankly, I _wouldn't_ describe C9x/C++/Java identifiers using a
regular expression. Remember, there are EIGHT "translation phases" in
C9x:

1. Map source multibyte characters to the source character set.
        This includes converting end of record to newline, and it
        SPECIFICALLY INCLUDES CONVERTING non-basic characters to
        UCNs. So you are allowed to have an <e-acute> character
        in your source code, and it may even be represented in the
        source file by a single 16#E9# byte, but subsequent phases
        of translation will 'see' \u00E9 or possibly even
        \u0065\u0301 <e,floating acute> instead.

        The main consequence of this for your regular expression is
        that if you want to recognise identifiers in SOURCE files,
        you need to handle the full range of local multibyte codes
        AS WELL AS universal character names. If your regular
        expression processor is 8-bit-clean, you might be able to
        get away with
letter = [a-zA-Z_] | [\0x80-\0xFF]+ | \u[0-9a-fA-Z]{4} | ...
        ident = letter (letter | digit)*

1a. After this, trigraphs are replaced. (Yes, that means C9x
        really has nine phases, not 8.)

2. \<newline> is spliced out.

3. the input is tokenized as a sequence of pp-tokens and white space

4. preprocessing is done, directives, macros, &c.
        THIS PHASE MAY GENERATE NEW IDENTIIFERS, so foo(x)(y) may
        actually _be_ an identifier even though it doesn't _look_
        like one. (No, you can't generate new UCNs here.)

5. Characters are now converted from the source character set
        to the execution character set.

6. Strings are pasted (narrow strings with narrow strings, wide
        strings with wide strings). The effect of "x" L"y" and
        L"x" "y" is not defined, which is a pity, because that was
        a very nasty problem that they should have fixed.

7. Now pp-tokens are converted to tokens, and of course some
        pp-tokens that look like identifiers are actually keywords.
        White space including comments is finally discarded.

7a. The program is parsed. (Yes, that means there are really
        ten phases, not 8.)

8. External references are resolved and everything is put into an
        "image" suitable for execution in the target environment.

What this means is that if you want a tool to do something useful with
identifiers in C source files, you would have to be very very silly
not to do it by taking a freely available preprocessor (such as the
GNU one) and bolting your tool on the end.

At least with Java there's none of the preprocessor nonsense to worry
about, so you _could_ write a regular expression to recognise
identifiers in Java source, but C9x and C++ use the preprocessor to
put this beyond the reach of the average programmer.

/??/
*no identifiers in here*??/
/

If your tool thinks there are four identiifers there,
it's broken!

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Universal Character Names

"Dr Richard A. O'Keefe" <ok@atlas.otago.ac.nz>17 Oct 1998 01:54:36 -0400

"Dr Richard A. O'Keefe" <ok@atlas.otago.ac.nz>
17 Oct 1998 01:54:36 -0400