Related articles |
---|
Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-10) |
Re: Universal Character Names qjackson@wave.home.com (Quinn Tyler Jackson) (1998-10-13) |
Re: Universal Character Names Brian.Inglis@cadvision.com (1998-10-13) |
Re: Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-17) |
Re: Universal Character Names ok@atlas.otago.ac.nz (Dr Richard A. O'Keefe) (1998-10-17) |
Re: Universal Character Names fjh@cs.mu.OZ.AU (1998-10-22) |
Re: Universal Character Names eggert@twinsun.com (1998-10-30) |
From: | "Dr Richard A. O'Keefe" <ok@atlas.otago.ac.nz> |
Newsgroups: | comp.compilers |
Date: | 17 Oct 1998 01:54:36 -0400 |
Organization: | Department of Computer Science, University of Otago |
References: | 98-10-068 98-10-080 |
Keywords: | C, i18n |
Eric Lemings <eric.b.lemings@lmco.com> wrote:
> [C++, Java, and C9x have "universal character names"]
> Needless to say this makes the old regexp for identifiers:
> [a-zA-Z_]+[a-zA-Z0-9_]*
> obsolete. How would you modify it to handle UCN?
That old regular expression was totally broken anyway, at least for C
and C++. You forgot that a backslash-newline sequence can be inserted
at _any_ point within _any_ token. So a _working_ regular expression
for recognising identifiers used to be
[a-zA-Z_](((??/|\\)\n)*[a-zA-Z0-9_])*
Brian Inglis wrote:
> Emphasis is on the *OLD*!
> AFAIR, modern syntax designed to handle this is:
> [[:alpha:]_]+[[:alpha:][:digit]_]*
> AFAIR, the extra brackets are a required part of the syntax...
Wrong. The let of letters does NOT vary with locale in C, C9x, C++,
or Java. The whole point of [:alpha:] is to adapt to the locale,
which an identifier pattern MUST NOT DO.
In fact the details about universal character names in C9x are
somewhat subtle. It is NOT the case that EVERY ucn is allowed at all,
still less that every ucn is allowed in an identifier. Annex H of the
draft standard lists as ranges exactly which UCNs are allowed in an
identifier. It would have been pleasant had C9x followed the rules in
section 5.14 "Identifiers" of the Unicode 2.0 book; the C9x people
were aware of those rules and Annex H is to some extent an
approximation of them.
Frankly, I _wouldn't_ describe C9x/C++/Java identifiers using a
regular expression. Remember, there are EIGHT "translation phases" in
C9x:
1. Map source multibyte characters to the source character set.
This includes converting end of record to newline, and it
SPECIFICALLY INCLUDES CONVERTING non-basic characters to
UCNs. So you are allowed to have an <e-acute> character
in your source code, and it may even be represented in the
source file by a single 16#E9# byte, but subsequent phases
of translation will 'see' \u00E9 or possibly even
\u0065\u0301 <e,floating acute> instead.
The main consequence of this for your regular expression is
that if you want to recognise identifiers in SOURCE files,
you need to handle the full range of local multibyte codes
AS WELL AS universal character names. If your regular
expression processor is 8-bit-clean, you might be able to
get away with
letter = [a-zA-Z_] | [\0x80-\0xFF]+ | \u[0-9a-fA-Z]{4} | ...
ident = letter (letter | digit)*
1a. After this, trigraphs are replaced. (Yes, that means C9x
really has nine phases, not 8.)
2. \<newline> is spliced out.
3. the input is tokenized as a sequence of pp-tokens and white space
4. preprocessing is done, directives, macros, &c.
THIS PHASE MAY GENERATE NEW IDENTIIFERS, so foo(x)(y) may
actually _be_ an identifier even though it doesn't _look_
like one. (No, you can't generate new UCNs here.)
5. Characters are now converted from the source character set
to the execution character set.
6. Strings are pasted (narrow strings with narrow strings, wide
strings with wide strings). The effect of "x" L"y" and
L"x" "y" is not defined, which is a pity, because that was
a very nasty problem that they should have fixed.
7. Now pp-tokens are converted to tokens, and of course some
pp-tokens that look like identifiers are actually keywords.
White space including comments is finally discarded.
7a. The program is parsed. (Yes, that means there are really
ten phases, not 8.)
8. External references are resolved and everything is put into an
"image" suitable for execution in the target environment.
What this means is that if you want a tool to do something useful with
identifiers in C source files, you would have to be very very silly
not to do it by taking a freely available preprocessor (such as the
GNU one) and bolting your tool on the end.
At least with Java there's none of the preprocessor nonsense to worry
about, so you _could_ write a regular expression to recognise
identifiers in Java source, but C9x and C++ use the preprocessor to
put this beyond the reach of the average programmer.
/??/
*no identifiers in here*??/
/
If your tool thinks there are four identiifers there,
it's broken!
Return to the
comp.compilers page.
Search the
comp.compilers archives again.