Re: UCS Identifiers and compilers

Hans-Peter Diettrich <>
Thu, 11 Dec 2008 12:17:27 +0100

          From comp.compilers

Related articles
UCS Identifiers and compilers (2008-12-10)
Re: UCS Identifiers and compilers (Hans-Peter Diettrich) (2008-12-11)
Re: UCS Identifiers and compilers (Dmitry A. Kazakov) (2008-12-11)
Re: UCS Identifiers and compilers (James Harris) (2008-12-11)
Re: UCS Identifiers and compilers (Marco van de Voort) (2008-12-11)
Re: UCS Identifiers and compilers (Ira Baxter) (2008-12-11)
Re: UCS Identifiers and compilers (Ray Dillinger) (2008-12-11)
Re: UCS Identifiers and compilers (Chris F Clark) (2008-12-11)
[2 later articles]
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers
Date: Thu, 11 Dec 2008 12:17:27 +0100
Organization: Compilers Central
References: 08-12-061
Keywords: i18n
Posted-Date: 11 Dec 2008 07:56:02 EST

William Clodius schrieb:

> 4. How does the incorporation of the larger character sets affect your
> lexical analysis? Is hash table efficiency affected? Do you have to deal
> with case/accent independence and if so how useful are the UCS
> recommendations for languages?

IMO a compiler is not a text processor, and consequently should not be
burdened with natural language conventions and textual representation
oddities. Text handling (as opposed to string handling) should be the
task of the application coder, not of the compiler or language designer.

Handling Unicode in string literals is not a special problem, when the
source code is already stored in UCS/UTF. The use of escape sequences in
string literals may deserve some consideration, a concatenation of
"plain" text and control character sequences may be preferable.
Identifiers also should be no problem, in detail when they are case
sensitive, so that a binary comparison is sufficient for the lookup of

APL used glyphs for keywords, what in those days required appropriately
equipped systems, but nowadays it would be more practical and could
simplify the lexers. I'm also waiting for the first Chinese programming
language, with glyphs for keywords; what will happen then to the shared
code, which currently still is written in traditional programming
languages and ASCII characters?

AFAIK the Russian attempts, to translate keywords into Cyrillic, were
not really successful - but mostly due to the longer words, not
because of readability. As with Chinese glyphs, it should be possible
to parse "nationalized" source code, and to reproduce the AST in a
different translation, properly retaining comments. Perhaps future
development systems will include such a feature, and a dictionary for
commonly used identifier names and abbreviations?

Just my 0,02$ <BG>

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.