Re: UCS Identifiers and compilers

"Ira Baxter" <idbaxter@semdesigns.com>
Thu, 11 Dec 2008 19:27:39 -0600

          From comp.compilers

Related articles
UCS Identifiers and compilers wclodius@los-alamos.net (2008-12-10)
Re: UCS Identifiers and compilers DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-12-11)
Re: UCS Identifiers and compilers mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-12-11)
Re: UCS Identifiers and compilers james.harris.1@googlemail.com (James Harris) (2008-12-11)
Re: UCS Identifiers and compilers marcov@stack.nl (Marco van de Voort) (2008-12-11)
Re: UCS Identifiers and compilers idbaxter@semdesigns.com (Ira Baxter) (2008-12-11)
Re: UCS Identifiers and compilers bear@sonic.net (Ray Dillinger) (2008-12-11)
Re: UCS Identifiers and compilers cfc@shell01.TheWorld.com (Chris F Clark) (2008-12-11)
Re: UCS Identifiers and compilers bc@freeuk.com (Bartc) (2008-12-12)
Re: UCS Identifiers and compilers mike@mike-austin.com (Mike Austin) (2008-12-12)
| List of all articles for this month |

From: "Ira Baxter" <idbaxter@semdesigns.com>
Newsgroups: comp.compilers
Date: Thu, 11 Dec 2008 19:27:39 -0600
Organization: Compilers Central
References: 08-12-061
Keywords: i18n
Posted-Date: 12 Dec 2008 10:32:38 EST

"William Clodius" <wclodius@los-alamos.net> wrote in message
> As a hobby I have started work on a language design and one of the
> issues that has come to concern me is the impact on the usefulness and
> complexity of implementation is the incorporation of UCS/Unicode into
> the language, particularly in identifiers.


What's the problem? What this basically does is extend the characters
allowed in identifiers, literal strings and comments. This isn't
particularly hard to specify or implement. Our DMS Software
Reengineering Toolkit has offered this capability since 1996.


>Most languages these days seem to be trying to exploit UCS including
> C, C++, Ada, Haskell, and Scheme,


and virtually every new langauge that comes along. We just
encountered CLIF, the Common Language for Interchange of (logic)
Formulas. Its identifiers and comments are explicitly Unicode based.


Also, if you want to process langauges in many foreign character sets,
e.g. those with double-byte characters such as Japanese Shift-JIS, you
better be prepared for this. Translating everything into Unicode
before you lex makes this particularly easy. (Even EBCDIC is easily
handled this way).


> although there are at least a few holdouts such as Fortran. As
> posters to this newsgroup are both users, implementors and language
> designers with a bit more contact with the outside world I would like
> responses to the following questions
>
> 1. Do many of your users make use of letters outside the ASCII/Latin-1
> sets?


We have Japanese customers. The identifiers are Latin. But the
comments, they are definitely Japanese.


> 2. What are the most useful development environments in terms of dealing
> with extended character sets?


Unicode :-}


> 3. Visually how well do alternative character sets mesh with a language
> with ASCII keywords and left to right, up and down display, typical of
> most programming languages? eg. how well do scripts with ideographs,
> context dependent glyphs for the same character, and alternative saptail
> ordering work, or character sets with characters with glyphs similar to
> those used for ASCII (the l vs 1 and O vs. 0 problem multiplied)


Well, we haven't seen Hebrew or Arabic programming languages.
The left-to-right paradigm seems to work pretty well so far.


> 4. How does the incorporation of the larger character sets affect
> your lexical analysis? Is hash table efficiency affected? Do you
> have to deal with case/accent independence and if so how useful are
> the UCS recommendations for languages?


Lexical analysis is slowed a little bit. We use the heuristic that
quite a lot of characters are still in the latin set and tend to to
table lookups (unit time) for characters in that range to classify
them. For character codes larger than that, our lexer generator
manufacturers functions with that have balanced binary tree
classifications of character ranges, so even ugly character groupings
require at worst 15 compares (for the 16 bit Unicode set). The
difference isn't really noticeable in practice.


-- IDB


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.