Related articles |
---|
Programming language specification languages nmm1@cus.cam.ac.uk (2001-09-20) |
Re: Programming language specification languages rkrayhawk@aol.com (2001-09-25) |
Re: Programming language specification languages joachim_d@gmx.de (Joachim Durchholz) (2001-10-06) |
I'm russian! was character sets crystal-pin@mail.ru (2001-10-13) |
Re: I'm russian! was character sets tmaslen@wedgetail.com (Thomas Maslen) (2001-10-20) |
Unicode, was: I'm Russian! bear@sonic.net (Ray Dillinger) (2001-11-25) |
Re: Unicode, was: I'm Russian! loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-11-26) |
From: | Martin von Loewis <loewis@informatik.hu-berlin.de> |
Newsgroups: | comp.compilers |
Date: | 26 Nov 2001 21:56:48 -0500 |
Organization: | Humboldt University Berlin, Department of Computer Science |
References: | 01-09-087 01-09-106 01-10-021 01-10-061 01-10-105 01-11-103 |
Keywords: | i18n |
Posted-Date: | 26 Nov 2001 21:56:48 EST |
Ray Dillinger <bear@sonic.net> writes:
> First, there are multiple characters that look the same. We are all
> familiar with the problems posed by lower-case "L" and the digit "1",
> and by the upper-case "O" and the digit "0". Unicode has multiplied
> these problems by hundreds, making it possible to create pages of code
> that look correct but will not compile in any reasonable system.
I think this is a made-up problem. Mixing, say, a latin A and a
cyrillic A simply won't happen in a program, in real life (unlike the
1/l problem, which does happen). People writing cyrillic identifiers
will use the cyrillic A consistently, and anybody else will have
problems even typing these identifiers in the first place.
For a long time, many programmers will restrict themselves to ASCII
for identifier, because of the problem that you can't properly use a
library with, say, Kanji identifiers without a Kanji keyboard.
> Second, the characters have "directionality". This interferes with
> the programmer's understanding of the sequence of characters in a
> page of source code, causing yet more debugging problems.
Strings always have the "right" directionality in Unicode. It would be
pointless to display Arabic characters in source code in a
left-to-right fashion; nobody could read it anymore. This is not a
problem with Unicode; it is inherent in the languages.
> Third, "Endian" issues can happen as unicode documents migrate across
> platforms; the efforts of the committee to provide a "graceful"
> solution instead require particular and special handling in all
> migrations, and code that can recognize either endianness.
I think this problem will disappear in the long run as everybody will
use UTF-8 for Unicode files.
> Fourth, the system that was supposed to finally save us from having
> multiple different character lengths mixed together as we mixed
> alphabets has gone utterly mad; now there are 8, 16, 20, and 32-bit
> representations for characters within Unicode.
Again, in the long run, I expect that programmers can continue to
assume simple indexing of characters in a string (leaving alone
normalizing issues). Libraries will either use a fixed-width
representation for internal storage, or transparently offer random
access on a variable-length representation. Unlike earlier multi-byte
encodings, this is possible for Unicode with little effort; you can
use the same indexing algorithm for all documents.
> Fifth, there are "holes" in the sequence of unicode character codes
> and applications have to be aware of them. This makes iterating over
> the code points into a major pain in the butt.
Why would you want to do that?
> Sixth, I don't want to add all the code and cruft to every system I
> produce, that I would have to add to support the complexities and
> subtleties of Unicode. It's just not worth it.
Right. Instead, all the cruft is in the system libraries (just like it
is for ASCII).
> If a simple, 32-bit "extended ascii" code comes along, I'll be the
> first to support it. But Unicode as we now see it is a crock.
There won't be anything else. Just assume that Unicode is a simple,
32-bit "extended ascii" today, and make every input you get fit that
view of the world.
Regards,
Martin
Return to the
comp.compilers page.
Search the
comp.compilers archives again.