Re: Unicode, was: I'm Russian!

Martin von Loewis <loewis@informatik.hu-berlin.de>
26 Nov 2001 21:56:48 -0500

          From comp.compilers

Related articles
Programming language specification languages nmm1@cus.cam.ac.uk (2001-09-20)
Re: Programming language specification languages rkrayhawk@aol.com (2001-09-25)
Re: Programming language specification languages joachim_d@gmx.de (Joachim Durchholz) (2001-10-06)
I'm russian! was character sets crystal-pin@mail.ru (2001-10-13)
Re: I'm russian! was character sets tmaslen@wedgetail.com (Thomas Maslen) (2001-10-20)
Unicode, was: I'm Russian! bear@sonic.net (Ray Dillinger) (2001-11-25)
Re: Unicode, was: I'm Russian! loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-11-26)
| List of all articles for this month |

From: Martin von Loewis <loewis@informatik.hu-berlin.de>
Newsgroups: comp.compilers
Date: 26 Nov 2001 21:56:48 -0500
Organization: Humboldt University Berlin, Department of Computer Science
References: 01-09-087 01-09-106 01-10-021 01-10-061 01-10-105 01-11-103
Keywords: i18n
Posted-Date: 26 Nov 2001 21:56:48 EST

Ray Dillinger <bear@sonic.net> writes:


> First, there are multiple characters that look the same. We are all
> familiar with the problems posed by lower-case "L" and the digit "1",
> and by the upper-case "O" and the digit "0". Unicode has multiplied
> these problems by hundreds, making it possible to create pages of code
> that look correct but will not compile in any reasonable system.


I think this is a made-up problem. Mixing, say, a latin A and a
cyrillic A simply won't happen in a program, in real life (unlike the
1/l problem, which does happen). People writing cyrillic identifiers
will use the cyrillic A consistently, and anybody else will have
problems even typing these identifiers in the first place.


For a long time, many programmers will restrict themselves to ASCII
for identifier, because of the problem that you can't properly use a
library with, say, Kanji identifiers without a Kanji keyboard.


> Second, the characters have "directionality". This interferes with
> the programmer's understanding of the sequence of characters in a
> page of source code, causing yet more debugging problems.


Strings always have the "right" directionality in Unicode. It would be
pointless to display Arabic characters in source code in a
left-to-right fashion; nobody could read it anymore. This is not a
problem with Unicode; it is inherent in the languages.


> Third, "Endian" issues can happen as unicode documents migrate across
> platforms; the efforts of the committee to provide a "graceful"
> solution instead require particular and special handling in all
> migrations, and code that can recognize either endianness.


I think this problem will disappear in the long run as everybody will
use UTF-8 for Unicode files.


> Fourth, the system that was supposed to finally save us from having
> multiple different character lengths mixed together as we mixed
> alphabets has gone utterly mad; now there are 8, 16, 20, and 32-bit
> representations for characters within Unicode.


Again, in the long run, I expect that programmers can continue to
assume simple indexing of characters in a string (leaving alone
normalizing issues). Libraries will either use a fixed-width
representation for internal storage, or transparently offer random
access on a variable-length representation. Unlike earlier multi-byte
encodings, this is possible for Unicode with little effort; you can
use the same indexing algorithm for all documents.


> Fifth, there are "holes" in the sequence of unicode character codes
> and applications have to be aware of them. This makes iterating over
> the code points into a major pain in the butt.


Why would you want to do that?


> Sixth, I don't want to add all the code and cruft to every system I
> produce, that I would have to add to support the complexities and
> subtleties of Unicode. It's just not worth it.


Right. Instead, all the cruft is in the system libraries (just like it
is for ASCII).


> If a simple, 32-bit "extended ascii" code comes along, I'll be the
> first to support it. But Unicode as we now see it is a crock.


There won't be anything else. Just assume that Unicode is a simple,
32-bit "extended ascii" today, and make every input you get fit that
view of the world.


Regards,
Martin


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.