Unicode, was: I'm Russian!

Ray Dillinger <bear@sonic.net>
25 Nov 2001 22:37:06 -0500

From comp.compilers

Related articles
Programming language specification languages nmm1@cus.cam.ac.uk (2001-09-20)
Re: Programming language specification languages rkrayhawk@aol.com (2001-09-25)
Re: Programming language specification languages joachim_d@gmx.de (Joachim Durchholz) (2001-10-06)
I'm russian! was character sets crystal-pin@mail.ru (2001-10-13)
Re: I'm russian! was character sets tmaslen@wedgetail.com (Thomas Maslen) (2001-10-20)
*Unicode, was: I'm Russian! bear@sonic.net (Ray Dillinger)* (2001-11-25)**
Re: Unicode, was: I'm Russian! loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-11-26)

| List of all articles for this month |

From:	Ray Dillinger <bear@sonic.net>
Newsgroups:	comp.compilers
Date:	25 Nov 2001 22:37:06 -0500
Organization:	Compilers Central
References:	01-09-087 01-09-106 01-10-021 01-10-061 01-10-105
Keywords:	i18n
Posted-Date:	25 Nov 2001 22:37:06 EST

Thomas Maslen wrote:
>
> >> Of course, scripting languages intended for the hand of the end user
> >> *should* be able to support 16-bit characters (which, today, means >
> >> Unicode).
> >
> >Yes. String, chracter and character arrays types and constants must
> >support 16 bit representation. At least.
>
> Yup, at least.
>
> Until recently, Unicode and ISO 10646 only defined characters in the range
> U+0000..U+FFFF (the "Basic Multilingual Plane", i.e. the first 16 bits).
>
> However, Unicode 3.1 now defines characters in three more 16-bit planes:
> U+10000..U+1FFFF, U+20000..U+2FFFF and U+E0000..U+EFFFF. For details, see
> the "New Character Allocations" section of
>
> http://www.unicode.org/unicode/reports/tr27/
>
> All is not lost, because the 16-bit representation of Unicode was designed
> with this in mind, and it can represent U+0000..U+10FFFF (i.e. a little over
> 20 bits) using "surrogate pairs":

I'm actually a trifle irritated at the Unicode Standard. It has
become far more complex than any application I have for characters
required it to be. I don't mind it being bigger than ascii, in fact I
applauded when I heard that there was going to be a more inclusive
character set standard. However, all the things that make it more
*complicated* than ascii are things that militate against its ever
being used as source code representation.

First, there are multiple characters that look the same. We are all
familiar with the problems posed by lower-case "L" and the digit "1",
and by the upper-case "O" and the digit "0". Unicode has multiplied
these problems by hundreds, making it possible to create pages of code
that look correct but will not compile in any reasonable system.

Second, the characters have "directionality". This interferes with
the programmer's understanding of the sequence of characters in a page
of source code, causing yet more debugging problems.

Third, "Endian" issues can happen as unicode documents migrate across
platforms; the efforts of the committee to provide a "graceful"
solution instead require particular and special handling in all
migrations, and code that can recognize either endianness. When
Endian conversions take place, the bit order of the files change while
no semantic change has taken place, confusing or requiring special
code in every "diff" or version-control system.

Fourth, the system that was supposed to finally save us from having
multiple different character lengths mixed together as we mixed
alphabets has gone utterly mad; now there are 8, 16, 20, and 32-bit
representations for characters within Unicode.

Fifth, there are "holes" in the sequence of unicode character codes
and applications have to be aware of them. This makes iterating over
the code points into a major pain in the butt.

Sixth, I don't want to add all the code and cruft to every system I
produce, that I would have to add to support the complexities and
subtleties of Unicode. It's just not worth it.

If a simple, 32-bit "extended ascii" code comes along, I'll be the first
to support it. But Unicode as we now see it is a crock.

Bear

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Unicode, was: I'm Russian!

Ray Dillinger <bear@sonic.net>25 Nov 2001 22:37:06 -0500

Ray Dillinger <bear@sonic.net>
25 Nov 2001 22:37:06 -0500