Re: I'm russian! was character sets

Thomas Maslen <tmaslen@wedgetail.com>
20 Oct 2001 21:59:04 -0400

          From comp.compilers

Related articles
Programming language specification languages nmm1@cus.cam.ac.uk (2001-09-20)
Re: Programming language specification languages rkrayhawk@aol.com (2001-09-25)
Re: Programming language specification languages joachim_d@gmx.de (Joachim Durchholz) (2001-10-06)
I'm russian! was character sets crystal-pin@mail.ru (2001-10-13)
Re: I'm russian! was character sets spinoza1111@yahoo.com (2001-10-14)
Re: I'm russian! was character sets tmaslen@wedgetail.com (Thomas Maslen) (2001-10-20)
Unicode, was: I'm Russian! bear@sonic.net (Ray Dillinger) (2001-11-25)
Re: Unicode, was: I'm Russian! loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-11-26)
| List of all articles for this month |

From: Thomas Maslen <tmaslen@wedgetail.com>
Newsgroups: comp.compilers
Date: 20 Oct 2001 21:59:04 -0400
Organization: Distributed Systems Technology CRC
References: 01-09-087 01-09-106 01-10-021 01-10-061
Keywords: i18n
Posted-Date: 20 Oct 2001 21:59:04 EDT

>> Of course, scripting languages intended for the hand of the end user
>> *should* be able to support 16-bit characters (which, today, means >
>> Unicode).
>
>Yes. String, chracter and character arrays types and constants must
>support 16 bit representation. At least.


Yup, at least.


Until recently, Unicode and ISO 10646 only defined characters in the range
U+0000..U+FFFF (the "Basic Multilingual Plane", i.e. the first 16 bits).


However, Unicode 3.1 now defines characters in three more 16-bit planes:
U+10000..U+1FFFF, U+20000..U+2FFFF and U+E0000..U+EFFFF. For details, see
the "New Character Allocations" section of


http://www.unicode.org/unicode/reports/tr27/


All is not lost, because the 16-bit representation of Unicode was designed
with this in mind, and it can represent U+0000..U+10FFFF (i.e. a little over
20 bits) using "surrogate pairs":


Two 10-bit slices are reserved within the 16-bit range, and a "high surrogate"
(U+D800..U+DBFF) immediately followed by a "low surrogate" (U+DC00..U+DFFF)
produces 20 bits of information to specify a single character in the range
U+10000..U+10FFFF.


In theory all code that uses the 16-bit Unicode representation (including
both Java and the Windows NT family) does the right thing with these
surrogate pairs, and life is generally wonderful. In practice Unicode 3.1
is probably the first time that this stuff has got a real workout, and even
now this code will only be tickled if you happen to use the new characters
(which are fairly uncommon), so chances are it'll take a while to shake out
the bugs.


If you're designing something from scratch, and you can live with the memory
consumption, then it might be more straightforward to use a pure 32-bit
Unicode representation internally and probably use UTF-8 externally. Or, if
you care about memory size and can trade off some performance, then maybe
use UTF-8 internally too.


Thomas Maslen
tmaslen@wedgetail.com


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.