Related articles |
---|
Programming language specification languages nmm1@cus.cam.ac.uk (2001-09-20) |
Re: Programming language specification languages rkrayhawk@aol.com (2001-09-25) |
Re: Programming language specification languages joachim_d@gmx.de (Joachim Durchholz) (2001-10-06) |
I'm russian! was character sets crystal-pin@mail.ru (2001-10-13) |
Re: I'm russian! was character sets spinoza1111@yahoo.com (2001-10-14) |
Re: I'm russian! was character sets tmaslen@wedgetail.com (Thomas Maslen) (2001-10-20) |
Unicode, was: I'm Russian! bear@sonic.net (Ray Dillinger) (2001-11-25) |
Re: Unicode, was: I'm Russian! loewis@informatik.hu-berlin.de (Martin von Loewis) (2001-11-26) |
From: | Thomas Maslen <tmaslen@wedgetail.com> |
Newsgroups: | comp.compilers |
Date: | 20 Oct 2001 21:59:04 -0400 |
Organization: | Distributed Systems Technology CRC |
References: | 01-09-087 01-09-106 01-10-021 01-10-061 |
Keywords: | i18n |
Posted-Date: | 20 Oct 2001 21:59:04 EDT |
>> Of course, scripting languages intended for the hand of the end user
>> *should* be able to support 16-bit characters (which, today, means >
>> Unicode).
>
>Yes. String, chracter and character arrays types and constants must
>support 16 bit representation. At least.
Yup, at least.
Until recently, Unicode and ISO 10646 only defined characters in the range
U+0000..U+FFFF (the "Basic Multilingual Plane", i.e. the first 16 bits).
However, Unicode 3.1 now defines characters in three more 16-bit planes:
U+10000..U+1FFFF, U+20000..U+2FFFF and U+E0000..U+EFFFF. For details, see
the "New Character Allocations" section of
http://www.unicode.org/unicode/reports/tr27/
All is not lost, because the 16-bit representation of Unicode was designed
with this in mind, and it can represent U+0000..U+10FFFF (i.e. a little over
20 bits) using "surrogate pairs":
Two 10-bit slices are reserved within the 16-bit range, and a "high surrogate"
(U+D800..U+DBFF) immediately followed by a "low surrogate" (U+DC00..U+DFFF)
produces 20 bits of information to specify a single character in the range
U+10000..U+10FFFF.
In theory all code that uses the 16-bit Unicode representation (including
both Java and the Windows NT family) does the right thing with these
surrogate pairs, and life is generally wonderful. In practice Unicode 3.1
is probably the first time that this stuff has got a real workout, and even
now this code will only be tickled if you happen to use the new characters
(which are fairly uncommon), so chances are it'll take a while to shake out
the bugs.
If you're designing something from scratch, and you can live with the memory
consumption, then it might be more straightforward to use a pure 32-bit
Unicode representation internally and probably use UTF-8 externally. Or, if
you care about memory size and can trade off some performance, then maybe
use UTF-8 internally too.
Thomas Maslen
tmaslen@wedgetail.com
Return to the
comp.compilers page.
Search the
comp.compilers archives again.