Re: Internal Representation of Strings

Hans Aberg <haberg_20080406@math.su.se>
Mon, 23 Feb 2009 20:39:33 +0100

          From comp.compilers

Related articles
[22 earlier articles]
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings bartc@freeuk.com (Bartc) (2009-02-22)
Re: Internal Representation of Strings scooter.phd@gmail.com (Scott Michel) (2009-02-22)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-23)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-23)
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-23)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-24)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-24)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-25)
Re: Internal Representation of Strings armelasselin@hotmail.com (Armel) (2009-02-26)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-27)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-28)
[5 later articles]
| List of all articles for this month |
From: Hans Aberg <haberg_20080406@math.su.se>
Newsgroups: comp.compilers
Date: Mon, 23 Feb 2009 20:39:33 +0100
Organization: Aioe.org NNTP Server
References: 09-02-051 09-02-077 09-02-092 09-02-104 09-02-112
Keywords: i18n
Posted-Date: 24 Feb 2009 07:50:36 EST

Hans-Peter Diettrich wrote:
>> in general, UTF-8 takes less space than UTF-16 (and mixes much better with
>> code designed for ASCII), but some many languages like UTF-16 more
>> potentially because it works better when being treated as an array.
>
> This IMO is a typical misconception of English-only speakers, which have
> caused a lot of trouble in the evolution of programming languages :-(


I might add: the UTF encodings were not designed with compression issues
in mind. If space is an issue, use a compression algorithm instead,
because it will be more efficient. And only use UTF-16 for backwards
compatibility (libraries and other programs you must use uses it); UTF-8
avoids the endian issue, as one nowadays mostly agrees on how to sort
out the bits of a byte. (The BOM used in some UTF-16 code to sort out
endianess is not a part of the Unicode standard.) UTF-32 might be good
in cases were variable length or perhaps speed is needed (like
internally in programs); but this requires endianess to be sorted oit
between platforms.


      Hans Aberg



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.