Re: Internal Representation of Strings

"cr88192" <>
Sat, 21 Feb 2009 16:06:20 +1000

          From comp.compilers

Related articles
[12 earlier articles]
Re: Internal Representation of Strings (2009-02-16)
Re: Internal Representation of Strings (2009-02-17)
Re: Internal Representation of Strings (Bartc) (2009-02-18)
Re: Internal Representation of Strings (Tony) (2009-02-18)
Re: Internal Representation of Strings (Tony) (2009-02-18)
Re: Internal Representation of Strings (cr88192) (2009-02-19)
Re: Internal Representation of Strings (cr88192) (2009-02-21)
Re: Internal Representation of Strings (Tony) (2009-02-21)
Re: Internal Representation of Strings (Ira Baxter) (2009-02-21)
Re: Internal Representation of Strings (cr88192) (2009-02-22)
Re: Internal Representation of Strings (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings (Bartc) (2009-02-22)
[15 later articles]
| List of all articles for this month |

From: "cr88192" <>
Newsgroups: comp.compilers
Date: Sat, 21 Feb 2009 16:06:20 +1000
References: 09-02-051 09-02-077 09-02-092
Keywords: storage, i18n
Posted-Date: 21 Feb 2009 09:33:31 EST

"Tony" <> wrote in message news:09-02-092@comp.compilers...
> "cr88192" <> wrote in message
>> "Tony" <> wrote
>>> What are some good ways/concepts of internal string representation?
>> depends on how they will be being used.
>> I opt to minimize interop issues with existing code.
>>> Are/should string literals, fixed-length strings and dynamic-lenght
>>> strings handled differently? My first tendency is to avoid like the
>>> plague NUL-terminated strings (aka, C strings) and to opt for some
>>> kind of array with a length at the beginning followed by the
>>> characters that could be encapsulated at the library level with
>>> appropriate functions.
>> IMO, this should not be necessary (avoiding ASCIIZ style strings).
>> the reason is that, if strings are actually being used for text, NUL
>> should almost never happen in practice.
> I forgot to mention that ASCII was a simplifying assumption that I
> will be taking advantage of. Probably 7-bit ASCII. Yes, I am avoiding
> null-termination (at least as the primary mechanism anyway).

in practice, it is not a big issue IMO.
in any case, it is better than MS-DOS style '$' termination...

>> In general, I keep UTF-8 and UTF-16 strings separate, but it is
>> possible that some library functions could allow both equally, for
>> example, by either using automatic recognition (problematic), or by
>> having UTF-16 strings generally start with a BOM (inconvinient).
> Is UTF-8 the ASCII subset of UTF-16?
> [No, it's a different encoding of 32 bit Unicode characters. -John]

pretty much...

UTF-8 maps characters in the ASCII range over to the plain ASCII characters.
as a result, ASCII is a proper subset of UTF-8 (which is rather confusing).

larger characters are encoded with a variable-length coding scheme.

128-2047: 110xxxxx 10yyyyyy
2048-65535: 1110xxxx 10yyyyyy 10zzzzzz
65536-2097151: 11110xxx 10yyyyyy 10zzzzzz 10wwwwww

actually, this scheme can be taken a lot further, but the entire unicode
space is currently limited to:
U+0000 to U+10FFFF (or 0x00000000 to 0x0010FFFF)

note that with UTF-16, any character outside the range 0x0000 to 0xFFFF is
encoded into 2 values, known as "surrogate pairs" (but, sadly, the existing
scheme has a max value of 0x10FFFF).

in general, UTF-8 takes less space than UTF-16 (and mixes much better with
code designed for ASCII), but some many languages like UTF-16 more
potentially because it works better when being treated as an array.

now, there is another funky issue:
there are several possible ways to map between UTF-16 and UTF-8.

one is the "proper" UTF-8, which maps in terms of codepoints. so, UTF-16
surrogate pairs are properly decoded, and then re-encoded as UTF-8 values.

another form is often called UTF-8, but it is more properly called CESU-8,
where the individual 16-bit values are encoded as 16-bit value (surrogate
pairs are ignored).

usually, both UTF-8 and CESU-8 will convert to the same UTF-16 sequence
using the same conversion function.

as a result: UTF-8 -> UTF-16 -> UTF-8 may not output exactly (as in bytes)
the same string as the input, for example, because the input string is
CESU-8 and the output is UTF-8, or the output produced is CESU-8 with UTF-8
input, ...

likewise, a "proper" decoder that also accepts CESU-8 may actually have to
handle surrogate pairs inline.

and, likewise, some variants exist which use "overlong" encodings for
certain values (normally NUL), but another possible use is to escape certain
characters in a communication channel, ... (and, some apps have exploited
this effect to bypass security measures in some apps by obscuring characters
or text from ASCII-based code).

but, most commonly, it allows embedding a NUL in an ASCIIZ string as the
byte sequence '0xC0 0x80'...

>> Something used in my case is actually, that, every on-heap object has
>> an associated type (this is transparent, as the heap-pointer points
>> after the header), and so I can use different type names for different
>> string types.
>> Note that it is problematic, as conversions between UTF-8 and UTF-16
>> may not be in all cases lossless (one may convert from one to the
>> other and back again and in some cases get a different string, ...).
>> As for length-prefixed strings, there is this often critical and
>> awkward limitation: the strings will often have an arbitrary max
>> length limit.
> I don't see a problem with 32-bit length "limitation". Even 16-bit would
> be
> fine for most cases that weren't trying to be the utmost in generality.

I will disagree...

it is not too uncommon IME to read in a whole text-file as a single big
string, and this can easily exceed 64k chars...

whereas, a 32 bit length uses 4 bytes for every string, and even then it is
not inconcievable that in maybe 10 or 15 years or so, we may periodically
see single big strings longer than 4G chars...

meanwhile, if one were to do UTF-8 anyways, why not just use the same coding
scheme for a length prefix (though maybe extended a little further).

>> So, for many cases where frameworks have used length prefixes, this
>> ended up being an arbitrary limit at 127, 255, 32767, or 65535
>> characters... this requires either using a size field larger than
>> the maximum sane size of a string (for example, 32 or 64 bits...),
>> or the use of a variable-length coding (more what I would opt for,
>> where a small string only needs a single byte for the length, but a
>> very long string can still be handled).
> Every time I revisit that (var length field or fixed length field) I
> rethink
> it. So it's probably going to depend on my mood the day I actually get to
> revamping my string lib. I have the web page open that the mod suggested
> but
> haven't examined it yet:


>>> But just a length seems like not enough information: the capacity
>>> (array length) also would be nice to have around. All thoughts, old
>>> and novel, welcome.
>> This goes into the awkward area where many frameworks opt for
>> representing their strings as objects. the cost to this though is
>> that it may add a good amount of overhead to working with strings, but
>> also a few partial advantages (for example, it is much easier to
>> create substrings, get multiple string literals to share the same
>> character array, ...).
> With a capacity and length field, it makes the underlying representation
> look like an array. Currently my library-level string has an array class
> underpinning.

well, you can think of it this way if you want...

>> But, on this front, another case is whether or not the implementation
>> should make use of "ropes" for the strings, where the idea is that any
>> non-trivial string becomes a rope, ...
>> as another thing:
>> it makes sense to make a contrast between "strings" and "character
>> buffers".
> Yes, I illuded to that noting that the underlying representation is not
> really a "string" but rather that the wrapper around it is. But I use the
> term in both contexts and hopefully it is clear by the context what is
> meant.

well, I consider them different enough to where normally I would consider
allowing a different representation...

it may also worth noting that if one does not merge strings, a significant
part of their memory usage can go into having duplicate strings (although,
as a plus point, giving each string its own buffer does make it easier to
free them absent having to rely on the GC to "get around to it"...).

>> So, in my thinking, strings, as such, are always assumed to be
>> immutable. so, if one wants to modify a string, they copy it to a
>> character buffer, and then later retrieve the new string from this
>> buffer.
> That just sounds like copy-on-write and is just a feature of a string
> rather
> than a basis.

"copy-on-write" assumes that a string is mutable.
I tend to assume that strings are immutable (the identity of a string then
based on the characters in the string).

as such, retrieving a new string from a buffer will create a conceptually
different string, not a mutated version of the original. usually as well,
buffers are just a way to get operations implemented, but are not really the
conceptual model of strings processing.

>> So, for example, the string is UTF-8, but could be copied into a
>> UTF-16 or UTF-32 buffer, worked on, and then copied back into a UTF-8
>> or UTF-16 string when done.
>> One advantage of this approach is that it allows effectively merging
>> duplicate copies of the same strings (one can keep a big hash tables
>> of seen strings, and so if a string is seen for one already in the
>> hash, it is reused, and if not, it is added).
> I think I see a "fixed string" as the special case rather than the
> fundamental thing.

I tend to thing of strings more as atomic values.
one then manipulates them with operators (splitting them up into tokens,
adding them back together, ...).

memory usage is also a big concern, and fixed strings can save lots of
memory in the common case (but, alas, they can also waste memory with large
and unusual strings, which sit around clogging up memory until the GC
figures out that they can be freed...).

as such, larger strings are often treated more as manually-managed buffers,
which is not as good if maintaining a purely consistent model is ones'

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.