Re: Internal Representation of Strings

"Tony" <>
Wed, 18 Feb 2009 08:02:03 -0600

          From comp.compilers

Related articles
[10 earlier articles]
Re: Internal Representation of Strings (Hans-Peter Diettrich) (2009-02-16)
Re: Internal Representation of Strings (Bartc) (2009-02-16)
Re: Internal Representation of Strings (2009-02-16)
Re: Internal Representation of Strings (2009-02-17)
Re: Internal Representation of Strings (Bartc) (2009-02-18)
Re: Internal Representation of Strings (Tony) (2009-02-18)
Re: Internal Representation of Strings (Tony) (2009-02-18)
Re: Internal Representation of Strings (cr88192) (2009-02-19)
Re: Internal Representation of Strings (cr88192) (2009-02-21)
Re: Internal Representation of Strings (Tony) (2009-02-21)
Re: Internal Representation of Strings (Ira Baxter) (2009-02-21)
Re: Internal Representation of Strings (cr88192) (2009-02-22)
Re: Internal Representation of Strings (Hans-Peter Diettrich) (2009-02-22)
[17 later articles]
| List of all articles for this month |

From: "Tony" <>
Newsgroups: comp.compilers
Date: Wed, 18 Feb 2009 08:02:03 -0600
Organization: at&t
References: 09-02-051 09-02-077
Keywords: storage
Posted-Date: 18 Feb 2009 17:10:51 EST

"cr88192" <> wrote in message
> "Tony" <> wrote
>> What are some good ways/concepts of internal string representation?
> depends on how they will be being used.
> I opt to minimize interop issues with existing code.
>> Are/should string literals, fixed-length strings and dynamic-lenght
>> strings handled differently? My first tendency is to avoid like the
>> plague NUL-terminated strings (aka, C strings) and to opt for some
>> kind of array with a length at the beginning followed by the
>> characters that could be encapsulated at the library level with
>> appropriate functions.
> IMO, this should not be necessary (avoiding ASCIIZ style strings).
> the reason is that, if strings are actually being used for text, NUL
> should almost never happen in practice.

I forgot to mention that ASCII was a simplifying assumption that I
will be taking advantage of. Probably 7-bit ASCII. Yes, I am avoiding
null-termination (at least as the primary mechanism anyway).

> However, if one is using UTF-8, there is a trick used by the JVM and
> friends, which maps the NUL character to a different
> representation. I ended up adding a similar hack in my case for
> UFT-16, where the NUL is hidden in another codepoint (I used 10FFFE
> I think, I forget, though in UTF-16 there is no good or obvious
> option...). the main reason for this hack though is mostly to allow
> converting strings with embedded NULs into UTF-16 without truncating
> the string.

> As for UTF-8 vs UTF-16, I personally by-far opt for UTF-8, mostly
> because it saves a good deal of space in most cases (and also,
> "proper" handling of UTF-16 isn't really any easier than UTF-8,
> where I don't consider treating UTF-16 text as a big array of 16-bit
> values as proper, especially if there is a risk of characters
> outside the 16-bit range, ...).

> In general, I keep UTF-8 and UTF-16 strings separate, but it is
> possible that some library functions could allow both equally, for
> example, by either using automatic recognition (problematic), or by
> having UTF-16 strings generally start with a BOM (inconvinient).

Is UTF-8 the ASCII subset of UTF-16?
[No, it's a different encoding of 32 bit Unicode characters. -John]

> Something used in my case is actually, that, every on-heap object has
> an associated type (this is transparent, as the heap-pointer points
> after the header), and so I can use different type names for different
> string types.
> Note that it is problematic, as conversions between UTF-8 and UTF-16
> may not be in all cases lossless (one may convert from one to the
> other and back again and in some cases get a different string, ...).
> As for length-prefixed strings, there is this often critical and
> awkward limitation: the strings will often have an arbitrary max
> length limit.

I don't see a problem with 32-bit length "limitation". Even 16-bit would be
fine for most cases that weren't trying to be the utmost in generality.

> So, for many cases where frameworks have used length prefixes, this
> ended up being an arbitrary limit at 127, 255, 32767, or 65535
> characters... this requires either using a size field larger than
> the maximum sane size of a string (for example, 32 or 64 bits...),
> or the use of a variable-length coding (more what I would opt for,
> where a small string only needs a single byte for the length, but a
> very long string can still be handled).

Every time I revisit that (var length field or fixed length field) I rethink
it. So it's probably going to depend on my mood the day I actually get to
revamping my string lib. I have the web page open that the mod suggested but
haven't examined it yet:

>> But just a length seems like not enough information: the capacity
>> (array length) also would be nice to have around. All thoughts, old
>> and novel, welcome.
> This goes into the awkward area where many frameworks opt for
> representing their strings as objects. the cost to this though is
> that it may add a good amount of overhead to working with strings, but
> also a few partial advantages (for example, it is much easier to
> create substrings, get multiple string literals to share the same
> character array, ...).

With a capacity and length field, it makes the underlying representation
look like an array. Currently my library-level string has an array class

> But, on this front, another case is whether or not the implementation
> should make use of "ropes" for the strings, where the idea is that any
> non-trivial string becomes a rope, ...
> as another thing:
> it makes sense to make a contrast between "strings" and "character
> buffers".

Yes, I illuded to that noting that the underlying representation is not
really a "string" but rather that the wrapper around it is. But I use the
term in both contexts and hopefully it is clear by the context what is

> So, in my thinking, strings, as such, are always assumed to be
> immutable. so, if one wants to modify a string, they copy it to a
> character buffer, and then later retrieve the new string from this
> buffer.

That just sounds like copy-on-write and is just a feature of a string rather
than a basis.

> So, for example, the string is UTF-8, but could be copied into a
> UTF-16 or UTF-32 buffer, worked on, and then copied back into a UTF-8
> or UTF-16 string when done.
> One advantage of this approach is that it allows effectively merging
> duplicate copies of the same strings (one can keep a big hash tables
> of seen strings, and so if a string is seen for one already in the
> hash, it is reused, and if not, it is added).

I think I see a "fixed string" as the special case rather than the
fundamental thing.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.