Re: Internal Representation of Strings

Marco van de Voort <marcov@stack.nl>
Sat, 14 Feb 2009 17:57:28 +0000 (UTC)

          From comp.compilers

Related articles
Internal Representation of Strings tony@my.net (Tony) (2009-02-14)
Re: Internal Representation of Strings mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-02-14)
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-14)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-14)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-14)
Re: Internal Representation of Strings anton@mips.complang.tuwien.ac.at (2009-02-14)
Re: Internal Representation of Strings cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-14)
Re: Internal Representation of Strings lkrupp@pssw.nospam.com.invalid (Louis Krupp) (2009-02-14)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-16)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-15)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-16)
[29 later articles]
| List of all articles for this month |
From: Marco van de Voort <marcov@stack.nl>
Newsgroups: comp.compilers
Date: Sat, 14 Feb 2009 17:57:28 +0000 (UTC)
Organization: Stack Usenet News Service
References: 09-02-051
Keywords: storage
Posted-Date: 14 Feb 2009 16:51:28 EST

On 2009-02-14, Tony <tony@my.net> wrote:
> What are some good ways/concepts of internal string representation?
> Are/should string literals, fixed-length strings and dynamic-lenght strings
> handled differently? My first tendency is to avoid like the plague
> NUL-terminated strings (aka, C strings) and to opt for some kind of array
> with a length at the beginning followed by the characters that could be
> encapsulated at the library level with appropriate functions. But just a
> length seems like not enough information: the capacity (array length) also
> would be nice to have around. All thoughts, old and novel, welcome.


Have a look at Delphi stringtypes, most notably the ansistring type.
- String is a first class type.
- pointer to first char of char array.
- length and ref count before first char (negative offset of pointer)
- the capacity part is not there, but part of the memory manager system.
- while it has a length, it is also double #0 terminated, so for read
    purposes can be passed to C code.


Literals are encoded with the same layout
([length] [ref count[ [length bytes chardata] #0#0 ) but have refcount -1.
This makes copy on write schemes possible.


D2009 afaik extends this to also
- a codepage (which can also be UTF-8 or 16)
- a granularity value (now 1 or 2), that specifies the granularity of the
      encoding.


However I'm not that deep into the unicode extensions.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.