Re: Internal Representation of Strings

"cr88192" <cr88192@hotmail.com>
Mon, 23 Feb 2009 11:42:25 +1000

          From comp.compilers

Related articles
[20 earlier articles]
Re: Internal Representation of Strings idbaxter@semdesigns.com (Ira Baxter) (2009-02-21)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-22)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings bartc@freeuk.com (Bartc) (2009-02-22)
Re: Internal Representation of Strings scooter.phd@gmail.com (Scott Michel) (2009-02-22)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-23)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-23)
Re: Internal Representation of Strings haberg_20080406@math.su.se (Hans Aberg) (2009-02-23)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-24)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-24)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-25)
Re: Internal Representation of Strings armelasselin@hotmail.com (Armel) (2009-02-26)
[7 later articles]
| List of all articles for this month |

From: "cr88192" <cr88192@hotmail.com>
Newsgroups: comp.compilers
Date: Mon, 23 Feb 2009 11:42:25 +1000
Organization: albasani.net
References: 09-02-051 09-02-068 09-02-078 09-02-115
Keywords: storage
Posted-Date: 24 Feb 2009 07:49:59 EST

"Scott Michel" <scooter.phd@gmail.com> wrote in message
> You might also want to have a look at prefix tries, as well, since
> that is likely to give you the best of many worlds (computing string
> length, etc.) There is additional complexity in inserting and removing
> strings from the trie. However, it should also make doing comparisons
> with I18N character sets somewhat less complicated. IIRC, some of the
> "rope" implementations take this approach.
>


yes, a potentially good suggestion, however:
a trie is not a mutable array, and I think the OP has his mind set on this;
tries don't play so well with a GC, whereas a weak hash is much better at
dropping unused strings (for GC the trie would likely need to be tightly
coupled with the GC, and this could get hairy);
tries are good for lots of strings with common prefixes, but less so for
random long strings (where a good number of nodes may be used);
a trie is more complicated than most other options;
...




it does bring up the interesting possible idea of an app using LZW as a
basis for strings though, as LZW could be essentially combined far more
easily with a GC (giving many of the advantages of a trie). likewise, the
patent for LZW has long-since expired.


as a cost, given the likely representational overhead of using in-memory
LZW, likely it would only really be used if matches of a certain minimal
length can be found, and for strings exceeding a certain length.


so, likely, per-character LZW would be used for building an internal
representation of the string space, but actual matches are only used if they
exceed a certain minimum length (for example, 12 or 16 bytes if the
representation is based on 8-byte CONS cells), where any shorter segments
are encoded directly (although probably added to a table as well so that
they can be effectively matched later).


so, if done well this could save a lot of memory even for large numbers of
larger strings, however, processing the strings in such a matter wouldn't be
quite so fast...




most likely, this would be an extended rope type, and not the basic string
type, since encoding strings in this manner would presuppose the existence
of a primitive string type.


for much longer matches, likely LZW would be done on the segment level as
well, such that rope segments would be reused, ...



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.