Re: Internal Representation of Strings

Waldek Hebisch <hebisch@math.uni.wroc.pl>
Thu, 5 Mar 2009 03:13:28 +0000 (UTC)

          From comp.compilers

Related articles
[32 earlier articles]
Re: Internal Representation of Strings armelasselin@hotmail.com (Armel) (2009-02-26)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-27)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-28)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-03-03)
Re: Internal Representation of Strings armelasselin@hotmail.com (Armel) (2009-03-02)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-03-03)
Re: Internal Representation of Strings hebisch@math.uni.wroc.pl (Waldek Hebisch) (2009-03-05)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-03-06)
| List of all articles for this month |

From: Waldek Hebisch <hebisch@math.uni.wroc.pl>
Newsgroups: comp.compilers
Date: Thu, 5 Mar 2009 03:13:28 +0000 (UTC)
Organization: Politechnika Wroclawska
References: 09-02-051 09-02-068 09-02-078 09-02-084 09-02-090 09-02-105
Keywords: storage, practice
Posted-Date: 05 Mar 2009 06:04:23 EST

Tony <tony@my.net> wrote:
>
> What if you make every item in a parse tree contain a string. Those strings
> are likely to be very small, a lot of one-character strings. It just seems
> like low overhead strings always have a place. (No, I haven't built a
> compiler, yet).
>
> Tony
> [Let's say you have a gigantic parse tree with 10,000 nodes. That means
> you'd have 40K of length words. Who cares? -John]


I have moderately large parse tree: source is 139840 lines 5418754
characters as reported by wc. The resulting tree has 1164004 nodes
referencing about 630000 symbols and 15562 string constants
(that is source language string constants). The actual number
of distinct symbols is about 13000. The whole tree takes about
20Mb of memory (typical node contains 2 64-bit pointers giving
16 bytes per node). Clearly overhead due to string storage
is limited to few hundred kilobutes, that is few percent. In
other word, the main thing is to store symbol names only
once. The tree contains pointers to symbols and each symbol
has pointer to its name. Since on average each symbol is
referenced about 50 times space taken by pointers dominate
and space taken by strings is less relevant.


Also, since source is limited to ASCII, we have at most 95
distinct one character symbols...


Note, the source is in a language which uses a lot of overloading,
other language may require more global names. But I would not
expect a dramatic difference, extra 10000 global names should
be enough for big systems. Significant proportion of symbols
in tree above are operator symbols (+, -, *, :=, ...) -- while
other language may have different number of operators one
can expect that relatively small number of symbols will
be responsible for most node. Or alternatively, parse tree
may use special king of nodes for operators, then number of
symbol nodes will be much smaller.


Anyway, it is not hard to scale parse tree to 30 million of
nodes. OTOH later processing is likely to be need much
more space than the parse tree.


--
                                                            Waldek Hebisch
hebisch@math.uni.wroc.pl



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.