Re: Internal Representation of Strings

"cr88192" <cr88192@hotmail.com>
Sun, 22 Feb 2009 07:41:13 +1000

From comp.compilers

Related articles
[15 earlier articles]
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-18)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-18)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-19)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-21)
Re: Internal Representation of Strings tony@my.net (Tony) (2009-02-21)
Re: Internal Representation of Strings idbaxter@semdesigns.com (Ira Baxter) (2009-02-21)
*Re: Internal Representation of Strings cr88192@hotmail.com (cr88192)* (2009-02-22)**
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-22)
Re: Internal Representation of Strings bartc@freeuk.com (Bartc) (2009-02-22)
Re: Internal Representation of Strings scooter.phd@gmail.com (Scott Michel) (2009-02-22)
Re: Internal Representation of Strings cr88192@hotmail.com (cr88192) (2009-02-23)
Re: Internal Representation of Strings marcov@stack.nl (Marco van de Voort) (2009-02-23)
[12 later articles]

| List of all articles for this month |

From:	"cr88192" <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Sun, 22 Feb 2009 07:41:13 +1000
Organization:	albasani.net
References:	09-02-051 09-02-068 09-02-078 09-02-084 09-02-090 09-02-105
Keywords:	storage
Posted-Date:	21 Feb 2009 18:56:57 EST

> What if you make every item in a parse tree contain a string. Those
> strings are likely to be very small, a lot of one-character
> strings. It just seems like low overhead strings always have a
> place. (No, I haven't built a compiler, yet).

> Tony
> [Let's say you have a gigantic parse tree with 10,000 nodes. That means
> you'd have 40K of length words. Who cares? -John]

small overheads build up quickly, especially with common objects, and when
combined with a garbage collector.

more memory means more garbage, which means more GC activity, which means
slowness (or, more annoying occasional pauses if a non-concurrent GC is
used...).

now, granted, with most traditional allocator strategies, much more will
likely go into per-object overhead than into string contents in the case of
strings (so, saving 1 or 2 bytes will not matter so much if the memory comes
from malloc, or sadly, for that matter, from Boehm...).

but, in my case, my GC places much more emphasis on smaller objects, and so
small size differences (a few bytes), can have a notable impact on how many
heap cells are used by each object, and thus, the total memory overhead...

now, granted, all this doesn't matter as much in the cases where the app can
manage to run entirely inside the initial heap space, but makes a difference
when all the memory is used up (and a few GB will often go a lot less far
than one might think...).

for example, not too long ago, an app of mine was making use of HMM-based
text modeling, and as a test I decided to run a 40MB text file through it.
the app didn't finish, and I killed the process after it used up around 1+
GB of memory and I was running low on free swap space (it was a new Windows
install, and I had yet to increase the swap, as apart from 2GB of ram,
Windows set an initial max limit on 2GB swap, which I have since expanded to
16GB spread across several drives).

it also starts to matter some when one realizes the huge amounts of garbage
that can be produced when compiling collections of C source files, and then
forcing the GC to more or less go berserk, recollecting the heap after every
module (of course, this is with each module including maybe several MB of
headers...).

I have since remedied some of these issues, for example, by caching things,
checking dependencies, ... but, still they linger (after all, code may have
a tendency to be edited, ...).

well, beyond the delay of compiling each module (non-GC, mostly apparently
the preprocessor and parser use much of the time...), this has given me some
doubt as to the goodness of using C as a primary language for dynamically
loaded code. me considering the possibility of partly giving its place over
to C#, noting that, among other things, 'csc' runs MUCH faster than gcc, and
I suspect this is primarily due to the lack of header inclusion, ... and so
if I were to dynamically compile C# instead of C, I could likely expect a
similar level of speedup, as well as a much lower cost for parsing and
compiling each module (my compiler runs a little slower than gcc, but not by
that large of a factor).

actually, it will probably be less work to get a working dynamic C# compiler
in place than it will be to get CIL (or JBC) to work (none the less, C# does
pose its share of technical complexities...). (my idea being to start from
an earlier version of my C compiler and use this as a base).

all this matters more when the app also has to do other things as well...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Internal Representation of Strings

"cr88192" <cr88192@hotmail.com>Sun, 22 Feb 2009 07:41:13 +1000

"cr88192" <cr88192@hotmail.com>
Sun, 22 Feb 2009 07:41:13 +1000