Re: How to design a symbol table?

"John Max Skaller" <>
21 Oct 2004 22:26:35 -0400

          From comp.compilers

Related articles
How to design a symbol table? (2004-10-09)
Re: How to design a symbol table? (2004-10-12)
Re: How to design a symbol table? (Jeff Kenton) (2004-10-12)
Re: How to design a symbol table? (Rodney M. Bates) (2004-10-17)
Re: How to design a symbol table? (2004-10-21)
Re: How to design a symbol table? (John Max Skaller) (2004-10-21)
| List of all articles for this month |

From: "John Max Skaller" <>
Newsgroups: comp.compilers
Date: 21 Oct 2004 22:26:35 -0400
Organization: Compilers Central
References: 04-10-075 04-10-103 04-10-126
Keywords: symbols
Posted-Date: 21 Oct 2004 22:26:34 EDT

On Sun, 17 Oct 2004 16:10:39 -0400, Rodney M. Bates wrote:

> Having always done it and seen it done this way for years, I have
> recently become intrigued with the opposite approach: All the symbol
> table contains is a pointer to the AST subtree for the declaration.
> Furthermore, when a lookup is done, a copy of theisST declaration
> pointer becomes a semantic attribute of the AST node for the
> reference. Then the symbol table goes away.
> Anybody have other experience with this approach?

I have a similar situation where the input is a table of unbound terms
rather than an AST, and the output is bound terms. In my system
'bound' means the string name is replaced by an index into the symbol

The need to do this arises from the language definition: a symbol is
visible in the block that defines it, including before its
declaration. This means all definitions in the block are mutually
recursive. It also means a simple linear binding order doesn't exist
-- the lookup algorithm which replaces names by indexes has to
recursively examine unbound terms to bind a term.

There is a rich variety of definitional forms, including overloaded
functions with C++ style overloading of partial specialisations of
generic functions, 'typeof' operator, and inference of variable and
function return types. Because of this the lookup is very expensive,
and often re-evaluates the same term many times.

Surprisingly, when I tried a hashtable based memoisation, it slowed
the code down even more, instead of speeding it up. This was because
the table was keyed by terms, and both hashing and comparison needed a
recursive descent, which actually increased the cost for infreqently
scanned terms, compared to just doing the requisite calculations.

The implementation used Ocaml and utilised the polymorphic comparison
and hash functions it provided. I was already memorising some special
cases such as calculating the environment and the type of a name, and
that seems optimial.

Lookup currently uses about 60% of all compile time: here are actual
times in seconds, the sum of many compilations of my regression tests
(mostly with optimisation off at the moment):

parse=252.760000 <-- parsing
desugar=243.830000 <--- flattening and code/data splitting
build=234.510000 <---- build unbound symbol tables
bind=1735.760000 <---- lookup to make bound tables
opt=36.310000 <-- optimiser
inst=47.400000 <-- instatiator
gen=290.010000 <-- code generator

I have no idea how to reduce lookup time. Directly annotating the
(moral equivalent of the) AST doesn't work, because for example, the
type of a term can depend on itself (recursive type). When I tried
this, the annotation became the point of recursion, instead of the
fixpoint binder. Hmmm. Can I explain that better .. In this example:

var u = (1, &u);

the variable u has the type

(int * 'a pointer) as 'a

Unfortunately .. the u in &u got labelled with type 'a instead,
when I tried to cache the type of each term directly in the
unbound term, ie, it became a free variable. This leads to the
expression (1,&u) having the correct type.. but u and &u inside
the term get the wrong type. To fix <g> this the calculation of
the type of the u in &u has to re-enter the calculation of the
type of the u in var u -- the correct type can't be ascribed
to the u in &u until that calculation is complete. That is,
I have to go around twice, which makes it very hard to know
when it is safe to store a computed type into the unbound term.
So I don't :)

The 'usual' solution here is to simply invent a type variable, and
when you're finished solve the resultant equations by
unification. However that won't work for me, because of function
overloading. I would need a unification engine that supported
non-discriminated union types. This is exponential in space and
time. (I think..) My solution is not to do full inference: I require
function argument types to be annotated, and then I just get left with
exponential time and linear space cost.

[This seems slow .. but in practice my compiler can generate code
faster than g++ can compile it by a factor of somewhere between 5 to
infinity .. where infinity means g++ just drops dead completely]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.