Symbol Table

firth@sei.cmu.edu (Robert Firth)
10 Mar 89 18:24:41 GMT

          From comp.compilers

Related articles
Symbol table dorina@esenet.dk (Rina) (2001-04-22)
Re: Symbol table mg169780@zodiac.mimuw.edu.pl (Michal Gajda) (2001-04-26)
Symbol Table firth@sei.cmu.edu (1989-03-10)
Re: Symbol Table henry@zoo.toronto.edu (1989-03-14)
Symbol Table guadagnoli.massimo@usa.net (Massimo Guadagnoli) (1997-12-17)
| List of all articles for this month |

From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.compilers
Date: 10 Mar 89 18:24:41 GMT

This is a brief response to the question of building a symbol table when
compiling a C-like language. I assume the language has at least these features
of C: no factored names (such as Text_IO.Put_Line) and no overloading (such as
Text_IO.Put). However, we do have global names, local names, and block
structure.


The symbol table seems to serve two purposes: mapping lexical identifiers into
names, and finding defining occurrences of names from applied occurrences.


For the first purpose, I believe a simple hash scheme is much the best. Hash
the identifier into a list head, and chain on collisions. Keep it very simple.
The hashing function I usually use takes the identifier length and a few of
its characters, does some shifts and adds, and then takes the final modulus.
There is very little point in being clever provided your string comparison
function is slick: the cost of a couple more probes is less than the cost of
some elaborate all-character hash, quadratic rehash, or whatever.


The number of list heads depends on how big you expect a program to be; for
medium-size compilations (eg of compiler modules with maybe 1500 lines and 600
names) something around 500 to 1000 is about right. The average number of
probes is then only 2 or 3.


For the second purpose, I have experimented with several structures, and I
believe this is the best:


(a) Each lexical identifier is stored once only, in the string table,
        accessed as described above.


(b) Each lexical identifier is associated with exactly one 'global
        declaration cell', which records a global declaration of that
        identifier. Note that this relies on the fact that all global
        (or outermost scope) names must be unique.


(c) The global declaration cell has two special flags


        . Not globally declared


        . Also locally declared


        If an identifier is ever redeclared locally, the second flag is
        set. If the identifier has been declared, but not globally, the
        first flag is set.


(d) A linear list of local redeclarations is kept, with the usual
        simple LIFO algorithm: add declarations to the end; search in
        reverse order; mark on block entry and reset on block exit.


The strategy for using this structure is simple: from the lexical identifier
one can go via the hash to the entry in the string table, and from there
directly to the global declaration cell. If the identifier has never been
redeclared, you are done. If the identifier is marked as redeclared, you have
to search the linear list.


The reasoning behind it is as follows: typically, programs have a large set of
global names, used sparsely, and a small set of local names, used densely.
Rarely is a global name redeclared locally. So the great majority of uses of
globally-declared identifiers are indeed applied occurrences of the global
declaration, and can be found in O(1+). The great majority of uses of
locally-declared identifiers are uses of the innermost current declaration,
and again will be found rapidly by linear search, in O(average number of local
declarations per block).


Since the same local names (x,i,p,r,ch,...) are used over and over again, I
don't even bother to reset the 'Also locally declared' flag on block exit: if
you run off the end of the local list, the global declaration applies; if the
'Not globally declared' flag is set, there is no global declaration and you
have an error.


The strategy also deals moderately well with those wretched 'include' files:
the declared names do get stuffed into the string table and so slow the hash
lookup, but their entries in the global declaration list merely take up space
- they don't slow down the cost of looking up declarations. Since small
programs may never even use 90% of the names they import via 'include' files,
this is a desirable property.


I've used this scheme in two compiler front ends, and it was very fast indeed.


Hope that helps


Robert Firth
[From firth@sei.cmu.edu (Robert Firth)]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.