Re: Table Compression Methods in Dragon Book
Mon, 17 Oct 1994 21:28:42 GMT

          From comp.compilers

Related articles
Question: Table Compression Methods in Dragon Book (1994-10-09)
Re: Question: Table Compression Methods in Dragon Book (Jan Peter de Ruiter) (1994-10-10)
Re: Table Compression Methods in Dragon Book (1994-10-17)
| List of all articles for this month |

Newsgroups: comp.compilers
Keywords: yacc
Organization: Compilers Central
References: 94-10-071 94-10-080
Date: Mon, 17 Oct 1994 21:28:42 GMT

      Greg Schmidt wrote:
      > There is a section titled "Table-Compression Methods" at the end of the
      > chapter discussing lexical analysis in the newer dragon book. The intent
      > is to implement efficient transition tables by leveraging off of the
      > similarities among states, but the example is unclear to me. I don't
      > understand what is meant by "default" transitions or "special" states.
      > I would be grateful if some kind soul could provide a simple example
      > showing the appropriate structures.

      Jan Peter de Ruiter wrote:
      !I too have problems understanding this mechanism from the dragon book.
      !Even my compiler guru admits he doesn't understand it. I'd like to see an
      !algorithm (in pseudocode or something equivalent) that describes how one
      !*builds up* this table. Once you have the table it's easy to apply the
      !pseudocode from the dragon book, but how the h*ll do you create it is my

Let us consider a DFA to accept the following regular expressions and
return tokens

end {return END}
else {return ELSE}
[endls]+ {return IDENTIFIER}

The keywords are "end" and "else" and everything else is an
identifier. The state transition table is

    state | e n d l s
        0 | 1 7 7 7 7
        1 | 7 2 7 2 7
        2 | 7 7 3 7 7
        3 | 7 7 7 7 7
        4 | 7 7 7 7 5
        5 | 6 7 7 7 7
        6 | 7 7 7 7 7
        7 | 7 7 7 7 7

Such a 2-dimensional table can be compressed so fewer entries are
actually stored.

An obvious scheme is to store an adjacency list instead of the
adjacency matrix. Each state stores a list of transitions on certain
inputs. for example, state 1 will carry 3 transitions namely, to state
2 on an input "n", to state 2 on an input "l" and if the two cases
fail, a third, "default" transition to state 7. "default" because that
is the transition on any other input. (This representation could be
further compressed by exploting the fact that on both inputs "n" and
"l", the next state is 2. Adjacency lists conserve space but the
overhead is in searching through the adjacency lists for an
appropriate transition.

The default-base-next-check method has the advantage of direct access
since all the transitions are stored in arrays. The disadvantage (the
worst case scenario in my opinion) is that the next-check arrays can
each end up as large as the adjacency matrix. I have trouble believing
the first part of discussion on page 145 in the dragon book. The
authors write "...Here we use a data structure consisting of four
arrays indexed by state numbers...". The fact is that the next-check
arrays can be much larger than the number of states! (we will see this
in the following example). Anyways...

The basic idea of this method is to store the two dimensional
adjacency matrix as a linear array. In other words, to lay out the
rows of the matrix linearly. Simply linearizing the rows would mean a
space utilization equal to that of the adjacency matrix. Instead, if
some of the entries could be overlapped...? well, that is where the
default transitions help. If we look at the transitions out of state 1
of the above table from a different angle, we can think of all
transitions out of state 1 as being "default" and the transitions on
inputs "n" and "l" as being "special" since the default transitions
outnumber the special cases in the above example. So, the "next" table
is nothing but a linear representation of the adjacency matrix with
some of the entries overlapped (hence table compression)

At this point, we know that we have to linearize the rows but we also
know that a dumb approach will not buy us any _compression_. One of
the heuristics that I have given some thought is to identify some sort
of an "inheritance" tree. In other words, to identify rows such that
the special transitions in the rows do not clash, when overlapped
(informally, a row inherits another if all the entries in the
inherited row is the same as the base row except for special
transitions). For example, if the rows corresponding to rows 0 and 5
are overlapped, the special transitions on "e" will clash but
overlapping 0 and 3 would not result in any clashes. Similarly, 0, 1
and 4 can be overlapped without the special cases clashing with each
other. So we have to identify such combinations so that the rows can
be stored within the same "space" in the linear organization. This
heuristic will buy us storage space in most cases. Once we identify
such combinations, we start filling the next-check arrays starting at
the least numbered index possible and laying out the rows one by one
(overlapping as much as possible). The "next" array's entries directly
fall out of the adjacency matrix and the "check" array's entries are
based on the construction of the "next" array.

In the table above, we notice that states 3, 6 and 7 can be treated as
default states since (1) they are all identical anyway and (2) all
other states can be treated as "inheriting" from them. so let us lay
them out first in the next-check table. Next, let us try to insert
rows 0 and 1 overlapping them as much as possible. At this point the
tables look as follows

        state default base index next check
            0 3 5 0 7 3
            1 3 5 1 7 3
            2 2 7 3
            3 X 0 3 7 3
            4 4 7 3
            5 5 1 0
            6 3 X 6 2 1
            7 3 X 7 X X
                                                                              8 2 1
                                                                              9 X X

we will assume that the offsets from the base index in the next-check
arrays are computed using the values 0 through 4 corresponding to
e,n,d,l,s respectively. For this table and subsequent ones, treat X to
mean some index that is invalid and does not exist in the table. We
see that we already need 10 entries in the next-check tables though
some are not used. This is because the base index to states 0 and 1
in the next-check table is 5, but each can be faced with 5 different
inputs so an offset of 5 is required from the base index. For
instance, if, at state 0, an input e is seen, the next state is
computed directly as state 1. If the input "s" is seen, the next state
at index 9 in the check table is not 0 (the current state), so the
default state (3) is assumed as the current state and the next state
computed as 7. And so on. The final table after inserting the rows in
the order 3, 6, 7, 0, 1, 4, 5, 2 is

    token state default base index next check
      IDENT 0 3 5 0 7 3
      IDENT 1 3 5 1 7 3
      IDENT 2 3 8 2 7 3
      END 3 X 0 3 7 3
      IDENT 4 3 5 4 7 3
      IDENT 5 3 7 5 1 0
      ELSE 6 3 X 6 2 1
      IDENT 7 3 X 7 6 5
                                                                                            8 2 1
                                                                                            9 5 4
                                                                                          10 3 2
                                                                                          11 X X
                                                                                          12 X X

space usage = 2*8 + 2*13 = 42 = default-base-size + next-check-size
size of Adjacency matrix = 8*5 = 40

Although we have ended up using up 2 entries more than the adjacency
matrix in this example, this method would compress the table in any
typical scanner when the number of states is far more than 8 and
almost all possible inputs have to be considered. The order of
inserting the rows also matters. For instance, inserting the rows in
the order 6, 7, 3, 1, 4, 2, 0, 5 will use 46 entries.

I have also included a "pattern" entry with each of the states so
that, if the end of the current token is recognized, the value
returned is the token at the final state. (The dragon book mentions
this in a footnote)


        author = "Aho, A. V. and Ullman, J. D.",
        title = "Principles of compiler Design",
        publisher = AW,
        address = "Reading, {MA}",
        year = 1977,

    author = "A. V. Aho, R. Sethi and J. D. Ullman",
    title = "Compilers: Principles, Techniques and Tools",
    year = "1986",
    publisher = "Addison-Wesley",

                author = "Robert E. Tarjan and A.C. Yao",
                title = "Storing a sparse table",
                journal = cacm,
                volume = 22,
                number = 11,
                pages = "606--611",
                year = 1979,

    title="Optimization of Parser Tables for Portable Compilers",
    author="Peter Dencker and Karl {D\"urre} and Johannes Heuft",

    author = "M.L. Fredman and J. Komlos and E. Szemeredi",
    title = "Storing a sparse table with {O}(1) worst case access time",
    journal = jacm,
    volume = 31,
    number = 3,
    pages = "538--544",
    year = 84,

Nandakumar Sankaran, G34 Jordan, Comp. Sci. Dept., Clemson Univ. SC 29634
311-8 Old Greenville Hwy. Clemson SC 29631-1651 (803)653-7749

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.