Re: Computing first sets on a left-recursive grammar

"Sönke Kannapinn" <soenke.kannapinn@wincor-nixdorf.com>
12 Apr 2001 02:46:09 -0400

          From comp.compilers

Related articles
Computing first sets on a left-recursive grammar cyrus@stwing.upenn.edu (Cyrus Najmabadi) (2001-04-10)
Re: Computing first sets on a left-recursive grammar paul@parsetec.com (Paul Mann) (2001-04-12)
Re: Computing first sets on a left-recursive grammar soenke.kannapinn@wincor-nixdorf.com (Sönke Kannapinn) (2001-04-12)
Re: Computing first sets on a left-recursive grammar max@max.mcs.gac.edu (Max Hailperin) (2001-04-14)
Re: Computing first sets on a left-recursive grammar parag@pinhead.parag.codegen.com (2001-04-14)
| List of all articles for this month |
From: "Sönke Kannapinn" <soenke.kannapinn@wincor-nixdorf.com>
Newsgroups: comp.compilers
Date: 12 Apr 2001 02:46:09 -0400
Organization: Siemens Inc.
References: 01-04-062
Keywords: parse, theory
Posted-Date: 12 Apr 2001 02:46:09 EDT

Hi Cyrus.


> I'm trying to understand (while simultaneously code) the algorithm for
> computing first and follow sets in the Dragon book (p177).
> Unfortunately, I'm being pretty unsuccessful at implementing this
> obfuscated explanation. The fault seems to lie in left-recursive
> grammar productions. Because "First" calls itself on the individual
> elements of the yield of the production, a left-recursive production
> will cause "first" to loop forever.
>
> Unfortunately, I find the explanation on these functions in the
> Dragon book rather difficult to understand, so I'm making no headway
> into determining if this a limitation of the "First" function, or if
> there's some simple way to get about this problem. If it's a
> limitation, is the correct solution to eliminate left-recursion? Or,
> if it's not a limitation, can someone post (or link to) an algorithm
> that can calculate the "first" set even on a left-recursive
> production.


The computation of FIRST and FOLLOW sets of a grammar should, of
course, not be limited to special grammars (such as non-left-
recursive ones) because their formal definition is based simply on the
respective grammar's derivation relation which means that FIRST and
FOLLOW are always *defined* no matter how weird the grammar is. So,
if your algorithm fails to compute FIRST and/or FOLLOW correctly on
some grammars then, clearly, the algorithm is insufficient!


I'll describe an algorithm below which is adequate because its
structure will follows exactly the structure of the problem. This is
what's wrong with your approach. Unfortunately, I know only a single
compiler-related text book that treats the problem well (see the
trailer); most compiler books present algorithms I'd call
"appropriate" in the above sense. (Don't misunderstand that: I didn't
say they are wrong; I said they are inappropriate.) Usual solutions
apply some least-fixpoint-computation strategy that can roughly be
sketched as follows (I'll deal with FIRST only):


Compute obvious initializing sets of FIRST(A) for all nonterminals A.
REPEAT
        Find all dependencies in the grammar where some set FIRST(A)
        must obviously contain some set FIRST(B) for some nonterminals A
        and B, and (re-)compute FIRST(A) := FIRST(A) + FIRST(B).
UNTIL, during a complete grammar examination, none of the set unions
        computed has added any new elements


Clearly, there are unnecessary re-computations done in this approach
because of the REPEAT-recompute-everithing-UNTIL-no-more-change
pattern.


I'll try to explain now how to do better. The algorithm I'll sketch
assumes that the grammar is reduced.


I can't do completely without formulas, so let me use G=(V,T,P,S)
(V vocabulary, T termininals, P productions, S start symbol,
N=V\T nonterminals). Furthermore,
- greeks alpha, beta, ... are strings from V*,
- epsilon is the empty word,
- A,B,... are nonterminals from N, and
- a,b,... are terminals from T.


Then we have


FIRST(A) = { a in T : A =>* a alpha } and
FOLLOW(A) = { a in T : S =>* alpha A a beta }.


(Of course, => is the derivation relation, =>* its reflexive
transitive closure.)


Now, looking at the grammar productions we see that we can read
off direct and indirect contributions to sets FIRST(A):


Examining successively each p in P, we see


        direct contributions:
        If p is of the shape A -> alpha a beta where alpha =>* epsilon
        then, of course,
                a in FIRST(A).


        indirect contributions:
        If p is of the shape A -> alpha B beta where alpha =>* epsilon
        then, of course,
                FIRST(A) is a superset of FIRST(B).
                Let's express this formally by a relationship
                (A,B) in contains-the-FIRSTs-of, or, in infix notation:
                A contains-the-FIRSTs-of B.


We'll interpret the relation contains-the-FIRSTs-of in N x N as
the edge relation of a directed graph with node set N.
If you think about it, loops in this graph express left recursion!


Here's an intricate example grammar:
A -> F C a | E A
B -> C G h
C -> D B a | b h | c | epsilon
D -> F e | epsilon
E -> G h | D f
F -> E A | d F | epsilon
G -> g


>From the above, we know:
FIRST0(A)={a}, FIRST0(B)={}, FIRST0(C)={b,c}, FIRST0(D)={e},
FIRST0(E)={f}, FIRST0(F)={d}, FIRST0(G)={g}, and
contains-the-FIRSTs-of = { (A,C), (A,E), (A,F), (B,C), (B,G), (C,B),
(C,D), (D,F), (E,D), (E,G), (F,E)}.


Drawn as a poor man's ASCII-graph this is:


        G <-----------
        ^ _______ |
        | | | |
        | | v |
        | A-->F-->E-
        | | ^ |
        | v | |
        B<--C-->D<--
          -->


Let's assume that we know whether a nonterminal is nullable. Then,
both
- the initial sets FIRST0(A), reflecting the "direct contributions",
- and the relation contains-the-FIRSTs-of, reflecting exactly the
    "indirect contributions" (subset-dependencies of FIRST sets),
can be computed in a single pass over the grammar productions in
time linear to the size of G.


Now one observes that, for each nonterminal A, any terminal a in
FIRST(A) indeed stems either from a direct contribution - i.e.
a in FIRST0(A) - or from an indirect contribution - i.e. a in
FIRST(B) for some nonterminal B, and A contains-the-FIRSTs-of A.
In other words, we can collectively characterize all FIRST sets
as the smallest sets FIRST(A) satisfying, for each A in N:
        1.) FIRST(A) contains FIRST0(A), and
        2.) for each nonterminal B, s.t. A contains-the-FIRSTs-of B:
                        FIRST(A) contains FIRST(B).
By the way, this is equivalent to
        FIRST(A) = Union of { FIRST0(B) : A contains-the-FIRSTs-of* B }.


Now, what an appropriate algorithm must do in view of the
problem visualization as a graph is to compute the FIRST0 set of
each graph node ("init"), and to traverse the contains-the-FIRSTs-of
graph in depth-first manner unioning FIRST(B) into FIRST(A)
when returning from the traversal of en edge (A,B) ("union").


Well, and cycles must be treated correctly, i.e. all nodes
(nonterminals) along a cycle must finally get the same FIRST sets
assigned ("distribute"):
For example, for the cycle D->F->E->D we have
        FIRST(D) contains FIRST(F) contains FIRST(E) contains FIRST(D),
so FIRST(D) = FIRST(F) = FIRST(E) = FIRST(D).
And there can be cycles in cycles...


Therefore, to do the minimum work required, an appropriate
algorithm must determine the structure of the contains-the-FIRSTs-of
graph w.r.t. cycles (or, to be precise, the "strongly connected
components" (SCCs) of the graph).


All this can be perfectly combined with Tarjan's brilliant algorithm
for the computation of a directed graph's SCCs. All we must do is
to add three set computation statements to Tarjan's algorithm, and
there we are.


Here is a sketch of an implementation-friendly variant of Tarjan's
algorithm computing what we need in case of FIRST (the three
set-computing statements added are marked by (*...!*) ).


        (* asssume that nonterminals are INTEGERs 0, 1, ..., n-1 *)
        (* and that sets FIRST0[0..n - 1] are already computed. *)


        PROCEDURE ComputeFIRST;
            CONST clean = - 1; done = n;


            VAR Stack, Low: ARRAY [0 .. n - 1] OF INTEGER;
                Top: INTEGER;


            PROCEDURE Traverse(v: INTEGER);


                VAR k, j, w, Top1: INTEGER;


            BEGIN
  INC(Top); Stack[Top] := v; Low[v] := Top; Top1 := Top;
  FOREACH w such that "v contains-the-FIRSTs-of w" DO
                    IF Low[w] = clean THEN
                        Traverse(w);
                        FIRST[v] := FIRST[v] + FIRST[w]; (*union!*)
                    END;
                    IF Low[w] < Low[v] THEN Low[v] := Low[w] END
  END;
  IF Low[v] = Top1 THEN (* v root of SCC *)
      WHILE Top >= Top1 DO w := Stack[Top];
                        FIRST[w] := FIRST[v]; (*distribute!*)
          Low[w] := done; DEC(Top)
      END
  END
            END Traverse;


            VAR v: INTEGER;


        BEGIN
            FOR v := 0 TO n - 1 DO FIRST[v] := FIRST0[v] END; (*init!*)
            Top := - 1;
            FOR v := 0 TO n - 1 DO Low[v] := clean END;
            FOR v := 0 TO n - 1 DO
                IF Low[v] = clean THEN Traverse(v) END
            END
        END ComputeFIRST;


In general, this algorithm does no more unnecessary set
computations. The underlying graph depth-first graph traversal
algorithm visits each node (nonterminal) exactly once and traverses
each edge (contains-the-FIRSTs-of relationship) exactly once, and it
has linear run-time w.r.t. the size of the graph (or, here, the
grammar). You can hardly do any better.


The algorithm is appropriate because its basic algorithmic structure
corresponds to the structure of the FIRST-computation problem. I
hope I have made that clear enough so that you will agree.


Computing FOLLOW is structurally very similar except for the fact,
of course, that the initial sets FOLLOW0 are different from the sets
FIRST0, and that the graph-spanning relation contains-the-FOLLOWs-of
is different from contains-the-FIRSTs-of.
In other words, a completely different graph will be traversed.
Having understood the above, it shouldn't be hard to derive how
FOLLOW is computed adequately.


Some final remarks:


In case you compute FIRST sets for an LL(1) parser you can easily
extend the above algorithm to complain about left recursion, i.e.
cycles in the graph.


I chose the above example grammar such that the resulting graph
is the one used in the classic book
K.Mehlhorn: Graph Algorithms and NP-Completeness, Springer 1984
to study Tarjan's SCC algorithm with.


The idea to add set-computing statements to (a variant of) Tarjan's
SCC algorithm is explained in the beautiful article
F.DeRemer, T.Pennello: Efficient Computation of LALR(1) Look-Ahead
        Sets, TOPLAS 1984.


Tarjan's SCC algorithm has been published in
R.E.Tarjan: Depth-First Search and Linear Graph Algorithms, SIAM
        Journal on Computing 1972.


Tarjan's original article on SCC computation is hard to understand.
Maybe you prefer the presentation in
T.H.Cormen, C.E.Leiserson, R.L. Rivest: Introduction to Algorithms,
        The MIT Press, 1990.


Also, chapter 2 of
S.Sippu, E.Soisalon-Soininen: Parsing Theory, vol.1, Springer 1988
treats the computation of set-valued functios over graphs in detail.
It explains more mathematically inclined what I've only sketched
above for the "theory-aware practitioner".


Hope that helps you tackle your problems successfully.


Regards,
Soenke Kannapinn


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.