2 Jul 2003 00:41:23 -0400

Related articles |
---|

FIRST function computation problem. mefrill@yandex.ru (2003-06-22) |

Re: FIRST function computation problem. derkgwen@HotPOP.com (Derk Gwen) (2003-07-02) |

Re: FIRST function computation problem. torbenm@diku.dk (2003-07-02) |

Re: FIRST function computation problem. haberg@matematik.su.se (2003-07-02) |

Re: FIRST function computation problem. soenke.kannapinn@wincor-nixdorf.com (=?Windows-1252?Q?S=F6nke_Kannapinn?=) (2003-07-02) |

From: | =?Windows-1252?Q?S=F6nke_Kannapinn?= <soenke.kannapinn@wincor-nixdorf.com> |

Newsgroups: | comp.compilers |

Date: | 2 Jul 2003 00:41:23 -0400 |

Organization: | Siemens Business Services |

References: | 03-06-103 |

Keywords: | parse |

Posted-Date: | 02 Jul 2003 00:41:23 EDT |

*> I have a problem with FIRST function for 1 lookahead symbol evalation.*

*> To lookup such the symbols for some nonterminal A I am doning follows:*

*>*

*> 1. go through all rules that left part is A. Here I do following:*

*> a) If I have rule A --> aB, where a is terminal. I just add this*

*> terminal to lookahead set for A.*

*> b) If I have rule A --> e, where e is epsilon, I add e to lookahead*

*> set for A.*

*> c) If I have rule A --> B C ... where B - nonterminal, I at first*

*> compute the lookahead set for B, if B's lookahead set contains*

*> epsilon, check C and etc. I found that recursive function here is the*

*> best. But I do not know how to work with rules like follows: A --> B,*

*> B --> C A D... To compute lookahead set for A I have to at first*

*> compute lookaheads for B, but if lookaheads for C contain epsilon I*

*> need to know the same about A to understand do checking of D or*

*> not...*

*>*

*> Please, help to decide this problem and there may be the good FIRTS*

*> computaion algorithm?*

*>*

*> Thanks,*

*> Vladimir.*

I have commented on the computation of FIRST sets some time ago in

comp.compilers (http://compilers.iecc.com/comparch/article/01-04-079)

Unfortunately, when browsing old pure-ASCII articles as html from the

comp.compiler archive all "formatting" spaces seem to be messed up.

I therefore decided to add a copy of that old text again with

formatting spaces re-inserted. Hope it helps you understand the nature

of FIRST.

Regards,

Dr. Sönke Kannapinn

--

The computation of FIRST and FOLLOW sets of a grammar should, of

course, not be limited to special grammars (such as non-left-

recursive ones) because their formal definition is based simply on the

respective grammar's derivation relation which means that FIRST and

FOLLOW are always *defined* no matter how weird the grammar is. So,

if your algorithm fails to compute FIRST and/or FOLLOW correctly on

some grammars then, clearly, the algorithm is insufficient!

I'll describe an algorithm below which is adequate because its

structure will follows exactly the structure of the problem. This is

what's wrong with your approach. Unfortunately, I know only a single

compiler-related text book that treats the problem well (see the

trailer); most compiler books present algorithms I'd call

"appropriate" in the above sense. (Don't misunderstand that: I didn't

say they are wrong; I said they are inappropriate.) Usual solutions

apply some least-fixpoint-computation strategy that can roughly be

sketched as follows (I'll deal with FIRST only):

Compute obvious initializing sets of FIRST(A) for all nonterminals A.

REPEAT

Find all dependencies in the grammar where some set FIRST(A)

must obviously contain some set FIRST(B) for some nonterminals A

and B, and (re-)compute FIRST(A) := FIRST(A) + FIRST(B).

UNTIL, during a complete grammar examination, none of the set unions

computed has added any new elements

Clearly, there are unnecessary re-computations done in this approach

because of the REPEAT-recompute-everithing-UNTIL-no-more-change

pattern.

I'll try to explain now how to do better. The algorithm I'll sketch

assumes that the grammar is reduced.

I can't do completely without formulas, so let me use G=(V,T,P,S)

(V vocabulary, T termininals, P productions, S start symbol,

N=V\T nonterminals). Furthermore,

- greeks alpha, beta, ... are strings from V*,

- epsilon is the empty word,

- A,B,... are nonterminals from N, and

- a,b,... are terminals from T.

Then we have

FIRST(A) = { a in T : A =>* a alpha } and

FOLLOW(A) = { a in T : S =>* alpha A a beta }.

(Of course, => is the derivation relation, =>* its reflexive

transitive closure.)

Now, looking at the grammar productions we see that we can read

off direct and indirect contributions to sets FIRST(A):

Examining successively each p in P, we see

direct contributions:

If p is of the shape A -> alpha a beta where alpha =>* epsilon

then, of course,

a in FIRST(A).

indirect contributions:

If p is of the shape A -> alpha B beta where alpha =>* epsilon

then, of course,

FIRST(A) is a superset of FIRST(B).

Let's express this formally by a relationship

(A,B) in contains-the-FIRSTs-of, or, in infix notation:

A contains-the-FIRSTs-of B.

We'll interpret the relation contains-the-FIRSTs-of in N x N as

the edge relation of a directed graph with node set N.

If you think about it, loops in this graph express left recursion!

Here's an intricate example grammar:

A -> F C a | E A

B -> C G h

C -> D B a | b h | c | epsilon

D -> F e | epsilon

E -> G h | D f

F -> E A | d F | epsilon

G -> g

*>From the above, we know:*

FIRST0(A)={a}, FIRST0(B)={}, FIRST0(C)={b,c}, FIRST0(D)={e},

FIRST0(E)={f}, FIRST0(F)={d}, FIRST0(G)={g}, and

contains-the-FIRSTs-of = { (A,C), (A,E), (A,F), (B,C), (B,G), (C,B),

(C,D), (D,F), (E,D), (E,G), (F,E)}.

Drawn as a poor man's ASCII-graph this is:

G <------------

^ _______ |

| | | |

| | v |

| A-->F-->E--

| | ^ |

| v | |

B<--C-->D<--

| ^

| |

---

Let's assume that we know whether a nonterminal is nullable. Then,

both

- the initial sets FIRST0(A), reflecting the "direct contributions",

- and the relation contains-the-FIRSTs-of, reflecting exactly the

"indirect contributions" (subset-dependencies of FIRST sets),

can be computed in a single pass over the grammar productions in

time linear to the size of G.

Now one observes that, for each nonterminal A, any terminal a in

FIRST(A) indeed stems either from a direct contribution - i.e.

a in FIRST0(A) - or from an indirect contribution - i.e. a in

FIRST(B) for some nonterminal B, and A contains-the-FIRSTs-of A.

In other words, we can collectively characterize all FIRST sets

as the smallest sets FIRST(A) satisfying, for each A in N:

1.) FIRST(A) contains FIRST0(A), and

2.) for each nonterminal B, s.t. A contains-the-FIRSTs-of B:

FIRST(A) contains FIRST(B).

By the way, this is equivalent to

FIRST(A) = Union of { FIRST0(B) : A contains-the-FIRSTs-of* B }.

Now, what an appropriate algorithm must do in view of the

problem visualization as a graph is to compute the FIRST0 set of

each graph node ("init"), and to traverse the contains-the-FIRSTs-of

graph in depth-first manner unioning FIRST(B) into FIRST(A)

when returning from the traversal of en edge (A,B) ("union").

Well, and cycles must be treated correctly, i.e. all nodes

(nonterminals) along a cycle must finally get the same FIRST sets

assigned ("distribute"):

For example, for the cycle D->F->E->D we have

FIRST(D) contains FIRST(F) contains FIRST(E) contains FIRST(D),

so FIRST(D) = FIRST(F) = FIRST(E) = FIRST(D).

And there can be cycles in cycles...

Therefore, to do the minimum work required, an appropriate

algorithm must determine the structure of the contains-the-FIRSTs-of

graph w.r.t. cycles (or, to be precise, the "strongly connected

components" (SCCs) of the graph).

All this can be perfectly combined with Tarjan's brilliant algorithm

for the computation of a directed graph's SCCs. All we must do is

to add three set computation statements to Tarjan's algorithm, and

there we are.

Here is a sketch of an implementation-friendly variant of Tarjan's

algorithm computing what we need in case of FIRST (the three

set-computing statements added are marked by (*...!*) ).

(* asssume that nonterminals are INTEGERs 0, 1, ..., n-1 *)

(* and that sets FIRST0[0..n - 1] are already computed. *)

PROCEDURE ComputeFIRST;

CONST clean = - 1; done = n;

VAR Stack, Low: ARRAY [0 .. n - 1] OF INTEGER;

Top: INTEGER;

PROCEDURE Traverse(v: INTEGER);

VAR k, j, w, Top1: INTEGER;

BEGIN

INC(Top); Stack[Top] := v; Low[v] := Top; Top1 := Top;

FOREACH w such that "v contains-the-FIRSTs-of w" DO

IF Low[w] = clean THEN

Traverse(w);

FIRST[v] := FIRST[v] + FIRST[w]; (*union!*)

END;

IF Low[w] < Low[v] THEN Low[v] := Low[w] END

END;

IF Low[v] = Top1 THEN (* v root of SCC *)

WHILE Top >= Top1 DO w := Stack[Top];

FIRST[w] := FIRST[v]; (*distribute!*)

Low[w] := done; DEC(Top)

END

END

END Traverse;

VAR v: INTEGER;

BEGIN

FOR v := 0 TO n - 1 DO FIRST[v] := FIRST0[v] END; (*init!*)

Top := - 1;

FOR v := 0 TO n - 1 DO Low[v] := clean END;

FOR v := 0 TO n - 1 DO

IF Low[v] = clean THEN Traverse(v) END

END

END ComputeFIRST;

In general, this algorithm does no more unnecessary set

computations. The underlying graph depth-first graph traversal

algorithm visits each node (nonterminal) exactly once and traverses

each edge (contains-the-FIRSTs-of relationship) exactly once, and it

has linear run-time w.r.t. the size of the graph (or, here, the

grammar). You can hardly do any better.

The algorithm is appropriate because its basic algorithmic structure

corresponds to the structure of the FIRST-computation problem. I

hope I have made that clear enough so that you will agree.

Computing FOLLOW is structurally very similar except for the fact,

of course, that the initial sets FOLLOW0 are different from the sets

FIRST0, and that the graph-spanning relation contains-the-FOLLOWs-of

is different from contains-the-FIRSTs-of.

In other words, a completely different graph will be traversed.

Having understood the above, it shouldn't be hard to derive how

FOLLOW is computed adequately.

Some final remarks:

In case you compute FIRST sets for an LL(1) parser you can easily

extend the above algorithm to complain about left recursion, i.e.

cycles in the graph.

I chose the above example grammar such that the resulting graph

is the one used in the classic book

K.Mehlhorn: Graph Algorithms and NP-Completeness, Springer 1984

to study Tarjan's SCC algorithm with.

The idea to add set-computing statements to (a variant of) Tarjan's

SCC algorithm is explained in the beautiful article

F.DeRemer, T.Pennello: Efficient Computation of LALR(1) Look-Ahead

Sets, TOPLAS 1984.

Tarjan's SCC algorithm has been published in

R.E.Tarjan: Depth-First Search and Linear Graph Algorithms, SIAM

Journal on Computing 1972.

Tarjan's original article on SCC computation is hard to understand.

Maybe you prefer the presentation in

T.H.Cormen, C.E.Leiserson, R.L. Rivest: Introduction to Algorithms,

The MIT Press, 1990.

Also, chapter 2 of

S.Sippu, E.Soisalon-Soininen: Parsing Theory, vol.1, Springer 1988

treats the computation of set-valued functios over graphs in detail.

It explains more mathematically inclined what I've only sketched

above for the "theory-aware practitioner".

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.