Re: Question about Earley's algorithm

friwi@prosun.first.gmd.de (F.W. Schroeer)
5 Jul 1999 13:46:33 -0400

          From comp.compilers

Related articles
Question about Earley's algorithm scavadini@hotmail.com (Salvador V.Cavadini) (1999-06-29)
Re: Question about Earley's algorithm alan@ez0.ezlink.com (1999-07-01)
Re: Question about Earley's algorithm friwi@prosun.first.gmd.de (1999-07-05)
| List of all articles for this month |

From: friwi@prosun.first.gmd.de (F.W. Schroeer)
Newsgroups: comp.compilers
Date: 5 Jul 1999 13:46:33 -0400
Organization: Compilers Central
References: 99-06-113
Keywords: parse

Salvador Cavadini (salvador@ucseii.edu.ar) wrote:


> How does Earley's algorithm work with the following grammar?
>
> D -> D
> D -> id
>
> Secuence of tokens: id


Earley's algorithm is a recognizer, it accepts or rejects a token
sequence. When processing the input it constructs lists of "items"
that express progress in processing. These can be used to build a
parse tree. There are several algorithms to turn the recognizer into
a parser. One can select specific trees to handle ambiguities or one
can list all trees (in the example this would be an infinite number).
Earley's paper sketches a procedure to create a "factored
representation of all possible parse trees". If this representation
isn't processed "carefully", it would result into an endless loop for
the example.


I used the ACCENT compiler compiler (see below) to process this
example. In debug mode it reports the ambiguity by printing two
different derivation trees (it can do so when confronted with an
ambiguous input although the general question whether a grammar is
ambiguous is undecidable). One can then resolve the ambiguity by a
grammar annotation, e.g. specify that the second rule should be
preferred. The example is pathological in the sense that one cannot
demand that the first rule should always be preferred, because each
derivation has to use also the second rule once (in a situation where
also the first rule rule would be applicable again).


I will shortly sketch Earley's algorithm and then give a trace
(produced with ACCENT) how it works with the grammar.


The algorithm constructs an item list for each input token.
An item has the form
      M0 : M1 ... Mi * Mi+1 ... Mn
for a corresponding grammar rule
      M0 : M1 ... Mi Mi+1 ... Mn
An item is a grammar rule in processing.
The dot ("*") indicates that the next member should be recognized.


The start of the item list for the current token 't' is given by
items in the previous list that have the form
      M : ... * 't' ...
The dot is now placed behind 't'
      M : ... 't' * ...
to indicate that 't' has been recognized.


If the dot appears in front of a nonterm, the "predictor" is invoked,
it inserts items that start the processing of the member.


If the dot appears at the end of a rule, the "completer" is invoked,
it takes the item that caused the processing of this rule and
puts the dot after the corresponding nonterminal.


There is a "backpointer" to find items that triggered the actual one,
I do not discuss this here.


For technical reasons a new rule
    YYSTART -> S eof
is added, where S is the start symbol of the grammar.
Processing starts with the item
    YYSTART -> * S eof


Below I give a trace for processing the example.


The step marked by (*) shows that different parses are represented by
the same item. So even if the number of parse trees is infinite,
the item list is finite.


            We start with the item
ITEM 1: YYSTART : * D 'eof'
            We inspect item 1,
            the dot is in front of a nonterminal, so we apply the predictor
            this adds two new items
ITEM 2: D : * 'id'
ITEM 3: D : * D
            We inspect item 3 and apply the predictor
            this would add the item
ITEM 4: D : * 'id'
            but this already present in this item list
            and therefore it is not inserted again.
            The predictor would also add
ITEM 4: D : * D
            which is also already present
            and therefore is not inserted again.


            This completes the item list
ITEM 4: [ separator-item ]


            The next item list starts with items that are
            triggered by those items where the dot was in front of
            the current input symbol ('id')
ITEM 5: D : 'id' *
            We apply the completer for this item, since the dot is at the end
            this adds those rules where the dot appears after the D
            completer for item 5:
ITEM 6: YYSTART : D * 'eof'
ITEM 7: D : D *
            We apply the completer for item 7
            this would add
ITEM 8: YYSTART : D * 'eof'
            but such an item is already present (item 6)
            and hence it is not inserted
            item 6 was caused by item 5 ( D : 'id' * )
            which corresponds to the 'id' choice.
            the next trial to add the item was caused by by item 7 ( D : D * )
            which corresponds to the D choice.
            (*) See note above


            The completer applied to item 7 would also add
ITEM 8: D : D *
            but this is also already present
            and hence it is not inserted


            This completes this item list
            end of item list
ITEM 8: [ separator-item ]


            The next item list start with items that are triggered by
            those items where there dot was in front of the current
            input symbol ('eof')
ITEM 9: YYSTART : D 'eof' *
            We have recognized the input


ACCENT, A Compiler Compiler for the ENTire class of context-free languages,
is a system that combines Earley's approach and techniques from LL(1) parsing.
ACCENT can be used like YACC (it also cooperates with LEX and allows to
insert semantic actions). It provides both inherited and synthesized
attributes and supports the Extended Backus Naur Form for grammar
specifications (i.e. regular expression on the right hand side of rules).
ACCENT is intended for users who are familiar with the notion of grammar
but don't want to be confronted with parser implementation details,
e.g. implementors of domain specific languages who are experts in their
field but not necessary experts in resolving shift/reduce conflicts.
ACCENT accepts all grammars without any restrictions.
ACCENT parsers aren't as fast as YACC parsers, but they are practical
for most applications. The ACCENT Java parser processes 10,000 lines
of code in a few seconds. ACCENT will be freely available from
http://www.combo.org


Regards,
Friedrich Wilhelm Schroeer
friwi@first.gmd.de


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.