# Re: Question about Earley's algorithm

## friwi@prosun.first.gmd.de (F.W. Schroeer)5 Jul 1999 13:46:33 -0400

From comp.compilers

Related articles
Re: Question about Earley's algorithm friwi@prosun.first.gmd.de (1999-07-05)
| List of all articles for this month |

 From: friwi@prosun.first.gmd.de (F.W. Schroeer) Newsgroups: comp.compilers Date: 5 Jul 1999 13:46:33 -0400 Organization: Compilers Central References: 99-06-113 Keywords: parse

> How does Earley's algorithm work with the following grammar?
>
> D -> D
> D -> id
>
> Secuence of tokens: id

Earley's algorithm is a recognizer, it accepts or rejects a token
sequence. When processing the input it constructs lists of "items"
that express progress in processing. These can be used to build a
parse tree. There are several algorithms to turn the recognizer into
a parser. One can select specific trees to handle ambiguities or one
can list all trees (in the example this would be an infinite number).
Earley's paper sketches a procedure to create a "factored
representation of all possible parse trees". If this representation
isn't processed "carefully", it would result into an endless loop for
the example.

I used the ACCENT compiler compiler (see below) to process this
example. In debug mode it reports the ambiguity by printing two
different derivation trees (it can do so when confronted with an
ambiguous input although the general question whether a grammar is
ambiguous is undecidable). One can then resolve the ambiguity by a
grammar annotation, e.g. specify that the second rule should be
preferred. The example is pathological in the sense that one cannot
demand that the first rule should always be preferred, because each
derivation has to use also the second rule once (in a situation where
also the first rule rule would be applicable again).

I will shortly sketch Earley's algorithm and then give a trace
(produced with ACCENT) how it works with the grammar.

The algorithm constructs an item list for each input token.
An item has the form
M0 : M1 ... Mi * Mi+1 ... Mn
for a corresponding grammar rule
M0 : M1 ... Mi Mi+1 ... Mn
An item is a grammar rule in processing.
The dot ("*") indicates that the next member should be recognized.

The start of the item list for the current token 't' is given by
items in the previous list that have the form
M : ... * 't' ...
The dot is now placed behind 't'
M : ... 't' * ...
to indicate that 't' has been recognized.

If the dot appears in front of a nonterm, the "predictor" is invoked,
it inserts items that start the processing of the member.

If the dot appears at the end of a rule, the "completer" is invoked,
it takes the item that caused the processing of this rule and
puts the dot after the corresponding nonterminal.

There is a "backpointer" to find items that triggered the actual one,
I do not discuss this here.

For technical reasons a new rule
YYSTART -> S eof
is added, where S is the start symbol of the grammar.
Processing starts with the item
YYSTART -> * S eof

Below I give a trace for processing the example.

The step marked by (*) shows that different parses are represented by
the same item. So even if the number of parse trees is infinite,
the item list is finite.

ITEM 1: YYSTART : * D 'eof'
We inspect item 1,
the dot is in front of a nonterminal, so we apply the predictor
ITEM 2: D : * 'id'
ITEM 3: D : * D
We inspect item 3 and apply the predictor
ITEM 4: D : * 'id'
but this already present in this item list
and therefore it is not inserted again.
ITEM 4: D : * D
and therefore is not inserted again.

This completes the item list
ITEM 4: [ separator-item ]

The next item list starts with items that are
triggered by those items where the dot was in front of
the current input symbol ('id')
ITEM 5: D : 'id' *
We apply the completer for this item, since the dot is at the end
this adds those rules where the dot appears after the D
completer for item 5:
ITEM 6: YYSTART : D * 'eof'
ITEM 7: D : D *
We apply the completer for item 7
ITEM 8: YYSTART : D * 'eof'
but such an item is already present (item 6)
and hence it is not inserted
item 6 was caused by item 5 ( D : 'id' * )
which corresponds to the 'id' choice.
the next trial to add the item was caused by by item 7 ( D : D * )
which corresponds to the D choice.
(*) See note above

The completer applied to item 7 would also add
ITEM 8: D : D *
but this is also already present
and hence it is not inserted

This completes this item list
end of item list
ITEM 8: [ separator-item ]

those items where there dot was in front of the current
input symbol ('eof')
ITEM 9: YYSTART : D 'eof' *
We have recognized the input

ACCENT, A Compiler Compiler for the ENTire class of context-free languages,
is a system that combines Earley's approach and techniques from LL(1) parsing.
ACCENT can be used like YACC (it also cooperates with LEX and allows to
insert semantic actions). It provides both inherited and synthesized
attributes and supports the Extended Backus Naur Form for grammar
specifications (i.e. regular expression on the right hand side of rules).
ACCENT is intended for users who are familiar with the notion of grammar
but don't want to be confronted with parser implementation details,
e.g. implementors of domain specific languages who are experts in their
field but not necessary experts in resolving shift/reduce conflicts.
ACCENT accepts all grammars without any restrictions.
ACCENT parsers aren't as fast as YACC parsers, but they are practical
for most applications. The ACCENT Java parser processes 10,000 lines
of code in a few seconds. ACCENT will be freely available from
http://www.combo.org

Regards,
Friedrich Wilhelm Schroeer
friwi@first.gmd.de

Post a followup to this message