Object-oriented parsers

compres!chris@crackers.clearpoint.com (Chris Clark)
Mon, 2 Sep 91 09:36:43 EDT

From comp.compilers

Related articles
Object-oriented parsers whatis@gnu.ai.mit.edu (1991-08-29)
Re: Object-oriented parsers kww@cblph.att.com (1991-08-29)
Re: Object-oriented parsers whatis@gnu.ai.mit.edu (1991-08-30)
Re: Object-oriented parsers adam@visix.com (1991-08-30)
*Object-oriented parsers compres!chris@crackers.clearpoint.com* (1991-09-02)**
Re: Object-oriented parsers paj@gec-mrc.co.uk (1991-09-03)

| List of all articles for this month |

Newsgroups:	comp.compilers,comp.object
From:	compres!chris@crackers.clearpoint.com (Chris Clark)
Keywords:	yacc, parse, OOP
Organization:	Compilers Central
References:	91-08-150
Date:	Mon, 2 Sep 91 09:36:43 EDT

Steve Boswell made a claim that the object-oriented parser generators
Yacc++ and the Language Objects Library or Bison++ are not very OOP
because they don't support a particular model of grammar development. I
think of the method he describes as object-oriented recursive descent
parsing. It is a very natural method of merging oop with parsing, and was
an initial design approach we discussed for Yacc++. However, what Steve
asks for can be easily done with (even non-oop) parser generators. Yacc++
has oop features which address other problems. I further argue that oop
recursive-descent parsing is equivalent to [E]LL(k) methods unless the 1)
host language supports non-deterministic or lazy-parallel procedure calls
or something equivalent or 2) the parser writers use an arbitrary
look-ahead technique in their parsing methods.

Kevin Wall (kww@cblph.att.com) wrote text marked with >
Steve Boswell (whatis@gnu.ai.mit.edu) wrote text marked with >>
Adam (adam@visix.com) wrote text marked with >>>
My notes refer to sections marked with [%d]

>> Lexical/syntactic/semantic style parsing, while affording division of
>> labor, is really only suited to procedural programming properties.
. . .
>> IDENTIFIER
>>
>> LOCAL_VARIABLE CLASS_ATTRIBUTE
>> \ /
>> \ /
>> V V
>> READ_ONLY WRITABLE FUNCTION PROCEDURE
>> \ / | /
>> \ / | /
>> V V V V
>> VALUE OPERATION
>> \ /
>> \ /
>> V V
>> IDENTIFIER_BASE [1]
>>
>> Thus, once an identifier is lexed, the symbol table can be consulted and
>> the proper descendent of IDENTIFIER_BASE can be created and attached to
>> IDENTIFIER (or returned by IDENTIFIER if IDENTIFIER is a pure parsing
>> class.) IDENTIFIER is allowed to know about all the descendents of
>> IDENTIFIER_BASE. [2]
>> Then dynamic binding on the resulting types can be used to weed through the
>> possibilities in whatever rule is being parsed. [3]
. . .
>> The parsing method of STATEMENT would have to tell whether an initial
>> IDENTIFIER starts an ASSIGNMENT or PROCEDURE_CALL. [4]
. . .
>> So by dynamic binding, the correct parsing method would be called. [5]

> I may be missing something here, but what I think you are saying is that
> the language's grammar for which you are buiding a parser is implemented
> via the class inheritance lattice!?! . . . [6]

>> Mostly, yes. It keeps the grammar implementation abstract and it's
>> very extendible (which is one of the main benefits of OOP anyways). [7]

> Maybe I'm just old fashioned but I think parser generators based on
> (extended) BNF notations (e.g., bison, yacc++) are a bit easier to change
> if all you want to do is tweak the grammar a bit.

>> I disagree. Shift/reduce and reduce/reduce errors are a pain in the
>> arse, especially when the semantics of the language could resolve the
>> ambiguity. [8]
>> Tweaking the grammar in big ways (e.g. adding a new type
>> of OPERATION called ONCE for Eiffel) is comparatively easy. [9]

> I think that you definitely would need an OOPL that was well suited to
> prototyping to be able to construct an "as-yet-undefined" language in
> this manner. (No doubt one reason you thought CLOS to be suitable.)

>> Actually, the only reason I suggested CLOS was that it allowed dynamic
>> binding on its arguments and not just the object. (I know C++ does
>> too, but I personally dislike C++.) This approach would also lend
>> itself to languages being extended/redefined, which languages tend to
>> do now and then. [10]

> I for one, find this a case of simply trying to shoe-horn a problem into
> an OOD. I always say (sometimes) "use the paradigm that fits the problem;
> don't try to fit the problem to the paradigm".

>>> Sorry, but I'll have to turn the accusation around. Simply because yacc
>>> and lex are so well understood, most language designers assume that they
>>> are the "best" paradigm to use. [11]

What Steve Boswell is arguing for [1-7] is what I think of as OOP
recursive-descent parsing. The model is basically this:

Create a class for each terminal and non-terminal.
Create methods for each non-terminal which correspond to each
production that define the non-terminal.
Use the inheritance hierarchy to provide generalizations of
the non-terminals and terminals.

The basic thrust here is that productions are mapped onto procedures
(methods) which parse the corresponding sentences. This is the
recursive-descent model.

How does one do this with a parser generator? One performs the mapping in
reverse. In particular, one can determine how to support the dynamic
binding Steve desires with the features provided in any parser generator
including yacc. Just define the non-terminals and terminals from the
classes, define the productions that correspond to the parser methods, and
define the inheritance hierarchy using productions. Here is how to do
Steve's example:

First, create a lexer which returns leaf types for the desired inheritance
hierarchy based upon some form of semantic analysis. With Yacc++ and the
Language Objects Library, this can be done with the following semantic
action code in the lexer.

IDENTIFIER : ('a'..'z')* ;
{ my_sym_ptr =
yy_lookup(yy_lex_token(), yy_lex_len(), IDENTIFIER_);
yy_lex_rdc() = yy_sym_type_this(my_sym_ptr);
                }

This code says to find the identifier object in the symbol table and cause
the lexer to return the type associated with that object. This is what
Steve asked for in [2]. This precisely matches the runtime lookup of an
object type within a class hierarchy. Different semantic models can be
implemented by changing how the return type is selected.

Second, define the inheritance hierarchy [1,6] as a series of productions
in the parser. The way to read these productions is the class on the LHS
is a base class for the derived classes on the RHS. In this model
multiple inheritance is easy, just list the derived class in the RHS for
each of its parent classes. For Steve's example [1] it would be:

WRITABLE : LOCAL_VARIABLE | CLASS_ATTRIBUTE;
   VALUE : READ_ONLY | WRITABLE;
                OPERATION : FUNCTION | PROCEDURE;
IDENTIFIER_BASE : VALUE | OPERATION;

The reason this does exactly what Steve wants is certainly worth further
discussion, but not in this posting. It can provide semantic
disambiguation as he requests in [8].

Third, define the productions using the appropriate members of the
hierarchy. This is what Steve requires in [4].

PROCEDURE_CALL : PROCEDURE "(" VALUE ("," VALUE)* ")" ";"

When this grammar definition is run through a parser generator, the
generator determines which token types are legal at each stage of the
parse. The parser generator takes care of [3] and [5].

Steve argues in [7,9] that his model makes extension easier. Is it easier
to add a new class to an inheritance hierarchy than it is to add a
production to a grammar? He then brings up shift-reduce and reduce-reduce
conflicts and semantic disambiguation [8]. Here lies the crux of his easy
extension argument.

If the host language, supports non-determinism, lazy-parallel evaluation
of procedures, unification, dynamic programming, or something similar,
then there may be an advantage to oop recursive-descent parsing.
Otherwise, oop recursive-descent parsing is essentially equivalent to
using an [E]LL(k) parser generator.

How do I justify that claim???

At runtime the code must call a parsing method which will parse the
sentence. In his example [4], he bases the parsing of the sentence upon
its first token (and presumably the current context). If all tokens are
read left-to-right (i.e. you always get tokens via get-next-token), this
is exactly equivalent to [E]LL(1) parsing. If you allow yourself to read
k-tokens of the sentence before determining which production to apply, you
have [E]LL(k) parsing. If you allow yourself even more freedom, such as
reading to the end of the token stream and using a Turing machine to
decide which production to apply, you can get a much larger grammar class,
but that is essentially defining your way out of the problem by using a
more powerful method to begin with. [Since in most lisp dialects you can
construct calls with an arbitrary number of arguments at runtime, I
presume you can make CLOS (or whatever your implementation choice is) do
your arbitrary dispatch. Personally, I find it hard to imagine how you
would describe the parameter bindings for such vary-adic methods. I would
probably use regular-expressions but that shows my bias :-). But I must
admit, I'm a C++ person and do not know any of the oop extensions to
lisp.] Also, if you start depending on scanning ahead too much, you may
find it difficult explaining exactly what language your system supports.

Unfortunately, aside from the arbitrary look-ahead via the runtime
vary-adic call case, Steve's dynamic binding has not bought anything,
except perhaps some runtime efficiency and cleanliness of code.
Recursive-descent compilers based upon switching on the first k-tokens are
a well-known technology. Some compiler writers like them. It is
certainly easy to hack them.

How does a [E][LA]LR(k) parser generator differ? When the generator runs,
it simulates calling the parsing methods in parallel. As long as there is
a token (within k tokens after the end of the production) which allows the
parser generator to make a deterministic decision on which parsing method
should have been called, the parser generator produces a conflict free
parser. Thus, parser generators effectively support a deterministic form
of parallelism and the class of languages they support is larger than the
LL-class. The class of languages supported by LR(k) parser generators is
generally large enough for most syntax descriptions, particularly if you
allow the lexer the priviledge of inspecting the symbol table before
determining the type of an object.

The "dreaded" conflicts come from grammars which are ambiguous. Some
methods for handling conflicts are well-known. The precedence directives
of most parser generators do a good job of handling most shift-reduce
conflicts. Kjell's recent proposal improves that for many dynamic cases.
Using regular-expression (ELR) grammars can help eliminate any
reduce-reduce conflicts. I suggest that you will find it is easier to
eliminate conflicts than to rewrite a grammar to LL(1) form, which you
need to do to use simple recursive-descent. Reason: all LL(1) grammars
are LR(1) grammars. [Note, Yacc++ fixes a problem found in most yacc
implementations of mid-production action code where yacc introduces
artificial reductions which can cause spurious conflicts. Yacc++ executes
mid-production action code as part of the shift and this reduces the
number of conflicts.]

I think the point of an abstract representation [7] is actually better
served by a non-procedural representation of the language to be parsed.
That is exactly what a grammar is. A grammar does not specify the method
used to parse it. If the grammar is LL(1), an LL parser, an LR parser, a
Turing machine, or some other mechanism could be used to parse it. A
parser generator is merely a way to convert the non-procedural
representation into a procedural representation. If the parser is written
as a series of procedures in a programming language, it is much harder to
switch to using a different parsing method. This is also my answer to
[11]. Grammars are the best solution because they are non-procedural and
capture exactly what we want to define. We usually find a well written
ELR grammar to be one of the most concise, precise, and readable
definitions of a language.

The issues of extending a grammar [7,9,10] are certainly well worth
consideration. Here is where we feel oop features should be added to
parser generators. In particular, Yacc++ supports inheritance of
productions from one language to another. This makes it easy to build
experimental extensions of a language by simply deriving the extended
language from the base language and then changing the appropriate
productions in the extended grammar.

We feel that the appropriate use of dynamic binding in an object oriented
lexer/parser system is in allowing lexer or parser definitions to be
switched at runtime, and this is the model that Yacc++ supports. This
allows a system to easily support a family of dialects, such as a base
language and its descendents. It can also facilitate parsing streams
which come from more than one source (as in a distributed application).

Chris Clark email : crackers.clearpoint.com!compres!chris
Compiler Resources, Inc. voice : (508) 435-5016
3 Proctor Street fax : (508) 435-4847
Hopkinton, MA 01748

--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Object-oriented parsers

compres!chris@crackers.clearpoint.com (Chris Clark)Mon, 2 Sep 91 09:36:43 EDT

compres!chris@crackers.clearpoint.com (Chris Clark)
Mon, 2 Sep 91 09:36:43 EDT