newbie grammar/parser/interpreter questions

craigh@cserver.plexus.com (Craig Heilman)
Thu, 7 May 1992 21:38:01 GMT

          From comp.compilers

Related articles
newbie grammar/parser/interpreter questions craigh@cserver.plexus.com (1992-05-07)
Re: newbie grammar/parser/interpreter questions parrt@ecn.purdue.edu (1992-05-08)
experience with code transformation systems? unisoft@tigger.jvnc.net (1992-05-09)
| List of all articles for this month |
Newsgroups: comp.compilers
From: craigh@cserver.plexus.com (Craig Heilman)
Keywords: parse, question
Organization: Compilers Central
Date: Thu, 7 May 1992 21:38:01 GMT

Hi,


I'm in the process of writing a compiler (actually an interpreter) for a
psuedo-object-oriented language. I have been reading a book "Compiler
Design in C" by A. I. Holub and have learned a lot but I still have some
implementation-specific questions. My basic goal is to create a correct
grammar for my language and from there determine whether I will hard-code
the parser/interpreter or rely on a tool like yacc or bison to generate a
parser for me. I also have to figure out how to "generate code" that can
be executed right after a valid parse. This all happens internal to my
device which has no disk space, minimal RAM (although I do have dynamic
memory) and a slow processor (80x86 family). Implementation languages are
C & assembly.




CURRENT STATUS


I have a working lexical analyzer that currently can recognize all the
discrete constructs of my language. I also have a preliminary recursive-
descent parser that can handle most of the syntax of my language.
Unfortunately there are a few precedence rules I forgot about, thus my
parser is now breaking on valid language statements. My parser currently
builds an actual parse tree when it parses a language statement. I
thought I would be able to just traverse the tree and call some internal
routines, effectively executing the language statement. I thought the
order of traversal (preorder, inorder, postorder) and the structure of the
tree would handle the precedence rules but...




LANGUAGE BASICS


A typical expression looks like this:


target message argument


(ie. in C++ this would be: target.message(argument); )


Nested subexpressions for the target and/or argument are valid using
parenthesis (any level of nesting is allowed). In the example below, the
first subexpression must evaluate to a valid target (at execution time)
and the second subexpression must evaluate to the type of argument that
message2 is expecting (again at execution time).


(target1 message1 argument1) message2 (target3 message3 argument3)


Nested subexpressions for the target and/or argument are also valid using
the precedence associated with the message type (3 types of messages):


target1 message1 target?/arg? message2 argument2


if message2 has higher precedence than message1, the above
expression will be interpreted as if it were written like so:


target1 message1 (target2 message2 argument2)


if message1 has higher precedence than message2, the above
expression will be interpreted as if it were written like so:


(target1 message1 argument1) message2 argument2




Grammar Specification


A simplified grammar for my language is as follows. Please note that this
is my first attempt at creating a grammar so constructive criticism is
welcome.


Notes: Terminals are capitalized or single-quoted and an empty string is
represented by E. The end-of-input marker "|-" is implicit in the first
production. Rules lower in a production have higher precedence.


A. "Statements" consist of one or more "expression's" separated by periods.


statement -> expression
| expression '.' statement


B. There are two types of expressions:


expression -> msg_expression
| block


msg_expression -> object_spec msg


block -> '[' statement ']'


      Comparing the msg_expression to the typical language expression
      described above, an object_spec corresponds to a target and a msg
      corresponds to the message and argument combined. A block is just
      a delimited statement that may be treated as a discrete object (ie.
      one can send messages to a block like "execute yourself").


C. There can be several msg's following an object_spec, each separated
      by semicolons ';'. An optional receiver may also be specified for
      each msg_spec.


msg -> receiver msg_spec
| receiver msg_spec ';' msg


receiver -> RECEIVER_SUPERCLASS
| E


D. An object_spec (or "target") may have an optional unit_spec in front
      of the actual object description (object_desc). We also take into
      account the possibility of a parenthesized subexpression here. Within
      a unit_spec, there may be several subunit_desc's. Note that if the
      object_desc is empty, the unit_spec forms the object that is the
      target of the message.


object_spec -> unit_spec object_desc
| '(' expression ')'


unit_spec -> unit_desc subunit_desc


unit_desc -> UNIT_CLASS UNIT_FUNCTION_ID unit_number


unit_number -> SSHORTINTEGER_CLASS SSHORTINTEGER_DATA
| E


subunit_desc -> subunit_desc' subunit_desc
| E


subunit_desc' -> SUBUNIT_CLASS SUBUNIT_FUNCTION_ID subunit_number


subunit_number -> SSHORTINTEGER_CLASS SSHORTINTEGER_DATA
| E


object_desc -> OBJECT_CLASS_NAME OBJECT_FUNCTION_ID
| E


E. A msg_spec can be one of three types of messages:


msg_spec -> KEYWORD_ID object_value
| ARITHMETIC_ID object_value
| SIMPLE_ID


      The precedence of the three types of messages is as shown, ie. a
      simple message has higher precedence than an arithmetic message
      which has higher precedence than a keyword message. Unfortunately
      my parser implementation doesn't handle this precedence correctly
      without using parenthesis. Note that a simple message doesn't
      take an argument (object_value). Also of importance is the fact
      that the message tokens (KEYWORD_ID, ...) are mutually exclusive
      with the tokens for the classes (SUBUNIT_CLASS, ...).


F. An object_value can be a discrete value or it could be a block
      (which is really just a special type of object) or it could be a
      subexpression. The data is just the actual value of an object, for
      instance a short integer '3' is represented by SSHORTINTEGER 3 (of
      course there is a unique code representing SSHORTINTEGER...).


object_value -> GENERAL_OBJECT_CLASS data
| block
| '(' expression ')'




QUESTIONS


1. Can this grammar be modified to include the precedence of messages
      as outlined in section E above, or would it be better to just rewrite
      the grammar from scratch?


2. Considering my hardware constraints, would it be prudent to hard-code
      the parser as I have done to date, or should I rely on yacc or bison
      generated code. (I played with yacc a bit and the output was pretty
      unreadable...). I'm somewhat worried about future maintenance.


3. If I hard-code the parser, should I switch to one-of the table-driven
      approaches instead of the recursive-descent approach. I found writing
      a recursive-descent parser to be fairly simple and it's very easy to
      understand and debug. I'm running it with a simple test program on my
      workstation - haven't ported to target hardware yet so efficiency hasn't
      been an issue, but that may become a concern. My parser does a top-down
      parse - should I switch to a bottom-up approach?


4. What would be an appropriate way to represent a statement that has
      been parsed correctly so that it may then be executed? In my parser
      I currently create a tree structure that represents the statement.
      Each node in the tree represents a non-terminal or terminal in my
      grammar. Nodes with data (terminals) include the token and lexeme
      representing that data. Nodes without data just include the token.


      Would it be better to use some sort of stack structure, or do I need
      an internal language? I need to be able to execute a statement
      immediately upon completion of a parse. Given a statement with the
      basic form shown above (target message argument) the internal execution
      would proceed as follows: (1) look up target in object table - an
      object must be created (in parser) if it doesn't already exist in the
      table, (2) look up message code in method table for that type of object
      - unless there is a "receiver" specified, in which case you look up the
      method code in the receiver's method table, (3) look up argument in the
      object table, (4) make the "procedure call" like target.message(argument).


That's it... To anyone who has taken the time to read this and anyone who
can give me a little feedback - I thank you.


Craig
--
Craig A. Heilman, craigh@cserver.plexus.com
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.