Re: Homegrown java parser ?!?

clkfa32y4001@sneakemail.com (alan r)
16 Feb 2002 01:21:40 -0500

          From comp.compilers

Related articles
Homegrown java parser ?!? bat_fastard_2000@yahoo.com (bf.) (2002-02-06)
Re: Homegrown java parser ?!? pfroehli@ics.uci.edu (Peter H. Froehlich) (2002-02-16)
Re: Homegrown java parser ?!? clkfa32y4001@sneakemail.com (2002-02-16)
| List of all articles for this month |
From: clkfa32y4001@sneakemail.com (alan r)
Newsgroups: comp.compilers
Date: 16 Feb 2002 01:21:40 -0500
Organization: http://groups.google.com/
References: 02-02-026
Keywords: Java, parse
Posted-Date: 16 Feb 2002 01:21:40 EST

I'll give you one technique I've used for hand-coding a parser. This
will be a fair amount of work for a language the size of Java but once
you've done this a few times it could probably be done in a week or
three. The alternative would be to use a parser generator.


First you need a specification of the rules of the language you are
trying to process, which is written in a language called BNF. For java
I found a suitable one somewhere at www.javasoft.com back when I was
writing a parser for a java-like language.


Then you write a piece of code for each piece of the specification
(called a production). You code each one exactly the same way , and in
the end you will be able to parse any legal source into a parse tree.


Lets take a very simple example. Suppose you have a language that
consists of numbers or binary expressions which can be wrapped in
parentheses. The BNF might appear as follows:


Program := expr
expr := primary { binop expr}
binop := + | - | * | /
primary := number | parenExpr
parenExpr := ( expr )


This says: a Program consists of an expr. An Expr consists of a
primary, optionally followed by a binop and another expression. A
binop consists of one of +-*/. A primary is either a number or a
parenExpr. A parenExpr is a ( token followed by an expr followed by a
) token.


Next I would create a class hierarchy something like:


ProgramNode
    Program
    Expression
          Primary
              Token
                  BinaryOperator
                  Number
              ParenExpression
          BinaryExpression


(The hierarchy in this example which I just cooked up has some
problems, eg a BinaryOperator is not really a Primary). The solution
is dont use inheritance to indicate "Primary-ness", rather code
isPrimary() in suitable places. But this is just for illustrative
purposes).


Next, you teach each class how to recover (deserialize) an instance of
itself from a Token stream. I like to use 2 streams. The first stream
is on the raw input such as a string. You teach Token the method (you
already have this)


static Token from(characterStream)


Then you create another class say Scanner. This object encapsulates a
characterStream on the input data, and a readWrite stream holding
Tokens produced from the character stream. Saying next() to a Scanner
gives you a Token.


Then you teach each of the possible nodes in the language (start from
the most elemental so you can unit test) to recognize itself
according to the rules. 2 examples:


Class Program {
        Expression e;
// Rule: Program := expr
static Program from(Scanner s) {
          Expression e := Expression.from(s);
          if (e == null) return null;
          return new Program(e);}}


Class Expression {
// Rule: expr := primary { binop expr}
static Expression from(Scanner s) {
          Primary p = Primary.from(s);
          if (p == null) return null;
          int pos = s.position() // remember position in case failure
          if ( ! s.peek().isBinop()) return p; // found simple primary
          Token binop = s.next(); // consume binop from scanner
          Expression expr = Expression.from(s);
          if ( expr == null) {
                  // oops we found primary binop but without the rightside expr
                  // backup s to before the binop and return p
// alternatively we could just return null here
                  s.position(pos);
                  return p}
            else { return new BinaryExpression(p,binop,expr) }}}


and so on. Note, as above, that each recognizer is responsible for
repositioning the scanner's token pointer if the recognition fails.


Then you say Program.fromString(aProgramString)
which would be implemented as:


Program.from( new Scanner(aProgramString))


if aProgramString is a legal program you'll get back a Program node
else a null.


No doubt the above code is rife with errors as I just typed it up for
this answer and it's not been anywhere near a compiler. I left a lot
unexplained and there are a lot of enhancements possible but hopefully
it's enough to explain the basic technique.


alan reider






On 6 Feb 2002 23:40:26 -0500, "bf." <bat_fastard_2000@yahoo.com>
wrote:


>I'm currently involved with a project that involves analysis of Java
>source code. To achieve this, we have decided that we need to parse
>each source file and load it into a tree representing its progmatic
>structure. We can then make better comparisons between files.
>
>So far, we have written a tokeniser that breaks a source file down by
>seperating its components. This scans a file looking for certain
characters
>which mark tokens, and then creates an array. ([public] [class] [x]
[{]
>[public] [method] [(] [int] [name] [,] [int] [name] [)] ... )
>
>We have written a basic language node, and extended it to some basic
>language constructs such as import, class, method, declaration. Of
course
>as the system grows, more of these will be created.
>
>Our problem is how to proceed. How do we convert a series of tokens
into a
>tree of language nodes ?
>
>[public] [static] [class] [x] [{] ..... [}]
>becomes Node_Class (name=x, public=yes, static=yes)
>
>and of course this node will need a pointer to child nodes of the
class.
>
>We are not writing a compiler, so we don't need to verify syntax. But
>otherwise
>I think (?) we're doing similar things to a compiler.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.