Related articles |
---|
Homegrown java parser ?!? bat_fastard_2000@yahoo.com (bf.) (2002-02-06) |
Re: Homegrown java parser ?!? pfroehli@ics.uci.edu (Peter H. Froehlich) (2002-02-16) |
Re: Homegrown java parser ?!? clkfa32y4001@sneakemail.com (2002-02-16) |
From: | clkfa32y4001@sneakemail.com (alan r) |
Newsgroups: | comp.compilers |
Date: | 16 Feb 2002 01:21:40 -0500 |
Organization: | http://groups.google.com/ |
References: | 02-02-026 |
Keywords: | Java, parse |
Posted-Date: | 16 Feb 2002 01:21:40 EST |
I'll give you one technique I've used for hand-coding a parser. This
will be a fair amount of work for a language the size of Java but once
you've done this a few times it could probably be done in a week or
three. The alternative would be to use a parser generator.
First you need a specification of the rules of the language you are
trying to process, which is written in a language called BNF. For java
I found a suitable one somewhere at www.javasoft.com back when I was
writing a parser for a java-like language.
Then you write a piece of code for each piece of the specification
(called a production). You code each one exactly the same way , and in
the end you will be able to parse any legal source into a parse tree.
Lets take a very simple example. Suppose you have a language that
consists of numbers or binary expressions which can be wrapped in
parentheses. The BNF might appear as follows:
Program := expr
expr := primary { binop expr}
binop := + | - | * | /
primary := number | parenExpr
parenExpr := ( expr )
This says: a Program consists of an expr. An Expr consists of a
primary, optionally followed by a binop and another expression. A
binop consists of one of +-*/. A primary is either a number or a
parenExpr. A parenExpr is a ( token followed by an expr followed by a
) token.
Next I would create a class hierarchy something like:
ProgramNode
Program
Expression
Primary
Token
BinaryOperator
Number
ParenExpression
BinaryExpression
(The hierarchy in this example which I just cooked up has some
problems, eg a BinaryOperator is not really a Primary). The solution
is dont use inheritance to indicate "Primary-ness", rather code
isPrimary() in suitable places. But this is just for illustrative
purposes).
Next, you teach each class how to recover (deserialize) an instance of
itself from a Token stream. I like to use 2 streams. The first stream
is on the raw input such as a string. You teach Token the method (you
already have this)
static Token from(characterStream)
Then you create another class say Scanner. This object encapsulates a
characterStream on the input data, and a readWrite stream holding
Tokens produced from the character stream. Saying next() to a Scanner
gives you a Token.
Then you teach each of the possible nodes in the language (start from
the most elemental so you can unit test) to recognize itself
according to the rules. 2 examples:
Class Program {
Expression e;
// Rule: Program := expr
static Program from(Scanner s) {
Expression e := Expression.from(s);
if (e == null) return null;
return new Program(e);}}
Class Expression {
// Rule: expr := primary { binop expr}
static Expression from(Scanner s) {
Primary p = Primary.from(s);
if (p == null) return null;
int pos = s.position() // remember position in case failure
if ( ! s.peek().isBinop()) return p; // found simple primary
Token binop = s.next(); // consume binop from scanner
Expression expr = Expression.from(s);
if ( expr == null) {
// oops we found primary binop but without the rightside expr
// backup s to before the binop and return p
// alternatively we could just return null here
s.position(pos);
return p}
else { return new BinaryExpression(p,binop,expr) }}}
and so on. Note, as above, that each recognizer is responsible for
repositioning the scanner's token pointer if the recognition fails.
Then you say Program.fromString(aProgramString)
which would be implemented as:
Program.from( new Scanner(aProgramString))
if aProgramString is a legal program you'll get back a Program node
else a null.
No doubt the above code is rife with errors as I just typed it up for
this answer and it's not been anywhere near a compiler. I left a lot
unexplained and there are a lot of enhancements possible but hopefully
it's enough to explain the basic technique.
alan reider
On 6 Feb 2002 23:40:26 -0500, "bf." <bat_fastard_2000@yahoo.com>
wrote:
>I'm currently involved with a project that involves analysis of Java
>source code. To achieve this, we have decided that we need to parse
>each source file and load it into a tree representing its progmatic
>structure. We can then make better comparisons between files.
>
>So far, we have written a tokeniser that breaks a source file down by
>seperating its components. This scans a file looking for certain
characters
>which mark tokens, and then creates an array. ([public] [class] [x]
[{]
>[public] [method] [(] [int] [name] [,] [int] [name] [)] ... )
>
>We have written a basic language node, and extended it to some basic
>language constructs such as import, class, method, declaration. Of
course
>as the system grows, more of these will be created.
>
>Our problem is how to proceed. How do we convert a series of tokens
into a
>tree of language nodes ?
>
>[public] [static] [class] [x] [{] ..... [}]
>becomes Node_Class (name=x, public=yes, static=yes)
>
>and of course this node will need a pointer to child nodes of the
class.
>
>We are not writing a compiler, so we don't need to verify syntax. But
>otherwise
>I think (?) we're doing similar things to a compiler.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.