Re: How to organize the data structure in a parser?

j1k1cki@hotmail.com
24 Jan 2005 11:28:04 -0500

          From comp.compilers

Related articles
How to organize the data structure in a parser? gnu04@yahoo.com (Andy) (2005-01-22)
Re: How to organize the data structure in a parser? mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2005-01-24)
Re: How to organize the data structure in a parser? j1k1cki@hotmail.com (2005-01-24)
Re: How to organize the data structure in a parser? cppljevans@cox-internet.com (Larry Evans) (2005-01-25)
Re: How to organize the data structure in a parser? gnu04@yahoo.com (Andy) (2005-01-30)
| List of all articles for this month |

From: j1k1cki@hotmail.com
Newsgroups: comp.compilers
Date: 24 Jan 2005 11:28:04 -0500
Organization: http://groups.google.com
References: 05-01-072
Keywords: OOP, parse
Posted-Date: 24 Jan 2005 11:28:04 EST

Andy wrote:
> I am trying to design a parser for C program using C++. Currently
> what I did for syntax tree is to design a class for each nontermials
> in the grammar, and use inherentance to link them. For example, as
> for the expression part, the classes are something like the follows:


> ....
> class MultiplicativeExpr : public AdditiveExpr
> {
> MultiplicativeExpr * multi_expr;
> int op_type;
> PmExpr * pm_expr;
> };
>
> class AdditiveExpr : public ShiftExpr
> {
> int op_type;
> AdditiveExpr * additive_expr;
> MultiplicativeExpr * multi_expr;
> };
>
> class ShiftExpr : public RelationalExpr
> {
> int op_type;
> ShiftExpr * shift_expr;
> AdditiveExpr * additive_expr;
> };
>
> class RelationalExpr : public EqualityExpr
> {
> int op_type;
> RelationalExpr * relational_expr;
> ShiftExpr * shift_expr;
> };
>
> ....
>
> I noticed that the bad thing about this solution is that using
> inheritance, the children in the leaf of inheritance tree will be of
> large size because it comprised of all data fields of its ancestors.
> How can I organize the data structure more efficienty? Thanks a lot!


0-th question is: why do this? There are existing projects that
accumulated lots of work in this area. If you want an exercise in
parsing, join one of them. If you need a parser done, reuse one of
them. I particularily recommend opencxx.sf.net , it implements lexer,
parser and static analyzer of C++ in C++ itself.


That being said the first question is do you really want to implement
the Standard grammar in your syntax tree? Perhaps instead of


multiplicative-expression:
pm-expression
multiplicative-expression * pm-expression
multiplicative-expression / pm-expression
multiplicative-expression % pm-expression


additive-expression:
multiplicative-expression
additive-expression + multiplicative-expression
additive-expression ?- multiplicative-expression


you can use


expression:
pm-expression
expression * expression
expression / expression
expression % expression
expression + expression
expression - expression


For many applications (e.g. evaluation) such tree would be more
convenient.


In fact you have already altered the grammar; your attempt seems better
aligned with this grammar:


multiplicative-expression:
pm-expression
multiplicative-expression mult-op pm-expression


mult-op:
%
/
*


additive-expression:
multiplicative-expression
additive-expression add-op multiplicative-expression


add-op
+
-


Second, I don't think there is a benefit in employing inheritance the
way you do. See [1] for good guidance on (ab)using inheritance.


Context free grammars have straightforward translation into
implementations of Composite design pattern (cf. [2]). Grammar can be
seen as a set of productions:


LHS -> T1 T2 ... TN /* NAME */


where LHS is nonterminal, and T1...TN are terminals and/or
nonterminals, and NAME is a name of this production. Given a grammar
one can define an abstract base class for each nonterminal derive
concrete classes from it, one class for each production of a given
nonterminal.


Let me rewrite your grammar to shorten the nonterminals' names and
introduce productions' names:


MultExpr -> PmExpr /* MultExprSimple */
MultExpr -> MultExpr MultOp PmExpr /* MultExprBinary */
MultOp -> '*'
MultOp -> '/'
MultOp -> '%/
AddExpr -> MultExpr /* AddExprMult */
AddExpr -> AddExpr AddOp MultExpr /* AddExprBinary */
AddOp -> '+'
AddOp -> '-'


This grammar translates to


struct MultExpr { ... };
struct MultExprSimple : public MultExpr {
PmExpr* pm;
};
struct MultExprBinary : public MultExpr {
MultExpr* left; MultOp* op; MultExpr *right;
};
struct AddExpr { ... };
struct AddExprMult : public AddExpr {
MultExpr* mult;
};
struct AddExprBinary : public AddExpr {
AddExpr* left; AddOp* op; MultExpr* right;
};


// Virtual destructors in base classes and all functions
// implemented in-tree (e.g. PrintOn()) or visitation support
// (e.g. Accept()) are omitted for brevity.


In this implementation inheritance serves as a way to implement
discriminated union (co-product) of alternatives defined by multiple
productions of the same nonterminal.


In most applications trees rooted in AddOp and AddMult nodes need not
carry any modifiable information. Moreover, they are always of the same
height, so it makes a lot of sense to reimplement them using just a
value
type, e.g.:


struct AddExprBinary : public AddExpr {
AddExpr* left; AddOp op; MultExpr* right;
};


// where AddOp is a value type, e.g. int, char, enum, or user-defined
// type with value semantics.


The most twisted point of such approach is the existence of productions


having only one nonterminal on rhs (linear productions), e.g.


AddExpr -> MultExpr /* AddExprMult */


the trick is that instead of implementing AddExprMult class
carying just a pointer to MultExpr:


struct AddExprMult : public AddExpr {
MultExpr* mult;
};


+-------+
|AddExpr|
+-------+
A
+-----+-------+
| |
+--------+ +-----------+ ...
|MultExpr|<----|AddExprMult|
+--------+ +-----------+
A
...




one may just move derivatives of MultExpr to become derivatives
of AddExpr by collapsing MultExpr and AddExprMult classes:




+-------+
|AddExpr|
+-------+
A
+-----+-------+
| |
+---------+ ...
|MultExpr |
+---------+
A
...


After such transformation implementation of the tree would look like
this:


struct AddExpr { ... };
struct AddExprBinary : public AddExpr {
AddExpr* left; AddOp* op; MultExpr* right;
};
struct MultExpr : public AddExpr { ... }
struct PmExpr : public MultExpr { ... };
struct MultExprBinary : public MultExpr {
MultExpr* left; MultOp* op; MultExpr *right;
};


This transformation is neither always possible (e.g. cannot be fully
applied in grammar having both "A->B" and "B->A" productions), nor
always desirable (my experience is that it obfuscates visitation).


The above way of implementing syntax tree is by no means universal or
best. There are other ways, e.g. generic implementation of a tree
using only untyped binary nodes. I also know at least two hybrid
aproaches, that mix generic, untyped implementation with some form of
Composite pattern (W3C DOM and OpenC++ AST).


HTH
Grzegorz


[1] Sutter "Exceptional C++" has in-depth discussion on when (and when
not) to use public inheritance in C++. See also Robert Martin's papers
on Liskov Subsitution Principle (available on the net or in print in
"Agile Software Development").


[2] Gamma, Helm, Johnson, Vlissides "Design Patterns", especially
chapters about Composite and Visitor.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.