Thu, 3 Jan 2013 13:33:50 -0800 (PST)

Related articles |
---|

[2 earlier articles] |

Re: Compiling expressions mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2012-12-30) |

Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-02) |

Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-02) |

Re: Compiling expressions matzebraun@googlemail.com (matzebraun@googlemail.com) (2013-01-03) |

Re: Compiling expressions torbenm@diku.dk (2013-01-03) |

Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-03) |

Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-03) |

Re: Compiling expressions mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2013-01-04) |

Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-06) |

Re: Compiling expressions vonbrand@inf.utfsm.cl (Horst von Brand) (2013-01-14) |

Re: Compiling expressions james.harris.1@gmail.com (James Harris \(es\)) (2013-03-07) |

From: | James Harris <james.harris.1@gmail.com> |

Newsgroups: | comp.compilers |

Date: | Thu, 3 Jan 2013 13:33:50 -0800 (PST) |

Organization: | Compilers Central |

References: | 12-12-035 |

Keywords: | parse |

Posted-Date: | 04 Jan 2013 11:41:07 EST |

On Dec 29 2012, 1:11 pm, James Harris <james.harri...@gmail.com>

wrote:

...

*> 1. Hand-written, not the output of a parser generator.*

*> 2. Efficient and without backtracking.*

*> 3. Precedences (and possibly associativities) defined in tables.*

*> 4. Output to be a tree structure.*

*> 5. Parenthesised subexpressions allowed.*

*> 6. Some operator families are *not* to associate with each other. See*

*> below.*

*> 7. Monadic prefix, dyadic infix and monadic postfix operators are all*

*> allowed.*

*> 8. Prefix and infix operators can use some same symbols (e.g. minus*

*> sign).*

*>*

*> Infix and postfix operators use distinct symbols.*

...

Here is an idea for an expression parser to try to address the points

mentioned. I am not sure if it covers all the bases yet. For some

reason, although the requirement "parse an expression" is ostensibly

straightforward, when precedence and three types of operators are

thrown in I have been finding it very hard to reduce to easily

understandable concepts.

I should say what concepts the proposal is based on:

* An expression has exactly two key types of elements which I am

calling values and operators. The operators are the normal types of

operator symbols: prefix, infix and postfix. The values are

identifiers and literals. A subexpression in parentheses is also a

value.

* A value can have an arbitrary number of prefix operators to its left

and an arbitrary number of postfix operators to its right.

* A prefix operator applies not just to the value immediately to its

right but to an arbitrary expression to its right as far as an

operator with lower precedence. For example,

-a.b+c

The unary minus in this case applies not just to the variable, a, but

to the subexpression a.b and terminates at the + sign. The dot here is

a binding operator which has a higher precedence than unary minus.

* Parentheses can appear in two contexts. Before a value (i.e. when

scanning for a value) a left paren is regarded as a grouping

parenthesis and starts the enclosure of a subexpression. Conversely,

after a value a left paren indicates a function call or similar. For

example,

a + (...) //a grouped expression

a(...) //a function call

So, here is an attempt to produce such an expression parser. This one

is partially recursive and written in fairly low-level pseudocode. Of

the approaches I have tried I think this is the easiest to understand.

It is certainly short (which is cool!) but it may not be completely

right. Just about all the issues I have had have been to do with

applying precedence properly. Discussion and/or corrections would be

welcome.

Parsing of an expression is intended to start with a call to

expression_parse which takes one parameter. On the initial call this

is a dummy start-of-expression symbol which has a lower precedence

than any real operator. Once the parse is complete expression_parse

should return a tree representing the expression.

This function is essentially in two parts: first obtain a value and

then iterate over as many pairs of (infix operator, value) as have

higher precedence than that of the initial symbol.

function expression_parse(op1)

op2 = token //a copy of the current token

if op2 is lparen //*grouping* parens

v1 = expression_parse(lparen)

consume rparen

else if op2 is a prefix operator

v1 = expression_parse(prefix version of op2)

v1 = node(PREFIX_OP, op2, v1)

else

v1 = value_parse(op1)

op2 = current token (or dummy end of expression)

while prec(op2) > prec(op1)

if incompatible(op1, op2)

raise "incompatible - grouping parens needed"

v2 = expression_parse(op2)

v1 = node(INFIX_OP, op2, v1, v2)

op2 = current token (or dummy end of expression)

return v1

I won't insult your intelligence by explaining it all but some notes

are in order. Each call to this routine ends leaving the current token

pointer pointing at the next unconsumed operator. The line

v1 = expression_parse(prefix version of op2)

is intended to deal with such things as a prefix (unary) minus having

a different precedence than an infix minus. When such a symbol is seen

in prefix context the "prefix version" of it (which has the higher

precedence) is used on the recursive call.

When expression_parse gets to a value - an identifier or a literal -

it calls value_parse, below. The value_parse function is intended not

just to read the value but to accumulate all the following operations

- postfix and infix - which bind tighter than the precedence in force

at the time value_parse was called.

function value_parse(op1)

v1 = value() //identifier or literal

while true

op2 = current token or dummy end of expression

if lparen //parens of a function call or similar

v1 = args_parse(op2)

consume rparen

else if prec(op2) > prec(op1)

if incompatible(op1, op2)

raise "incompatible - grouping parens needed"

v1 = node(POSTFIX_OP, v1)

return v1

There is a need to accumulate a list of arguments. That is the job of

args_parse, below.

function args_parse()

v1 = empty list

while true

append to v1 expression_parse(lparen)

if current token is a comma

consume comma

else

break loop

return v1

I am aware of some little things I would change to render it into a

programming language but the main issue is whether it would handle

precedences generally and correctly (or could be made easier to

understand). As I say, comments and corrections - especially on the

bigger issues - would be very welcome.

James

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.