1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648

"Edsko de Vries" <devriese@cs.tcd.ie>
31 Jan 2006 09:57:37 -0500

          From comp.compilers

Related articles
1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 devriese@cs.tcd.ie (Edsko de Vries) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 torbenm@app-0.diku.dk (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 rsc@swtch.com (Russ Cox) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 jm@bourguet.org (Jean-Marc Bourguet) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 gneuner2@comcast.net (George Neuner) (2006-01-31)
Re: 1 - 1, 1 -1, 1-1, 1 - -1 and -2147483648 rivers@dignus.com (Thomas David Rivers) (2006-01-31)
[5 later articles]
| List of all articles for this month |

From: "Edsko de Vries" <devriese@cs.tcd.ie>
Newsgroups: comp.compilers
Date: 31 Jan 2006 09:57:37 -0500
Organization: http://groups.google.com
Keywords: lex, arithmetic, question
Posted-Date: 31 Jan 2006 09:57:37 EST

Hi,


I have a question regarding lexical analysis. I recently came across a
bug in our lexical analyser in phc (www.phpcompiler.org), that I am
unsure how to solve. This is the problem: our current definition for
integer constant looks something like


INT ([1-9][0-9]*)|0


In particular, note that it does not allow for an (optional) "+" or "-"
at the start of the integer. This means that the strings "1 - 1", "1
-1" and "1-1" all generate the same sequence of three tokens INT(1),
OP(-), INT(1), for which the syntax analyser generates the subtree
BIN_OP(-, 1, 1).


For the string "1 - -1", the lexer (unsurprisingly) generates INT(1),
OP(-), OP(-), INT(1). The syntax analyser recognises this as BIN_OP(1,
UNARY_OP(-, 1)). In other words, the second "-" is treated as a unary
operator, rather than as part of the number.


This works fine, with the sole exception of the number "-2147483648".
The problem is, of course, overflow: -2147483648 is a valid negative
number (assuming 32-bit numbers), but the integer 2147483648 is _not_ a
valid positive number. Thus, the above method of dealing with "-" as a
unary operator breaks down.


The solution is to interpret the "-" as part of the number, and
generate INT(-2147483648), rather than OP(-), INT(...). However,
changing the definition of INT to


INT [+-]?([1-9][0-9]*)|0


causes "1-1" to be recognised as INT(1), INT(-1), which is incorrect.
Even requiring a space before the operator


INT (" "[+-])?([1-9][0-9]*)|0


does not help, because "1 -1" will be recognised as INT(1), INT(-1)
instead of INT(1), OP(-), INT(1).


Is there a good solution that solves both problems? At the moment, I
have added a new rule to the lexer that matches this exact number,
which solves the problem - but it's a hack.. (actually, there are two
rules, one for 32-bit machines, one for 64-bit machines..)


Thanks,


Edsko


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.