Re: Making semicolons optional moves LALR(1) language to LALR(2)?

"BGB / cr88192" <>
Sat, 6 Feb 2010 09:45:19 -0700

          From comp.compilers

Related articles
Making semicolons optional moves LALR(1) language to LALR(2)? ng2010@att.invalid (ng2010) (2010-02-05)
Re: Making semicolons optional moves LALR(1) language to LALR(2)? (BGB / cr88192) (2010-02-06)
Re: Making semicolons optional moves LALR(1) language to LALR(2)? ng2010@att.invalid (ng2010) (2010-02-06)
Re: Making semicolons optional moves LALR(1) language to LALR(2)? (Kaz Kylheku) (2010-02-08)
| List of all articles for this month |

From: "BGB / cr88192" <>
Newsgroups: comp.compilers
Date: Sat, 6 Feb 2010 09:45:19 -0700
References: 10-02-025
Keywords: LALR, parse
Posted-Date: 10 Feb 2010 10:53:45 EST

"ng2010" <ng2010@att.invalid> wrote in message
> For a hypothetical programming language that is LALR(1) and uses
> semicolons as statement terminators, would a change that makes semicolons
> only required on multi-statement lines and using the newline as an
> implicit statement terminator make the language LALR(2)?
> [Seems to me that it makes a newline syntactically equivalent to a
> semicolon, unless you
> have some plan for multi-line statements you haven't mentioned. -John]

this change should not be nearly so fundamental (as far as the parser
category goes).

admittedly, my experience is largely limited to recursive descent parsers,
so all this may not be entirely valid in other contexts.

the basic idea is that, in contexts where a semicolon would normally be a
valid break, one can detect a linebreak and possibly accept this instead.

in my parsers, this leads to there being 2 'EatWhite' functions (for eating
"ParseEatWhite()", which eats all whitespace (including comments,
linebreaks, ...), this being the most common whitespace-eating function;
"ParseEatWhiteOnly()", which only eats "pure" whitespace (excluding comments
and linebreaks), where this one may be used in some contexts if a
linebreak-sensitive syntax is used (or also in the case of special commands
which may be embedded in comments, ...).

often then, the semicolon is simply "eaten" in an outer level of the parser,
and its main use is actually to "get in the way" of parsing other syntactic
forms (essentially causing the parser to "unwind" while trying to parse it).
in cases where the parser unwinds on its own (usually because no other parse
is possible) the semicolon can be made optional (the "EatSemicolon()"
function is then no-op if there is no semicolon to eat...).

a more explicit option would be to add bits "here and there" which would
detect a linebreak as an explicit break. this would, in turn, likely add a
number of special notational rules in which to allow correct parsing:
being valid, but:
being not valid.

some languages also use indentation for disambiguation, ...
however, personally I have found both of these strategies to be not be
worthwhile (they seem to add more hassle in using the syntax than they

alternatively, one could always require \ for any multi-line statement:
i=x \

however, this usually makes more sense for inherently line-structured
syntax/parsers (it is out of place in a token-based syntax).

in some cases though, I have used whitespace-sensitive disambiguation (where
if/where whitespace exists also serves to disambiguate the syntax). for
example, my assembler uses this so that ";" can be used both as a comment,
and as a way of placing several instructions on a single line.

nop ; nop ;comment
nop ;nop ;comment
nop;nop ;comment
nop; nop ;grouping

a few other special-purpose syntax of mine have also used this to good
"2-3", "2 - 3", "2- 3" => "2-3";
"2 -3" => "2, -3";

so, a lot is specifics.

most "fundamental" changes in the parser are more often as a need to deal
with otherwise ambiguous syntax, where a prior general strategy would be
unable to parse a given syntax.

examples of this would be trying to use a simple (context insensitive)
parser to parse C or C++, or apparently trying to use such a plain C-style
parser to parse C# (which I suspect actually requires a different parsing
strategy, but I have yet to really "dig into" the spec enough to really
figure out exactly how the ideal parser would be structured). in the past I
had suspected it would require a 2-pass strategy, but I am now wondering if
there is a better single-pass approach (but, alas, I would need to get
around to it...).

or such...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.