Intricate problem with scannerless LALR(1) parser

mailings@jmksf.com
Fri, 6 Jun 2008 13:12:37 +0200 (CEST)

From comp.compilers

Related articles
*Intricate problem with scannerless LALR(1) parser mailings@jmksf.com* (2008-06-06)**
Re: Intricate problem with scannerless LALR(1) parser kamalpr@hp.com (kamal) (2008-06-08)
Re: Intricate problem with scannerless LALR(1) parser GeniusVczh@gmail.com (vczh) (2008-06-09)
Re: Intricate problem with scannerless LALR(1) parser paul@paulbmann.com (Paul B Mann) (2008-06-25)
Re: Intricate problem with scannerless LALR(1) parser gah@ugcs.caltech.edu (glen herrmannsfeldt) (2008-06-26)
Re: Intricate problem with scannerless LALR(1) parser paul@paulbmann.com (Paul B Mann) (2008-06-26)
Re: Intricate problem with scannerless LALR(1) parser parsersinc@earthlink.net (SLK Mail) (2008-06-27)

| List of all articles for this month |

From:	mailings@jmksf.com
Newsgroups:	comp.compilers
Date:	Fri, 6 Jun 2008 13:12:37 +0200 (CEST)
Organization:	Compilers Central
Keywords:	parse, LR(1), question
Posted-Date:	06 Jun 2008 10:00:18 EDT

Hi there!

I'm currently developing a LALR(1) parser generator which generates
parsers running directly on the input character stream, which means
that it is scannerless. Lexemes or higher syntacical elements are
directly recognized by the grammar itself. The development of the
generator has nearly finished, and I added many utilizing features to
it, for example whitespace recognition and a keyword-feature, matching
terminals which exist of more than one character using a definite
state machine.

However, that keyword-feature has one side effect which I would
discuss on the mailing list. Given the following grammar:

start: a

a: b 'XX'

b: c | '[' b ']'

c: 'X' | c 'X'

where "start", "a", "b" and "c" are nonterminals ("start" is goal
symbol) and "XX", "X", "[" and "]" are the terminals ("XX" is a so
called keyword, "X", "[" and "]" are characters), my generator
constructs the following LALR(1) states (I used the state reduction
algorithm as described by [Goose], using SHIFTREDUCE-Actions):

    --- State 0 ---

    Kernel:
                    start: .a

    --- State 1 ---

    Kernel:
                    a: b .'XX'

    --- State 2 ---

    Kernel:
                    b: c . { 'XX' ']' }
                    c: c .'X'

    --- State 3 ---

    Kernel:
                    b: '[' .b ']'

    --- State 4 ---

    Kernel:
                    b: '[' b .']'

The configuration in State 2 is now the sticking point: The parser
will reduce here if it detects a keyword "XX", which is even made up
from the same character as the string matched by nonterminal "c". So a
parser with such a grammar, which has no shift-reduce or reduce-reduce
conflicts at all, will only accept the input strings "XXX" and
"[XXX]". The parse tree for "XXX":

          start
              |
a
            /
   b "XX"
        /
      c
    /
"X"

Strings like "XXXX" or "[XXX]XX" cannot be parsed using this parsing
technique, althought they look valid.

My question is now: Is there a possiblity to detect problems of this
kind beginning from state 2? My first idea was that it must be tried
to parse the keyword that should be reduced (in this case: "XX") using
nonterminal "c", but how could my parser generator invoke this in any
possibly situation this kind of problems comes up? How can it find out
that "c" is the nonterminal which possibly matches the keyword? Please
note, that now shift-reduce conflict will come up, because "X" and
"XX" are two different terminals.

Any ideas on this topic?

Best regards
Jan
[Your grammar is ambiguous. To see where, replace "XX" with xx and
define it like this: xx: "X" "X"

Assuming you want to use normal tokenizing rules, your "X" token is
really "X followed by something other than a letter or digit, and if
the something is white space, skip over the white space. Oh, and skip
comments, too." Now you know why we use separate lexer and parser
generators, because they need separate state machines to keep the
parser grammar frome exploding. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Intricate problem with scannerless LALR(1) parser

mailings@jmksf.comFri, 6 Jun 2008 13:12:37 +0200 (CEST)

mailings@jmksf.com
Fri, 6 Jun 2008 13:12:37 +0200 (CEST)