Re: Can this be proven for LL(1) grammars?

Kaz Kylheku <kaz@kylheku.com>
Fri, 7 Nov 2014 19:43:44 +0000 (UTC)

From comp.compilers

Related articles
Can this be proven for LL(1) grammars? lhp+news@toft-hp.dk (Lasse =?iso-8859-1?q?Hiller=F8e?= Petersen) (2014-11-06)
*Re: Can this be proven for LL(1) grammars? kaz@kylheku.com (Kaz Kylheku)* (2014-11-07)**
Re: Can this be proven for LL(1) grammars? slkpg4@gmail.com (SLK Mail) (2014-11-08)

| List of all articles for this month |

From:	Kaz Kylheku <kaz@kylheku.com>
Newsgroups:	comp.compilers
Date:	Fri, 7 Nov 2014 19:43:44 +0000 (UTC)
Organization:	Aioe.org NNTP Server
References:	14-11-006
Keywords:	theory, parse
Posted-Date:	07 Nov 2014 15:55:25 EST

On 2014-11-06, Lasse HillerC8e Petersen <lhp+news@toft-hp.dk> wrote:
> Given a context free grammar G that is LL(1), is the answer to the
> following questions "yes"? I intuitively believe it is, except for (2),
> but I am not sufficiently confident of my skills in proving this in full
> generality, so I would much appreciate expert advice.
>
> 1. From G, for each nonterminal symbol R define a grammar GR, where the
> start rule is changed to R. Rules that are no longer reachable are of
> course useless and discarded. Is each GR also LL(1)?

This is true, and the informal proof is to work backwards: you cannot
take some grammar GR which is not LL(1) and embed it in a larger
context free grammar G which fixes the problem.

> 2. From all the GR grammars from (1), define G' with a new start rule S:
> GR1 | GR2 ... |GRn. Is G' also LL(1)? Given that there could be rules Gi
> and Gj such that their FIRST sets intersect, I suppose the answer is no.

Indeed:

    G1 -> blah blah blah 'x'
    G2 -> blah blah blah 'y'

Oops! Of course, the language is LL(1) parseable, but requires
a refactored grammar in which we merge the similar left parts:

    G1_G2 -> blah blah blah G1_G2'
    G1_G2' -> 'x' | 'y'

If we are building a syntax tree as a semantic action, we can handle this
language by encoding the difference between the 'x' and 'y' variant in the
tree. We don't have G1 and G2 phrase structures, but a G1_G2 that
generates multiple forms.

> What if instead, S is defined only from a selection of rules with
> disjoint FIRST sets?

This might not help because of the threat of FIRST/FOLLOW conflicts.

The problem is that G1 and G2 might share some constituent, call it C.
The G1 and G2 rules contribute different material to FOLLOW(C):

    G1 -> 'x' C Z
    G2 -> 'y' C W

G1 and G2 nicely derive a starting nonterminal, so their FIRST sets are clearly
disjoint (in compliance with your stated constraint on S). But the combintion
of G1 and G2 causes a merging of FOLLOW(C) from G1, and FOLLOW(C) from G2.
Moreover, if C can derive empty (C -> N5), FIRST(C) is also being affected. In
general, conflicts can arise from this. Even though there is no
FIRST(C)/FOLLOW(C) conflict in the G1 and G2 rules in isolation, possibly the
FIRST(C) from G1 could conflict with FOLLOW(C) in G2 or vice versa,
under the merge of these sets.

> 3. From G, define a new grammar G', such that for each nonterminal R, a
> rule R: "R" is added, where "R" is a new terminal symbol. Is G' also LL
> (1)?

This lets you replace any phrase structure in the language with
a "super token" which represents that phrase structure like
3 + MULTIPLICATIVE_EXPR.

This seems to fit in with sentential form representations that you
would use in generation or derivation of sentences.

This introduction cannot introduce a new FIRST/FOLLOW conflict.
R -> "R" means that the terminal "R" is in FIRST(R). Can it also
appaer in FOLLOW(R)? Only if grammar derives R R somewhere, and R happens to
derive N5. But in that case, there is already an existing FIRST/FOLLOW
conflict, unless perhaps the *only* rule for R was previously R -> N5.

If we have:

  S -> R R
  R -> N5

Then we add:

  R -> 'R'

We have a FIRST/FOLLOW conflict. Given the input 'R' it is ambiguous
whether it is S -> N5 'R' or S -> 'R' N5.

What's not clear to me is that the original before the new rule doesn't already
have a conflict. I think not because the symbol N5 only appears in FIRST sets,
not FOLLOW sets. In the original unaugmented grammar, we have FIRST(R) = { N5 }
and FOLLOW(R) = { }. The grammar clearly matches the input in exactly
one way, by deriving R -> N5 twice.

Suppose that R actually derives something in addition to N5, and is embedded
in a correct grammar with no conflicts. Since there is no FIRST/FOLLOW
conflict that means that either R does not derive N5, or FIRST(R)
and FOLLOW(R) do not intersect. If they do not intersect, that means that
FOLLOW(R) was not derived from R: there is nothing in the grammar of
the form R R or R X R where symbols X potentially disappear by deriving N5.

Then if we add R -> 'R', it is harmless.

Informally, since R already derives *something* non-empty without conflicts,
then making R derive a new nonterminal, unique to R, will not create
a conflict. (R already succesfully derives tokens that are not necessarily
unique to the R rule, so why wouldn't it handle an additional unique one.)

> 4. Do (1) and (3) hold for other types of CFG?

I suspect (1) may pop out from what it means for a grammar to be "context
free", but since I already hand-waved through the first comment about it, I'm
going to leave it at that. :) I look forward to other responses.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Can this be proven for LL(1) grammars?

Kaz Kylheku <kaz@kylheku.com>Fri, 7 Nov 2014 19:43:44 +0000 (UTC)

Kaz Kylheku <kaz@kylheku.com>
Fri, 7 Nov 2014 19:43:44 +0000 (UTC)