Re: First vs Predict and LL(*)

Alexander Morou <alexander.morou@gmail.com>
Thu, 22 Jan 2015 01:07:47 -0600

From comp.compilers

Related articles
First vs Predict and LL() alexander.morou@gmail.com* (2015-01-07)
Re: First vs Predict and LL() alexander.morou@gmail.com* (2015-01-07)
Re: First vs Predict and LL() DrDiettrich1@netscape.net (Hans-Peter Diettrich)* (2015-01-07)
First vs Predict and LL() slkpg4@gmail.com (SLK Mail)* (2015-01-07)
Re: First vs Predict and LL() alexander.morou@gmail.com (Alexander Morou)* (2015-01-09)
Re: First vs Predict and LL() alexander.morou@gmail.com (Alexander Morou)* (2015-01-09)
Re: First vs Predict and LL() DrDiettrich1@netscape.net (Hans-Peter Diettrich)* (2015-01-09)
*Re: First vs Predict and LL() alexander.morou@gmail.com (Alexander Morou) (2015-01-22)**
Re: First vs Predict and LL() DrDiettrich1@netscape.net (Hans-Peter Diettrich)* (2015-01-22)
Re: First vs Predict and LL() alexander.morou@gmail.com (Alexander Morou)* (2015-01-23)

| List of all articles for this month |

From:	Alexander Morou <alexander.morou@gmail.com>
Newsgroups:	comp.compilers
Date:	Thu, 22 Jan 2015 01:07:47 -0600
Organization:	Compilers Central
References:	15-01-003
Keywords:	LL(1), parse
Posted-Date:	22 Jan 2015 10:01:29 EST

After much more effort, I've finally got what I think is an effective
predictor, but will be able to know more once I can emit the code. I
didn't want to get ahead of myself because it's one thing to predict
that ten paths all use 'attributes' so parse attributes and reduce
those tokens, but another to resolve 'Dangling Else' style ambiguities.

From the prediction mechanism above, I can effectively know all paths
that lead into a rule, even if it's nested 39 levels deep, but it runs
me into my next dilemma: the dangling else.

So far I have an effective strategy for detecting them, though someone
please pipe in and tell me if I'm short sighted on my approach:
For Every Rule 'x', gather every edge node 'e' within its automation
which contains outgoing transitions. Then compare these outgoing
transitions with the transitions of the state 's' which is just after
the state 'c' which called rule 'x' . If there is a symbolic collision
between the two, you have a dangling else ambiguity, naturally it's not
enough to consider just that initial state 's' because it could also be
an edge state and the rule which called its parent would need
considered as well.

This brings me to my next point: what if the point of ambiguity leads
to a situation where the calling rule's tie for that symbol is a
requirement for the parse, but the edge state is optional? Granted you
have already noted a slight ambiguity, and with most languages, the
dangling else is known to be attached to the innermost rule. In the
mentioned situation that won't work, if you include it in the innermost
rule's context, you would fail to parse the calling rule's states that
follow.

I've included here a situation that mimics this, which was discovered
in detecting this ambiguity. The example is an excerpt from a language
which is a derivative of C#:

        #Root "MemberName";
        #GrammarName "TestGrammar";

        MemberName ::=
                (Type:TypePart; '.')?
                Identifier:Name; ;

        /* ---------------------------------------- *\
        |* B.2.2 Types *|
        \* ---------------------------------------- */
        TypeArgumentList ::=
                '<' Type:Arguments; (',' Type:Arguments;)* '>';

        Type ::=>
                ReferenceType ;

        ReferenceType ::=>
                NonArrayType |
                ArrayType ;

        NonArrayType ::=>
                StructType |
                ClassType ;

        ArrayType ::=
                NonArrayType:ElementType;
                        RankSpecifier:DimensionInfo;+ ;

        ClassType ::=>
                TypeName ;

        TypeName ::=>
                NamespaceOrTypeName ;

        NamespaceOrTypeName ::=
                (
                        QualifiedAliasMember:AliasPart; |
                        SimpleName:Names;
                ) ('.' SimpleName:Names;)* ;

        QualifiedAliasMember ::=
                Identifier:AliasName; "::"
                SimpleName:Name; ;

        SimpleName ::=
                Identifier:Name;
                TypeArgumentList:TypeParameters;? ;

        StructType ::=>
                IntegralDataTypes |
                EnumBaseTypes |
                NullableType ;

        NullableType ::=
                Type:NonNullableType; '?' ;

        RankSpecifier ::=
                '[' ',':Rank;* ']';

        Identifier$ :=
                IdentifierOrKeyword |
                '@' IdentifierOrKeyword ;

        UnicodeEscapeSequence :=
                "\\u" HexChar{4} ;

        IdentifierStartCharacter :=
                LetterCharacter |
                '_' |
                UnicodeEscapeSequence ;

        IdentifierOrKeyword :=
                IdentifierStartCharacter
                IdentifierPartCharacter* ;

        IdentifierPartCharacter :=
                IdentifierPartCharacterNoEscape |
                UnicodeEscapeSequence ;

        IdentifierPartCharacterNoEscape :=
                LetterCharacter |
                CombiningCharacter |
                DecimalDigitCharacter |
                ConnectingCharacter |
                CombiningCharacter |
                FormattingCharacter ;

        LetterCharacter :=
                [:Lu::Ll::Lt::Lm::Lo::Nl:] ;

        CombiningCharacter :=
                [:Mn::Mc:] ;

        DecimalDigitCharacter :=
                [:Nd:] ;

        ConnectingCharacter :=
                [:Pc:] ;

        FormattingCharacter :=
                [:Cf:] ;

        HexChar :=
                [0-9A-Fa-f] ;

        StaticTokens :=
                '.':Period; |
                "::":AliasSeparator; |
                '[':LeftSquareBracket; |
                ']':RightSquareBracket; |
                '<':TypeParameterStart; |
                '>':TypeParameterEnd; |
                '?':Nullable; |
                ',':ArrayRankSpecifier; ;

        IntegralDataTypes>Identifier :=
                                            "bool":Boolean; |
                                            "char":Char; |
                                      "decimal":Decimal; |
                                        "double":Double; |
                                          "float":Float; |
                                        "object":Object; |
                                        "string":String; ;

        EnumBaseTypes>Identifier :=
                                            "byte":Byte; |
                                              "int":Int32; |
                                            "long":Int64; |
                                            "uint":UInt32; |
                                          "ulong":UInt64; |
                                        "ushort":UInt16; |
                                          "sbyte":SByte; |
                                          "short":Int16; ;

Since it wasn't obvious to me at first, I've included the ambiguity
context here:
{ Rule = {NamespaceOrTypeName ::= (QualifiedAliasMember:AliasPart; |
                SimpleName:Names;) ('.' SimpleName:Names;)*;},
  Edge = {StateValue: 1231},
  EdgeNode = {Rule Sub-Node for NamespaceOrTypeName at 1231},
  StartingPath = {
      MemberName::Type::ReferenceType::NonArrayType::ClassType::
          TypeName::[NamespaceOrTypeName]},
  TyingPath = {[MemberName(1193)]} }

{ and } denote enclosing scope of members following '='.
Node Paths refer to Rules by name, sub-states of rules
by Name(StateNumber).

Basically the MemberName's '.' following the Type and then the
Identifier would tie on the NamespaceOrTypeName's repetition
over ('.' SimpleName)*

The question I have is: given the context of what could be ambiguous,
couldn't you apply a similar prediction for determining which forward
rule to parse to the tail end of the equation?
The exception being, you couldn't simply consider the inner-most rule's
edges, you'd have to consider it a valid edge, if and only if the
parent rule's context is satisfied. The initial state that represented
the terminal edge would be valid, but you'd advance and note where
a valid edge should have been, only to accept once the follow context
is satisfied...?

Any insight on this is helpful. Grammar allows simple token inserts
of '.' | "::" to be directly in the rules, the definition of simple
tokens in a fixed alternation pattern below is understood and
rewritten.

Left recursion of Direct, Indirect, and Hidden allowed in parser
grammar. Though I'm thinking what I allow is the cause of my
headaches.

On Fri, Jan 9, 2015 at 12:47 PM, Alexander Morou
<alexander.morou@gmail.com> wrote:
>
> > If the source is "uCDPL" then a follow stack would be insufficient
> > in determining the next valid token, unless you ignored failures
> > in a rule when an accept state is reached and let the parent
> > rule figure it out by rewinding. Based on the above, the requirements
> > could change based on the entire stack, if there were other rules
> > which introduced other requirements, the parse stack would be a
> > determining factor on what was next.
> >
> > Is the above contrived example simple to calculate the Follow for?
>
> I might add, ANTLR 4, in looking after writing the above response,
> does what I surmised a possible solution would be:
> For each rule construct a state machine, uisng the tree context that
> represents your parse as you go, note the state that the parser was
> in when that tree was being constructed. This provides a traversable
> structure that gives you a per-rule state to know what elements are
> valid next. The predict method likely steps through the source input
> and walks up the tree evaluating the look-ahead as necessary with
> multiple state machines (or perhaps a single unified state machine.)
> This would require the states across all rules to use a shared state
> pool, but that's a simple enough matter.
>
> The only complaint I have on this is: ANTLR's source is not human
> alterable. And it's not terribly performant, though Terrence Parr
> did mention that was a secondary focus. 32KB worth of data in an
> altered version of the contrived example yielded 250ms for roughly 32KB
> of data, increasing the number of iterations increases time
> requirements linearly (320KB would be 2.5 seconds). Another
> limit seems to be ANTLR 4 allows direct left recursion but indirect
> left recursion is strangely not allowed. Likely a rewrite rule in
> effect.
>
> I'd prefer the state machines be expanded into the code so that this
> logic is laid out and easier for someone to infer the intent by
> stepping through the code.
>
> Someone pipe in if the view of how ANTLR 4 works is incorrect.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: First vs Predict and LL(*)

Alexander Morou <alexander.morou@gmail.com>Thu, 22 Jan 2015 01:07:47 -0600

Alexander Morou <alexander.morou@gmail.com>
Thu, 22 Jan 2015 01:07:47 -0600