First vs Predict and LL(*)
Wed, 7 Jan 2015 02:09:59 -0800 (PST)

          From comp.compilers

Related articles
First vs Predict and LL(*) (2015-01-07)
Re: First vs Predict and LL(*) (2015-01-07)
Re: First vs Predict and LL(*) (Hans-Peter Diettrich) (2015-01-07)
First vs Predict and LL(*) (SLK Mail) (2015-01-07)
Re: First vs Predict and LL(*) (Alexander Morou) (2015-01-09)
Re: First vs Predict and LL(*) (Alexander Morou) (2015-01-09)
Re: First vs Predict and LL(*) (Hans-Peter Diettrich) (2015-01-09)
[3 later articles]
| List of all articles for this month |

Newsgroups: comp.compilers
Date: Wed, 7 Jan 2015 02:09:59 -0800 (PST)
Organization: Compilers Central
Keywords: LL(1), question
Posted-Date: 07 Jan 2015 08:47:58 EST

I took about a three year hiatus (multiple sets of consecutive months without
working on it) from a language-centric project and I'm having the fun of
acclimating myself to my own work.

Up front: I'm self taught, so certain language concepts might be beyond my
current level of understanding, so *please* don't assume I know something

In my process of regaining my legs in the project, I re-read some common
articles, and noticed: I don't see the point of differentiating First vs
Predict sets.

The type of project I'm writing is a LL(*) parser builder , which aims to scan
ahead and yield a state which denotes which path is the most logical path to
follow next.

To accomplish this, my basic approach is to track the source of a given symbol
referenced by a state within a rule's deterministic state machine.

This can lead to situations where a given rule has tens to hundreds of
possible sources, especially if the nesting goes 20+ deep (expression
precedence defined through left-recursive rules, versus defining operator
precedence sugar in the grammar definition language.)

So far I'm to the point where I've collapsed common routes into a single view,
so if '(' is available through two possible parse paths, but both are the same
outgoing target at the current look ahead, it considers them the same.

An example:
A ::= B | C; //::= defines rules
B ::= D | E;
D ::= '(' P ')';
E ::= '(' Q ')';
C ::= @'O'*; //'@' means case insensitive.
P := [0-9A-Fa-f]+; //:= defines tokens
Q := [:Ll:]+; //:Xx: means unicode character classes as defined by .NET's
standard sets.

'A' would contain '(' within its first set, but it would be observed through
'B' and only 'B', so it would be valid to parse 'B' always when '(' is
encountered at the start of 'A'. B's definition would have to look ahead
further, for the sake of simplicity, token-level ambiguities are resolved by
longest prevails, or first come first serve. So if you want [A-F] to be
parsed in a 'Q' token, you'd have to specify it first, unless other letters
are parsed that make 'Q' parse a longer sub-string.

(Next step will involve collecting all states which observe a given symbol and
looking ahead further, then further, and so on, until a decision is possible.
I might find this is computationally implausible. Left-recursive rules will
be handled through loops that parse alternate paths until the rule itself can
be pushed to the stack, once this is done, it can safely re-parse alternate
paths until there are no more symbol reductions. This is what I do by hand in
the Grammar Definition Language's parser.)

The main question: what's the functional intent behind the First vs. Follow
sets, and the goal of the project I'm writing is a recursive descent parser,
from what's described above, is my classification of LL(*) accurate?

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.