Re: First vs Predict and LL(*)

Alexander Morou <>
Fri, 9 Jan 2015 00:37:23 -0600

          From comp.compilers

Related articles
First vs Predict and LL(*) (2015-01-07)
Re: First vs Predict and LL(*) (2015-01-07)
Re: First vs Predict and LL(*) (Hans-Peter Diettrich) (2015-01-07)
First vs Predict and LL(*) (SLK Mail) (2015-01-07)
Re: First vs Predict and LL(*) (Alexander Morou) (2015-01-09)
Re: First vs Predict and LL(*) (Alexander Morou) (2015-01-09)
Re: First vs Predict and LL(*) (Hans-Peter Diettrich) (2015-01-09)
Re: First vs Predict and LL(*) (Alexander Morou) (2015-01-22)
Re: First vs Predict and LL(*) (Hans-Peter Diettrich) (2015-01-22)
Re: First vs Predict and LL(*) (Alexander Morou) (2015-01-23)
| List of all articles for this month |

From: Alexander Morou <>
Newsgroups: comp.compilers
Date: Fri, 9 Jan 2015 00:37:23 -0600
Organization: Compilers Central
References: 15-01-003 15-01-005
Keywords: parse, LL(1)
Posted-Date: 09 Jan 2015 10:18:36 EST

On Wednesday, January 7, 2015 7:41:39 PM UTC-6, Hans-Peter Diettrich wrote:
> schrieb:
> Merging both, into an "scannerless parser", opens a can of worms, at
> least when the input is prepared by humans and should be readable
> (parseable) also by other people. One of the most annoying "features" of
> a scannerless parser IMO is the handling of token separators. You can
> find out easily, how complicated the grammar for some well known
> programming language will become, as soon as whithespace handling is
> moved out of the lexer.

> But of course there exist use cases for scannerless parsers, in detail
> for languages other than programming languages.

Apologies, the focus wasn't to yield a scanner-less parser. To my
understanding, the functional intent behind First, Follow, Predict is
to provide a contextual basis from which the parser can denote issues
with the source in the event that an unexpected token appears on the
look-ahead stack.

By LL(*) I'm referring to situations where tokens are parsed by a
lexer and the rules are parsed by a parser. The Star aspect comes
into play when you have two rules, say 'Field' and 'Method', both
productions refer to an optional 'Attributes' which starts with a
required '['.

The "First set" of each rule would be identical which would likely
mean that LL(1) would be out at this point. The issue comes into
play when you consider that Attributes can contain 'x' many actual

In evaluating all avenues for a given production, you'll note the
requirement for '[' all begin in 'Attributes'. The solution in this
instance is to have 'Attributes' be parsed in the rule that can
accept either a 'Field' or a 'Method' and have that pushed onto
the look-ahead stack as the next symbol for consumption.

I think, if done properly, this would remove the need to back-track
or memoize the structure of the 'Attributes' reference, since you've
decidedly figured out that it's always correct to have it there.

> > To accomplish this, my basic approach is to track the source of a given symbol
> > referenced by a state within a rule's deterministic state machine.
> >
> > This can lead to situations where a given rule has tens to hundreds of
> > possible sources, especially if the nesting goes 20+ deep (expression
> > precedence defined through left-recursive rules, versus defining operator
> > precedence sugar in the grammar definition language.)
> Right, precedence rules are a weak point in formal grammars. Here the
> human factor suggests (to me) some formal annotation, in favor of deeply
> nested rules, for the implementation of operator precedence, grouping etc.

I actually prefer the recursive approach. It's unambiguous and doesn't
require any special call-outs. Situations where Left Recursion are
detected, in theory can be solveable by the symbol approach mentioned
above: when a left-recursive case is encountered, replace the standard
parse method with a loop in which:
1. Failure only truly occurs if zero matches occur.
2. Using the symbol of the rule predict the most logical avenue.
    a. If multiple avenues, look ahead further.
    b. If zero avenues during look-ahead, and no matches have yet
been found, report an error; otherwise, exit gracefully.

This is also, of course, assuming my understanding of LL(k) and/or
LL(1) parsers is accurate, it would mean it's somewhat of a
pseudomorph of LR and LL-styled parsers due to the reduction and
placement of the rule on the stack.

If what I'm describing is something else entirely, someone please
note as such.

> > So far I'm to the point where I've collapsed common routes into a single view,
> > so if '(' is available through two possible parse paths, but both are the same
> > outgoing target at the current look ahead, it considers them the same.
> >
> > An example:
> > A ::= B | C; //::= defines rules
> > B ::= D | E;
> > D ::= '(' P ')';
> > E ::= '(' Q ')';
> > C ::= @'O'*; //'@' means case insensitive.
> > P := [0-9A-Fa-f]+; //:= defines tokens
> > Q := [:Ll:]+; //:Xx: means unicode character classes as defined by .NET's
> > standard sets.
> >
> > 'A' would contain '(' within its first set, but it would be observed through
> > 'B' and only 'B', so it would be valid to parse 'B' always when '(' is
> > encountered at the start of 'A'. B's definition would have to look ahead
> > further, for the sake of simplicity, token-level ambiguities are resolved by
> > longest prevails, or first come first serve. So if you want [A-F] to be
> > parsed in a 'Q' token, you'd have to specify it first, unless other letters
> > are parsed that make 'Q' parse a longer sub-string.
> Here I feel a need for some annotation, that tells the (human) generator
> of a sentence how to go ahead, in order to reach the intended goal. I
> think that you found the same need already, too?

Could you perhaps give me a run-down of what the annotation would

If I build a decision tree, in which all avenues point to 'x',
you would know, not only that 'x' is next, but what productions
require it. If 'x' is a rule, then you know, at the very least,
to parse that rule. Though I'm sure there's some technical aspect I
am not fully aware of yet as I'm still writing the thing and I started
in 2008...

> In LL(1) the rules D and E could be reduced into
> DE ::= '(' (P|Q) ')';
> what I'd understand better. I prefer to use EBNF as much as possible, in
> order to reduce the number of rules in an LL grammar. Also PEG
> parsers/grammars are easier to understand to me, when it comes to the
> clarification of ambiguities; a PEG parser will use the first
> interpretation, found in the grammar, so that it's easier to find out
> how to make the parser consider some alternative interpretation.

This ties into not adhering to a LL(1) compiler, LL(k) for a particular
k tokens wouldn't apply if the common rule required 'n' tokens and 'n'
is only determinable at the runtime of the parser.

> When both sets overlap (LL(1) First/First conflict), descending
> immediately into the same subroutine will yield an endless loop. That's
> why I add a rule to my LL parsers, that every call *must* consume at
> least one terminal token, before descending and (possibly) recursing. In
> the simplest implementation the parser remembers the current position in
> the input stream, after consuming its first (assumed) terminal, then
> tries to parse the first (optional) token/alternative. In case of an
> error it returns to the input position and tries the next alternative.
> This model can be optimized by using an stack or queue of tokens, in
> order to eliminate repeated lexing of the offending terminals.

Initially I posted 'difference between first and follow', but a later
message, that you likely didn't see, clarified: 'The difference between
'first and PREDICT'. SLK Mail already clarified this point, but for the
sake of the topic project at hand: I'll be using a follow stack (which
it turns out ANTLR 3 uses one to do this same thing) which will contain
what is next relative to the rules currently in play.

> > and the goal of the project I'm writing is a recursive descent parser,
> > from what's described above, is my classification of LL(*) accurate?
> I don't see your classification of LL(*) :-(
> I'd say that an LL(k) parser definitely knows how to proceed after
> inspection of at most k terminals, but I may be wrong here - number of
> terminals or tokens?

I think the confusion here is that I didn't specify what my
understanding was, so here goes: k in an LL(k) parser refers to the
number of TOKENS of look-ahead necessary to deterministically predict
what's next and parse the appropriate rule from that decision. In
situations where 'k' is unbound because the same rule is referenced
on multiple avenues, and that rule uses repetitions, you yield an
unbound 'k' requirement. This would also occur when two rules are
equivalent, in that they utilize the same set of tokens, which may or
may not use repetitions. The second case is much much trickier, from
what I can gather. The computational time necessary to find them is
*fairly* high.

The other question that seems to puzzle me is what do you do in
situations where the current rule is in an acceptable state,
it has possible further tokens, but what do you do when what is
valid next is also valid in the calling rule?

Let's use a contrived example:
S ::= A 'C'? 'X'? 'P'+ 'L';
A ::= B 'C' 'D'? 'P'?;
B ::= 'u' ('C'? 'O')?;

If the source is "uCDPL" then a follow stack would be insufficient
in determining the next valid token, unless you ignored failures
in a rule when an accept state is reached and let the parent
rule figure it out by rewinding. Based on the above, the requirements
could change based on the entire stack, if there were other rules
which introduced other requirements, the parse stack would be a
determining factor on what was next.

Is the above contrived example simple to calculate the Follow for?

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.