Determinism for Contextual Keywords

Alexander Morou <alexander.morou@gmail.com>
Sat, 11 Apr 2015 14:34:49 -0500

          From comp.compilers

Related articles
Determinism for Contextual Keywords alexander.morou@gmail.com (Alexander Morou) (2015-04-11)
| List of all articles for this month |

From: Alexander Morou <alexander.morou@gmail.com>
Newsgroups: comp.compilers
Date: Sat, 11 Apr 2015 14:34:49 -0500
Organization: Compilers Central
Keywords: parse, question
Posted-Date: 13 Apr 2015 00:19:30 EDT

Hello,


I have a general question about contextual keywords in a parse. I'm
trying to make a deterministic LL(*) parser generator, which I've posed
about a few times before, and a general thought occurred to me about
contextual keywords. Contextual Keywords are such that they go in and
out of relevance based off of the current state of the parser, so 'var'
in C# may or may not be the start of the declaration of a variable,
depending on what's after it. The contextual in this sense is not aware
of external influence (a type model or other hack that requires insight
beyond the active character stream), but more terms which are ambiguous,
have to be ambiguous, and can logically be rationed about during parse
should no ambiguity arise as a result.


In order to avoid situations where the contextual keyword wins out
every time, the only way to make this deterministic is to be aware
of the ambiguities, though analysis on the language defined, and points
where these ambiguities crop up, create a targeted symbol that
represents that situation, this would need to be done by the parser
compiler itself due to the number of possible permutations of a given
language.


These points of ambiguities would have their 'k' determined once you
look past the point of ambiguity, and the path would then be chose.
The question I have today, is: would such an approach be computationally
plausible or would it require an exponential amount of memory? I'm on
a system with 12GB of ram, and I've found that k analysis for unbound
k tends to be a bit memory hungry (current language predictions takes
about 3-5GB to process without taking this contextual situation in mind).
In May I'll be upgrading to a system with 64GB (with more computational
legroom) which will allow me to evaluate this more thoroughly I just
wanted to see if anyone else has traversed this road enough to tell me
the plausibility of such an attempt.


My biggest fear with such an analysis is situations where there are
diminishing sets:


S :=
Identifier Identifier Identifier 'c' |
A B C |
A C B |
B A C |
B C A |
C A B |
C B A ;


If each terminal represented a word for the language and 'Identifier'
is contextually valid in each scenario, the number of sets created to
identify the ambiguities would be:


{ A, B, C, Identifier },
{ A, B, Identifier},
{ A, C, Identifier},
{ B, C, Identifier},
{ A, Identifier },
{ B, Identifier },
{ C, Identifier }


Order not observed as relevant because they're each identities and
{ A, B, C, Identifier } would be equivalent to { A, Identifier, C, B }
This concerns me because of 'templates' which expand difficult to
outline, by hand, syntactic mechanisms like the modifiers of a type.
Verifying these post parse is possible, but this project is more about
trying different things. These situations sometimes create up to
720 variations of a given set of terminals, merged with the fact that
other identities can have equally complex sets of terminals the
sets possible are fairly staggering. I wonder if the upgrade to 64GB
will even be enough once context awareness is thrown in!


To curtail this, I added an annotation to the grammar to require these
identities to be explicitly called out as possibly ambiguous requiring
the said analysis. Other ambiguous identities are required to denote
a logical pecking order, since the grammar description language uses
includes and conditional compilation arguments, I can't rely on when
a term is first encountered to decide this priority.


Further, due to the ambiguity representing an identity within the result
language, each point where the symbols are valid, the set of states
which created the current deterministic state would need to be aware
of the ambiguity, limited to what is relevant for that state, to change
them to the proper identity once path determination was possible, since
each token should only be truly evaluated lexically once, if that state
which identified that ambiguity was itself ambiguous, it would need to
perform the projection again, though given it's less ambiguous at that
point the determination should be quicker since scanning has been done.


This would likely increase the chances for a given language being
ambiguous; however, typically language designers only introduce these
contextual mechanisms when they're versioning the language, or the
prevalence of the keyword is so minimal making it overpower identifiers
(or other identities) every time doesn't make sense (get/set in C#.)


Some of this disambiguation work is likely possible post-parse;
however, that's not always the case.


Any suggestions or insight towards what might have been missed is
appreciated.


-Allen Clark Copeland, Jr.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.