|Top-down deterministic parser? firstname.lastname@example.org (Alexander Morou) (2009-05-11)|
|Re: Top-down deterministic parser? email@example.com (Alexander Morou) (2009-07-11)|
|Re: Top-down deterministic parser? firstname.lastname@example.org (Alexander Morou) (2009-07-13)|
|From:||Alexander Morou <email@example.com>|
|Date:||Mon, 11 May 2009 21:16:50 -0500|
|Posted-Date:||12 May 2009 05:06:23 EDT|
I've been further researching parsing techniques over the past year
or two, and I have a few questions for the experts in language
Over the past year or so I've been building a framework that aids
in generative programming concepts, a code generator for higher
level languages built upon the .NET Common Language Infrastructure.
The goal is to replace the previous incarnation of the same. The
previous is complete for what it does, but not sufficient enough to
handle the full set of concepts behind a compiler: resolving the
symbols used in building the code, lambda, LINQ, extension, and
other concept transformation, resolution, and/or expansion.
Beyond that, I've also been coding a second project using the first
code generation project, a parser generator, which right now is a
fairly decent first attempt for a lexical analysis generator using
immutable state machines (fixed vs. dynamic, ie. hard-coded).
I've previously discussed the specifics of what I was interested in
doing, common concepts, and so forth with Chris F. Clark,
Hans-Peter Diettrich, and Ira Baxter, who have given me some
insight as to what I might expect, and what I might be doing wrong
(with the lexical phase).
Presently I'm working on fleshing out a concept for the syntactical
analysis aspect (which I've found is different from Semantic
Analysis, which defines meaning vs. grammatical correctness). My
aim initially was to create a backtrack-less recursive descent
parser. Since then, I've done a lot more reading up on the subject,
and to create a parser that uses an, potentially, infinite
look-ahead, I've discovered that the code that would result
in a fixed-execution parser would not only be ugly, but redundant.
The rule path determination code would have to be pre-processed and
fully replicated inside the production which calls for each
individual rule. This creates code which bloats, fast. Direct
left-recursion would be easy to circumvent by trying to parse all
the other productions first and wrapping them inside an instance of
the rule, then attempting to see if the next token in the
look-ahead, for the left-recursive production, would be valid, but
even that doesn't solve the initial problem of bloat.
So I went back to square one, and thought of other parsing
mechanisms such as LR and the commonly used LALR parser variant.
One of my issues with LR parsers is they're simply confusing, to
me, at least; however I did get something out of them, they're
deterministic machines, which led me to believe maybe I've already
created something very similar to them when I handled the lexical
Since most languages, of any great complexity, require recursion of
some sort, the standard DFA machine can't really be created in the
same fashion as seen in a scanner's state machine. The primary
reason for this is what I've known to call Combinatorial Explosion
(through Wikipedia -_-). Since it is recursive, fully expanding it
from the get-go would result in an infinitely large structure which
can't be expressed on machines with finite limitations. Other
systems I've read utilize what's called lazy evaluation.
The system I'm going to attempt to build now will utilize a DFA
construct, which defines the transitions in token and rule form in
their initial phase, upon entering a state of a given production,
the rule references would be weeded out into their tokenized form,
a node relative to the given state would be passed to the generated
productions to preserve the path information of a given production,
so the transducer-type machine that results will be able to
reconstruct a proper parse graph from the DFA states. The initial
structure of the transitions of an expanded node will be
non-deterministic, only as a valid token is encountered will the
transition state set be determinized.
Left-recursion, the direct form at least, could be handled by
creating a detached state set that determines the productions
post-left-recursive reference, and all exit states could transition
to that state, since all callers of a given production point the
exit states, of that new production, to the state following the
reference to the production: it's nondeterministic, as said above,
this is determinized as needed.
Granted a mutable state machine will have a time impact for the
first time a given production is encountered, there's no real way
to avoid the cost of dynamics, short of persisting the tree after
it is expanded and reloading it upon the next time the parser's
used (which then increases parser startup and shutdown time).
The main question I have for the people in the proper field of
study is: have I missed anything? I realize that ambiguity
resolution will still be an issue, but I figure it should be pretty
simplistic to determine such ambiguities by exit-states for a given
rule (which are still marked as an exit point, but only for that
rule, not the entire graph) overlapping on two transitions. So if
you're on a point where it's a double exit, it's ambiguous as to
which production is actually supposed to result.
If there's anything that you've think I've left out, please ask me
questions, being outside the area of academia there's likely things
I'm going to overlook, and methods that I don't know of. The
difference between this and LR parsers is it is still looking at
things from a top-down perspective. I didn't really get the point
of actions: the shifting and reducing, to me that seems like a lot
of work that could be avoided if you're looking at it from the
Any response is appreciated.
PS: I didn't get into token-token conflicts because that would
introduce a small level of nondeterminism to determine properly,
especially since I'm using a system that doesn't use the maximal
munch principle. This is due to the simplicity of the system I'm
working in. It does not allow the use of match injection so in
cases where token-token conflicts exist, it would have to evaluate
both paths until only one prevails. In cases where both do,
another type of ambiguity would arise.
Return to the
Search the comp.compilers archives again.