Re: Parser classifications?

Alexander Morou <alexander.morou@gmail.com>
Thu, 13 Aug 2009 12:56:44 -0700 (PDT)

          From comp.compilers

Related articles
Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-11)
Re: Parser classifications? zaimoni@zaimoni.com (Kenneth 'Bessarion' Boyd) (2009-08-13)
Re: Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-13)
Re: Parser classifications? zaimoni@zaimoni.com (Kenneth 'Bessarion' Boyd) (2009-08-16)
Re: Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-17)
| List of all articles for this month |
From: Alexander Morou <alexander.morou@gmail.com>
Newsgroups: comp.compilers
Date: Thu, 13 Aug 2009 12:56:44 -0700 (PDT)
Organization: Compilers Central
References: 09-08-020 09-08-021
Keywords: parse
Posted-Date: 14 Aug 2009 11:09:55 EDT

On Aug 13, 11:55 am, "Kenneth 'Bessarion' Boyd" <zaim...@zaimoni.com>
wrote:
> . . .
> Looking ahead and seeing the performance concerns:
> * May I assume this *is* being done incrementally, in a way that
> calculates each step along the state-trail once?
> * If not, do we have a strong memory conservation design requirement
> that mandates this?
>
> (I see this is being written in C#, so I'll hold off on commentary
> specific to C++ .)


The process is done on an as-needed basis. So yes, it's incremental.
The primary reason that it reuses states within the system is due to
the way most rules in the examples I've been using work. From the
provided example (where I omitted the flat-DFA view of the rules),
methods always end with a '}', so while the path that reaches the '}'
might vary, leading to alternate versions of that '}' state, the
actual sub-set construction for what's -after- that is always
identical, so the states exiting are usually repeated. Exceptions
most notably to cases where the actual state set differ, as an
example, a 'compilation unit' from C# would use a different initial
state set for its very first state than it would after encountering
its first class or first namespace. After that class, certain rules
go out of scope, such as includes. This changes the initial/first set
after the first class' '}' brace.


The overall caching mechanism is two fold. Each state has its own
transition table, and there's a higher level unhinged cache for the
states. The actual paths themselves are fairly lightweight. While
they cross reference one another, it's in a single direction, the GC
for C# should be able to detect when useless paths were created and
clean them up on its own. Though I've often thought it might not be a
bad idea to do self cleanup, I want to get it working before I worry
about performance concerns (since those are really only important
after it works, if you optimize too early and you might end up making
it worse when you need to change something). State caching is used,
which works by checking the initial/first/follow set groups calculated
when a token transition is requested on the transition table of a
given state. If the resulted initial/first/follow set, is a set of
paths that exists, that match is found and reused (from the unhinged
cache). Since the paths themselves are very simple structures, the
time involved in checking for them shouldn't be terribly high (since
C#'s lists, and dictionaries use hash codes to avoid useless full
checks.) The issue I've found so far is non-left recursion, if you
were working with a parametered expression, there would only be as
many right parenthesis as there were left parenthesis, which is
similar to a recursive descent parser, except you could consider the
paths an emulation of that theme, though instead of looking at one
rule at a time, it looks at all of them. I don't know enough about
Big-O notation to tell you what kind of time concern is involved;
however as an example, if there was an expression with 200 left
parenthesis it would take 0.7 seconds to calculate all the lookaheads
as the paths get deeper, but 5.39 seconds if there were just twice as
many (400 left parenthesis, and for 800, 55.39 seconds). The number
of sources per level is constant. I doubt someone's going to use 200,
400, or even 800 parenthesis in the same expression. I'm guessing the
time requirement is quadratic in nature, 7.7 times longer from
200->400, 10.27 times longer from 400->800. Since the time
requirement is a multiplier versus a linear climb, it makes me think
quadratic. Though the multiplier doesn't appear constant, so there's
probably other variables involved in calculating it. Could someone
pipe in if they know more about this?


> > Does anyone here know what class of grammar this will be by the time
> > I'm done?
>
> You're setting this up to support arbitrarily long lookahead. I
> assume that if I'm wrong that this approach could support LL(infinity)
> (aside from resource limitations), that an actual expert will correct
> my misreading.


This is primarily due to the focus on no back tracking. It still
doesn't include cases where tokens might overlap. The parser that
drives the data aspect of this project is a very simple [badly]
hand-written parser. So I ignored the maximal munch principle, though
I think I will add a small bit of information to the parser to define
precedence (on equal length tokens) since grammars can span multiple
files.


> So, for instance, while distinguishing where the rules are coming
> from (1a and 1b) is important for hand-maintainability, it's not
> nearly so important for a formal description and it actually
> complicates the data representation.


The primary reason for including path information was to hopefully aid
in path discovery. Figuring out which rule you're on should be as
simple as evaluating the initial state set for a given state in the
language. From what I could tell, even if two productions shared the
first 100 tokens, from a shared sub-production, based upon a few
states that were distinct from either, the actual type of production
made itself clear, pretty quickly after the shared point in common
(the shared sub-production). Without this context information to aid
in such discovery, I'll have to rediscover which rules are involved,
and essentially redetermine which rule is being parsed when the parse
graph is constructed. So it's more than just about maintainability.


> A formal description would be a state machine. I think your
> difficulty in searching the literature is the same as mine:
> * there is remarkably little concern about how to implement the state
> machine, and
> * the untutored ways of improvising state machines are very different
> from what is generated by tools like yacc.


Improvising? Yes, true; however, everything was improvised at one
point. If I had a more complete educational background, perhaps I
would have had better luck defining a perfect theory the first time.
It might be improvised, but that doesn't mean it will always be so.
I hope to go back to school and research these things further, in a
directed manner where I can avoid common mistakes. I don't have that
luxury (school is a business, without money, no amount of interest
can help you), so I have to make due with what I have.


> Likewise, working with paths is a specific way of using the
> combinatoric graph used in formally defining the state machine. The
> literature concentrates on how the mature automatic generators work
> (e.g., rigging the control flow of the generated parser to to process
> the initial conditions of rules with shared initial conditions only
> once, and working well with an incoming token stream).


If you don't mind me asking, what literature?


Thanks,



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.