Re: Parser classifications?

"Kenneth 'Bessarion' Boyd" <zaimoni@zaimoni.com>
Sun, 16 Aug 2009 21:30:04 -0700 (PDT)

          From comp.compilers

Related articles
Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-11)
Re: Parser classifications? zaimoni@zaimoni.com (Kenneth 'Bessarion' Boyd) (2009-08-13)
Re: Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-13)
Re: Parser classifications? zaimoni@zaimoni.com (Kenneth 'Bessarion' Boyd) (2009-08-16)
Re: Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-17)
| List of all articles for this month |
From: "Kenneth 'Bessarion' Boyd" <zaimoni@zaimoni.com>
Newsgroups: comp.compilers
Date: Sun, 16 Aug 2009 21:30:04 -0700 (PDT)
Organization: Compilers Central
References: 09-08-020 09-08-021 09-08-022
Keywords: parse, comment
Posted-Date: 17 Aug 2009 00:48:23 EDT

On Aug 13, 2:56 pm, Alexander Morou <alexander.mo...@gmail.com> wrote:
> On Aug 13, 11:55 am, "Kenneth 'Bessarion' Boyd" <zaim...@zaimoni.com>
> wrote:
>
> ....
>
> > > Does anyone here know what class of grammar this will be by the time
> > > I'm done?
>
> > You're setting this up to support arbitrarily long lookahead. I
> > assume that if I'm wrong that this approach could support LL(infinity)
> > (aside from resource limitations), that an actual expert will correct
> > my misreading.
>
> This is primarily due to the focus on no back tracking. It still
> doesn't include cases where tokens might overlap. The parser that
> drives the data aspect of this project is a very simple [badly]
> hand-written parser. So I ignored the maximal munch principle, though
> I think I will add a small bit of information to the parser to define
> precedence (on equal length tokens) since grammars can span multiple
> files.


Ok; no comment there (not sure how no back-tracking and the maximal
munch principle/optimization interact).


> > So, for instance, while distinguishing where the rules are coming
> > from (1a and 1b) is important for hand-maintainability, it's not
> > nearly so important for a formal description and it actually
> > complicates the data representation.
>
> The primary reason for including path information was to hopefully aid
> in path discovery. Figuring out which rule you're on should be as
> simple as evaluating the initial state set for a given state in the
> language. From what I could tell, even if two productions shared the
> first 100 tokens, from a shared sub-production, based upon a few
> states that were distinct from either, the actual type of production
> made itself clear, pretty quickly after the shared point in common
> (the shared sub-production). Without this context information to aid
> in such discovery, I'll have to rediscover which rules are involved,
> and essentially redetermine which rule is being parsed when the parse
> graph is constructed. So it's more than just about maintainability.


Ok. This is an example of something I'd have never guessed just from
reading about parsers, but is blindingly obvious when actually having
to implement one along these lines.


> > A formal description would be a state machine. I think your
> > difficulty in searching the literature is the same as mine:
> > * there is remarkably little concern about how to implement the state
> > machine, and
> > * the untutored ways of improvising state machines are very different
> > from what is generated by tools like yacc.
>
> Improvising? Yes, true; however, everything was improvised at one
> point. If I had a more complete educational background, perhaps I
> would have had better luck defining a perfect theory the first time.


Unlikely, when one is doing any kind of new research.


> It might be improvised, but that doesn't mean it will always be so.
> I hope to go back to school and research these things further, in a
> directed manner where I can avoid common mistakes. I don't have that
> luxury (school is a business, without money, no amount of interest
> can help you), so I have to make due with what I have.


Understood.


> > Likewise, working with paths is a specific way of using the
> > combinatoric graph used in formally defining the state machine. The
> > literature concentrates on how the mature automatic generators work
> > (e.g., rigging the control flow of the generated parser to to process
> > the initial conditions of rules with shared initial conditions only
> > once, and working well with an incoming token stream).
>
> If you don't mind me asking, what literature?


Ok...consider how yacc and lex interact.


* yacc (and its successor bison) are basically tools for automatically
constructing finite state machines in C that parse token sequences.
So the relevant literature would be on finite state machines, and how
to reduce state machines from the easy-to-specify ones
(nondeterministic ones) to the easy-to-automatically-generate ones
(deterministic). One wouldn't do badly starting with Wikipedia's
citations for finite state machines (
http://en.wikipedia.org/wiki/Finite_state_machine
); note that I haven't carefully checked that the article itself is
accurately stating its citations.


* lex is a tool that generates a C library (and a driver main() ) for
incrementally lexing a text file. Again, there's a state machine in
the background. The constraint of returning tokens in sequence,
rather than doing all of them at once, seems to be a historical
efficiency measure. (I'm speculating, but based on what I've read
about the 1970's systems this seems very plausible.)


However, while state machines are relatively easy to write automatic
generators for, and even to automatically optimize -- they are very
hard to explicitly verify. So they're not closely related to how one
would write a parser directly, as explicit verification (both proof,
and extensive unit-testing) becomes very important.


[Lex and yacc both generate state machines to match their input. Yacc
also has a state stack that lets it match nested and recursive
patterns, which the lex state machine can't do. A lex scanner calls
code you write every time it matches a token; you can return it
immediately to a calling routine (usually a yacc parser) or buffer it
somewhere and keep going, lex doesn't care. Compilers frequently feed
hints back from the parser to the scanner, that modify its behavior
for subsequent tokenizing, which wouldn't work if the scanner blasted
through the whole input first. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.