Re: [Query] Tree-based parsing?

Ray Dillinger <bear@sonic.net>
22 Apr 1997 21:14:22 -0400

          From comp.compilers

Related articles
[Query] Tree-based parsing? jlilley@empathy.com (John Lilley) (1997-04-16)
Re: [Query] Tree-based parsing? fjh@mundook.cs.mu.OZ.AU (1997-04-18)
Re: [Query] Tree-based parsing? bear@sonic.net (Ray Dillinger) (1997-04-22)
Re: [Query] Tree-based parsing? geert@sun3.iaf.nl (1997-05-04)
Re: [Query] Tree-based parsing? nkramer@cs.cmu.edu (Nick Kramer) (1997-05-04)
Re: [Query] Tree-based parsing? scotts@metaware.com (Scott Stanchfield) (1997-05-08)
Re: [Query] Tree-based parsing? scotts@metaware.com (Scott Stanchfield) (1997-05-08)
Re: [Query] Tree-based parsing? wilson@cs.utexas.edu (1997-05-08)
| List of all articles for this month |

From: Ray Dillinger <bear@sonic.net>
Newsgroups: comp.compilers
Date: 22 Apr 1997 21:14:22 -0400
Organization: Cognitive Dissidents
References: 97-04-096
Keywords: parse

John Lilley wrote:


> Hi,
> I've had this idea recently that surely has been explored by those
> more knowledgeable than I....


  <snip>.


> Essentially, I would take the following steps:
> 1) Lex the input stream into tokens as usual.
> 2) Convert a token stream directly into a degenerate "AST"


<snip>


> 3) Walk the flat "AST" using LL or LALR methods and match gross
> features of the language that are invariant regardless of local
> syntax.


  <snip>


> 4) Perform successive refinement steps of the AST.


  <snip>


> Yes, this approach would definitely take a lot of memory and probably
> be slower than a "normal" parser. I am hoping that despite these
> drawbacks, it would have significant benefits in terms of:


> 1) Error recovery.


<snip>


> 2) Incremental processing.


<snip>


> 3) AST generation.


<snip>


> 4) Decoupling of semantics and syntax.


<snip>


> 5) I believe, but have no concrete evidence, that successive tree
> rewrites where for example sequences of modifiers are collapsed into a
> subtree before being used, will allow more powerful pattern-matching
> to be applied than would be available from monolithic parsing of the
> token stream.


One thing that has probably blinded a lot of people to this
approach is that you can't do it with Yacc. For a lot of compiler
writers, the range of ideas they can have is limited to a certain
extent by the tools they work with. Lex and Yacc are good tools, but
they're not the be-all and end-all of compiler writing, and I often
exhort people to "reinvent the wheel," hoping that they'll come up
with better wheels, or at least different ones, and explore a little
of the territory away from the main road.


I think it's probably a good approach to analyzing C/C++ code,
but it's not fully generalizable. IE, there are languages you won't
be able to compile using this approach, particularly languages where
the gross syntax is dynamically defined. But the advantages you
mention are reasonable expectations and probably worth the limitation.


And there is something you can do to make it more friendly in
terms of memory use. First, you can read the input stream parsing it
into tokens until you get to one of your "top-level" landmarks, (from
a leftparen to a rightparen, for example) then traverse the tokens
you've read from the beginning of that lexical space to one of your
second-level landmarks, (like a semicolon) then traverse the resulting
tree to one of your third-level landmarks, etc... recursively until
you have fully parsed the first lexical space you read; then you move
on to the next. This way, you still get the recursively-refined block
structure, but you get to interleave your passes, so you don't have to
hold the early parts of the parsed tree in memory, nor the unread
input.


Good Luck!


Bear
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.