Re: Do we need parsers?

Nick Kramer <nkramer@jprc.com>
30 May 1997 23:16:50 -0400

          From comp.compilers

Related articles
[2 earlier articles]
Re: Do we need parsers? dgay@barnowl.CS.Berkeley.EDU (1997-05-16)
Re: Do we need parsers? monnier+/news/comp/compilers@tequila.cs.yale.edu (Stefan Monnier) (1997-05-17)
Re: Do we need parsers? monnier+/news/comp/compilers@tequila.cs.yale.edu (Stefan Monnier) (1997-05-17)
Re: Do we need parsers? thetick@scruz.net (Scott Stanchfield) (1997-05-19)
Re: Do we need parsers? David.Monniaux@ens-lyon.fr (1997-05-22)
Re: Do we need parsers? john@dwaf-hri.pwv.gov.za (John Carter) (1997-05-22)
Re: Do we need parsers? nkramer@jprc.com (Nick Kramer) (1997-05-30)
Re: Do we need parsers? dgay@barnowl.CS.Berkeley.EDU (1997-06-04)
| List of all articles for this month |

From: Nick Kramer <nkramer@jprc.com>
Newsgroups: comp.compilers
Date: 30 May 1997 23:16:50 -0400
Organization: JPRC
References: 97-05-158
Keywords: parse, design

John Carter writes:
> If every statement was an object. ie. You had an "if-then" object, an
> "if-then-else" object and a "assign" object and and and..
>
> If each object knew how to display itself, either as something like
> cout << "if" << ifObj.expr << " then\n" << ifObj.statements << "endif\n";
> or more graphically. And each object knows how to execute itself,
> either in terms of invoking the execute method of other objects or
> invoking primitives. They even could have a compile method that would
> compile that object into a lower code.
>
> Then do we need lexical analysers and parsers? Couldn't we just create
> object editors and persistent object stores and bypass the whole
> grammar issue? An object editor wouldn't ever allow you to delete or
> add say the 'i' in "if", but only the entire "if-then" object.


There are two ideas here. The first is that "Programs are not just a
bunch of characters." (They have syntax, semantics, etc) This is a
good idea, although not a unique one. The Gwydion Project, which I
work with, is doing research along these lines; the buzzword is
"hypercode". See http://legend.gwydion.cs.cmu.edu/gwydion/ for a
rather out of date white paper.


The second idea in your message is that hypercode should be
implemented using the particular scheme you described, by editing and
internally representing programs with really smart Token objects. I'd
say this is not such a good idea.


This "token-centric" scheme has a lot of problems. The larger ones
are implementation issues. If you write a token-centric editor,
you'll end up scattering the parser throughout several dozen classes.
Conventional parsers may not be much fun to write, but a parser
scattered across several dozen files would be *far* less enjoyable.
Besides, if you use the token-centric parsing, you can't use existing
parser generators like YACC or JavaCC.


But I think the more interesting issue is how the user interacts with
the system. As I see it, the token-centric approach is a variation on
the structure editor. A classic structure editor allows/requires you
to manipulate programs as a tree structure, which ensures that your
program is always syntactically correct.


The problem, as I see it, is that structure editors are very rigid.
When I write code, I frequently have unbalanced braces and other
syntax errors. And I correct them a few seconds later. Classic
structure editors won't let me edit that way, and that's a real pain
in the butt.


I've heard of two general approaches to this problem. One is to write
a hybrid structure/text editor. Above some granularity, usually
statements or expressions, you use tree operations. Below that
granularity, the editor acts like a normal text editor. The other
approach is to give the user what looks like a normal text editor, but
then re-parse the program with every keystroke (using fancy
incremental lexing and parsing algorithms to make this computationally
feasible).


The Gwydion Project has been prototyping a variation on the first
approach. What we do is we break the program into "components." In a
language like C, a component is one top-level form, such as a typedef
or a function definition. You can move components around using
tree-editor techniques. But each individual component acts like a
little Emacs buffer. You type in it for a while, then you hit the
"commit edit" button and the component is re-parsed.


This approach seems to work pretty well. In general, it lets users
edit programs the way they want to -- as text. But at the same time,
the editor understands the program at a deeper level than simple text,
and can do all those neat things other people have described in this
thread. Perhaps most important of all, adapting the editor for a new
programming language is relatively easy, because you can use a
conventional parser.


Actually, our prototype is slightly more complicated than I described
above. In order to do syntax highlighting (aka font lock mode) for
components while they are being edited, we maintain a list of tokens
for that component. With each keystroke, we re-tokenize the line
that's been changed. (We originally retokenized the entire component
with each keystroke, but that was a bit too slow for large components)
That's our "incremental" lexer, and it works great.


One of these days, I'd like to see how slow it would be to reparse the
component with every keystroke. After all, lexing is the most
expensive part of parsing, and since the lexing has already been done,
the parse on every keystroke just might be doable.


I think the big unknown with our approach is what users will think
about having to perform a "commit edit" action every time they're done
editing a component. I think that if it's convenient enough to commit
edits, users won't mind, but this is speculation. It's also possible
that the re-parse with every keystroke could eliminate the need to
explicitly commit edits, but I'm not going to hold my breath on that
one.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.