Re: Compiler Books? Parsers?

Carl Cerecke <>
23 Dec 2003 00:22:20 -0500

          From comp.compilers

Related articles
[6 earlier articles]
Re: Compiler Books? Parsers? (Jeff Kenton) (2003-11-21)
Re: Compiler Books? Parsers? (Chris F Clark) (2003-12-03)
Re: Compiler Books? Parsers? (Rodney M. Bates) (2003-12-08)
Re: Compiler Books? Parsers? (Nick Roberts) (2003-12-08)
Re: Compiler Books? Parsers? (Marco van de Voort) (2003-12-20)
Re: Compiler Books? Parsers? (Chris F Clark) (2003-12-21)
Re: Compiler Books? Parsers? (Carl Cerecke) (2003-12-23)
Re: errors in Java programs, was Compiler Books? Parsers? (Joachim Durchholz) (2003-12-27)
Re: Compiler Books? Parsers? (Chris F Clark) (2003-12-27)
Re: Compiler Books? Parsers? (Oliver Zeigermann) (2004-01-02)
Re: Compiler Books? Parsers? (Mark Sayers) (2004-01-07)
Re: Compiler Books? Parsers? (Jeff Kenton) (2004-01-09)
Re: Compiler Books? Parsers? (Oliver Zeigermann) (2004-01-22)
| List of all articles for this month |

From: Carl Cerecke <>
Newsgroups: comp.compilers
Date: 23 Dec 2003 00:22:20 -0500
Organization: TelstraClear
References: 03-10-113 03-10-145 03-11-010 03-11-083 03-12-017 03-12-116 03-12-125
Keywords: parse, errors, practice
Posted-Date: 23 Dec 2003 00:22:20 EST

Chris F Clark wrote:

> The second point on this topic, which I think I mentioned in another
> thread, is that many (most to my mind) errors are actually sub-lexical
> occuring at the single character level and not at the parsing level.

As part of my recently completed PhD, I analysed about 200,000
incorrect Java programs written by novice programmers. Nearly all are
correct lexically. That is, the program constitutes a valid lexical
stream. About half of the programs had no syntax errors, only
semantic errors. (Using the term "program" loosely here of course. If
the files really were programs, they wouldn't have any errors, and I
wouldn't have looked at them. They are mostly "almost programs",
although some wouldn't even be classed as that; you'd be surprised
what some novice progammers submit to a compiler!).

Anyway, the point is that almost all errors involve a lexically valid
token stream. However....

> Even most hand-written parsers use some form of separate lexer (which
> is mostly context insensitive), so when I make the error of omitting
> the closing quote from my character string, it swallows far too much
> relevant text and the resulting error recovery isn't important,
> because the basic problem the missing character is not in the parsers
> purview at all.

....the most difficult errors to repair mostly fall into two related
categories: comment delimiter problems, and string delimiter problems.

Some novice programmers really have a problem remembering /* opens a
comment, and */ closes a comment - often transposing one or the other or
both. Suddenly, the parser is asked to make sense of the tokens <star>
<slash> <ident> <ident> <ident> ... and gets rather confused. Seeing
this has convinced me that it is better for the comment delimiters of a
language to be single-character tokens that are not used for any other
purpose. Also, notice how the above stream of tokens could look like a
malformed expression to a parser attempting recovery.

Even though missing delimiter problems are really a lexical issue from
the programmer's point of view, it's really the parser that has to
deal (gracefully!) with the problem, while the lexer often remains
blissfully ignorant.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.