Chris F Clark <>
17 Dec 2004 00:30:44 -0500

          From comp.compilers

Related articles
EBNF (2004-11-20)
Re: EBNF (2004-11-28)
Re: EBNF (Martin Bravenboer) (2004-11-28)
Re: EBNF (2004-12-01)
Re: EBNF (2004-12-11)
Re: EBNF (Vidar Hokstad) (2004-12-16)
Re: EBNF (Chris F Clark) (2004-12-17)
| List of all articles for this month |

From: Chris F Clark <>
Newsgroups: comp.compilers
Date: 17 Dec 2004 00:30:44 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 04-11-089 04-11-111 04-12-044 04-12-072
Keywords: parse, design
Posted-Date: 17 Dec 2004 00:30:44 EST

Henry Spencer wrote:

> It is also a very nice "separation of concerns", dividing the problem
> two steps and thus often making it easier to solve.
> I have found a scanner/parser separation very useful even in problems
> normally thought too simple to require it.

Vidar Hokstad replied:

> I think the most valuable part of the traditional scanner/parser
> separation is when it guides language design. A language designer
> thats thinks like a compiler writer seems to me to often result in
> languages where tokenization is simple enough that the separation
> isn't important.

I don't know if I've ever seen a language where the separation wasn't
important. However, I tend to approach the problem from the point of
view that there is a relatively simple "core" lexer/scanner that is
about 80-90% correct for any target language. As a result, I tend to
start with the core lexer/scanner and then only add things to it when
adding things makes the parser less complex. This practice has
certainly served me well over the years.

It is also part of the design rationale in Yacc++ and the Langauge
Objects Library. Over the years as I found "best practice" "hacks"
that simplified problems for certain languages, I have tried to
canonize those hacks into the tool set. (An important point of modesty
is due here, only a few of these hacks were my own discovery, most of
them came from someone else.) The goal being that once the "hacks"
are supported and documented as a feature of the tool set, they can
become more than hacks and become patterns to guide compiler writers.

The key point in canonizing the hacks into patterns is that gives the
hacks a name and a well defined meaning. For example, if I say
"discard token", any Yacc++ user can quickly know what I'm talking
about, because discard tokens relate to a specific feature of the tool
set, and that feature solves a specific problem. (In particular, the
discard token feature is used to represent tokens that the lexer
recognizes, but which are discarded before reaching the parser, so the
parser never sees. Comments and whitespace are typical discard

But, I digress. The main point is that there is this core
scanner/lexer that I carry around in my head that solves 80% of the
scanning problem for whatever language I intend to parse next. It

                - an idenitifer token
                - an integer token (and perhaps a floting point token and hex token)
                - a quoted string token
                - a comment (discard) token
                - a whitespace (discard) token
                - some punctuation mark tokens
                - some list of keyword tokens

In addition to that true core, there are some typical extensions that
get applied. A common example is that many languages have a C-like
preprocessor. Many of the rudiments of that are parts of the common
patterns that are in the tool set.

In practice, I generally cobble my next lexer by taking some existing
lexer and slightly modifiying it, although writing these tokens from
scratch is also not that hard. And, this was the point I was driving
for. No matter how simple the lexing phase is, by separating it into
a distinct phase, I can generally reuse a signficant portion of an
already written lexer to solve the problem. It doesn't matter that
writing one of these lexers isn't that hard, it is almost always
easier to "borrow" an existing solution.

Now, just as I didn't originate most of the patterns that are in
Yacc++, I'm not the only one who has seen that most lexers have the
above structure. In fact, some toolkit authors have tried to simplify
the problem into providing semi-pre-canned lexers, where one has some
limited ability to customize the above items. In practice, I've found
that the subtle differences in how such things work mitigates against
a semi-pre-canned solution. The systems I have seen are too limited
in their customization options, being more pre-canned than flexible.

It reminds me of our esteemed moderator's comments on creating an
Uncol--it is easy to create a very functional version that handles
some core set of languages, however as one grows the diversity of
languages supported, eventually the differences overwhelm the
commonalities and the attempt dies due to heat death, too much entropy
pulling the core model apart and fragmenting it. Curiously, I think
the same principle holds, there is an 80-90% core of an "internediate
language", bytecode, or virtual machine--that would work for most
languages. There are also a set of patterns that are useful
customizations for large subsets of languages.

The solutions to these kind of problems are a method of describing
what one wants (i.e. the input to a tool) and a library the includes
patterns that solve common problems. However, the ultimate control
needs to be in the author's hands. At some level the author needs to
be able to say, I want this solution at this point (but not over
here). If one has that, then the author can borrow a previous
solution and make the changes that describe the exact places where the
previous solution didn't fit and a different choice needs to be made.

In summary:

The separation of lexers and parsers are good. One of the advantages
of the separation is that one can develop patterns for parts of the
problem independently. This allows one to reuse previous solutions
more easily than if one didn't split that process into parts. Even if
the separation reduces one of the parts (e.g. lexing) to almost
trivial, it is still worthwhile if the trivial part can be done via
reuse. In the case of languages, the lexing part isn't trivial enough
to be completely automated down to a semi-pre-canned solution. It is
still worthwhile having a tool+library approach, so that one can make
the exact tweeks one needs for each language's special cases.

Just my opinion,

Actually, I have a lot more on this topic, but this posting was
already too unfocused.

Chris Clark Internet :
Compiler Resources, Inc. Web Site :
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.