Re: allowing \n in tokens

cort@shay.ecn.purdue.edu (Cortland D Starrett)
30 Jun 1997 23:02:17 -0400

          From comp.compilers

Related articles
Re: allowing \n in tokens cort@shay.ecn.purdue.edu (1997-06-30)
| List of all articles for this month |

From: cort@shay.ecn.purdue.edu (Cortland D Starrett)
Newsgroups: comp.compilers.tools.pccts,comp.compilers
Date: 30 Jun 1997 23:02:17 -0400
Organization: Purdue University, W. Lafayette, IN
References: <5oq7af$qov@mozo.cc.purdue.edu> <33B17422.446B9B3D@metaware.com>
Summary: white space in parsers
Keywords: parse, design

Thank you Sinan for the excellent suggestion using 'ch'.
Thank you Scott for reminding me that, "Tokens delimit tokens."


Scott's post straightened me out enough to write a long-winded
design note to my co-workers sharing my (mis)understanding of
the work of the tokenizer (dlg, lex, etc). I include it hoping
it will be of some benefit to another enthusiastic, but struggling
parser builder. I have eliminated company names, etc....


Cort.




White Space in Parsers
--------------------------------------------------


This design note addresses the issue of white space in the
                                                        compilers. A summary of the current
implementation is presented along with theories to its rationale,
followed by a proposed strategy for the new parsers.




The Approach
-----------------


The current "handles" white space in
a very specific way. In the tokenizer, lex explicitly defines
two regular expressions, one called opt_ws (optional white
space), another called man_ws (mandatory white space). The ----
---- seeks to provide flexibility in the placement of white
space and comments while enforcing some minimum mandatory white
space to delimit tokens. Using the opt_ws and
man_ws regular expressions, the lex file allows white space
almost anywhere while requiring it around language elements.


It is interesting to note that due to limitations in the
approach, FOREACH is an allowed sequence of the two tokens FOR
and EACH (according to the grammar). In the lex file, FOR is
defined as {opt_ws}FOR{opt_ws}, and EACH is defined as
EACH{man_ws} (lex syntax). The white space following FOR cannot
be made mandatory, because of the use of FOR in the END FOR
combination. Mandatory white space following FOR would require
space preceding the final semicolon in "end for;".




Subtle Tokenizer Rules
----------------------


Note that there is a tokenizer rule (applied both by lex and
dlg) which prevents FOREACH from actually parsing. The lexical
analyzer wants to recognize the longest string of characters [Scott, Tom!!]
possible. When the tokenizer finds two or more matches of the
same length, the rule listed first in the grammar is chosen.
(That is why key words are listed before ID "[a-zA-Z0-9_]+".)


Since "foreach" matches ID and is longer than "for", ID is
matched (to be flagged illegal by the parser). These rules
mean that textual words, key words and IDs, cannot legally be
crunched together.




          Takes a Wrong Turn
-----------------------


The approach to white space in the tokenizer taken by the ----
parser seems to illustrate a lack of understanding of the above
subtle tokenizer rules. [No discredit intended. Some of these
issues are somewhat elusive, and I'll have to admit that given
their examples, I did not immediately identify the clumsiness of
their approach.]


The ---- approach seems to assume that white space must be
"told" where to be (----! :). It is more appropriate for the
parser as a whole to simply define where white space is _allowed_.
The ---- approach attempts to force the tokenizer into
explicitly delimiting tokens with explicit white space.




How to Properly Delimit a Token
-------------------------------


Tokens are delimited by other tokens. White space is a token.
Newline is a token. A comment is a token. A key word, operator
or --- element is a token.


Tokens are delimited by other tokens. In the phrase "saucer one
fork", the SAUCER token is delimited by the white space token
following it. In the phrase "a=b;", the ID token for "a" is
delimited by the EQUALS token that follows it. Whether a token
is returned to the parser depends upon the (optional) action
associated with the token. White space carries no semantic
meaning in ----, and therefore, need not be passed on to the
parser. We can "skip" white space (and comments) in our ------
-------- parsers. This is done in the action associated with
the token by invoking skip() in dlg or by simply going on to the
next token in lex.


Therefore, explicitly glomming white space onto other tokens
(e.g. {opt_ws}saucer{man_ws}) is not only not necessary, but
ends up making the grammar wrong (as in for each).


Tokens are delimited by other tokens.




Design Approach in PCCTS
------------------------


PCCTS uses DLG which is a tokenizer somewhat weaker than lex
used in the existing code. [The PCCTS crowd will argue that
tokenizers should not be too smart for their own good.] With
the weaker tokenizer, the temptation to sprinkle white space
everywhere is not so compelling.


In the compiler, the tokenizer is responsible
for maintaining line and column number information. This
information is stored in the token objects passed to the parser.
The newline character is of special significance when keeping
track of line and column information. Therefore, the newline
character must be recognized in one and only one spot. (Thus,
explicitly distributing white space all over the place is
clearly undesirable.)




The Parsers
---------------


In the parsers, white space and comments are totally
skipped. The tokens created for white space are never passed to
the parser and never referenced afterward. The newline
character is recognized and used to maintain line and column
information.


White space characters and comments should not be part of the
key words, operators or ----------- tokens. (This is great.
The solution is to eliminate all that ugly complexity. :)




Other Options [Sinan!!]
-------------


DLG provides a single character of look-ahead. This look-ahead
character could be used to further enforce the esthetics of the
parsed ---------------.


Cosider:


#token SAUCER "saucer" <<ws();>>
#token "\n" <<newline();skip();>>
#token "[\t\ \r]" <<skip);>>


ws() {
    switch (ch) {
case ' ': break;
case '\t': break;
case '\r': break;
case '\n': break;
default: Error("Required white space not found.");
    }
}
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.