Re: More on the purpose of Lexers

"VBDis" <vbdis@aol.com>
18 Oct 2002 23:19:51 -0400

          From comp.compilers

Related articles
More on the purpose of Lexers dos_programmer@yahoo.com (JMB) (2002-10-13)
Re: More on the purpose of Lexers vbdis@aol.com (VBDis) (2002-10-18)
| List of all articles for this month |

From: "VBDis" <vbdis@aol.com>
Newsgroups: comp.compilers
Date: 18 Oct 2002 23:19:51 -0400
Organization: AOL Bertelsmann Online GmbH & Co. KG http://www.germany.aol.com
References: 02-10-028
Keywords: lex, practice
Posted-Date: 18 Oct 2002 23:19:51 EDT

"JMB" <dos_programmer@yahoo.com> schreibt:


>I have read another recent article about how the lexer should avoid as
>much checks as possible, and sometimes even allow for
>character-by-character reconstruction of the source. In other words,
>comments aren't skipped over; and whitespace is actually given back to
>the module that uses the lexer.


In my current project comments are scanned as a whole, resulting in
several tokens (comment until EOL, delimited comment, continued
comment). For Pascal compiler options {$...} another comment type may
be required, equivalent to C preprocessor commands. The scanner
creates all the tokens for new lines (new line, continuation line,
continued comment), with an indication of the size and/or characters
of white space or continued comments at the begin of a line. Inside a
line I skip whitespace, but the column number can be retained for each
token, so that the original source code can be reconstructed from the
tokens on demand.


Files are scanned as a whole, so that in the first stage tokens for
all line ends, remarks and preprocessor commands are present. The
parser then reads an token stream, from which all special tokens
(EOL...) are filtered. The preprocessor also is implemented as a
filter, which allows for selective interpretation of preprocessor
definitions and conditional compilation. Of course the parser can go
crazy when receiving the tokens from all branches of a conditional
compilation, so that manual intervention and selection of the
appropriate conditional branch may be necessary. Some "hard"
preprocessor symbols can be defined or undefined in the environment
(command line), which make the preprocessor filter conditional
passages according to the given settings of these symbols. Otherwise
all conditional branches are passed to the parser, which then must be
able to handle the remaining preprocessor commands. In the simplest
case then the parser (cross compiler or pretty printer) e.g.
translates #if (C) into $IF (Pascal). I could not find a better
solution for the implementation of the preprocessor, when the full
source code shall be retained for listings or cross
compilation. Perhaps different filters must be implemented, for
semantical parsers, pretty printers etc.


When source code is tokenized in the scanner, without assistance of
the parser (parserless scanner ;-), then the scanner also must be able
to detect required switches from one language to another one, e.g. to
assembly or preprocessor syntax. This requires that the scanner
recognizes at least an minimal set of keywords.


For the many optional keywords and semicolons in Delphi I used
different classifications of identifiers, in an older project, and the
parser gives the scanner information about the expected set of
keywords, depending on the context. In my current C parser project
I'll use an similar approach, for special preprocessor tokens, type
names etc.


DoDi


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.