More on the purpose of Lexers

"JMB" <dos_programmer@yahoo.com>
13 Oct 2002 16:17:22 -0400

          From comp.compilers

Related articles
More on the purpose of Lexers dos_programmer@yahoo.com (JMB) (2002-10-13)
Re: More on the purpose of Lexers vbdis@aol.com (VBDis) (2002-10-18)
| List of all articles for this month |
From: "JMB" <dos_programmer@yahoo.com>
Newsgroups: comp.compilers
Date: 13 Oct 2002 16:17:22 -0400
Organization: http://groups.google.com/
Keywords: lex, comment
Posted-Date: 13 Oct 2002 16:17:22 EDT

I was wondering if anyone had any ideas about this topic -- the
purpose of the lexical analyzer.


My first attempt at an interpreter called the Lexer, which ignored
comments and whitespace and only gave back to the interpreter the
relevant tokens.


I have read another recent article about how the lexer should avoid as
much checks as possible, and sometimes even allow for
character-by-character reconstruction of the source. In other words,
comments aren't skipped over; and whitespace is actually given back to
the module that uses the lexer.


So then the question of context arises. Now my lexer recognizes the
beginning of a comment ("{" and "(*") but once it passes, it doesn't
know that the next token should be part of a comment -- main handles
that. Which means that the check for an identifier is pointless, an
identifier within a comment isn't an identifier at all, it's merely
text. The same goes for floating point numbers. Why bother calling the
routine to extract numbers when the only characters it needs to
recognize are the closing comments, "}" and "*)". ?


What do you think a solution is? Currently when I'm calling the
GetToken routine, I'm telling the lexer which tokens to check for;
meaning if I say check only for symbols, and it encounters an
alphanum, it will call the GetWholeWord() routine and ignore the class
the token belongs to. That way the context main uses can speed up the
lexing of comments and strings.


1.) Is there a better way to do it?


2.) So now I also wonder... should it be the job of the lexer to check
whether an identifier is a reserved word? Or should it determine only
that a particular token is symbol, number, string, or identifier?


Thanks for your input.
[I take it back -- it's ask three people, get six answers. If you're
doing source-to-source conversion, saving the formatting makes sense.
If it's a normal translation to machine code, there's no point. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.