|Using Lex to scan a processor specific Assembly Language email@example.com (2000-12-20)|
|Re: Using Lex to scan a processor specific Assembly Language firstname.lastname@example.org (Joachim Durchholz) (2000-12-21)|
|From:||"Joachim Durchholz" <email@example.com>|
|Date:||21 Dec 2000 14:54:43 -0500|
|Posted-Date:||21 Dec 2000 14:54:43 EST|
msnead <firstname.lastname@example.org> wrote:
> A comment can start with a star, which is easy enough, "*"[^\n]*
> But text in the fourth column of the file is also considered a
> comment. And this type of comment does not start with any kind of
> special character like the '*'.
I usually layer the lexical phase into two (or even more) passes,
exactly because the rules of what's a comment often isn't easily
captured in a regular expression. (Nested comments, for example,
In general, I get a structure like this:
Layer 1: Reader.
End-of-line conversion (CRLF into LF and such).
Trivial global character set substitutions.
Layer 2: Whitespace coalescer
Replace end-of-line with blanks.
Replace multiple whitespace by single blanks.
Insert line/column information for error messages (important!).
Layer 3: Tokenizer
Group characters into tokens.
Usually done using a tool (lex or similar).
Layer 4: Parser
Group tokens and constructs into constructs.
Usually done using a tool (yacc, ANTLR etc.),
or (sometimes) by hand if using recursive descent.
Layer 1 and layer 2 could both be done using the same tools that apply
to layer 3 and 4, but usually the rules on these layers are so simple
that it's easier to code them directly. If layer 1 and layer 2 are
very simple, they can be merged into one layer. Layer 1 gets added as
an afterthought when the language is successful enough to be ported to
operating systems with deviating line end or character set
conventions. And, being an afterthought, it's more often than not just
hacked into the reader routine of the scanner.
Back to your question: It's clearly a layer-2 issue, and as John
wrote, it's best handled by handcoding. Lexing tools won't save you
anything for this type of requirement. That layer stuff I'm
presenting above is just an attempt at removing the "hackish" aspect
from handcoding; it's always a good idea to separate a complicate
semantics out if there's any chance that the code will have to be
Return to the
Search the comp.compilers archives again.