Re: Using Lex to scan a processor specific Assembly Language

"Joachim Durchholz" <joachim_d@gmx.de>
21 Dec 2000 14:54:43 -0500

          From comp.compilers

Related articles
Using Lex to scan a processor specific Assembly Language mprestonsnead@yahoo.com (2000-12-20)
Re: Using Lex to scan a processor specific Assembly Language joachim_d@gmx.de (Joachim Durchholz) (2000-12-21)
| List of all articles for this month |

From: "Joachim Durchholz" <joachim_d@gmx.de>
Newsgroups: comp.compilers
Date: 21 Dec 2000 14:54:43 -0500
Organization: Compilers Central
References: 00-12-091
Keywords: lex, assembler
Posted-Date: 21 Dec 2000 14:54:43 EST

msnead <mprestonsnead@yahoo.com> wrote:
>
> A comment can start with a star, which is easy enough, "*"[^\n]*
>
> But text in the fourth column of the file is also considered a
> comment. And this type of comment does not start with any kind of
> special character like the '*'.


I usually layer the lexical phase into two (or even more) passes,
exactly because the rules of what's a comment often isn't easily
captured in a regular expression. (Nested comments, for example,
aren't regular.)


In general, I get a structure like this:
Layer 1: Reader.
    End-of-line conversion (CRLF into LF and such).
    Trivial global character set substitutions.
Layer 2: Whitespace coalescer
    Replace end-of-line with blanks.
    Replace multiple whitespace by single blanks.
    Strip comments.
    Insert line/column information for error messages (important!).
Layer 3: Tokenizer
    Group characters into tokens.
    Usually done using a tool (lex or similar).
Layer 4: Parser
    Group tokens and constructs into constructs.
    Usually done using a tool (yacc, ANTLR etc.),
    or (sometimes) by hand if using recursive descent.


Layer 1 and layer 2 could both be done using the same tools that apply
to layer 3 and 4, but usually the rules on these layers are so simple
that it's easier to code them directly. If layer 1 and layer 2 are
very simple, they can be merged into one layer. Layer 1 gets added as
an afterthought when the language is successful enough to be ported to
operating systems with deviating line end or character set
conventions. And, being an afterthought, it's more often than not just
hacked into the reader routine of the scanner.


Back to your question: It's clearly a layer-2 issue, and as John
wrote, it's best handled by handcoding. Lexing tools won't save you
anything for this type of requirement. That layer stuff I'm
presenting above is just an attempt at removing the "hackish" aspect
from handcoding; it's always a good idea to separate a complicate
semantics out if there's any chance that the code will have to be
maintained.


Regards,
Joachim


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.