Re: A question about lexer portability in C ?

cfc@world.std.com (Chris F Clark)
24 Sep 1997 22:29:35 -0400

          From comp.compilers

Related articles
A question about lexer portability in C ? frederic.guerin@sympatico.ca (Frederic Guerin) (1997-09-23)
Re: A question about lexer portability in C ? cfc@world.std.com (1997-09-24)
Re: A question about lexer portability in C ? henry@zoo.toronto.edu (Henry Spencer) (1997-09-28)
| List of all articles for this month |
From: cfc@world.std.com (Chris F Clark)
Newsgroups: comp.compilers
Date: 24 Sep 1997 22:29:35 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 97-09-090
Keywords: lex, i18n

Frederic Guerin asked:
> The question is : Can I fix this table at compile time or do I need to
> build it at run time so as to make sure that the correct codes will be
> assigned to the correct characters ?
...
> May I assume that all character sets used
> over the world are superset of the ANSI one ( with identical character
> code ) ?


To, which, our moderator correctly pointed out EBCDIC is not a
superset nor subset of ASCII.


However, this does not mean that you cannot fix the tables at compile
time. First, most machines do not dynamically change character sets.
The most notable exception being "code pages" for PC's. The more
modern solution is Unicode which defines a character set large enough
to accomodate the target language character sets as subsets (and ASCII
is a subset of Unicode in one sense). Second, even when a machine
provides the ability to switch character sets on the fly, there is
usually a universal subset which all character sets on the machines
support. This corresponds roughly to your ANSI set.


More importantly, each machine generally has a "compiler culture" (or
locale) already established. That is there is some character set
which is "native" to the machine that all compilers expect their input
in. Since you are comparing to C, this would correspond to the "C
locale", and the machines C compiler will make specific requirements
on the character set (and those are likely to be the only characters
you can put in C literals). It is worth mentioning here, that the C
character set is not a strict subset of the universal character set on
all machines which support C, which is where tri-graphs come in.


Thus, if you are able to define your language in terms of some subset
of the characters on each machine (preferably the same set used by
other compilers on the machine), you can do your compile time trick.
Even if not, you simply need to set up several sets of lexer tables
(at compile time), one for each supported target language and switch
between them.


Note, that defining your language in terms of a subset of characters
is not always the correct solution. For example, many non-English
speakers like to be able to use non-ASCII alphabetic characters in
identifiers. If you want your language to support that, you have
increased the odds that you will need run-time identification of
characters.


> If No, are parser/lexer generating tools safe in this regard ?


The answer to this is maybe (some are and some are not). Each
parser/lexer generating tools has a set of target character sets it
supports. In most cases, it is only one set. However, some support
more than one set (and may have a universal subset defined). For
example, the lexer in Yacc++ has a mapping from ASCII to EBCDIC and
thus you can specific your lexer in ASCII and have an EBCDIC lexer
generated. There are some characters which are the same graphic in
both sets and this defines the universal subset. (Not knowing EBCDIC
myself, I'm not sure what that set is, but I know it includes the
alphanumerics and most of the characters one would expect.)


------------------------------------------------------------------------


I think the question you are really trying to ask is whether you can
use a lexer generator or not. The answer is yes (even if you need
run-time tables).


If you can do the subset case, you are probably safe with any lexer
generator which handles ASCII. If you have to port to a non-ASCII
compatible machine in the future, which is probably unlikely, you can
deal with that problem then.


If you need run-time tables, look for a lexer generator which can
decide charcter codes by function calls (i.e. whitespace is found by
iswhite), I have seen at least one which works that way. You will pay
for it in performance, but you will be able to get out a portable
lexer which queries the environment to match the users' expectations.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.