Related articles |
---|
Double-byte lex and yacc? moleary@primus.com (Michael O'Leary) (1997-04-02) |
Re: Double-byte lex and yacc? Julian.Orbach@unisys.com (1997-04-03) |
How to implement a double byte Lex and Yacc jukkaj@ping.at (JUKKA) (1997-04-16) |
Re: How to implement a double byte Lex and Yacc jlilley@empathy.com (John Lilley) (1997-04-20) |
Re: How to implement a double byte Lex and Yacc clark@quarry.zk3.dec.com (1997-04-22) |
Re: How to implement a double byte Lex and Yacc arons@panix.com (Stephen Arons) (1997-05-05) |
Re: How to implement a double byte Lex and Yacc vern@daffy.ee.lbl.gov (1997-05-06) |
From: | "JUKKA" <jukkaj@ping.at> |
Newsgroups: | comp.compilers |
Date: | 16 Apr 1997 00:09:48 -0400 |
Organization: | GOOD |
References: | 97-04-013 97-04-023 |
Keywords: | lex, i18n |
Michael O'Leary wrote:
> Are there any versions of lex and/or yacc that are capable of
> accepting double-byte character streams as input?
> [Yacc doesn't process text, it processes tokens so the character set
> isn't much of an issue. Lex is much harder, since all the
> versions of lex that I know use 256 byte dispatch tables indexed by
> character code. This came up a year ago, suggestions included the
> plan 9 versions and re2c. -John]
[ in a later message, the moderator said ]
>[If you could cook us up a double byte version of lex, we'd be most
>grateful. Pay particular attention to the places where it creates arrays
>indexed by character codes. -John]
Here is what I could cook. I think one should use Unicode to
implement a truly universal scanner. Ok maybe Lex is more
dedicated to the old Ascii 256 character set. But also computers
have now mostly 32Mb and not 32kb of memory - as compared to
the times when Lex was done. So basicly the tables can be
bigger now.
Basicly:
a) The method in Lex should be changed.
b) Use the old Lex method but preprocess
(and postprocess) the double byte code
so that it will produce some subset of
the Unicode. And doesn't need more than
256 or so characters. A simple mapping
array will do!
Here is how to make the more easy b:
1. One should first check what characters the user wants to
use from the Unicode in his scanner. Notice that with Unicode
(16 bit) you can handle all modern (=99.9%) languages and
characters at the same time. And make use of the fact that
mostly the people want to implement something for their own
language which is always a subset of the Unicode. 80% of the
important languages can do it with 256 characters. But notice
that all of them have different sets. And maybe the lagrest
is Chinese - what it needs 16000, 1500 characters or so? But
anyway a subset of the Unicode which can handle them all at
the same time.
2. There should be some statement in the Lex source that declares
the characters from the Unicode set which are actually used.
(Notice that this could also sometimes reduce the character
set smaller than 256 .. and make the tables still smaller in
some cases).
2. And then use this declared subset to produce the scanner.
Not all possible characters in the world. But only the set
the aplication needs. In input there is a mapping from 64000
characters to 126/256 {/2000/16000} or so. Then you run the
old Lex. And that will produce your tokens. Nothing very much
to change.
If sombody really needs a scanner that can handle all the
characters in the world !at the same time! .. he should
change the method in Lex. Or hand write a dedicated scanner.
Yacc takes only tokens so it will always do the job .. what
ever the character set is.
JOJ
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.