How to implement a double byte Lex and Yacc

"JUKKA" <>
16 Apr 1997 00:09:48 -0400

          From comp.compilers

Related articles
Double-byte lex and yacc? (Michael O'Leary) (1997-04-02)
Re: Double-byte lex and yacc? (1997-04-03)
How to implement a double byte Lex and Yacc (JUKKA) (1997-04-16)
Re: How to implement a double byte Lex and Yacc (John Lilley) (1997-04-20)
Re: How to implement a double byte Lex and Yacc (1997-04-22)
Re: How to implement a double byte Lex and Yacc (Stephen Arons) (1997-05-05)
Re: How to implement a double byte Lex and Yacc (1997-05-06)
| List of all articles for this month |

From: "JUKKA" <>
Newsgroups: comp.compilers
Date: 16 Apr 1997 00:09:48 -0400
Organization: GOOD
References: 97-04-013 97-04-023
Keywords: lex, i18n

Michael O'Leary wrote:
> Are there any versions of lex and/or yacc that are capable of
> accepting double-byte character streams as input?

> [Yacc doesn't process text, it processes tokens so the character set
> isn't much of an issue. Lex is much harder, since all the
> versions of lex that I know use 256 byte dispatch tables indexed by
> character code. This came up a year ago, suggestions included the
> plan 9 versions and re2c. -John]

[ in a later message, the moderator said ]
>[If you could cook us up a double byte version of lex, we'd be most
>grateful. Pay particular attention to the places where it creates arrays
>indexed by character codes. -John]

Here is what I could cook. I think one should use Unicode to
implement a truly universal scanner. Ok maybe Lex is more
dedicated to the old Ascii 256 character set. But also computers
have now mostly 32Mb and not 32kb of memory - as compared to
the times when Lex was done. So basicly the tables can be
bigger now.


a) The method in Lex should be changed.

b) Use the old Lex method but preprocess
      (and postprocess) the double byte code
      so that it will produce some subset of
      the Unicode. And doesn't need more than
      256 or so characters. A simple mapping
      array will do!

Here is how to make the more easy b:

1. One should first check what characters the user wants to
      use from the Unicode in his scanner. Notice that with Unicode
      (16 bit) you can handle all modern (=99.9%) languages and
      characters at the same time. And make use of the fact that
      mostly the people want to implement something for their own
      language which is always a subset of the Unicode. 80% of the
      important languages can do it with 256 characters. But notice
      that all of them have different sets. And maybe the lagrest
      is Chinese - what it needs 16000, 1500 characters or so? But
      anyway a subset of the Unicode which can handle them all at
      the same time.

2. There should be some statement in the Lex source that declares
      the characters from the Unicode set which are actually used.
      (Notice that this could also sometimes reduce the character
      set smaller than 256 .. and make the tables still smaller in
      some cases).

2. And then use this declared subset to produce the scanner.
      Not all possible characters in the world. But only the set
      the aplication needs. In input there is a mapping from 64000
      characters to 126/256 {/2000/16000} or so. Then you run the
      old Lex. And that will produce your tokens. Nothing very much
      to change.

If sombody really needs a scanner that can handle all the
characters in the world !at the same time! .. he should
change the method in Lex. Or hand write a dedicated scanner.
Yacc takes only tokens so it will always do the job .. what
ever the character set is.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.