Re: Double-byte lex and yacc?

Julian.Orbach@unisys.com (Julian Orbach)
3 Apr 1997 14:02:31 -0500

          From comp.compilers

Related articles
Double-byte lex and yacc? moleary@primus.com (Michael O'Leary) (1997-04-02)
Re: Double-byte lex and yacc? sreeni@csc.albany.edu (1997-04-03)
Re: Double-byte lex and yacc? Julian.Orbach@unisys.com (1997-04-03)
Re: Double-byte lex and yacc? dds@flavors.com (Duncan Smith) (1997-04-06)
How to implement a double byte Lex and Yacc jukkaj@ping.at (JUKKA) (1997-04-16)
Re: Double-byte lex and yacc? moleary@primus.com (Michael O'Leary) (1997-04-16)
Re: How to implement a double byte Lex and Yacc jlilley@empathy.com (John Lilley) (1997-04-20)
Re: How to implement a double byte Lex and Yacc clark@quarry.zk3.dec.com (1997-04-22)
Re: How to implement a double byte Lex and Yacc arons@panix.com (Stephen Arons) (1997-05-05)
[1 later articles]
| List of all articles for this month |

From: Julian.Orbach@unisys.com (Julian Orbach)
Newsgroups: comp.compilers
Date: 3 Apr 1997 14:02:31 -0500
Organization: Australian Centre for Unisys Software
References: 97-04-013
Keywords: i18n, lex

On 2 Apr 1997 16:00:13 -0500, "Michael O'Leary" <moleary@primus.com>
wrote:


>Are there any versions of lex and/or yacc that are capable of
>accepting double-byte character streams as input?
>
>Michael O'Leary
>moleary@primus.com
>[Yacc doesn't process text, it processes tokens so the character set isn't
>much of an issue. Lex is much harder, since all the versions of lex that
>I know use 256 byte dispatch tables indexed by character code. This came
>up a year ago, suggestions included the plan 9 versions and re2c. -John]


I was asking about this around a year ago. I found that, depending on
the language, you may be able to perform a kludge. I haven't seen any
description of this kludge elsewhere, so here it is.


The language I was working with allowed Unicode characters to appear
in strings, identifiers and comments. Significantly, none of the
keywords required any more than the ASCII character set, and the lex
specification didn't care *which* non-ASCII character it was.
(I believe Java also satisfies these criteria).


In the end, I modified lex's prototype file, to generate a lexer
which:


* Stored the incoming Unicode characters in "yyUnicodeText" - a
parallel (but wider) data structure to "yytext" .
* Stored the 7-bit equivalents of the Unicode characters in the
standard yytext. Where no equivalent code exists, it converted it
into CTRL-A (an unused code).


The lexer specification could then treat all of the non-ASCII
characters identically, catching them all by catching the CTRL-A
character.


If the actual token is required by a parser action, it can be
extracted from yyUnicodeText, rather than yytext.


Other details to consider:
* Unicode has many "space" characters - should it be legal to separate
tokens with them? I converted them all to CTRL-B, instead of CTRL-A,
so I could detect them.
* How should a legitimate CTRL-A (or CTRL-B) in the text be handled?
* Use #defines in the prototype, so Unicode support can be turned off
and on?


I wrote this prototype for the project, and gave it some perfunctory
testing, but then the requirement for Unicode support was postponed,
and finally the project folded. These ideas have not been tested in
the wild.


Julian Orbach
Australian Centre for Unisys Software
[Someone else pointed out that if you use multibyte encodings rather than
wide characters, more often than not a regular 8 bit lexer will do the
right things. -John]


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.