|Double-byte lex and yacc? email@example.com (Michael O'Leary) (1997-04-02)|
|Re: Double-byte lex and yacc? firstname.lastname@example.org (1997-04-03)|
|Re: Double-byte lex and yacc? Julian.Orbach@unisys.com (1997-04-03)|
|Re: Double-byte lex and yacc? email@example.com (Duncan Smith) (1997-04-06)|
|Re: Double-byte lex and yacc? firstname.lastname@example.org (Michael O'Leary) (1997-04-16)|
|From:||"Michael O'Leary" <email@example.com>|
|Date:||16 Apr 1997 00:22:12 -0400|
|Organization:||Primus Communications Corporation|
|Keywords:||lex, i18n, comment|
So if I wanted to tokenize unicode text that is a mixture of latin1
and japanese characters, would it work to use lex to group pairs of
bytes into double-byte character tokens of type LATIN1_ALPHA,
LATIN1_PUNCT, JAPANESE_HIRAGANA, JAPANESE_KATAKANA, JAPANESE_KANJI,
etc., and then use yacc to perform the higher level tokenization into
Latin1 and Japanese substrings, etc., or would that be too slow?
Also, can lex handle 0x00 bytes in an input stream, or would it always
treat them as string terminators? Thanks. Michael O'Leary
Michael O'Leary wrote:
> Are there any versions of lex and/or yacc that are capable of
> accepting double-byte character streams as input?
[If it's unicode, there aren't any double-byte characters. But yes, in
general, once you get lex to find your tokens, yacc can parse the result.
And lex does handle 0 bytes albeit with a modest performance penalty.
Return to the
Search the comp.compilers archives again.