Multibyte lexers in flex?

fussylizard@my-deja.com
10 May 2000 02:51:56 -0400

          From comp.compilers

Related articles
Multibyte lexers in flex? fussylizard@my-deja.com (2000-05-10)
| List of all articles for this month |

From: fussylizard@my-deja.com
Newsgroups: comp.compilers
Date: 10 May 2000 02:51:56 -0400
Organization: Deja.com - Before you buy.
Keywords: lex, i18n, comment

    Does anyone have any experience in or tricks for developing scanners
in flex (or a variant) that support multibyte characters? I am
interested in developing a lexer (actually extending an existing one)
that will have to support different code pages at runtime. So, for
example, I would like to recognize patterns such as:


    KEYWORD = VALUE


where VALUE can contain multibyte characters in the current codepage
(Japanese Shift-JIS, EUC-JP, etc. depending on where the executable is
running).


    I realize there are some ways around this by writing patterns such
as:


KEYWORD[ \t]*= { BEGIN(MULTIBYTE_MODE) }


<MB_MODE>.+ { /* punt to some external fcn to
handle the multibyte string */ }


but this is somewhat ugly and requires me to be very careful about how
I write my patterns. Anyone have any ideas?


Thanks,
Chris
[This has come up before. In its usual 8-bit transparent mode, lex handles
multibyte characters just fine as multi-character sequences. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.