Related articles |
---|
looking for Lex/Bison unicode support porky72@hotmail.com (Yaron Bracha) (2000-01-19) |
Re: looking for Lex/Bison unicode support qjackson@wave.home.com (Quinn Tyler Jackson) (2000-01-21) |
Re: looking for Lex/Bison unicode support dmr@bell-labs.com (Dennis Ritchie) (2000-01-21) |
Re: looking for Lex/Bison unicode support chet@watson.ibm.com (2000-01-23) |
Re: looking for Lex/Bison unicode support webid@asi.fr (Armel) (2000-02-04) |
From: | "Quinn Tyler Jackson" <qjackson@wave.home.com> |
Newsgroups: | comp.compilers |
Date: | 21 Jan 2000 00:40:43 -0500 |
Organization: | Compilers Central |
References: | 00-01-081 |
Keywords: | lex, i18n |
The moderator commented said:
> [... Lex or flex is harder since all of
> the implementations I know of use the character codes as indexes into
> tables to implement the lex state machine. But if you do that for
> Unicode, you'll have 64K entry tables rather than 256 entry tables and
> severe program bloat. I believe that plan 9 has a Unicode lex,
> presumably with some hackery to keep the table sizes down -John]
In LPM's lexical scanner, which can be compiled in Unicode mode, I got
around the huge table problem by using a table for everything less
than UCHAR_MAX, and allowing codes greater than that only in clause
parameters. This is only possible because the clause is the atom in
LPM, rather than the character. This means that the RE:
lexicographic*[a-z]+
is expressed in LPM as:
%'lexicographi'$ %*'c'$ %'a-z'#
This works in LPM because, during the match, each clause is matched
against the stream by the engine. The character class [a-z]+ above,
when compiled in standard mode builds a table that is always UCHAR_MAX
big, but in Unicode mode, behaves differently, since table
construction for character classes could result in huge tables.
--
Quinn Tyler Jackson
http://www.qtj.net/~quinn/
Return to the
comp.compilers page.
Search the
comp.compilers archives again.