Re: looking for Lex/Bison unicode support

"Quinn Tyler Jackson" <qjackson@wave.home.com>
21 Jan 2000 00:40:43 -0500

From comp.compilers

Related articles
looking for Lex/Bison unicode support porky72@hotmail.com (Yaron Bracha) (2000-01-19)
*Re: looking for Lex/Bison unicode support qjackson@wave.home.com (Quinn Tyler Jackson)* (2000-01-21)**
Re: looking for Lex/Bison unicode support dmr@bell-labs.com (Dennis Ritchie) (2000-01-21)
Re: looking for Lex/Bison unicode support chet@watson.ibm.com (2000-01-23)
Re: looking for Lex/Bison unicode support webid@asi.fr (Armel) (2000-02-04)

| List of all articles for this month |

From:	"Quinn Tyler Jackson" <qjackson@wave.home.com>
Newsgroups:	comp.compilers
Date:	21 Jan 2000 00:40:43 -0500
Organization:	Compilers Central
References:	00-01-081
Keywords:	lex, i18n

The moderator commented said:

> [... Lex or flex is harder since all of
> the implementations I know of use the character codes as indexes into
> tables to implement the lex state machine. But if you do that for
> Unicode, you'll have 64K entry tables rather than 256 entry tables and
> severe program bloat. I believe that plan 9 has a Unicode lex,
> presumably with some hackery to keep the table sizes down -John]

In LPM's lexical scanner, which can be compiled in Unicode mode, I got
around the huge table problem by using a table for everything less
than UCHAR_MAX, and allowing codes greater than that only in clause
parameters. This is only possible because the clause is the atom in
LPM, rather than the character. This means that the RE:

lexicographic*[a-z]+

is expressed in LPM as:

%'lexicographi'$ %*'c'$ %'a-z'#

This works in LPM because, during the match, each clause is matched
against the stream by the engine. The character class [a-z]+ above,
when compiled in standard mode builds a table that is always UCHAR_MAX
big, but in Unicode mode, behaves differently, since table
construction for character classes could result in huge tables.

--
Quinn Tyler Jackson
http://www.qtj.net/~quinn/

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: looking for Lex/Bison unicode support

"Quinn Tyler Jackson" <qjackson@wave.home.com>21 Jan 2000 00:40:43 -0500

"Quinn Tyler Jackson" <qjackson@wave.home.com>
21 Jan 2000 00:40:43 -0500