Re: looking for Lex/Bison unicode support

chet@watson.ibm.com (Chet Murthy)
23 Jan 2000 10:11:22 -0500

From comp.compilers

Related articles
looking for Lex/Bison unicode support porky72@hotmail.com (Yaron Bracha) (2000-01-19)
Re: looking for Lex/Bison unicode support qjackson@wave.home.com (Quinn Tyler Jackson) (2000-01-21)
Re: looking for Lex/Bison unicode support dmr@bell-labs.com (Dennis Ritchie) (2000-01-21)
*Re: looking for Lex/Bison unicode support chet@watson.ibm.com* (2000-01-23)**
Re: looking for Lex/Bison unicode support webid@asi.fr (Armel) (2000-02-04)

| List of all articles for this month |

From:	chet@watson.ibm.com (Chet Murthy)
Newsgroups:	comp.compilers
Date:	23 Jan 2000 10:11:22 -0500
Organization:	IBM_Research
References:	00-01-081 00-01-087
Keywords:	lex, i18n

There's another way to do this. Take your favorite regular-expression
over Unicode (UTF-16) glyphs. Convert each glyph to its UTF-8
representation as a series of characters.

Then feed that to Lex. You'll need to convert your input codepage
into UTF-8, before handing it to your lexer, but that's not so tough.

Now, if you have a Lex that can handle *huuuuge* regular-expressions,
it'll just *work*.

If not, and if your Lex is relatively stupid (in this case, stupidity
is a *good* thing ;-) about not trying to build optimal lexing
automata, you can pre-process your regular-expression to factor out
lots of common cases.

E.g., in the 'Name' lexical class in the XML spec, there are 35K
different glyphs. If you apply a few factorization transformations to
the naive 35K-way disjunction, you can get the disjunction down to 107
branches, each of which, admittedly, has a hundred or so branches.
But in the process, you end up removing a lot of opportunity for
state-explosion.

Now, I don't claim that this is either pretty, nor that this will work
every time. But it certainly got me past all the lexical issues in
XML, and I was able to use (CAML) Lex/Yacc to do it all.

--chet--
[Wow, that's gross. But it's a good idea. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: looking for Lex/Bison unicode support

chet@watson.ibm.com (Chet Murthy)23 Jan 2000 10:11:22 -0500

chet@watson.ibm.com (Chet Murthy)
23 Jan 2000 10:11:22 -0500