Re: looking for Lex/Bison unicode support

chet@watson.ibm.com (Chet Murthy)
23 Jan 2000 10:11:22 -0500

          From comp.compilers

Related articles
looking for Lex/Bison unicode support porky72@hotmail.com (Yaron Bracha) (2000-01-19)
Re: looking for Lex/Bison unicode support qjackson@wave.home.com (Quinn Tyler Jackson) (2000-01-21)
Re: looking for Lex/Bison unicode support dmr@bell-labs.com (Dennis Ritchie) (2000-01-21)
Re: looking for Lex/Bison unicode support chet@watson.ibm.com (2000-01-23)
Re: looking for Lex/Bison unicode support webid@asi.fr (Armel) (2000-02-04)
| List of all articles for this month |
From: chet@watson.ibm.com (Chet Murthy)
Newsgroups: comp.compilers
Date: 23 Jan 2000 10:11:22 -0500
Organization: IBM_Research
References: 00-01-081 00-01-087
Keywords: lex, i18n

There's another way to do this. Take your favorite regular-expression
over Unicode (UTF-16) glyphs. Convert each glyph to its UTF-8
representation as a series of characters.


Then feed that to Lex. You'll need to convert your input codepage
into UTF-8, before handing it to your lexer, but that's not so tough.


Now, if you have a Lex that can handle *huuuuge* regular-expressions,
it'll just *work*.


If not, and if your Lex is relatively stupid (in this case, stupidity
is a *good* thing ;-) about not trying to build optimal lexing
automata, you can pre-process your regular-expression to factor out
lots of common cases.


E.g., in the 'Name' lexical class in the XML spec, there are 35K
different glyphs. If you apply a few factorization transformations to
the naive 35K-way disjunction, you can get the disjunction down to 107
branches, each of which, admittedly, has a hundred or so branches.
But in the process, you end up removing a lot of opportunity for
state-explosion.


Now, I don't claim that this is either pretty, nor that this will work
every time. But it certainly got me past all the lexical issues in
XML, and I was able to use (CAML) Lex/Yacc to do it all.


--chet--
[Wow, that's gross. But it's a good idea. -John]





Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.