Re: looking for Lex/Bison unicode support (Chet Murthy)
23 Jan 2000 10:11:22 -0500

          From comp.compilers

Related articles
looking for Lex/Bison unicode support (Yaron Bracha) (2000-01-19)
Re: looking for Lex/Bison unicode support (Quinn Tyler Jackson) (2000-01-21)
Re: looking for Lex/Bison unicode support (Dennis Ritchie) (2000-01-21)
Re: looking for Lex/Bison unicode support (2000-01-23)
Re: looking for Lex/Bison unicode support (Armel) (2000-02-04)
| List of all articles for this month |

From: (Chet Murthy)
Newsgroups: comp.compilers
Date: 23 Jan 2000 10:11:22 -0500
Organization: IBM_Research
References: 00-01-081 00-01-087
Keywords: lex, i18n

There's another way to do this. Take your favorite regular-expression
over Unicode (UTF-16) glyphs. Convert each glyph to its UTF-8
representation as a series of characters.

Then feed that to Lex. You'll need to convert your input codepage
into UTF-8, before handing it to your lexer, but that's not so tough.

Now, if you have a Lex that can handle *huuuuge* regular-expressions,
it'll just *work*.

If not, and if your Lex is relatively stupid (in this case, stupidity
is a *good* thing ;-) about not trying to build optimal lexing
automata, you can pre-process your regular-expression to factor out
lots of common cases.

E.g., in the 'Name' lexical class in the XML spec, there are 35K
different glyphs. If you apply a few factorization transformations to
the naive 35K-way disjunction, you can get the disjunction down to 107
branches, each of which, admittedly, has a hundred or so branches.
But in the process, you end up removing a lot of opportunity for

Now, I don't claim that this is either pretty, nor that this will work
every time. But it certainly got me past all the lexical issues in
XML, and I was able to use (CAML) Lex/Yacc to do it all.

[Wow, that's gross. But it's a good idea. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.