Related articles |
---|
looking for Lex/Bison unicode support porky72@hotmail.com (Yaron Bracha) (2000-01-19) |
Re: looking for Lex/Bison unicode support qjackson@wave.home.com (Quinn Tyler Jackson) (2000-01-21) |
Re: looking for Lex/Bison unicode support dmr@bell-labs.com (Dennis Ritchie) (2000-01-21) |
Re: looking for Lex/Bison unicode support chet@watson.ibm.com (2000-01-23) |
Re: looking for Lex/Bison unicode support webid@asi.fr (Armel) (2000-02-04) |
From: | chet@watson.ibm.com (Chet Murthy) |
Newsgroups: | comp.compilers |
Date: | 23 Jan 2000 10:11:22 -0500 |
Organization: | IBM_Research |
References: | 00-01-081 00-01-087 |
Keywords: | lex, i18n |
There's another way to do this. Take your favorite regular-expression
over Unicode (UTF-16) glyphs. Convert each glyph to its UTF-8
representation as a series of characters.
Then feed that to Lex. You'll need to convert your input codepage
into UTF-8, before handing it to your lexer, but that's not so tough.
Now, if you have a Lex that can handle *huuuuge* regular-expressions,
it'll just *work*.
If not, and if your Lex is relatively stupid (in this case, stupidity
is a *good* thing ;-) about not trying to build optimal lexing
automata, you can pre-process your regular-expression to factor out
lots of common cases.
E.g., in the 'Name' lexical class in the XML spec, there are 35K
different glyphs. If you apply a few factorization transformations to
the naive 35K-way disjunction, you can get the disjunction down to 107
branches, each of which, admittedly, has a hundred or so branches.
But in the process, you end up removing a lot of opportunity for
state-explosion.
Now, I don't claim that this is either pretty, nor that this will work
every time. But it certainly got me past all the lexical issues in
XML, and I was able to use (CAML) Lex/Yacc to do it all.
--chet--
[Wow, that's gross. But it's a good idea. -John]
Return to the
comp.compilers page.
Search the
comp.compilers archives again.