|looking for Lex/Bison unicode support email@example.com (Yaron Bracha) (2000-01-19)|
|Re: looking for Lex/Bison unicode support firstname.lastname@example.org (Quinn Tyler Jackson) (2000-01-21)|
|Re: looking for Lex/Bison unicode support email@example.com (Dennis Ritchie) (2000-01-21)|
|Re: looking for Lex/Bison unicode support firstname.lastname@example.org (2000-01-23)|
|Re: looking for Lex/Bison unicode support email@example.com (Armel) (2000-02-04)|
|From:||firstname.lastname@example.org (Chet Murthy)|
|Date:||23 Jan 2000 10:11:22 -0500|
There's another way to do this. Take your favorite regular-expression
over Unicode (UTF-16) glyphs. Convert each glyph to its UTF-8
representation as a series of characters.
Then feed that to Lex. You'll need to convert your input codepage
into UTF-8, before handing it to your lexer, but that's not so tough.
Now, if you have a Lex that can handle *huuuuge* regular-expressions,
it'll just *work*.
If not, and if your Lex is relatively stupid (in this case, stupidity
is a *good* thing ;-) about not trying to build optimal lexing
automata, you can pre-process your regular-expression to factor out
lots of common cases.
E.g., in the 'Name' lexical class in the XML spec, there are 35K
different glyphs. If you apply a few factorization transformations to
the naive 35K-way disjunction, you can get the disjunction down to 107
branches, each of which, admittedly, has a hundred or so branches.
But in the process, you end up removing a lot of opportunity for
Now, I don't claim that this is either pretty, nor that this will work
every time. But it certainly got me past all the lexical issues in
XML, and I was able to use (CAML) Lex/Yacc to do it all.
[Wow, that's gross. But it's a good idea. -John]
Return to the
Search the comp.compilers archives again.