Re: Unicode lexical analisis and parsing with flex and bison in C++?

Henry Spencer <henry@zoo.toronto.edu>
13 Jan 1998 00:18:41 -0500

          From comp.compilers

Related articles
Unicode lexical analisis and parsing with flex and bison in C++? craigh@visio.com (Craig Hobbs) (1998-01-12)
Re: Unicode lexical analisis and parsing with flex and bison in C++? henry@zoo.toronto.edu (Henry Spencer) (1998-01-13)
Re: Unicode lexical analisis and parsing with flex and bison in C++? schwartz@finch.cse.psu.edu (1998-01-14)
| List of all articles for this month |

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Date: 13 Jan 1998 00:18:41 -0500
Organization: SP Systems, Toronto
References: 98-01-042
Keywords: lex, i18n

Craig Hobbs <craigh@visio.com> wrote:
>I'd imagine (although I haven't tried it) that one would need to
>modify the generated code to use wchar_t instead of char...
>[It's not that simple, lex uses character values as indexes into
>arrays, and you might not want to have a lot of 64K arrays in your
>code...]


The orthodox approach to this, although it would take some changes to
lex, is to use a single big table to map the incoming character into
one of a small number of equivalence classes, and use
equivalence-class number to index into your internal arrays. Even
with 8-bit characters, the number of equivalence classes is almost
always *much* smaller than the number of distinct characters. (For
example, even with a fairly complex system of numeric constants, the
digits 1-7 are typically one equivalence class, because the regular
expressions don't distinguish between them.)


The size of the mapping table can be kept under control, if that's an
issue, by using a two-level table, with the upper byte of the
character indexing into a first-level table which contains pointers to
second-level tables, which in turn are indexed by the lower byte. The
advantage is that many of the second-level tables are identical, and
you need only one copy of each *different* second-level table. (This
helps both memory consumption and cache hit ratio, and is typically
quicker than fooling around with conditional branches to implement
things like run-length compression of the table.)
--
| Henry Spencer
| henry@zoo.toronto.edu
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.