Lex character class question

"Michael O'Leary" <moleary@primus.com>
13 Apr 1998 00:28:32 -0400

          From comp.compilers

Related articles
Lex character class question moleary@primus.com (Michael O'Leary) (1998-04-13)
| List of all articles for this month |

From: "Michael O'Leary" <moleary@primus.com>
Newsgroups: comp.compilers
Date: 13 Apr 1998 00:28:32 -0400
Organization: Primus Communications Corporation
Keywords: lex, question, comment

We are using Flex in an application to tokenize English text, and have a
set of rules for determining when certain punctuation chartacters should
be treated as part of a token and when they indicate a token boundary
(such as a period in foo.com vs at the end of a sentence).


We would like to customize this in such a way that different users can
declare at run time (i.e., at the start of execution) which category
certain characters are assigned to, since they may be operating on
different kinds of text where word-internal punctuation serves different
purposes. My impression is that Flex does not permit this kind of
flexibility, since as far as I can tell the tables are built before
compile time when Flex is run on the .ll file. Furthermore, the patterns
are specified with literal characters and character classes, and it
appears that there is no way to use something along the lines of
reassignable variables in patterns instead.


Am I overlooking a way to supply this feature using Flex? Or could we
provide it with another tokenizing application? If not, I suppose we
could do some kind of post-processing of the tokens (although in the
worst case it seems the post-processor would have to redo everything the
Flex run had done).


Mike O'Leary
moleary@primus.com
[Flex does indeed precompute the tables, which is one of the reasons
that it's so fast. Redoing the tables at run time is impractical
unless you want to fork off an entire run of flex each time, but there
are several possible ways to handle your situation. If you want to
handle a moderate number of known punctuation setups, use exclusive
start states with a start state per setup. Mark the various patterns
that handle punctuation differently with the various start states, and
at runtime switch to the appropriate start state and stay there.
(Start states are defined as preprocessor symbols, and the BEGIN macro
can take any expression so long as it evaluates to the value of one of
the defined start states.) Or if you really want people to be able to
put any punctuation into any syntax class, redefine YY_INPUT so that
it looks up all the input characters in a 256 byte table before it
gives the input string to the lexer, set up the table at the beginning
of your run to map each character to an examplar character for the
class to which the user has assigned it, and write the lexer in terms
of the exemplar characters. The start state approach won't slow down
your lexer at all, the translation approach will slow it a little
since there's one lookup per character, but it should still be quite
fast. -John]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.