Re: Alternatives to Regexps

bromage@cs.mu.OZ.AU (Andrew Bromage)
11 Jul 1998 23:45:03 -0400

          From comp.compilers

Related articles
[3 earlier articles]
Re: Alternatives to Regexps jamz@cdsnet.net (1998-07-10)
Re: Alternatives to Regexps d.rourke@arpc.com (Daniel Rourke) (1998-07-10)
Re: Alternatives to Regexps bear@sonic.net (Ray Dillinger) (1998-07-10)
Re: Alternatives to Regexps bpr@best.com (Brian Rogoff) (1998-07-10)
Re: Alternatives to Regexps mav@naxos.esat.kuleuven.ac.be (Maurizio Vitale) (1998-07-10)
Re: Alternatives to Regexps lord@emf.emf.net (1998-07-11)
Re: Alternatives to Regexps bromage@cs.mu.OZ.AU (1998-07-11)
Re: Alternatives to Regexps cfc@world.std.com (Chris F Clark) (1998-07-13)
Re: Alternatives to Regexps torbenm@diku.dk (Torben Mogensen) (1998-07-13)
Re: Alternatives to Regexps thetick@magelang.com (Scott Stanchfield) (1998-07-20)
| List of all articles for this month |

From: bromage@cs.mu.OZ.AU (Andrew Bromage)
Newsgroups: comp.compilers
Date: 11 Jul 1998 23:45:03 -0400
Organization: Computer Science, The University of Melbourne
References: 98-07-057 98-07-075
Keywords: DFA, design

G'day all.


Torben Mogensen <torbenm@diku.dk> writes:


>You can use context free grammars also at the lexical level. Sometimes
>LR parsetables for grammars that actually represent regular languages
>can be smaller than the DFA's generated from the regular expressions.


Do you have any pointers to some stats on this?


I'd also be interested to know if any good tools for (partially?)
automating the process of writing such a lexer exist. I don't think
that most off-the-shelf LR parser generators will cut it.


Warning: Unjustified over-generalisations and opinions follow. Proceed
at your own risk. :-)


One thing that separates lexical analysis and syntax analysis in current
mainstream practice (i.e. not necessarily desirable in a perfect world)
is the amount of disambiguation that lexical analyser generators do for
you. Consider C identifiers and keywords. In lex, you might write:


%%


if return IF;
int return INT;
while return WHILE;
. . .
[A-Za-z_][a-Za-z0-9_]* {
if (is_typedef(yytext)) {
return TYPENAME;
}
else {
return IDENTIFIER;
}
}


%%


Lex resolves the overlap between the keywords and identifiers/type names
by favoring rules which appear syntactically earlier in the file. This
is generally what you want (arguably there should be a better mechanism
for specifying relative rule priorities, but that's another topic).
Typical lex specifications tend to do this often. For example, it may
be more convenient to make the rules for a floating-point number overlap
with those of an integer and simply put the integer rule at a higher
priority than the floating-point rule.


Doing a similar job using current parser generator tools is not as
straightforward, since you generally don't want disambiguation to happen
behind your back like this. If there is a conflict, it's probably a bug.


One other option, writing an unambiguous regular expression/CFG for both
non-keyword identifiers, is difficult. Even something as simple as a C
inline comment is quite difficult to express in regular expression syntax
(unless your regex language has set difference; an invaluable tool for
regex-based scanner generators).


Cheers,
Andrew Bromage
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.