Re: Alternatives to Regexps

Torben Mogensen <torbenm@diku.dk>
13 Jul 1998 23:48:33 -0400

          From comp.compilers

Related articles
[5 earlier articles]
Re: Alternatives to Regexps bear@sonic.net (Ray Dillinger) (1998-07-10)
Re: Alternatives to Regexps bpr@best.com (Brian Rogoff) (1998-07-10)
Re: Alternatives to Regexps mav@naxos.esat.kuleuven.ac.be (Maurizio Vitale) (1998-07-10)
Re: Alternatives to Regexps lord@emf.emf.net (1998-07-11)
Re: Alternatives to Regexps bromage@cs.mu.OZ.AU (1998-07-11)
Re: Alternatives to Regexps cfc@world.std.com (Chris F Clark) (1998-07-13)
Re: Alternatives to Regexps torbenm@diku.dk (Torben Mogensen) (1998-07-13)
Re: Alternatives to Regexps thetick@magelang.com (Scott Stanchfield) (1998-07-20)
| List of all articles for this month |

From: Torben Mogensen <torbenm@diku.dk>
Newsgroups: comp.compilers
Date: 13 Jul 1998 23:48:33 -0400
Organization: Compilers Central
References: 98-07-057 98-07-075 98-07-107
Keywords: DFA, design

Andrew Bromage wrote:


>Torben Mogensen <torbenm@diku.dk> writes:
>
>>You can use context free grammars also at the lexical level. Sometimes
>>LR parsetables for grammars that actually represent regular languages
>>can be smaller than the DFA's generated from the regular expressions.
>
>Do you have any pointers to some stats on this?


Only an example: When writing a lexer for Scheme, the DFA generated
from regexps was huge, mainly due to the complexity of numerical
constants. An SLR table had a much smaller number of states. I don't
have the exact figures at hand.


>I'd also be interested to know if any good tools for (partially?)
>automating the process of writing such a lexer exist. I don't think
>that most off-the-shelf LR parser generators will cut it.


We used a homebrew SLR parser generator for this, but it actually
wasn't well suited for this, as it didn't have precedence declarations
(see below). I think yacc will work, though. AFAIR, yacc allows you to
write single-character terminals directly in the yacc source. Integers
less than 256 in the input represent these, wheras larger integers
represent composite lexical symbols. It has been a while since I used
yacc, though, so I can't guarantee that it will work.


>One thing that separates lexical analysis and syntax analysis in current
>mainstream practice (i.e. not necessarily desirable in a perfect world)
>is the amount of disambiguation that lexical analyser generators do for
>you.


>Lex resolves the overlap between the keywords and identifiers/type names
>by favoring rules which appear syntactically earlier in the file. This
>is generally what you want (arguably there should be a better mechanism
>for specifying relative rule priorities, but that's another topic).
>Typical lex specifications tend to do this often. For example, it may
>be more convenient to make the rules for a floating-point number overlap
>with those of an integer and simply put the integer rule at a higher
>priority than the floating-point rule.
>
>Doing a similar job using current parser generator tools is not as
>straightforward, since you generally don't want disambiguation to happen
>behind your back like this. If there is a conflict, it's probably a bug.


Ambiguities of this kind will usually lead to reduce/reduce conflicts
in LR parsers. Many LR parser generators allow you to prioritize rules
by adding operator precedence. Normally, this is not what quite you
need for lexical disambiguation, since you don't have any obvious
operators to fix these precedences to. However, most parser generators
with precendece declarations allow you to locally change the
precedence of a rule (often used for unary minus etc.), which should
do the trick.


Another disambiguation that lexers normally do is to scan for the
longest token. In an LR parser, this corresponds to a shift/reduce
conflict where you want to shift instead of reduce. This is the
default action taken by yacc etc. for such conflicts (though they will
usually issue warnings).


>One other option, writing an unambiguous regular expression/CFG for both
>non-keyword identifiers, is difficult. Even something as simple as a C
>inline comment is quite difficult to express in regular expression syntax
>(unless your regex language has set difference; an invaluable tool for
>regex-based scanner generators).


Indeed. To be really useful for lexical analysis, grammar notation
should be extended with a convenient way to represent sets of
characters using ranges, union and complement.


Torben Mogensen (torbenm@diku.dk)
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.