Magic (was Re: ^/$ in lexical analyzers)

goer@midway.uchicago.edu (Richard L. Goerwitz)
Wed, 30 Jun 1993 13:45:13 GMT

          From comp.compilers

Related articles
^/$ in lexical analyzers goer@midway.uchicago.edu (1993-06-27)
Re: ^/$ in lexical analyzers lord+@andrew.cmu.edu (Tom Lord) (1993-06-28)
Magic (was Re: ^/$ in lexical analyzers) goer@midway.uchicago.edu (1993-06-30)
| List of all articles for this month |
Newsgroups: comp.compilers
From: goer@midway.uchicago.edu (Richard L. Goerwitz)
Keywords: DFA, lex
Organization: University of Chicago
References: 93-06-071 93-06-075
Date: Wed, 30 Jun 1993 13:45:13 GMT

> [How are ^ and $ handled in regexp matchers?]
>
>There is a subtlety to this. In the dfa conversion, all the possible
>paths through the regexp nfa are explored in parallel -- breadth first
>fasion. But when characters like '^' and '$' are added it becomes
>possible that some paths will require checks for `^' while others don't.


The problems become particularly severe when you get a pattern like
^ab|ac. If you treat ^ as a special character of some kind (but a char-
acter nonetheless), then the DFA will have transitions out of the start
state on ^ and a. If we follow ^, though, on a pattern like "ac," we
will have to backtrack. Poof. There goes determinism.


There is a way to maintain the determinism. Just check, when creating
the DFA's start state (e-closure(NFA.start)), whether there is a tran-
sition on ^. If there is, then save the just-generated start state, and
then create another start state where ^ is equivalent to an epsilon
transition. I.e. take the e/^-closure of the NFA's start state. Use
this automaton when at the beginning of a line, and use the old "canoni-
cal" DFA start state when the start position is past the line's begin
point. This is very easy to implement, and doesn't require prepending
".*" to any portion of the regexp. It also doesn't require insertion of
any dummy newlines. The extra cost is that the canonical DFA start
state has to be constructed, even if it is not used. A single extra
state isn't too bad, especially since one could generate the new start
state for the beginning of the line by simply doing a union of ^-closure(
NFA.start) with the canonical start state.


>Watch for GNU rx, a regexp library based on this approach, to be released
>quite soon.


This will be delightful. I'm a philologist by trade, and I find the stuff
you computer people do to be just short of magic :-).
--
      -Richard L. Goerwitz goer%midway@uchicago.bitnet
      goer@midway.uchicago.edu rutgers!oddjob!ellis!goer
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.