Re: ^/$ in lexical analyzers

mike@ichips.intel.com
Wed, 30 Jun 1993 18:27:43 GMT

          From comp.compilers

Related articles
^/$ in lexical analyzers goer@midway.uchicago.edu (1993-06-27)
Re: ^/$ in lexical analyzers lord+@andrew.cmu.edu (Tom Lord) (1993-06-28)
Re: ^/$ in lexical analyzers mike@ichips.intel.com (1993-06-30)
| List of all articles for this month |

Newsgroups: comp.compilers
From: mike@ichips.intel.com
Keywords: DFA, lex
Organization: Compilers Central
References: 93-06-071 93-06-075
Date: Wed, 30 Jun 1993 18:27:43 GMT

Tom Lord wrote:
>There is a subtlety to this. In the dfa conversion, all the possible
>paths through the regexp nfa are explored in parallel -- breadth first
>fasion. But when characters like '^' and '$' are added it becomes
>possible that some paths will require checks for `^' while others don't.
>The short summary of the solution is that we won't convert to a dfa at all
>but rather to a second nfa; one that is as deterministic as we can make it
>while still leaving in necessary nondeterminism for things like '^'.


There is no need to go to the hybrid automaton if your only problem is ^
and $ and the like.


GNU egrep uses a pure dfa in this case. Here's how:


It uses the parse-tree-to-dfa conversion (roughly) in the Red Dragon book.
However, for each "position" on the parse tree, it associates a few bits
of "allowable context" information.


Forexample, the parse tree for the regexp "^x" would look like:


CAT
/ \
              ^ x


Then it builds "follow" sets in the usual way, treating "^" as an ordinary
character. This results in an NFA without epsilon transitions. Next, it
performs a pseudo epsilon closure on this NFA. At this point, constructs
like ^, $, \< and \> (and some other GNU-specific things) get treated as
epsilon transitions, that propogate some constraints forward to their
"follow" sets.


The result is a new NFA without epsilon transitions, together with
contraints on when positions in that NFA are allowed to match. In this
case, the constraint associated with the "x" position would be that it
must be at the beginning of the string, or the previous character must be
a newline.


Finally, it uses the canonical NFA-to-DFA subset construction, modified as
follows:


Each state is labeled with the kind of incoming transition that led to it.
For example, whether or not the state was reached by matching a newline.


The contents of the state (a set of positions it can match) are built in
the ordinary way. However, before computing the outgoing transitions the
positions whose constraint fail to match the state's context are
eliminated. So, in the above example the position "x" would be eliminated
from any state not labeled as having been reached by matching a newline.


In GNU egrep, there are only three flavors of states: those reached on a
newline, those reached on an alphabetic, and those reached on a
nonalphabetic. Consequently, the introduction of ^, $, \<, and \> into a
regexp can at most triple the size of the DFA. There is no penalty at all
if these constructs are not used. In this case the GNU egrep DFA is
identical to the ordinary Red Dragon book DFA.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.