Re: ^/$ in lexical analyzers

Tom Lord <lord+@andrew.cmu.edu>
Mon, 28 Jun 1993 16:43:13 GMT

          From comp.compilers

Related articles
^/$ in lexical analyzers goer@midway.uchicago.edu (1993-06-27)
Re: ^/$ in lexical analyzers lord+@andrew.cmu.edu (Tom Lord) (1993-06-28)
Magic (was Re: ^/$ in lexical analyzers) goer@midway.uchicago.edu (1993-06-30)
Re: ^/$ in lexical analyzers mike@ichips.intel.com (1993-06-30)
| List of all articles for this month |

Newsgroups: comp.compilers
From: Tom Lord <lord+@andrew.cmu.edu>
Keywords: DFA, lex
Organization: School of Computer Science, Carnegie Mellon, Pittsburgh, PA
References: 93-06-071
Date: Mon, 28 Jun 1993 16:43:13 GMT

            [How are ^ and $ handled in regexp matchers?]


Some matchers compile expressions into a byte-coded program. The program
is a straightforward backtracking search through the pattern space for
possible matches. In those matchers, `^' and `$' get special bytecodes
and their action is to look at the matcher context and either fail or move
on to the next byte-code.


Here is another way:


A simple, truly regular, regular expression can be compiled into an NFA
who's edges are labeled with characters (or epsilon). Using well-known
magic, such an nfa can be converted to a dfa. For on-line applications,
it is best to do the dfa conversion lazily (on-demand).


So, you wind up with a matcher that is a simple loop following the edges
described in a dfa transition table. Some table entries might not have
been constructed yet, so you need an exception mechanism to detect an
incomplete table entry in order to be able to bail to the more complicated
code (i.e., the code that builds some of the dfa).


We can expand this model by letting edges of the regexp nfa be labled with
things besides characters; for example, `^' and `$'. In the dfa, we use
the same exception mechanism that drives lazy-dfa conversion to detect
edge crossings that require special actions. An example of a special
action is to look at the context of the match to see if the conditions
required by `^' are satisfied. So, the matcher looks similar to the basic
lazy-dfa algorithm, but has more exceptional cases than just
dfa-cache-misses.


There is a subtlety to this. In the dfa conversion, all the possible
paths through the regexp nfa are explored in parallel -- breadth first
fasion. But when characters like '^' and '$' are added it becomes
possible that some paths will require checks for `^' while others don't.
The short summary of the solution is that we won't convert to a dfa at all
but rather to a second nfa; one that is as deterministic as we can make it
while still leaving in necessary nondeterminism for things like '^'.


(If the only weird meta characters were `^' and `$' we might be able to
fudge a bit and still build a dfa. But other common meta characters make
it harder to do anything but move to an nfa, for example: `(' and `)' when
the matcher has to report the position and extent of the subexpression
they match.)


Watch for GNU rx, a regexp library based on this approach, to be released
quite soon.


-t
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.