Re: regular expression search algorithm

pardo@cs.washington.edu (David Keppel)
Fri, 19 Mar 1993 19:37:14 GMT

          From comp.compilers

Related articles
Regular expression search algorithm harvinder.singh@patni.com (Harry) (2002-07-15)
Re: Regular expression search algorithm joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Re: Regular expression search algorithm ralph@inputplus.co.uk (Ralph Corderoy) (2002-07-21)
Re: Regular expression search algorithm peter.tillier@btinternet.com (Peter S Tillier) (2002-07-24)
Re: Regular expression search algorithm arnold@skeeve.com (Aharon Robbins) (2002-09-03)
regular expression search algorithm forsyth@minster.york.ac.uk (1993-03-18)
Re: regular expression search algorithm pardo@cs.washington.edu (1993-03-19)
| List of all articles for this month |

Newsgroups: comp.compilers
From: pardo@cs.washington.edu (David Keppel)
Keywords: lex, DFA
Organization: Computer Science & Engineering, U. of Washington, Seattle
References: 93-03-063
Date: Fri, 19 Mar 1993 19:37:14 GMT

forsyth@minster.york.ac.uk writes:
>[Ken Thompson's DFA RE search algorithm.]


I found the paper rather cryptic, and, unfortunately, I've deleted both my
writeup and my reimplementation. But since it is a bit tricky, maybe I
can shed some light on what it did (and maybe this will either convince
you that you have to read the paper or that you never want to read it!).


The RE is translated in to code blocks that are used to implement a
threaded interpreter. For example, a RE a(a|b)* needs two code blocks.
One code block {1} that matches the input against `a' and, if equal,
pushes the address of block {2} on to a threaded interpreter list. The
code block {2} matches against `(a|b)' and pushes the address of {2}, if
either `a' or `b' is found.


The matcher is initialized by pushing the address of {1} on the "curr"
list. During execution, the matcher steps through the "curr" list,
executing the blocks, where each block may push zero or more state
addresses on the "next" list. When the "curr" list is exhausted, the
"next" list is moved to the "curr" list, {1} is appended, the next input
character is read, and the matcher continues. Execution (matching) halts
when if (a) both the "curr" and "next" lists are empty; (b) a final state
is found or (c) if the end-of-list code detects end-of-input when it reads
a character.


The lists are stored as FIFO queues and the last element in a queue is the
address of a block that handles the empty queue case. Thus, there is no
explicit test for end-of-queue in the code blocks themselves.


In Thompson's engine, the `threads' are not actually the addresses of
blocks, but instead are jump instructions. Thus, each block jumps to an
instruction in the thread and the thread jumps to the next block. A
register is dedicated to point in to the thread. The block ends with a
jump-indirect-postincrement that transfers control to an instruction in
the thread and updates the register to point at the next jump instruction
in the thread. The thread contains a jump instruction that transfers
control to the next block. Each block pushes state on the "next" list by
pushing a jump immediate instruction.


;-D on ( The matcher in the why ) Pardo
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.