31 Jan 2006 21:19:27 -0500

Related articles |
---|

NFA -> DFA on-the-fly determinization rsc@swtch.com (Russ Cox) (2006-01-31) |

Re: NFA -> DFA on-the-fly determinization d148f3wg02@sneakemail.com (Karsten Nyblad) (2006-02-02) |

Re: NFA -> DFA on-the-fly determinization rsc@swtch.com (Russ Cox) (2006-02-03) |

From: | "Russ Cox" <rsc@swtch.com> |

Newsgroups: | comp.compilers |

Date: | 31 Jan 2006 21:19:27 -0500 |

Organization: | Compilers Central |

Keywords: | lex, question |

I am trying to find a reference for the technique of converting an NFA

to a DFA as needed during NFA execution and caching the result to

avoid repeating the conversion at each step.

Thompson's 7094 implementation represents each DFA state as a list of

NFA states (themselves function pointers) and computes the next state

afresh on each transition.

As a simple example, suppose you have an NFA for matching "a..". The

NFA has four states [] [a] [a.] and [a..] (I am using [] to mean the

state corresponding to reading the given string). The DFA has 15 or

so states: [] [a] [!a] [aa] [a!a] [!aa] [!a!a] [aaa] [aa!a] ...

(here, !a means a character that is not a). Each DFA state is a list

of NFA states: [aa!a] = ([], [a.], [a..]), [a!aa] = ([], [a], [a..]),

and so on.

If Thompson's algorithm is in the [aa!a] state and reads an a, it

executes the first list to produce the second list. The next time it

is in the [aa!a] state and reads an a, it will do this again. It does

not learn from past work. This is entirely reasonable given the

memory constraints at the time and the potentially large size.

It is a simple optimization to remember each state in some kind of

hash table and re-use previously-computed transitions. Assuming the

input only exercises a small part of the DFA, after you've primed the

DFA, each input character's processing is a single memory access.

In fact, this is the approach taken by Thompson's current grep, which

I posted a link to last month, but I expect that the fact that

NFA->DFA can be done incrementally has been well known for quite some

time.

With apologies for the long-winded explanation (I wanted it to be

clear why I wasn't asking for Thompson's 1968 paper), does anyone know

of any references to the original papers that pointed this out?

Thanks.

Russ

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.