11 May 2006 01:53:22 -0400

Related articles |
---|

Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-21) |

Re: Fast NFA engine anyone? rsc@swtch.com (Russ Cox) (2006-04-22) |

Re: Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-22) |

Re: Fast NFA engine anyone? cfc@shell01.TheWorld.com (Chris F Clark) (2006-04-23) |

Re: Fast NFA engine anyone? reeuwijk@few.vu.nl (2006-04-23) |

Re: Fast NFA engine anyone? Danny.Dube@ift.ulaval.ca (2006-05-11) |

Re: Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-05-12) |

From: | Danny.Dube@ift.ulaval.ca (=?iso-8859-1?q?Danny_Dub=E9?=) |

Newsgroups: | comp.compilers |

Date: | 11 May 2006 01:53:22 -0400 |

Organization: | Compilers Central |

References: | 06-04-121 06-04-129 |

Keywords: | lex |

"Remo D." <rd3ntat0@hotmail.com> writes:

*> [...]*

*> Well, consider that I'm interested in subpatterns and back references as*

*> well (sorry, I didn't mentioned that in my post).*

*>*

*> For example things like: ([_^]*)[A-G]([',]*) where I want to get the,*

*> possibly empty, sequences I matched at the beginning and at the end. Or*

*> things like: (["'])[A-Z]\1 where I want uppercase strings enclosed in*

*> single or double quotes.*

*> [...]*

Maybe the technique described in:

Dubé D., Feeley M. (2000), "Efficiently building a parse tree from

a regular expression", Acta Informatica, volume 37, no. 2, pages

121-144.

will interest you. However, it takes care of subpatterns only, not of

back references.

In this technique, the idea of extracting subpatterns is taken to the

extreme. Essentially, all sub-expressions of a regular expression are

considered tagged for subpattern extraction. The subpatterns are not

extracted as substrings (nor coordinates inside) of the matching

lexeme but as unconventional parse trees. A parse tree describes how

a word is generated by a regular expression, much the same way as a

conventional parse tree describes how a word is generated by a

context-free grammar.

The shape of a parse tree for a word `w' matching a regular expression

`e' is mostly dictated by `e'. For instance:

- if `e = r1|r2|...|rN', then the tree looks like `#I:t', where `I'

indicates a choice among the `N' alternatives and `t', the way `w'

is generated by `rI';

- if `e = r1r2...rN', then the tree is a list of trees, `[t1,...,tN]',

where each `tI' indicates how `wI' is generated by `r', such that

`w1...wN = w';

- if `e = r*', then the tree is a list of trees, `[t1,...,tN]' where

each `tI' indicates how `wI' is generated by `r', such that `w1...wN

= w'.

Each kind of regular expression leads to specific shapes of parse

trees. The technique works with both NFA and DFA and exhibits linear

complexity in the length of lexemes (`e' affects only the hidden

constant). It is only theoretical work for now but we are working on

the integration into a lexical analyzer generator.

--

Danny Dubé

DannyDOTDube@iftDOTulavalDOTca

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.