Re: Fast NFA engine anyone?

Danny.Dube@ift.ulaval.ca (=?iso-8859-1?q?Danny_Dub=E9?=)
11 May 2006 01:53:22 -0400

          From comp.compilers

Related articles
Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-21)
Re: Fast NFA engine anyone? rsc@swtch.com (Russ Cox) (2006-04-22)
Re: Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-22)
Re: Fast NFA engine anyone? cfc@shell01.TheWorld.com (Chris F Clark) (2006-04-23)
Re: Fast NFA engine anyone? reeuwijk@few.vu.nl (2006-04-23)
Re: Fast NFA engine anyone? Danny.Dube@ift.ulaval.ca (2006-05-11)
Re: Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-05-12)
| List of all articles for this month |

From: Danny.Dube@ift.ulaval.ca (=?iso-8859-1?q?Danny_Dub=E9?=)
Newsgroups: comp.compilers
Date: 11 May 2006 01:53:22 -0400
Organization: Compilers Central
References: 06-04-121 06-04-129
Keywords: lex
Posted-Date: 11 May 2006 01:53:22 EDT

"Remo D." <rd3ntat0@hotmail.com> writes:
> [...]
> Well, consider that I'm interested in subpatterns and back references as
> well (sorry, I didn't mentioned that in my post).
>
> For example things like: ([_^]*)[A-G]([',]*) where I want to get the,
> possibly empty, sequences I matched at the beginning and at the end. Or
> things like: (["'])[A-Z]\1 where I want uppercase strings enclosed in
> single or double quotes.
> [...]


Maybe the technique described in:


        Dubé D., Feeley M. (2000), "Efficiently building a parse tree from
        a regular expression", Acta Informatica, volume 37, no. 2, pages
        121-144.


will interest you. However, it takes care of subpatterns only, not of
back references.


In this technique, the idea of extracting subpatterns is taken to the
extreme. Essentially, all sub-expressions of a regular expression are
considered tagged for subpattern extraction. The subpatterns are not
extracted as substrings (nor coordinates inside) of the matching
lexeme but as unconventional parse trees. A parse tree describes how
a word is generated by a regular expression, much the same way as a
conventional parse tree describes how a word is generated by a
context-free grammar.


The shape of a parse tree for a word `w' matching a regular expression
`e' is mostly dictated by `e'. For instance:


- if `e = r1|r2|...|rN', then the tree looks like `#I:t', where `I'
    indicates a choice among the `N' alternatives and `t', the way `w'
    is generated by `rI';


- if `e = r1r2...rN', then the tree is a list of trees, `[t1,...,tN]',
    where each `tI' indicates how `wI' is generated by `r', such that
    `w1...wN = w';


- if `e = r*', then the tree is a list of trees, `[t1,...,tN]' where
    each `tI' indicates how `wI' is generated by `r', such that `w1...wN
    = w'.


Each kind of regular expression leads to specific shapes of parse
trees. The technique works with both NFA and DFA and exhibits linear
complexity in the length of lexemes (`e' affects only the hidden
constant). It is only theoretical work for now but we are working on
the integration into a lexical analyzer generator.


--
                Danny Dubé
                DannyDOTDube@iftDOTulavalDOTca


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.