Related articles |
---|
Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-21) |
Re: Fast NFA engine anyone? rsc@swtch.com (Russ Cox) (2006-04-22) |
Re: Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-22) |
Re: Fast NFA engine anyone? cfc@shell01.TheWorld.com (Chris F Clark) (2006-04-23) |
Re: Fast NFA engine anyone? reeuwijk@few.vu.nl (2006-04-23) |
Re: Fast NFA engine anyone? Danny.Dube@ift.ulaval.ca (2006-05-11) |
Re: Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-05-12) |
From: | Danny.Dube@ift.ulaval.ca (=?iso-8859-1?q?Danny_Dub=E9?=) |
Newsgroups: | comp.compilers |
Date: | 11 May 2006 01:53:22 -0400 |
Organization: | Compilers Central |
References: | 06-04-121 06-04-129 |
Keywords: | lex |
Posted-Date: | 11 May 2006 01:53:22 EDT |
"Remo D." <rd3ntat0@hotmail.com> writes:
> [...]
> Well, consider that I'm interested in subpatterns and back references as
> well (sorry, I didn't mentioned that in my post).
>
> For example things like: ([_^]*)[A-G]([',]*) where I want to get the,
> possibly empty, sequences I matched at the beginning and at the end. Or
> things like: (["'])[A-Z]\1 where I want uppercase strings enclosed in
> single or double quotes.
> [...]
Maybe the technique described in:
Dubé D., Feeley M. (2000), "Efficiently building a parse tree from
a regular expression", Acta Informatica, volume 37, no. 2, pages
121-144.
will interest you. However, it takes care of subpatterns only, not of
back references.
In this technique, the idea of extracting subpatterns is taken to the
extreme. Essentially, all sub-expressions of a regular expression are
considered tagged for subpattern extraction. The subpatterns are not
extracted as substrings (nor coordinates inside) of the matching
lexeme but as unconventional parse trees. A parse tree describes how
a word is generated by a regular expression, much the same way as a
conventional parse tree describes how a word is generated by a
context-free grammar.
The shape of a parse tree for a word `w' matching a regular expression
`e' is mostly dictated by `e'. For instance:
- if `e = r1|r2|...|rN', then the tree looks like `#I:t', where `I'
indicates a choice among the `N' alternatives and `t', the way `w'
is generated by `rI';
- if `e = r1r2...rN', then the tree is a list of trees, `[t1,...,tN]',
where each `tI' indicates how `wI' is generated by `r', such that
`w1...wN = w';
- if `e = r*', then the tree is a list of trees, `[t1,...,tN]' where
each `tI' indicates how `wI' is generated by `r', such that `w1...wN
= w'.
Each kind of regular expression leads to specific shapes of parse
trees. The technique works with both NFA and DFA and exhibits linear
complexity in the length of lexemes (`e' affects only the hidden
constant). It is only theoretical work for now but we are working on
the integration into a lexical analyzer generator.
--
Danny Dubé
DannyDOTDube@iftDOTulavalDOTca
Return to the
comp.compilers page.
Search the
comp.compilers archives again.