RE->DFA with subexpression matching

"Remo D." <rdentato@news.tin.it>
Sun, 20 Jan 2008 12:19:45 +0100

          From comp.compilers

Related articles
Fast NFA engine anyone? rd3ntat0@hotmail.com (Remo D.) (2006-04-21)
RE->DFA with subexpression matching rdentato@news.tin.it (Remo D.) (2008-01-20)
| List of all articles for this month |

From: "Remo D." <rdentato@news.tin.it>
Newsgroups: comp.compilers
Date: Sun, 20 Jan 2008 12:19:45 +0100
Organization: TIN.IT (http://www.tin.it)
References: 06-04-121
Keywords: lex, DFA
Posted-Date: 20 Jan 2008 10:46:36 EST

Almost two years ago I asked for help on this group
(http://compilers.iecc.com/comparch/article/06-04-121) to solve the
following problem:


Given a set of regular expressions, create a DFA that would recognize
them and upon successful match would report which RE has been matched
and would capture any subexpression. Text is to be matched starting
from its first character.


What I wanted to achieve, in the end, is to create a tool like re2c
(http://re2c.sf.net) for embedding lexical scanners into C code.


Unfortunately at that time I couldn't learn all that I needed to solve
the problem but it really interested me so I continued working on it
in my, very little, spare time.


Starting from the advices got in this group, I've read a lot on the
subject, tried many alternatives and failed many times until I devised
an algorithm that, I think, could solve the problem.


It's based on tagged automata and I've described it here:
http://yrx.googlecode.com/files/yrxreg.pdf .


The description is rather informal and intuitive but I hope it's clear
enough. Should I not find any flaw in it, I'll refine the document to be
more formal and precise.


To check that what I was doing had any sense, I've implemented it in C
( http://yrx.googlecode.com/files/yrx_080120.zip ) with the most up to
date version stored here: http://yrx.googlecode.com/svn/trunk . The
current version takes a list of regexps and create a graph
representation for the DFA in dot format. AT&T GraphViz can be used to
produce the real graphics.


I would be very grateful if anyone would have a look at the description
(and at the code) and would comment on the algorithm.


I'd like to know if you find any irremediable flaw, if there are other
approaches that I should consider, etc.


For example, I suspect that this approach is equivalent to (and probably
less general than) the one described by Ville Laurikari
(http://laurikari.net/tre ) but I've not been able to prove it nor I
understood Ville's work enough to implement his algorithm.


Next steps for me would be to generate source code for the DFA.


I appreciate that a lot could be done on the theoretical side. For
example could tagged automata be extended to implement negation and
intersection?


Also Danny Dube' suggested a different approach that I now understand to
be based on creating the parse trees for the REs and consider that a
capture is a non-terminal of the grammar. Would that approach be better
(faster, more general, ...) than tagged automata?


Any comment and suggestion will be welcome!


                      Remo.D


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.