Re: Generating strings from regular expressions

"Jonathan Barker" <ucapjab@ucl.ac.uk>
27 May 1999 23:23:37 -0400

          From comp.compilers

Related articles
Generating strings from regular expressions kgatlin@cs.ucsd.edu (KSG) (1999-05-24)
Re: Generating strings from regular expressions Helmut.Richter@lrz-muenchen.de (1999-05-27)
Re: Generating strings from regular expressions ucapjab@ucl.ac.uk (Jonathan Barker) (1999-05-27)
Re: Generating strings from regular expressions std10381@moritz.dial.techfak.uni-kiel.de (Carsten Fritz) (1999-05-27)
Re: Generating strings from regular expressions djello@well.com (1999-05-27)
| List of all articles for this month |

From: "Jonathan Barker" <ucapjab@ucl.ac.uk>
Newsgroups: comp.compilers,comp.theory
Date: 27 May 1999 23:23:37 -0400
Organization: Deja.com - Share what you know. Learn what you don't.
Distribution: inet
References: 99-05-119
Keywords: lex

    KSG <kgatlin@cs.ucsd.edu> wrote:
> Does anyone know of any code or an algorithm that does the following:
>
> input: a regular expression, R, and an integer k.
> output: a list of all strings of length k that are in R
>
> I've thought about building the NFA and traversing that, but I'm not
> sure if building the NFA is really necessary. Is there a simpler way?
>
> Thanks,
>
> KSG
> [Simpler? Sure. Enumerate all the strings of length k, run them through
> the RE matcher of your choice, and list the one that match. But I will
> admit that for large k it could be kind of slow. -John]


Surely it'll be kind of slow whatever - for example the number of
matches of (a|b)* of length k is obviously 2^k so we can't expect to
do much better than exponential time (in k).


I was intrigued by this problem - here is a solution based on the
syntax tree. I am not 100% sure that it is correct but it looks ok to
me. It may not be very efficient to implement since it contains a lot
of set manipulation and recursion.


I tried implementing the algorithm in C++. I you like I'll email you
a copy of my source but I warn you now, it is pretty poor (and
probably buggy) code :-)


Regards


Jonathan


DEFINITIONS AND NOTATION


Express regular expressions as given by the
following "grammar":


        R -> CAT(R,R) // contatentation
        R -> OR(R,R) // alternation
        R -> STAR(R) // closure
        R -> CHAR(a) // symbols from alphabet


where a is taken from the alphabet A. That is,
the input to the algorithm is a syntax tree.


If a is a character from the alphabet
then let STRING(a) be the string of length
1 whose only character is a.


Let EMPTY denote the empty string. Note that
concatenation with EMPTY leaves a string
unchanged.


Define the cross product operation on sets of
strings by:


        A * B := { ab | a in A and b in B }


where ab is the concatenation of a and b. If
A and B are sets of strings, then A*B denotes
the set of all strings formed by taking a string
from A and concatenating a string from B.


For notational convenience, whenever I write
the expression:


        Union[i=1,m] f(i)


I will mean:


f(1) union f(2) union ... union f(m)




THE ALGORITHM


Inputs:
R - a regular expression (as above)
k - a positive integer


Outputs:
S - a set of strings, each of length k.


The algorithm is defined recursively in pseudo-code
as follows:


function ENUM(R,k)


// There are four cases depending on
// the form of R:


// Case 1: R represents a leaf of the tree
// carrying the character a from the alphabet


if R = CHAR(a)
{
        if k = 1 then
S = {STRING(a)}
else
S = {}
}


// Case 2: R represents an alternation node
// with left node A and right node B


else if R = OR(A,B)
{
        S = Enum(A,k) union Enum(B,k)
}


// Case 3: R represents a concatenation node
// with left node A and right node B.


else if R = CAT(A,B)
{
        S = Union[i=0,k] Enum(A,i)*Enum(B,k-i)
}


// Case 4: R represets a "star" node
// with child node A


else if R = STAR(A)
{
        if k = 0 then
S = {EMPTY}
else
S = Union[i=1,k] Enum(A,i)*Enum(R,k-i)
}


// Now S contains the result (it may be empty).


return S
end


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.