Re: Generating strings from regular expressions

"Jonathan Barker" <>
27 May 1999 23:23:37 -0400

          From comp.compilers

Related articles
Generating strings from regular expressions (KSG) (1999-05-24)
Re: Generating strings from regular expressions (1999-05-27)
Re: Generating strings from regular expressions (Jonathan Barker) (1999-05-27)
Re: Generating strings from regular expressions (Carsten Fritz) (1999-05-27)
Re: Generating strings from regular expressions (1999-05-27)
| List of all articles for this month |

From: "Jonathan Barker" <>
Newsgroups: comp.compilers,comp.theory
Date: 27 May 1999 23:23:37 -0400
Organization: - Share what you know. Learn what you don't.
Distribution: inet
References: 99-05-119
Keywords: lex

    KSG <> wrote:
> Does anyone know of any code or an algorithm that does the following:
> input: a regular expression, R, and an integer k.
> output: a list of all strings of length k that are in R
> I've thought about building the NFA and traversing that, but I'm not
> sure if building the NFA is really necessary. Is there a simpler way?
> Thanks,
> [Simpler? Sure. Enumerate all the strings of length k, run them through
> the RE matcher of your choice, and list the one that match. But I will
> admit that for large k it could be kind of slow. -John]

Surely it'll be kind of slow whatever - for example the number of
matches of (a|b)* of length k is obviously 2^k so we can't expect to
do much better than exponential time (in k).

I was intrigued by this problem - here is a solution based on the
syntax tree. I am not 100% sure that it is correct but it looks ok to
me. It may not be very efficient to implement since it contains a lot
of set manipulation and recursion.

I tried implementing the algorithm in C++. I you like I'll email you
a copy of my source but I warn you now, it is pretty poor (and
probably buggy) code :-)




Express regular expressions as given by the
following "grammar":

        R -> CAT(R,R) // contatentation
        R -> OR(R,R) // alternation
        R -> STAR(R) // closure
        R -> CHAR(a) // symbols from alphabet

where a is taken from the alphabet A. That is,
the input to the algorithm is a syntax tree.

If a is a character from the alphabet
then let STRING(a) be the string of length
1 whose only character is a.

Let EMPTY denote the empty string. Note that
concatenation with EMPTY leaves a string

Define the cross product operation on sets of
strings by:

        A * B := { ab | a in A and b in B }

where ab is the concatenation of a and b. If
A and B are sets of strings, then A*B denotes
the set of all strings formed by taking a string
from A and concatenating a string from B.

For notational convenience, whenever I write
the expression:

        Union[i=1,m] f(i)

I will mean:

f(1) union f(2) union ... union f(m)


R - a regular expression (as above)
k - a positive integer

S - a set of strings, each of length k.

The algorithm is defined recursively in pseudo-code
as follows:

function ENUM(R,k)

// There are four cases depending on
// the form of R:

// Case 1: R represents a leaf of the tree
// carrying the character a from the alphabet

if R = CHAR(a)
        if k = 1 then
S = {STRING(a)}
S = {}

// Case 2: R represents an alternation node
// with left node A and right node B

else if R = OR(A,B)
        S = Enum(A,k) union Enum(B,k)

// Case 3: R represents a concatenation node
// with left node A and right node B.

else if R = CAT(A,B)
        S = Union[i=0,k] Enum(A,i)*Enum(B,k-i)

// Case 4: R represets a "star" node
// with child node A

else if R = STAR(A)
        if k = 0 then
S = Union[i=1,k] Enum(A,i)*Enum(R,k-i)

// Now S contains the result (it may be empty).

return S

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.