27 May 1999 23:23:37 -0400

Related articles |
---|

Generating strings from regular expressions kgatlin@cs.ucsd.edu (KSG) (1999-05-24) |

Re: Generating strings from regular expressions Helmut.Richter@lrz-muenchen.de (1999-05-27) |

Re: Generating strings from regular expressions ucapjab@ucl.ac.uk (Jonathan Barker) (1999-05-27) |

Re: Generating strings from regular expressions std10381@moritz.dial.techfak.uni-kiel.de (Carsten Fritz) (1999-05-27) |

Re: Generating strings from regular expressions djello@well.com (1999-05-27) |

From: | "Jonathan Barker" <ucapjab@ucl.ac.uk> |

Newsgroups: | comp.compilers,comp.theory |

Date: | 27 May 1999 23:23:37 -0400 |

Organization: | Deja.com - Share what you know. Learn what you don't. |

Distribution: | inet |

References: | 99-05-119 |

Keywords: | lex |

KSG <kgatlin@cs.ucsd.edu> wrote:

*> Does anyone know of any code or an algorithm that does the following:*

*>*

*> input: a regular expression, R, and an integer k.*

*> output: a list of all strings of length k that are in R*

*>*

*> I've thought about building the NFA and traversing that, but I'm not*

*> sure if building the NFA is really necessary. Is there a simpler way?*

*>*

*> Thanks,*

*>*

*> KSG*

*> [Simpler? Sure. Enumerate all the strings of length k, run them through*

*> the RE matcher of your choice, and list the one that match. But I will*

*> admit that for large k it could be kind of slow. -John]*

Surely it'll be kind of slow whatever - for example the number of

matches of (a|b)* of length k is obviously 2^k so we can't expect to

do much better than exponential time (in k).

I was intrigued by this problem - here is a solution based on the

syntax tree. I am not 100% sure that it is correct but it looks ok to

me. It may not be very efficient to implement since it contains a lot

of set manipulation and recursion.

I tried implementing the algorithm in C++. I you like I'll email you

a copy of my source but I warn you now, it is pretty poor (and

probably buggy) code :-)

Regards

Jonathan

DEFINITIONS AND NOTATION

Express regular expressions as given by the

following "grammar":

R -> CAT(R,R) // contatentation

R -> OR(R,R) // alternation

R -> STAR(R) // closure

R -> CHAR(a) // symbols from alphabet

where a is taken from the alphabet A. That is,

the input to the algorithm is a syntax tree.

If a is a character from the alphabet

then let STRING(a) be the string of length

1 whose only character is a.

Let EMPTY denote the empty string. Note that

concatenation with EMPTY leaves a string

unchanged.

Define the cross product operation on sets of

strings by:

A * B := { ab | a in A and b in B }

where ab is the concatenation of a and b. If

A and B are sets of strings, then A*B denotes

the set of all strings formed by taking a string

from A and concatenating a string from B.

For notational convenience, whenever I write

the expression:

Union[i=1,m] f(i)

I will mean:

f(1) union f(2) union ... union f(m)

THE ALGORITHM

Inputs:

R - a regular expression (as above)

k - a positive integer

Outputs:

S - a set of strings, each of length k.

The algorithm is defined recursively in pseudo-code

as follows:

function ENUM(R,k)

// There are four cases depending on

// the form of R:

// Case 1: R represents a leaf of the tree

// carrying the character a from the alphabet

if R = CHAR(a)

{

if k = 1 then

S = {STRING(a)}

else

S = {}

*}*

// Case 2: R represents an alternation node

// with left node A and right node B

else if R = OR(A,B)

{

S = Enum(A,k) union Enum(B,k)

*}*

// Case 3: R represents a concatenation node

// with left node A and right node B.

else if R = CAT(A,B)

{

S = Union[i=0,k] Enum(A,i)*Enum(B,k-i)

*}*

// Case 4: R represets a "star" node

// with child node A

else if R = STAR(A)

{

if k = 0 then

S = {EMPTY}

else

S = Union[i=1,k] Enum(A,i)*Enum(R,k-i)

*}*

// Now S contains the result (it may be empty).

return S

end

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.