Related articles |
---|
Generating strings from regular expressions kgatlin@cs.ucsd.edu (KSG) (1999-05-24) |
Re: Generating strings from regular expressions Helmut.Richter@lrz-muenchen.de (1999-05-27) |
Re: Generating strings from regular expressions ucapjab@ucl.ac.uk (Jonathan Barker) (1999-05-27) |
Re: Generating strings from regular expressions std10381@moritz.dial.techfak.uni-kiel.de (Carsten Fritz) (1999-05-27) |
Re: Generating strings from regular expressions djello@well.com (1999-05-27) |
From: | "Jonathan Barker" <ucapjab@ucl.ac.uk> |
Newsgroups: | comp.compilers,comp.theory |
Date: | 27 May 1999 23:23:37 -0400 |
Organization: | Deja.com - Share what you know. Learn what you don't. |
Distribution: | inet |
References: | 99-05-119 |
Keywords: | lex |
KSG <kgatlin@cs.ucsd.edu> wrote:
> Does anyone know of any code or an algorithm that does the following:
>
> input: a regular expression, R, and an integer k.
> output: a list of all strings of length k that are in R
>
> I've thought about building the NFA and traversing that, but I'm not
> sure if building the NFA is really necessary. Is there a simpler way?
>
> Thanks,
>
> KSG
> [Simpler? Sure. Enumerate all the strings of length k, run them through
> the RE matcher of your choice, and list the one that match. But I will
> admit that for large k it could be kind of slow. -John]
Surely it'll be kind of slow whatever - for example the number of
matches of (a|b)* of length k is obviously 2^k so we can't expect to
do much better than exponential time (in k).
I was intrigued by this problem - here is a solution based on the
syntax tree. I am not 100% sure that it is correct but it looks ok to
me. It may not be very efficient to implement since it contains a lot
of set manipulation and recursion.
I tried implementing the algorithm in C++. I you like I'll email you
a copy of my source but I warn you now, it is pretty poor (and
probably buggy) code :-)
Regards
Jonathan
DEFINITIONS AND NOTATION
Express regular expressions as given by the
following "grammar":
R -> CAT(R,R) // contatentation
R -> OR(R,R) // alternation
R -> STAR(R) // closure
R -> CHAR(a) // symbols from alphabet
where a is taken from the alphabet A. That is,
the input to the algorithm is a syntax tree.
If a is a character from the alphabet
then let STRING(a) be the string of length
1 whose only character is a.
Let EMPTY denote the empty string. Note that
concatenation with EMPTY leaves a string
unchanged.
Define the cross product operation on sets of
strings by:
A * B := { ab | a in A and b in B }
where ab is the concatenation of a and b. If
A and B are sets of strings, then A*B denotes
the set of all strings formed by taking a string
from A and concatenating a string from B.
For notational convenience, whenever I write
the expression:
Union[i=1,m] f(i)
I will mean:
f(1) union f(2) union ... union f(m)
THE ALGORITHM
Inputs:
R - a regular expression (as above)
k - a positive integer
Outputs:
S - a set of strings, each of length k.
The algorithm is defined recursively in pseudo-code
as follows:
function ENUM(R,k)
// There are four cases depending on
// the form of R:
// Case 1: R represents a leaf of the tree
// carrying the character a from the alphabet
if R = CHAR(a)
{
if k = 1 then
S = {STRING(a)}
else
S = {}
}
// Case 2: R represents an alternation node
// with left node A and right node B
else if R = OR(A,B)
{
S = Enum(A,k) union Enum(B,k)
}
// Case 3: R represents a concatenation node
// with left node A and right node B.
else if R = CAT(A,B)
{
S = Union[i=0,k] Enum(A,i)*Enum(B,k-i)
}
// Case 4: R represets a "star" node
// with child node A
else if R = STAR(A)
{
if k = 0 then
S = {EMPTY}
else
S = Union[i=1,k] Enum(A,i)*Enum(R,k-i)
}
// Now S contains the result (it may be empty).
return S
end
Return to the
comp.compilers page.
Search the
comp.compilers archives again.