How to rewrite a regexp without word boundaries?

dave_140390@hotmail.com
Sun, 5 Jul 2009 02:02:08 -0700 (PDT)

          From comp.compilers

Related articles
How to rewrite a regexp without word boundaries? dave_140390@hotmail.com (2009-07-05)
Re: How to rewrite a regexp without word boundaries? h.b.furuseth@usit.uio.no (Hallvard B Furuseth) (2009-07-05)
Re: How to rewrite a regexp without word boundaries? dave_140390@hotmail.com (2009-07-06)
Re: How to rewrite a regexp without word boundaries? haberg_20080406@math.su.se (Hans Aberg) (2009-07-07)
Re: How to rewrite a regexp without word boundaries? h.b.furuseth@usit.uio.no (Hallvard B Furuseth) (2009-07-07)
Re: How to rewrite a regexp without word boundaries? andrew@tomazos.com (Andrew Tomazos) (2009-07-07)
Re: How to rewrite a regexp without word boundaries? cfc@shell01.TheWorld.com (Chris F Clark) (2009-07-13)
[2 later articles]
| List of all articles for this month |

From: dave_140390@hotmail.com
Newsgroups: comp.compilers,comp.theory
Date: Sun, 5 Jul 2009 02:02:08 -0700 (PDT)
Organization: Compilers Central
Keywords: lex, theory, question
Posted-Date: 05 Jul 2009 09:29:57 EDT

Hi,


I have been wondering, with limited success, how to rewrite a regexp
without word boundaries.


According to http://www.perl.com/doc/manual/html/pod/perlre.html:


        <from Perl manual>
        In addition, Perl defines the following:
                \w Match a "word" character (alphanumeric plus "_")
                \W Match a non-word character
        [...]
        Perl defines the following zero-width assertions:
                \b Match a word boundary
        [...]
        A word boundary (\b) is defined as a spot between two characters
that has a \w on one side of it and a \W on the other side of it (in
either order), counting the imaginary characters off the beginning and
end of the string as matching a \W.
        </from Perl manual>


Thus, regexp "ex" matches "example" and "text", whereas regexp "\bex"
matches "example" but not "text".


Now, if "\b" occurs at the beginning (or end) of the regexp, I think
it's easy to rewrite the regexp without using "\b". For example,
"\bex" could be rewritten as "\Wex".


But what if "\b" occurs within the regexp? For example, how to get rid
of "\b" in "<RE>\bex" (with "<RE>" being any regexp)? "<RE>\Wex"
wouldn't work here: for example (with "<RE>" = "\W"), "\W\Wex" is not
equivalent to "\W\bex".


So, is there an algorithm that transforms any regexp into an
equivalent regexp that contains no "\b"?


NOTE: In this message, by "regexp" I mean regular expressions in their
traditional meaning (= empty set, empty string, single symbol, union,
concatenation, Kleene star, and nothing else).


-- dave



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.