Re: additional regular expression operators

torbenm@pc-003.diku.dk (Torben =?iso-8859-1?Q?=C6gidius?= Mogensen)
Fri, 17 Apr 2009 11:05:31 +0200

          From comp.compilers

Related articles
[4 earlier articles]
Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-03-31)
Re: additional regular expression operators rpboland@gmail.com (Ralph Boland) (2009-03-31)
Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-14)
Re: additional regular expression operators zayenz@gmail.com (MZL) (2009-04-15)
Re: additional regular expression operators anton@mips.complang.tuwien.ac.at (2009-04-16)
Re: additional regular expression operators gah@ugcs.caltech.edu (glen herrmannsfeldt) (2009-04-16)
Re: additional regular expression operators torbenm@pc-003.diku.dk (2009-04-17)
Re: additional regular expression operators mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2009-04-17)
| List of all articles for this month |

From: torbenm@pc-003.diku.dk (Torben =?iso-8859-1?Q?=C6gidius?= Mogensen)
Newsgroups: comp.compilers
Date: Fri, 17 Apr 2009 11:05:31 +0200
Organization: Department of Computer Science, University of Copenhagen
References: 09-03-111 09-03-123 09-04-001
Keywords: lex
Posted-Date: 17 Apr 2009 06:12:28 EDT

Ralph Boland <rpboland@gmail.com> writes:




> I have never found a useful example of set intersection though.


In lexers for programming languages, such examples are, indeed, rare
-- as are any nontrivial regular expression. The most complex are
usually for floating-point numbers, comments and strings.


But you also use regular languages to specify properties for state
machines, and if you want to check that two properties are both
fulfilled, you can do so by checking with the intersection property
instead of once for each property.


> I have found so far only one paper on implementing these operators
> and it is complex. I will take a look at derivative based algorithms.


Here is a short introduction:


We define three functions of regular expressions:


Nullable(r) is true if r accepts the empty string, false otherwise.


First(r) is the set of characters that can start strings in L(r).


Next(r,c) is a regular expression s that accepts a string w if and
only if r accepts the string cw.


We can define these structurally on regular expressions in this way:


Nullable(\epsilon) = true
Nullable('a') = false
Nullable(r|s) = Nullable(r) or Nullable(s)
Nullable(rs) = Nullable(r) and Nullable(s)
Nullable(r*) = true
Nullable(r+) = Nullable(r)
Nullable(r&s) = Nullable(r) and Nullable(s)


where r&s is the intersection of r and s.


First(\epsilon) = {}
First('a') = {'a'}
First(r|s) = First(r) U First(s)
First(rs) = First(r) U First(s), if Nullable(r)
First(rs) = First(r), if not Nullable(r)
First(r*) = First(r)
First(r+) = First(r)
First(r&s) = First(r) \cap First(s)


where \cap is the set intersection operator.


Next(\epsilon, c) = \phi
Next('a', 'a') = \epsilon
Next('a', c) = \phi, if c != 'a'
Next(r|s, c) = Next(r, c)| Next(s, c)
Next(rs, c) = Next(r, c)s | Next(s, c), if Nullable(r)
Next(rs, c) = Next(r, c)s, if not Nullable(r)
Next(r*, c) = Next(r, c)r*
Next(r+, c) = Next(r, c)r*
Next(r&s, c) = Next(r, c) & Next(s, c)


where \phi is the regular expression that accepts no strings. We can
use the following reduction rules:


\phi | r = r
r | \phi = r
\phi r = \phi
r \phi = \phi
(\phi)* = \epsilon
(\phi)+ = \phi
\phi & r = \phi
r & \phi = \phi


We can now build a DFA from a regular expression by marking satets
with regular expressions:


  - The initial state is marked with r


  - There are outgoing edges from r on all symbols c in First(r)


  - The transition from r on c is to Next(r,c)


  - A state marked r is acception if Nullable(r)


To avoid getting an infinite DFA, you need to reduce regular
expressions so you won't get an infinite number of equivalent regular
expressions. The more reduction rules you use, the smaller your
resulting DFA will be. A good idea is to define some (arbitrary)
total order on regular expressions and make sure r<s in r|s and r&s.
If r=s, both can be reduced to r. Some useful rules are:


(r|s)|t = r(s|t)
(rs)t = r(st)
\epsilon r = r
r \epsilon = r
\epsilon* = \epsilon
r(r*) = r+
r(r+) = r+
r+ | \epsilon = r*
r* | \epsilon = r*


Torben



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.