Re: Create nfa from a regular expression

Pascal Bourguignon <pjb@informatimago.com>
7 Oct 2006 02:23:26 -0400

          From comp.compilers

Related articles
Create nfa from a regular expression felix.dorner@gmail.com (Felix Dorner) (2006-10-06)
Re: Create nfa from a regular expression pjb@informatimago.com (Pascal Bourguignon) (2006-10-07)
| List of all articles for this month |

From: Pascal Bourguignon <pjb@informatimago.com>
Newsgroups: comp.compilers
Date: 7 Oct 2006 02:23:26 -0400
Organization: Informatimago
References: 06-10-025
Keywords: lex
Posted-Date: 07 Oct 2006 02:23:26 EDT

"Felix Dorner" <felix.dorner@gmail.com> writes:
> I Tried To Write A Simple Compiler That Creates A Nondeterministic
> Finite Automaton From A Regular Expression, However I Am Really Not
> Sure My Solution Is The Way To Go, Most Because I Cannot Figure Out How
> To Introduce Operator Precedence. So the given expression is evaluated
> simply from the left to the right. (Operator precedence can then be
> forced by putting braces) There are some cases that are still to be
> catched, but the main work is done. I would be really pleased to get
> some comments on my code.


Precedence is implicit, deduced from the grammar rules, or explicitely
specified by parentheses in the expression.


So, the first step is to define your grammar for your regular
expressions.


For example, reading:


      man 7 regex




Start Non Terminal:


    re


Productions:


    re ::= branches .
    branches ::= branch | branch '|' branches .
    branch ::= piece | piece branch .
    piece ::= atom | atom '*' | atom '+' | atom '?' | atom bound .
    bound ::= '{' integer '}' | '{' integer ',' '}'
                                  | '{' integer ',' integer '}' .
    atom ::= '(' re ')' | '(' ')' | bracket | '.'
                                  | escaped-special-char | normal-char .
    bracket ::= '[' ['^'] [']'|'-'] list-of-char ['-'] ']'
    list-of-char ::= bracket-char | collating-sequence
                                  | equivalence-class | character-class .
    collating-sequence ::= '[.' collating-item '.]' .
    collating-item ::= character | multi-character | collating-sequence-name .
    equivalence-class ::= '[=' character | multi-character '=]' .
    character-class ::= '[:' class-name ':]' .
    class-name ::= identifier .
    collating-sequence-name ::= identifier .


Terminals (in addition to the literal terminals in the grammar rules):


    integer ::= /[0-9]+/ .
    escaped-char ::= /\\./ .
    normal-char ::= /[^][\\|()*+?]/
    bracket-char ::= /[^]-]/
    identifier ::= /[a-zA-Z]+/
    character ::= /./
    multi-character ::= /.+/


Don't mind the complexity of list-of-char, it merely reduces to set of
characters.




The precedence of * over |, or of [] over *, is given by the grammar
rules. A parser that analyses this grammar will automatically
implement these operator precedences.






So, once you've parsed the regular expression, you get a parse tree
with nodes of the form:


        node ::= (ALT node...) . -- a|b|c
        node ::= (SEQ node...) . -- abc
        node ::= (REP min max node) . -- x* x+ x? x{...}
        node ::= (SET characters) . -- bracket [xxx]
        node ::= (INV characters) . -- bracket [^xxx]




For example, (ab)+|cd(e[a-e]){3,}|[^a]{0,5}
would result in this tree:


    (ALT (REP 1 INFINITE (SEQ (SET "a") (SET "b")))
              (SEQ (SET "c")
                        (SET "d")
                        (REP 3 INFINITE (SEQ (SET "e")
                                                                  (SET "abcde"))))
              (SEQ (REP 0 5 (INV "a"))))


Here, there's no notion of operator precedence anymore. You can have
subnodes of any kind, the structure of the tree directly indicates the
order of the operators.






Now, to generate a NFA from such a tree, you can easily do it
processing each kind of node:




(SET characters) translates to:


          +-----(character-1)----+
          | |
  *---+-----(character-2)----+---->*
          | |
          | ... |
          | |
          +-----(character-n)----+


(INV characters) translates to the same, but for the characters that
are not in the set given in INV.




(SEQ node...) translates to:


  *--(ε)-->[NFA for node 1]--+
                                                        |
  +-----------(ε)------------+
  |
  v
  *--(ε)-->[NFA for node 2]--+
                                                        |
  +-----------(ε)------------+
  |
  v
...
  *--(ε)-->[NFA for node n]--(ε)-->*




Or you can directly merge the final state of the NFA for node i with
the start state of the NFA for node i+1:


[NFA for node 1]--->[NFA for node 2]--...-->[NFA for node n]






(ALT node...) translates to:


          +--(ε)--[NFA for node 1]--(ε)--+
          | |
  *---+--(ε)--[NFA for node 2]--(ε)--+---->*
          | |
          | ... |
          | |
          +--(ε)--[NFA for node n]--(ε)--+






(REP min max node) with max < INFINITE translates to:


  *--(ε)-->[NFA for node]--+ \
                                                    | |
  +-----------(ε)----------+ |
  | |
  v |
  *--(ε)-->[NFA for node]--+ |
                                                    | |
  +-----------(ε)----------+ | min times
  | |
  v |
... |
  *--(ε)-->[NFA for node]--+ |
                                                    | |
  +-----------(ε)----------+ /
  |
  v
  *------------(ε)-------->*---(ε)--+
                                                    | |
  +-----------(ε)----------+ |
  | |
  v |
  *--(ε)-->[NFA for node]--+---(ε)--+ \
                                                    | | |
  +-----------(ε)----------+ | |
  | | |
  v | |
  *--(ε)-->[NFA for node]--+---(ε)--+ |
                                                    | | |
  +-----------(ε)----------+ | |
  | | | (max - min - 1 ) times
  v | |
  *--(ε)-->[NFA for node]--+---(ε)--+ |
                                                    | | |
  +-----------(ε)----------+ | |
  | | /
  v v
  *--(ε)-->[NFA for node]---(ε)-----+--->*






(REP min INFINITE node) translates to:


  *--(ε)-->[NFA for node]--+ \
                                                    | |
  +-----------(ε)----------+ |
  | |
  v |
  *--(ε)-->[NFA for node]--+ |
                                                    | |
  +-----------(ε)----------+ | min times
  | |
  v |
... |
  *--(ε)-->[NFA for node]--+ |
                                                    | |
  +-----------(ε)----------+ /
  |
  v
  *------------(ε)------------>*---(ε)--->*
  ^ |
  | |
  +--(ε)--[NFA for node]--(ε)--+






This gives a NFA with some useless epsilon transaction, but you can
easily reduce this NFA or transform it into a DFA.




--
__Pascal Bourguignon__ http://www.informatimago.com/


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.