Regular express to NFA using Thompson construction, questions, HELP ME

heng@Ag.Arizona.Edu (Heng Yuan)
22 Sep 1998 01:24:18 -0400

          From comp.compilers

Related articles
Regular express to NFA using Thompson construction, questions, HELP ME heng@Ag.Arizona.Edu (1998-09-22)
Re: Regular express to NFA using Thompson construction, questions mkjacobs@erols.com (Eric Jacobs) (1998-09-24)
| List of all articles for this month |

From: heng@Ag.Arizona.Edu (Heng Yuan)
Newsgroups: comp.compilers
Date: 22 Sep 1998 01:24:18 -0400
Organization: University of Arizona, Tucson, AZ
Keywords: lex

Hi,


I have some questions regarding regular express to NFA using
Thompson construction.


The Normal Thompson Construction for a((ab*)*)* :
(a) node with an out edge labeled with a.
(e) epsilon node


Here is what I thought how the Thompson construction is made
1) (a)-(e) // process the first letter a
2) (a)-(e) // process the second letter a
3) (a)-(e)-(b)-(e) // process the b, is this how?


                                          v---.
4) (a)-(e)-(e)-(e)-(b)-(e) // process b*, is this how?
                                  |-----------^


                                  v---.
5) (a)-(e)-(e)-(b)-(e) // then remove extra node?
                          |-----------^


                          .---------------.
                          v v---. |
5) (e)-(a)-(e)-(e)-(b)-(e)-(e) // then this?
                  | |-----------^ ^
                  |-----------------------|


                          .-----------------------.
                          | .---------------. |
                          v v v---. | |
6) (e)-(e)-(a)-(e)-(e)-(b)-(e)-(e)-(e)
                  | | |-----------^ ^ ^
                  | |-----------------------| |
                  |-------------------------------|


                                  .-----------------------.
                                  | .---------------. |
                                  v v v---. | |
7) (a)-(e)-(e)-(a)-(e)-(e)-(b)-(e)-(e)-(e)
                          | | |-----------^ ^ ^
                          | |-----------------------| |
                          |-------------------------------|


Is this how Thompson construction made? I found it quite difficult to
code in C :(


Here comes the second question:
Can I do the following instead? with pseudo codes following charts.
The method only lists once for the same situation, only results list in
the repeated ones.


regexp = a((ab*)*)* :


struct nfa
{
    short thischar; // epsilon or alphabets
    struct nfa *next; // first default edge out
    struct nfa *next2; // second edge out
}


(#) is epsilon node, (0) = start.


0) before start, construct
                (0)-(1)
                variable prevnfa = (0), prevnfa2 = (0), end = (1).


1) (0)-(a)-(2) // see a, replace (1) with (a).
                if (prevnfa2 != prevnfa) // advance prevnfa and prevnfa2
                    prevnfa2 = prevnfa;
                prevnfa = (a);
2) (0)-(a)-(2) // see (
                pushstack (end) // to be popped and use as prevnfa
                pushstack (prevnfa) // to be popped and use as prevnfa2
3) (0)-(a)-(2) // read the second (
                pushstack (end)
                pushstack (prevnfa)
4) (0)-(a)-(a)-(3) // read a, replace (2) with (a)
                prevnfa = second (a)
                prevnfa2 = first (a)
5) (0)-(a)-(a)-(b)-(4) // read b, replace (3) with (b)
                prevnfa = (b)
                prevnfa2 = second (a)
                end = (4)
                                          v---.
6) (0)-(a)-(a)-(b)-(4)-(5) // read *
                                  |-----------^
                end->next = new nfa. // make (4)-(5), end is (4) at this point
                end->next2 = prevnfa;
                prevnfa2->next2 = end;
                end = end->next; // finish end


                                          v---.
7) (0)-(a)-(a)-(b)-(4)-(5) // read )
                                  |-----------^
                prevnfa2 = popstack ();
                prevnfa = popstack ();


                at this point,
                prevnfa = seond (a), end = (5), prevnfa2 = first (a)


                                  v-----------.
                                          v---. |
8) (0)-(a)-(a)-(b)-(4)-(5)-(6) // read *
                          | |-----------^ ^
                          |-------------------|


9) repeat, after reading another ) and another *, but the
                figure remain the same. If + or ? operator is used here,
the result remains the same.


In case of | operator, since there is no arrow pointed to (0) start,
the starting pointer can remain the same. Just copy the start next
and next2 pointers to the new one, and let (0) point to the new
instead. If the start of the second only has 1 out edge (2 edges
in case of a*, or a?) , it can be recycled.
Two expressions:
                (0)-(a)-(1)
and (2)-(b)-(3)
| operator:
                (0)-(4)-(a)-(1)-(2)
                          |--(b)-(3)--^
This is useful in case of a(a|b), make:
                (0)-(a)-(a)-(1)
and (2)-(b)-(3)
Finally
                (0)-(a)-(a)-(1)-(2)
                          |--(b)-(3)--^
Note the position of NFA states. Original Thompson construction
would give:
                (a)-(1)-(a)-(2)-(3)
                          |--(b)-(4)--^


instead. And my way seems to be more compact and a lot of NFA states
can be saved. The NFA to DFA code is not required to be modified.
It is also easier to construct. The states pushed onto the stacks
remain the same relationship to the NFA states constructed and therefore
no need to be modified no matter how complex the following expression is.


I don't have backgrounds in CS, so please correct me and help me.
Thanks a lot.


Heng Yuan
heng@ag.arizona.edu
--
Name: Heng Yuan Email: heng@ag.arizona.edu
Home: http://ag.arizona.edu/~heng/
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.