Re: RE's to CFG's

Chris F Clark <cfc@world.std.com>
19 Dec 2000 17:03:27 -0500

From comp.compilers

Related articles
RE's to CFG's clive_minnican@hotmail.com (Clive Minnican) (2000-12-18)
*Re: RE's to CFG's cfc@world.std.com (Chris F Clark)* (2000-12-19)**
Re: RE's to CFG's philip.fortomas@virgin.net (Philip Fortomas) (2000-12-20)

| List of all articles for this month |

From:	Chris F Clark <cfc@world.std.com>
Newsgroups:	comp.compilers
Date:	19 Dec 2000 17:03:27 -0500
Organization:	The World Public Access UNIX, Brookline, MA
References:	00-12-070
Keywords:	parse, lex
Posted-Date:	19 Dec 2000 17:03:27 EST

> Does anyone know of any good internet documents explaining the process
> involved in converting Regular Expressions (RE's) to Context-Free
> Grammars (CFG's) by hand? Or can anyone explain it to me easily?

The easy way to convert a regular expression to a context-free grammar
is by way of the linear grammars. Linear grammars have rules like:

non_terminal: non_terminal TERMINAL_1 TERMINAL_2 TERMINAL_3
                        | TERMINAL_4 TERMINAL_5;

Left-linear grammars allow rules with left-recursion (but no right nor
"center" recursion--the type used for parenthesized expressions).
Right-linear grammars only allow right recursion. General linear
grammars allow both left and right recursion, but no "center"
recursion.

// disallowed "center" recursion--not a linear grammar
non_terminal: TERMINAL_1 non_terminal TERMINAL_2;

I believe (but cannot state definitively) that you can reference other
non-terminals in a linear grammar. Technically, however, the
references to the other non-terminals must not result in indirect
recursion of the prohibited type(s) (or the grammar is not a linear
one). Now, since we don't really care whether the resulting grammar
is linear or not, this prohibition is irrelevant. However, if you had
a grammar that you wanted to turn back into a regular expression, the
prohibition is important, because only simple left and right
recursions have regular expression equivalents and any indirect
recursion that cannot be reduced to a simple left or right recursion
cannot be translated into a regular expression.

In any case, there is a simple translation into a linear grammar for
each regular-expression operator. Consider the following example:

non_terminal: A (B | C D+) E? (F G+ H)*;

1) Create a new non-terminal for each regular-expression operator.

b_or_c_d_plus: ... // nested | operators are a regular-expression
d_plus: ... // + is a regular-expression operator
e_qmark: ... // ? is a regular-expression operator
f_g_plus_h_star: ... // * is a regular-expression operator
g_plus: ... // + is a regular-expression operator

2) Write each new operator non-terminal out as its right-linear
      translation.

non_terminal: A b_or_c_d_plus e_qmark f_g_plus_h_star;
      // use new non_terminals in place of regular-expressions
b_or_c_d_plus: b | c d_plus;
      // simply rewrite the nested | operator as is
d_plus: d | d d_plus;
      // right linear form for a + operator
      // uses right recursion to represent iteration
e_qmark: /* empty */ | e;
      // linear form for a ? operator
      // uses an empty (aka epsilon or lambda) alternative
f_g_plus_h_star: /* empty */ | f g_plus h
      // right linear form for a * operator
      // combines + (recursion) and ? (empty) concepts
g_plus: g | g_plus;
      // same as before

Note, the translation of regular-expressions into new non-terminals
increases the number of non-terminals in the grammar and as a result
makes the parsing tables larger. However, before 1990 there weren't
common parser generators that could handle ELL or ELR grammars
(grammars with embedded regular expressions). As a result, you will
find a lot of language standards that have non-terminals like
id-list-opt to represent id* (or perhaps (id ("," id)*)? ). These
non-terminals represented the linearized regular-expressions.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: RE's to CFG's

Chris F Clark <cfc@world.std.com>19 Dec 2000 17:03:27 -0500

Chris F Clark <cfc@world.std.com>
19 Dec 2000 17:03:27 -0500