25 Feb 2007 13:24:42 -0500

Related articles |
---|

CEX - Context-Free Expression Filter markwh04@yahoo.com (2007-02-25) |

From: | markwh04@yahoo.com |

Newsgroups: | comp.compilers |

Date: | 25 Feb 2007 13:24:42 -0500 |

Organization: | Compilers Central |

Keywords: | parse, design |

Posted-Date: | 25 Feb 2007 13:24:42 EST |

In 1993, the regular expression filter REX was introduced. Notable

amongst its features was that processing was carried out entirely

within an algebraic paradigm. The utility and the algorithms

underlying it have eventually found their way into other applications

(e.g. XML-Query).

The following is a preliminary description of a planned utility called

CEX; and of the syntax, algebra and language underlying the utility.

CEX is the long-planned successor to REX. It will not handle the full

range of REX functionality, but will process language expressions at

the next stage up the Chomsky hierarchy: Context-Free Expressions.

The Untold Story of Formal Languages

http://federation.g3z.com/CompSci/index.htm#Untold

(Numerous other references may be found on this page, both above and

below this link)

1. SYNTAX

As mentioned in the supplementary material to the RegEx software, REX

was originally meant to be the precursor to an even more expanded GREP

superset that would also have a syntax for processing context-free

expressions. The biggest obstacle had been finding a suitable

embodiment of the underlying algebra.

The following elements are novel to CEX:

* (x = e, f) -- substitution expression

* (x -> e, f) -- a "rule" expression

* <x:e> -- least-fixed-point abstraction

* <e> -- expectation value

* <e| -- bra

* |e> -- ket

* [e] -- projection

In addition, the following standard regular expression operators are

incorporated

* e* -- Kleene star

* e + f -- alteration

* e f -- concatenation

* 1 -- success

* 0 -- failure

* e+ = e e*

* e? = | e

The two central features of the context-free expression algebra are

its fixed-point calculus and its use of the Bra-Ket algebra. The

latter algebra is defined by the identities

<x| |x> = 1; <x| |y> = 0, if x != y

as well as a "completeness" relation

|x0><x0| + ... + |xn><xn| = 1

summed over the entire set of bra-ket operator pairs.

An example of a CEX syntax context-free expression is the context-free

language given by the grammar

S -> a S b; S -> c

which may be represented by any of the following

S -> a S b, S -> c, S

S -> a S b + c, S

<S: a S b + c>

<(a <1|)* c (|1> b)*>

<0| (a <1|)* c (|1> b)* |0>.

Raise = a <1|, Lower = |1> b, <0| Raise* c Lower* |0>.

2. SEMANTICS

The value of an expression is that obtained by first applying all the

substitutions in its first. Thus, the (x = e, f) is equal to the

expression that results from substituting the expression e for all

free occurrences of the variable x in the expression f. Substitutions

may also be done within the bra, ket and projection symbols.

Substitutions do not take place in contexts where the variable x is

bound. Binding contexts include the following

(x = e, f), (x -> e, f), <x:e>

The remainder of the semantics is therefore defined, assuming the

substitutions have all been carried out.

First, the value of the expectation <e> is the same as <u| e |u>,

where the operator pair <u|, |u> is otherwise unused in e. The

expectation value is implicitly taken in the following contexts:

* for the top-level expression

* for the expression e in the rule (x -> e, f).

Thus, (x -> e, f) = (x -> <e>, f).

A limited subset of the expression syntax may be applied within the

bra, ket and projection operators. Thus,

<e*| = <e|*, |e*> = |e>*

<e+f| = <e|+<f|, |e+f> = |e>+|f>

<ef| = <f|<e|, |ef> = |e>|f>

<e+| = <e|+, |e+> = |e>+

<e?| = <e|?, |e?> = |e>?

The symbols 0 and 1 do not carry the meaning of failure and success

within a bra and ket, but are just used as ordinary labels. Note the

reversal of order in <ef|. This is to ensure the property that

<e| <-> |e>

be obtainable from one another by the process of reversal, for

arbitrary expressions e.

For the projections, the equivalences are

[e*] = [e?] = [] = 1

[e+] = [e]

[e,f] = [e+f] = [e] + [f]

[e* f] = [e? f] = [f]

[(e+) f] = [e f]

[(e f) g] = [e (f g)]

[(e + f) g] = [e g] + [f g]

[x e] = |x> [e] <x|,

where x is a label.

The operations a* and a+ are, respectively,

a* = <x : 1 + a x>; a+ = <x : a + a x>.

The least-fixed-point expression <x:e> is equivalent to the rule

expression (x -> e, x). If the variable x does not occur freely in e,

then <x:e> = <e>.

A "complete" rule expression is a subexpression e of the form

e = (x1 -> e1, x2 -> e2, ..., xn -> en, f)

such that (1) e is not contained within a larger expression of the

form (x0 -> e0, e), and f is not a rule expression. Its value is the

equivalent of that obtained by collecting all the rules for a given

variable and summing the expressions in the right-hand side. Thus, for

instance

(S -> S T, T -> x, S -> 1, T -> u S v, S b S)

is equivalent to

(S -> S T + 1, T -> x + u S v, S b S)

The resulting expression is then said to be in fixed-point form. This

can be equivalently expressed in terms of the

least-fixed-point operator. Thus, if

x0 -> e0, x1 -> e1, x2 -> e2, ..., xn -> en, f

is in fixed-point form, then its equivalent value will be

x0 = <x0 : e0>, (x1 -> e1, x2 -> e2, ..., xn -> en, f).

The rules for fixed-point expressions commute. Thus, for instance,

(S -> S T + 1, T -> x + u S v, S b S)

and

(T -> x + u S v, S -> S T + 1, S b S)

denote the same value.

Neither the least-fixed-point operator, nor the rule expression is

taken as primitive. Both may be reduced in terms of the bra-ket

algebra. The result is analogous to writing out a decimal expansion

for a solution to a non-linear equation.

The processing may be done on a rule expression in its entirety, or on

a fixed-point expression to eliminate one variable at a time. A

variety of methods may be employed. The one described here may be

taken as the normative reference. This is also described in section 4

of the Untold Story Of Formal Languages reference. Other methods

exist. Section 7 describes the resolution of the fixed-point operator.

The article "Algebraic LR Parsing" contains notes on implementing the

LR paradigm in algebraic form.

For a given fixed point expression, let Q denote its set of variables,

H(q) the expression on the right hand side of the rule for variable q,

and S the main expression. This process assumes that there are no

further "rule expressions" contained in any H(q) or S (otherwise, we'd

apply the reduction to these first). Add in an auxillary variable s to

denote the "main variable" of the system and set H(s) = S.

Each expression H(q) and S is written as the least-fixed point

solution in its "main variable". Thus, we have

H(q) = q_0

q_i -> F_i

q_i -> H^j_i q_j

where the F's are expressions free of the variables in Q and the H's

are either variables in q or expressions free of variables from Q.

This system is converted as follows. First, for the main variable q of

the subsidary system, we have

q -> q_0.

Second, for each rule of the form q_i -> F_i, we write

q_i -> F_i q_F.

Each rule of the form

q_i -> e q_j

where e is free of variables is kept intact. Finally, each rule of the

form

q_i -> r q_j

where r is another variable in Q is transformed into the pair of rules

q_i -> <u| r_0

r_F -> |u> q_j,

where <u| and |u> are a fresh operator pair. For the main variable s,

we write

s -> <0| s_0

and

s_F -> |0>.

The result is a system that is right-linear in its new set of

variables. The same process for converting a right-linear system to a

regular expression can be used to find the actual context-free

expression that results, though it will prove to be more useful to

keep the system in right-linear form.

3. PROCESSING INPUT: THE NORMAL FORM THEOREM, BRA-KET REDUCTION &

SHORTEST-PATHS

I won't go into too much detail here, other than to mention the main

features of the processing algorithm. Given a context-free expression

e (or its equivalent right-linear system), and an input x, there is a

linear-time algorithm that will find the intersection of e and {x}.

The right-linear system corresponding to this result is a context-free

expression either for the set {x} or {}. The latter case occurs if the

word x is not a member of the language given by e. Once the

intersection, say e_x, is found, the symbols comprising the word x can

be erased (i.e. mapped to 1). As a result, e' becomes a pure operator

expression.

A context-free expression e in right-linear form can be written as a

system in matrix-vector form

e -> S Q

Q -> C Q

Q -> U_i <i| Q

Q -> |i> V_i Q

Q -> F

where C, U_i, V_i are matrices with coefficients consisting of 0's and

input symbols; S is a row vector, Q and F are column vectors. The

least fixed-point of this solution is

e = S (U + V + C)* F

where

U = sum U_i <i|

V = sum |i> V_i.

The normal form theorem states that this can be expressed equivalently

as

e = S (N V)* N (U N)* F

where N is a matrix of context-free expressions that comprise the

least fixed-point solution to

N -> (sum (U_i N V_i) + C) N + I.

In the special case, where the expression e' is a pure operator

expression, the C matrix is 0 and the result reduces to the equivalent

of a "shortest paths" problem. This, of course, leads directly to the

equivalent of Valiant's O(n^(log 7)) context-free recognition

algorithm.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.