CEX - Context-Free Expression Filter

25 Feb 2007 13:24:42 -0500

          From comp.compilers

Related articles
CEX - Context-Free Expression Filter markwh04@yahoo.com (2007-02-25)
CEX - Context-Free Expression Filter: update. rockbrentwood@gmail.com (Rock Brentwood) (2021-04-04)
| List of all articles for this month |

From: markwh04@yahoo.com
Newsgroups: comp.compilers
Date: 25 Feb 2007 13:24:42 -0500
Organization: Compilers Central
Keywords: parse, design
Posted-Date: 25 Feb 2007 13:24:42 EST

In 1993, the regular expression filter REX was introduced. Notable
amongst its features was that processing was carried out entirely
within an algebraic paradigm. The utility and the algorithms
underlying it have eventually found their way into other applications
(e.g. XML-Query).

The following is a preliminary description of a planned utility called
CEX; and of the syntax, algebra and language underlying the utility.

CEX is the long-planned successor to REX. It will not handle the full
range of REX functionality, but will process language expressions at
the next stage up the Chomsky hierarchy: Context-Free Expressions.

The Untold Story of Formal Languages

(Numerous other references may be found on this page, both above and
below this link)

As mentioned in the supplementary material to the RegEx software, REX
was originally meant to be the precursor to an even more expanded GREP
superset that would also have a syntax for processing context-free
expressions. The biggest obstacle had been finding a suitable
embodiment of the underlying algebra.

The following elements are novel to CEX:
* (x = e, f) -- substitution expression
* (x -> e, f) -- a "rule" expression
* <x:e> -- least-fixed-point abstraction
* <e> -- expectation value
* <e| -- bra
* |e> -- ket
* [e] -- projection

In addition, the following standard regular expression operators are
* e* -- Kleene star
* e + f -- alteration
* e f -- concatenation
* 1 -- success
* 0 -- failure
* e+ = e e*
* e? = | e

The two central features of the context-free expression algebra are
its fixed-point calculus and its use of the Bra-Ket algebra. The
latter algebra is defined by the identities
      <x| |x> = 1; <x| |y> = 0, if x != y
as well as a "completeness" relation
      |x0><x0| + ... + |xn><xn| = 1
summed over the entire set of bra-ket operator pairs.

An example of a CEX syntax context-free expression is the context-free
language given by the grammar
      S -> a S b; S -> c
which may be represented by any of the following
      S -> a S b, S -> c, S
      S -> a S b + c, S
      <S: a S b + c>
      <(a <1|)* c (|1> b)*>
      <0| (a <1|)* c (|1> b)* |0>.
      Raise = a <1|, Lower = |1> b, <0| Raise* c Lower* |0>.

The value of an expression is that obtained by first applying all the
substitutions in its first. Thus, the (x = e, f) is equal to the
expression that results from substituting the expression e for all
free occurrences of the variable x in the expression f. Substitutions
may also be done within the bra, ket and projection symbols.
Substitutions do not take place in contexts where the variable x is
bound. Binding contexts include the following
      (x = e, f), (x -> e, f), <x:e>

The remainder of the semantics is therefore defined, assuming the
substitutions have all been carried out.

First, the value of the expectation <e> is the same as <u| e |u>,
where the operator pair <u|, |u> is otherwise unused in e. The
expectation value is implicitly taken in the following contexts:
* for the top-level expression
* for the expression e in the rule (x -> e, f).
Thus, (x -> e, f) = (x -> <e>, f).

A limited subset of the expression syntax may be applied within the
bra, ket and projection operators. Thus,
      <e*| = <e|*, |e*> = |e>*
      <e+f| = <e|+<f|, |e+f> = |e>+|f>
      <ef| = <f|<e|, |ef> = |e>|f>
      <e+| = <e|+, |e+> = |e>+
      <e?| = <e|?, |e?> = |e>?
The symbols 0 and 1 do not carry the meaning of failure and success
within a bra and ket, but are just used as ordinary labels. Note the
reversal of order in <ef|. This is to ensure the property that
      <e| <-> |e>
be obtainable from one another by the process of reversal, for
arbitrary expressions e.

For the projections, the equivalences are
      [e*] = [e?] = [] = 1
      [e+] = [e]
      [e,f] = [e+f] = [e] + [f]
      [e* f] = [e? f] = [f]
      [(e+) f] = [e f]
      [(e f) g] = [e (f g)]
      [(e + f) g] = [e g] + [f g]
      [x e] = |x> [e] <x|,
where x is a label.

The operations a* and a+ are, respectively,
      a* = <x : 1 + a x>; a+ = <x : a + a x>.

The least-fixed-point expression <x:e> is equivalent to the rule
expression (x -> e, x). If the variable x does not occur freely in e,
then <x:e> = <e>.

A "complete" rule expression is a subexpression e of the form
      e = (x1 -> e1, x2 -> e2, ..., xn -> en, f)
such that (1) e is not contained within a larger expression of the
form (x0 -> e0, e), and f is not a rule expression. Its value is the
equivalent of that obtained by collecting all the rules for a given
variable and summing the expressions in the right-hand side. Thus, for
      (S -> S T, T -> x, S -> 1, T -> u S v, S b S)
is equivalent to
      (S -> S T + 1, T -> x + u S v, S b S)

The resulting expression is then said to be in fixed-point form. This
can be equivalently expressed in terms of the
least-fixed-point operator. Thus, if
      x0 -> e0, x1 -> e1, x2 -> e2, ..., xn -> en, f
is in fixed-point form, then its equivalent value will be
      x0 = <x0 : e0>, (x1 -> e1, x2 -> e2, ..., xn -> en, f).
The rules for fixed-point expressions commute. Thus, for instance,
      (S -> S T + 1, T -> x + u S v, S b S)
      (T -> x + u S v, S -> S T + 1, S b S)
denote the same value.

Neither the least-fixed-point operator, nor the rule expression is
taken as primitive. Both may be reduced in terms of the bra-ket
algebra. The result is analogous to writing out a decimal expansion
for a solution to a non-linear equation.

The processing may be done on a rule expression in its entirety, or on
a fixed-point expression to eliminate one variable at a time. A
variety of methods may be employed. The one described here may be
taken as the normative reference. This is also described in section 4
of the Untold Story Of Formal Languages reference. Other methods
exist. Section 7 describes the resolution of the fixed-point operator.
The article "Algebraic LR Parsing" contains notes on implementing the
LR paradigm in algebraic form.

For a given fixed point expression, let Q denote its set of variables,
H(q) the expression on the right hand side of the rule for variable q,
and S the main expression. This process assumes that there are no
further "rule expressions" contained in any H(q) or S (otherwise, we'd
apply the reduction to these first). Add in an auxillary variable s to
denote the "main variable" of the system and set H(s) = S.

Each expression H(q) and S is written as the least-fixed point
solution in its "main variable". Thus, we have
      H(q) = q_0
      q_i -> F_i
      q_i -> H^j_i q_j
where the F's are expressions free of the variables in Q and the H's
are either variables in q or expressions free of variables from Q.

This system is converted as follows. First, for the main variable q of
the subsidary system, we have
      q -> q_0.
Second, for each rule of the form q_i -> F_i, we write
      q_i -> F_i q_F.
Each rule of the form
      q_i -> e q_j
where e is free of variables is kept intact. Finally, each rule of the
      q_i -> r q_j
where r is another variable in Q is transformed into the pair of rules
      q_i -> <u| r_0
      r_F -> |u> q_j,
where <u| and |u> are a fresh operator pair. For the main variable s,
we write
      s -> <0| s_0
      s_F -> |0>.

The result is a system that is right-linear in its new set of
variables. The same process for converting a right-linear system to a
regular expression can be used to find the actual context-free
expression that results, though it will prove to be more useful to
keep the system in right-linear form.

I won't go into too much detail here, other than to mention the main
features of the processing algorithm. Given a context-free expression
e (or its equivalent right-linear system), and an input x, there is a
linear-time algorithm that will find the intersection of e and {x}.

The right-linear system corresponding to this result is a context-free
expression either for the set {x} or {}. The latter case occurs if the
word x is not a member of the language given by e. Once the
intersection, say e_x, is found, the symbols comprising the word x can
be erased (i.e. mapped to 1). As a result, e' becomes a pure operator

A context-free expression e in right-linear form can be written as a
system in matrix-vector form
      e -> S Q
      Q -> C Q
      Q -> U_i <i| Q
      Q -> |i> V_i Q
      Q -> F
where C, U_i, V_i are matrices with coefficients consisting of 0's and
input symbols; S is a row vector, Q and F are column vectors. The
least fixed-point of this solution is
      e = S (U + V + C)* F
      U = sum U_i <i|
      V = sum |i> V_i.
The normal form theorem states that this can be expressed equivalently
      e = S (N V)* N (U N)* F
where N is a matrix of context-free expressions that comprise the
least fixed-point solution to
      N -> (sum (U_i N V_i) + C) N + I.

In the special case, where the expression e' is a pure operator
expression, the C matrix is 0 and the result reduces to the equivalent
of a "shortest paths" problem. This, of course, leads directly to the
equivalent of Valiant's O(n^(log 7)) context-free recognition

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.