Re: TeX syntax? ... In-stream/parallel parsing & intelligent handling of macros

Rock Brentwood <>
Fri, 9 Apr 2021 15:18:40 -0700 (PDT)

          From comp.compilers

Related articles
Re: macros of yore, was TeX syntax? (gah4) (2021-04-09)
| List of all articles for this month |

From: Rock Brentwood <>
Newsgroups: comp.compilers
Date: Fri, 9 Apr 2021 15:18:40 -0700 (PDT)
Organization: Compilers Central
References: 07-02-024 21-04-002 21-04-004
Injection-Info:; posting-host=""; logging-data="5637"; mail-complaints-to=""
Keywords: macros
Posted-Date: 09 Apr 2021 19:20:24 EDT
In-Reply-To: 21-04-004

I remade {tangle,weave}.web and c{tangle,weave}.w both in more C-like form as
ordinary contiguous code+comments (+ UNICODE, because I'd like to keep some of
the original spirit of making it as more math-/typeset-friendly
program+documentation) ... and might post what I have on GitHub or somewhere,
even though it hasn't yet been baseline tested. Among other things, I want to
remake my C compiler to directly accept UNICODE for the operators. The
glossary I use is (↑,↓,→,≠,≡,≤,≥,∷,∧,∨,¬) for
(++,--,->,!=,==,<=,>=,::,&&,||,!), with the latter treated as legacy digraph
forms of the former. I'll post a note on the link, when I get it up. The
context-sensitive grammar is a switch statement that runs up to 4 levels deep.
It *might* be yacc-convertible, as is, since some versions of yacc may allow
for stack-hacking, e.g. an $-1 argument in a rule.

On Monday, April 5, 2021 at 9:55:35 AM UTC-5, gah4 wrote:
> It seems to me that macro processors were popular in the 1970's.
> Some years ago, I used to work with Mortran, which is a macro processor
> meant to be used to process an improved language like Fortran, which is
> then converted into Fortran...

You hit the nail on the head here.

Partially in defense of Knuth, it looks like what he was trying to do (in the
background of all this) was establish an entirely new framework for parsing,
that got out from under the deficiencies of LR, that was (a) streaming and (b)
potentially even parallelizable. By "streaming" I mean that you can literally
start and end a partial parse in mid-stream and *still* get meaningful
results, even of the chunk that was processed doesn't form any kind of
grammatical unit at all ... which has obvious utility for intelligently
processing macros.

That's CYK-on-steroids.

The lack of such a facility also stood as a roadblock to making cfront
directly into a C++ compiler. In the old cfront code (e.g. the first version,
release E, which we transcribed and is currently here>front

an updated/correction version which will be out on GitHub hopefully soon)
shows indication in it that Stroustrup was trying to make it into a *direct*
C++ to C processor. He had an option in the build script in release E or the
other early releases to allow for this, but never implemented it. The macro
layer stands in the way of everything. So, instead, he set it up C++ to C
conversion as a Rube Goldberg device that got rid of macros first ... thus
totally destroying the transparency of the translation process; when what we
actually wanted as a translation that kept the macros intact, but
intelligently processed them.

(His original aim is much more doable now, by the way, than in the 1980's,
because the *forward* advancement of C from the 1980's to C99 actually entails
a dramatic *simplification* of what's needed for a "cfront" like utility, and
much of what went into the original is no longer needed, on that account. You
can almost even write up a utility as a simple editor script, now! ... Except
for exception-handling. Most people forget, translation is a *contravariant*
function of source language - it simplifies when source language complexifies.
So things become *more* doable, not less!)

At the time of LR's inception, you had 2 major results which were never fully
developed or exploited: (a) the Chomsky-Schützenberger Theorem (CST), which
*very nearly* gives you the ability to make 100% streaming representation for
translators, with one "frame" per word -- except the CST "kernels" have word
boundary markers; and (b) its successor of sorts , Greibach's "Hardest CFL"
construction, which successfully incorporates word-boundary markers into the
word frames.

I'll show you how it can be done, now, with the newer formalism for
context-free expressions.

L → α | L S β,
S → x γ | u L v δ,
S w L ω
is a translation grammar containing input words from the set {u,v,w,x}, output
actions from the set {α,β,γ,δ,ω} which have an obvious interpretation in
terms of "reduce" (α,β,γ,δ) and "accept" (ω) actions; plus a starting
configuration S w L ω. That's typical of what yacc assumes, except for the
loosening of restrictions on what's allowed for the "start".

It is possible, by the way, to hack yacc to allow for *any* rule body under
%start and to allow "%start" to be issued anywhere, and multiple times; by
just treating each "%start α" declaration as being synonymous with a hidden
rule of the form "$accept: α $end;" ... because that's actually already how
it is processed and even displayed under the "-v" option in some
implementations. With the loosening, you can even add in start actions under
%start (thereby eliminating the need for some of bison's kludges).

This translation grammar is in "canonical bottom-up" form - all actions occur
at the end, no in-between actions. (In yacc, in-between actions are hacked by
turning them into actions for special single-use non-terminals that have empty
transitions ... which you can see on display when you run the "-v" option on a
source file containing rules with in-between actions.)

The translation grammar has this as its CST-kernel:
<0| (u<u| + v<v| + w<w| + x<x| + (α + |S>|L>β)<L| + (|x>γ +
|v>|L>|u>δ)<S|)* |L>|w>|S>|0> ω

In the 1960's version of the theorem, the labelled brackets <u|,<v|,...;
|u>,|v>,... were interpreted as just extra word symbols (and actually Chomsky
& Schützenberger replaced words with brackets too). In the newer algebraic
form ... which may finally see publication soon this year ... the brackets
form an algebra with <i| |j> = 0 if i ≠ j, <i| |i> = 1, which commute with
the other symbols. In addition, for translator grammars, input and output
symbols commute (e.g. αx = xα, the conversion of αx to xα is, itself, the
very essence of a "look-ahead"). One can also conceive of cases where there
are *multiple* input and output channels, each having their own alphabet, the
different channels' alphabet commuting with one another.

The kernel has the form
I (X + YR)* YF
I = <0|
X = u<u| + v<v| + w<w| + x<x|
YR = α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|
YF = |L>|w>|S>|0> ω
and, in fact, YR and YF both factor into
Y = α<α| + β<β| + γ<γ| + δ<δ| + ω<ω|
R = (|α> + |β>|S>|L>)<L| + (|γ>|x> + |δ>|v>|L>|u>)<S|
F = |ω>|L>|w>|S>|0>
and the original kernel is actually equal to I (X + Y + R)* F ... a form which
is suited for translation grammars that are *not* in bottom-up canonical form.
The reduction I (X + Y + R)* F = I (X + YR)* YF only applies if the
translation grammar is in canonical form.

By Kleene algebra, the kernel is equal to: I (YR)* (X (YR)*)* YF which
contains the "renormalized" forms of:
Start action: I' = I (YR)* = <0| (YR)*
Inputs: X' = X (YR)* = Σ_x x p'(x), composed of
Word frames: p'(x) = <x| (YR)*
Final action: F' = F = |L>|w>|S>|0>ω

That's a form suitable for both (a) mid-stream parsing, e.g. a word "vxu"
would be processed as p'(v) p'(x) p'(u) ... for instance, if vxu were a macro
body; and (b) a parallelized form of CYK. Except: instead of assembling chunks
bottom-up in dynamic programming from length 1, then length 2, then length 3,
etc. in a "CYK pyramid", you could do it both in parallel *and* exponentially,
from length 1, length 2, length 4, length 8, etc.

So, there are your "word frames" and you have a means to produce both
in-stream and even parallel translators ... which is ideally suited also for
intelligently handling macros.

It is also suitable for both deterministic and non-deterministic parsing; and
it ultimately subsumes the stack-based framework - devised originally by
Tomita - for GLR parsing; as well as generalizing to translation grammars
which are not canonical.

The role that Knuth's LR theory plays is that it provides a means to construct
a shift-reduce parser. Seen in this context, it's easy to explain both what it
is and does. A shift-reduce parser provides a set of transitions z: q→q'
between states q, q' on a [non-]terminal z, which generates a morphism
<z| ↦ p(z) ≡ Σ_{z:q→q'} |q><q|<q'| for all [non-]terminals z; and <0|
↦ <0|
|z> ↦ q(z) ≡ Σ_{z:q→q'} |q'>|q><q| for all [non-]terminals z; and |0>
↦ |0>
that (a) yields the same result when substituted into the CST kernel and (b)
provides for a more nearly-deterministic push-down transducer when the bracket
symbols are subject to the obvious stack interpretation with |0><0| playing
the role of the "empty stack" projection operator.

It can be written more succinctly in graphical form by writing defining [q]
≡ |q><q|, q⇐ ≡ |q>, ⇒q ≡ <q|, [q,⋯,q'] = [q] + ⋯ + [q]' (and
other comma-separated subgraphs being understood to be added, e.g.
[0]⇒2,[1]⇒3 = |0><0|<2| + |1><1|<3|). In that case, a rule such as S → u
L v corresponds to |v>|L>|u><S| which translates into q(v)q(L)q(u)p(S) which
takes the form of a sum of graphs that looks like q⇐q⇐q⇐[q]⇒q.

That's the expression for the "roll-back" action. The absence of the explicit
account of the roll-back actions from LR theory is the chief deficiency of the
entire formalism! It's normally imposed externally, either as a narrative
description or else (in an implementation) as a cookie-cutter code skeleton
(like in BSD yacc) or in the system library (like UNIX yacc) or
user-specifiable (like in bison).

The right-ward pointing segment on the left corresponds to what is actually
called a "lane" in LR-theory, while [q] is called a "laneahead" state. You can
read the LALR "lookahead"'s directly off the graph - it's whatever lookaheads
the states, q, to the right of the laneahead state have.

The current state of LR(k), for k > 0, by the way, runs pretty much like this.
Pager's classical theory - which came out shortly after Knuth, has been
largely superseded by IELR (the IELR paper is here which is what bison uses; and to
a limited extent by Pager and Chen's newer formulation that is in "hyacc",
which better handles LR(1) and a limited subset of LR(k) for k > 1. Pager and
Chen, however, haven't figured out how to deal with "cyclic" lookaheads -
which I think correspond to the cases where the (YR)* roll-back expression has
star-height 1. Nor do I know how their newer method compares with IELR; other
than that IELR is limited to LR(1).

You might be able to handle the situation within the algebraic framework just
described here and, thereby, extend their approach to all LR(k) grammars for k
> 1.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.