Related articles |
---|
Recursive Descent vs. LALR JohnMResler@netscape.net (John Resler) (2003-06-20) |
Re: Recursive Descent vs. LALR bear@sonic.net (2003-06-25) |
Re: Recursive Descent vs. LALR cfc@shell01.TheWorld.com (Chris F Clark) (2003-07-02) |
Re: Recursive Descent vs. LALR kamalpr@yahoo.com (2003-07-03) |
Re: Recursive Descent vs. LALR John.M.Resler@Boeing.com (Boeing) (2003-07-04) |
Re: Recursive Descent vs. LALR bear@sonic.net (2003-07-04) |
From: | Chris F Clark <cfc@shell01.TheWorld.com> |
Newsgroups: | comp.compilers |
Date: | 2 Jul 2003 00:45:05 -0400 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 03-06-093 03-06-117 |
Keywords: | parse |
Posted-Date: | 02 Jul 2003 00:45:04 EDT |
Note, this question has been asked before, and I've answered it before
(as well as have others). In some ways, I'm a little surpised it
isn't in the FAQ, but I guess it isn't asked that frequently, only
about once per year or two. Here is more than you wanted to know.
There is a hiearchy of languages, and a separate hierarchy of
grammars. The difference (which is important) is that when we say a
(set of) language(s) is in a certain class, we mean there exists a
grammar that parses that lanugage which is in the appropriate grammar
class. However, we may not know what the grammar is, and more
importantly there may be no mechanical way of finding that grammar.
Thus, just because a certain language is LR, does not mean that a
grammar one might have for that language is LR. In fact, "you" might
never be able to find an LR grammar for the language.
In addition, to the classes listed in this hierarchy, there are many
other grammar/language classes (in fact, infinitely many).
1) Regular (type 3) languages
These are things one can express with just a regular expression.
Also, text one can parse with a FSA (finite state automata) FSM
(... machine), two words for the same thing, with no auxilary
storage. It does not matter what the FSA is determinitic or
non-deterministic, both can parse exact the same languages.
Equivalently, text one can parse with a "linear" grammar. A linear
grammar has only 1 type of recursion, either left or right, but not
both in the same grammar and no recursion which is "central"
recursion--what one does to express parenthesis.
Note, given any one of the above representations of a type 3
language, it is possible to apply a mechanical algorithm that
produces any of the other forms.
Most "lexer" generators use regular expressions and handle type 3
languages. The "regeps" of editors are type 3, unless they support
back-references (which most do), which require a more powerful
parser.
2) Dike grammars
These are grammars that handle parenthesis (and only parenthesis).
They have only central recursion.
There is little overlap between Dike grammars and regular grammars.
What one can express in one, one cannot express in the other. The
exception being the empty language (i.e. parsing only an empty
file, a file with no tokens).
2) LL(k) (top-down or predictive) grammars
The k refers to how many tokens of lookahead one needs at the be
examine at the start of a rule to determine which rule (or
alternative of a rule) to apply.
If at the beginning of any non-terminal in the grammar, one can
always look at just one token and decide whether the rule should be
apllied or not (to a syntactically legal input), the grammar is
LL(1). This is the most common case.
All regular languages are LL(1), in particular all right linear
grammars (grammars that use only right recursion) are LL(1).
LL(1) grammars also allow some forms of central recursion,
parenthesization. All Dike grammars are also LL(1).
LL grammars do not allow left recursion.
Also, some forms of central recursion mixed with right recursion
are not LL(1). In particular, the common if-then/if-then-else rule
of most commonly used programming languages is not LL(k) for any k.
This is not as big of problem as it would seem as there is a simple
easy way (hack) to extend LL parsers to handle the if-then/if-then-
else conflict by having it simply choose the else version if and
only if the else is present. This binds the else to the closest
if, which is the normally desired semantics. Most LL parser
generators will automatically do this for you with at most a
warning.
Most recursive descent parsers are "almost" LL(k), where k is the
number of tokens they need to look at to figure out what to do in
the worst case. The reason they are almost LL(k) is that they
generally implicitly do the if-then-else hack without being aware
that they are.
Strictly speaking, for each k, there exists a grammar which is
LL(k+1) but not LL(k).
There are mechanical manipulations one can apply, removing left
recursion, and factoring to convert a non-LL grammar into an LL
one. However, I believe there are LL languages that have CFGs
(described later) which cannot be converted mechanically into an
equivalent LL grammar. (However, I could be wrong on this, as I
don't keep that rule in my head securely.) Equally importantly,
the meachanical transformations to make a grammar LL can sometimes
destroy the "character" of the grammar, making it impossible to
refer to the way the tokens were grouped into non-terminals in the
original grammar, for sematic purposes.
Also, LL grammars need expression precendence to be expressed
explicitly in the grammar, rather than in a separate formalism
(i.e. not in a precedence table nor in precedence declarations).
However, all of those "negatives" aside, almost every programming
language is either LL(k) +if-the-else hack for some very small k (~
5 or less with most being LL(1)) or is not parsable with any type 2
grammar (defined later). The only difficultly is that if one has
an LR grammar for the language, it may not be easy to get an
equivalent LL grammar.
Moreover, certain extensions to LL grammars: backtracking,
predicates, and adaptive grammars allow extended LL grammars to
parse type 1 and even type 0 languages (again defined later).
3) LR(k) (bottom up or shift-reduce) grammars
In LR(k) grammars, the k refers to how many tokens one needs at the
end of the rule to decide which rule to apply.
If at the end of a rule, one can always determine with one token of
lookahead which rule to apply the grammar is LR(1). If at the end
of the rule, one doesn't need any tokens of lookahead to decide
which rule to apply (i.e. the ambiguities have all been sorted out
before coming to the end of the rule) the grammar is LR(0).
LR(0) is important because two important sub-classes of LR grammars
arise from it, SLR(k) and LALR(k). The key feature of LR(0)
grammars is that they throw away all left (preceding) context in
the lookahead calculations. The throwing away of the left context
occurs, because the LR(0) mahcine "merges" states without regard to
their lookahead information. thus, if in one context the lookahead
for one rule is three tokens: a, b, or c and in another the
lookahead for the same rule is d or e, the LR(0) machaine marges
those two states, confusing the lookaheads, so that at the end of
the rule any of a, b, c, d, or e should result in a reduction (note
if the language is LR(0) meaning we don't need to look at all, it
doesn't matter, since only the correct token will appear anyway in
a syntactically legal program). Most "LR" parser generation
algorithms first build the LR(0) machine and then resolve the
conflicts the occur in the reduce states.
A grammar is SLR(k) if at the end of all uses of a rule, the same k
tokens always determine whether to apply the rule or not. A
grammar is LALR(k) if at the end of each use of a rule, there are k
tokens which decide whther to apply the rule or not and those k
tokens do not conflict with the lookahead tokens for some other
rule that is applied in the same LR(0) reduce state. The
difference between the two grammar classes is subtle, and all
SLR(k) grammars are LALR(k) (and all LALR(k) grammars are LR(k)).
Essentially, LALR(k) gives some context sensitivity to the
lookahead tokens and different sets of conflicting rules can have
different lookahead tokens to distinguish them.
That brings us back to "full" or canonical LR(k), where k is >= 1.
LR(k) parsing preserves the needed left context to disambiguate
certain situations where LALR(k) finds the rules to conflict. The
canonical way of doing this is to not merge states with different
lookaheads in the first place. That causes an explosion in the
table size as many states are kept distinct simply because they
have different lookaheads, when in reality the lookahead for those
states will never be consulted. A more modern method for solving
the problem involves splitting the states only when a conflict is
detected. Then, if the grammar is LALR, no splitting will occur
and one has the same size machine as the LR(0), but as conflicts
arise the require left context to resolve, the tables slowly grow.
Back now, to the inclusion rules.
Every LL(k) grammar is an LR(k) grammar, but not necessarily an
SLR(k) or LALR(k) grammar. The troublesome cases occur with empty
rules (null productions). An LL(k) grammar uses left context to
decide which lookahead to apply to each use of an empty rule.
However, the LR(0) machine throws that context away and thus the
result SLR or LALR machine may have conflicts due to ignoring the
left context. If there are no rules shorter than k-1, an LL(k)
grammar is always an SLR(k) or LALR(k) grammar.
And to the languages, every LR(k) language is actually also an
LR(1) language, meaning that there exists an LR(1) grammar for the
language. However, there is no mechanical method to find that
LR(1) grammar.
The LR(k) languages also correspond to what you can parse with an
deterministic FSA and an auxillary stack (also know as a fifo) with
k tokens of lookahead, sometimes called a (D)PDA ((determinsitic)
push down automata). These are also called the deterministic
context free languages. Since every LR(k) grammar is also LR(1),
one can show that one needs only one token of lookahead in the PDA.
By the way, there are many other grammar classes in the area such
as the various precedence grammars and the bounded context
grammars. Many of these have their only parsing methodologies.
Moreover, the grammars form a farily complex hiearchy with some
nesting inside others and other sets not nesting. However, all of
these languages are a subset of the LR(1) languages, meaning that
there is always an LR(1) grammar for anything that any one of these
grammars can parse, but not necessarily (and often not) a method
of finding that LR(1) grammar.
4) type 2 grammars (cfgs, NPDAs)
There aare also non-deterministic PDA's. These can parse some
languages that determinstic ones can't, notably palindromes.
Backtracking is equivalent to non-determinism.
Any grammar with only one [non-]terminal on the left-hand-side
(LHS) of a rule (e.g. "a: b c d;" but not "a b: c d e;") is called
a context free grammar (cfg) and the resulting language a context
free language (cfl). Context free, because the additional
[non-]terminals on the LHS of a rule are considered the context
where the (non-context) non-terminal applies. These are the same
languages that one can parse with an NPDA.
The places where a DPDA gets confused and an NPDA doesn't are
called ambiguities. (The places where a parser generator gets
confused are called conflicts. All ambiguities cause conflicts.
Not all conflicts are true ambiguities.)
Of course, any language which can be parsed deterministically can
also be parsed non-determinstically, by simply having the
non-determinstic parse choose the determinstic "choices".
5) type 1 grammars (LBAs context sensitive grammars)
Any grammar where there can be more than one token on the LHS of a
rule, but the LHS of a rule cannot be shorter than the RHS of a
rule are called "context sensitive grammars" (csgs) (such as "a b:
b a;" but not "a b: c;"). All cfgs are trivially csgs.
These are the grammars that can be parse by a Turing Machine (TM)
with only a fixed amount of tape that is a linear multiple of the
input size, also known as a linear bounded automata.
It is worth noting that the intersection of two cfgs need not be a
cfg, but the intersection of two csgs is a csg. Predicates
(mentioned above) allow one to write grammatically grammars that
are the intersection of context free grammars, which allows
predicated parsers (either LL or LR) to parse context-sensitive
languages.
6) type 0 languages (Turning Machines)
At the other end of the spectrum are Turing Machines (TMs) which
ware equivalent to grammars with arbitrary numbers of symbols on
both the left and right hand sides). They are also equivalent to
machines with a queue rather than a stack or machines with two
stacks, or numerous other models of computation.
Everything previously discussed can be parsed by a TM (or one of
the equivalent forms) and non-determinism does not add anything to
a TM. Moreover, most of the TM equivalents have a mechanical
method for converting a TM into (or out of) their representation,
which is how TM equivalence is generally proven.
BTW, adaptive grammars are equivalent in power to TMs.
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)
Return to the
comp.compilers page.
Search the
comp.compilers archives again.