2 Jul 2003 00:45:05 -0400

Related articles |
---|

Recursive Descent vs. LALR JohnMResler@netscape.net (John Resler) (2003-06-20) |

Re: Recursive Descent vs. LALR bear@sonic.net (2003-06-25) |

Re: Recursive Descent vs. LALR cfc@shell01.TheWorld.com (Chris F Clark) (2003-07-02) |

Re: Recursive Descent vs. LALR kamalpr@yahoo.com (2003-07-03) |

Re: Recursive Descent vs. LALR John.M.Resler@Boeing.com (Boeing) (2003-07-04) |

Re: Recursive Descent vs. LALR bear@sonic.net (2003-07-04) |

From: | Chris F Clark <cfc@shell01.TheWorld.com> |

Newsgroups: | comp.compilers |

Date: | 2 Jul 2003 00:45:05 -0400 |

Organization: | The World Public Access UNIX, Brookline, MA |

References: | 03-06-093 03-06-117 |

Keywords: | parse |

Posted-Date: | 02 Jul 2003 00:45:04 EDT |

Note, this question has been asked before, and I've answered it before

(as well as have others). In some ways, I'm a little surpised it

isn't in the FAQ, but I guess it isn't asked that frequently, only

about once per year or two. Here is more than you wanted to know.

There is a hiearchy of languages, and a separate hierarchy of

grammars. The difference (which is important) is that when we say a

(set of) language(s) is in a certain class, we mean there exists a

grammar that parses that lanugage which is in the appropriate grammar

class. However, we may not know what the grammar is, and more

importantly there may be no mechanical way of finding that grammar.

Thus, just because a certain language is LR, does not mean that a

grammar one might have for that language is LR. In fact, "you" might

never be able to find an LR grammar for the language.

In addition, to the classes listed in this hierarchy, there are many

other grammar/language classes (in fact, infinitely many).

1) Regular (type 3) languages

These are things one can express with just a regular expression.

Also, text one can parse with a FSA (finite state automata) FSM

(... machine), two words for the same thing, with no auxilary

storage. It does not matter what the FSA is determinitic or

non-deterministic, both can parse exact the same languages.

Equivalently, text one can parse with a "linear" grammar. A linear

grammar has only 1 type of recursion, either left or right, but not

both in the same grammar and no recursion which is "central"

recursion--what one does to express parenthesis.

Note, given any one of the above representations of a type 3

language, it is possible to apply a mechanical algorithm that

produces any of the other forms.

Most "lexer" generators use regular expressions and handle type 3

languages. The "regeps" of editors are type 3, unless they support

back-references (which most do), which require a more powerful

parser.

2) Dike grammars

These are grammars that handle parenthesis (and only parenthesis).

They have only central recursion.

There is little overlap between Dike grammars and regular grammars.

What one can express in one, one cannot express in the other. The

exception being the empty language (i.e. parsing only an empty

file, a file with no tokens).

2) LL(k) (top-down or predictive) grammars

The k refers to how many tokens of lookahead one needs at the be

examine at the start of a rule to determine which rule (or

alternative of a rule) to apply.

If at the beginning of any non-terminal in the grammar, one can

always look at just one token and decide whether the rule should be

apllied or not (to a syntactically legal input), the grammar is

LL(1). This is the most common case.

All regular languages are LL(1), in particular all right linear

grammars (grammars that use only right recursion) are LL(1).

LL(1) grammars also allow some forms of central recursion,

parenthesization. All Dike grammars are also LL(1).

LL grammars do not allow left recursion.

Also, some forms of central recursion mixed with right recursion

are not LL(1). In particular, the common if-then/if-then-else rule

of most commonly used programming languages is not LL(k) for any k.

This is not as big of problem as it would seem as there is a simple

easy way (hack) to extend LL parsers to handle the if-then/if-then-

else conflict by having it simply choose the else version if and

only if the else is present. This binds the else to the closest

if, which is the normally desired semantics. Most LL parser

generators will automatically do this for you with at most a

warning.

Most recursive descent parsers are "almost" LL(k), where k is the

number of tokens they need to look at to figure out what to do in

the worst case. The reason they are almost LL(k) is that they

generally implicitly do the if-then-else hack without being aware

that they are.

Strictly speaking, for each k, there exists a grammar which is

LL(k+1) but not LL(k).

There are mechanical manipulations one can apply, removing left

recursion, and factoring to convert a non-LL grammar into an LL

one. However, I believe there are LL languages that have CFGs

(described later) which cannot be converted mechanically into an

equivalent LL grammar. (However, I could be wrong on this, as I

don't keep that rule in my head securely.) Equally importantly,

the meachanical transformations to make a grammar LL can sometimes

destroy the "character" of the grammar, making it impossible to

refer to the way the tokens were grouped into non-terminals in the

original grammar, for sematic purposes.

Also, LL grammars need expression precendence to be expressed

explicitly in the grammar, rather than in a separate formalism

(i.e. not in a precedence table nor in precedence declarations).

However, all of those "negatives" aside, almost every programming

language is either LL(k) +if-the-else hack for some very small k (~

5 or less with most being LL(1)) or is not parsable with any type 2

grammar (defined later). The only difficultly is that if one has

an LR grammar for the language, it may not be easy to get an

equivalent LL grammar.

Moreover, certain extensions to LL grammars: backtracking,

predicates, and adaptive grammars allow extended LL grammars to

parse type 1 and even type 0 languages (again defined later).

3) LR(k) (bottom up or shift-reduce) grammars

In LR(k) grammars, the k refers to how many tokens one needs at the

end of the rule to decide which rule to apply.

If at the end of a rule, one can always determine with one token of

lookahead which rule to apply the grammar is LR(1). If at the end

of the rule, one doesn't need any tokens of lookahead to decide

which rule to apply (i.e. the ambiguities have all been sorted out

before coming to the end of the rule) the grammar is LR(0).

LR(0) is important because two important sub-classes of LR grammars

arise from it, SLR(k) and LALR(k). The key feature of LR(0)

grammars is that they throw away all left (preceding) context in

the lookahead calculations. The throwing away of the left context

occurs, because the LR(0) mahcine "merges" states without regard to

their lookahead information. thus, if in one context the lookahead

for one rule is three tokens: a, b, or c and in another the

lookahead for the same rule is d or e, the LR(0) machaine marges

those two states, confusing the lookaheads, so that at the end of

the rule any of a, b, c, d, or e should result in a reduction (note

if the language is LR(0) meaning we don't need to look at all, it

doesn't matter, since only the correct token will appear anyway in

a syntactically legal program). Most "LR" parser generation

algorithms first build the LR(0) machine and then resolve the

conflicts the occur in the reduce states.

A grammar is SLR(k) if at the end of all uses of a rule, the same k

tokens always determine whether to apply the rule or not. A

grammar is LALR(k) if at the end of each use of a rule, there are k

tokens which decide whther to apply the rule or not and those k

tokens do not conflict with the lookahead tokens for some other

rule that is applied in the same LR(0) reduce state. The

difference between the two grammar classes is subtle, and all

SLR(k) grammars are LALR(k) (and all LALR(k) grammars are LR(k)).

Essentially, LALR(k) gives some context sensitivity to the

lookahead tokens and different sets of conflicting rules can have

different lookahead tokens to distinguish them.

That brings us back to "full" or canonical LR(k), where k is >= 1.

LR(k) parsing preserves the needed left context to disambiguate

certain situations where LALR(k) finds the rules to conflict. The

canonical way of doing this is to not merge states with different

lookaheads in the first place. That causes an explosion in the

table size as many states are kept distinct simply because they

have different lookaheads, when in reality the lookahead for those

states will never be consulted. A more modern method for solving

the problem involves splitting the states only when a conflict is

detected. Then, if the grammar is LALR, no splitting will occur

and one has the same size machine as the LR(0), but as conflicts

arise the require left context to resolve, the tables slowly grow.

Back now, to the inclusion rules.

Every LL(k) grammar is an LR(k) grammar, but not necessarily an

SLR(k) or LALR(k) grammar. The troublesome cases occur with empty

rules (null productions). An LL(k) grammar uses left context to

decide which lookahead to apply to each use of an empty rule.

However, the LR(0) machine throws that context away and thus the

result SLR or LALR machine may have conflicts due to ignoring the

left context. If there are no rules shorter than k-1, an LL(k)

grammar is always an SLR(k) or LALR(k) grammar.

And to the languages, every LR(k) language is actually also an

LR(1) language, meaning that there exists an LR(1) grammar for the

language. However, there is no mechanical method to find that

LR(1) grammar.

The LR(k) languages also correspond to what you can parse with an

deterministic FSA and an auxillary stack (also know as a fifo) with

k tokens of lookahead, sometimes called a (D)PDA ((determinsitic)

push down automata). These are also called the deterministic

context free languages. Since every LR(k) grammar is also LR(1),

one can show that one needs only one token of lookahead in the PDA.

By the way, there are many other grammar classes in the area such

as the various precedence grammars and the bounded context

grammars. Many of these have their only parsing methodologies.

Moreover, the grammars form a farily complex hiearchy with some

nesting inside others and other sets not nesting. However, all of

these languages are a subset of the LR(1) languages, meaning that

there is always an LR(1) grammar for anything that any one of these

grammars can parse, but not necessarily (and often not) a method

of finding that LR(1) grammar.

4) type 2 grammars (cfgs, NPDAs)

There aare also non-deterministic PDA's. These can parse some

languages that determinstic ones can't, notably palindromes.

Backtracking is equivalent to non-determinism.

Any grammar with only one [non-]terminal on the left-hand-side

(LHS) of a rule (e.g. "a: b c d;" but not "a b: c d e;") is called

a context free grammar (cfg) and the resulting language a context

free language (cfl). Context free, because the additional

[non-]terminals on the LHS of a rule are considered the context

where the (non-context) non-terminal applies. These are the same

languages that one can parse with an NPDA.

The places where a DPDA gets confused and an NPDA doesn't are

called ambiguities. (The places where a parser generator gets

confused are called conflicts. All ambiguities cause conflicts.

Not all conflicts are true ambiguities.)

Of course, any language which can be parsed deterministically can

also be parsed non-determinstically, by simply having the

non-determinstic parse choose the determinstic "choices".

5) type 1 grammars (LBAs context sensitive grammars)

Any grammar where there can be more than one token on the LHS of a

rule, but the LHS of a rule cannot be shorter than the RHS of a

rule are called "context sensitive grammars" (csgs) (such as "a b:

b a;" but not "a b: c;"). All cfgs are trivially csgs.

These are the grammars that can be parse by a Turing Machine (TM)

with only a fixed amount of tape that is a linear multiple of the

input size, also known as a linear bounded automata.

It is worth noting that the intersection of two cfgs need not be a

cfg, but the intersection of two csgs is a csg. Predicates

(mentioned above) allow one to write grammatically grammars that

are the intersection of context free grammars, which allows

predicated parsers (either LL or LR) to parse context-sensitive

languages.

6) type 0 languages (Turning Machines)

At the other end of the spectrum are Turing Machines (TMs) which

ware equivalent to grammars with arbitrary numbers of symbols on

both the left and right hand sides). They are also equivalent to

machines with a queue rather than a stack or machines with two

stacks, or numerous other models of computation.

Everything previously discussed can be parsed by a TM (or one of

the equivalent forms) and non-determinism does not add anything to

a TM. Moreover, most of the TM equivalents have a mechanical

method for converting a TM into (or out of) their representation,

which is how TM equivalence is generally proven.

BTW, adaptive grammars are equivalent in power to TMs.

Hope this helps,

-Chris

*****************************************************************************

Chris Clark Internet : compres@world.std.com

Compiler Resources, Inc. Web Site : http://world.std.com/~compres

19 Bronte Way #33M voice : (508) 435-5016

Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.