Re: Recursive Descent vs. LALR

Chris F Clark <cfc@shell01.TheWorld.com>
2 Jul 2003 00:45:05 -0400

          From comp.compilers

Related articles
Recursive Descent vs. LALR JohnMResler@netscape.net (John Resler) (2003-06-20)
Re: Recursive Descent vs. LALR bear@sonic.net (2003-06-25)
Re: Recursive Descent vs. LALR cfc@shell01.TheWorld.com (Chris F Clark) (2003-07-02)
Re: Recursive Descent vs. LALR kamalpr@yahoo.com (2003-07-03)
Re: Recursive Descent vs. LALR John.M.Resler@Boeing.com (Boeing) (2003-07-04)
Re: Recursive Descent vs. LALR bear@sonic.net (2003-07-04)
| List of all articles for this month |
From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 2 Jul 2003 00:45:05 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 03-06-093 03-06-117
Keywords: parse
Posted-Date: 02 Jul 2003 00:45:04 EDT

Note, this question has been asked before, and I've answered it before
(as well as have others). In some ways, I'm a little surpised it
isn't in the FAQ, but I guess it isn't asked that frequently, only
about once per year or two. Here is more than you wanted to know.


There is a hiearchy of languages, and a separate hierarchy of
grammars. The difference (which is important) is that when we say a
(set of) language(s) is in a certain class, we mean there exists a
grammar that parses that lanugage which is in the appropriate grammar
class. However, we may not know what the grammar is, and more
importantly there may be no mechanical way of finding that grammar.


Thus, just because a certain language is LR, does not mean that a
grammar one might have for that language is LR. In fact, "you" might
never be able to find an LR grammar for the language.


In addition, to the classes listed in this hierarchy, there are many
other grammar/language classes (in fact, infinitely many).


1) Regular (type 3) languages


      These are things one can express with just a regular expression.


      Also, text one can parse with a FSA (finite state automata) FSM
      (... machine), two words for the same thing, with no auxilary
      storage. It does not matter what the FSA is determinitic or
      non-deterministic, both can parse exact the same languages.


      Equivalently, text one can parse with a "linear" grammar. A linear
      grammar has only 1 type of recursion, either left or right, but not
      both in the same grammar and no recursion which is "central"
      recursion--what one does to express parenthesis.


      Note, given any one of the above representations of a type 3
      language, it is possible to apply a mechanical algorithm that
      produces any of the other forms.


      Most "lexer" generators use regular expressions and handle type 3
      languages. The "regeps" of editors are type 3, unless they support
      back-references (which most do), which require a more powerful
      parser.


2) Dike grammars


      These are grammars that handle parenthesis (and only parenthesis).
      They have only central recursion.


      There is little overlap between Dike grammars and regular grammars.
      What one can express in one, one cannot express in the other. The
      exception being the empty language (i.e. parsing only an empty
      file, a file with no tokens).


2) LL(k) (top-down or predictive) grammars


      The k refers to how many tokens of lookahead one needs at the be
      examine at the start of a rule to determine which rule (or
      alternative of a rule) to apply.


      If at the beginning of any non-terminal in the grammar, one can
      always look at just one token and decide whether the rule should be
      apllied or not (to a syntactically legal input), the grammar is
      LL(1). This is the most common case.


      All regular languages are LL(1), in particular all right linear
      grammars (grammars that use only right recursion) are LL(1).


      LL(1) grammars also allow some forms of central recursion,
      parenthesization. All Dike grammars are also LL(1).


      LL grammars do not allow left recursion.


      Also, some forms of central recursion mixed with right recursion
      are not LL(1). In particular, the common if-then/if-then-else rule
      of most commonly used programming languages is not LL(k) for any k.
      This is not as big of problem as it would seem as there is a simple
      easy way (hack) to extend LL parsers to handle the if-then/if-then-
      else conflict by having it simply choose the else version if and
      only if the else is present. This binds the else to the closest
      if, which is the normally desired semantics. Most LL parser
      generators will automatically do this for you with at most a
      warning.


      Most recursive descent parsers are "almost" LL(k), where k is the
      number of tokens they need to look at to figure out what to do in
      the worst case. The reason they are almost LL(k) is that they
      generally implicitly do the if-then-else hack without being aware
      that they are.


      Strictly speaking, for each k, there exists a grammar which is
      LL(k+1) but not LL(k).


      There are mechanical manipulations one can apply, removing left
      recursion, and factoring to convert a non-LL grammar into an LL
      one. However, I believe there are LL languages that have CFGs
      (described later) which cannot be converted mechanically into an
      equivalent LL grammar. (However, I could be wrong on this, as I
      don't keep that rule in my head securely.) Equally importantly,
      the meachanical transformations to make a grammar LL can sometimes
      destroy the "character" of the grammar, making it impossible to
      refer to the way the tokens were grouped into non-terminals in the
      original grammar, for sematic purposes.


      Also, LL grammars need expression precendence to be expressed
      explicitly in the grammar, rather than in a separate formalism
      (i.e. not in a precedence table nor in precedence declarations).


      However, all of those "negatives" aside, almost every programming
      language is either LL(k) +if-the-else hack for some very small k (~
      5 or less with most being LL(1)) or is not parsable with any type 2
      grammar (defined later). The only difficultly is that if one has
      an LR grammar for the language, it may not be easy to get an
      equivalent LL grammar.


      Moreover, certain extensions to LL grammars: backtracking,
      predicates, and adaptive grammars allow extended LL grammars to
      parse type 1 and even type 0 languages (again defined later).


3) LR(k) (bottom up or shift-reduce) grammars


      In LR(k) grammars, the k refers to how many tokens one needs at the
      end of the rule to decide which rule to apply.


      If at the end of a rule, one can always determine with one token of
      lookahead which rule to apply the grammar is LR(1). If at the end
      of the rule, one doesn't need any tokens of lookahead to decide
      which rule to apply (i.e. the ambiguities have all been sorted out
      before coming to the end of the rule) the grammar is LR(0).


      LR(0) is important because two important sub-classes of LR grammars
      arise from it, SLR(k) and LALR(k). The key feature of LR(0)
      grammars is that they throw away all left (preceding) context in
      the lookahead calculations. The throwing away of the left context
      occurs, because the LR(0) mahcine "merges" states without regard to
      their lookahead information. thus, if in one context the lookahead
      for one rule is three tokens: a, b, or c and in another the
      lookahead for the same rule is d or e, the LR(0) machaine marges
      those two states, confusing the lookaheads, so that at the end of
      the rule any of a, b, c, d, or e should result in a reduction (note
      if the language is LR(0) meaning we don't need to look at all, it
      doesn't matter, since only the correct token will appear anyway in
      a syntactically legal program). Most "LR" parser generation
      algorithms first build the LR(0) machine and then resolve the
      conflicts the occur in the reduce states.


      A grammar is SLR(k) if at the end of all uses of a rule, the same k
      tokens always determine whether to apply the rule or not. A
      grammar is LALR(k) if at the end of each use of a rule, there are k
      tokens which decide whther to apply the rule or not and those k
      tokens do not conflict with the lookahead tokens for some other
      rule that is applied in the same LR(0) reduce state. The
      difference between the two grammar classes is subtle, and all
      SLR(k) grammars are LALR(k) (and all LALR(k) grammars are LR(k)).
      Essentially, LALR(k) gives some context sensitivity to the
      lookahead tokens and different sets of conflicting rules can have
      different lookahead tokens to distinguish them.


      That brings us back to "full" or canonical LR(k), where k is >= 1.
      LR(k) parsing preserves the needed left context to disambiguate
      certain situations where LALR(k) finds the rules to conflict. The
      canonical way of doing this is to not merge states with different
      lookaheads in the first place. That causes an explosion in the
      table size as many states are kept distinct simply because they
      have different lookaheads, when in reality the lookahead for those
      states will never be consulted. A more modern method for solving
      the problem involves splitting the states only when a conflict is
      detected. Then, if the grammar is LALR, no splitting will occur
      and one has the same size machine as the LR(0), but as conflicts
      arise the require left context to resolve, the tables slowly grow.


      Back now, to the inclusion rules.


      Every LL(k) grammar is an LR(k) grammar, but not necessarily an
      SLR(k) or LALR(k) grammar. The troublesome cases occur with empty
      rules (null productions). An LL(k) grammar uses left context to
      decide which lookahead to apply to each use of an empty rule.
      However, the LR(0) machine throws that context away and thus the
      result SLR or LALR machine may have conflicts due to ignoring the
      left context. If there are no rules shorter than k-1, an LL(k)
      grammar is always an SLR(k) or LALR(k) grammar.


      And to the languages, every LR(k) language is actually also an
      LR(1) language, meaning that there exists an LR(1) grammar for the
      language. However, there is no mechanical method to find that
      LR(1) grammar.


      The LR(k) languages also correspond to what you can parse with an
      deterministic FSA and an auxillary stack (also know as a fifo) with
      k tokens of lookahead, sometimes called a (D)PDA ((determinsitic)
      push down automata). These are also called the deterministic
      context free languages. Since every LR(k) grammar is also LR(1),
      one can show that one needs only one token of lookahead in the PDA.


      By the way, there are many other grammar classes in the area such
      as the various precedence grammars and the bounded context
      grammars. Many of these have their only parsing methodologies.
      Moreover, the grammars form a farily complex hiearchy with some
      nesting inside others and other sets not nesting. However, all of
      these languages are a subset of the LR(1) languages, meaning that
      there is always an LR(1) grammar for anything that any one of these
      grammars can parse, but not necessarily (and often not) a method
      of finding that LR(1) grammar.


4) type 2 grammars (cfgs, NPDAs)


      There aare also non-deterministic PDA's. These can parse some
      languages that determinstic ones can't, notably palindromes.


      Backtracking is equivalent to non-determinism.


      Any grammar with only one [non-]terminal on the left-hand-side
      (LHS) of a rule (e.g. "a: b c d;" but not "a b: c d e;") is called
      a context free grammar (cfg) and the resulting language a context
      free language (cfl). Context free, because the additional
      [non-]terminals on the LHS of a rule are considered the context
      where the (non-context) non-terminal applies. These are the same
      languages that one can parse with an NPDA.


      The places where a DPDA gets confused and an NPDA doesn't are
      called ambiguities. (The places where a parser generator gets
      confused are called conflicts. All ambiguities cause conflicts.
      Not all conflicts are true ambiguities.)


      Of course, any language which can be parsed deterministically can
      also be parsed non-determinstically, by simply having the
      non-determinstic parse choose the determinstic "choices".


5) type 1 grammars (LBAs context sensitive grammars)


      Any grammar where there can be more than one token on the LHS of a
      rule, but the LHS of a rule cannot be shorter than the RHS of a
      rule are called "context sensitive grammars" (csgs) (such as "a b:
      b a;" but not "a b: c;"). All cfgs are trivially csgs.


      These are the grammars that can be parse by a Turing Machine (TM)
      with only a fixed amount of tape that is a linear multiple of the
      input size, also known as a linear bounded automata.


      It is worth noting that the intersection of two cfgs need not be a
      cfg, but the intersection of two csgs is a csg. Predicates
      (mentioned above) allow one to write grammatically grammars that
      are the intersection of context free grammars, which allows
      predicated parsers (either LL or LR) to parse context-sensitive
      languages.


6) type 0 languages (Turning Machines)


      At the other end of the spectrum are Turing Machines (TMs) which
      ware equivalent to grammars with arbitrary numbers of symbols on
      both the left and right hand sides). They are also equivalent to
      machines with a queue rather than a stack or machines with two
      stacks, or numerous other models of computation.


      Everything previously discussed can be parsed by a TM (or one of
      the equivalent forms) and non-determinism does not add anything to
      a TM. Moreover, most of the TM equivalents have a mechanical
      method for converting a TM into (or out of) their representation,
      which is how TM equivalence is generally proven.


      BTW, adaptive grammars are equivalent in power to TMs.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.