Re: How test if language is LL(k) or LR(k) ?

Stephen Horne <sh006d3592@blueyonder.co.uk>
Wed, 12 Nov 2008 05:27:53 +0000

          From comp.compilers

Related articles
[4 earlier articles]
Re: How test if language is LL(k) or LR(k) ? cfc@shell01.TheWorld.com (Chris F Clark) (2008-11-06)
Re: How test if language is LL(k) or LR(k) ? torbenm@pc-003.diku.dk (2008-11-07)
Re: How test if language is LL(k) or LR(k) ? ulf.schwekendiek@googlemail.com (ulf.schwekendiek@googlemail.com) (2008-11-07)
Re: How test if language is LL(k) or LR(k) ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-11-07)
Re: How test if language is LL(k) or LR(k) ? torbenm@pc-003.diku.dk (2008-11-10)
Re: How test if language is LL(k) or LR(k) ? cfc@shell01.TheWorld.com (Chris F Clark) (2008-11-10)
Re: How test if language is LL(k) or LR(k) ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-11-12)
| List of all articles for this month |
From: Stephen Horne <sh006d3592@blueyonder.co.uk>
Newsgroups: comp.compilers
Date: Wed, 12 Nov 2008 05:27:53 +0000
Organization: virginmedia.com
References: 08-10-044 08-10-055 08-11-033 08-11-035 08-11-043
Keywords: parse, theory
Posted-Date: 12 Nov 2008 09:58:17 EST

On Mon, 10 Nov 2008 17:54:35 +0100, torbenm@pc-003.diku.dk (Torben
Fgidius Mogensen) wrote:


>Stephen Horne <sh006d3592@blueyonder.co.uk> writes:
>
>> On Fri, 07 Nov 2008 14:27:18 +0100, torbenm@pc-003.diku.dk (Torben AEgidius
Mogensen) wrote:
>>
>>>Stephen Horne <sh006d3592@blueyonder.co.uk> writes:
>>>
>>>> For LR, *almost* all context-free grammars are LR(k < infinity), so if
>>>> you can prove that the grammar is context-free you are almost there.
>>>
>>>That's not very useful, and not even true. There are many useful
>>>grammars that are not LR(k) for any k and even useful languages for
>>>which there doesn't exits LR(k) grammars for any k.
>>
>> Context-free grammars? And for any finite k?
>
>Indeed. The language of all even palindromic sequences over the
>alphabet {a,b} is not LR(k) for any k. Note that this has an
>unambiguous grammar:
>
> P -> aPa
> P -> bPb
> P ->


This is annoying because formally, I'm sure you are absolutely 100%
correct, and I cannot even appeal to usefulness since palindrome-like
expressions are part of many programming language grammars (esp WRT
different styles of parentheses).


Sort of.


Yet despite this, I still think your criticism is wrong.


Basically, palindrome-like expressions are common, as I said, in many
programming languages - most of which parse perfectly happily with
LR(1). I specifically mention problems with empty rules later in the
original post. Therefore, the only problem you *can* have WRT this is
in the subjective meaning of the term "almost all" in my original
post.


My impression of this discussion is that a lot of this has happened. I
can't fault you on your theory - for that matter, I have to thank you,
because you have made me take a serious look at some of my sloppy
thinking. Even so, the original post wasn't meant to be formal, the
bits that needed qualifiers were (mostly, at least) qualified, and
what has lead you to see me as simply wrong seems to be mostly a
matter of interpretation of subjective terms such as "almost all".


In these situations, I have learned to stop and rethink what the real
issue is, and I suspect that the real issue is the tone in my original
post. Very often, my writing seems to have an unintentional tone of
authority or even of arrogance.


All I can say here is that I have Aspergers. I have developed a
pedantic and formal communication style as a desperate and misguided
strategy to avoid being misunderstood, due to my dodgy nonverbals etc.
By now, it is a deeply ingrained habit. Mixing a pedantic and formal
communication style with an informal approach to a subject is of
course bound to cause misunderstandings, especially given that I
mostly see the tone I intend when I write - the tone that others see
is of course obvious, but only once someone else has pointed it out to
me :-(


Anyway, at this point, I think the best thing to do is try to draw
this to an end before I cause any more upset.


>> 1. As I said, there's a pretty big proviso in there, related to the
>> halting problem. The only improvement in the power of this
>> relative to backtracking/Tomita is that it can detect problems for
>> specific inputs at run-time.
>>
>> Actually, there is a proviso to that proviso. My method doesn't
>> need reduce sizes to be encoded in the state model - they are
>> determined at run-time by the table. This makes it easy to support
>> EBNF definitions for nonterminals. There's no extra power in terms
>> of the set-of-strings definition of a grammar, but it's a little
>> more flexible in how you get to define the grammar.
>
>I just don't understand this description. Can you write it as
>pseudo-code?


By "proviso to the proviso" I am not describing a feature related to
the detection of problems at run-time, only a
maybe-there-is-a-slight-improvement-in-power-after-all.


In an LR(0) state model, you might describe a state with a set of
dotted rules. For LR(1) etc, each dotted rule also has a lookahead,
but for simplicity I'll deal with LR(0).


If one of those dotted rules for the state has the following form...


      NONTERM : a b C D e f.


then that indicates that the top of the stack contains those 6 tokens,
and that a reduction is possible which replaces those 6 tokens with
the single NONTERM token. The reduction size - 6 - is encoded in the
state model.


In normal LR, with a BNF form grammar, you always have a fixed reduce
length - a fixed number of tokens to pop from the stack, which is
encoded in the state model, and easy to determine using a dotted rules
based representation of the state model.


If you want to use EBNF rules to define your grammar, this basically
means that some rules are themselves defined using a regular grammar.
The normal approach to handling this is IIRC to translate the EBNF
form into a BNF form. However, tabular LR offers an alternative.


Basically, you can in principle have dotted rules such as...


    NONTERM : A (b C D e)* f.


A standard LR parser cannot cope with this - it doesn't know how many
tokens to pop off of the stack in order to do the reduction - but a
tabular LR parser can derive that information from the table at
run-time.


Obviously, this approach has both benefits (state models should be a
little smaller) and costs (attribute evaluation may be problematic).


>> 2. As I said, at best I re-invented it ten years after the fact.
>> Formal papers have already been published for tabular LR parsing.
>
>Indeed. But while they claim linear-time parsing for tables without
>conflicts, there is no claim for tables with conflicts.


*Tables* with conflicts? If this refers to the cycles within a column
of the table, then perhaps I am making a new claim, though it seems a
fairly obvious idea to me.


In the simplest form of the algorithm, you evaluate the input and
table backwards using a standard LR state model. It's a kind of
dynamic programming approach. Since it is evaluated one column of the
table (one token of the input) at a time, with all cells being fully
evaluated in each column before progressing, forward references are
always trivially resolved - the cells being referenced have already
been evaluated.


The only problem is the evaluation of the cells in a single column,
which may reference each other. The only "conflict" I can see,
therefore, is the possibility of cycles. Since the number of cells is
constant for a particular grammar, order of evaluation can be
determined (and cycles detected) in constant time.


Perhaps you're thinking of conflicts in the grammar/state model. But
then the whole point of tabular LR (like backtracking LR and Tomita)
is to finitely resolve those conflicts/ambiguities that can be
finitely resolved. The only real differences between these, from this
perspective, are that tabular LR can detect cycles at run-time rather
than going into a loop, and that tabular LR can always give the
first/best parse in linear time (though enumerating alternative parses
obviously breaks the linear time claim).


>Knowing the full input up-front is no problem, but if you make a table
>specifically to this input, you must show that this can be constructed
>in linear time and that subsequent parsing using this table is also in
>linear time.


Absolutely true, and an area in which I fully acknowledge that some
variants of my algorithm "cheat a bit".


With input known fully in advance, the table is just a big array -
constant size per column, one column per input token. Trouble is, it's
a *big* array. Therefore, some variants look at ways to reduce the
memory requirement at any point in time, and the memory management and
lookup issues for that have not been properly addressed except to note
that it would be hard (perhaps impossible) to ensure strictly linear
time, though "practically" linear time shouldn't be too hard.


n log m (where m is the maximum number of table cells in memory at
once) should be easy. Unfortunately, you cannot decide m based on the
grammar alone. Therefore, for example, tree-based data structures are
not suitable for table-cell lookup. Hash-tables *might* work, but
while hash tables can give amortized constant time inserts, that's
only under certain constraints - when inserts and deletes are mixed in
such a way that the table grows substantially at times, and shrinks
substantially at other times, the analysis is a lot more complex.


>For just proving linear-time, big constant factors don't matter - as
>long as they are constants independent of the input size.


Yes - but big constant factors do affect whether something is
practical.


There's a common perception that linear time implies something faster
than, for example, n log n. That is only true in general for
sufficiently long inputs, and "sufficiently long" can be very very
long indeed. Furthermore, in practice, there is some correlation
between length of input and grammar complexity. That is, if the
grammar is very simple, you are unlikely to be dealing with very long
inputs, and if the inputs are all very short, it's unlikely that they
have a complex grammar.


No-one is going to switch to using a linear parsing algorithm for
their parser when - for typical source files - it takes many times
longer than their existing non-linear parsing algorithm. This includes
me - and it's the major reason why I haven't developed the idea
further.


Comparing tabular LR with basic LR, my guess would be that tabular LR
would run between 100 and 1000 times slower for the same fairly
typical domain-specific-language scale grammar. As the grammar gets
more complex, the table gets many more rows to evaluate per token
parsed, so I could easily be underestimating by orders of magnitude.
Sure, tabular LR can cope with some things that basic LR can't, but
just how much am I willing to pay for those things?


Basically, tabular LR has scaling constants based on the number of
nonterminal tokens and the number of LR states, which both tend to
increase the number of columns in the table. Where LR parsing speed is
only weekly dependent on the complexity of the grammar (depending on
the state model representation in the parser), and backtracking LR is
usually not much slower, tabular LR speed is extremely sensitive to
grammar complexity due to the evaluation of the table.


I've not done any formal analysis (or even looked back at how the
table is structured), but my intuition suggests at least quadratic
time for each token parsed as the grammar complexity increases (and
thus both the number of nonterminals and the number of states),
basically due to quadratically increasing numbers of columns in the
table. It could be much worse due to order-of-evaluation and
cycle-detection issues as each column is evaluated, since this also
has to be decided at run-time, and also because the number of states
probably increases polynomially with "grammar complexity", however
that is defined. So my best guess is "probably polynomial rather than
exponential, but pretty seriously bad". And I'm certainly not ruling
out exponential, either.


Anyway, I'm sure you can see how this can get very impractical, both
in terms of time and memory, even for quite modest grammars. Niche
applications with very large inputs and very simple grammars might
work, but I suspect that there are very very few applications where
the grammars are simple enough and the inputs long enough for it to be
worthwhile - as I said, there is some correlation between grammar
complexity and typical input size, and having simple grammars but
large inputs seems very niche indeed.


For the record, I mostly use Kelbt - a backtracking parser generator
that's a bit beta, really. Sometimes, simple mistakes cause core
dumps, for example. What's particularly strange is I don't even use
the backtracking features - LALR(1) works fine for what I need, and I
don't even miss precedence and associativity stuff that much. The big
gain for me is a lot of flexibility in how the generated code is
structured, since you basically insert generated code fragments into a
template that you write for yourself.


Since backtracking LR is overkill for what I need, it's fairly obvious
that designing and implementing a tabular LR parser generator just for
my few DSLs would be overkill on an epic scale.


>If you really have an algorithm for linear-time CF parsing, send a
>paper to a journal to get it reviewed. If you get it past the
>reviewers, I might be persuaded to read your article. There are just
>too many "I have this wonderful solution to (open problem), which I
>have described in (long and impenetrable text)" claims out there for
>me to want to read one more.


Well, it's not that long and impenetrable. Several posts in this
thread were probably longer and more impenetrable, but I'm not sure
how well I can present the table format in ASCII.


Anyway, considering that I can't really be bothered with it myself, I
certainly don't blame you.


Besides, as one of those who proposed a "wonderful solution to
linear-time parsing" back in 2002, with a long, vague, impenetrable
description of a broken algorithm - well, I don't exactly get to
complain ;-)


I have a lot of these ideas that get scribbled down, partly so that I
can forget them and moved on. I have some old notes for "texture
mapping" that I wrote when I was maybe 17 - something in the order of
20 years ago anyway. I had never seen real texture mapping done at the
time, though I had seen some stuff where the graphics are only
projected for a fixed number of orientations. My idea then wasn't
texture mapping, as it turns out, but a limited form of ray tracing
done the hard way (simultaneous equations rather than vectors etc).


If anyone discovers all this junk after I'm dead, I'll probably be
seen as a kind of tenth-rate wannabe Da Vinci.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.