Wed, 12 Nov 2008 05:27:53 +0000

Related articles |
---|

[4 earlier articles] |

Re: How test if language is LL(k) or LR(k) ? cfc@shell01.TheWorld.com (Chris F Clark) (2008-11-06) |

Re: How test if language is LL(k) or LR(k) ? torbenm@pc-003.diku.dk (2008-11-07) |

Re: How test if language is LL(k) or LR(k) ? ulf.schwekendiek@googlemail.com (ulf.schwekendiek@googlemail.com) (2008-11-07) |

Re: How test if language is LL(k) or LR(k) ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-11-07) |

Re: How test if language is LL(k) or LR(k) ? torbenm@pc-003.diku.dk (2008-11-10) |

Re: How test if language is LL(k) or LR(k) ? cfc@shell01.TheWorld.com (Chris F Clark) (2008-11-10) |

Re: How test if language is LL(k) or LR(k) ? sh006d3592@blueyonder.co.uk (Stephen Horne) (2008-11-12) |

From: | Stephen Horne <sh006d3592@blueyonder.co.uk> |

Newsgroups: | comp.compilers |

Date: | Wed, 12 Nov 2008 05:27:53 +0000 |

Organization: | virginmedia.com |

References: | 08-10-044 08-10-055 08-11-033 08-11-035 08-11-043 |

Keywords: | parse, theory |

Posted-Date: | 12 Nov 2008 09:58:17 EST |

On Mon, 10 Nov 2008 17:54:35 +0100, torbenm@pc-003.diku.dk (Torben

Fgidius Mogensen) wrote:

*>Stephen Horne <sh006d3592@blueyonder.co.uk> writes:*

*>*

*>> On Fri, 07 Nov 2008 14:27:18 +0100, torbenm@pc-003.diku.dk (Torben AEgidius*

Mogensen) wrote:

*>>*

*>>>Stephen Horne <sh006d3592@blueyonder.co.uk> writes:*

*>>>*

*>>>> For LR, *almost* all context-free grammars are LR(k < infinity), so if*

*>>>> you can prove that the grammar is context-free you are almost there.*

*>>>*

*>>>That's not very useful, and not even true. There are many useful*

*>>>grammars that are not LR(k) for any k and even useful languages for*

*>>>which there doesn't exits LR(k) grammars for any k.*

*>>*

*>> Context-free grammars? And for any finite k?*

*>*

*>Indeed. The language of all even palindromic sequences over the*

*>alphabet {a,b} is not LR(k) for any k. Note that this has an*

*>unambiguous grammar:*

*>*

*> P -> aPa*

*> P -> bPb*

*> P ->*

This is annoying because formally, I'm sure you are absolutely 100%

correct, and I cannot even appeal to usefulness since palindrome-like

expressions are part of many programming language grammars (esp WRT

different styles of parentheses).

Sort of.

Yet despite this, I still think your criticism is wrong.

Basically, palindrome-like expressions are common, as I said, in many

programming languages - most of which parse perfectly happily with

LR(1). I specifically mention problems with empty rules later in the

original post. Therefore, the only problem you *can* have WRT this is

in the subjective meaning of the term "almost all" in my original

post.

My impression of this discussion is that a lot of this has happened. I

can't fault you on your theory - for that matter, I have to thank you,

because you have made me take a serious look at some of my sloppy

thinking. Even so, the original post wasn't meant to be formal, the

bits that needed qualifiers were (mostly, at least) qualified, and

what has lead you to see me as simply wrong seems to be mostly a

matter of interpretation of subjective terms such as "almost all".

In these situations, I have learned to stop and rethink what the real

issue is, and I suspect that the real issue is the tone in my original

post. Very often, my writing seems to have an unintentional tone of

authority or even of arrogance.

All I can say here is that I have Aspergers. I have developed a

pedantic and formal communication style as a desperate and misguided

strategy to avoid being misunderstood, due to my dodgy nonverbals etc.

By now, it is a deeply ingrained habit. Mixing a pedantic and formal

communication style with an informal approach to a subject is of

course bound to cause misunderstandings, especially given that I

mostly see the tone I intend when I write - the tone that others see

is of course obvious, but only once someone else has pointed it out to

me :-(

Anyway, at this point, I think the best thing to do is try to draw

this to an end before I cause any more upset.

*>> 1. As I said, there's a pretty big proviso in there, related to the*

*>> halting problem. The only improvement in the power of this*

*>> relative to backtracking/Tomita is that it can detect problems for*

*>> specific inputs at run-time.*

*>>*

*>> Actually, there is a proviso to that proviso. My method doesn't*

*>> need reduce sizes to be encoded in the state model - they are*

*>> determined at run-time by the table. This makes it easy to support*

*>> EBNF definitions for nonterminals. There's no extra power in terms*

*>> of the set-of-strings definition of a grammar, but it's a little*

*>> more flexible in how you get to define the grammar.*

*>*

*>I just don't understand this description. Can you write it as*

*>pseudo-code?*

By "proviso to the proviso" I am not describing a feature related to

the detection of problems at run-time, only a

maybe-there-is-a-slight-improvement-in-power-after-all.

In an LR(0) state model, you might describe a state with a set of

dotted rules. For LR(1) etc, each dotted rule also has a lookahead,

but for simplicity I'll deal with LR(0).

If one of those dotted rules for the state has the following form...

NONTERM : a b C D e f.

then that indicates that the top of the stack contains those 6 tokens,

and that a reduction is possible which replaces those 6 tokens with

the single NONTERM token. The reduction size - 6 - is encoded in the

state model.

In normal LR, with a BNF form grammar, you always have a fixed reduce

length - a fixed number of tokens to pop from the stack, which is

encoded in the state model, and easy to determine using a dotted rules

based representation of the state model.

If you want to use EBNF rules to define your grammar, this basically

means that some rules are themselves defined using a regular grammar.

The normal approach to handling this is IIRC to translate the EBNF

form into a BNF form. However, tabular LR offers an alternative.

Basically, you can in principle have dotted rules such as...

NONTERM : A (b C D e)* f.

A standard LR parser cannot cope with this - it doesn't know how many

tokens to pop off of the stack in order to do the reduction - but a

tabular LR parser can derive that information from the table at

run-time.

Obviously, this approach has both benefits (state models should be a

little smaller) and costs (attribute evaluation may be problematic).

*>> 2. As I said, at best I re-invented it ten years after the fact.*

*>> Formal papers have already been published for tabular LR parsing.*

*>*

*>Indeed. But while they claim linear-time parsing for tables without*

*>conflicts, there is no claim for tables with conflicts.*

*Tables* with conflicts? If this refers to the cycles within a column

of the table, then perhaps I am making a new claim, though it seems a

fairly obvious idea to me.

In the simplest form of the algorithm, you evaluate the input and

table backwards using a standard LR state model. It's a kind of

dynamic programming approach. Since it is evaluated one column of the

table (one token of the input) at a time, with all cells being fully

evaluated in each column before progressing, forward references are

always trivially resolved - the cells being referenced have already

been evaluated.

The only problem is the evaluation of the cells in a single column,

which may reference each other. The only "conflict" I can see,

therefore, is the possibility of cycles. Since the number of cells is

constant for a particular grammar, order of evaluation can be

determined (and cycles detected) in constant time.

Perhaps you're thinking of conflicts in the grammar/state model. But

then the whole point of tabular LR (like backtracking LR and Tomita)

is to finitely resolve those conflicts/ambiguities that can be

finitely resolved. The only real differences between these, from this

perspective, are that tabular LR can detect cycles at run-time rather

than going into a loop, and that tabular LR can always give the

first/best parse in linear time (though enumerating alternative parses

obviously breaks the linear time claim).

*>Knowing the full input up-front is no problem, but if you make a table*

*>specifically to this input, you must show that this can be constructed*

*>in linear time and that subsequent parsing using this table is also in*

*>linear time.*

Absolutely true, and an area in which I fully acknowledge that some

variants of my algorithm "cheat a bit".

With input known fully in advance, the table is just a big array -

constant size per column, one column per input token. Trouble is, it's

a *big* array. Therefore, some variants look at ways to reduce the

memory requirement at any point in time, and the memory management and

lookup issues for that have not been properly addressed except to note

that it would be hard (perhaps impossible) to ensure strictly linear

time, though "practically" linear time shouldn't be too hard.

n log m (where m is the maximum number of table cells in memory at

once) should be easy. Unfortunately, you cannot decide m based on the

grammar alone. Therefore, for example, tree-based data structures are

not suitable for table-cell lookup. Hash-tables *might* work, but

while hash tables can give amortized constant time inserts, that's

only under certain constraints - when inserts and deletes are mixed in

such a way that the table grows substantially at times, and shrinks

substantially at other times, the analysis is a lot more complex.

*>For just proving linear-time, big constant factors don't matter - as*

*>long as they are constants independent of the input size.*

Yes - but big constant factors do affect whether something is

practical.

There's a common perception that linear time implies something faster

than, for example, n log n. That is only true in general for

sufficiently long inputs, and "sufficiently long" can be very very

long indeed. Furthermore, in practice, there is some correlation

between length of input and grammar complexity. That is, if the

grammar is very simple, you are unlikely to be dealing with very long

inputs, and if the inputs are all very short, it's unlikely that they

have a complex grammar.

No-one is going to switch to using a linear parsing algorithm for

their parser when - for typical source files - it takes many times

longer than their existing non-linear parsing algorithm. This includes

me - and it's the major reason why I haven't developed the idea

further.

Comparing tabular LR with basic LR, my guess would be that tabular LR

would run between 100 and 1000 times slower for the same fairly

typical domain-specific-language scale grammar. As the grammar gets

more complex, the table gets many more rows to evaluate per token

parsed, so I could easily be underestimating by orders of magnitude.

Sure, tabular LR can cope with some things that basic LR can't, but

just how much am I willing to pay for those things?

Basically, tabular LR has scaling constants based on the number of

nonterminal tokens and the number of LR states, which both tend to

increase the number of columns in the table. Where LR parsing speed is

only weekly dependent on the complexity of the grammar (depending on

the state model representation in the parser), and backtracking LR is

usually not much slower, tabular LR speed is extremely sensitive to

grammar complexity due to the evaluation of the table.

I've not done any formal analysis (or even looked back at how the

table is structured), but my intuition suggests at least quadratic

time for each token parsed as the grammar complexity increases (and

thus both the number of nonterminals and the number of states),

basically due to quadratically increasing numbers of columns in the

table. It could be much worse due to order-of-evaluation and

cycle-detection issues as each column is evaluated, since this also

has to be decided at run-time, and also because the number of states

probably increases polynomially with "grammar complexity", however

that is defined. So my best guess is "probably polynomial rather than

exponential, but pretty seriously bad". And I'm certainly not ruling

out exponential, either.

Anyway, I'm sure you can see how this can get very impractical, both

in terms of time and memory, even for quite modest grammars. Niche

applications with very large inputs and very simple grammars might

work, but I suspect that there are very very few applications where

the grammars are simple enough and the inputs long enough for it to be

worthwhile - as I said, there is some correlation between grammar

complexity and typical input size, and having simple grammars but

large inputs seems very niche indeed.

For the record, I mostly use Kelbt - a backtracking parser generator

that's a bit beta, really. Sometimes, simple mistakes cause core

dumps, for example. What's particularly strange is I don't even use

the backtracking features - LALR(1) works fine for what I need, and I

don't even miss precedence and associativity stuff that much. The big

gain for me is a lot of flexibility in how the generated code is

structured, since you basically insert generated code fragments into a

template that you write for yourself.

Since backtracking LR is overkill for what I need, it's fairly obvious

that designing and implementing a tabular LR parser generator just for

my few DSLs would be overkill on an epic scale.

*>If you really have an algorithm for linear-time CF parsing, send a*

*>paper to a journal to get it reviewed. If you get it past the*

*>reviewers, I might be persuaded to read your article. There are just*

*>too many "I have this wonderful solution to (open problem), which I*

*>have described in (long and impenetrable text)" claims out there for*

*>me to want to read one more.*

Well, it's not that long and impenetrable. Several posts in this

thread were probably longer and more impenetrable, but I'm not sure

how well I can present the table format in ASCII.

Anyway, considering that I can't really be bothered with it myself, I

certainly don't blame you.

Besides, as one of those who proposed a "wonderful solution to

linear-time parsing" back in 2002, with a long, vague, impenetrable

description of a broken algorithm - well, I don't exactly get to

complain ;-)

I have a lot of these ideas that get scribbled down, partly so that I

can forget them and moved on. I have some old notes for "texture

mapping" that I wrote when I was maybe 17 - something in the order of

20 years ago anyway. I had never seen real texture mapping done at the

time, though I had seen some stuff where the graphics are only

projected for a fixed number of orientations. My idea then wasn't

texture mapping, as it turns out, but a limited form of ray tracing

done the hard way (simultaneous equations rather than vectors etc).

If anyone discovers all this junk after I'm dead, I'll probably be

seen as a kind of tenth-rate wannabe Da Vinci.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.