Shallow Parsing

smryan@garth.uucp (Steven Ryan)
13 Jun 88 23:09:51 GMT

          From comp.compilers

Related articles
Shallow Parsing smryan@garth.uucp (1988-06-13)
Re: Shallow Parsing (Bart Massey) (1988-06-16)
Re: Shallow Parsing smryan@garth.uucp (1988-06-17)
| List of all articles for this month |

From: smryan@garth.uucp (Steven Ryan)
Newsgroups: sci.lang,comp.compilers
Date: 13 Jun 88 23:09:51 GMT
Organization: INTERGRAPH (APD) -- Palo Alto, CA
Posted: Mon Jun 13 16:09:51 1988
Church: Our Lady of Reluctant Software.

I once heard a suggestion humans use a different parsing strategy than
compilers. Compilers use a deep parse using lotsa malenky rules and produce
a very impressive parse tree.

The suggestion was that humans memorise faster than generate and are better
at handling large chunks in parallel than small chunks nested.

What I take that to mean, we memorise words in each inflected form, even
regular forms, and possibly even groups words possibly up to words. Than
language generation consists of inserting these chunks into a verb frame
chunk, with each sentence form being a different chunk.

I think I'll call this suggestion shallow parsing: the parse tree will
only go down two or three levels before running into unanalysed (ig est
memorised) chunks. In terms of a productions, this would mean having
thousands of similar yet distinct productions instead factoring the
similarities to reduce the number of productions.

What interests me about this is the possible application to compilers:
humans parse an ambiguous and nondeterministic language in almost always
linear time. Most programming languages are intentionally designed with a
deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is
all very interesting, but the resulting language is often cumbersome: the
programmer must write out a sufficient left context to help out the parser
even when he/she/it/they/te/se/... can look a little to right and know what
is happen.

            Example: the Pascal 'var' is there to help the parser know this
                              is declaration and still remain LL(1).
                  var v1,v2,...,vn:type;

To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores
the fact that this is context free parse and must be followed symbol
identification. Assuming the number of distinct symbols is logarithmic in the
program length, the time necessary for a context sensitive parse is from
O(n log log n) to O(n**2).

It would be interesting if we could learn something from our own minds and
make computer languages more like natural languages (but excluding ambiguity).

                                                                        Make it simple--not simplistic.
                                                                                                                        sm ryan
[From smryan@garth.uucp (Steven Ryan)]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.