|Shallow Parsing email@example.com (1988-06-13)|
|Re: Shallow Parsing firstname.lastname@example.org (Bart Massey) (1988-06-16)|
|Re: Shallow Parsing email@example.com (1988-06-17)|
|From:||firstname.lastname@example.org (Steven Ryan)|
|Date:||13 Jun 88 23:09:51 GMT|
|Organization:||INTERGRAPH (APD) -- Palo Alto, CA|
|Posted:||Mon Jun 13 16:09:51 1988|
|Church:||Our Lady of Reluctant Software.|
I once heard a suggestion humans use a different parsing strategy than
compilers. Compilers use a deep parse using lotsa malenky rules and produce
a very impressive parse tree.
The suggestion was that humans memorise faster than generate and are better
at handling large chunks in parallel than small chunks nested.
What I take that to mean, we memorise words in each inflected form, even
regular forms, and possibly even groups words possibly up to words. Than
language generation consists of inserting these chunks into a verb frame
chunk, with each sentence form being a different chunk.
I think I'll call this suggestion shallow parsing: the parse tree will
only go down two or three levels before running into unanalysed (ig est
memorised) chunks. In terms of a productions, this would mean having
thousands of similar yet distinct productions instead factoring the
similarities to reduce the number of productions.
What interests me about this is the possible application to compilers:
humans parse an ambiguous and nondeterministic language in almost always
linear time. Most programming languages are intentionally designed with a
deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is
all very interesting, but the resulting language is often cumbersome: the
programmer must write out a sufficient left context to help out the parser
even when he/she/it/they/te/se/... can look a little to right and know what
Example: the Pascal 'var' is there to help the parser know this
is declaration and still remain LL(1).
To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores
the fact that this is context free parse and must be followed symbol
identification. Assuming the number of distinct symbols is logarithmic in the
program length, the time necessary for a context sensitive parse is from
O(n log log n) to O(n**2).
It would be interesting if we could learn something from our own minds and
make computer languages more like natural languages (but excluding ambiguity).
Make it simple--not simplistic.
[From email@example.com (Steven Ryan)]
Return to the
Search the comp.compilers archives again.