RE: Can Coco/R do multiple tokenizations

Quinn Tyler Jackson <quinn-j@shaw.ca>
24 Aug 2005 18:23:08 -0400

          From comp.compilers

Related articles
Re: Can Coco/R do multiple tokenizations cfc@shell01.TheWorld.com (Chris F Clark) (2005-08-21)
RE: Can Coco/R do multiple tokenizations quinn-j@shaw.ca (Quinn Tyler Jackson) (2005-08-24)
| List of all articles for this month |
From: Quinn Tyler Jackson <quinn-j@shaw.ca>
Newsgroups: comp.compilers
Date: 24 Aug 2005 18:23:08 -0400
Organization: Compilers Central
References: 05-08-070
Keywords: lex
Posted-Date: 24 Aug 2005 18:23:08 EDT

Chris F. Clark said:


> Multiple tokenizations is "hard". I agree that Neta-S, I believe now
> called GrammarForge, is your best bet for built-in support.


Yes, multiple tokenization is hard. As DoDi noted in another post in this
thread, scannerless parsers are one possible route, and the Meta-S parser is
a scannerless parser. (Terminology side note: The Grammar Forge is the
IDE/grammar development environment, the parsing engine is still called
Meta-S. The algorithm used to effect a Meta-S parse is called adaptive(k).)


Regarding the original construction in question:


> Consder this string (note no whitespaces ):
> 'a!=b'
> Valid tokenization/parsing can yield several posibbilityes
> 1. 'a!=b' .. a single token.
> 2. 'a!' '=' 'b' .. an assignment
> 3. 'a' '!=' 'b' .. an comparision.


Since the $-calculus is TP, it can most certainly deal with such
constructions, as Chris notes, and in a "built-in" way (that is -- without
ad hockery). All questions of practicality/usability aside, the only thing
really necessary to recognize a!=b as a single token would be to have
variable names bind more tightly than binary operators in terms of
precedence, and to have the strict requirement that all variables be
declared before use. These issues are easily handled by $-grammars without
attaching code to disambiguate. It would get a bit trickier if a and b had
*already* been declared, but $-grammars would allow you to either reject
such cases with predicates, or prefer one over the other in terms of
precedence.


The only thing really necessary is that the grammar allow for the run-time
binding of tokens to productions, such that some "declared_variable"
production know that a!=b has been declared previously as a variable.


In a traditional parser, this could be handled by having the declaration
production add a!=b to the lexer's symbol table, such that the lexer would
return IDENTIFIER when encountering a!=b, rather than the three tokens
IDENTIFIER OPERATOR IDENTIFIER.


That said, however, it is theoretically not really as easy as all that,
since every change to the lexer's symbol table that included infixed symbols
that also have tokens of their own (in this case !=) could introduce an
intractable ambiguity into the grammar as a whole. This would mean
re-binding the entire grammar every time a new identifier with symbols that
also map to other tokens is added, and rechecking for ambiguity, which could
be an explosive situation computationally speaking.


(BTW, Chris -- emails sent directly to you typically still bounce back.)


--
Chevalier Quinn Tyler Jackson


http://members.shaw.ca/qtj/



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.