Re: Can Coco/R do multiple tokenizations

Darius Blasband <darius@raincode.com>
21 Aug 2005 00:21:57 -0400

          From comp.compilers

Related articles
Can Coco/R do multiple tokenizations vardhanvarma@gmail.com (2005-08-13)
Re: Can Coco/R do multiple tokenizations gneuner2@comcast.net (George Neuner) (2005-08-16)
Re: Can Coco/R do multiple tokenizations DrDiettrich@compuserve.de (Hans-Peter Diettrich) (2005-08-16)
Re: Can Coco/R do multiple tokenizations gene@abhost.us (Gene Wirchenko) (2005-08-16)
Re: Can Coco/R do multiple tokenizations cfc@shell01.TheWorld.com (Chris F Clark) (2005-08-21)
Re: Can Coco/R do multiple tokenizations darius@raincode.com (Darius Blasband) (2005-08-21)
| List of all articles for this month |

From: Darius Blasband <darius@raincode.com>
Newsgroups: comp.compilers
Date: 21 Aug 2005 00:21:57 -0400
Organization: Compilers Central
References: 05-08-053 05-08-065
Keywords: lex
Posted-Date: 21 Aug 2005 00:21:57 EDT

George Neuner wrote:
>
>>Can Coco/R, (can any other parser/lexer generator ) do multiple
>>tokenizations & parser-tree-generations
>
>
> AFAIK, there are no existing lexer gen tools which allow alternate
> tokenizations for the same input text. You are free to write one of
> course.
>
> I don't know Coco/R, but what you want is possible by using deliberate
> backtracking and multiple lexers. It's a slow and painstaking process
> of trying a particular parse, saving the AST if the parse succeeds,
> then backtracking, switching lexers and trying the same parse again.
> If you end up with no ASTs, the parse failed, and if you end up with
> multiple ASTs, the code was ambiguous.
>


As a matter of fact, my PhD thesis


(http://www.phidani.be/homes/darius/thesis.html)


was just about that: a lexer and parser
generator that can backtrack, and in practice, this is convenient
enough. Most ambiguities are local, up to a point where backtracking
is a convenience feature with reasonable impact on performance, far
from the theoretical worst case. This technology is robust enough to
be used in RainCode (http://www.raincode.com) and has been used to
parse poorly defined languages such as COBOL and PL/1.


For instance, in COBOL, something such as:


          X(9)


could be a picture definition, recognized as a single token,
or an access to an array named X. Originally, the COBOL standard
specified that parenthesis should be surrounded by spaces, but most
modern compiler would accept the notation above anyway.


Our backtracking lexer will first recognize the above fragment
as a single "Picture" token (please note that there is no need for more
complex structures, as one cannot nest parenthesis in this context).
If the parser fails to do anything useful with this, it will then
recognize the same fragment as 4 tokens ("X", "(", "9", ")").


The same can be used to parse fortran (where white space is ignored
altogether) and PL/1 where keywords are not reserved and can be used
as user-defined identifiers.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.