From: | grosch@cocolab.sub.com (Josef Grosch) |
Newsgroups: | comp.compilers |
Date: | 24 Aug 1996 21:32:47 -0400 |
Organization: | CoCoLab, Karlsruhe, Germany |
Keywords: | parse |
Mark Thiehatten <mark@research.techforce.nl> wrote:
> I am working on a parser for a language that allows keywords to
> be used as identifiers.
I have implemented this for FORTRAN 77, Fortran 90, and PL/I.
The solutions depend very much on the language.
In Fortran you need backtracking (at least two passes) in both, scanner and
parser. The scanner can be switched to recognize keywords as such or as
identifiers. A first pass tries to recognize a statement as an assignment or
some similar statement. During this pass the scanner does not recognize
keywords. If the first pass fails then a second pass is initiated. Now the
scanner does recognize keywords and the parser tries to parse for a
statement that starts with a keyword. After recognizing the starting keyword
the scanner is switched into identifier mode in order allow all identifiers
to appear within a statement. Only at a few places within statements it is
necessary to switch the scanner into keyword mode. At some places it is
possible to accept an identifier where a keyword is required and add a check
whether the identifier is equal to the required keyword. Example:
select_case_stmt = SELECT identifier "(" expr ")" .
instead of
select_case_stmt = SELECT CASE "(" expr ")" .
In PL/I the non-reserved keywords are indeed a problem, but far less than
I expected. In my solution, the scanner distinguishes between keywords and
ordinary identifiers. The grammar has a rule similar to:
identifier = identifier_token
| BEGIN
| END
| IF
| THEN
| ELSE
| ... /* all keywords */
Surprisingly, this solution almost works. In most cases an LALR(1) parser can
decide from the context whether a keyword stands for itself or an identifier.
Of course, some situations remain where the grammar is not LALR(1). Then a
lookahead of unlimited length or backtracking is needed. For example:
PUT (X + Y);
PUT (X + Y) = Z;
The first PUT is a keyword and the second one an identifier. This decision
can be made only after seeing the tokens ';' or '='.
With an advanced parser generator such as Lark from the Cocktail Toolbox for
compiler construction these problems can be solved. Lark offers lookahead of
unlimited length or backtracking. For example the above problem at PUT is
solved as follows:
identifier = identifier_token
| PUT ? ref_tail .
| BEGIN
| END
| ... /* all keywords */
ref_tail = arguments ':'
| arguments '->'
| arguments '('
| arguments '.'
| arguments ','
| arguments '='
| arguments PT
arguments = '(' ')'
| '(' subscript_commalist ')'
...
The construct "? ref_tail" behind PUT is a predicate that triggers
backtracking. The rule is recognized (PUT is an identifier) only if the
token PUT is followed by tokens that can be correctly parsed using the start
symbol ref_tail. ref_tail is a nonterminal that describes those right
contexts where PUT is an identifier.
Regards
Josef Grosch
CoCoLab
Hagsfelder Allee 16
D-76131 Karlsruhe
Germany
Tel.: +49-721-697061
Fax : +49-721-661966
Mail: grosch@cocolab.sub.com
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.