Related articles |
---|
How to parse keywords that can be used as identifiers? mark@research.techforce.nl (Mark Thiehatten) (1996-08-19) |
Re: How to parse keywords that may be used as identifiers scooter@mccabe.com (Scott Stanchfield) (1996-08-19) |
Re: How to parse keywords that may be used as identifiers gleeh@tulletts.sprint.com (1996-08-21) |
From: | Scott Stanchfield <scooter@mccabe.com> |
Newsgroups: | comp.compilers.tools.pccts,comp.compilers |
Date: | 19 Aug 1996 23:17:04 -0400 |
Organization: | McCabe & Associates |
References: | <32184418.167E@research.techforce.nl> 96-08-058 |
Keywords: | parse |
>[How can I parse languages where keywords aren't reserved words?]
One way to do this with PCCTS is to use "token classes." For example,
if you were doing a language like PL/I (no reserved words -- how evil!)
you might have something like
#token IF "if"
#token THEN "then"
#token ELSE "else"
#token IDENT "[a-zA-Z$_][a-zA-Z0-9$_]" // might not be quite right...
#tokclass IDENTIFIER {IDENT IF THEN ELSE}
then, in a rule like
if_statement
: IF expression THEN statement ELSE statement
;
assuming that "expression" could match IDENTIFIER, you should be able to
parse an incredibly evil statement like
if if then else = then else then = if
fairly well. (Don't sue me if it takes a while to get it to work
right...)
It's basically shorthand for
identifier
: IDENT
| IF
| THEN
| ELSE
;
This may look feasible in yacc as well, but you'll need to delay symbol
table lookup until you're inside the parser so you can determine the
symbol's function based on context. You can't lookup "if" in the symbol
table while scanning, you must wait until you see a rule like
primary_expression_component_or_whatever_you_call_it
: literal
| IDENTIFIER
<<lookup in symbol table>>
;
This ends up leading to a potential bigger problem in that the language
you are parsing might be syntactically ambiguous (a statement's meaning
might only be known based on the "types" of its components. Such as the
T(x) ambiguity in C++ -- is this a var decl, or a function call.)
To resolve the syntactic ambiguity, a yacc-based parser would likely
have the lexer return different tokens based on the "type" of the ident
being scanned. (The scanner performs the symbol table lookup.) With a
language that has non-reserved words, you can't have the scanner just
look up something like "if" and tell if it's being used as a var or
keyword without the scanner keeping track of context as well.
A predicated parser generator, such as PCCTS, can resolve that ambiguity
using semantic predicates. (See my post on comp.compilers RE lookahead
and parser->scanner communication.) However, if you're lucky, the
language will not be syntactically ambiguous...
Hope this helps a bit,
Scott
--
Scott Stanchfield McCabe & Associates -- Columbia, Maryland
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.