Related articles |
---|
AnaGram: a new LALR-1 parser generator (long) jholland@world.std.com (1993-08-09) |
Newsgroups: | comp.compilers |
From: | jholland@world.std.com (Jerome T Holland) |
Keywords: | parse, tools |
Organization: | Compilers Central |
Date: | Mon, 9 Aug 1993 17:54:21 GMT |
Parsifal Software has released AnaGram(TM), an extended LALR-1 parser
generator. AnaGram produces C or C++ input processing modules from
formal descriptions of input syntax. It has been designed to make
syntax directed parsing a practical, usable tool for everyday
software development.
AnaGram provides a number of features designed to help get reliable,
robust parsers up and running quickly. Among the most important are:
* Lexical scanner not required: AnaGram's notation, with character
sets, keywords and fine control over white space suppression,
usually makes a separate lexical scanner unnecessary.
* Interactive working environment: Pop up a wide variety of windows
and interactive tools to analyze and trace conflicts and
other problems. Step through any sequence of inputs and
watch your parser's response.
* File Trace: See precisely how a parser derived from your grammar
will interpret a test file. Fast forward, parse character by
character, back up, try again.
* Semantic control over parsing: Semantically determined productions
allow determination of reduction token at parse time.
* Automatic setup of parser stack, with provision for arbitrary C
or C++ data types, including template classes.
* Parser options include automatic error diagnostics and recovery,
event driven parsers and coverage analysis.
* Notation provides simple, familiar representation of choices,
options, and repetition as well as symbolic linkage between
grammar rules and semantic actions.
AnaGram comes with extensive cross-referenced, context sensitive
online help, a comprehensive reference manual including an
introduction to parsing, and numerous documented examples.
A Brief Introduction to AnaGram
-------------------------------
AnaGram was written by a working programmer as an everyday working
tool. It is now being released to the software marketplace on the
premise that other programmers and software developers will also find
it useful.
The first primitive version of AnaGram was written for the PDP-11 over
ten years ago as an example of the use of a software prototyping
tool, now long forgotten. At the time the author, like most
programmers, did not have access to UNIX or UNIX tools. Therefore
AnaGram developed independently of YACC and other compiler development
tools. Almost from the beginning, AnaGram has been written in AnaGram.
Because of AnaGram's inherent flexibility, it was easy to add features,
many of which were discarded. Many, however, remain: too many to even
mention in this brief article.
As AnaGram evolved, two basic obstacles to general use of syntax
directed parsing became clear. The first problem has to do with the
characterization of input tokens. The traditional approach to this
problem has been the adoption of lexical scanners. Lexical scanners,
however, turn the parsing problem into a two-step process, and thereby
make it unduly cumbersome for everyday use.
The second obstacle is the tendency of hastily written grammars to
exhibit conflicts. Conflicts can be hard to track down and fix. Of
course, programmers who parse by hand, writing tedious conditional
code, can have ambiguities also. The difference is that they don't know
it.
Three distinct features of AnaGram evolved to cope with the first
obstacle: Character set logic, keyword logic, and "disregard" logic,
which taken together render separate lexical scanners unnecessary in
most instances. These features are illustrated in the example which
follows.
To cope with conflicts, an interactive working environment was built
and a number of special tools evolved. These are described very briefly
following the example.
Since parsing is rarely the whole problem, parsers must be able to
interface cleanly and smoothly with the rest of the user's program.
Since there is no single right way to do this, a number of options
evolved. Some of the options for building parsers are described briefly
at the end of this article.
An AnaGram Example
------------------
The following example is a complete program: The output produced by
AnaGram from this example can be compiled, linked and run without any
support modules other than the standard run-time library provided by
any C compiler. In the interest of brevity, the example has been kept
very simple.
FFCALC.SYN implements a simple four function calculator which reads its
input from stdin. The calculator has 52 registers, labeled 'a' through
'z' and 'A' through 'Z'. FFCALC evaluates arithmetic expressions and
assignment statements and prints the results to stdout. The expressions
may contain '+', '-', '*', and '/' operators as well as parentheses. In
addition, FFCALC supports the free use of white space and C style
comments in the input. It also contains complete error handling,
including syntax error diagnostics and resynchronization after syntax
errors.
For purposes of annotation, line numbers have been inserted at the
left margin. The line numbers are *not* part of the AnaGram syntax.
Immediately following the example are some brief explanatory notes
keyed to the line numbers.
1 {/* FOUR FUNCTION CALCULATOR: FFCALC.SYN */}
2 // -- CONFIGURATION SECTION ----------------------------
3 [
4 default token type = double
5 disregard white space
6 lexeme { real}
7 traditional engine
8 ]
9 // -- FOUR FUNCTION CALCULATOR -------------------------
10 (void) calculator $
11 -> [calculation?, '\n']..., eof
12 (void) calculation
13 -> expression:x =printf("%g\n",x);
14 -> name:n, '=', expression:x ={
15 printf("%c = %g\n",n+'A',value[n]=x);}
16 -> error
17 expression
18 -> term
19 -> expression:x, '+', term:t = x+t;
20 -> expression:x, '-', term:t = x-t;
21 term
22 -> factor
23 -> term:t, '*', factor:f = t*f;
24 -> term:t, '/', factor:f = t/f;
25 factor
26 -> name:n = value[n];
27 -> real
28 -> '(', expression:x, ')' = x;
29 -> '-', factor:f = -f;
30 // -- LEXICAL UNITS ------------------------------------
31 digit = '0-9'
32 eof = -1
33 (void) white space
34 -> ' ' + '\t' + '\r' + '\f' + '\v'
35 -> "/*", ~eof?..., "*/"
36 (int) name
37 -> 'a-z' + 'A-Z':c = c-'A';
38 real
39 -> integer part:i, '.', fraction part:f = i+f;
40 -> integer part, '.'?
41 -> '.', fraction part:f = f;
42 integer part
43 -> digit:d = d-'0';
44 -> integer part:x, digit:d = 10*x + d-'0';
45 fraction part
46 -> digit:d = (d-'0')/10.;
47 -> digit:d, fraction part:f = (d-'0' + f)/10.;
48 { /* -- EMBEDDED C ---------------------------------- */
49 double value[64]; /* registers */
50 void main(void) {
51 ffcalc();
52 }
53 } // -- END OF EMBEDDED C ------------------------------
Notes to example:
General note: When an AnaGram grammar is written to use direct character
input, the terminal tokens are written as character sets. A single
character is construed to be the set consisting only of the character
itself. Otherwise character sets can be defined by ranges, e.g., 'a-z',
or by set expressions using +, -, &, or ~ to represent union,
difference, intersection, or complement respectively. If the sets used
in the grammar are not pairwise disjoint, and they seldom are, AnaGram
calculates a disjoint covering of the character universe, and extends
the grammar appropriately. The semantic value of a terminal token is
the ascii character code, so that semantic distinctions may still be
made even when characters are syntactically equivalent.
1 Braces { } are used to denote embedded C or C++ code that should
be passed unchanged to the parser. Embedded C at the very beginning of
the syntax file is placed at the beginning of the parser file. All
other embedded C is placed following a set of definitions and
declarations AnaGram needs for the code it generates. AnaGram saves up
all the reduction procedures, or semantic actions, and places them
after all the embedded C.
3 Brackets [ ] are used to denote configuration sections.
Configuration sections contain settings for configuration parameters
and switches, and a number of attribute statements that provide
metasyntactic information.
4 This statement sets the default token type for nonterminal tokens
to double. The default value for "default token type" is void. You can
override the type for a particular token using an explicit cast. See
line 10. The default type for terminal tokens is int. AnaGram uses the
token type declarations to set up calls and definitions of reduction
procedures and also to set up the parser value stack.
5 The disregard statement tells AnaGram to extend the grammar so
that the generated parser will skip all instances of white space which
are not contained within lexemes. "White space" is a token defined at
line 33. There is nothing magic about the name. Any other name could
have been used.
6 The lexeme statement identifies a list of nonterminal tokens within
which the "disregard" statement is inoperative. real is defined at line
38.
7 "traditional engine" is a configuration switch. Simply asserting it
turns it on. You can also write: traditional engine = ON. To turn off a
switch use ~: thus ~traditional engine would guarantee the switch is
off, whatever its default value. Alternatively set traditional engine =
OFF.
AnaGram parsers normally use a parsing engine with more than the
standard four parsing actions: shift, reduce, error and accept. The
extra actions are compound actions. The result of using these actions
is to speed up the parser and to reduce the size of the state table by
about fifty per cent. The traditional engine switch turns this
optimization off, so the parser will only use the four traditional
actions. This is usually only done for clarity when using the File
Trace or Grammar Trace options described below.
9 AnaGram supports both C and C++ style comments. Nesting of C
comments is controlled by the "nest comments" switch.
10 An explicit cast can be used to override the token type for
nonterminal tokens. Types can be just about any C or C++ type,
including template types. Basically, the only exceptions are types
containing () or [].
The simplest way to specify the goal token for a grammar is to mark it
with a dollar sign. You can also simply name it "grammar", or set the
"grammar token" parameter in a configuration section.
11 For "->" read "produces". '?' following a token name makes it
optional. Tokens in a rule are separated by commas. Multiple rules with
the same left side can also be separated with '|'.
The rules for character constants are the same as for C. Brackets []
indicate the rule is optional. Braces { } would be used if the rule
were not optional. Brackets and braces can include multiple rules
separated by |. The ellipsis ... indicates unlimited repetition. These
constructs are referred to as "virtual productions". eof is defined at
line 32.
Lines 10 and 11 taken together specify that this grammar describes a
possibly empty sequence of lines terminated with an eof character. Each
line contains an optional "calculation" followed by a newline character.
13 To assign the value of a token (stored on the parser value stack)
to a c variable for use in a semantic action, or reduction procedure,
simply follow the token name with a colon and the name of the variable.
Short form reduction procedures are simple C or C++ expressions
terminated with a semicolon. They cannot include a newline character.
The name of the C variable is local to this particular procedure.
Normally the value of the reduction procedure is assigned to the token
on the left side of the production. In this case, since calculation is
of type "void", the result of the printf call is discarded.
14 When reduction procedures won't fit on a single line or are more
complex than a single expression, they can be enclosed in braces { }.
Use a return statement to return a value.
16 The error token works more or less like the error token in YACC. In
this case it matches any portion of a "calculation" up to a syntax
error and then everything up to the next newline, as determined by the
production on line 11. AnaGram also provides an alternative form of
error continuation called "automatic resynchronization" which uses a
heuristic approach derived from the grammar. By default, AnaGram
parsers provide syntax error diagnostics. The user may provide his own
if he wishes.
18 If a grammar rule does not have a reduction procedure, the value of
the first token in the rule is assigned to the token on the left side
of the production.
19 Since the default type specification given on line 4 was "double",
x and t have type double, and the reduction procedure returns their
sum, also double.
24 Note that in the interest of simplicity, this reduction procedure
omits any provision for divide by zero errors.
31 Definition statements may be used to provide shorthand names. '0-9'
is a character range, as discussed above.
32 Input characters can also be defined using decimal, octal or hex
notation. They are not limited to any particular range, so that it is
possible to define the end of file token as the standard stream I/O end
of file value.
33 Note that AnaGram permits embedded blanks in token names.
34 The set consisting of blank, tab, return, form feed or vertical
tab.
35 Keywords are strings of characters enclosed in double quotes.
Standard C rules apply for literal strings. Keywords stand outside the
character space and are recognized in preference to individual
characters.
~ indicates the complement of a character set, so that ~eof is any
character except end of file. The character universe is the set of
characters on the range 0..255 unless there are characters outside this
range, in which case it is extended to the smallest contiguous range
which includes the outside characters.
?... allows zero or more comment characters. This rule describes a
standard C comment (no nesting allowed).
36 The value of name is an int, an index into the value table.
37 The '+' is set union. Therefore c is any alphabetic character.
50 If you don't have any embedded C in your syntax file, AnaGram will
create a main program automatically. Since there was already embedded C
at line 1, AnaGram won't automatically create a main program, so we
need to define one explicitly.
51 The default function name for the parser is taken from the file
name, in lower case. There is a configuration parameter available to
set it to something else if necessary. Lacking any contrary
specification, the parser will read its input from stdin.
AnaGram's Interactive Environment
---------------------------------
AnaGram uses a simple pop up window system to provide information
about a grammar and to support interactive analysis. The most
important tools are the interactive, interpretive parsers called File
Trace and Grammar Trace. These are supported by a variety of static
windows which display supporting information. All windows can be
resized or moved about on the screen. Most windows can be cloned, so
you can have more than one window open on a table. You can also pop up
context sensitive help from almost every window.
File Trace. When you select File Trace and identify a test file,
AnaGram opens two windows: the File Trace window and a window on the
test file. Each line in the File Trace window represents one level in
the parse stack and displays the parser state number for that level and
the name and number of the token that was seen in that state.
In the test file window, you simply move the cursor down or to the
right to cause AnaGram to parse text. When you move the cursor, AnaGram
performs all the parsing actions necessary to catch up to the cursor.
The current contents of the parser stack are displayed in the File
Trace window. If you wish a finer-grained picture, you may proceed a
single step at a time by using the Enter key. The portion of the file
that has not been parsed is highlighted.
If you move the cursor left or up, the parse will not automatically
back up. To back up the parse to a desired position, move the cursor
to the desired position and press Enter. The parse will back up to that
point.
In the File Trace window, if you move the cursor bar, the highlighted
region in the test file window changes. As long as the cursor bar is on
the last line of the stack, the unparsed portion of the file is
highlighted. As soon as you move the cursor bar up or down, the
highlight in the test file window is moved to cover all of the text
which comprises the token under the cursor bar in the File Trace
window.
>From the File Trace window, you may also access any of ten auxiliary
windows to provide more information about the current status of the
parse. Of these, the most important is the Rule Stack window, which
shows all active grammar rules at each level in the parser stack. The
cursor bar in the Rule Stack window is synched with the cursor in the
window on your syntax file so that, as you move the cursor up and
down, your syntax file window will always show the selected rule in
context. There are also a number of auxiliary windows accessible from
the Rule Stack window.
To help you make sure you have tested all the intricacies of your
grammar, AnaGram keeps a count of the number of times each grammar rule
is reduced. These counts can be seen in the Trace Coverage window.
Grammar Trace. The Grammar Trace is similar to the File Trace, but
differs in that it is completely interactive and does not use an input
file. The Grammar Trace display is the same as the File Trace display,
showing state numbers, token numbers and token names on the parser
stack. At any level in the stack, you can press Enter to pop up
a list of all the tokens, terminal and nonterminal, that are admissible
in the state specified at that level. If you select a token, the parser
stack will be truncated to that level, the action appropriate to
the token you have selected will be taken, and the stack will be
updated accordingly. You may even clone a Grammar Trace window, so that
you can try one input in the original window and another in the clone.
AnaGram provides pre-initialized Grammar Traces under numerous
circumstances. You may make a copy of the stack in a File Trace
window, for example, and continue your analysis interactively. In any
window where the cursor bar identifies a parser state unambiguously you
may ask AnaGram to provide you with a Grammar Trace preloaded with a
sequence of tokens which gets you to the indicated state. If you set
the "error trace" configuration parameter in your syntax file, your
parser will save its state stack to a temporary file any time it
encounters a syntax error. The next time you run AnaGram, it will
provide a trace to the state in which the error occurred.
The most important use for the Grammar Trace is in analyzing conflicts.
For any conflict you may request a Conflict Trace. AnaGram will find a
sequence of tokens which illustrates the conflict and preload a Grammar
Trace leading to the conflict state. For any completed rule in the
conflict state, you may also request a Reduction Trace which shows what
the result of reducing that rule would be.
All the auxiliary windows, including the Rule Stack, that are available
from the File Trace are also available from the Grammar Trace.
Other Windows. Altogether, AnaGram has over thirty different varieties
of windows to illustrate various aspects of a grammar. Some of these
are quite pedestrian, the Symbol Table and Token Table windows, for
instance. Others, such as the Conflict Derivation and Token Derivation
windows, which show how a conflict derives from the rules in the
grammar, are more esoteric. From most windows it is possible to pop up
the Auxiliary Windows menu, which typically offers a choice of eight or
ten supplementary windows.
AnaGram Parsers
---------------
AnaGram builds complete, ready-to-use parsers. It sets up the parser
stack automatically, allowing for all the data types associated with
tokens in the grammar. It has options for extending the parser: to
maintain a context stack or to perform coverage analysis, for example.
All AnaGram options have reasonable defaults. If you build a parser for
a pure grammar, for instance, you will get a syntax checker that reads
its input from stdin.
There is no one right way to build a parser. Different parsers have
different requirements. AnaGram provides the flexibility needed to
accommodate differing requirements. AnaGram provides numerous options
for the following problems:
* Invoking the parser
* Identifying input tokens
* Diagnosing syntax errors
* Error recovery
* Invoking the parser. The parser's name defaults to the name of the
syntax file. A configuration parameter allows you to change it to
anything you wish. As a consequence, it is easy to use more than one
parser in a program. AnaGram itself uses three, and one of the examples
distributed with AnaGram uses five.
Somewhat more subtle is the way a parser gets its input. Conventionally
you call a parser and it reads and parses its input until it has
identified its goal token. This is the default protocol for AnaGram
parsers. AnaGram also provides a special pointer option for those cases
where the entire input is already in memory.
AnaGram offers an alternative, inside-out, event-driven protocol:
Initialize the parser, and then call it once for each input token. Each
time it is called, it processes exactly one token and returns to the
calling program when it can go no further. The calling program can then
interrogate a flag to see whether the parser needs more data, or has
identified the goal token. Event driven parsers are extremely easy to
interface. You can easily build chains of parsers, the output from one
driving the next.
The entire state of a parser is maintained in a single structure called
a parser control block. Because of this, it is easy, even in C
programs, to have multiple instances of a single parser running
concurrently. It is also possible to have a number of distinct parsers
processing the same input stream simultaneously.
* Identifying input tokens. Most AnaGram parsers use direct character
input, as in the example above. AnaGram parsers, however, can also
accept more complex input tokens as generated by lexical scanners or
other parsers. AnaGram provides a variety of ways to make sure lexer
and parser agree on token numbers.
* Diagnosing syntax errors. AnaGram provides optional syntax error
diagnostics. Combined with the optional line and column tracking, these
diagnostics are often all that is needed. If they are inadequate,
however, the user can easily install his own.
* Error recovery. AnaGram provides not only "error token"
resynchronization, similar to that available in YACC, but also
"automatic resynchronization", a heuristic technique based on the
user's grammar. To use automatic resynchronization, the user need only
set a switch. The user is also free to incorporate his own resynch
mechanisms if these are not adequate.
AnaGram requires an IBM compatible AT, 386, or 486 computer with a
minimum of 400K of available memory, DOS 3.1 or higher, and
approximately 1 megabyte of hard disk space. A C or C++ compiler is
required to compile the generated code.
AnaGram costs $495 and requires no royalties or runtime licenses.
A free trial copy with numerous examples is available.
AnaGram is available from:
Parsifal Software
P.O. Box 219, Wayland, MA 01778
Phone: (800) 879-2577 or (508) 358-2564
Internet: jholland@world.std.com
CompuServe: 72603,1763
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.